Functional Analysis of Real World Truck Fuel Consumption Data

(1)

Technical Report, IDE0806, January 2008

Functional Analysis of Real World Truck Fuel

Consumption Data

Master’s Thesis in Computer Systems Engineering

Georg Vogetseder

(2)

(3)

Functional Analysis of Real World

Truck Fuel Consumption Data

School of Information Science, Computer and Electrical Engineering Halmstad University

Box 823, S-301 18 Halmstad, Sweden

(4)

Acknowledgement

If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family anatidae on our hands. Douglas Adams (1952-2001)

Thanks to my family, especially my mother Eva and friends.

(5)

Abstract

This thesis covers the analysis of sparse and irregular fuel consumption data of long distance haulage articulate trucks. It is shown that this kind of data is hard to analyse with multivariate as well as with functional methods. To be able to analyse the data, Principal Components Analysis through Conditional Expectation (PACE) is used, which enables the use of observations from many trucks to compensate for the sparsity of observations in order to get continuous results.

The principal component scores generated by PACE, can then be used to get rough estimates of the trajectories for single trucks as well as to detect outliers.

The data centric approach of PACE is very useful to enable functional analysis of sparse and irregular data. Functional analysis is desirable for this data to sidestep feature extraction and enabling a more natural view on the data.

(6)

List of Figures

3.1 Fuel Consumption between Observations . . . 14

3.2 Fuel consumption plot generated from the raw data . . . 15

3.3 Histograms of the original and the cleaned data . . . 17

3.4 Fuel consumption plot generated from the clean data . . . 17

3.5 Scatter plot and histograms . . . 18

3.6 Histogram of the distance between observations . . . 19

4.1 Distribution and mean/variance of binned data . . . 22

4.2 Boxplots of binned data . . . 23

4.3 Outlier detection based on feature extraction . . . 25

4.4 Straight line fitting . . . 27

4.5 Plot of mean function and principal components . . . 29

4.6 Scree Plot . . . 30

4.7 Smoothed covariance matrix . . . 31

4.8 Reconstructed curves versus mean function and raw observations of se-lected trucks . . . 32

4.9 Reconstructed curves and raw measurements for all trucks. . . 33

4.10 Reconstructed traces of misfitted trucks . . . 33

4.11 Comparison of reconstructed trajectories with differing number of PCs 35 4.12 Reconstructed trajectories without measurement error assumed . . . . 37

4.13 A comparison of µ with different smoothing kernels . . . 38

4.14 A comparison of 3 PCs with different smoothing kernels . . . 39

4.15 Distribution of all mean curves . . . 41

4.16 Graph of all mean curves . . . 41

4.17 Trucks with a high influence on the results of PACE . . . 42

(9)

4.19 Normal Distribution Plots of the PC scores . . . 45

4.20 Histograms of the probability of trucks . . . 46

4.21 Samples of truck probability . . . 46

4.22 PACE Results of Speed Data . . . 47

4.23 PACE Results on Seasonal Fuel Consumption . . . 48

4.24 Selected trucks from the Seasonal Fuel Consumption Data . . . 49

(10)

List of Tables

4.1 MSE of PACE with 8 principal components . . . 34

4.2 MSE of PACE with 3 PCs . . . 35

4.5 MSE of PACE with 8 PCs and error cut-off . . . 37

(11)

1 Introduction

1.1 Background

The original idea for analyzing this data came from Volvo Parts AB, one of the main business units of Volvo Group AB. The role of Volvo Parts is to provide solutions and tools to the after-market, which includes vehicle electronics diagnostic tools. When a truck is in the workshop, the vehicle electronics data is read out from the truck using diagnostics tools from Volvo Parts and transmitted to a central database.

This data, which is collected from the vehicles electronics systems is called logged vehicle data (LVD) and is collected from sensors within the truck. Several electronic subsystems supply information for LVD, which can include data from the electronic suspension, the transmission, and most importantly from the Engine Electric Control Unit. The current main use of LVD is seemingly just basic analysis, e.g. remote diagnostics of faulty components and simple statistics.

One of the problems with analysing LVD is the relative lack of observations. The source of this lack of information is the data retrieval process. The procedure is a time consuming process, making it a cost factor for the workshops. The time consumption affects the adoption rate of this procedure in the field negatively, which leads to the data composition detailed in Section 3.1.

The basic idea behind the problems detailed in this thesis is to expand the usefulness of the data for Volvo Parts, retrieving additional new information from it and provide means to access this information. This is done by using recent advanced statistical

(12)

1. Introduction 2

techniques. As a starting point to the application of these techniques, the analysis of the fuel consumption data contained in LVD was suggested.

Fuel consumption data is very interesting from a statistical point of view. This interest stems from being a major cost factor, as well as being influenced by a high number of other factors, such as:

• Usage patterns of the operator, i.e. the driving style and habits • Maintenance of the truck

• Gross Combination Weight usage, i.e. the cargo of the truck • Environment, i.e. hilliness, road condition, etc.

The influence of these and more factors make this data a good indicator. But the mass of influences also makes exact determination of the underlying cause impossible. Additionally, some of these influences might cancel each other out, thus removing information. If it is possible to extract information from fuel consumption data, then it should work for the rest of the data too.

1.2 Motivation and Novelty

From LVD, it should be possible to extract information on hidden trends, i.e. the principal components (see Section 2.1.1) that are common to all similar trucks. Based on these components, it should be possible to determine if a truck is unrelated to other trucks, i.e. a outlier and to predict future developments in fuel consumption, when the trucks behavior is similar to that of other vehicles.

It is very easy to take the last observation of each truck in a group of similar trucks to determine abnormal fuel consumption, but it is hardly possible to calculate underlying trends or other information from these facts.

(13)

1. Introduction 3

The analysis of this data can be done in at least two ways. The most obvious choice in methodology would be the use of multivariate statistics, but for several reasons de-tailed below, the central methodology for this thesis is functional statistics. Functional statistics focuses on analysing the data as functions, rather than a set of discrete values 1_.

Multivariate statistics are a set of methods which work on more than one variable at a time. Some examples for these methods are regression analysis, principal components analysis and artificial neural networks. Principally, functional statistics are also part of this set, as both have multiple variables as input. However, the focus on handling the input variables as continuous functions rather than arbitrary variables separates those two fields.

As the observation of trucks in the workshop is not happening regularly, i.e. the observations can not be fitted to a grid, it is difficult to incorporate all information from the input into variables for use in multivariate statistics. Therefore, features like mean, variance, duration of all observations, date of first observation, odometer count at the last observation, etc. have to be extracted from the data to be able to do analysis. Inevitably, the extraction of this knowledge leads to information loss, which is problematic on this already sparse data. The process of discovery and selection of important features for multivariate analysis is very difficult and time consuming. It is crucial to extract and select the best and most important features from the data to minimize the data loss and maximize the information content of the features for the success of all further steps in analysis. Feature extraction creates an additional layer of data processing and introduces a large number of tunable knobs.

Functional Data Analysis (FDA) on the other hand, preserves the information in the data present and does not need feature extraction at all. Furthermore, it facilitates a more natural handling of the data, describing not only more or less abstract features of the data, but a function which resembles the data. The choice of using functional over multivariate data analysis is also motivated by the ability to analyze the func-tional properties of the data, e.g. derivatives of the data. Addifunc-tionally, FDA does not introduce a high number of additional parameters, unlike multivariate analysis.

(14)

1. Introduction 4

However, multivariate analysis has an advantage over FDA when a high number of different functions have to be analysed at the same time. FDA has problems in visu-alizing this higher dimensional data, as well as the necessity of having a high amount of data for each dimension (curse of dimensionality).

The most important step in FDA is the transformation of the discrete data to a func-tional basis. Again, the irregular and sparse nature of the data makes this transforma-tion difficult. For being able to perform FDA on this data, a method called Principal Components Analysis through Conditional Expectation (PACE) is applied. The foun-dation of PACE is the assumption that a smooth function is underlying the sparse data. Under this assumption, it is possible to use even irregular data for the discovery of principal components.

The main novel aspect of this thesis is the application of FDA and PACE to automotive data. Previously it has successfully been applied to biological data, economic processes, bidding in online auction houses, but not automotive data. PACE itself is highly interesting to be applied to the data at hand, because it is able to work on it without the need for feature extraction or regular observations.

The methods used in this work can be used to describe the actual fuel consumption of the observed trucks in customer hands. This means the methods applied to LVD are driven by data and not by a model.

1.3 Related Work

General sources of information on data analysis – related to this work – are The El-ements of Statistical Learning [1], Functional Data Analysis [2] and Nonparametric Functional Data Analysis [3].

(15)

1. Introduction 5

Functional Data Analysis for Sparse Auction Data [5] combines the PACE approach with linear regression to predict closing prices of online auctions.

The most related of the few public papers on fuel consumption in heavy trucks is Heavy Truck Modeling for Fuel Consumption Simulations and Measurements [6]. This work deals with building a simulation model of fuel consumption. Another paper, which dis-cusses methods to reduce idle fuel consumption in North American long distance trucks and highlights typical driver behavior is Analysis of Technology Options to Reduce the Fuel Consumption of Idling Trucks[7]

Additional information on doing PCA on sparse and irregular data can be found in Principal component models for sparse functional data[8] and Sparse Principal Compo-nent Analysis[9]. More related to PACE is Properties of principal compoCompo-nent methods for functional and longitudinal data analysis[10]. Another paper which is related to the estimation of Functional Principal Component Scores is [11]. Knowledge relating to linear regression analysis for longitudinal data can be found in [12].

1.4 Limitations

The scope of this thesis is to research the possibilities for the application of FDA meth-ods to the sparse and irregular automotive data from LVD. It is outside of the scope of this thesis to establish a conclusive theory about a true long term fuel consumption model of all truck engines.

(16)

1. Introduction 6

1.5 Outline

(17)

2 Methods

This chapter is divided into three parts. General Statistical Methods describes non-functional methods which are fundamental to this work. Functional Data Analysis provides an introduction into this field. The final part, Principal Components Analysis through Conditional Expectation gives an overview of this crucial method.

2.1 General Statistical Methods

This section introduces general statistical concepts used in this thesis and a number of tools to visualize data and test results.

2.1.1 Principal Component Analysis

One of the constitutional methods for analysing LVD is the Karhunen-Lo`eve transfor-mation, universally known as Principal Component Analysis (PCA). PCA is also the foundation to Functional Principal Component Analysis (FPCA)[1, 13].

Basically, PCA is a method to explore data by finding the most important ways the variables in the data differ from another. It can compress the data by discovering a low number of linear combinations of input variables which contribute most to the variability of the input. These linear combinations are found by constructing a linear basis for the data where the retained variability is maximal.

(18)

2. Methods 8

Mathematically speaking, the goal is to reduce or compress high dimensional data X to lower dimensional data Y .

To do this reduction, a number of algorithms are available, here, a method involving the calculation of the covariance is described.

The first step is to calculate the mean vector µ for each variable:

µi = 1 Ki Ki X j=1 xij, i = 1 . . . N

where N denotes the number of variables and Ki the number of observations in one variable.

Subsequently, µ is removed from every observation in X, which is subsequently denoted as X − ¯X.

In the next step the covariance matrix cov(X − ¯X) has to be calculated. Covariance is a measure how two variables vary together. If those two variable vary in the same way (i.e. same prefix), the covariance will be positive. If, on the other hand, the two variables have different prefixes, the covariance will be negative. A covariance matrix is the result of calculating the covariance for all members of two vectors. The resulting matrix gives the grade of correlation between the input vectors.

To find a mapping M that is able to transform the high dimensional data into low dimensional data, M that maximizes MTcov(X − ¯X)M has to be found. It can be shown that the best (variance maximizing) mapping is formed by the eigenvectors of the covariance matrix. Hence, PCA has to solve the eigenproblem to get the transformation matrix.

cov(X − ¯X)M = λM

The eigenproblem has to be solved d times with different principal eigenvalues λ to get the principal eigenvectors (or principal components). The low dimensional representa-tion Y can then be computed by simple multiplicarepresenta-tion:

(19)

2. Methods 9

2.1.2 Hierarchical Clustering

Hierarchical clustering is a relatively simple method [1] to segment data into related groups. Clustering is used within this thesis for testing if differing clusters of trucks can be found from extracted features. Hierarchical clustering needs a dissimilarity measure between the elements. The standard for measuring the dissimilarity is the euclidean distance, which is also used in this thesis.

When the distance between all possible pairs of elements is calculated, the clusters can be built. For building these clusters, there are two different approaches: The agglom-erative approach, which starts with as many clusters as there are individuals. The divisive method starts with one big cluster which is then split into smaller clusters. Agglomerative methods are guaranteed to have a monotonic increasing level of dissim-ilarity between merged clusters, growing with the level of merging. This property is not guaranteed to divisive approaches.

The second choice for building the clusters is to decide on the measurement for the distance between two clusters.

• Single Linkage – The link between the clusters is defined by the smallest dis-tance between elements in the two clusters.

• Complete Linkage – The link is defined by the largest distance between ele-ments in the two clusters, the opposite of the first method.

• Average Linkage – Uses the average distance between all pairs of elements in both clusters.

2.1.3 Validation Methods

A number of methods to validate the results and to estimate variation were used in the scope of this thesis. These include brief usage of bootstrap, jackknife and various cross validation methods, such as k-fold and leave-one-out [1].

(20)

2. Methods 10

Jackknifing can be used to estimate the bias and standard error. Jackknife is very similar to k-fold and leave-one-out cross validation, as it systematically removes one or more observations from a sample and then recalculates the results as often as there are possible readouts.

2.1.4 Diagrams

A number of special diagrams were used to illustrate some results of this thesis. Those diagrams are dendrograms, boxplots and scree plots [1, 2].

• Dendrograms are tree diagrams which are used to illustrate the result of a clus-tering algorithm. An example for such a diagram is Figure 4.3. On the vertical axis the distance between clusters is plotted. A horizontal line denotes a split between classes at this specific distance measure. This implies that a split at a higher distance value has a higher dissimilarity between the split classes, as opposed to a lower distance value split.

• Boxplots describe groups of data – such as binned data – through five statistical properties. A boxplot example can be seen in Figure 4.2. The box represents the lower and the upper quartile, showing where half of the data is contained. The line in this box illustrates the median of data in this group. The whiskers attached to this box extend to the furthest data point, up to a maximum of 1.5 the distance between the quartiles. Data points outside of this boundary are usually marked with a cross, indicating a possible outlier.

• Scree plots give an indication of the relevance of a principal component (eigen-function) by indicating the accumulated eigenvalue up to the n-th principal com-ponent. This plot can be used to select a suitable number of eigenfunctions. An example for a scree plot is Figure 4.6.

2.2 Functional Data Analysis

(21)

2. Methods 11

set of observations not as a vector in discrete time, but as a continuous function. The analysis of functions rather than discrete samples inherits advantages over multivariate analysis.

An advantage of this property is that the rate of change or derivatives of these functions can easily be calculated and analysed. FDA also includes variants of multivariate methods like PCA. Functional PCA, like normal PCA, not only provides a method for dimensionality reduction, but also characterizes the main modes of variation from a mean function.

To perform FDA on discretely sampled data, the data has to be converted to a contin-uous, functional format. This means a function has to be fitted to the sampled data points. It is not feasible to convert every dataset to a functional form. Especially in the case of sparse and irregular observations, this task is very difficult, but central to the success of functional data analysis.

Usually, the methods used to convert data into a functional format are interpolation and smoothing, or more generally function fitting. A very simple method to do this conversion would be a least squares fit of a first order polynomial (a straight line). Usually, a more flexible method is used for this step, namely spline interpolation. Depending on the underlying data, other fits like Fourier functions are possible. FDA is easily applicable if the measurements were done with a regular spacing, and the data is complete over the observation duration. In the opposite case, it is very difficult to estimate the complete trajectory, when only a single subject is taken into calculation.

2.3 Principal Components Analysis through

Con-ditional Expectation

(22)

2. Methods 12

PACE is an algorithm for extracting the principal components from irregular and sparse data. It also provides an estimation of individual smooth trajectories of the data. PACE assumes that the data is randomly located with a random number of observations per subject. Furthermore it assumes that data is determined by a underlying smooth trajectory.

The first step in PACE is the estimation of the smooth mean function µ, by using a local linear line smoother on all measurements combined into one pool of data. The choice of the smoothing parameter, or bandwidth is done automatically[14] or by hand in this step.

The covariance surface can then be calculated like a regular covariance matrix. This raw covariance surface is stripped of the variance (the first diagonal). This raw matrix is then smoothed utilizing a local linear surface smoother. The bandwidth is chosen by leave-one-curve-out cross-validation. The smoothing step is necessary to fill in for missing observations. The estimation of these two model components share the same smoothing kernel. The choice of a smoothing kernel is discussed in Chapter 4.

From these model components, it is possible to calculate the estimates of the eigenvalues and eigenfunctions, i.e. the functional principal components of sparse and irregular data.

(23)

3 The Vehicle Application and Data

Description

The purpose of this chapter is to outline the connection between the methods proposed in Chapter 2 and the application of those methods on the Volvo data.

3.1 Volvo Truck Data

The original data received from Volvo Parts AB consists of 2027 observations of 267 trucks. It was collected between June 2004 and May 2007 in North America.

All trucks have the same engine and are configured as articulate truck for long distance transports on smooth roads. The gross combination weight (GCW), which includes the weight of the towed trailer and the truck itself is 36 tons, the US federal GCW limit. Data is retrieved when a truck is in a workshop that is equipped to read out the onboard electronics and performs this procedure. It is then sent to the Volvo Headquarter in Gothenburg for storage and analysis.

The data from each observation contains only informations from one of the trucks onboard electronic systems, the Engine Control Unit (ECU). From these data, two variables are mainly relevant for this thesis:

• Total distance driven

• Total amount of fuel consumed

(24)

3. The Vehicle Application and Data Description 14 0 1 2 3 4 5 6 7 x 105 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Distance Driven [km]

Incremental Fuel Mileage [km/l]

Figure 3.1: This figure shows the distribution of the fuel consumption, when the fuel mileage is calculated only between two observations. The outliers visible in this figure can be explained by a high amount of idling between two close observations. When the fuel mileage is calculated accumulative, those outliers do not occur.

These variables are not reset when the ECU was read out in the workshop and therefore behave accumulative. Using these variables as a basis to calculate the fuel consumption per distance or time has an averaging effect on itself as it includes all former mileage data. This is necessary because of the unevenly distributed data. If a truck was read out twice within a very short span of time, the fuel consumption in this interval is possibly vastly different from the normal fuel consumption behavior of the truck, possibly because the truck was not moved very far withhin this time span, but idling for some time. The outliers caused by this effect can be seen in Figure 3.1. These outliers are the reason for not using the difference in fuel amounts between two observations as a calculation basis in this thesis. The accumulative approach allows those outliers to remain in the dataset.

3.1.1 Impurities in the Truck Data

(25)

3. The Vehicle Application and Data Description 15 0 1 2 3 4 5 6 7 x 105 1 1.5 2 2.5 3 3.5 4 Distance Driven [km] Fuel Mileage [km/l]

Figure 3.2: Fuel consumption plot generated from the raw data. The lines are linear interpolations between the observations.

data.

• Incomplete Observations – A truck is missing one of more variables that would be required for analysis. The observations from this individual can not be used for the calculations.

• Physically impossible changes in accumulative variables – Between two observations of a single truck, accumulative variables changed to a smaller value. This means that a later observation in time has a smaller number of total driving distance than an earlier measurement for example. This is physically impossible, but observable if the ECU has been replaced or the contents of the ECU were erased during a software update. This criteria applies to 44 trucks. Although it is possible to use a subset of the observations from each of these trucks. This was not done, because the quality of the measurement might have been compromised and the manual effort of cleaning the data is a time consuming task for very few usable measurements.

(26)

3. The Vehicle Application and Data Description 16

final data, but the remaining observations of this truck are used. Phenomena like these might occur, when the data aquisition process in the workshop was interrupted, or a transmission error occurred.

• Early Observations – These observations are too early in the life of the truck to give a meaningful information. The removal of these observations is moti-vated by the unusual fuel consumption of a truck in this state. The unusual fuel consumption is caused by a high number of short trips the truck has to travel before it can be put into regular service. Examples are drives to paint shops or truck customizers as well as transfers to the customer. The number of observa-tions purged when this criteria is set to remove all measurements below 10000 km is 150, when all measurements before 1000km are deleted, the number of observations drops by 100. See Figure 3.3.

From the 269 initial individual trucks, 56 trucks are removed. In terms of observations, from originally 20271 observations 1320 remained in the data set, when the lower border for observations is set to 1000km. See Figure 3.4 for a plot of the cleaned fuel consumption data. The most visible change to Figure 3.2 is the lower number of outliers at roughly 0 kilometers, which is mostly an effect of the removal of very early observations.

3.1.2 Data structure

Some properties of the data make the task of analyzing inherently difficult. Most of these properties stem from the sparsity of the data. Sparseness in this case means that every truck has been observed on average just 7.405 times with a standard deviation of 2.4083 observations. The sparseness of the data is visualized in Figure 3.5.

(27)

3. The Vehicle Application and Data Description 17 0 1 2 3 4 5 6 7 8 9 x 105 0 50 100 150 200 250 Distance Driven [km] Number of Observations Raw Data Cleaned Data

Figure 3.3: This comparison shows the number of observations on the raw data versus the cleaned data. The overall reduction in the number of observations as well as the lower amount of observations at the beginning is noticeable.

0 1 2 3 4 5 6 7 x 105 1 1.5 2 2.5 3 3.5 4 Driven Distance [km] Fuel Mileage [km/l]

(28)

3. The Vehicle Application and Data Description 18 0 1 2 3 4 5 6 7 8 9 x 105 2 2.2 2.4 2.6 2.8 3 Driven Distance [km] Fuel Mileage [km/l]

Figure 3.5: The scatter plot in this figure highlights the sparse and irregular distribution of the data. The histograms describe the distribution of the observations along the axes. is at 303232 kilometers deviating by 133609 kilometers, which means that most of the trucks are not observed from the beginning, but observed later on in their life-cycle.

• The density of measurements varies. This implies that the placement of measure-ments is irregular throughout the duration of their observation. As the trucks are independent of each other, the times when observations happen are not correlated with each other. For a visual representation of the irregular duration between the measurements, see Figure 3.6. This figure indicates a non-normal distribution. The average distance between observations is 52020 kilometers with a standard deviation of 61858 kilometers.

• Unsupported curvature. The irregular placement and the sparsity of variables causes this property to occur. If a part of a curve has a high curvature, which can be approximated by kd_dx2y2k or (

d2_y

dx2)2. When this is the case, the relative

(29)

3. The Vehicle Application and Data Description 19 0 0.5 1 1.5 2 2.5 3 3.5 4 x 105 0 50 100 150 200 250 300 350

Distance Driven between Observations [km]

Number of Observations

Figure 3.6: This figure shows the distribution of distances between two observations of the same truck.

3.2 Approach

The first part in analyzing truck data, which is described in section 4.1, is to establish results with basic multivariate analysis as a basis where the results of functional analysis can be compared to. This part shows pitfalls and difficulties when applying standard multivariate methods to the data.

The first possible way for multivariate analysis is feature extraction. It is a difficult task to find relevant features to extract. A simple statistical feature will be extracted from the data to be able to give an idea how feature extraction works. The second possibility for multivariate analysis is to put the observations into bins. This is done in order to be able to align the data onto a vertical grid.

(30)

3. The Vehicle Application and Data Description 20

These steps should lead to two results: A simple outlier detection, based on a clustering of the extracted features and a variance and mean estimation for the data, based on the binned data.

The task of estimating fuel consumption behavior for a single truck, outside of its observation duration using the extracted features is very hard. This is because the mapping between the values of the features and a function is not available. Addition-ally, information from other, similar trucks is not taken into consideration.

The last step in Basic Analysis (Sect. 4.1) is a demonstration of the main problem of applying FDA on the data at hand: The difficulty of fitting a function to a single truck.

The main task of this thesis is to apply the PACE algorithm to the data (Sect. 4.2), and to try out the various options within the PACE algorithm. In this section, the results of PACE in general will be assessed, the difference between PACE with different options in regard of the PACE generated functions as well as general statistical properties, such as the mean function.

The first advantage in using the PACE algorithm in comparison to the basic methods is the lack of need to pre-process data, i.e. to extract features or otherwise process the data. This non-parametric input of the data is complimented by a number of options to tune the algorithm itself for various needs (amount of information retained, if the input data has measurement errors, etc.).

The next step is to try out a number of methods which can be applied to the results of PACE. For example to calculate the possibility of the fuel consumption of a particular truck, given all the other trucks.

PACE enables the user to analyse the sparse and irregular data at hand, enabling the use of additional techniques from FDA, whereas using only multivariate data analysis or normal FDA on the same data is very difficult to do and does not incorporate the information gathered from the other trucks.

(31)

4 Results

4.1 Basic Data Analysis

The aim of this section is to provide an overview of basic multivariate analysis possi-bilities with the available data. Functional methods are applied from Section 4.2.

4.1.1 Data Binning

One approach, as described in the previous chapter, is the creation of a vertical grid for the data domain followed by binning the data into a limited number of “buckets” along the time or distance axis, similar to creating a histogram. If there is more than one observation of a truck in one of these bins, an average of these measurements is put into the bin. This has to be done to avoid biasing in case of dense observations of a truck within a short timespan.

The size and the quantity of the bins is crucial for binning. With the data at hand, 25 bins were used, which results in a size of 36087 kilometers per bin.

In Figure 4.1 the number of observations per bin, as well as an estimation of the mean function and the variance of the data can be seen.

In Figure 4.2 a boxplot of the binned data and one of the results of bootstrapping [1] the mean value per bin (10000 bootstrap samples) are illustrated.

(32)

4. Results 22 0 5 10 15 20 25 0 20 40 60 80 100 120 140 Bins Observations Histogram 0 2 4 6 8 x 105 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3

Mean and Deviation

Distance Driven [km]

Fuel Mileage [km/l]

(33)

4. Results 23 5 10 15 20 25 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 Bin Values

Binned Data Boxplot

5 10 15 20 25 2.35 2.4 2.45 2.5 2.55 2.6 2.65 2.7 2.75 2.8 Bin Values

Bootstrapped Mean Boxplot

(34)

4. Results 24

4.1.2 Feature Extraction

The features which are retrieved from all observations of a single truck, are used to construct a simple outlier detector with hierarchical clustering.

The goal of this simple outlier detector is to find trucks, whose mean is deviating significantly from the mean of the entire data. A single extracted feature was used in this case:

∆T ruck= (µT ruck− µAll)2

(35)

4. Results 25 1 5 2 4 3 7 6 0.05 0.1 0.15 0.2 0.25 Dendrogram Class Class Distance 0 2 4 6 8 x 105 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 Distance Driven [km] Fuel Mileage [km/l] Plot

(36)

4. Results 26

4.1.3 Function Fitting

Finding a plausible function that is fitting the data of the trucks well is difficult, because of the open-ended nature of the measurements. If a set of observations have a defined start and an end of their measurements, i.e. the data is fully observed, it is easy to interpolate the data in between, even if the data within this span is sparse. This property of the data at hand is also discussed in Section 3.1.

If the set of data is not fully observed, it is almost impossible to get a reliable fit outside the observation span of a single entity. This reliable fit outside of this span is necessary for performing FDA on this data, as FDA needs the same set of basis functions, or in the case of spline interpolation, the same knots for all functions to work.

It was not possible to get a good fit on this data with splines, where all of the knots are distributed the same for all truck entities. Also, polynomial fits, i.e. the approximation of the data with low (< 5) order did not result in a stable fit for the available data. The most reliable fit under these conditions were generated by fitting a linear function to the fuel consumption observations. These results in fitting the sparse and irregular data motivate the idea of combining the observations by the means of PACE, to be able to get better fits from the reconstructed trajectories.

(37)

4. Results 27 2 4 6 8 x 105 1.5 2 2.5 3 3.5 Distance Driven [km] Fuel Mileage [km/l] 2 4 6 8 x 105 1.5 2 2.5 3 3.5 Distance Driven [km] Fuel Mileage [km/l]

(38)

4. Results 28

4.2 Application of PACE

The goal of this section is to elaborate on the application of the PACE method on the truck data, focusing only on fuel consumption per kilometer over the distance axis. Along with the results of this first application, some options available for a fine-tuning of the method will be presented and a general estimate of variability will be given.

4.2.1 Baseline PACE Results

The data in use for this initial run of the PACE method is the cleaned set, with all the trucks removed which have less than 2 observations. Additionally every observation, that happened before a threshold of 10000 km has been removed. The PACE method has some interchangeable sub-methods. For the baseline results, mostly the same parts as in the original method described in [4] were used. Thus, the kernel used for smoothing the mean function is the Epanechnikov kernel [4] and the input data is assumed to contain measurement errors.

A small discrepancy to the original method is the choice of using Fraction of Variance Explained1 (FVE), instead of the Akaike Information Criterion [1] (AIC) to select the number of PCs. The FVE threshold is set at 95 % of variance explained.

Regarding Figure 4.5, the smoothed mean curve should be taken with a grain of salt, especially the variance plots and the measurement density plots in Figure 3.3 should be considered. The number of PCs selected by FVE is 8, which accounts for 96.57 % of the total variation. The scree plot (Section 2.1.4) of the principal components from this analysis can be seen in Figure 4.6. The first, strong principal component is almost a straight line, which is basically shifting the mean from its starting point closer to the position of the measurements. The second and the fourth principal component seem to serve partially as corrective for trucks with a higher initial fuel economy than the average truck. The smoothed covariance matrix generated and used by PACE is visualized in Figure 4.7 by a color-matrix.

1_{The sum of the eigenvalues of a certain number of eigenfunctions divided by the sum of all}

(39)

4. Results 29 0 2 4 6 8 x 105 2.45 2.5 2.55 2.6 Distance Driven [km] Fuel Mileage [km/l]

Smooth mean curve

0 2 4 6 8 x 105 −3 −2 −1 0 1 2 3 4 x 10 −3 Distance Driven [km] Principal Components 55.72 % 11.88 % 8.65 % 4.03 %

(40)

4. Results 30 0 5 8 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 100 96.57

Number of principal components

Fraction of variance explained (%)

Scree Plot

(41)

4. Results 31 0 2 4 6 8 10 x 105 0 1 2 3 4 5 6 7 8 9 x 10 5 Distance Driven [km] Smoothed Covariance Matrix

Distance Driven [km] −0.04 −0.02 0 0.02 0.04 0.06 0.08

(42)

4. Results 32 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 14 Fuel Consumption [km/l] 2 4 6 8 x 105 Vehicle # 106 2 4 6 8 x 105 Vehicle # 92 2 4 6 8 x 105 Vehicle # 72 2 4 6 8 x 105 Vehicle # 4

Figure 4.8: These plots exhibit the mean curve(red), the corresponding original obser-vations(green) and the reconstructed curve(blue). Vehicle 14 and 106 have high values on all major PC scores, under opposite prefixes. Number 92 has the lowest PC scores overall; Trucks 72 and 4 have average PC scores. High PC scores lead to extreme values, especially on the strong first PC.

From the estimated PCA scores, the mean function µ and the principal component functions, the individual traces of the trucks can be reconstructed, which should give a rough estimate on the behavior of the truck. A number of selected reconstructions can be viewed in Figure 4.8 and a collection of all traces and the original measurements can be seen in Figure 4.9.

As a next step, for an analysis of the results, the goodness-of-fit of the original mea-surements versus the reconstructed traces is assessed. To estimate the goodness-of-fit, the mean squared error [1] between the discrete observation and the estimated re-construction is considered. However, the irregular measurement intervals are making assessment of the results difficult.

(43)

4. Results 33 0 1 2 3 4 5 6 7 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Distance Driven [km] Fuel Mileage [km/l]

Figure 4.9: This graph shows all reconstructed traces (gray) and original measurements (blue). Note how the traces tend to follow the observations, especially when the relative occurrence of observations is low.

2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 73 Fuel Consumption [km/l] 2 4 6 8 x 105 Vehicle # 102 2 4 6 8 x 105 Vehicle # 106 2 4 6 8 x 105 Vehicle # 202

(44)

4. Results 34

Method Max. MSE Mean MSE Median MSE Std. MSE

Mean MSE per Truck 0.189% 0.0343% 0.0209% 0.0383%

Median MSE per Truck 0.238% 0.0215% 0.0096% 0.0331%

All Observations Pooled 0.679% 0.0310% 0.0089% 0.0629%

Table 4.1: MSE of the reconstructed traces by PACE and the original observations with 8 PCs. In the last column, the standard deviation of the MSE is given.

get reliable error measurements is to use the median of the individual MSE as error measure. A good example of a bad fit is truck #102 (Figure 4.10), which is, when the median MSE is used, the third worst fitting truck, in contrast to mean MSE, where the truck is ranked 63rd.

A counter-example is provided by vehicle #202 which is ranked 3rd using the median and 19th _{with mean MSE. In this example, one of the observations is a strong outlier,} which is influencing the median MSE, because of the low number of observations on this truck.

Because both measurement methods have their respective merits, both are used for judging the fit of the individual trucks. In addition to these two methods, which view the trucks as separate entities, all truck observations will be pooled and the overall MSE is given. The results can be seen in Table 4.1

After the establishment of these baseline results, various parts of the PACE method can be changed to see their influence on the results.

4.2.2 Number of Principal Components

(45)

4. Results 35 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 14 Fuel Consumption [km/l] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 105 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 92 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 72 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 4 8 PC 3 PC 4 PC 29 PC

Figure 4.11: These plots show how much reconstructed traces vary with different num-bers of PCs involved.

Table 4.2: MSE of the reconstructed traces with 3 PCs (76.69% variance retained). Thus, the only difference to the baseline result is the number of PCs, graphs of the mean function and the PCs themselves will be omitted. Only the MSE table and the reconstructed trajectories of selected trucks will be shown. For the baseline table, see Table 4.1 and for a comparative visualization of reconstructed trajectories see Figure 4.11

As expected, the MSE results from the variations (Tables 4.2, 4.3, 4.4) perform anal-ogous to the scree plot visible in Figure 4.6. When using a lower number of PCs the

(46)

4. Results 36

Table 4.4: MSE of the reconstructed traces with 29 PC (99.99% variance retained).

error increases, whereas a high number of used principal components do not necessar-ily boost the error performance much. This means, the scree plot and the fraction of variance retained is proportional to the size of the MSE.

4.2.3 Error Assumptions in PACE

There are two possibilities to tune the behavior of PACE regarding “measurement errors”:

• The assumption that the observations are containing no ”measurement errors”. • In addition to the presence of ”measurement errors”, the estimated errors are cut

off at the quartiles for the estimation of the error variance σ.

The notion of ”measurement errors” in this context is a bit misleading, as PACE assumes a underlying smooth function. The accumulative fuel consumption data is precise enough, but the variation of the observations around this smooth function can be considered as noise. The assumptions on measurement error mostly influence the calculation of the PC scores.

(47)

4. Results 37

Table 4.5: MSE of the reconstructed traces with 8 PC. For the estimation of error variance, all data outside the quartiles were cut off.

2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 14 Fuel Consumption [km/l] 2 4 6 8 x 105 Vehicle # 105 2 4 6 8 x 105 Vehicle # 92 2 4 6 8 x 105 Vehicle # 72 2 4 6 8 x 105 Vehicle # 4

Figure 4.12: Reconstructed traces of selected trucks with no measurement error as-sumed. The influence of this assumption can be seen clearly in Vehicle #105, where PC scores are maximized to fit at the observation points.

Baseline PACE with error cut off:

Table 4.5 shows that the MSE for with error cut off is almost as small as the MSE with 29 PC, which is a clear improvement over the baseline. Basically the additional cut off leads to reduced outliers, which seems to improve the performance in comparison with baseline results.

Using PACE under the assumption of zero measurement error:

(48)

4. Results 38 0 1 2 3 4 5 6 7 8 9 x 105 2.45 2.5 2.55 2.6 2.65 2.7 Driven Distance [km] Fuel Consumption [km/l] Epanechnikov kernel Rectangular kernel Gaussian Kernel

Figure 4.13: This Figure shows the effects of using different kernels for smoothing the mean curve µ. The Gaussian kernel produces a very smooth mean curve, whereas the rectangular kernel picks up noise from the measurements. Epanechnikov produces a compromise to these kernel variants.

4.2.4 Different Kernel Functions

Usually, the Epanechnikov kernel [4] is the standard choice for the smoothing steps in the PACE method. This kernel function has a compact basis and definitive endings. Alternative choices of kernels are rectangular and Gaussian kernels[4]. Whereas the rectangular kernel has definitive endings, the Gaussian kernel extends to infinity. For the smooth mean curve and the principal components the rectangular kernel has the effect of adding some noise to the curves, whereas the Gaussian kernel has stronger smoothing properties. In Figure 4.13 all three mean curves and in Figure 4.14 the three most significant PC curves are visible.

In comparison, the overall MSE of the pooled data is slightly higher for PACE with a rectangular kernel than it is with a Epanechnikov kernel (0.0351% mean, 0.0113% median in the rectangular case versus 0.031% mean, 0.0089% median with the Epanech-nikov kernel). In the Gaussian case, the fit is worse than with the other kernels (0.0468% mean, 0.0162% median)2_.

(49)

4. Results 39 1 2 3 4 5 6 7 8 x 105 −3 −2 −1 0 1 2 3 4 x 10 −3 Distance [km] Principal Components Rectangular PC1 Rectangular PC2 Rectangular PC3 Epanechnikov PC1 Epanechnikov PC2 Epanechnikov PC3 Gaussian PC1 Gaussian PC2 Gaussian PC3

Figure 4.14: This Figure shows the effect of using different kernels for smoothing on the PCs. The order of the PCs is visualized by the thickness of the lines, i.e. the thickest line depicts the first principal component. Generally, the same observations as in Figure 4.13 apply.

4.2.5 Variances

There are two different variances in the results, model and data variance.As the different name indicates, the variances result from different sources, and therefore must be handled differently.

The model variance is based on the question of how sure we are of a model. A gen-eralisation of this variance would be to do leave-one-curve-out cross validation on the smooth mean curve. This enables a visualization of how much influence a single curve has on the overall result of the mean curve or the principal components.

The data variance represents the density of measurements in a certain part of the curve. For calculating the variance and the confidence interval of a certain part of the curve, the number of trucks that influence a part of the curve has to be given. There are two different approaches to this:

(50)

4. Results 40

Another approach is the use of reconstructed curves as a basis for calculating the variances. There are two different implementations to this approach. Either the recon-structed curves are taken into account only within the interval of their real observations, i.e. incorporate only observations which are relevant for this particular interval. The other approach is to use the complete reconstructed curves, which ignores the number of real observations in a part of the curve.

4.2.5.1 Model Variance

The data used for this experiment is generated by PACE which uses 8 principal compo-nents and every observation which happened before the truck had run 1000 kilometers was removed. The method which is used to generate the necessary data to analyze the model variance is leave-one-curve-out cross validation. This validation method gener-ates a result from PACE for excluding one truck at a time in the data. Thus, there are as many PACE results generated as there are trucks.

Model variance gives us two results. The first result is a ranking of the most influential trucks to the mean curve or a PC. This result can be found when a PACE result which excludes a particular truck and the overall PACE result is compared.

Additionally, model variance gives the distribution and variation for a particular result from PACE. In Figure 4.15 the very peaky distribution of all the mean curves at different points can be seen. Figure 4.16 shows all leave-one-out mean curves, the overall mean curve µ and the standard deviation σ curves. The average deviation of σ from µ is 0.0016 km/l, the maximal deviation 0.0039 km/l. Figure 4.17 is a fuel consumption plot of interesting vehicles in regard of their influence on the mean curve or on the PCs.

4.2.5.2 Data Variance

(51)

4. Results 41 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] Number of Curves 50000 km 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] 225000 km 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] 400000 km 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] 575000 km 2.5 2.55 2.6 0 20 40 60 80 100 120 140 160 180 Fuel [km/l] 750000 km

Figure 4.15: The distribution of all mean curves generated with the leave-one-curve-out method at various points. Two properties of the mean curves are visible, namely the peakiness of the distribution and the higher deviation from the mean at 50000 km and at 750000 km. 1 2 3 4 5 6 7 8 x 105 2.5 2.52 2.54 2.56 2.58 2.6 2.62 2.64 Distance [km] Fuel Consumption [km/l]

(52)

4. Results 42 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 43 Fuel Consumption [km/l] 2 4 6 8 x 105 Vehicle # 15 2 4 6 8 x 105 Vehicle # 88 Distance [km] 2 4 6 8 x 105 Vehicle # 23 2 4 6 8 x 105 Vehicle # 186

Figure 4.17: Plot of trucks with a high influence on the results of PACE. Trucks 43 and 15 have a strong influence on the µ curve since they provide data at he end of all observations where data is very sparse. Vehicle 88 is the truck with the smallest influence on µ. It has both a small observation duration and average measurements. Truck #23 has the highest influence on the first PC and truck #185 has the smallest influence on the first PC.

real data support is existent. Both results can be seen in Figure 4.18. Both methods deliver a similar result. The main difference is the the resolution of the result based on PACE, which is much higher. However, unlike the binning results, the estimated data between the observations is also incorporated into the variance results, which means that regions with low data support are also represented in the variance.

4.3 Prediction of Fuel Consumption with PACE

Prediction in this case essentially is the usage of the reconstructed trajectories from the PC scores to guess the fuel consumption of a truck at a certain point3.

The baseline to measure the effectiveness of the prediction, the value of the last mea-surement available will be used as the predicted value. This straight assumption is good on the accumulative data because the fuel consumption is usually developing in an almost straight line.

3_{If the data — unlike the available truck data — is not open ended, an alternative to the direct}

(53)

4. Results 43 0 1 2 3 4 5 6 7 8 x 105 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Distance Driven [km] Fuel Mileage [km/l] 0 1 2 3 4 5 6 7 8 x 105 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Distance Driven [km] Fuel Mileage [km/l]

Figure 4.18: The standard deviation extracted from the binned data is visible on the left. The right graph shows the standard deviation of of the data, reconstructed from the observation duration and the trajectories regenerated from the PC scores of PACE. For testing the prediction of new observations, the last observation of the truck to predict is removed from the data, and the PACE results are calculated without it. The prediction at the time of the removed observation is taken from these results. This procedure is done for each available truck.

In this general prediction test, straight line prediction produces a maximum error of 5.04%, a mean error of 0.58% with a standard deviation of 0.81%. Whereas using the reconstructed trajectories to predict produces a maximum error of 5.49%, a mean error of 1.25% with a standard deviation of 1.06%.

These results emphasize the straight nature of data. In general, these results show that it is better to assume steady continuing fuel consumption behavior for forward prediction.

(54)

4. Results 44

Using the reconstructed trajectories directly for prediction is affected by the assumption of the presence of a measurement error – i.e. a basic underlying deviation even at the points with known observations, and the bad fit which usually occurs when dealing with outliers. However, given the relatively constant measurements of individual trucks and the preexisting error between the actual observations and the trajectories, the prediction works and is quite stable regarding the removal of observations.

4.4 Detection of Outliers with PACE

The main idea behind outlier detection with PACE, in particular with the PC scores, is to be able to quantify how normal and likely the fuel consumption behavior of a single truck is.

The first step in quantifying this probability, the distribution of the PC scores has to be found. In this case, the scores are normally distributed, which can be seen in Figure 4.19. This makes the calculation of probabilities for a single PC score easily possible. By using the probabilities from just the first principal component, the same outliers as with simple feature extraction (Section 4.1.2) can be found. Basically, the same outliers can be found by just using the raw PC scores.

However, if the probabilities of several PCs are calculated, it is possible to calculate the “normality” of a truck. The distribution of these probabilities, with a varying count of PCs used can be seen in Figure 4.20. In Figure 4.21 a few example fuel consumption plots of trucks along with their normality can be seen. For these samples, the weighted four first PCs were used.

(55)

4. Results 45 −200 0 200 0.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 Data Probability PC 1 −40 −20 0 20 40 0.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 PC 2 Data Probability −50 0 50 0.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 Data Probability PC 3 −20 0 20 0.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 Data Probability PC4

(56)

4. Results 46 0 0.5 1 0 20 40 60 80 100 120 Probability Number of Observations 1 PC 0 0.5 1 0 20 40 60 80 100 120 Probability 3 PCs 0 0.5 1 0 20 40 60 80 100 120 4 PCs Probability 0 0.5 1 0 20 40 60 80 100 120 4 PCs (weighted) Probability Figure 4.20: These histograms show the likelihoods for the occurrence of a single truck with different counts of PCs used for the calculation. When multiple PCs are used, the result is given by the product from the probabilities of all principal components. In the rightmost histogram the likelihoods are weighted by the eigenvalues of the PCs.

2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 88 Prob. 78,84% Dist. [km] Mileage [km/l] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 92 Prob. 75,60% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 4 Prob. 50,74% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 72 Prob. 53,48% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 6 Prob. 25,93% Dist. [km] Mileage [km/l] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 16 Prob. 18,20% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 14 Prob. 2,360% Dist. [km] 2 4 6 8 x 105 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 106 Prob. 11,84% Dist. [km]

(57)

4. Results 47 2 4 6 8 x 105 40 45 50 55 60 65 70 Distance [km] Average Speed [km/h] Mean Function 2 4 6 8 x 105 −4 −3 −2 −1 0 1 2 x 10 −3 Distance [km] PCs 0 2 4 6 8 x 105 0 10 20 30 40 50 60 70 80 90 100 Observations Distance [km] PC1 43% PC2 16% PC3 13% PC4 7%

Figure 4.22: This figure shows the mean curve of average vehicle speed, the PCs and a scatter-plot of all available observations. The mean curve is an indicator, that trucks with an high odometer count have a higher average speed.

4.5 Expansion of our Application

As an example of the application of PACE on other data the average vehicle speed is used. Furthermore, in this section the PACE method is used on cyclic fuel consumption data, even if PACE was developed for longitudinal data.

The results of PACE on the average vehicle speed can be seen in figure 4.22. In comparison to the results from the fuel consumption data, the speed data has also a similar distribution of the observations.

(58)

4. Results 48 0 100 200 300 2.5 2.52 2.54 2.56 2.58 2.6

Day of the Year

Fuel Consumption [km/l] Mean Function 0 100 200 300 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15

Day of the Year

Principal Components Fuel Consumption [km/l] PC1 PC2 PC3 PC4

Figure 4.23: The left figure shows the mean fuel consumption when the fuel consump-tion is observed over the year. The peak at the very beginning and the very end of the year is probably caused by the lack of observations at this time of the year. On the right, the first four PCs are shown. The first PC is a linear offset, whereas the second and the third components probably show seasonal effects.

(59)

4. Results 49 0 200 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 121

Day of the Year

Fuel Consumption [km/l] 0 200 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 192

Day of the Year

0 200 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 190

Day of the Year

0 200 2 2.2 2.4 2.6 2.8 3 3.2 Vehicle # 6

(60)

5 Discussion

A natural perspective for a possible continuation of this work would be the expansion of the method applications to different datasets, especially to more specialized ones. For example, data in a similar quantity as the one used, but from a corporate fleet of quasi identical trucks which are in service within the same climate zone with similar loads, etc. This data would likely be better suited for research on detecting trends, as well as detection of trucks with unexpected behavior, i.e. outliers. An example for such a data set would be data from Scandinavian long distance trucks, where the differences in fuel consumption between summer and winter should be clearly visible. To further research on seasonal variation of fuel consumption, an expansion of PACE onto cyclic data might be useful.

Furthermore the analysis of data containing more observations might be interesting, as a lot of small underlying influences in the fuel consumption data would be uncovered. With these datasets, research on the asymptotic properties, as well as the distribution of the data would be more useful as with the small amount of mixed data at hand. With denser data, it might also be viable to switch to calculate fuel consumption based on the fuel amount used between two observations instead of using the fuel amount consumed since the truck was manufactured.

(61)

6 Conclusion

Sparse and irregular data is hard to analyse with multivariate and functional statistics. The steps which make analysis with those two approaches hard are feature extraction and data interpolation. Feature extraction needs a careful extraction of relevant fea-tures. The manual work in this case is not very desirable. The main problem in doing interpolation on this data is the open-ended nature. Outside the given observations for a single truck it is very difficult to estimate a function without knowledge on the underlying model.

However, it is possible to analyse such data in a functional way if the Principal Com-ponents Analysis through Conditional Expectation (PACE) method is used. If the Gaussian assumptions made by PACE are acceptable, the methods provide a complete data centric approach to extract a mean curve and principal components from the data as well as complete trajectories regenerated from the principal component scores of the individuals.

These results can be used as basis for further analysis, such as classification and re-gression. While these tasks can also be approached with feature extraction, PACE uses all data available and is much more non-parametric. Also, the functional approach of PACE keeps the data in an more natural format than the format of abstract extracted features of the individuals.

Most of the variation in the data used in this work (long-distance articulate truck fuel consumption data) can be captured with a small number of principal components.

(62)

6. Conclusion 52

However, the data does not contain highly significant general trends and easily separa-ble clusters. Some outlying individuals are contained in this data, but because of the meta-data nature of the fuel consumption, it is not possible to distinguish between a possible truck fault and environmental influences.

Fuel consumption is difficult to predict, as it can change very rapidly when the en-vironment changes. The available truck data has no definitive end or start. Samples were taken at arbitrary times and are connected by just the truck configuration. Thus prediction for this data can just give an educated guess about the fuel consumption based on the data from the other trucks, not for this individual truck.

(63)

Bibliography

[1] Hastie, T., R. Tibshirani, and J. Friedman: The Elements of Statistical Learning. Springer Verlag, 2001.

[2] Ramsay, J. O. and B. W. Silverman: Functional Data Analysis. Second Edition. Springer Verlag, 2006.

[3] Ferraty, F. and P. Vieu: Nonparametric Functional Data Analysis: Theory and Practice. Springer Verlag, 2006.

[4] Yao, F., H.G. M¨uller, and J.L. Wang: Functional Data Analysis for Sparse Longi-tudinal Data. Journal of the American Statistical Association, 100(470):577–591, 2005.

[5] Liu, B. and H.G. M¨uller: Functional Data Analysis for Sparse Auction Data. 2007. http://www.smith.umd.edu/ceme/statistics/Liu_Muller_FDA%20for_ Sparse_Auction_Data.pdf Preprint published online. Retrieved 2007-11-25. [6] Sandberg, T.: Heavy Truck Modeling for Fuel Consumption Simulations and

Mea-surements. Master’s thesis, Division of Vehicular Systems, Department of Electri-cal Engineering, Link¨oping University, Link¨oping, 2001.

[7] Stodolsky, F., L. Gaines, and A. Vyas: Analysis of Technology Options to Reduce the Fuel Consumption of Idling Trucks. Technical report, ANL/ESD-43, Argonne National Lab., IL (US), 2000.

[8] James, GM, TJ Hastie, and CA Sugar: Principal component models for sparse functional data. Biometrika, 87(3):587–602, 2000.

(64)

Bibliography 54

[9] Zou, H., T. Hastie, and R. Tibshirani: Sparse principal component analysis. Jour-nal of ComputatioJour-nal and Graphical Statistics, 15(2):265–286, 2006.

[10] Hall, P., H.G. M¨uller, and J.L. Wang: Properties of principal component methods for functional and longitudinal data analysis. Ann. Statist, 34(3):1493–1517, 2006. [11] Yao, F., H.G. M¨uller, A.J. Clifford, S.R. Dueker, J. Follett, Y. Lin, B.A. Buchholz, and J.S. Vogel: Shrinkage Estimation for Functional Principal Component Scores with Application to the Population Kinetics of Plasma Folate. Biometrics, 59:676– 685, 2003.

[12] Liang, K.Y. and S.L. Zeger: Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13–22, 1986.

[13] Maaten, L. J. P. van der, E. O. Postma, and H. J. van den Herik: Dimension-ality reduction: A comparative review. 2007. http://www.cs.unimaas.nl/l. vandermaaten/dr/DR_draft.pdf Preprint published online. Retrieved 2007-12-01.

(65)

List of Abbreviations

LVD . . . Logged Vehicle Data

EECU . . . Engine Electric Control Unit FDA . . . Functional Data Analysis FD . . . Functional Data

PCA . . . Principal Component Analysis PC . . . Principal Component

PCs . . . Principal Components

PACE . . . Principal Component Analysis through Conditional Ex-pectation

MSE . . . Mean Squared Error

AIC . . . Akaike Information Criterion FVE . . . Fraction of Variance Explained

Functional Analysis of Real World Truck Fuel Consumption Data

Technical Report, IDE0806, January 2008

Functional Analysis of Real World Truck Fuel

Consumption Data

Master’s Thesis in Computer Systems Engineering

Georg Vogetseder

Functional Analysis of Real World

Truck Fuel Consumption Data

Acknowledgement

Abstract

Contents

List of Figures

List of Tables

1

Introduction

1.1

Background

1.2

Motivation and Novelty

1.3

Related Work

1.4

Limitations

1.5

Outline

2

Methods

2.1

General Statistical Methods

2.1.1

Principal Component Analysis

2.1.2

Hierarchical Clustering

2.1.3

Validation Methods

2.1.4

Diagrams

2.2

Functional Data Analysis

2.3

Principal Components Analysis through

Con-ditional Expectation

3

The Vehicle Application and Data

Description

3.1

Volvo Truck Data

3.1.1

Impurities in the Truck Data

3.1.2

Data structure

3.2

Approach

4

Results

4.1

Basic Data Analysis

4.1.1

Data Binning

4.1.2

Feature Extraction

4.1.3

Function Fitting

4.2

Application of PACE

4.2.1

Baseline PACE Results

4.2.2

Number of Principal Components

4.2.3

Error Assumptions in PACE

4.2.4

Different Kernel Functions

4.2.5

Variances

4.3

Prediction of Fuel Consumption with PACE

4.4

Detection of Outliers with PACE

4.5