Modeling of fuel consumption in a road network

(1)

Modeling of fuel

consumption in a road network

ZEHUA CHEN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

in a road network

ZEHUA CHEN

Master in Machine Learning Date: June 27, 2020

Supervisor: Ying Liu (KTH), Georg von Zedtwitz-Liebenstein (Scania Group)

Examiner: Erik Fransén

School of Electrical Engineering and Computer Science

Host company: Scania Group

(4)

(5)

Abstract

The fuel consumption accounts for a large portion of operational cost for lo- gistics companies and hence building an accurate fuel prediction model is of the key importance. Recently machine learning methods have been widely used in this area, and data like historical GPS data, road conditions, weather conditions are involved when building such models.

This study aims at investigating the possibility to replace the road condi- tion features with an index that is constructed by aggregating the data collected and maintained by Scania. Two normalization methods are used for building fuel consumption index, some commonly used models including Support Vec- tor Machine, Random Forest and Gradient Boosted Machines are trained and evaluated both with and without it. The experimental results show that the Random Forest model outperforms the others in most cases. By comparing the results with the previous studies, we can see that replacing road condi- tions by a fuel consumption index can lead to almost the same performance of machine learning models.

To guarantee the reliability of this index, approximately 4000 to 5000 sam-

ples are expected for each road segment, this is however not realistic for many

of them. When predicting fuel consumption for a given route, the more road

segments with adequate samples it contains, the higher predictive accuracy we

can expect.

(6)

Sammanfattning

För logistikföretag står bränsleförbrukningen för en stor del av driftskostnaden och därför är det viktigt att bygga en exakt modell för bränsleprognos. Under senare år har maskininlärningsmetoder använts i stor utsträckning inom detta område, och data som historiska GPS-positioner, vägförhållanden och väder- förhållanden är involverade när man bygger dessa modeller.

Denna studie syftar till att undersöka möjligheten att ersätta karrakteristika beskrivande vägtillståndet med ett index som är konstruerat genom att använda de uppgifter som samlas in och underhålls av Scania. Två normaliseringsme- toder används för att bygga bränsleförbrukningsindex. En uppsättning vanliga modeller inkluderande supportvektormaskinen, Random Forest och Gradient Boosted Machines tränas och utvärderas både med och utan normalisering.

De experimentella resultaten visar att Random Forest-modellen överträffar de andra i de flesta fall. Genom att jämföra resultaten med tidigare studier kan vi se att byte av vägförhållandekarrakteristika mot ett bränsleförbrukningsindex kan leda till nästan samma prestanda som maskininlärningsmodeller.

För att garantera tillförlitligheten för detta index behövs cirka 4000 till

5000 prover för varje vägsegment, men detta är dock inte realistiskt för många

av segmenten. När vi predicerar bränsleförbrukningen för en given rutt blir

förväntad prediktionsnoggranhet högre desto fler vägsegment med adekvata

dataprover den innehåller.

(7)

Acknowledgements

I want to thank Georg von Zedtwitz-Liebenstein who is my supervisor at Sca- nia, as well as the entire KSEA group, for providing me all the necessary re- sources and guides through the whole process of my thesis project. Thanks to my academic supervisor Ying Liu at KTH, for all the help I got including academic instructions so the research can be conducted in an appropriate man- ner, and also the help regarding my student visa. Thanks to my examiner Erik Fransén, for eaxming my thesis report and helping me extending my student visa.

I also want to give thanks to my fellow thesis workers Joar Nykvist and Mengdi Xue for all the time we used to spend together on solving problems, having lunches and on the way home.

Finally, thanks to my parents, for the tremendous support I have from you

all both financially and morally, that is why I can live and study thousands of

miles away from my home to pursue a promising future.

(8)

1 Introduction 1

1.1 Research Question . . . 2

1.2 Objective . . . 2

1.3 Scope . . . 2

1.4 Ethics, Sustainability and Social Aspects . . . 3

2 Background 4 2.1 Related works . . . 4

2.1.1 Fuel consumption prediction . . . 4

2.1.2 Map-matching methods . . . 5

2.2 Hidden Markov Model . . . 6

2.3 Machine Learning . . . 8

2.3.1 Support Vector Regression . . . 8

2.3.2 Tree Based Methods . . . 10

3 Methods 15 3.1 Map matching . . . 15

3.1.1 HMM based map-matching method . . . 15

3.2 Normalization method . . . 17

3.3 Performance metrics . . . 18

3.3.1 Mean Absolute Error . . . 18

3.3.2 Root Mean Squared Error . . . 18

3.3.3 Mean Absolute Percentage Error . . . 19

3.4 Feature engineering . . . 19

3.4.1 Pearson Correlation . . . 20

4 Data collection and processing 21 4.1 Fleet Management System Data . . . 21

4.2 Weather Data . . . 24

4.3 Vehicle configurations . . . 26

vi

(9)

4.4 Road network model . . . 26

4.4.1 Fuel consumption index . . . 27

5 Results 29 5.1 Data exploration . . . 29

5.1.1 Fuel Consumption index . . . 29

5.1.2 Final Dataset . . . 30

5.2 Training Results . . . 31

5.2.1 Support Vector Machine . . . 32

5.2.2 Random Forest . . . 33

5.2.3 Gradient Boosted Machine . . . 34

5.3 Exploratory experiments . . . 36

5.3.1 Experiments on new dataset . . . 37

5.3.2 Experiments on refined dataset . . . 37

5.3.3 Experiments of feature selection . . . 39

6 Discussion 43 6.1 Experiment results . . . 43

6.1.1 Map Matching Server . . . 43

6.1.2 Fuel Consumption Index . . . 44

6.1.3 Performance and Features . . . 45

7 Conclusions 47 7.1 Future works . . . 48

7.1.1 Data Quality . . . 48

7.1.2 Feature Engineering . . . 49

7.1.3 Machine Learning Model . . . 50

Bibliography 51

A Data structure 54

(10)

(11)

Introduction

Generally, when people opening the Google Map, searching for an optimal route from point A to point B, the only thing they care about is the traveling time. But when things become a bit more complicated when it comes to the long haulage transportation in the logistics industry, there are some other fac- tors need to be considered like the locations of workshops and fuel consump- tion which is one of the three major sources of cost for those truck owners (the other two are truck maintenance and drivers). Saving the fuel cost is always of the key importance for the profitability of a company and reducing the air pollution, both the truck manufacturers and operators are all dedicated to this goal.

This project is conducted at Scania Group, which is one of the world’s leading truck manufacturers and traffic solution providers in the world. Scania trucks have an excellent performance regarding the fuel economy, and it would be worthy to investigate if there are any other methods, beyond the trucks, can be used to reach higher fuel efficiency. Here an interesting assumption is raised that the route saves the time most may not be the optimal option when fuel consumption is considered.

Building models for accurate and reliable predictions of a vehicle’s fuel consumption is a key capability for Scania in the area of connected services and solutions. Fuel consumption prediction requires a rich set of input data to describe the vehicle, it’s operation and the surrounding environment. Data regarding the vehicle’s configuration and its operation is core business for Sca- nia. But when it comes to describing the surrounding environment, such as the road inclination and road traffic speed, Scania is currently relying on data from external suppliers, which can be both costly and inconvenient for many reasons.

1

(12)

The main area that this study involves is machine learning, which may be the most remarkable branch of artificial intelligence in the past decade.

Machine learning is a kind of algorithm that can learn from the experience in the past and give predictions about the future. It can be further divided into some subfields include supervised learning, unsupervised learning and reinforcement learning. In this study, we will focus on supervised learning methods.

1.1 Research Question

This study aims to build an index based on historical data to indicate the ability that each road segment can contribute to the fuel consumption and evaluate the machine learning methods for fuel consumption prediction based on this index. The questions in this project that are to be examined include:

1. Is it possible to build a fuel consumption index to replace the road con- ditions in fuel prediction models? If so, how?

2. Which machine learning model based on this fuel consumption index performs best?

3. What are the strengths and weaknesses of this method?

1.2 Objective

This thesis sets out to provide an alternative solution to the modeling of the road network’s effect on the vehicle’s fuel consumption to get rid of the de- pendence on third-party map data suppliers. To be specific, this study will build machine learning based fuel consumption predictive models which do not need road conditions as input parameters, this methods can significantly benefit the stakeholders like transportation companies.

1.3 Scope

Since the main purpose of this study is discovering the possibility of building

a fuel consumption, the methods and data used here will be limited for the

merit of time and resources.

(13)

As it will be stated in later chapters, a large number of repetitive records are required to guarantee the quality of models, thus the data will only be selected from a small area in Sweden. Previous researches make comparisons on the models built on the data sampled at the 1-minute frequency and 10- minute frequency separately. The data used in this study range from 5-minute to 10-minute. The influences of sampling frequency are excluded. Finally only supervised machine learning methods will be investigated here, the others like semi-supervised learning and reinforcement learning are excluded. Deep learning methods are also out of the scope of this paper.

1.4 Ethics, Sustainability and Social Aspects

The major concern about the ethics is the privacy of the data. Data used in this study is collected from various sources, including GPS records, configuration data from the connected vehicles and weather data from SMHI. The SMHI data is open and free to use for any research purposes [1]. When collecting the data from vehicles, Scania will guarantee the whole process is in compliance with the local laws and the awareness of those truck owners. In this study, as a thesis worker I only have the access to the data what is necessary, those regarding the private information of drivers and companies is not allowed.

The main motivations lie behind this project are finding environment- friendly

and economical transport solutions. Figuring out the optimal route with re-

spect to fuel consumption is beneficial for the truck owners since a large amount

of expenditure can be saved by doing so. Reducing the consumption of fossil

fuel is also good for the environment, this is becoming increasingly important

with the background of Global Warming which has come to one of the top

concerns for the whole society nowadays.

(14)

Background

For truck manufacturers like Scania, who hold a large amount of operational data in history, it is meaningful to make most of the data. And Scania does do a lot of researches regarding mining hidden information from the data and build- ing the fuel prediction models, the recent works have turned to the machine learning models from the simulation methods. In these machine learning- based methods, to train the models a rich set of input data is required to de- scribe the vehicle, its operational details, and the surrounding environment.

Data regarding the vehicle’s configuration and its operation is core business for Scania. But when it comes to describing the surrounding environment, such as the road inclination and road traffic speed, Scania currently relies on data from external suppliers, which can be both costly and inconvenient for many reasons. This study sets out to provide a possibility to get rid of the dependence of those external suppliers.

2.1 Related works

2.1.1 Fuel consumption prediction

Since fuel consumption is always a key factor to the profitability of any com- mercial transportation activities, a lot of related researches have been done since the last century.

Predicting fuel consumption is complicated in terms of many aspects since it depends on quite many parameters both internally and externally. The early studies tend to simplify the questions like considering the relationships be- tween only partial parameters and fuel consumption [2, 3] and focusing on statistical methods. There are also many limits imposed on the scenarios like

4

(15)

the type of vehicles or working environments. These studies leave us many valuable results, acquaint us how each parameter can effect fuel consumption.

However, these simplified model can hardly be adopted widely due to the com- plexity of real world conditions. To get a more generalized model, we need to model as many parameters as possible.

While mathematicians tend to use statistical methods, those who are fa- miliar with vehicle engineering create some simulation-based methods [4].

Scania has a mature simulation-based solutions for fuel consumption predic- tion and is always looking forward to refining such kinds of system. There are some other thesis workers used to do research projects at Scania, and the possibilities for improvement are fully discussed [5].

The simulation methods require a profound insight into vehicle configu- rations. To make the simulating system work accurately, many descriptive parameters are to be provided. The system will build some separate models so as to measure the contribution to the fuel consumption of each subsystem in the vehicle and do the integration then. This process could be slow due to the tremendous amount of parameters and highly vulnerable to the variance.

Machine learning is another commonly used method in this area. Some thesis workers did some excellent works on the similar topic at Scania [6, 7, 8].

All these studies use traditional machine learning methods including Support Vector Machine, Artificial Neural Networks and some tree-based methods.

The latest work comes from Henrik [8], who did experiments and makes some comparisons on four regression models include linear regression, support vec- tor machine, artificial neural network and random forest. He also raised some valuable tips for building the training data set. He gives the conclusion that the random forest has the best performance in terms of the prediction variance.

Besides, he has some interesting findings like data sampled at 10-minute fre- quency is better for training compared to 1-minute frequency data due to the lower variance, and data of driver behavior will not contribute to the final pre- diction.

The previous works have built a set of complete methodologies for building machine learning based fuel consumption predictive models. This study will not try to seek innovative points on any machine learning methods, but on more practical perspectives for the sake of many stakeholders as stated before.

2.1.2 Map-matching methods

An important technique involved in this work is map-matching, which is a

prerequisite for the later machine learning researches. Previous studies uses

(16)

existing systems like DigitalReality provided by Volkswagen. But such system is not available in this study for some reason, so we will turn to open source solutions here.

A lot of efforts have been made to figure out a way to map massive GPS points to real road segments. Lili Cao and John Krumm propose a two-step algorithm to build road network from massive GPS points [9]. They firstly filter out those points with significant noises and followed by a simple road generation algorithm. However, this method needs quite a large amount of recorded points which is not applicable in this project, also the final topological model we get in this way is only an approximation of the real road model.

Another popular method to solve this problem is map matching algorithm.

Map matching is the problem that given a sequence of GPS coordinates, find the roads they belong to in real world. The easiest way to achieve this is to choose the road with the minimum distance from the point, but this is very error prone due to the error of the records and the complexity of road network.

While many geometrical algorithms perform quite well in many cases, there is still the potential risk of failure because of the sparseness of the data.

The most commonly used probabilistic model for map matching is Hidden Markov Model. Paul Newson and John Krumm described such a HMM model that can significantly accounts for the measurement noises, which can reach a reasonable balance between the GPS position and the feasibility of the path [10]. In this method, they treat the potential road segments as hidden states and the GPS points as the observations.

BMW Car IT GmbH design and implement an online map matching sys- tem based on Hidden Markov Model. This system provides both online and offline map matching functionalities and the code is totally open source and can be indexed online.

2.2 Hidden Markov Model

The Hidden Markov Model can be seen as the extension of the Markov Model,

which is a powerful probabilistic model describes the relationships between

a sequence of random variables (states)[11]. Generally when predicting the

next state in the sequence, intuitively we may think all the previous states will

matter, which could lead the model extremely complicated as the sequence

grows. A strong assumption that Markov Model makes is that only the present

state will contribute to the prediction of the next one, the past could be simply

ignored.

(17)

P (s

i

|s

i 1

, ..., s

2

, s

1

) = P (s

i

|s

i 1

) (2.1) Hidden Markov Model nowadays is widely used in sequential data pro- cessing like speech recognition and natural language processing for example.

In the recent years, some deep learning models like RNN (Recursive Neural Network) and LSTM (Long Short-Term Memory) show better performances on such tasks, while HMM can still contribute to many traditional probabilistic problems.

Figure 2.1: A simple example of Hidden Markov Model. S = s

1

, s

2

, ..., s

n

are the hidden states, a

ij

is the transition probability and b

i

(o

t

) is the emission probability.

Table 2.1: Formulation of Hidden Markov Model

Symbol Explanation

S = s

1

, s

2

, ..., s

n

A sequence of hidden states

A = a

11

, ..., a

ij

, ..., a

N N

The transition probability matrix A, a

ij

repre- sents the probability of changing from state i to j.

O = o

1

, o

2

, ..., o

T

A sequence of observations.

B = b

i

(o

T

) The emission probability matrix, b

i

(o

T

) repre- sents the probability that o

T

is observed when the present hidden state is b

i

.

⇡ = ⇡

0

, ⇡

1

, .., ⇡

n

The probability distribution of initial state, where ⇡

i

represents the probability that the Markov process starts from state i and P

n

i=1

⇡

_i

= 1.

The Markov process assumes that all the states are observable while many

of them are actually not in real life. Hidden Markov Model introduces a se-

(18)

quence of hidden states which means they cannot be observed directly. Cor- respondingly, there is a sequence of observable objects which are called ob- servations. A Hidden Markov Model is formulated in Table 2.1 [11], Figure 2.1 shows an example of HMM. Generally, the Hidden Markov Model can be used to solve three classical problems[11]:

1. Evaluation Problem Given a HMM = (A, B) and a sequence of ob- servations O = o

1

, o

2

, ..., o

T

, calculate the probability that the sequence is generated by the model which is P (O| ).

2. Decoding Problem Given a HMM = (A, B) and a sequence of ob- servations O = o

1

, o

2

, ..., o

T

, determine the most likely hidden state sequence S = s

1

, s

2

, ..., s

n

.

3. Learning Problem Given the observation sequence O = o

1

, o

₂

, ..., o

_T

and state sequence S = s

1

, s

2

, ..., s

n

, learn the model parameters A, B, ⇡ which can maximize p(O| ).

2.3 Machine Learning

Countless machine learning models emerged in the past decades, it is not easy to get any clue which one may perform best on the fuel consumption prediction problem. If we have a look at the inbuilt algorithms in sci-kit learn, we will see tens of models over there but apparently, we do not need to try all of them.

In Henrik’s work, he did experiments on Linear Regression, Artificial Neural Networks, Support Vector Machine, and Random Forest and concluded that Random Forest will slightly outperform the other three models [8]. However, SVM has a quite close performance compared to Random Forest and it even performs better in Svärd’s study [12]. Randal Olson’s study shows that the Gradient Tree Boosting algorithm has the best performance in most problems while Naive Bayes methods are the worst [13]. Based on the previous research, we will put the focus on Support Vector Machine, Random Forest and Gradi- ent Boosted Machines in this study.

2.3.1 Support Vector Regression

Given a dataset ( ~x

1

, y

1

), ( ~ x

2

, y

2

), ..., ( ~ x

n

, y

n

), Support Vector Machine is dedi-

cated to finding the most robust separating plane that can maximize the margin

(which is defined by the twice the distance from the decision boundary to the

(19)

nearest sample point, shown as the width between the dashed lines in Figure 2.2), the hyperplane can be represented as f(x) = w

^T

x + b, where w is the normal vector to the hyperplane,

_kwk^b

represents the offset of the hyperplane to the original boundary along the normal vector w. SVM takes some samples more important than the others. Only the points located on the dashed lines in 2.2 will be finally used to determine the decision boundary, these points are called the Support Vectors. Another key insight behind SVM is that due to the dimension or complex distribution of the samples, they may not be separable in the original feature space but possible in higher-dimensional space. Here the so-called kernel trick is introduced which can map the original samples to other feature spaces. And a high dimensional hyperplane may be found in this situation.

Support Vector Machine is originally proposed for solving bi-classification problems, but it can also be slightly modified to adapt to the regression cases.

While the classification case is to find a margin where all samples are located outside, the regression case is to figure out a tube ✏ which contains most sam- ples. Given that there may not always be an existed hyperplane, sometimes the restrictions can be relaxed to some extent by introducing relax variables

⇠

_i⁺

and ⇠

_i

. Then the optimization problem can be formulated as[14]:

w,b,⇠

min

⁺,⇠

1 2 kwk

²

+ C m

X

m i=1

(⇠

_i⁺

+ ⇠

_i

) (2.2) s.t. y

i

(w

^T

x + b)  ✏ + ⇠

i⁺

,

(w

^T

x + b) y

i

 ✏ + ⇠

i

,

⇠

_i⁺

0, ⇠

_i

0 (2.3)

This formulation can be further converted to a dual problem:

↵

min

⁺,↵

1 2

X

i,j

(↵

⁺_i

↵

_i

)(↵

⁺_j

↵

_j

)k(x

_i

, x

_j

)+

✏ X

m

i=1

(↵

⁺_i

+ ↵

_i

) X

m

i=1

y

i

(↵

⁺_i

↵

_i

)

(2.4)

s.t.

X

m i=1

↵

_i⁺

= X

m

i=1

↵

_i

,

0  ↵

⁺i

 C m , 0  ↵

i

 C

m

(2.5)

(20)

where ↵

⁺

, ↵ are Lagrange multipliers, C 0 is a penalty parameter can be used to control the effect of relax vectors, m is the number of samples and k(x

i

, x

j

) is the kernel function. Given the complexity of the features and the limitation of the linear kernel, only radial basis function kernel will be considered in this study which is given by 2.6.

k(x

i

, x

j

) = e k

^{xi xj}

k

²

2 2

(2.6)

Figure 2.2: An illustration of Support Vector Regression.

In some definitions like sci-kit learn, is used to express

₂¹2

, so the formula above can be written as 2.7 [15]. Together with C and ✏, will also be taken as one of the free hyperparameters to be fine-tuned in this study.

k(x

_i

, x

_j

) = e

^kxⁱ ^x^j^k²

(2.7)

2.3.2 Tree Based Methods

Rrgression Trees

The decision tree may be one of the most fundamental and intuitive models

that can be used to explicitly represent the decision process. It is also one of

(21)

the most straightforward and interpretable non-parametric supervised learn- ing methods. Just as its name suggests, it is a tree-like model built upside down with the root node at the top and multiple leaf nodes at the bottom. The decision tree building is implemented by recursive partitioning. The original dataset will be split into multiple subsets at each node and the model will try to make these subsets as purer as possible in this course. The partitioning process will terminate until all samples in the child node are from the same class or some other stopping criteria are met. There are various algorithms to measure the purity of datasets like ID3, C4.5, Gini index and so on.

Classification And Regression Trees, also known as CART, is the default built-in algorithm in sci-kit learn library, can cope with both classification and regression problems. CART uses Gini index as the metric for splitting nodes, which is shown in 2.8, where p

i

is the probability a sample being classified as class i, i 2 1, 2, ..., n, D

i

is the size of a subset in which all the samples belong to class i and D indicates the size of the entire data set. In this study, we will use CART since it has been proven to be the most efficient algorithm for regression problems.

Gini(D) = X

n

i=0

p

_i

X

k6=i

p

_k

= X

n

i=0

p

_i

(1 p

_i

) = 1 X

n

i=0

( D

i

D )

²

(2.8) Things become a little different when it comes to the regression case. Com- pare to the Gini index, the mean squared error is used as the metric for regres- sion. The input space will be split into multiple distinct and non-overlapping spaces R

1

, R

₂

, ..., R

_n

. For each observation falls into space R

i

, the prediction is given by the average of the responses of the training samples in R

i

.

The process of building a regression tree can be summarize as follow:

1. Given the original input set D = (x

1

, y

1

), (x

2

, y

2

), (x

3

, y

3

), ..., (x

n

, y

n

), for each attribute feature and the corresponding values, calculate the RSS of the split sub datasets which is given by 2.10.

2. Find the optimal attribute-value pair, split the original set into two sub- sets.

3. Repeat step 1 and step 2 until the stopping conditions are met.

RSS = X

xi2R1

(y

i

c

1

)

²

+ X

xi2R2

(y

i

c

2

)

²

(2.9)

(22)

c

1

= 1 N

1

X

x2R1

y

i

, c

2

= 1 N

2

X

x2R1

y

i

(2.10)

Figure 2.3: A simple decision tree for the regression case. The predicted value at each leaf node is the mean of all samples that meet the conditions along the branches.

Random Forest

Random Forest is a sort of ensemble method that creates a cluster of decision trees and the final prediction is the combination of the results from these trees.

Random Forest is inspired by the theory of crowd wisdom. Generally, a trained decision tree is highly sensitive to the data, it can fit the training data set very well which means it has high variance and relatively low bias. This property makes it a good choice for some ensemble methods[16].

To guarantee the efficiency of ensemble methods, the predictive ability of a single model should be more powerful than random guesses and various single predictors are diverse from each other. Random Forest introduces two randomnesses to boost the model’s generalization ability[16]:

1. Sampling from the dataset with replacement when building a single de- cision tree, this process is also known as bagging.

2. Only a subset of features will be selected when splitting the nodes.

After getting all the single predictors, Random Forest will output the averaged

value as the final prediction since this is a regression problem. Some hyper-

parameters like the number of features used in splitting and size of data used

for building trees need to be fine-tuned for the performance of the model.

(23)

Gradient Boosted Trees

The boosting method is also an ensemble method. Compared to the bagging used by Random Forest, boosting will also generate a series of predictors but in a sequential fashion. In each round of generation, the boosting method will put the focus on those samples which are misclassified by the last predictor, they will be assigned higher weights and then be used for training the next model.

Once enough weak models are collected, we will use the same strategy as before like simply taking the average to combine them to get a more powerful model.

The boosted trees is an additive model that takes advantage of decision trees and boosting methods. A simple model will be initialized at first, then more weak models will be trained to fit the residual between the existed model and real values. Considering the training dataset D = (x

1

, y

₁

), (x

₂

, y

₂

), ..., (x

_n

, y

_n

), the Gradient Boosted Trees algorithm can be formulated as follow [17]:

1. Initialize a simple regression tree model as f

0

to minimize P

^N_i=1

L(y

_i

, f

₀

(x)) where c is the output value at the leaf nodes.

f

0

(x) = argmin

c

X

m i=1

L(y

i

, c) (2.11)

2. For each sample, compute the negative gradient given by 2.12.

r

ti

= [ @L(y

i

, f (x

i

)))

@f (x

i

) ]

f (x)=ft 1(x)

(2.12)

3. Fit a regression tree on the negative gradients r

ti

in step 2 as the t

th

weak model. The output spaces are R

tj

, j = 1, 2, ..., J corresponding to the leaf nodes.

4. For each leaf node, compute the best output value which can minimize the loss function given by 2.13.

c

tj

= argmin

c

X

xi2Rtj

L(y

i

, f

t 1

(x

i

) + c) (2.13)

5. Update the model by adding the weak predictor as shown in 2.14, where

⌘ is the learning rate.

(24)

f

t

(x) = f

t 1

(x) + ⌘ X

J

j=1

c

tj

I(x 2 R

tj

) (2.14)

In this study, we will turn to a modern machine learning framework called

XGBoost, which provides a robust and efficient implementation of the Gradi-

ent Boosted Trees algorithm [18]. XGBoost allows us to add regularization

terms and customize the loss functions to avoid over-fitting problems, it has

many well-designed optimization mechanisms with respect to the calculation

of derivatives. It also enables the model working under a multi-GPU environ-

ment which can significantly boost the training speed [18].

(25)

Methods

3.1 Map matching

3.1.1 HMM based map-matching method

The original data collected from connected vehicles is just a series of GPS points with the coordinates, accumulated consumed fuel, headings as well as some other information. The first step for data preparation is mapping these points to the real road in the world. Thus some geoinformation studies need to be involved and among all of them, an algorithm called map-matching would be a powerful tool for this purpose. A lot of researches have been done to do so. Lili Cao and John Krumm propose a two-step algorithm to build a road network from massive GPS points[9]. They firstly filter out those points with significant noises and followed by a simple road generation algorithm. How- ever, this method needs quite a large amount of recorded points which is not applicable in this project, also the final topological model we get in this way is only an approximation of the real road model. There are some other similar methods for road network generation based on different implementations, they all kind of face the same problems as above.

Currently there are various methods for map-matching that can be roughly divided into geometrical methods and probabilistic methods. Due to the er- ror that may be caused by matching one point at each time, many improved geometrical algorithms that try to match multiple points simultaneously are proposed[19, 20].

While many geometrical algorithms perform quite well in many cases, there is still the potential risk of failure because of the sparseness of the data.

The most commonly used probabilistic model for map matching is the Hid-

15

(26)

den Markov Model. Paul Newson and John Krumm described such an HMM model that can significantly account for the measurement noises, which can reach a reasonable balance between the GPS position and the feasibility of the path[10]. In this method, they treat the potential road segments as hid- den states and the GPS points as the observations. Their experiments show that such methods can reduce the side effects brought about by the noises well and has strong robustness. Once the HMM-based methods are proposed, a lot of related efforts were made to improve it. Nowadays it is the most widely adopted method for many open source map-matching libraries.

Map-matching server

Since the map-matching algorithm is not the main research of the topic of this study, we decide to turn to some existed open-source software package to solve this problem. At the beginning of the project, we limit our choices on a small range of well-verified tools include SharedStreets [21], GraphHopper [22], MapBox [23] and Barefoot [24], where the built-in map-matching algorithms are all HMM-based. Some experiments have been done on these tools and after thorough comparisons and have many factors like cost and data format into consideration, Barefoot is finally used for the map-matching task. Barefoot is a software package designed and developed by BMW Car IT GmbH[25]. This system provides both online and offline map matching functionalities.

OpenStreetMap

Besides, it is also worthy of introducing the open-source project OpenStreetMap

to make it easier to understand how the geographical data look like. Open-

StreetMap (OSM) project is launched in 2004, dedicated to creating an accu-

rate copyright-free map database available to anyone, it allows the community

to freely access and edit the information which can keep it being continuously

improved [26]. OpenStreetMap does not include the fundamental connectiv-

ity and geometry, the contributors can add much valuable information such as

the names of streets, the type of roads and even the speed limits. Some of the

information may not be completed but still, it can provide some crucial refer-

ences to validate experiment results in the later works. The OSM data consists

of four parts shown in Table 3.1 [27, 28]. Barefoot defines its own key-road

mapping mechanism based on the OpenStreetMap reference system.

(27)

Table 3.1: The elements in OpenStreetMap data Element Explanation

Node A node represent a single point on the map.

Relation A Relation specifies the relationships between multiple objects.

Way The geometry information of a road, consists of multiple nodes generally.

Tag Tag is an optional field provides some descriptive information for the nodes, relations or ways.

3.2 Normalization method

In the following sections, we will see the necessity to normalize the fuel con- sumption due to the diversity between different types of vehicles. In this study, we will introduce two normalization methods: min-max normalization and z- score normalization.

Min-max normalization

Min-max normalization (or feature scaling) can transform all values to the same range [a,b] as shown in 3.1 [29].

X

⁰

= a + (X X

min

)(b a)

X

max

X

min

(3.1)

Z-score normalization

Z-score (or standard score) is calculated by subtracting the mean value and then divided by the standard deviation as shown in 3.2 [29]. The normalized data set has zero mean and one deviation.

X

⁰

= X µ (3.2)

where µ is the mean of the dataset and is the standard deviation of the dataset.

The premise for computing z-score is that mean and standard deviation are

known, but we only have a subset of the whole dataset, in this case, so we will

use a variant of z-score which is also referred to as t-score [29]:

(28)

X

⁰

= X µ

pm

(3.3) where m is the number of road segments belongs to a certain vehicle. The absolute value of z-score represents the distance between the original value and the mean value of the population, so z-score is also very commonly used as the metric to detect outliers in the data set.

3.3 Performance metrics

3.3.1 Mean Absolute Error

Mean Absolute Error (MAE) is one of the most widely used metrics in machine learning. It is the average absolute difference between the actual values and predictions where all the samples have equal weight [14].

M AE = 1 n

X

n j=1

|y

^j

y ˆ

j

| (3.4)

3.3.2 Root Mean Squared Error

Bias and variance are the most common metrics to evaluate the performance of machine learning models. Bias measures the difference between the out- put of the model and the true values, it can qualify the ability that how well a model can fit the training data. The lower the bias is, the more accurate the estimation given by the model will be. The variance describes the stability of the model when it is working on a different dataset. Monitoring the variance is very useful to avoid overfitting problems during the training phase.

To get a machine learning model that can fit the data in a reasonable way, it is necessary to find a balance between bias and variance. The root mean squared error (RMSE) is a powerful metric to measure the error which can consider both the bias and variance [14], and this is what will be used in this project.

RM SE = v u u t 1

n X

n

j=1

(y

j

y ˆ

j

)

²

(3.5)

(29)

Generally, RMSE would be a better choice compared to MAE if the er- ror follows a normal distribution, and also it is more sensitive to the outliers.

When comparing different models by the same metric, it is beneficial to inspect their generalization abilities, that is how well they can deal with the outliers in the dataset.

3.3.3 Mean Absolute Percentage Error

The percentage error is very commonly used in classification problems since it is very intuitive and easy to interpret. It is also possible to measure the error of predictions with continuous values by percentage as expressed by the following equation[30]:

M AP E = 100 n

X

n i=1

(T rue

i

P red

i

)

T rue

_i

(3.6)

In fact, it is not that popular in regression cases because it may be mislead- ing sometimes. For example, the percentage error could be highly relevant to the magnitude of the actual values and it is error-prone given that the denom- inator could be zero sometimes. However, since it is suggested in Henrik’s work [8], I will keep it in this study for the good of making comparisons.

In practical applications, any single metric can only emphasize one aspect of the model errors, a combination of various metrics can provide an insight of model performance from different angles.

3.4 Feature engineering

Before we feed all the variables into the machine learning models, it would be meaningful to explore their relations to our target. Here we use Pearson correlation to examine the relationships between two variables.

Intuitively, the date and time may also contribute a lot to fuel consumption since they are good indicators for the change of seasons and traffic flows. The traffic flow may vary a lot on weekends in many countries and hence influence driving behaviors. Therefore we add another feature: ’day of week’ for differ- entiating weekdays and weekends.

The most straightforward idea to express the date would be splitting it into

the year, month and day, since all the data come from 2018 we will only focus

on ’month’ and ’day’ here. However, ’day’ is a periodic variable, it will turn to

(30)

1 at the beginning of each month, which means it cannot show the variance of the season. A solution to this problem would be to combine variables ’month’

and ’day’ somehow, so we add an extra variable ’day of year’ to indicate the day number within the current year from January 1st. We would explore the expressiveness of these variables by observing the relationships between them and the fuel consumption later.

3.4.1 Pearson Correlation

The Pearson Correlation coefficient is often used to measure the linear rela- tionship between two variables. Its value ranges from 1 to 1, where 1 means a total positive linear correlation and 0 means no linear correlation between two variables. The formula of Pearson is shown as below [31]:

⇢

XY

= E[(X µ

X

)(Y µ

Y

)]

X Y

(3.7) where

X

is the standard deviation of X,

Y

is the standard deviation of Y , µ

X

is the mean of X, µ

X

is the mean of Y . The general guideline for explaining Pearson correlation value is shown in Table 3.2.

Table 3.2: A general explanation of Pearson Correlation Absolute value Correlation

.00-.19 very weak

.20-.39 weak

.40-.59 moderate

.60-.79 strong

.80-1.0 very strong

(31)

Data collection and processing

4.1 Fleet Management System Data

The Fleet Management System (FMS) database stores data collected from con- nected vehicles operated by different companies all around the world. The connected vehicles will send messages to Scania regularly, each message con- tains information include the instantaneous coordinates, odometer, heading, fuel level as well as the temporal information like date and time when the message is recorded. The recorded messages will be sent out later in batch fashion. The raw features used in this study are shown in Table 4.1. The data is limited to a relatively small area in Sweden as the bounding box in Figure 4.1 shows. The time period of the data is also confined in 2018 from January 1st to December 31st.

Table 4.1: Features selected from FMS database

Feature Description

id The identifier of vehicle

timestamp The time when message is recorded

latitude The latitude of vehicle position when

message is recorded

longitude The longitude of vehicle position when message is recorded

odometer The accumulated distance this vehicle has travelled

accumulated fuel consumption The accumulated consumed fuel accumulated weight The accumulated weight of the vehicle

heading The direction of the vehicle

21

(32)

Figure 4.1: The bounding box of a small area where the data is collected from in this project. The longitude ranges from 13.9957 E to 18.3672 E and latitude is from 57.7154 N to 60.1961 N.

Each record in raw FMS data set is a single GPS point, two records that are continuous on time will be fed to the map-matching server we set up before, a reasonable route in the real-world will be output if there is a one. After getting all the points through the map-matching server, there will be a new data set in which each item represents a route. Then we can further calculate the traveling time, fuel consumption and average speed on this route by using the accumulated fuel consumption and time stamps of two endpoints.

In this study we will also consider the calculated weight of the vehicles, this parameter comes from the FMS database as well and is considered as an accurate estimate of the weights of the vehicles in a certain time period. The estimated weight is given by expression 4.1:

Estimated_weight

t2 t1

= Cal_weight

t2

Cal_weight

t1

t

2

t

1

(4.1)

(33)

where Cal_weight

t2

and Cal_weight

t1

are accumulated weights recorded at time t

2

and t

1

separately, the result is the estimated weight in time period t

1

to t

2

.

The weight of the vehicle will vary over time due to the fuel consumption or loading/unloading of the cargo. However, the change in vehicle weight should be smooth in a consecutive time period if the cargo factor is not considered.

In practice, we notice that the estimated weight calculated from a very short time interval is extremely unstable with high variance. To smooth the weight and make it as accurate as possible, a minimum valid time interval is set as 10 minutes and the maximum time interval as 360 minutes, that is, 6 hours. The features of this new data set are shown in Table 4.2.

Table 4.2: Features of matched routes Feature Description

id The identifier of vehicle

roads An array contains ids of the roads segments in this route fuel consumption The consumed fuel on this route

Estimated weight The estimated weight in a certain time period travelling time The time spent on this route

distance The travelled distance calculated by odometer total length The total length of this route, given by Barefoot average speed The average speed on this route

There are quite much exceptional data contained in the new data set. Some are due to the errors in raw FMS data, some are generated by the map-matching server. It is extremely difficult to detect all the errors in the data set. Here many filters based on multiple criteria are set to remove unqualified data as much as possible.

1. Remove all the data lack fuel records or any other parameters. The missed value will not be interpolated in this study.

2. Compare the route length from the map database and the traveled dis- tance given by the odometer records, if the difference between these two records is larger than 10% of the odometer distance, we will consider this route is very error-prone and hence abandon it.

3. The average speed on each matched route is derived from the traveled

distance (from odometer) and traveling time. If the calculated speed

(34)

exceeds the 90km/h (the maximum speed limit for trucks in Sweden) or below 15km/h, this route will be removed from the dataset. The reasons why setting a minimum speed limit are complicated. When the average speed on a certain road is incredibly low, it could be due to the traffic jam, accidents or it is simply stopped for loading the cargoes. Given the difficulty to distinguish these reasons one by one, the records with low speed will be simply taken as outliers and be removed from this study.

4. Restrict the time span of each pair of matched points in the range from 5 minutes to 10 minutes. The motivation lies behind this setting is to reduce the risk of potential wrong matches even though some correct matches will also be removed. For example, there may be various pos- sible routes that lie between two points especially the time interval of the two points is very large. In this case, the map-matching engine cannot guarantee the reliability of matching results. From previous studies, we can also know that the fuel consumption derived from the data recorded at a high frequency has a very strong vibration.

5. When building the fuel consumption indices, an extra filter will be ap- plied. To guarantee the reliability of normalized results, we expect there are enough historical records for a single vehicle. Thus I set the thresh- old of the number of matched routes as 20, for the vehicles whose routes are below this threshold will be removed.

4.2 Weather Data

The weather data used in this project comes from SMHI [32]. The raw data is processed in basically the same manner as Henrik’s work [8].

The data of SMHI can be accessed by SMHI Open Data REST API [33], but the challenge is that only one parameter from a certain weather station can be accessed each time. Another difficulty is the diversity of parameters recorded at each station. For example, some stations will record the air hu- midity but some will not.

SMHI will perform a validation process on the historical records, thus

the data older than three months can be considered as highly reliable. This

puts some restrictions on the range of FMS data we are using. I extracted the

weather data in April 2019, which means the historical data till January 1st,

2019 is validated. There are 905 weather stations and 38 weather parameters

in total in SMHI’s database, 240 stations were still active now and 227 of them

(35)

have records including the following six parameters in 2018.

In this study I will only use 6 parameters which suggested as the most influ- ential ones to the fuel consumption by related work as shown in Table 4.3. The air temperature, wind direction, wind speed, humidity, and current weather are recorded once per hour, precipitation is recorded per 15 minutes.

Table 4.3: Features selected as weather conditions Feature Description

Air temperature The instantaneous temperature value, recorded once per hour

Wind direction The instant wind direction Wind speed The instant wind speed Relative humidity The air humidity

Precipitation The rain fall and snow fall

Current weather An index used to describe current weather condition

Matching weather data with route records

To deal with the problem of diverse parameters and better organize the weather data, I also reformat the weather data into JSON and import them into Mon- goDB. Data in MongoDB is structured in JSON documents, we can easily add or remove field to a single data record.

The data collected by a certain station can only represent the weather con- dition of the exact location of that station. For an arbitrary position, we can use the weather records from the station which is nearest to that position. It is obvious that the closer the station is, the more accurate the data will be. I set the date and time as the key for indexing all weather data. When matching the weather data with the routes, we will first check the data from the nearest sta- tion, if some parameters are missed, we then turn to the second nearest station until we collect all the parameters we need or we have iterated all the stations within the 20 kilometers, which is the maximum distance I set to guarantee the accuracy of the weather data.

To efficiently search out the nearest station I import the basic information

of all the active weather stations into MongoDB, by means of geospatial index

functionality of MongoDB, we can create sphere index on the coordinates of

stations and MongoDB will perform $near operation to get a list of stations

from nearest to farthest [34].

(36)

4.3 Vehicle configurations

With assistance from employees at Scania, I extract 35 parameters of the ve- hicle configurations. However, not all of these parameters will contribute to the fuel consumption, here I will only take only a few of them suggested by Henrik as shown in Table 4.4.

Table 4.4: Features selected as vehicle configurations

Feature Description

Technical total weight The maximum load weight of vehicle.

Engine stroke volume The volume of the engine.

Horsepower The power of the engine.

Rear axle gear ratio The rear axle gear ratio.

Emission level The standards indicate the amount of pollutants re- leased to the air.

Overdrive If the vehicle has overdrive system or not.

Ecocruise If the vehicle has ecocruise system or not.

Scania has done some researches regarding modeling the influence of driver behaviors on fuel consumption. In an interview with Scania’s employees, we learn that some driver behavior parameters are highly correlated to others like road conditions. The previous study [8] also show that the driver behaviors may be misleading when predicting the fuel consumption because of the low data quality and the hidden noises. The driver behaviours will be all excluded in this paper due to reasons mentioned above.

4.4 Road network model

Before we go further to the fuel consumption index part, we need to first figure out how to manage map data in this project. Figure A.3 shows the structure of a single road segment in JSON format and this is how it is managed in a database. In MongoDB, each road segment is stored as a document, it contains five fields include road id, fuel consumption indices and the count of samples used for getting this index. We add a sample count here to ensure the fuel consumption indices can be continuously updated when new data come in. To boost the query speed when matching fuel indices with routes, we create an index on field gid, which is the id of road segment [34].

The matched routes given by barefoot are in standard Geojson format,

which can be directly imported into MongoDB. As we can see from Figure

(37)

A.2, each route document contains three sub-fields where geometry has a se- ries of coordinates located at the corners of the route, can be utilised for vi- sualization; properties contains much information include traveling time, fuel consumption, a sequence of road segment ids and road length; id is the iden- tifier of the vehicle.

4.4.1 Fuel consumption index

The construction of fuel consumption index is built on two fundamental as- sumptions:

1. It is possible to get an efficient descriptor somehow as long as we have enough data.

2. For a route consists of various road segments, the fuel consumption on each segment is approximately proportional to the length of this seg- ment.

Actually the second assumption is coarse and hence may cause some potential errors. Because of the diversities on the road conditions, even two road seg- ments with the same length may lead to different fuel consumption. However, for adjacent limited segments belonging to the same route, we can simply as- sume that they have very similar road conditions.

The data in FMS is recorded at different times, under different weather conditions or even with different drivers, but what remains the same is the road conditions. The basic idea is that as long as we have enough records, we can somehow figure out a way to remove the impacts brought about by other parameters but only keep the road conditions.

The most intuitive method would be taking the average of all fuel con- sumption records on the same road segment, but there are some risks need our attention. Firstly the diversity of various vehicles. The fuel consumption of different vehicles on the same road segments may at completely different magnitudes due to some reasons like the weight of vehicles. We cannot simply average them or add them up in this case, given that the larger value will have a significantly larger impact on the final result and may lead to an unexpected bias.

Here we introduce the z-score normalization and min-max normalization to map the values measured at various scales to a common scale.

The z-score based fuel consumption index computation is as follow:

(38)

1. For each vehicle, normalize its fuel consumption on each road segment by z-score normalization.

2. For each road segment, average all the normalized fuel consumption on this road segment.

3. When finished the above two steps, we have fuel consumption indices with zero mean. It would be beneficial if we unify the sign of these indices. Then we subtract the minimum fuel consumption index value from all the road segments fuel consumption indices.

This process can be summarized as equation 4.2.

X

⁰

= 1 n

X

n i=1

X

i

µ

pm

+ minF CI

z score

(4.2)

where n is the number of times a road segment has been recorded, F CI

z score

is a sequence contains fuel consumption indices of all the road segments.

Once the fuel consumption index is built for all the road segments, we can get a sequence of indices for each route in the data set. Due to the variance of the number of road segments contained in each route, it is necessary to derive a single metric as the fuel consumption index of this route. The most straightforward method would be taking the average of the index sequence, a more refined method is using the weighted mean value where the weights are the lengths of the road segments as shown in 4.3.

F CI

route

= 1 n

X

n i=1

l

i

f ci

i

(4.3)

where n is the number of road segments in a route, l

i

and fci

i

are the corre-

sponding length and fuel consumption index of a certain road segment.

(39)

Results

5.1 Data exploration

Before applying the machine learning algorithms on the data, it is beneficial to perform some preliminary analysis on the dataset. This will give us a good understanding of the problem. We will examine the distribution, trend, corre- lations or any other possible patterns in the data, this will pave the way for the later feature engineering and modeling works.

5.1.1 Fuel Consumption index

Here I choose a representative road segment to illustrate the difference be- tween mean and median, the normalized fuel consumption and the trend of fuel consumption indices are plotted.

Hopefully the trend of fuel consumption index will converge once enough data is added into the computation. Theoretically, the mean and median val- ues are the same if the data is normally distributed. The left plot in Figure 5.1 shows us a nearly perfect normal distribution which can also be illustrated in the right figure where we can see that the mean and median value have a beautiful convergence.

Figure 5.3 and Figure 5.4 display the scatter plots of min-max based fuel consumption index and z-score based index separately. The distribution of weighted mean and mean are almost the same but with slight differences. Fig- ure 5.2 visualizes the fuel consumption indices of part road segments. Gener- ally, those segments centered around cities or towns have high fuel consump- tion indices. In contrast, the segments in the suburb area have relatively low

29

(40)

Figure 5.1: The distribution normalized fuel consumption of road segment 8324 and the trend of fuel consumption indices as more and more samples added into the calculation.

indices. This is in accordance with common sense given that the traffic jam in an urban area may cause frequent brake and start which can lead to high fuel consumption. We should notice that this illustration certainly contains some inaccurate fuel consumption indices because of the lack of samples as well as some other reasons.

5.1.2 Final Dataset

Once we have collected the data in the complete feature space, we can have a look at their correlations to get a preliminary impression on their relationships with each other. Figure A.1 is a correlation matrix based on Pearson correla- tion coefficient, from which we can tell that there may be weak relationships between some temporal features and fuel consumption. This is explainable intuitively since the variable month can indicate the change of seasons, which will significantly affect the weather and hence influence the road conditions and vehicle conditions. Similarly, the weather condition changes across the same day, the driver may also have different behaviors at day time and at night.

There are also some variables that are highly correlated with each other like

the average speed and route length (distance). And given that we have limited

the traveling time on each route in a small period from 5 min to 10 min, the

average speed is almost proportional to the length, so it is possible to consider

(41)

Figure 5.2: The visualization of the fuel consumption indices of a partial road segments. The dark color indicate a large index while light color means the opposite.

only one of these two variables in the training phase. In this study, the feature distance is removed from the data set.

The values of different features are recorded at various scales, it is benefi- cial to transform them into the common scale before feeding into the machine learning models. This process is not necessary for those models who treat each feature separately like the random forest but essential to others like Support Vector Machines. In this study, we will turn to the z-score again to complete this normalization process.

5.2 Training Results

Generally the most reliable way to evaluate the performance on a certain dataset

would be using cross-validation to eliminate any potential bias from a bad split

of the dataset. However, cross-validation is an extremely slow process given a

huge amount of data. So we will simply use random split in the experiments,

80% will serve as the training set, the left 20% is used as the test set.

(42)

Figure 5.3: The scatter plot of min-max score based fuel consumption index.

Figure 5.4: The scatter plot of z-score based fuel consumption index.

5.2.1 Support Vector Machine

The Support Vector Machine model is implemented with the help of sci-kit learn [35]. There are three hyperparameters in total to be adjusted include , ✏ and C. We fix the kernel function as radial basis function kernel as it is shown to be a more robust kernel over other options for the data sets with high dimensional feature space. After a thorough grid search on the parameter space, we finally reach the best performance with = 0.02, ✏ = 0.1 and C = 20. The results of Support Vector Machines are shown in Table 5.1.