STOCKHOLM SWEDEN 2018 ,
Estimating fuel consumption using regression and machine learning
LUKAS EKSTRÖM
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ENGINEERING SCIENCES
using regression and machine learning
LUKAS EKSTRÖM
Degree Projects Systems Engineering (30 ECTS credits) Degree Programme in Aerospace Engineering (120 credits) KTH Royal Institute of Technology year 2018
Supervisor at Scania: Henrik Wentzel
Supervisor at KTH: Johan Karlsson
Examiner at KTH: Johan Karlsson
TRITA-SCI-GRU 2018:378 MAT-E 2018:80
Royal Institute of Technology School of Engineering Sciences KTH SCI
SE-100 44 Stockholm, Sweden
URL: www.kth.se/sci
This thesis focuses on investigating the usage of statistical models for esti- mating fuel consumption of heavy duty vehicles. Several statistical models are assessed, along with machine learning using artificial neural networks.
Data recorded by sensors on board trucks in the EU describe the oper- ational usage of the vehicle. The usage of this data for estimating the fuel consumption is assessed, and several variables originating from the opera- tional data is modelled and tested as possible input parameters.
The estimation model for real world fuel consumption uses 8 parameters describing the operational usage of the vehicles, and 8 parameters describing the vehicles themselves. The operational parameters describe the average speed, topography, variation of speed, idling, and more. This model has an average relative error of 5.75%, with a prediction error less than 11.14% for 95% of all tested vehicles.
When only vehicle parameters are considered, it is possible to make predictions with an average relative error of 9.30%, with a prediction error less than 19.50% for 95% of all tested vehicles.
Furthermore, a computer software called Vehicle Energy Consumption Calculation tool(VECTO) must be used to simulate the fuel consumption for all heavy duty vehicles, according to legislation by the EU. Running VECTO is a slow process, and this thesis also investigates how well statistical models can be used to quickly estimate the VECTO fuel consumption. The model estimates VECTO fuel consumption with an average relative error of 0.32%
and with a prediction error less than 0.65% for 95% of all tested vehicles.
I
gression och maskininl¨ arning
Sammanfattning
Denna rapport fokuserar p˚ a att unders¨ oka anv¨ andningen av statistiska mod- eller f¨ or att uppskatta br¨ anslef¨ orbrukningen hos tunga fordon. Flera statis- tiska modeller utv¨ arderas, tillsammans med maskinl¨ arning med artificiella neurala n¨ atverk.
Data som registreras av sensorer ombord p˚ a Scania-lastbilar i EU beskriver fordonets drift. Anv¨ andningen av dessa data f¨ or att uppskatta br¨ anslef¨ orbrukningen unders¨ oks och flera variabler som kommer fr˚ an operativa data modelleras och testas som m¨ ojliga in-parametrar.
Uppskattningsmodellen f¨ or den verkliga br¨ anslef¨ orbrukningen anv¨ ander 8 parametrar som beskriver anv¨ andningen av fordonet och 8 parametrar som beskriver sj¨ alva fordonet. Bland annat beskrivs medelhastighet, topografi, hastighetsvariation, andel tomg˚ ang. Denna modell har ett genomsnittligt relativfel p˚ a 5.75 %, med ett skattningsfel mindre ¨ an 11.14% f¨ or 95% av de de fordon som testats.
Om endast fordonsparametrar beaktas som in-parametrar ¨ ar det m¨ ojligt att g¨ ora skattningar med ett genomsnittligt relativfel p˚ a 9.30 %, med ett skattningsfel mindre ¨ an 19.50% f¨ or 95% av de de fordon som testats. Ett da- torprogram kallat VECTO m˚ aste anv¨ andas f¨ or att simulera br¨ anslef¨ orbrukningen f¨ or alla tunga fordon enligt EU-lagstiftning. Att k¨ ora VECTO ¨ ar en tid- skr¨ avande process, och denna rapport unders¨ oker ocks˚ a hur v¨ al statistiska modeller kan anv¨ andas f¨ or att snabbt uppskatta VECTO-br¨ anslef¨ orbrukningen.
Modellen uppskattar VECTO-br¨ anslef¨ orbrukningen med ett genomsnittligt relativfel p˚ a 0.32% och med ett skattningsfel mindre ¨ an 0.65% f¨ or 95% av de de fordon som testats.
II
I would like to express my gratitude to Henrik Wentzel, Johan Karlsson, Marcus Forslin, Antonius Kies, and G¨ oran Svensson, for the help and guid- ance they have given me.
III
Contents IV
1 Introduction 1
2 Theory 3
2.1 Preliminary math . . . . 3
2.1.1 Basic statistical measures . . . . 3
2.1.2 Normalizing . . . . 4
2.1.3 Standardized euclidean distance . . . . 4
2.1.4 Prediction error . . . . 5
2.1.5 Categorical variables . . . . 6
2.2 Regression . . . . 7
2.2.1 Linear regression . . . . 8
2.2.2 KNN regression . . . . 9
2.2.3 Multivariate adaptive regression splines . . . . 10
2.3 Artificial neural network . . . . 16
2.3.1 Artificial neuron . . . . 17
2.3.2 Artificial neural network . . . . 18
2.3.3 Input . . . . 19
2.3.4 Output . . . . 19
2.3.5 Learning . . . . 20
2.3.6 Batch learning . . . . 25
2.3.7 Evaluating observations . . . . 25
3 Data 26 3.1 Vehicle parameters . . . . 26
3.2 Operational parameters . . . . 28
3.2.1 Bin data . . . . 29
3.2.2 Considered variables . . . . 30
3.3 Output data . . . . 32
3.4 Data gathering . . . . 32
3.4.1 Estimation scenario A . . . . 33
3.4.2 Estimation scenario B and C . . . . 33
IV
4.2 Models . . . . 35
4.2.1 Linear regression . . . . 35
4.2.2 KNN regression . . . . 35
4.2.3 KNN Regression with inverse distance weighting . . . 36
4.2.4 Multivariate adaptive regression splines (MARS) . . . 36
4.2.5 Artificial neural network . . . . 37
4.3 Choice of input parameters . . . . 38
4.3.1 Estimation scenario A - VECTO fuel consumption . . 38
4.4 Estimation scenario B and C - Real world fuel consumption . 40 4.5 Selection of model . . . . 42
5 Results 43 5.1 Estimation scenario A - VECTO fuel consumption . . . . 43
5.1.1 Selected model . . . . 43
5.2 Estimation scenario B - Real world fuel consumption using only vehicle parameters . . . . 46
5.2.1 Selected model . . . . 46
5.3 Estimation scenario C - Real world fuel consumption using vehicle and operational parameters . . . . 48
5.3.1 Selected model . . . . 48
5.4 Results summary . . . . 50
5.4.1 Comparison of estimation scenario B and C . . . . 50
5.4.2 Variable influence . . . . 51
6 Discussion and conclusion 52 6.1 Discussion . . . . 52
6.2 Conclusion . . . . 53
References 54 A Glossary 55 A.1 Acronyms . . . . 55
A.2 Terms . . . . 56
B Fundamental statistics 59
C Assumed values for bin data 60
D Additional tables 63
E Activation functions 66
V
F.2 Estimation scenario B . . . . 84 F.3 Estimation scenario C . . . . 96
G Additional figures 109
H Code 110
VI
Chapter 1
Introduction
This report is the result of a thesis made at Scania AB, in cooperation with Royal Institute of Technology (KTH). The thesis is being conducted in the group YDMC at Scania in S¨ odert¨ alje, Sweden, which is specialized in analy- sis of fuel consumption, and by extent, CO
2emissions. It is of great interest to be able to quickly and accurately estimate the fuel consumption (FC) and CO
2emissions for vehicles. Current methods for estimation relies on sim- ulation, which is too slow and too inaccurate. More advanced simulations require several minutes to complete, which is too slow for being implemented into a sales process, and are still not very accurate. Furthermore, a new leg- islation by the European union require all new heavy duty vehicles (HDVs) to be simulated using a software called Vehicle Energy Consumption Cal- culation Tool (VECTO). The simulation tool is used for certification of the vehicles. Therefore, it is also of great interest to be able to quickly deliver and estimate the VECTO certified CO
2to haulage contractors. This paper assesses statistical models for predicting fuel consumption. Several regres- sional models are created and tested. Furthermore, the use of Artificial intelligence (AI) using artificial neural networks are also tested. The paper also investigates the usage of Operational data - data uploaded from vehicles in use - and their potential role in estimating the real world FC.
When VECTO is used for certification purposes, simulations are per-
formed for a set of predefined transport missions, each with a set of prede-
fined transport payloads. Since the driving patterns are predefined, the FC
obtained through VECTO, for a given predefined transport mission, can be
regarded as a function only of the vehicle itself. This thesis focuses on the
transport mission called long haulage with reference payload, which simu-
lates transportation across a large distance, with few stops and dominantly
motorway velocity. A previous thesis, [1] by Max af Klintberg, has assessed
how several regressional models can be used to evaluate the VECTO FC
for the long haulage cycle. That thesis has heavily influenced the methods
assessed in this thesis for estimating VECTO FC. New prediction models
Figure 1.1: The three estimation scenarios
are tested and their performance compared.
The main objective of this thesis is to investigate how well prediction models can be used to estimate fuel consumption. Estimation of two differ- ent values of fuel consumption - VECTO and real world - are to be assessed.
The VECTO fuel consumption describes the average fuel consumption dur- ing one simulated transport mission. The real world fuel consumption de- scribes the average fuel consumption in a longer term perspective.
The input for the models should be a set of variables that can be de- rived from vehicle specification and/or operational data. The estimation of VECTO is done solely based on vehicle data, and does not consider any operational data. Estimation of real world fuel consumption considers both vehicle and operational data. A scenario when the real world FC is estimated solely based on vehicle parameters is also assessed.
Consequentially, three estimation scenarios can be introduced, according to
• Estimation scenario A - Estimation of VECTO fuel consumption with vehicle data as input.
• Estimation scenario B - Estimation of real world fuel consumption with vehicle data as input.
• Estimation scenario C - Estimation of real world fuel consumption with vehicle data and operational data as input.
The three estimation scenarios are also illustrated in Figure 1.1.
Chapter 2
Theory
2.1 Preliminary math
This section introduces mathematical concepts which form a basis for fol- lowing sections. Methods for describing properties in a quantifiable manner, measuring similarity between observations, and for modelling estimation er- ror, are explained.
2.1.1 Basic statistical measures
If X
1, X
2, ..., X
nare random variables, and they form a vector
X = X
1X
2... X
n. (2.1)
Let x
ijbe the ith independent observation on the jth random variable.
Assume that m ∈ N independent observations have been drawn for all n ∈ N random variables. These observations then form a data set X = {x
i∈ R
n, i = 1, ..., m}. Further, let ˆ x
jbe the sample mean and s
jbe the sample standard deviation of the jth random variable. Obtained as described in Appendix B.
The sample covariance q
jkof the j:th and k:th random variables is de- fined as
q
jk= 1 m − 1
m
X
i=1
(x
ij− ˆ x
j)(x
ik− ˆ x
k). (2.2) The pearson correlation coefficient r can then be computed using
r
jk= q
jks
js
k. (2.3)
Pearson correlation coefficient takes a value in the range [−1, 1] where 1
is positive linear correlation, -1 is negative linear correlation and 0 is no
correlation. The value r
jk2is called the coefficient of determination, which
takes a value in the range [0, 1] where 1 is linear correlation and 0 is no
correlation.
2.1.2 Normalizing
Consider a random variable x
jin a sample, that has a sample mean ˆ x
jand a sample standard deviation s
j, obtained as in subsection 2.1.1. The sample can be normalized to have a new sample mean ˆ x
0jand sample standard deviation s
0j. This is done by modifying all observations x
ijaccording to
x
ij:= x
ij− s
jˆ
x
js
0j+ ˆ x
0j. (2.4) 2.1.3 Standardized euclidean distance
For comparing similarity of quantitative components of observations, a mea- sure of distance between the observations are to be obtained. A formulation of the distance is derived below.
According to [2], a metric on the set X d(x
1, x
2) is a function such that for all x
1, x
2∈ X , the following holds:
1. d(x
1, x
2) ≥ 0
2. d(x
1, x
2) = 0 if, and only if x
1= x
23. d(x
1, x
2) = d(x
2, x
1)
4. d(x
1, x
2) ≤ d(x
1, x
3) + d(x
2, x
3).
A metric can then be considered to be the distance between observations.
There are a lot of ways to define a metric. However, the following metrics are relevant for this thesis.
The euclidean distance between x
1∈ R
nand x
2∈ R
nis the L
2-norm (also called the euclidean norm) of the difference x
1− x
2, i.e. ||x
1− x
2||
2. The L
2-norm of a vector v ∈ R
nis defined as
||v||
2= v u u t
n
X
j=1
v
j2. (2.5)
Therefore, the euclidean distance d
euc(x
1, x
2) is defined as:
d
euc(x
1, x
2) = ||x
1− x
2||
2= v u u t
n
X
j=1
(x
1,j− x
2,j)
2. (2.6)
Problems can arise when the euclidean distance is used and the com-
ponents of the vectors vary in size, and larger components can come to
dominate the value of the norm. The standardized euclidean distance nor-
malizes the components in X and makes them dimensionless. If ˆx
jand s
jare the sample mean and sample standard deviation of component j in the data set X , the standardized euclidean distance d
euc,s(x
1, x
2) is defined as
d
euc,s(x
1, x
2) = v u u t
n
X
j=1
x
1,j− s
jˆ
x
j− x
2,j− s
jˆ x
j 2. (2.7)
2.1.4 Prediction error
Consider a set of m observations for one random variable y = {y
i, i = 1, ..., m} ,
and a set of predictions for the same variable ˆ
y = {ˆ y
i, i = 1, ..., m} . The prediction error for each observation is defined as
ˆ
e
i= ˆ y
i− y
i(2.8)
and the relative prediction error is ˆ
e
rel,i= e ˆ
iy
i= y ˆ
i− y
iy
i. (2.9)
The root-mean-squared error (RMSE) is defined as
RMSE(y) = v u u t
1 m
m
X
i=1
ˆ
e
2i. (2.10)
The root-mean-squared for the relative errors is
RMSE
rel(y) = v u u t
1 m
m
X
i=1
ˆ
e
2rel,i. (2.11)
Quantiles
The performance of an estimation model can also be measures by quantiles of the prediction error. The quantiles are associated with a probability p, which creates a threshold for the prediction error, given that probability p.
This is useful for measuring the performance regarding outliers. This thesis
uses quantiles for p = 95%, 99%, 99.5%.
2.1.5 Categorical variables
Categorical variables are variables that is set to one of a finite set of possible values. They are not trivially quantifiable, i.e. it is not clear how values for one categorical variable relate to each other, other than being equal or not.
There are several different ways to implement categorical variables into a quantitative model.
Dummy variables
Assume that a categorical variable x
cis given which can take on one value from {v
1, ..., v
n}. This categorical variable can be represented by a series of binary variables y
j, j = 1, ..., n − 1 [3], such that
y
j(x
c) = 0 if x
c= v
1or x 6= v
j+11 if x
c= v
j+1. (2.12)
The result is a n − 1 binary variables. For example, if x is set to the first value v
1, the resulting vector is
v|
xc=v1= 0 0 ... 0 . If x
cis set to the second or third value:
v|
xc=v2= 1 0 ... 0
v|
xc=v3= 0 1 ... 0 . Exclusivity
Assume that a categorical variable x
cis given, which can take on one value from {v
1, ..., v
n}, where x
cis one component of vectors x ∈ X . Subsets can be created for every value {v
1, ..., v
n} and X can then be split into these subsets depending on the categorical variable. This would result in n different data sets.
This effectively removes the categorical variable x
cfrom the data sets, and the resulting data sets would be on the form Y = {x
i∈ R
n−1}. The drawback of this method is that some of the resulting data sets may be small, and using this method on several categorical variables will partition the subsets into even smaller subsets.
Penalized distance
Using penalized distance, d
p(·, ·), is a method of measuring distance between
categorical observations. Assume that x
1cand x
2care two observations for
the categorical variable x
c. The penalized distance weight w
c> 0 is defined for x
c. The penalized distance for the two observations are then
d
pc(x
1c, x
2c) = w
cI(x
1c= x
2c) = w
cif x
1c= x
2c0 otherwise. (2.13) The concept can be expanded to vectors of categorical variable. As- sume the vectors x
1∈ R
nand x
2∈ R
nare observations of the same set of categorical variable, and that penalized distance weights w
i, i = 1, ..., n are defined for all variables. The penalized distance between the two categorical observations are then
d
p(x
1, x
2) = d
p1(x
11, x
21) + d
p2(x
12, x
22) + ... + d
pn(x
1n, x
2n)
= w
1I(x
11= x
21) + w
2I(x
12= x
22) + ... + w
nI(x
1n= x
2n).
(2.14)
2.2 Regression
Regressional analysis is a branch within statistic that aims to estimate the relationship between variables. A set of independet variables X (input vari- ables) are used to estimate one or more dependent variables Y (output vari- ables).
A regressional f (X) model is constructed using a set o training data, which is a collection of previous observations consisting of both independent and one dependent variable where x
i∈ X and y
i∈ Y forms a pair (x
i, y
i).
The regressional model will then be constructed to make estimates ˆ Y of the dependent variable, using the independent variables:
ˆ
y
i= E(y
i|x
i) = f (x
i). (2.15) The above equation combined with Equation 2.8 yields the following func- tion for the prediction error
ˆ
e
i= f (x
i) − y
i. (2.16) The aim of regressional analysis is to construct the prediction function f so to minimize the prediction error. Assume a set of m previous observations, and therefore i = 1, ..., m. A measure on the total prediction error can be formulated as
E = 1 m
m
X
i=1
|ˆ e
i| (2.17)
or
E = 1 m
m
X
i=1
ˆ
e
2i. (2.18)
The process of constructing a regressional model can then be formulated as minimize
f
N
X
i=1
(f (x
i) − y
i)
2, (2.19) which is called the method of least squares.
There are different methods for regressional analysis. The ones used in this thesis are linear regression, K nearest neighbors-regression, Multivariate adaptive regression splines and Artificial neural network.
2.2.1 Linear regression
Assume that there are k independent variables. Linear regression assumes that the relationship between X and Y is linear, and that f (X) is a linear function, according to:
f (x
i, β) =β
0+ β
1x
i,1+ β
2x
i,2+ ... + +β
kx
i,k=β
Ta
i(2.20)
where there a m independent variables and a
i= 1 x
Ti T∈ R
k+1. The parameter vector β ∈ R
k+1is obtained through
β = arg min
β m
X
i=1
(f (x
i, β) − y
i)
2. (2.21)
Let A ∈ R
(k+1)×mbe the design matrix, and y a column vector with all the observations for the dependent variable, according to
A =
a
T1a
T2.. . a
Tm
=
1 x
T11 x
T2.. . .. . 1 x
Tm
, y =
y
1y
2.. . y
m
.
The optimization problem in Equation 2.21 can then be written as β = arg min
β
(Aβ − y)
T(Aβ − y) = arg min
β
Φ(β) (2.22)
. Simplifications of the objective function Φ yield Φ(β) = (Aβ − y)
T(Aβ − y)
=(Aβ)
TAβ − (Aβ)
Ty − y
T(Aβ) + y
Ty
=β
TA
TAβ − 2(Aβ)
Ty + y
Ty. (2.23) Differentiate twice:
∂Φ
∂β = 2A
TAβ − 2A
Ty (2.24)
∂
2Φ
∂β
2= 2A
TA (2.25)
The second derivative is positive semi-definite and therefore Φ is convex.
The optimum to Equation 2.21 and Equation 2.22 is then obtained from
∂Φ
∂β =2A
TAβ − 2A
Ty = 0
⇒ A
TAβ = A
Ty. (2.26)
If A
TA is invertible, there exists a unique solution
β
∗= (A
TA)
−1A
Ty. (2.27) Thus, predictions for new observations x
0can be computed with
ˆ
y
0= f (x
0) = β
∗Ta
0= β
∗T1 x
T0T.. (2.28)
2.2.2 KNN regression
K nearest neighbors (KNN) is a non-parametric regressional model, where K ∈ N is a parameter for the model. An observation x
0is evaluated by measuring the standardized euclidean distance to observations in the training data x
i, i = 1, ..., m. The value of the dependent variables are then equal to the average of the K nearest observations, i.e, the K items with the shortest distance.
d
euc,s(x
(q)0, x
(q)i) (2.29) is obtained as explained in Equation 2.7. Penalized distance is used to evaluate the distance over the categorical input variables x
(c)i, according to Equation 2.14, which yields
d
p(x
(c)0, x
(c)i). (2.30) The total distance between the observations is then
d
tot(x
,0x
i) = d
euc,s(x
(q)0, x
(q)i) + d
p(x
(c)0, x
(c)i). (2.31) Inverse distance weighting
KNN regression may also implement Inverse distance weighting (IDW).
The value of a dependent variable o for new observations x
0will then be a weighted average for the K nearest observations in the training data:
f (x
0) = (
PKk=1wk(x0,xk)ok PK
k=1wk(x0,xk)
if d
tot(x
,0x
i) 6= 0∀i o
iif d
tot(x
,0x
i) = 0 for some i.
(2.32)
In the above equation x
1, x
2, ..., x
Kare the K nearest observations to x
0. The weights will be inversely proportional to the distance, according to
w
k(x
0, x
k) = 1
d
tot(x
,0x
i)
u(2.33) where u ∈ R
+is a parameter that can be used to control the magnitude of the distance weighting.
2.2.3 Multivariate adaptive regression splines
Multivariate adaptive regression splines (MARS) is a nonlinear regression algorithm introduced in [4]. The algorithm results in an estimation function that is a finite sum of basis functions B
M(·):
f (x
i) =
Mf inal
X
M =1
c
MB
M(x
i) (2.34)
where c
M, for M = 1, ..., M
f inal, are coefficients. An example of a function fit to data by MARS can be seen in Figure 2.1.
Figure 2.1: The function f (x) = 0.10968+1.1890∗max(0, x−5.5102) created
by MARS to fit the data.
The MARS algorithm consists of two sequential parts
1. The forward stepwise part finds a set of M
maxbasis functions.
2. The backwards stepwise part tries to reduce the set of basis function provided from the forward stepwise part, to increase generalizability.
The number of basis functions in the final set of basis functions M
f inal, where 1 ≤ M
f inal≤ M max, is selected as a trade-of of generalizability and fit to training data.
Basis functions
The original MARS algorithm implements methods to form basis functions of quantitative input variables, as explained in [4]. Futhermore, an approach for managing categorical input variables is given in [5]. These methods are explained here.
The most basic form of basis function considered by MARS is
B
1(x
i) = 1. (2.35)
All new basis functions are created as a product of a previously created basis function and some function b(x
ij) of one variable x
j. Assume there are M basis functions already created, basis function M + 1 will then be on the form
B
M +1(x
i) = B
Mp,M +1(x
i) · b(x
ij). (2.36) The function b(x
ij) can assume two different forms, depending on whether x
jis quantitative or categorical.
For quantitative input variables, MARS uses hinge functions with one quantitative variable as a parameter. Assume x
jis a quantitative input variable. The hinge functions are on the form
b
a(x
ij) = h(x
ij) = max(0, x
ij− c) (2.37) or
b
b(x
ij) = h(x
ij) = max(0, c − x
ij) (2.38) where c is a constant, called the knot. The two above equations are consid- ered a mirrored pair.
Assume now that x
jis a categorical input variable, which assumes K
jdifferent values so that x
ij∈ C
jwhere C
j= C
j1, ..., C
jKj. Let A
j1, A
j2, ...
be subsets of C
j. According to the binomial theorem, there are
2
Kj(2.39)
subsets of a set with K
jelements, as explained in [6]. MARS considers functions with a binary output value, on the form
b
a(x
j) = I(x
ij∈ A
jl) (2.40)
or
b
b(x
j) = I(x
ij∈ A /
jl) (2.41) where A
jlis some of the subsets of C
jand
I(x
ij∈ A
jl) = 1 if x
ij∈ A
jl0 otherwise I(x
ij∈ A /
jl) = 1 if x
ij∈ A /
jl0 otherwise .
Thus, the basis functions created by MARS are a product of 1 and a number of hinge functions defined in Equation 2.37 and Equation 2.38, and categorical functions defined in Equation 2.40 and Equation 2.41. The in- teraction of a basis function is the number of hinge and categorical functions present in a basis function. Some examples of basis functions are
• B(x
i) = 1, the most basic basis function considered, which has inter- action = 0.
• B(x
i) = h
1(x
ij), where h
1(x
ij) = max(0, x
ij− c
1), where c
1is the knot of h
1, This basis function has interaction = 1.
• B(x
i) = h
1(x
ij) · I(x
ik∈ A
kl), where A
klis a subset of the the values that the categorical variable x
kcan assume, h
1(x
ij) = max(0, x
ij−c
1), where c
1is the knot of h
1. This basis function has interaction = 2.
• B(x
i) = h
1(x
ij) · h
2(x
it), where h
1(x
ij) = max(0, x
ij− c
1), h
2(x
it) = max(0, x
it− c
2) where c
1and c
2are the knots of h
1and h
2. This basis function has interaction = 2.
The function in Figure 2.1 have two basis functions, B
1(x
i) = 1 and B
2(x
i) = max(0, x
i1− 5.5102). B
2implements a hinge function with a knot at 5.5102.
Finding coefficients
For a set of basis functions B = {B
1(·), B
2(·), ..., B
M(·)}. The coefficients c = c
1c
2...
Mpresent in Equation 2.34 are found using linear regres- sion. The design matrix is obtained by evaluating the basis functions in B for every observation x
iin the training data.
A =
B
1(x
1) B
2(x
1) ... B
M(x
1) B
1(x
2) B
2(x
2) ... B
M(x
2)
.. . .. . . .. .. . B
1(x
m) B
2(x
m) ... B
M(x
m)
. (2.42)
The output vector is y = y
1y
2... y
m. This results in a linear problem.
If A
TA is invertible, the solution c
∗, for B is obtained using Equation 2.27.
Forward stepwise part
The forward stepwise part of MARS creates a basis function set B containing M
maxbasis functions. The set B of basis functions increases in size as the algorithm proceeds. At the start of the algorithm the set contains the basis function B
1(x
i) = 1. New basis function created for MARS are products of one of the basis functions already in B and a hinge function or categorical function. Furthermore, a pair of basis functions are added at a time, i.e.
two basis functions are added simultaneously.
Consider B = {B
1(·), B
2(·), ..., B
M(·)}, i.e. there are currently M basis functions. Assume M < M
max. A pair of new basis functions
B
M +1(·) and B
M +2(·)
should now be added. The new pair of basis function are based on the same basis function B
Mp,M +1(·) ∈ B and appends a mirrored pair b
a(x
i,M +1) and b
b(x
i,M +1). So that
B
M +1(x
i) = B
Mp,M +1(x
i)b
a,M +1(x
i,jM +1) B
M +2(x
i) = B
Mp,M +1(x
i)b
b,M +1(x
i,jM +1)
where x
jM +1is one of the variables x
j, j = 0, ..., n. If x
jM +1is quantitative, the new basis functions take the form
B
M +1(x
i) = B
Mp,M +1(x
i)max(0, x
i,jM +1− c
M +1) B
M +2(x
i) = B
Mp,M +1(x
i)max(0, c
M +1− x
i,jM +1)
for some knot c used for this pair. If x
jis categorical, the new basis functions take the form
B
M +1(x
i) = B
k(x
i)I(x
i,M +1∈ A
M +1) B
M +2(x
i) = B
k(x
i)I(x
i,M +1∈ A /
M +1) where A
M +1is some of the subsets A
j1, A
j2, ... of C
j.
Thus, for every new pair B
M +1(·) and B
M +2(·), the forward stepwise algorithm should determine:
• B
Mp,M +1(x
i) - One of the basis function already in B.
• x
M +1- The variable to use in the mirrored pair b
a,M +1and b
b,M +1.
• b
a,M +1(x
M +1) and b
b,M +1(x
jM +1) - The mirror paired of b(·)-functions to use. If x
M +1is
– quantitative - b
a,M +1and b
b,M +1are hinge-functions as described
in Equation 2.37 and Equation 2.38. The knot location c
M +1has
to be determined.
– categorical - b
a,M +1and b
b,M +1are categorical functions as de- scribed in Equation 2.40 and Equation 2.41. The set A
M +1has to be determined.
Determining the above values is an iterative process. MARS considers every combination of B
Mp,M +1(x
i) ∈ B and x
jM +1. Furthermore, the algo- rithm considers a range of different ways to formulate b
a,M +1(x
i,jM +1) and b
b,M +1(x
i,jM +1):
• If x
jM +1is quantitative, then b
M +1(·) are a pair of hinge functions as described in Equation 2.37 and Equation 2.38. There is an infinite number of possible knot locations. However, MARS considers every observation x
ijM +1, i = 1, ..., n in the training data and as a possible location for the knot.
• If x
jM +1is categorical, then b
M +1(·) are a pair of of categorical func- tions as described in Equation 2.40 and Equation 2.41. All possible subsets A
j1, A
j2, ... of C
jare candidates to form A
M +1.
For all of these combinations, two candidate basis functions ˆ B
1and ˆ B
2are created:
B ˆ
1(x
i) = B
Mp,M +1(x
i)b
a,M +1(x
i,jM +1) B ˆ
2(x
i) = B
Mp,M +1(x
i)b
b,M +1(x
i,jM +1).
MARS then considers a set consisting of B, ˆ B
1and ˆ B
2:
B ˆ = {B
1, B
2, ..., B
M, B
M +1, B
M +2} (2.43) and thereafter solves the linear optimization problem with the design matrix provided by Equation 2.42 given ˆ B, which yields the coefficients ˆ
c ∈ R
M +2corresponding to ˆ B. Thereafter the RM SE, with respect to the training data, is calculated for the current solution.
When RMSE has been computed for every combination of the elements in B and b , the pair ˆ B
1, ˆ B
2associated with the smallest RMSE is selected and added as basis functions to B. Hence, M := M + 2. The procedure repeats untill M ≥ M
max.
Backward stepwise part
The algorithm featured in the forward stepwise part is a greedy algorithm and resulting in overfitting of the model given by B, meaning the that model has adapted to noise in the training data and does not generalize well.
Therefore, the backward stepwise part attempts to remove basis functions
from B. Removing these functions will increase the prediction error for the
training data collection. However, the basis function that gives the smallest
increase in RMSE are removed one at a time. The final basis function set B
∗is selected by a trade-off of generalizability and fit to training data.
The algorithm proceeds as following:
1. The initial data set is B
1:= B, obtained from the forward stepwise algorithm.
2. The basis function that can be removed with the smallest increase in LOF is removed from the previous B
M. The resulting set of basis function is B
M +1. The process is repeated until there is a set that only contains one basis function. The result is a range of basis function sets B
1, B
1, ..., B
Mmaxwhere N (M ) = M
max−M +1 is the number of basis functions in B
M.
3. The final set B
f inalis determined using generalized cross-validation (GCV). Let RM SE
Mbe the root-mean-squared error obtained from performing regression with the data set B
Mon the training data.
Furthermore, MARS defines effective number of parameters (EN OF ) of B
Mas
EN OF
M= N (M ) + ψ N (M ) − 1
2 (2.44)
where ψ = 2.3 is a penalty, which is fixed in this thesis. The GCV for B
Mis then
GCV
M= RMSE
2MN (M ) (1 − EN OF
M/N (M ))
2. (2.45) The final set of basis functions is the set that has the lowest GCV:
M
∗= arg min
M ∈{1,...,Mmax}
GCV
M, (2.46)
B
∗:= B
M∗M
f inal:= N (M
∗).
Given B
∗, the corresponding optimal coefficients, c
∗= c
1c
2... C
Mf inalis obtained using linear least squares fitting with the design matrix provided in Equation 2.42.
Optimization
The forward stepwise algorithm may be very computationally demanding.
For every new basis function to add to B, the algorithm will consider every
combination of
• Basis functions already in B.
• Every input variable x
j.
• Every point x
ijin the training data if x
jis quantitative, or every subset of C
jif x
jis categorical.
For combination of these variables, the MARS algorithm solves the linear problem resulting from Equation 2.42. Solving the problem using the normal equation solution as given in Equation 2.27 is computationally fast. The number of columns in the design matrix are proportional to the amount of basis functions in ˆ B, computation time for solving every linear problem increases as B grows.
The computational time for the entire forward stepwise part grows rapidly with M
maxand the training data size, as described in [7]. Furthermore, the computation time grows exponentially with the length of C
kfor every cat- egorical variable x
j, according to Equation 2.39.
Two methods of reducing the amount of computations for the forward stepwise part are presented in [7]. These include:
• Introducing an interaction limit. This limits the interaction of basis functions to a certain positive integer. Basis functions in B that have interaction equal to the interaction limit will not be considered to form a factor in new basis functions.
• Introducing a cutoff count and a priority queue for which basis func- tions B to consider. The basis functions in B are ordered in a de- scending order according to their latest computed improval of fit - the reduction in LOF. The algorithm then only considers the basis func- tions at the top of the priority queue. The cutoff count is the number of basis functions considered.
• Ignoring categorical parameters with many subsets.
• Only considering some of the points x
ijfor quantitative parameters x
jas possible knot location for hinge-functions. The points should be selected so that they are evenly distributed over the range of values of the variable.
2.3 Artificial neural network
Artificial neural networks (ANNs) are systems of computation in the field of
AI that can be used for regression and classification. Inspired by biological
neural networks, which exists in brains, artificial neural networks have a
set of artificial neurons distributed in different layers. There is one input
layer and one output layer. Between them, there can also be a number of
hidden layers. Signals are passed from the input layers through the net- work to the output layer. The neurons are connected by synapses, which all have synaptic weights that signifies how the transported value is amplified between the two neurons connected by the synapse. The neural network can learn by adjusting the synaptic weights to produce certain outputs de- pending on the inputs. This section describes the type of Artificial neural network considered in this thesis.
2.3.1 Artificial neuron
A neuron that has n input synapses receives input values x
1, x
2, ..., x
nand produces one output value. The procedure is described in Figure 2.2.
Figure 2.2: An artificial neuron The following components are used by the neuron:
1. Inputs x
1, x
2, ..., x
n. There is one input from input synapse - one input from every neuron in the previous neurons layer.
2. Synapse weights w
1, w
2, ..., w
n. The weights of the input synapses.
3. Total net input is the weighted sum of all the inputs, according to u =
n
X
i=1
w
ix
i. (2.47)
4. Output value y. The total net input is transformed using an activation function g to produce the output, according to
y = g(u). (2.48)
The output of the neuron can then be defined as
y(x
1, x
2, ..., x
n) = g
n
X
i=1
w
ix
i!
. (2.49)
Activation function
The activation function g : R 7→ R, is a function that transforms the total net input. Determining the activation function of an Artificial neural network (ANN) is directly linked to the performance of the network. Non-linear ac- tivation functions give ANN:s their ability to work with non-linear patterns.
Some common activation functions are
• Logistic function g(u) = σ(u) =
1+e1−u• Hyperbolic tangent g(u) = tanh(u) =
eeuu−e+e−u−u• Identity function g(u) = u
• Rectified linear unit (ReLU) g(u) = u for x ≥ 0 0 for x < 0
• Leaky rectified linear unit (L-ReLU) g(u) =
u for x ≥ 0 0.01u for x < 0 More information about activation functions can be found in Appendix E.
2.3.2 Artificial neural network
The type of ANNs considered in this thesis are called Feed forward artificial neural network. The neurons are organized into layers. All neurons in the hidden layers and output layer are called receiving neurons. All receiving neurons are connected through synapses to every neuron in the previous layer so that the outputs of a layer are the inputs of the next layers, as can be seen in Figure 2.3. Every synapse has a specific weight associated with it. Furthermore, all neurons in the hidden layers and output layer have a transfer function associated with it. Is is possible to have different transfer functions for different neurons in the same ANN, however this thesis only considers one transfer function for all hidden layer neurons and one transfer function for the output layer neurons. This concept is explained further in subsection 2.3.4.
Bias neuron
It may be beneficial to add a constant term to the ANN. This is done by
adding a neuron with a constant output, called bias neuron. All neurons in
hidden layers and output layers are connected to the bias neuron by regular
synapses. This allows the weight of the constant term to be adjusted in the
same way as other weights in the ANN. The output value of the bias neuron
is usually set to y
b= 1. An example of an ANN with a bias neuron can be
seen in Figure 2.4.
Figure 2.3: Example of an artificial neural network with 3 input values, 1 output value and one hidden layer with 4 neurons. The arrows represent synapses. Every synapses has an individual weight w associated with it.
2.3.3 Input
The input values of an ANN can be normalized as explained in subsec- tion 2.1.2. According to [8], this increases the computational efficiency of the learning process.
Categorical input variables are managed using the dummy variable ap- proach explained in subsection 2.1.5.
2.3.4 Output
Many activation functions have a limited codomain. For example:
• The codomain of the logistic function σ(u) =
1+e1−uis described by 0 < σ(u) < 1 ∀ u ∈ R.
• The codomain of the hyperbolic tangent function tanh(u) is described by −1 < tanh(u) < 1 ∀ u ∈ R.
The output values of a pattern that the ANN is to learn must be in the
codomain of the transfer functions of the output layer. To achieve this, 2
methods are considered:
Figure 2.4: Example of an artificial neural network that features a bias neuron.
1. Normalizing the output data so it is distributed in a range compatible with the codomain of the output transfer functions. The normalizing is performed as described in subsection 2.1.2.
2. Setting the transfer function of the output layer neurons to some func- tion g(x) that has a codomain that includes the range of the output values. The identity function g(u) = u can conveniently be used, as its codomain is R.
2.3.5 Learning
The learning process of an ANN is done using the backpropagation algorithm (BPA). Using this algorithm, the network adapts to the training data. This is done by adjusting the synaptic weights
Assume an ANN created for a set of data consisting of n
ininput variables
and n
outoutput variables, and which contains m observations. Let v
i∈ R
ninand o
i∈ R
noutbe the input and output vectors for the i:th observation.
There are therefore n
inneurons in the input layer and n
outneurons in the output layer. There are N
Hhidden layers. The number of neurons in the hidden layers are n
hi, 1 ≥ i ≥ N
H. Total number of layers is N = N
H+ 2.
The number of neurons in all layers are described by the set
n = {n
in, n
h1, n
h2, ..., n
hNH, n
out} . (2.50) The activation functions for the hidden layer neurons and output layer neu- rons are g
h(u) and g
o(u) respectively. Furthermore, assume g
h(u) and g
o(u) are C
0or a higher level of differentiability.
The BPA can be summarized by the following stages:
1. Forward pass - The procedure of sending values through the ANN to obtain an estimate of the output values and a metric of the prediction error.
2. Backward pass - The process of finding the influence of the synaptic weights on the prediction error.
3. Updating synaptic weights using the Gradient descent algorithm.
For this subsection, the following notation is introduced:
• L
H= {2, 3, ..., N − 1} is the collection of hidden layers. L = 1 repre- sents the input layer and L = N represents the output layer.
• n
(L)is the number of neurons in layer L, indexed as the set in Equa- tion 2.50.
• y
(L)is a vector consisting of the current output values of the neurons in layer L. Consequentially, y
(L)jis the output value of the j:th neuron in layer L.
• ˆo
iare the output values generated by the ANN for the i:th observation.
That is, the estimations of o
imade with the input v
i.
• u
(L)jis the total net input of the j:th neuron in layer L.
• w
jkLis the synaptic weight of the synapse from neuron j in layer L − 1 to neuron k in layer L.
• If the ANN features a bias neuron, w
bkLis the synaptic weight of the
synapse from the bias neuron to neuron k in layer L.
Cost function
A cost function is defined to represent the error of prediction. The BPA is used to minimize the cost function. The cost function used in this thesis is the Quadratic cost function.
Assume that the ANN has been used to predict the i:th observation where the target values are o
i∈ R
noutand the predicted values are ˆ o
i∈ R
nout. The quadratic cost function is defined as
φ(o
i, ˆ o
i) = 1 2
nout
X
k=1
(ˆ o
ik− o
ik)
2. (2.51)
Forward pass
The forward pass is the process of sending values through the ANN form the input layer to the output layer. Before the forward pass, an input vector v
iis assigned to the input neurons, so that
y
(1):= v
i.
Thereafter, all receiving neuron layers compute their output values y
(L)jaccording to Equation 2.49, sequentially, starting with the first receiving layer, i.e. layer L = 2. At this point, the following holds for the input and output vectors:
ˆ
o
i= y
(N ). (2.52)
The estimations for the i:th observation is thereby obtained, and prediction error φ(o
i, ˆ o
i) can be calculated using Equation 2.51.
Backward pass
The objective of the backward pass is to find the partial derivative
∂φ
∂w
Ljkfor all synaptic weights. To find these partial derivatives, the chain rule is applied. The derivatives are computed sequentially, starting with the final layer. The formulation of the chain rule product used for all synaptic weights is
∂φ
∂w
jkL= ∂φ
∂y
(L)k· ∂y
k(L)∂u
(L)k· ∂u
(L)k∂w
Ljk. (2.53)
For the synapses leading into the output layer, L = N , Equation 2.53 yields
∂φ
∂w
jkN= ∂φ
∂y
(N )k· ∂y
(N )k∂u
(N )k· ∂u
(N )k∂w
jkN. (2.54)
For the first partial derivative, the following holds, as explained in Equa- tion 2.52:
∂φ
∂y
(N )k= ∂φ
∂ ˆ o
ikoˆik=y(N )k
. (2.55)
Differentiation yields
∂φ
∂ ˆ o
ik= ∂
∂ ˆ o
ik"
1 2
nout
X
k=1
(ˆ o
ik− o
ik)
2#
= ∂
∂ ˆ o
ik1
2 (ˆ o
ik− o
ik)
2= ˆ o
ik− o
ik⇒ ∂φ
∂y
k(N )= y
k(N )− o
ik. (2.56)
The second partial derivative is given by differentiating Equation 2.48, using the transfer function g
o(u). This yields:
∂y
k(N )∂u
(N )k= dg
odu
u=u(N )k. (2.57)
The third partial derivative
∂u(N ) k
∂wNjk
can be obtained by differentiating Equa- tion 2.47, according to
∂u
(N )k∂w
jkN= ∂
∂w
Njk"
nX
i=1
w
iy
j(N −1)#
= y
(N −1)j(2.58)
where y
(N −1)jis the output of the neuron in the preceding layer connected sending its output through the current synapse.
By considering Equation 2.56, Equation 2.57 and Equation 2.58, all par- tial derivatives of Equation 2.54 are known and the procedure can be used to calculate
∂φ
∂w
Ljkfor L = N .
For partial derivatives of synaptic weights leading into a hidden layer, i.e. for L ∈ L
H, Equation 2.59 becomes
∂φ
∂w
jkL= ∂φ
∂y
(L)k· ∂y
(L)k∂u
(L)k· ∂u
(L)k∂w
Ljk∀ L ∈ L
H. (2.59) The second and third partial derivatives of the above product are obtained in a manner consistent with the output layer,
∂y
(L)k∂u
(L)k= dg
hdu
u=u(N )k(2.60)
and
∂u
(L)k∂w
Ljk= ∂
∂w
Ljk"
nX
i=1
w
iy
j(L−1)#
= y
(N −1)j(2.61)
for all L ∈ L
H. Because the backward pass starts with the last layer, all partial derivatives for layer L + 1 are known. This can be used to calculate the partial derivative
∂φ∂yk(L)
∀ L ∈ L
H:
∂φ
∂y
(L)k=
n(L+1)
X
l=1
w
(L+1)kl· ∂φ
∂u
(L+1)l=
n(L+1)
X
l=1
w
(L+1)kl· ∂φ
∂y
(L+1)l· ∂y
l(L+1)∂u
(L+1)l∀ L ∈ L
H, (2.62)
where n
(L+1)is the number of neurons in layer L + 1. The partial derivatives
∂φ
∂yl(L+1)
and
∂y(L+1) l
∂u(L+1)l
have previously been obtained and Equation 2.62 can be computed.
By considering Equation 2.62, Equation 2.60 and Equation 2.61, all par- tial derivatives of Equation 2.59 are known and the procedure can be used to calculate
∂φ
∂w
Ljkfor all L ∈ L
H.
If the ANN has a bias neuron, the synaptic weights w
bkLare updated in the same way as the synaptic weights w
jkL.
Gradient descent
After the backward pass is completed, all partial derivatives are known.
Gradient descent is used to update the synaptic weights. The method is explained in detail in [9]. The weights are updated according to
w
jk(L):= w
jk(L)− η∆w
(L)jk, where ∆w
(L)jk= ∂φ
∂w
Ljk(2.63)
for all L, j and k. The value η is called step size, and is central to the
gradient descent method. Specifically for ANNs, η is usually called learning
rate. The choice of learning rate greatly affects training performance - A
high value may result in divergence and failure of the algorithm, whereas a
low value may result in entrapment in local minima or slow learning.
2.3.6 Batch learning
During training, the observations in the training data may be used multiple times. Each complete cycle of the entire training data through the BPA is called a training epoch.
The BPA can also be defined so that the forward pass and backward pass for several observations in the training data are performed before the synaptic weights are updated. This is called batch learning. Let γ ≥ 1 be the batch size. During batch learning, a sample of γ observations are randomly selected from the training data. The forward pass and backward pass are performed for the observations. The weights are then updated using gradient descent as defined in Equation 2.63, but instead letting ∆w
jk(L)be the average of
∂w∂φLjk