Estimating fuel consumption using regression and machine learning

(1)

STOCKHOLM SWEDEN 2018 ,

Estimating fuel consumption using regression and machine learning

LUKAS EKSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

using regression and machine learning

LUKAS EKSTRÖM

Degree Projects Systems Engineering (30 ECTS credits) Degree Programme in Aerospace Engineering (120 credits) KTH Royal Institute of Technology year 2018

Supervisor at Scania: Henrik Wentzel

Supervisor at KTH: Johan Karlsson

Examiner at KTH: Johan Karlsson

(4)

TRITA-SCI-GRU 2018:378 MAT-E 2018:80

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden

URL: www.kth.se/sci

(5)

This thesis focuses on investigating the usage of statistical models for esti- mating fuel consumption of heavy duty vehicles. Several statistical models are assessed, along with machine learning using artificial neural networks.

Data recorded by sensors on board trucks in the EU describe the oper- ational usage of the vehicle. The usage of this data for estimating the fuel consumption is assessed, and several variables originating from the opera- tional data is modelled and tested as possible input parameters.

The estimation model for real world fuel consumption uses 8 parameters describing the operational usage of the vehicles, and 8 parameters describing the vehicles themselves. The operational parameters describe the average speed, topography, variation of speed, idling, and more. This model has an average relative error of 5.75%, with a prediction error less than 11.14% for 95% of all tested vehicles.

When only vehicle parameters are considered, it is possible to make predictions with an average relative error of 9.30%, with a prediction error less than 19.50% for 95% of all tested vehicles.

Furthermore, a computer software called Vehicle Energy Consumption Calculation tool(VECTO) must be used to simulate the fuel consumption for all heavy duty vehicles, according to legislation by the EU. Running VECTO is a slow process, and this thesis also investigates how well statistical models can be used to quickly estimate the VECTO fuel consumption. The model estimates VECTO fuel consumption with an average relative error of 0.32%

and with a prediction error less than 0.65% for 95% of all tested vehicles.

I

(6)

(7)

gression och maskininl¨ arning

Sammanfattning

Denna rapport fokuserar p˚ a att unders¨ oka anv¨ andningen av statistiska mod- eller f¨ or att uppskatta br¨ anslef¨ orbrukningen hos tunga fordon. Flera statis- tiska modeller utv¨ arderas, tillsammans med maskinl¨ arning med artificiella neurala n¨ atverk.

Data som registreras av sensorer ombord p˚ a Scania-lastbilar i EU beskriver fordonets drift. Anv¨ andningen av dessa data f¨ or att uppskatta br¨ anslef¨ orbrukningen unders¨ oks och flera variabler som kommer fr˚ an operativa data modelleras och testas som m¨ ojliga in-parametrar.

Uppskattningsmodellen f¨ or den verkliga br¨ anslef¨ orbrukningen anv¨ ander 8 parametrar som beskriver anv¨ andningen av fordonet och 8 parametrar som beskriver sj¨ alva fordonet. Bland annat beskrivs medelhastighet, topografi, hastighetsvariation, andel tomg˚ ang. Denna modell har ett genomsnittligt relativfel p˚ a 5.75 %, med ett skattningsfel mindre ¨ an 11.14% f¨ or 95% av de de fordon som testats.

Om endast fordonsparametrar beaktas som in-parametrar ¨ ar det m¨ ojligt att g¨ ora skattningar med ett genomsnittligt relativfel p˚ a 9.30 %, med ett skattningsfel mindre ¨ an 19.50% f¨ or 95% av de de fordon som testats. Ett da- torprogram kallat VECTO m˚ aste anv¨ andas f¨ or att simulera br¨ anslef¨ orbrukningen f¨ or alla tunga fordon enligt EU-lagstiftning. Att k¨ ora VECTO ¨ ar en tid- skr¨ avande process, och denna rapport unders¨ oker ocks˚ a hur v¨ al statistiska modeller kan anv¨ andas f¨ or att snabbt uppskatta VECTO-br¨ anslef¨ orbrukningen.

Modellen uppskattar VECTO-br¨ anslef¨ orbrukningen med ett genomsnittligt relativfel p˚ a 0.32% och med ett skattningsfel mindre ¨ an 0.65% f¨ or 95% av de de fordon som testats.

II

(8)

(9)

I would like to express my gratitude to Henrik Wentzel, Johan Karlsson, Marcus Forslin, Antonius Kies, and G¨ oran Svensson, for the help and guid- ance they have given me.

III

(10)

(11)

Contents IV

1 Introduction 1

2 Theory 3

2.1 Preliminary math . . . . 3

2.1.1 Basic statistical measures . . . . 3

2.1.2 Normalizing . . . . 4

2.1.3 Standardized euclidean distance . . . . 4

2.1.4 Prediction error . . . . 5

2.1.5 Categorical variables . . . . 6

2.2 Regression . . . . 7

2.2.1 Linear regression . . . . 8

2.2.2 KNN regression . . . . 9

2.2.3 Multivariate adaptive regression splines . . . . 10

2.3 Artificial neural network . . . . 16

2.3.1 Artificial neuron . . . . 17

2.3.2 Artificial neural network . . . . 18

2.3.3 Input . . . . 19

2.3.4 Output . . . . 19

2.3.5 Learning . . . . 20

2.3.6 Batch learning . . . . 25

2.3.7 Evaluating observations . . . . 25

3 Data 26 3.1 Vehicle parameters . . . . 26

3.2 Operational parameters . . . . 28

3.2.1 Bin data . . . . 29

3.2.2 Considered variables . . . . 30

3.3 Output data . . . . 32

3.4 Data gathering . . . . 32

3.4.1 Estimation scenario A . . . . 33

3.4.2 Estimation scenario B and C . . . . 33

IV

(12)

4.2 Models . . . . 35

4.2.1 Linear regression . . . . 35

4.2.2 KNN regression . . . . 35

4.2.3 KNN Regression with inverse distance weighting . . . 36

4.2.4 Multivariate adaptive regression splines (MARS) . . . 36

4.2.5 Artificial neural network . . . . 37

4.3 Choice of input parameters . . . . 38

4.3.1 Estimation scenario A - VECTO fuel consumption . . 38

4.4 Estimation scenario B and C - Real world fuel consumption . 40 4.5 Selection of model . . . . 42

5 Results 43 5.1 Estimation scenario A - VECTO fuel consumption . . . . 43

5.1.1 Selected model . . . . 43

5.2 Estimation scenario B - Real world fuel consumption using only vehicle parameters . . . . 46

5.2.1 Selected model . . . . 46

5.3 Estimation scenario C - Real world fuel consumption using vehicle and operational parameters . . . . 48

5.3.1 Selected model . . . . 48

5.4 Results summary . . . . 50

5.4.1 Comparison of estimation scenario B and C . . . . 50

5.4.2 Variable influence . . . . 51

6 Discussion and conclusion 52 6.1 Discussion . . . . 52

6.2 Conclusion . . . . 53

References 54 A Glossary 55 A.1 Acronyms . . . . 55

A.2 Terms . . . . 56

B Fundamental statistics 59

C Assumed values for bin data 60

D Additional tables 63

E Activation functions 66

V

(13)

F.2 Estimation scenario B . . . . 84 F.3 Estimation scenario C . . . . 96

G Additional figures 109

H Code 110

VI

(14)

(15)

Chapter 1

Introduction

This report is the result of a thesis made at Scania AB, in cooperation with Royal Institute of Technology (KTH). The thesis is being conducted in the group YDMC at Scania in S¨ odert¨ alje, Sweden, which is specialized in analy- sis of fuel consumption, and by extent, CO

₂

emissions. It is of great interest to be able to quickly and accurately estimate the fuel consumption (FC) and CO

2

emissions for vehicles. Current methods for estimation relies on sim- ulation, which is too slow and too inaccurate. More advanced simulations require several minutes to complete, which is too slow for being implemented into a sales process, and are still not very accurate. Furthermore, a new leg- islation by the European union require all new heavy duty vehicles (HDVs) to be simulated using a software called Vehicle Energy Consumption Cal- culation Tool (VECTO). The simulation tool is used for certification of the vehicles. Therefore, it is also of great interest to be able to quickly deliver and estimate the VECTO certified CO

₂

to haulage contractors. This paper assesses statistical models for predicting fuel consumption. Several regres- sional models are created and tested. Furthermore, the use of Artificial intelligence (AI) using artificial neural networks are also tested. The paper also investigates the usage of Operational data - data uploaded from vehicles in use - and their potential role in estimating the real world FC.

When VECTO is used for certification purposes, simulations are per-

formed for a set of predefined transport missions, each with a set of prede-

fined transport payloads. Since the driving patterns are predefined, the FC

obtained through VECTO, for a given predefined transport mission, can be

regarded as a function only of the vehicle itself. This thesis focuses on the

transport mission called long haulage with reference payload, which simu-

lates transportation across a large distance, with few stops and dominantly

motorway velocity. A previous thesis, [1] by Max af Klintberg, has assessed

how several regressional models can be used to evaluate the VECTO FC

for the long haulage cycle. That thesis has heavily influenced the methods

assessed in this thesis for estimating VECTO FC. New prediction models

(16)

Figure 1.1: The three estimation scenarios

are tested and their performance compared.

The main objective of this thesis is to investigate how well prediction models can be used to estimate fuel consumption. Estimation of two differ- ent values of fuel consumption - VECTO and real world - are to be assessed.

The VECTO fuel consumption describes the average fuel consumption dur- ing one simulated transport mission. The real world fuel consumption de- scribes the average fuel consumption in a longer term perspective.

The input for the models should be a set of variables that can be de- rived from vehicle specification and/or operational data. The estimation of VECTO is done solely based on vehicle data, and does not consider any operational data. Estimation of real world fuel consumption considers both vehicle and operational data. A scenario when the real world FC is estimated solely based on vehicle parameters is also assessed.

Consequentially, three estimation scenarios can be introduced, according to

• Estimation scenario A - Estimation of VECTO fuel consumption with vehicle data as input.

• Estimation scenario B - Estimation of real world fuel consumption with vehicle data as input.

• Estimation scenario C - Estimation of real world fuel consumption with vehicle data and operational data as input.

The three estimation scenarios are also illustrated in Figure 1.1.

(17)

Chapter 2

Theory

2.1 Preliminary math

This section introduces mathematical concepts which form a basis for fol- lowing sections. Methods for describing properties in a quantifiable manner, measuring similarity between observations, and for modelling estimation er- ror, are explained.

2.1.1 Basic statistical measures

If X

₁

, X

₂

, ..., X

_n

are random variables, and they form a vector

X = X

₁

X

₂

... X

_n

. (2.1)

Let x

_ij

be the ith independent observation on the jth random variable.

Assume that m ∈ N independent observations have been drawn for all n ∈ N random variables. These observations then form a data set X = {x

i

∈ R

ⁿ

, i = 1, ..., m}. Further, let ˆ x

_j

be the sample mean and s

_j

be the sample standard deviation of the jth random variable. Obtained as described in Appendix B.

The sample covariance q

_jk

of the j:th and k:th random variables is de- fined as

q

_jk

= 1 m − 1

m

X

i=1

(x

_ij

− ˆ x

_j

)(x

_ik

− ˆ x

_k

). (2.2) The pearson correlation coefficient r can then be computed using

r

_jk

= q

jk

s

_j

s

_k

. (2.3)

Pearson correlation coefficient takes a value in the range [−1, 1] where 1

is positive linear correlation, -1 is negative linear correlation and 0 is no

correlation. The value r

_jk²

is called the coefficient of determination, which

takes a value in the range [0, 1] where 1 is linear correlation and 0 is no

correlation.

(18)

2.1.2 Normalizing

Consider a random variable x

j

in a sample, that has a sample mean ˆ x

j

and a sample standard deviation s

j

, obtained as in subsection 2.1.1. The sample can be normalized to have a new sample mean ˆ x

⁰_j

and sample standard deviation s

⁰_j

. This is done by modifying all observations x

_ij

according to

x

ij

:= x

ij

− s

_j

ˆ

x

_j

s

⁰_j

+ ˆ x

⁰_j

. (2.4) 2.1.3 Standardized euclidean distance

For comparing similarity of quantitative components of observations, a mea- sure of distance between the observations are to be obtained. A formulation of the distance is derived below.

According to [2], a metric on the set X d(x

1

, x

2

) is a function such that for all x

1

, x

2

∈ X , the following holds:

1. d(x

₁

, x

₂

) ≥ 0

2. d(x

1

, x

2

) = 0 if, and only if x

1

= x

2

3. d(x

1

, x

2

) = d(x

2

, x

1

)

4. d(x

₁

, x

₂

) ≤ d(x

₁

, x

₃

) + d(x

₂

, x

₃

).

A metric can then be considered to be the distance between observations.

There are a lot of ways to define a metric. However, the following metrics are relevant for this thesis.

The euclidean distance between x

1

∈ R

ⁿ

and x

2

∈ R

ⁿ

is the L

2

-norm (also called the euclidean norm) of the difference x

1

− x

₂

, i.e. ||x

1

− x

₂

||

₂

. The L

₂

-norm of a vector v ∈ R

ⁿ

is defined as

||v||

₂

= v u u t

n

X

j=1

v

_j²

. (2.5)

Therefore, the euclidean distance d

_euc

(x

₁

, x

₂

) is defined as:

d

euc

(x

1

, x

2

) = ||x

1

− x

₂

||

₂

= v u u t

n

X

j=1

(x

1,j

− x

_2,j

)

²

. (2.6)

Problems can arise when the euclidean distance is used and the com-

ponents of the vectors vary in size, and larger components can come to

dominate the value of the norm. The standardized euclidean distance nor-

malizes the components in X and makes them dimensionless. If ˆx

j

and s

_j

(19)

are the sample mean and sample standard deviation of component j in the data set X , the standardized euclidean distance d

euc,s

(x

₁

, x

₂

) is defined as

d

_euc,s

(x

₁

, x

₂

) = v u u t

n

X

j=1

x

_1,j

− s

_j

ˆ

x

_j

− x

_2,j

− s

_j

ˆ x

_j

2

. (2.7)

2.1.4 Prediction error

Consider a set of m observations for one random variable y = {y

i

, i = 1, ..., m} ,

and a set of predictions for the same variable ˆ

y = {ˆ y

_i

, i = 1, ..., m} . The prediction error for each observation is defined as

ˆ

e

i

= ˆ y

i

− y

_i

(2.8)

and the relative prediction error is ˆ

e

_rel,i

= e ˆ

_i

y

_i

= y ˆ

_i

− y

_i

y

_i

. (2.9)

The root-mean-squared error (RMSE) is defined as

RMSE(y) = v u u t

1 m

m

X

i=1

ˆ

e

²_i

. (2.10)

The root-mean-squared for the relative errors is

RMSE

_rel

(y) = v u u t

1 m

m

X

i=1

ˆ

e

²_rel,i

. (2.11)

Quantiles

The performance of an estimation model can also be measures by quantiles of the prediction error. The quantiles are associated with a probability p, which creates a threshold for the prediction error, given that probability p.

This is useful for measuring the performance regarding outliers. This thesis

uses quantiles for p = 95%, 99%, 99.5%.

(20)

2.1.5 Categorical variables

Categorical variables are variables that is set to one of a finite set of possible values. They are not trivially quantifiable, i.e. it is not clear how values for one categorical variable relate to each other, other than being equal or not.

There are several different ways to implement categorical variables into a quantitative model.

Dummy variables

Assume that a categorical variable x

c

is given which can take on one value from {v

₁

, ..., v

_n

}. This categorical variable can be represented by a series of binary variables y

j

, j = 1, ..., n − 1 [3], such that

y

j

(x

c

) = 0 if x

_c

= v

1

or x 6= v

j+1

1 if x

c

= v

j+1

. (2.12)

The result is a n − 1 binary variables. For example, if x is set to the first value v

1

, the resulting vector is

v|

_x_c_=v₁

= 0 0 ... 0 . If x

_c

is set to the second or third value:

v|

xc=v2

= 1 0 ... 0

v|

xc=v3

= 0 1 ... 0 . Exclusivity

Assume that a categorical variable x

c

is given, which can take on one value from {v

₁

, ..., v

_n

}, where x

_c

is one component of vectors x ∈ X . Subsets can be created for every value {v

₁

, ..., v

_n

} and X can then be split into these subsets depending on the categorical variable. This would result in n different data sets.

This effectively removes the categorical variable x

_c

from the data sets, and the resulting data sets would be on the form Y = {x

i

∈ R

ⁿ⁻¹

}. The drawback of this method is that some of the resulting data sets may be small, and using this method on several categorical variables will partition the subsets into even smaller subsets.

Penalized distance

Using penalized distance, d

p

(·, ·), is a method of measuring distance between

categorical observations. Assume that x

_1c

and x

_2c

are two observations for

(21)

the categorical variable x

c

. The penalized distance weight w

c

> 0 is defined for x

_c

. The penalized distance for the two observations are then

d

pc

(x

1c

, x

2c

) = w

c

I(x

1c

= x

2c

) = w

_c

if x

_1c

= x

_2c

0 otherwise. (2.13) The concept can be expanded to vectors of categorical variable. As- sume the vectors x

₁

∈ R

ⁿ

and x

₂

∈ R

ⁿ

are observations of the same set of categorical variable, and that penalized distance weights w

i

, i = 1, ..., n are defined for all variables. The penalized distance between the two categorical observations are then

d

p

(x

1

, x

2

) = d

p1

(x

11

, x

21

) + d

p2

(x

12

, x

22

) + ... + d

pn

(x

1n

, x

2n

)

= w

₁

I(x

₁₁

= x

₂₁

) + w

₂

I(x

₁₂

= x

₂₂

) + ... + w

_n

I(x

_1n

= x

_2n

).

(2.14)

2.2 Regression

Regressional analysis is a branch within statistic that aims to estimate the relationship between variables. A set of independet variables X (input vari- ables) are used to estimate one or more dependent variables Y (output vari- ables).

A regressional f (X) model is constructed using a set o training data, which is a collection of previous observations consisting of both independent and one dependent variable where x

_i

∈ X and y

_i

∈ Y forms a pair (x

_i

, y

_i

).

The regressional model will then be constructed to make estimates ˆ Y of the dependent variable, using the independent variables:

ˆ

y

i

= E(y

i

|x

_i

) = f (x

i

). (2.15) The above equation combined with Equation 2.8 yields the following func- tion for the prediction error

ˆ

e

i

= f (x

i

) − y

i

. (2.16) The aim of regressional analysis is to construct the prediction function f so to minimize the prediction error. Assume a set of m previous observations, and therefore i = 1, ..., m. A measure on the total prediction error can be formulated as

E = 1 m

m

X

i=1

|ˆ e

_i

| (2.17)

or

E = 1 m

m

X

i=1

ˆ

e

²_i

. (2.18)

(22)

The process of constructing a regressional model can then be formulated as minimize

f

N

X

i=1

(f (x

_i

) − y

_i

)

²

, (2.19) which is called the method of least squares.

There are different methods for regressional analysis. The ones used in this thesis are linear regression, K nearest neighbors-regression, Multivariate adaptive regression splines and Artificial neural network.

2.2.1 Linear regression

Assume that there are k independent variables. Linear regression assumes that the relationship between X and Y is linear, and that f (X) is a linear function, according to:

f (x

i

, β) =β

0

+ β

1

x

i,1

+ β

2

x

i,2

+ ... + +β

_k

x

_i,k

=β

^T

a

i

(2.20)

where there a m independent variables and a

i

= 1 x

^T_i

T

∈ R

^k+1

. The parameter vector β ∈ R

^k+1

is obtained through

β = arg min

β m

X

i=1

(f (x

_i

, β) − y

_i

)

²

. (2.21)

Let A ∈ R

^(k+1)×m

be the design matrix, and y a column vector with all the observations for the dependent variable, according to

A =





 a

^T₁

a

^T₂

.. . a

^T_m







=





 1 x

^T₁

1 x

^T₂

.. . .. . 1 x

^T_m







, y =





 y

₁

y

2

.. . y

m





 .

The optimization problem in Equation 2.21 can then be written as β = arg min

β

(Aβ − y)

^T

(Aβ − y) = arg min

β

Φ(β) (2.22)

. Simplifications of the objective function Φ yield Φ(β) = (Aβ − y)

^T

(Aβ − y)

=(Aβ)

^T

Aβ − (Aβ)

^T

y − y

^T

(Aβ) + y

^T

y

=β

^T

A

^T

Aβ − 2(Aβ)

^T

y + y

^T

y. (2.23) Differentiate twice:

∂Φ

∂β = 2A

^T

Aβ − 2A

^T

y (2.24)

(23)

∂

²

Φ

∂β

²

= 2A

^T

A (2.25)

The second derivative is positive semi-definite and therefore Φ is convex.

The optimum to Equation 2.21 and Equation 2.22 is then obtained from

∂Φ

∂β =2A

^T

Aβ − 2A

^T

y = 0

⇒ A

^T

Aβ = A

^T

y. (2.26)

If A

^T

A is invertible, there exists a unique solution

β

^∗

= (A

^T

A)

⁻¹

A

^T

y. (2.27) Thus, predictions for new observations x

0

can be computed with

ˆ

y

0

= f (x

0

) = β

^∗T

a

0

= β

^∗T

1 x

^T₀

T

.. (2.28)

2.2.2 KNN regression

K nearest neighbors (KNN) is a non-parametric regressional model, where K ∈ N is a parameter for the model. An observation x

0

is evaluated by measuring the standardized euclidean distance to observations in the training data x

_i

, i = 1, ..., m. The value of the dependent variables are then equal to the average of the K nearest observations, i.e, the K items with the shortest distance.

d

_euc,s

(x

^(q)₀

, x

^(q)_i

) (2.29) is obtained as explained in Equation 2.7. Penalized distance is used to evaluate the distance over the categorical input variables x

^(c)_i

, according to Equation 2.14, which yields

d

p

(x

^(c)₀

, x

^(c)_i

). (2.30) The total distance between the observations is then

d

_tot

(x

^,₀

x

_i

) = d

_euc,s

(x

^(q)₀

, x

^(q)_i

) + d

_p

(x

^(c)₀

, x

^(c)_i

). (2.31) Inverse distance weighting

KNN regression may also implement Inverse distance weighting (IDW).

The value of a dependent variable o for new observations x

0

will then be a weighted average for the K nearest observations in the training data:

f (x

₀

) = (

^PK

k=1w_k(x0,x_k)o_k PK

k=1wk(x0,xk)

if d

_tot

(x

^,₀

x

_i

) 6= 0∀i o

_i

if d

_tot

(x

^,₀

x

_i

) = 0 for some i.

(2.32)

(24)

In the above equation x

1

, x

2

, ..., x

K

are the K nearest observations to x

0

. The weights will be inversely proportional to the distance, according to

w

k

(x

0

, x

k

) = 1

d

tot

(x

^,₀

x

_i

)

^u

(2.33) where u ∈ R

⁺

is a parameter that can be used to control the magnitude of the distance weighting.

2.2.3 Multivariate adaptive regression splines

Multivariate adaptive regression splines (MARS) is a nonlinear regression algorithm introduced in [4]. The algorithm results in an estimation function that is a finite sum of basis functions B

_M

(·):

f (x

_i

) =

Mf inal

X

M =1

c

_M

B

_M

(x

_i

) (2.34)

where c

_M

, for M = 1, ..., M

_{f inal}

, are coefficients. An example of a function fit to data by MARS can be seen in Figure 2.1.

Figure 2.1: The function f (x) = 0.10968+1.1890∗max(0, x−5.5102) created

by MARS to fit the data.

(25)

The MARS algorithm consists of two sequential parts

1. The forward stepwise part finds a set of M

max

basis functions.

2. The backwards stepwise part tries to reduce the set of basis function provided from the forward stepwise part, to increase generalizability.

The number of basis functions in the final set of basis functions M

_{f inal}

, where 1 ≤ M

_{f inal}

≤ M max, is selected as a trade-of of generalizability and fit to training data.

Basis functions

The original MARS algorithm implements methods to form basis functions of quantitative input variables, as explained in [4]. Futhermore, an approach for managing categorical input variables is given in [5]. These methods are explained here.

The most basic form of basis function considered by MARS is

B

1

(x

i

) = 1. (2.35)

All new basis functions are created as a product of a previously created basis function and some function b(x

_ij

) of one variable x

_j

. Assume there are M basis functions already created, basis function M + 1 will then be on the form

B

_{M +1}

(x

_i

) = B

_M_{p,M +1}

(x

_i

) · b(x

_ij

). (2.36) The function b(x

ij

) can assume two different forms, depending on whether x

j

is quantitative or categorical.

For quantitative input variables, MARS uses hinge functions with one quantitative variable as a parameter. Assume x

j

is a quantitative input variable. The hinge functions are on the form

b

_a

(x

_ij

) = h(x

_ij

) = max(0, x

_ij

− c) (2.37) or

b

(x

ij

) = h(x

ij

) = max(0, c − x

ij

) (2.38) where c is a constant, called the knot. The two above equations are consid- ered a mirrored pair.

Assume now that x

j

is a categorical input variable, which assumes K

j

different values so that x

_ij

∈ C

_j

where C

_j

= C

_j1

, ..., C

_jK_j

. Let A

_j1

, A

_j2

, ...

be subsets of C

_j

. According to the binomial theorem, there are

2

^K^j

(2.39)

subsets of a set with K

j

elements, as explained in [6]. MARS considers functions with a binary output value, on the form

b

a

(x

j

) = I(x

ij

∈ A

_jl

) (2.40)

(26)

or

b

_b

(x

_j

) = I(x

_ij

∈ A /

_jl

) (2.41) where A

_jl

is some of the subsets of C

_j

and

I(x

ij

∈ A

_jl

) = 1 if x

ij

∈ A

_jl

0 otherwise I(x

_ij

∈ A /

_jl

) = 1 if x

_ij

∈ A /

_jl

0 otherwise .

Thus, the basis functions created by MARS are a product of 1 and a number of hinge functions defined in Equation 2.37 and Equation 2.38, and categorical functions defined in Equation 2.40 and Equation 2.41. The in- teraction of a basis function is the number of hinge and categorical functions present in a basis function. Some examples of basis functions are

• B(x

i

) = 1, the most basic basis function considered, which has inter- action = 0.

• B(x

i

) = h

₁

(x

_ij

), where h

₁

(x

_ij

) = max(0, x

_ij

− c

₁

), where c

₁

is the knot of h

₁

, This basis function has interaction = 1.

• B(x

i

) = h

1

(x

ij

) · I(x

_ik

∈ A

_kl

), where A

_kl

is a subset of the the values that the categorical variable x

_k

can assume, h

₁

(x

_ij

) = max(0, x

_ij

−c

₁

), where c

1

is the knot of h

1

. This basis function has interaction = 2.

• B(x

i

) = h

₁

(x

_ij

) · h

₂

(x

_it

), where h

₁

(x

_ij

) = max(0, x

_ij

− c

₁

), h

₂

(x

_it

) = max(0, x

_it

− c

₂

) where c

₁

and c

₂

are the knots of h

₁

and h

₂

. This basis function has interaction = 2.

The function in Figure 2.1 have two basis functions, B

₁

(x

_i

) = 1 and B

2

(x

_i

) = max(0, x

i1

− 5.5102). B

₂

implements a hinge function with a knot at 5.5102.

Finding coefficients

For a set of basis functions B = {B

1

(·), B

2

(·), ..., B

M

(·)}. The coefficients c = c

₁

c

2

...

_M

present in Equation 2.34 are found using linear regres- sion. The design matrix is obtained by evaluating the basis functions in B for every observation x

i

in the training data.

A =







B

₁

(x

₁

) B

₂

(x

₁

) ... B

_M

(x

₁

) B

1

(x

2

) B

2

(x

2

) ... B

M

(x

2

)

.. . .. . . .. .. . B

1

(x

m

) B

2

(x

m

) ... B

M

(x

m

)







. (2.42)

The output vector is y = y

₁

y

₂

... y

_m

. This results in a linear problem.

If A

^T

A is invertible, the solution c

^∗

, for B is obtained using Equation 2.27.

(27)

Forward stepwise part

The forward stepwise part of MARS creates a basis function set B containing M

max

basis functions. The set B of basis functions increases in size as the algorithm proceeds. At the start of the algorithm the set contains the basis function B

1

(x

i

) = 1. New basis function created for MARS are products of one of the basis functions already in B and a hinge function or categorical function. Furthermore, a pair of basis functions are added at a time, i.e.

two basis functions are added simultaneously.

Consider B = {B

1

(·), B

2

(·), ..., B

M

(·)}, i.e. there are currently M basis functions. Assume M < M

_max

. A pair of new basis functions

B

_{M +1}

(·) and B

_{M +2}

(·)

should now be added. The new pair of basis function are based on the same basis function B

_M_{p,M +1}

(·) ∈ B and appends a mirrored pair b

_a

(x

_{i,M +1}

) and b

_b

(x

_{i,M +1}

). So that

B

_{M +1}

(x

_i

) = B

_M_{p,M +1}

(x

_i

)b

_{a,M +1}

(x

_i,j_{M +1}

) B

M +2

(x

_i

) = B

M_{p,M +1}

(x

_i

)b

_{b,M +1}

(x

i,jM +1

)

where x

_j_{M +1}

is one of the variables x

_j

, j = 0, ..., n. If x

_j_{M +1}

is quantitative, the new basis functions take the form

B

M +1

(x

_i

) = B

M_{p,M +1}

(x

_i

)max(0, x

i,jM +1

− c

_{M +1}

) B

_{M +2}

(x

_i

) = B

_M_{p,M +1}

(x

_i

)max(0, c

_{M +1}

− x

_i,j_{M +1}

)

for some knot c used for this pair. If x

j

is categorical, the new basis functions take the form

B

_{M +1}

(x

_i

) = B

_k

(x

_i

)I(x

_{i,M +1}

∈ A

_{M +1}

) B

_{M +2}

(x

_i

) = B

_k

(x

_i

)I(x

_{i,M +1}

∈ A /

_{M +1}

) where A

M +1

is some of the subsets A

j1

, A

j2

, ... of C

j

.

Thus, for every new pair B

_{M +1}

(·) and B

_{M +2}

(·), the forward stepwise algorithm should determine:

• B

Mp,M +1

(x

_i

) - One of the basis function already in B.

• x

M +1

- The variable to use in the mirrored pair b

a,M +1

and b

_{b,M +1}

.

• b

a,M +1

(x

M +1

) and b

b,M +1

(x

jM +1

) - The mirror paired of b(·)-functions to use. If x

_{M +1}

is

– quantitative - b

a,M +1

and b

b,M +1

are hinge-functions as described

in Equation 2.37 and Equation 2.38. The knot location c

_{M +1}

has

to be determined.

(28)

– categorical - b

a,M +1

and b

b,M +1

are categorical functions as de- scribed in Equation 2.40 and Equation 2.41. The set A

_{M +1}

has to be determined.

Determining the above values is an iterative process. MARS considers every combination of B

_M_{p,M +1}

(x

_i

) ∈ B and x

_j_{M +1}

. Furthermore, the algo- rithm considers a range of different ways to formulate b

a,M +1

(x

i,jM +1

) and b

_{b,M +1}

(x

_i,j_{M +1}

):

• If x

jM +1

is quantitative, then b

M +1

(·) are a pair of hinge functions as described in Equation 2.37 and Equation 2.38. There is an infinite number of possible knot locations. However, MARS considers every observation x

ijM +1

, i = 1, ..., n in the training data and as a possible location for the knot.

• If x

jM +1

is categorical, then b

M +1

(·) are a pair of of categorical func- tions as described in Equation 2.40 and Equation 2.41. All possible subsets A

j1

, A

j2

, ... of C

j

are candidates to form A

M +1

.

For all of these combinations, two candidate basis functions ˆ B

₁

and ˆ B

₂

are created:

B ˆ

₁

(x

_i

) = B

_M_{p,M +1}

(x

_i

)b

_{a,M +1}

(x

_i,j_{M +1}

) B ˆ

2

(x

_i

) = B

M_{p,M +1}

(x

_i

)b

_{b,M +1}

(x

i,jM +1

).

MARS then considers a set consisting of B, ˆ B

1

and ˆ B

2

:

B ˆ = {B

₁

, B

₂

, ..., B

_M

, B

_{M +1}

, B

_{M +2}

} (2.43) and thereafter solves the linear optimization problem with the design matrix provided by Equation 2.42 given ˆ B, which yields the coefficients ˆ

c ∈ R

^{M +2}

corresponding to ˆ B. Thereafter the RM SE, with respect to the training data, is calculated for the current solution.

When RMSE has been computed for every combination of the elements in B and b , the pair ˆ B

1

, ˆ B

2

associated with the smallest RMSE is selected and added as basis functions to B. Hence, M := M + 2. The procedure repeats untill M ≥ M

_max

.

Backward stepwise part

The algorithm featured in the forward stepwise part is a greedy algorithm and resulting in overfitting of the model given by B, meaning the that model has adapted to noise in the training data and does not generalize well.

Therefore, the backward stepwise part attempts to remove basis functions

from B. Removing these functions will increase the prediction error for the

training data collection. However, the basis function that gives the smallest

(29)

increase in RMSE are removed one at a time. The final basis function set B

^∗

is selected by a trade-off of generalizability and fit to training data.

The algorithm proceeds as following:

1. The initial data set is B

₁

:= B, obtained from the forward stepwise algorithm.

2. The basis function that can be removed with the smallest increase in LOF is removed from the previous B

M

. The resulting set of basis function is B

M +1

. The process is repeated until there is a set that only contains one basis function. The result is a range of basis function sets B

₁

, B

₁

, ..., B

_M_max

where N (M ) = M

_max

−M +1 is the number of basis functions in B

M

.

3. The final set B

f inal

is determined using generalized cross-validation (GCV). Let RM SE

M

be the root-mean-squared error obtained from performing regression with the data set B

_M

on the training data.

Furthermore, MARS defines effective number of parameters (EN OF ) of B

M

as

EN OF

_M

= N (M ) + ψ N (M ) − 1

2 (2.44)

where ψ = 2.3 is a penalty, which is fixed in this thesis. The GCV for B

_M

is then

GCV

_M

= RMSE

²_M

N (M ) (1 − EN OF

_M

/N (M ))

²

. (2.45) The final set of basis functions is the set that has the lowest GCV:

M

^∗

= arg min

M ∈{1,...,Mmax}

GCV

_M

, (2.46)

B

^∗

:= B

M^∗

M

f inal

:= N (M

^∗

).

Given B

^∗

, the corresponding optimal coefficients, c

^∗

= c

₁

c

₂

... C

_M_{f inal}

is obtained using linear least squares fitting with the design matrix provided in Equation 2.42.

Optimization

The forward stepwise algorithm may be very computationally demanding.

For every new basis function to add to B, the algorithm will consider every

combination of

(30)

• Basis functions already in B.

• Every input variable x

j

.

• Every point x

ij

in the training data if x

_j

is quantitative, or every subset of C

j

if x

j

is categorical.

For combination of these variables, the MARS algorithm solves the linear problem resulting from Equation 2.42. Solving the problem using the normal equation solution as given in Equation 2.27 is computationally fast. The number of columns in the design matrix are proportional to the amount of basis functions in ˆ B, computation time for solving every linear problem increases as B grows.

The computational time for the entire forward stepwise part grows rapidly with M

max

and the training data size, as described in [7]. Furthermore, the computation time grows exponentially with the length of C

_k

for every cat- egorical variable x

_j

, according to Equation 2.39.

Two methods of reducing the amount of computations for the forward stepwise part are presented in [7]. These include:

• Introducing an interaction limit. This limits the interaction of basis functions to a certain positive integer. Basis functions in B that have interaction equal to the interaction limit will not be considered to form a factor in new basis functions.

• Introducing a cutoff count and a priority queue for which basis func- tions B to consider. The basis functions in B are ordered in a de- scending order according to their latest computed improval of fit - the reduction in LOF. The algorithm then only considers the basis func- tions at the top of the priority queue. The cutoff count is the number of basis functions considered.

• Ignoring categorical parameters with many subsets.

• Only considering some of the points x

ij

for quantitative parameters x

j

as possible knot location for hinge-functions. The points should be selected so that they are evenly distributed over the range of values of the variable.

2.3 Artificial neural network

Artificial neural networks (ANNs) are systems of computation in the field of

AI that can be used for regression and classification. Inspired by biological

neural networks, which exists in brains, artificial neural networks have a

set of artificial neurons distributed in different layers. There is one input

layer and one output layer. Between them, there can also be a number of

(31)

hidden layers. Signals are passed from the input layers through the net- work to the output layer. The neurons are connected by synapses, which all have synaptic weights that signifies how the transported value is amplified between the two neurons connected by the synapse. The neural network can learn by adjusting the synaptic weights to produce certain outputs de- pending on the inputs. This section describes the type of Artificial neural network considered in this thesis.

2.3.1 Artificial neuron

A neuron that has n input synapses receives input values x

₁

, x

₂

, ..., x

_n

and produces one output value. The procedure is described in Figure 2.2.

Figure 2.2: An artificial neuron The following components are used by the neuron:

1. Inputs x

₁

, x

₂

, ..., x

_n

. There is one input from input synapse - one input from every neuron in the previous neurons layer.

2. Synapse weights w

₁

, w

₂

, ..., w

_n

. The weights of the input synapses.

3. Total net input is the weighted sum of all the inputs, according to u =

n

X

i=1

w

i

x

i

. (2.47)

4. Output value y. The total net input is transformed using an activation function g to produce the output, according to

y = g(u). (2.48)

The output of the neuron can then be defined as

y(x

₁

, x

₂

, ..., x

_n

) = g

n

X

i=1

w

_i

x

_i

!

. (2.49)

(32)

Activation function

The activation function g : R 7→ R, is a function that transforms the total net input. Determining the activation function of an Artificial neural network (ANN) is directly linked to the performance of the network. Non-linear ac- tivation functions give ANN:s their ability to work with non-linear patterns.

Some common activation functions are

• Logistic function g(u) = σ(u) =

_1+e¹−u

• Hyperbolic tangent g(u) = tanh(u) =

^e_e^uu^−e+e^−u^−u

• Identity function g(u) = u

• Rectified linear unit (ReLU) g(u) = u for x ≥ 0 0 for x < 0

• Leaky rectified linear unit (L-ReLU) g(u) =

u for x ≥ 0 0.01u for x < 0 More information about activation functions can be found in Appendix E.

2.3.2 Artificial neural network

The type of ANNs considered in this thesis are called Feed forward artificial neural network. The neurons are organized into layers. All neurons in the hidden layers and output layer are called receiving neurons. All receiving neurons are connected through synapses to every neuron in the previous layer so that the outputs of a layer are the inputs of the next layers, as can be seen in Figure 2.3. Every synapse has a specific weight associated with it. Furthermore, all neurons in the hidden layers and output layer have a transfer function associated with it. Is is possible to have different transfer functions for different neurons in the same ANN, however this thesis only considers one transfer function for all hidden layer neurons and one transfer function for the output layer neurons. This concept is explained further in subsection 2.3.4.

Bias neuron

It may be beneficial to add a constant term to the ANN. This is done by

adding a neuron with a constant output, called bias neuron. All neurons in

hidden layers and output layers are connected to the bias neuron by regular

synapses. This allows the weight of the constant term to be adjusted in the

same way as other weights in the ANN. The output value of the bias neuron

is usually set to y

b

= 1. An example of an ANN with a bias neuron can be

seen in Figure 2.4.

(33)

Figure 2.3: Example of an artificial neural network with 3 input values, 1 output value and one hidden layer with 4 neurons. The arrows represent synapses. Every synapses has an individual weight w associated with it.

2.3.3 Input

The input values of an ANN can be normalized as explained in subsec- tion 2.1.2. According to [8], this increases the computational efficiency of the learning process.

Categorical input variables are managed using the dummy variable ap- proach explained in subsection 2.1.5.

2.3.4 Output

Many activation functions have a limited codomain. For example:

• The codomain of the logistic function σ(u) =

_1+e¹−u

is described by 0 < σ(u) < 1 ∀ u ∈ R.

• The codomain of the hyperbolic tangent function tanh(u) is described by −1 < tanh(u) < 1 ∀ u ∈ R.

The output values of a pattern that the ANN is to learn must be in the

codomain of the transfer functions of the output layer. To achieve this, 2

methods are considered:

(34)

Figure 2.4: Example of an artificial neural network that features a bias neuron.

1. Normalizing the output data so it is distributed in a range compatible with the codomain of the output transfer functions. The normalizing is performed as described in subsection 2.1.2.

2. Setting the transfer function of the output layer neurons to some func- tion g(x) that has a codomain that includes the range of the output values. The identity function g(u) = u can conveniently be used, as its codomain is R.

2.3.5 Learning

The learning process of an ANN is done using the backpropagation algorithm (BPA). Using this algorithm, the network adapts to the training data. This is done by adjusting the synaptic weights

Assume an ANN created for a set of data consisting of n

in

input variables

and n

_out

output variables, and which contains m observations. Let v

_i

∈ R

ⁿⁱⁿ

and o

_i

∈ R

ⁿ^out

be the input and output vectors for the i:th observation.

(35)

There are therefore n

in

neurons in the input layer and n

out

neurons in the output layer. There are N

_H

hidden layers. The number of neurons in the hidden layers are n

_hi

, 1 ≥ i ≥ N

_H

. Total number of layers is N = N

_H

+ 2.

The number of neurons in all layers are described by the set

n = {n

_in

, n

_h1

, n

_h2

, ..., n

_hN_H

, n

_out

} . (2.50) The activation functions for the hidden layer neurons and output layer neu- rons are g

_h

(u) and g

_o

(u) respectively. Furthermore, assume g

_h

(u) and g

_o

(u) are C

⁰

or a higher level of differentiability.

The BPA can be summarized by the following stages:

1. Forward pass - The procedure of sending values through the ANN to obtain an estimate of the output values and a metric of the prediction error.

2. Backward pass - The process of finding the influence of the synaptic weights on the prediction error.

3. Updating synaptic weights using the Gradient descent algorithm.

For this subsection, the following notation is introduced:

• L

H

= {2, 3, ..., N − 1} is the collection of hidden layers. L = 1 repre- sents the input layer and L = N represents the output layer.

• n

^(L)

is the number of neurons in layer L, indexed as the set in Equa- tion 2.50.

• y

^(L)

is a vector consisting of the current output values of the neurons in layer L. Consequentially, y

^(L)_j

is the output value of the j:th neuron in layer L.

• ˆo

i

are the output values generated by the ANN for the i:th observation.

That is, the estimations of o

_i

made with the input v

_i

.

• u

^(L)_j

is the total net input of the j:th neuron in layer L.

• w

_jk^L

is the synaptic weight of the synapse from neuron j in layer L − 1 to neuron k in layer L.

• If the ANN features a bias neuron, w

_bk^L

is the synaptic weight of the

synapse from the bias neuron to neuron k in layer L.

(36)

Cost function

A cost function is defined to represent the error of prediction. The BPA is used to minimize the cost function. The cost function used in this thesis is the Quadratic cost function.

Assume that the ANN has been used to predict the i:th observation where the target values are o

_i

∈ R

ⁿ^out

and the predicted values are ˆ o

_i

∈ R

ⁿ^out

. The quadratic cost function is defined as

φ(o

_i

, ˆ o

_i

) = 1 2

nout

X

k=1

(ˆ o

_ik

− o

_ik

)

²

. (2.51)

Forward pass

The forward pass is the process of sending values through the ANN form the input layer to the output layer. Before the forward pass, an input vector v

_i

is assigned to the input neurons, so that

y

⁽¹⁾

:= v

_i

.

Thereafter, all receiving neuron layers compute their output values y

^(L)_j

according to Equation 2.49, sequentially, starting with the first receiving layer, i.e. layer L = 2. At this point, the following holds for the input and output vectors:

ˆ

o

i

= y

^{(N )}

. (2.52)

The estimations for the i:th observation is thereby obtained, and prediction error φ(o

i

, ˆ o

i

) can be calculated using Equation 2.51.

Backward pass

The objective of the backward pass is to find the partial derivative

∂φ

∂w

^L_jk

for all synaptic weights. To find these partial derivatives, the chain rule is applied. The derivatives are computed sequentially, starting with the final layer. The formulation of the chain rule product used for all synaptic weights is

∂φ

∂w

_jk^L

= ∂φ

∂y

^(L)_k

· ∂y

_k^(L)

∂u

^(L)_k

· ∂u

^(L)_k

∂w

^L_jk

. (2.53)

For the synapses leading into the output layer, L = N , Equation 2.53 yields

∂φ

∂w

_jk^N

= ∂φ

∂y

^{(N )}_k

· ∂y

^{(N )}_k

∂u

^{(N )}_k

· ∂u

^{(N )}_k

∂w

_jk^N

. (2.54)

(37)

For the first partial derivative, the following holds, as explained in Equa- tion 2.52:

∂φ

∂y

^{(N )}_k

= ∂φ

∂ ˆ o

_ik

oˆik=y^{(N )}_k

. (2.55)

Differentiation yields

∂φ

∂ ˆ o

_ik

= ∂

∂ ˆ o

_ik

"

1 2

nout

X

k=1

(ˆ o

_ik

− o

_ik

)

²

#

= ∂

∂ ˆ o

_ik

1 2 (ˆ o

ik

− o

_ik

)

²

= ˆ o

_ik

− o

_ik

⇒ ∂φ

∂y

_k^{(N )}

= y

_k^{(N )}

− o

_ik

. (2.56)

The second partial derivative is given by differentiating Equation 2.48, using the transfer function g

_o

(u). This yields:

∂y

_k^{(N )}

∂u

^{(N )}_k

= dg

o

du

u=u^{(N )}_k

. (2.57)

The third partial derivative

^∂u

(N ) k

∂w^N_jk

can be obtained by differentiating Equa- tion 2.47, according to

∂u

^{(N )}_k

∂w

_jk^N

= ∂

∂w

^N_jk

"

_n

X

i=1

w

_i

y

_j^{(N −1)}

#

= y

^{(N −1)}_j

(2.58)

where y

^{(N −1)}_j

is the output of the neuron in the preceding layer connected sending its output through the current synapse.

By considering Equation 2.56, Equation 2.57 and Equation 2.58, all par- tial derivatives of Equation 2.54 are known and the procedure can be used to calculate

∂φ

∂w

^L_jk

for L = N .

For partial derivatives of synaptic weights leading into a hidden layer, i.e. for L ∈ L

_H

, Equation 2.59 becomes

∂φ

∂w

_jk^L

= ∂φ

∂y

^(L)_k

· ∂y

^(L)_k

∂u

^(L)_k

· ∂u

^(L)_k

∂w

^L_jk

∀ L ∈ L

_H

. (2.59) The second and third partial derivatives of the above product are obtained in a manner consistent with the output layer,

∂y

^(L)_k

∂u

^(L)_k

= dg

h

du

u=u^{(N )}_k

(2.60)

(38)

and

∂u

^(L)_k

∂w

^L_jk

= ∂

∂w

^L_jk

"

_n

X

i=1

w

_i

y

_j^(L−1)

#

= y

^{(N −1)}_j

(2.61)

for all L ∈ L

_H

. Because the backward pass starts with the last layer, all partial derivatives for layer L + 1 are known. This can be used to calculate the partial derivative

^∂φ

∂y_k^(L)

∀ L ∈ L

_H

:

∂φ

∂y

^(L)_k

=

n^(L+1)

X

l=1

w

^(L+1)_kl

· ∂φ

∂u

^(L+1)_l

=

n^(L+1)

X

l=1

w

^(L+1)_kl

· ∂φ

∂y

^(L+1)_l

· ∂y

_l^(L+1)

∂u

^(L+1)_l

∀ L ∈ L

_H

, (2.62)

where n

^(L+1)

is the number of neurons in layer L + 1. The partial derivatives

∂φ

∂y_l^(L+1)

and

^∂y

(L+1) l

∂u^(L+1)_l

have previously been obtained and Equation 2.62 can be computed.

By considering Equation 2.62, Equation 2.60 and Equation 2.61, all par- tial derivatives of Equation 2.59 are known and the procedure can be used to calculate

∂φ

∂w

^L_jk

for all L ∈ L

H

.

If the ANN has a bias neuron, the synaptic weights w

_bk^L

are updated in the same way as the synaptic weights w

_jk^L

.

Gradient descent

After the backward pass is completed, all partial derivatives are known.

Gradient descent is used to update the synaptic weights. The method is explained in detail in [9]. The weights are updated according to

w

_jk^(L)

:= w

_jk^(L)

− η∆w

^(L)_jk

, where ∆w

^(L)_jk

= ∂φ

∂w

^L_jk

(2.63)

for all L, j and k. The value η is called step size, and is central to the

gradient descent method. Specifically for ANNs, η is usually called learning

rate. The choice of learning rate greatly affects training performance - A

high value may result in divergence and failure of the algorithm, whereas a

low value may result in entrapment in local minima or slow learning.

(39)

2.3.6 Batch learning

During training, the observations in the training data may be used multiple times. Each complete cycle of the entire training data through the BPA is called a training epoch.

The BPA can also be defined so that the forward pass and backward pass for several observations in the training data are performed before the synaptic weights are updated. This is called batch learning. Let γ ≥ 1 be the batch size. During batch learning, a sample of γ observations are randomly selected from the training data. The forward pass and backward pass are performed for the observations. The weights are then updated using gradient descent as defined in Equation 2.63, but instead letting ∆w

_jk^(L)

be the average of

_∂w^∂φL

jk

for observations in the batch.

Furthermore, two special cases of batch learning are described in [9].

These are

• Online learning. γ = 1 - The weights are updated after every iteration of forward pass and backward. pass, i.e. m times per training epoch.

• Offline learning. γ = m, where m is the number of samples in the training data - The weights are updated once per training epoch.

2.3.7 Evaluating observations

When an ANN has been trained, observations can quickly be evaluated.

Consider a new observations with input vector v

0

. An estimation ˆ o

0

of the

output can be obtained using the forward pass, explained above.

(40)

Chapter 3

Data

This chapter describes parameters considered as variables in the models.

The variables introduced here are candidates for being in the final estimation models. Therefore, a variable introduced here will not necessarily be used in the final models.

The data gathering shall result in a database where the considered vari- ables can be obtained using vehicle Chassis number .

The first two sections of this chapter describes variables considered as input for the models, i.e. variables that might be used to predict the fuel consumption. The third section describes the two output variables.

Two categories of variables are considered. Vehicle data and operational data. Operational data variables are not present in estimation of VECTO FC.

3.1 Vehicle parameters

Vehicle parameters are variables that are part of, or can be derived from, the vehicle specification. The variables can be determined from the vehicle itself, regardless of driving patterns and other operational data. Most of the variables can be directly obtained from vehicle specification and vehicle data.

• Wheel configuration c

wc