Forecasting future delivery orders to support vehicle routing and selection

(1)

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2018 ,

Forecasting future delivery orders to support vehicle routing and

selection

GUSTAF ENGELBREKTSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

orders to support vehicle routing and selection

GUSTAF ENGELBREKTSSON

Degree Programme in Information and Communication Technology 300 ECTS

Date: October 4, 2018

Industrial supervisor: Johan Frisk, Fleet 101 AB Supervisor: Somayeh Aghanavesi

Examiner: Elena Troubitsyna

Swedish title: Förutsägelse av framtida leveransorder för att stödja val av fordon samt deras ruttplanering

School of Electrical Engineering and Computer Science

(4)

Abstract

Courier companies receive delivery orders at different times in ad- vance. Some orders are known long beforehand, some arise with a very short notice. Currently the order delegation, deciding which car is going to drive which order, is performed completely manually by a (TL) where the TL use their experience to guess upcoming orders. If delivery orders could be predicted beforehand, algorithms could cre- ate suggestions for vehicle routing and vehicle selection.

This thesis used the data set from a Stockholm based courier com- pany. The Stockholm area was divided into zones using agglomerative clustering and K-Means, where the zones were used to group deliver- ies into time-sliced Origin Destination (OD) matrices. One cell in one OD-matrix contained the number of deliveries from one zone to an- other during one hour. Long-Short Term Memory (LSTM) Recurrent Neural Networks were used for the prediction. The training features consisted of prior OD-matrices, week day, hour of day, month, precip- itation, and the air temperature.

The LSTM based approach performed better than the baseline, the Mean Squared Error was reduced from 1.1092 to 0.07705 and the F1 score increased from 41% to 52%. All features except for the precipitation and air temperature contributed noticeably to the prediction power.

The result indicates that it is possible to predict some future delivery

orders, but that many are random and are independent from prior de-

liveries. Letting the model train on data as it is observed would likely

boost the predictive power.

(5)

Sammanfattning

Budföretag får in leveransorder olika tid i förväg. Vissa order är kända lång tid i förväg, medan andra uppkommer med kort varsel. I dags- läget genomförs orderdelegationen, delegering av vilken bil som kör vilken order, manuellt av en transportledare (TL) där TL använder sin erfarenhet för att gissa framtida order. Om leveransorder skulle kunna förutsägas i förväg kan fordonsrutter och fordonsval föreslås av algo- ritmer.

Denna uppsats använde sig utav ett dataset från ett Stockholmsba- serat budföretag. Stockholmsområdet delades in i zoner med agglome- rativ klustring och K-Means, där zoner användes för att gruppera leve- ranser in i tidsdelade Ursprungsdestinationsmatriser (OD-matriser).

En cell i en OD-matris innehåller antalet leveranser från en zon till en annan under en timme. Neurala nätverk med lång-kortsiktsminne (LSTM) användes för förutsägelsen. Modellen tränades på tidigare OD- matriser, veckodag, timme, månad, nederbörd, och lufttemperatur.

Det LSTM-baserade tillvägagångssättet presterade bättre än baslin-

jen, det genomsnittliga kvadratfelet minskade från 1,1092 till 0,07705

och F1-poängen ökade från 41% till 52%. Nederbörd och lufttempera-

tur bidrog inte märkbart till förutsägelsens prestation. Resultatet indi-

kerar att det är möjligt att förutse vissa leveransorder, men att en stor

andel är slumpmässiga och oberoende från tidigare leveranser. Att lå-

ta modellen tränas med nya data när den observeras skulle troligtvis

öka prognosförmågan.

(6)

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem . . . . 2

1.3 Goal & purpose . . . . 3

1.4 Data . . . . 4

1.5 Contribution to research . . . . 4

1.6 Ethics and societal issues . . . . 4

1.7 Delimitations . . . . 5

1.8 Thesis outline . . . . 5

2 Theory 6 2.1 Clustering . . . . 6

2.2 Artificial Neural Networks . . . . 7

2.3 Routing Problems . . . 10

2.4 Problem specific terms . . . 12

2.5 Performance metrics . . . 13

2.6 Related research . . . 14

3 Method 19 3.1 Overview . . . 19

3.2 Data description . . . 21

3.3 Zoning . . . 22

3.4 Data pre-processing . . . 29

3.5 Measuring the prediction performance . . . 33

3.6 Baseline prediction: The calendar model . . . 33

3.7 Long-Short Term Memory based prediction . . . 34

4 Results 41 4.1 Hyperparameter optimisation . . . 41

4.2 Final model . . . 46

v

(7)

5 Discussion 48

5.1 Analysis . . . 48

5.2 Future work . . . 53

5.3 The Long-Short Term Memory based approach . . . 55

5.4 Contribution to research . . . 55

5.5 Conclusions . . . 56

Bibliography 58

(8)

Introduction

Goods transportation is vital for our society to work. The total revenue for the freight market in Sweden 2015 was 275 billion SEK [56]. The road freight market is operating with slim profit margins [21]. Many steps in the transportation chain is performed manually by people, such as driving vehicles and managing vehicle fleets [15]. Automa- tion of steps in the transport chain is becoming feasible with modern research and hardware [14]. Optimising the transport chain by for ex- ample minimising driving distances can lead to economical and envi- ronmental improvements [15].

1.1 Background

Fleet 101 is a company developing a software named K2 that is used for transport management in courier and goods transport companies. The software is sold to various customers in the freight industry mostly lo- cated in the Nordic countries. Each customer run their own instance of the software. The software is used to keep track of the whole trans- port chain from incoming delivery orders to sending out invoices to customers [16].

One large and important part in the transport chain is to delegate delivery jobs among available vehicles, something currently being per- formed by people manually. K2 is used to keep track of vehicle at- tributes such as position, load capacity, etc. as well as delegating trans- port jobs [16].

DHL [14] describe the research field anticipatory logistics, which aims to make supply chains more efficient. The research field has been iden-

1

(9)

tified as a trend with high-impact for the logistics business area. An- ticipatory logistics includes, for example, anticipatory shipping, which can be used to predict future shipments. Taylor [52] interpreted DHL’s predictions and split them into three parts:

• Autonomous logistics, which includes self-driving vehicles.

• Internet of Things (IoT), which refers to, for example, delivery vehicles being connected to the internet, including sensors on ve- hicles.

• Artificial intelligence and logistics, which refers to the many possibilities to using AI and machine learning to optimize logis- tics, the area connected to this thesis.

Many areas belonging to anticipatory shipping are used today such as Internet of Things. However, the research is far from finished [9].

The vision of the company is to automate manual transport man- agement using state of the art approaches, such as modern machine learning methods. One part in the automation consists of anticipating future incoming transport orders.

1.2 Problem

Future delivery orders are not always known in advance. Occasionally the orders are known far in advance, but more frequently they come in during the day. This makes it less straight-forward to automatize the task of delegating orders among vehicles using well-established and well-studied approaches such as using solvers for the Vehicle Rout- ing Problem (VRP), since not all delivery orders are deterministic and known in advance [43].

The number of transport orders for different times are not uni- form. Peak periods exist and transport managers have experience about them [27, 26]. Possible factors that affect the number of trans- port orders can be, for example, type of day, month, and the nation- wide economic situation. As a simple example there are usually more transport orders before Christmas in December than in July in Stock- holm, something displayed in the data set used in this thesis.

Currently known orders can be input into a route planning soft-

ware used by the company. The route planning software accepts a list

(10)

of delivery orders from point A to B with available vehicles as input.

The software outputs near-optimal driving routes for the vehicles. The near-optimal solutions, however, assumes that future delivery orders are static, which is not the case. If predicted future orders can be in- serted together with known future orders, it is likely that the output will become more usable. The software does not accept any sort of stochastic information about pick-up hotspots or similar.

The data set used is from a large Swedish courier company. The company has different service types ranging between different type of business-to-business (B2B) deliveries during working hours, pre- planned home deliveries during the evening, mixed with priority rush transports at any time. For these jobs the company has different vehi- cles, ranging from small vans to lorries. This thesis will target B2B- deliveries during working hours.

1.3 Goal & purpose

Delivery forecast Previous

delivery data

Route optimisation

This thesis The future, using a proprietary solver

Figure 1.1: The purpose: predict deliveries.

As displayed in Figure 1.1, the purpose of this degree project is to assist transport planning by presenting predictions about future trans- port orders, based on previous delivery data. The research question is formulated as follows: How can transport management be assisted using predictions based on historical data?

The goal is to predict future transport orders from location A to B

short-term, e.g., during the rest of the day, the next day, or the next

hour. The goal is not to predict transport orders long-term, i.e., for

future years. The prediction should either be readable by a human or

in a format that can be used by a computer for route optimization, as

displayed in Figure 1.1. The goal is visualised in Figure 1.2.

(11)

Region Arlanda

Bromma

Årsta Sigtuna

Figure 1.2: The goal: predict deliveries between zones in a region.

1.4 Data

The data set resides in a database. The relevant data consists of about 15 years of historical delivery orders containing different forms of de- livery deadlines, pick-up addresses, and delivery addresses. An ad- dress usually has a city, zip-code, street address, and street number.

Most addresses saved in the database during the last four to five years also have coordinates with varying degrees of accuracy.

1.5 Contribution to research

In the literature study (Section 2.6) the newest research focused on pas- senger transport when neural network approaches were used. Older research that focused on freight transportation usually used statistical methods. Research where predictions in the form of Origin-Destination (OD) pairs were generated, instead of only origin hot spots in a freight management context generated by recurrent neural networks, has not been found.

1.6 Ethics and societal issues

There is an increasing need for responsible and sustainable transports [14].

Due to the large and global scale of transports, even minor reductions in driving distances can have a large impact.

One ethical issue concerning this thesis, is automation of manual

labour. If the transport leader role is fully or partially automatised, the

(12)

work burden of transport leaders is decreased leading to possible less need for them and worse job security. Another ethical and legal issue is the usage of historical data. This thesis tries to avoid these issues by, for example, using random noise on the data. Using data with random noise makes, for example, coordinates less sensitive, while the scientific prediction performance measurements are assumed to not be affected.

1.7 Delimitations

This thesis focus on making predictions for Business-to-Business (B2B) deliveries. The objective is not to try and predict home deliveries to private persons or occasional random events requiring specialized transports. The reason for predicting B2B-deliveries is that they are dy- namic, the optimal routing and selection of delivery vehicles depend on future not-known delivery jobs. This contrasts with deliveries to home deliveries to individuals, where all orders are known at route planning time.

1.8 Thesis outline

This chapter introduces the thesis. Chapter 2: Theory provides a the-

oretical background for concepts used in the thesis and provides an

overview of previous research of the area. Chapter 3: Method describe

how the prediction was performed. Chapter 4: Results presents the

performance of the prediction model. Finally Chapter 5: Discussion

analyses the prediction performance and discusses the work.

(13)

Theory

This chapter aims to give the reader the necessary background to un- derstand the work performed. Related research is also presented.

2.1 Clustering

Two forms of clustering using unsupervised machine learning was used for the thesis. They are presented here.

2.1.1 K-Means Clustering

The K-Means clustering algorithm works in the way described below.

1. Decide how many k centroids to be created.

2. Select random initial k centroids.

3. For each k centroids, create a cluster of all points closest to the centroid.

4. Create k new centroids by calculating the center of mass for all points in a cluster.

5. Repeat the previous two steps until the centroids no longer change.

Due to the nature of K-Means clustering, it only supports Euclidean distances in theory [4].

6

(14)

2.1.2 Agglomerative Clustering

Agglomerative clustering is a type of hierarchical clustering. The idea is that each point at first is its own cluster, then clusters are merged by combining close clusters. One advantage with agglomerative cluster- ing is that the distance metric does not have to be Euclidean [33].

2.2 Artificial Neural Networks

An Artificial Neural Network (ANN), often simply called neural net- work, is a machine learning method inspired by the human brain. ANNs have many benefits, for example they are non-linear, which allows them to capture non-linear inputs. An ANN consists of a set of infor- mation processing neurons, where a single neuron has three elements:

connecting links, adder, and an activation function [23].

Input signals Weights

Summing

junction Activation

function Output

x

₁

w

₁

x

₂

w

₂

x

_m

w

_m

Σ φ(·) y

_k

Bias b

_k

Figure 2.1: Neuron. Adapted from Figure 1.5 in [23].

In Figure 2.1 a neuron is displayed. The connecting links (input sig-

nals) have their own weights that are summed in the adder (summing

junction). The activation function defines the output y k . The activation

function can, for example, be a simple threshold function, that returns

1 or -1 depending on the adder functions output [23].

(15)

Input layer Hidden layer

Output layer

Figure 2.2: Artificial Neural Network.

Typically when neurons are combined to form an ANN, neurons are combined in different layers. In a feed forward neural network (FFNN), displayed in Figure 2.2, neurons are combined into one input layer, one output layer and an arbitrary amount of hidden layers. A single neuron can be linear but when they are combined the whole net becomes non-linear [23].

Activation function

Each hidden neuron in a neural network computes a regressive output, usually ranging from 0 to 1. The function used for the calculation is called the activation function. A common activation function common in RNNs is the hyperbolic tangent, it ranges from -1 to 1. To rescale the output to a classification format after the last layer softmax can be used. With softmax scores are given to the different classes where the scores for all possible classes sum to one [37, 62].

Loss function

A loss function is a function measuring the cost or performance of a

prediction. It is used during training for the network to measure the

cost when weights are updated. The mean squared error can be used

(16)

as a loss function. Another loss function used for categorical output is cross-entropy [37].

2.2.1 Long Short Term Memory Recurrent Neural Net- works

A Recurrent Neural Network (RNN) categorises itself from a FFNN in that it has feedback loops [23]. FFNNs as well as RNNs are bad at handling prior dependencies, i.e., historical data, something that is solved with Long Short-Term Memory (LSTM) networks. LSTMs solve this by having a memory cell. A typical LSTM unit consists of a cell, an input gate, an output gate, and a forget gate. The cell is the memory itself, the cell is able to remember data for a long term. The input gate controls the input activations. The output gate controls the output flow from the cell. The forget gate controls the LSTM’s memory by a self recurrent connection, i.e., it controls the cell (the memory).

Each gate works as a simple ANN with no hidden layers, a gate has an activation function. Different variations of the LSTM unit exist [6].

Training

Neural Networks need to be trained. Back-propagation through time is a common way to train LSTM RNNs [22, 6]. Back-propagation com- putes the gradient of the cost function and works by calculating the errors backwards, the algorithm starts at the last layer and works it way towards the first layer. Gradient descent is usually used as an op- timisation algorithm to decide on which extent the weights should be updated [37]. ADAM is an optimisation method combining two older optimisation methods. ADAM has built in learning rate decay [28].

Overfitting

When a machine learning model starts to learn too much detail from the training data, it will generalise less and perform worse on the test data. This condition is called overfitting. Two popular ways to com- bat overfitting in Neural Networks are dropout and regularisation.

Dropout works by randomly dropping hidden neurons and then re-

adding them. Regularisation works by adding an extra term to the

loss function, which results in the network preferring to learn smaller

weights [37].

(17)

2.2.2 Convolutional Neural Networks

Deep Convolutional Neural Networks (CNNs) have good performance for image classifications. Residual Networks (ResNets) introduce resid- ual connections in a Deep CNN [24].

2.3 Routing Problems

The travelling salesman problem (TSP) describes in what order a sales- person should visit a given set of cities exactly once, in order to mini- mize the total distance and time spent. TSP is NP-hard and has several variations and forks [45].

2.3.1 The Vehicle Routing Problem

Depot

C1

C2 C3

C5

C4

C6

Figure 2.3: VRP visualised. C1 to C6 represent different customers, depot the common start location. The blue line is the routing for one vehicle, the red the routing for another vehicle.

One generalization of the TSP is the Vehicle Routing Problem (VRP).

It was first described in 1959 and the VRP describes in what order one or more trucks should visit service stations from a main station [13].

VRP is NP-hard alike TSP [30]. Polynomial approximation algorithms

exist for solving the VRP [10]. The problem is visualised in Figure 2.3,

where the two colours represent different routes.

(18)

2.3.2 The Pick-up and Delivery Problem

C5 C6 C3

C2 C4

C8 C1

C7

Figure 2.4: PDP visualised. Boxes are pick-up points, circles drop-off points.

The pick-up and delivery problem (PDP) is related to VRP. It de- scribes how one or more vehicles should perform deliveries between different locations [36]. The problem is illustrated in Figure 2.4. The definition of the PDP can vary. One definition is that one vehicle only can serve one delivery at a time, while another definition is that one vehicle can serve multiple transport orders at the same time [8].

A problem very related to the PDP is the dial-a-ride problem, that in short describes how taxis should be routed to pick-up and drop off passengers when ride-sharing can be used [12].

2.3.3 VRP & PDP classes

VRP & PDP have several variations. Relevant ones are presented here,

nonetheless the problems have a wide range of applications. As an

example, the VRP has even been applied to military aircraft mission

planning [46].

(19)

Time Windowed

In the time windows flavour, time constrains are introduced. Typically each transport order has an earliest allowable pick-up time and a latest allowable delivery time [36].

Dynamic & Stochastic

There are times when not all orders are known during the planning stage. The dynamic (or online, real-time) flavour defines the case when additional transport orders occur after the planning stage, during the operation. The stochastic flavour extends the dynamic, by describing that previous knowledge about future unknown transport orders can be taken into consideration in the planning step [47].

Different strategies to satisfy dynamic demand exist. One strat- egy called double horizon describes how route distance should be minimized in the short term, while favouring empty vehicles long term. Different waiting approaches describe when and where vehi- cles should wait when time allows; the reason for the waiting is to be able to faster satisfy new delivery jobs. Fruitful regions define vehicle re-routing to areas where the probability of future requests is high [27].

2.4 Problem specific terms

This section gives a short introduction to some terms and methods used by the found literature.

2.4.1 Origin Destination Matrix

An Origin Destination (OD) matrix describes the flow of, for example, passengers, goods, or data flow between different zones. Table 2.1 displays a sample OD-matrix giving information of the flow between the zones a, b, and c. The flow a → a is 1, the flow c → b is 8, etc.

Table 2.1: Sample OD-matrix.

a b c

a 1 2 3

b 4 5 6

c 7 8 9

(20)

One application of OD-matrices is that they assist traffic planning, usually on a large scale. Typically the set of all zones, i.e. the complete matrix, form a region or city while the size of the individual zones varies. However, in this thesis an OD-matrix will give information about the number of delivery orders between zones. It is also possible to add a time-dimension to the OD-matrix, by adding discrete time- intervals as a new dimension in the matrix [42].

2.4.2 Simple calendar method to create OD-matrices

Some literature compare their new methods of predicting OD-matrices with a simple calendar model. The calendar model varies a bit be- tween different research, but the basic principle is usually the same.

The principle is that for each origin-destination pair week days are split into different groups (working day, school holiday, etc.) and his- torical data is mapped to the groups. The time can be split into discrete intervals if it is desirable to have a time dimension [55].

2.4.3 Autoregressive models

Autoregressive Integrated Moving Average (ARIMA) is a statistical and regressive model of a random process. It is assumed that the fu- ture is a linear function of historical data. To apply an ARIMA model, a model is constructed and parameters are then found [63].

Vector Autoregression (VAG) is a generalization of the basic autore- gressive model that allows for multiple dependent variables as input.

Usually VAG performs better than simpler univariate autoregressive models [59].

2.5 Performance metrics

The Mean Squared Error (MSE) is defined as:

M SE = 1 N

N

X

i=1

(y _i − t _i ) ²

where N is the number of predictions, y i a prediction, and t i an ex-

pected value.

(21)

According to [33], precision and recall are defined as:

precision = true positives

true positives + false positives recall = true positives

true positives + false negatives and F1 score as:

F1 score = 2 ∗ precision ∗ recall precision + recall

2.6 Related research

This section overviews the previous research on making predictions about future transportation demand, both in a passenger and freight context. There are related research fields such as forecasting demand in electrical networks [39], but this section will focus on more closely related fields including traffic flow forecasting such as [54].

In general the first step of making a prediction is to divide the whole area to be predicted into smaller zones [57]. There are two classes of approaches: if only the origin is predicted or if both the ori- gin and destination are predicted. Section 2.6.1 looks at research that mostly predicts the origin, while Section 2.6.2 presents research that predicts both the origin and destination.

2.6.1 Forecasting origin and/or destination as sepa- rate entities for VRP & PDP solvers

Ichoua, Gendreau, and Potvin [27] stated in 2007 that solution ap- proaches for the VRP that anticipate future demand are not yet ma- ture, but that research interest exists. However, Ritzinger, Puchinger, and Hartl [47] claimed in 2016 that the research interest has increased during the last few years for the Dynamic and Stochastic VRP and it is, for example, now possible to process knowledge about demand using modern statistical approaches. For example, in 2018 van Engelen et al.

[58] incorporated historical demand with empty vehicle re-routing in the dial-a-ride problem.

An example of how future demand predictions can be generated

and taken into consideration in the planning step is presented in an

(22)

article by Schilde, Doerner, and Hartl [49]. Patients were transported from their homes to a hospital or from a hospital to home. Around half of the requests were known in the morning while the other half were dynamic and appeared during the day. The inter-arrival times of the dynamic requests were found to have an exponential distribution. The return transports resulting from the first dynamic request were found to have a gamma distribution.

Swihart and Papastavrou [51] created and analysed a model for the PDP. It was discovered that dynamic requests arrived according to Poisson processes in geographical zones. In another report by Garrido and Mahmassani [20] it was assumed that the dynamic requests arrive according to a Poisson distribution. The Poisson distribution assump- tion was used in conjunction with an autoregressive model to predict future orders short-term. In 2000 Garrido and Mahmassani [19] contin- ued their research by modelling demand with an econometric model, however they discovered that their model’s prediction did not corre- spond fully to a real sample. Newer research from 2016 by Vonolfen and Affenzeller [61] confirms that assuming that the arrival rates of transport orders can be seen as a Poisson process.

2.6.2 Forecasting demand as origin-destination pairs

Tsekeris and Tsekeris [57] writes about the traditional four-stage trans- port planning process for passenger transport, which are trip genera- tion, trip distribution, mode choice and traffic assignment. Trip gen- eration refers to forecasting passenger transport by using econometric models. Trip distribution refers to allocating the demand from the pre- vious step into an origin destination matrix. Mode choice specifies to splitting the OD-matrix into different modes of transport, for example private car or public transport. Traffic assignment maps the OD-matrix into a transport network, i.e., which routes will be used. A Ph.D. the- sis by Peterson [42] also states that the generation of OD-matrices on a large scale is a well studied problem.

Tsekeris and Tsekeris [57] writes that using for example seasonal

exponential smoothing can predict the medium-term or long-term trans-

port demand. Tsekeris and Tsekeris present an overview of the meth-

ods used for forecasting and states that modern approaches occasion-

ally combine steps from the classical methods, such as averaging. Rele-

vant methods include Kalman filtering, autoregressive models such as

(23)

ARIMA, genetic algorithms, and artificial neural networks. In general the paper is aimed at a macro-level scale, i.e., predicting the OD-graph for commuters in a region.

Toqué et al. [55] predicted public transport demand city wide as OD-matrices using Long-Short Term Memory (LSTM) Recurrent Neu- ral Networks (RNN). Their RNN model had one LSTM layer. For training of the model they used gradient-based optimization, where they experimented with the hidden state size. Their input to the model was prior OD-matrices at 300 time stamps; the output was a predicted OD-matrix at the time stamp to be predicted. The RNN method was compared with two more conventional methods, a calendar model, and a Vector Autoregressive (VAR) model. The calendar method con- sisted of putting historical rides into 15-minute slots for different day types, where one day type was for example ”Monday to Wednesday”

and another was ”school holiday”. The VAR & LSTM methods out- performed the calendar method.

In an article from 2018, Li et al. [31] developed an algorithm to pre- dict OD-matrices for taxi trips in a large city. It was done by combin- ing non-negative matrix factorization (NMF) with an autoregressive model. Li et al. stated that predicting OD-matrices using statistical models, including for example maximum likelihood and Bayesian in- ference, are unsuitable for short-term predictions. Li et al. writes that the reason is that those statistical models assume that all transports must end in identical time windows, an assumption that cannot be made since transport orders have different time lengths.

The reason as to why Li et al. [31] chose to use NMF over regression or neural networks, was that they claimed that it would not be possi- ble to detect the purpose of the travel (i.e. commute, leisure, etc.) as described by Peng et al. [41]. For this thesis it is not required to factor in the purpose of a trip, since the main purpose always is the same, i.e., deliver a package. Li et al. also ruled out Kalman filtering, since Ming-jun and Shi-ru [34] found out that the predictions are delayed.

Zhang, Zheng, and Qi [64] developed a deep learning method for

crowd flow prediction and compared it with some autoregressive mod-

els. Their method combined convolutional neural networks that looked

at different time intervals. The deep learning method performed bet-

ter than the autoregressive models such as ARIMA & Vector Autore-

gression (VAR). Crowd flow prediction is a bit different from predict-

ing OD-matrices, since the problem is about predicting flows between

(24)

neighbouring grids. Nevertheless the result is interesting, as it shows that using neural networks works better than autoregressive models.

Alonso-Mora, Wallar, and Rus [3] looked at the dial-a-ride problem in New York City. They divided the city into a grid system with equal areas and divided the historical data into 15 minute intervals for each week day. A clustering algorithm was used to merge grids into larger grids, where a probability distribution then was found for each OD- pair. The prediction was then used in their routing algorithm.

Azzouni and Pujolle [6] used a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) to predict Origin-Destination pairs, in an approach reminding of Toqué et al. [55]. As input to the LSTM Azzouni and Pujolle mapped a prior N xN OD-matrix into a N ² long vector. The output was then a N ² long vector that could be mapped back into a N xN OD-matrix. They also presented a method for con- tinuous prediction over time.

Tian and Pan [54] also predicted OD-matrices using LSTM RNNs.

However, they also compared the performance with some other ap- proaches including Support Vector Machines and Feed Forward Neu- ral Networks. The LSTM RNN has the best performance. They state that LSTM RNNs have good performance for short-term predictions, due to LSTMs’ ability to remember long-term data.

In a Ph.D. thesis by Larsen [29], parcel pick-up points for mail trucks were analysed. It was determined that pick-up locations were highly dynamic since only a small subset of the locations were known in advance.

2.6.3 Summary

We can conclude that for OD-matrices autoregressive and machine learning approaches outperform the simple calendar method [55]. Re- cent research has shown that approaches using neural networks are feasible and can work better than autoregressive methods [64, 55, 6, 54]. LSTM RNNs are used over simple RNNs since training simple RNNs using back-propagation is difficult when modelling long range [6].

Some research, for example [31, 3, 48], also focus on finding clusters for pick-up hot spots.

In general there are more research about incorporating dynamic

and stochastic information into VRP than PDP [8]. A lot of research

about predicting OD-matrices are on a macro-scale, for example trans-

(25)

portation city-wide, while less research has been performed on a micro- scale.

The overview of the related work shows that there is no previous

research that aims at predicting deliveries at a smaller scale. Another

difference is that this thesis used classification output, instead of out-

put on a regression format which previous research used.

(26)

Method

This chapter presents the method used to predict future deliveries. The data, zoning approach, and prediction method is described. The ex- perimental setup is also presented.

3.1 Overview

As shown in Section 2.6, a lot of research focused on how a relatively simple forecast could be used in a VRP or PDP solver, where previous knowledge about only the origin was used. For example the previous knowledge might be modelled as a Poisson process. One advantage of knowing where future pick-up hot spots are, is that free vehicles can be routed to those positions. For example, it is common that taxi drivers drive to an airport when they have no passengers, since the drivers anticipate future orders from the airport [66].

An interview with a transport leader was conducted, where the purpose was to further understand how transport leaders work, how they could be helped by a prognosis, and on what factors they base their experience about upcoming deliveries on. The interviewee ex- plained that they never have the problem with too few orders and empty vehicles. They may choose to let some vehicles be empty on standby for important incoming jobs with a short deadline, but they did not currently have the need to route empty vehicles to future loca- tions where it is believed that future orders will occur.

Since the route optimization software used by Fleet 101 does not ac- cept stochastic information about pick-up hotspots, inserting only fu- ture pick-up points instead of data derived from OD-matrices, would

19

(27)

require a solution where the destination point would need to be in- ferred from the pick-up point. An example on how to solve this prob- lem would be to place the predicted pick-up hotspots as delivery jobs with some arbitrary destination, where the predicted jobs have the constraint that they must be delivered after the real jobs, which would make the end destination to have less importance.

3.1.1 Predicting OD-matrices

The decision was made to predict the OD-matrices using RNNs with LSTMs. The reason was twofold. First of all recent research presented in Section 2.6 showed that approaches using RNNs with LSTMs per- formed better than autoregressive methods. Secondly it is easier to add additional features into neural networks than in autoregressive models, since for example a LSTM RNN can capture dependencies between the features [55]. An additional feature is, for example, the weather.

To save time and resources in the implementation the high-level machine learning API Keras [2] was used with the machine learning framework Tensorflow [1] as back-end. Recent research papers found in the literature study using LSTM RNNs approaches to predict OD- matrices used Keras and/or Tensorflow [55, 6].

Database Addresses

to zones Delivery times

Addresses

Time-sliced OD-matrices Neural

Network Prediction

Figure 3.1: Method overview.

The method in short is simplified in Figure 3.1. First data is re-

trieved from a database, time-sliced OD-matrices are created, inserted

into the Neural Network and finally predictions are made. The time

slices are one hour long, i.e., one time-sliced OD-matrix contains all

deliveries for one hour.

(28)

3.2 Data description

The data resided in a Microsoft SQL database, the total size was about 170GB including all data not relevant for this thesis. Available data ranged from 2003 to the early spring of 2018, where less data was avail- able for the first years and more data existed during later years. The relevant parts in the data set were related to transport orders, where relevant features for transport orders orders are presented below.

• Pick-up & delivery addresses. Some address fields are:

– Address lines. The content can be for example a company name or a street address. The address lines are not clean;

occasionally a field can contain a company name but more frequently a street address.

– Zip-code. It exists in nearly all addresses, independently of whether the address lines describe a company or street address. The zip-code is in nearly all cases clean, i.e., the field is not used to describe something else.

– City. Is usually clean, however at times the field is used for other things. Frequently the names of cities or areas is shorted, for example ”Ö-malm” instead of ”Östermalm”.

– Coordinates. Since around 2014-2015 coordinates are con- stantly available with different accuracies, since 2015 most addresses have coordinates with high accuracy. All addresses since 2015 are not guaranteed to have coordinates.

• Date.

• Earliest allowable pick-up time.

• Latest allowable delivery time (deadline). Together with the ear- liest allowable pick-up time a time window is formed.

In addition to the provided data set described above, external data in

the form of a calendar was available. The calendar could for example

map dates to week-days and tell whether a day is a public holiday in

Sweden or not. Weather data for a weather station in central Stock-

holm was also retrieved from the Swedish Meteorological and Hydro-

logical Institute (SMHI) [50]. The weather data retrieved contained

(29)

0 24 48 72 96 120 144 168 0.00

0.01 0.02 0.03 0.04 0.05

Figure 3.2: Weekly hourly distribution of all available delivery dead- lines, including home deliveries. 24-48 is Tuesday, 48-72 Wednesday, etc. The y-axis has been normalised.

the daily precipitation in millimetres and the temperature in Celsius at 06:00, 12:00 & 18:00.

Figure 3.2 displays the delivery deadline distributions for an aver- age week during a three month period. The top highest peaks are 16:00 and 17:00, typical deadlines for deliveries since non-urgent deliveries typically can be delivered any time during working hours the same day. The peak at 22:00 represents home deliveries. It can be observed that almost no deliveries happens on Saturdays and Sundays.

3.3 Zoning

Since a Origin-Destination (OD) matrix was predicted the size of the matrix needed to be decided, i.e., a good level of detail had to be found.

Letting each possible street be its own zone would not be feasible, since

the matrix would be too sparse and too large. The goal with the zoning

(30)

was to divide the Stockholm area into a set of zones, where the zones had a similar number of deliveries in them. The purpose was to use the zones to get a prediction between two points that could be used in a route optimisation software. In the end the clustering approach was deemed more feasible. This section describes two different approaches to create zones. The first approach is using zip-codes as zones and the second is using clustering to create larger zones.

3.3.1 Using zip-codes as zones

It was assumed that the sizes of zip-code areas in Sweden are corre- lated to either the population or number of packages. That means that the area of a zip-code in central Stockholm can be tiny (a single block), while a zip-code’s area in the countryside can have a large area (a small town). The numbers in Swedish zip-codes are structured according to the geographical area, for example all zip-codes beginning with 10 or 11 are located in Stockholm [44].

One drawback of using zip-codes for dividing zones is that they may change [44]. Zip-code data in Sweden is not open and freely available. Since no resource was found that lists all of these changes with their dates, it is not feasible to update zip-codes retroactively. It was assumed that the changing zip-codes problem is minor and that it would not have a noticeable impact on the result.

Using zip-codes as zones was performed by letting the first three digits form a zone, e.g., an address with the zip aaabb belongs to a zone named aaa. The problem with using the zip-code approach for zoning is that the OD-matrices become either too sparse or too small, see Figure 3.3 for an example with sparsity. In the figure it can be seen that most cells are black, meaning that no deliveries occur between those two zones. Due to the sparsity and results of basic experiments performed, it was decided that using zip-codes would be infeasible.

If the detail level would be lowered by using only two digits, nearly all addresses in the Stockholm city area (Kungsholmen, Södermalm, Vasastaden, etc.) would belong to the same zone.

3.3.2 Creating zones by clustering

As an alternative to zip-codes zoning, a clustering approach using un-

supervised machine learning was implemented. Most addresses in the

(31)

0.0000 0.0025 0.0050 0.0075 0.0100 0.0125

Figure 3.3: OD-matrix for deliveries during one Wednesday between 3-digit zip-codes. To see differences more easily the OD-matrix has been plotted in a logarithmic scale. The plot has been normalized.

Black means no deliveries between zones, lighter colours indicates more deliveries.

data set had coordinates since around 2015, allowing addresses from 2015 and onward to use this approach. Algorithms from the machine learning library Scikit-Learn [40] were used for the implementation.

Distance metric

A distance metric between coordinates is required for clustering. A ba- sic distance metric between two coordinates is the Euclidean distance.

Approximating a distance in a grid-based city may work well using Euclidean distances, however since Stockholm is a city consisting of many islands using Euclidean distances leads to undesirable distances between points. An example of the undesirable behaviour is displayed in Figure 3.4, where points with water in between are close according to the distance metric but far away for a car.

To solve the problem with Euclidean distances, driving times be-

tween points were used instead as a distance metric. To calculate

(32)

Figure 3.4: Zones created with a Euclidean distance metric, note how the pink points in the top left belong to the same cluster as the pink points in the bottom left. Map from OpenStreetMap contributors [38].

For privacy reasons noise has been added to the coordinates, thereof coordinates in the sea.

the driving times Open Streetmap Routing Engine (OSRM) [32] was used. OSRM was used to retrieve a distance matrix containing driv- ing times between all points. One drawback of using driving times is the calculation time required to compute the distance matrix. Cal- culating a distance matrix for 1000 points is near instant, however computing a distance matrix for 2000 points takes significantly longer time. The matrix size grows with O(points ² ) , meaning the time in- crease is quadratic. Calculating a distance matrix for over 2-3000 arbi- trary points was deemed infeasible.

Handling outliers

Some addresses may have wrong coordinates, causing them to be in for example another city. Other addresses may be isolated from other addresses, due to occasional deliveries to the country side. In short outliers need to be handled to not affect the rest of the clustering neg- atively, either by removing them or taking them into account in the clustering algorithm, which requires a cluster algorithm able to handle outliers. K-Means does not handle outliers well in its basic form [18].

Compared with for example K-Means, agglomerative clustering

(33)

Figure 3.5: Outlier detected using agglomerative clustering.

does not try to create zones with an equal amount of points and often places outlining points into small clusters, as displayed in Figure 3.5 and 3.6. This behaviour was exploited to find outlining points and then remove them. Each cluster containing less than four points had its points removed from the set of all points, before the final clustering was performed.

Figure 3.6: Clustered coordinates using agglomerative clustering, cen-

troids are marked with crosshairs. In total 30 clusters, some are outside

of the map borders. Notice the differences in cluster sizes. Map from

OpenStreetMap contributors [38].

(34)

Clustering using K-Means

After removing outliers, different clustering algorithms and their pa- rameters were experimented with. Figure 3.7 displays a result of first removing outliers with the method described and then clustering 30 zones with K-Means. Figure 3.6 displays clustering using agglomera- tive clustering. Notice how K-Means created clusters with more equal sizes than agglomerative clustering. A common problem with many tested algorithms, was that they grouped all points not too far away from the Stockholm city together into a single large cluster, while it created small clusters from points further away from the city centre.

K-Means was deemed to perform the most desirable, despite theo- retically not supporting non-Euclidean distances. K-Means performed best since the number of points in each zone turned out to be of equal amounts and most zones seemed to have good areas.

Figure 3.7: Clustered coordinates using K-Means, centroids are

marked with crosshairs. In total 30 clusters, some are outside of the

map borders. Map from OpenStreetMap contributors [38].

(35)

Classifying unobserved points

The centroids are marked with crosshairs in Figure 3.7. The coordi- nates of a centroid were calculated by taking the mean of all coordi- nates in a cluster. The calculation of the latitude is displayed in Equa- tion 3.1, where C is the set of all points c in a cluster. The longitude is calculated using the same equation.

Centroid latitude = P |C|

i=1 c _i

_latitude

|C| (3.1)

Only the centroids were used for classifying points. This leads to the size of the distance matrix needed to grow linearly with the num- ber of points to be classified, O(|clusters| ∗ |points|), where the number of clusters is constant. This allowed for fast distance matrix retrieval from OSRM. The classification was simply performed by finding the closest centroid to an unknown point, where the distance was mea- sured with the driving time. Note that since all points were calculated using the centroids, some points in Figure 3.7 belong to other clusters in the final run.

Method summary

The method described above can be summarised with the steps below.

1. Decide the number of clusters desired.

2. Retrieve a distance matrix. Distance metric: driving time.

3. Cluster using agglomerative clustering, remove outlining points from the data set. Distance metric: driving time.

4. Discard the clustering result.

5. Cluster using K-Means. Distance metric: driving time.

6. Calculate centroids using the Euclidean distance.

7. Discard the points used for the construction, only keep the cen- troids.

8. Classify new points by finding the closest centroid. Distance

metric: driving time.

(36)

3.4 Data pre-processing

The data needs to be pre- and post-processed. This section describes how the data was processed to be able to be used in the baseline pre- diction in Section 3.6 and the machine learning prediction approach in Section 3.7.

3.4.1 Data split

The data set described in Section 3.2 was split into training, valida- tion, and test data. Due to preferring data containing coordinates to avoid having to geocode address lines into coordinates, a step requir- ing cleaning of address lines and paid services, older data was not used. The data split is presented below.

• Training set. The data set contained deliveries from 2015-01-01 to 2016-12-31. Note that for most experiments presented in Sec- tion 4 the training set only used data from 2016. By having at least a full year represented in the training data all seasonal vari- eties and public holidays are represented.

• Validation set. Data ranging from 2017-01-01 to 2017-06-30. This period does not cover all periods of the year, however some hol- idays such as Easter & Midsummer are included.

• Test set. Data ranging from 2017-07-01 to 2018-02-28. This period covers different periods of the year and some public holidays such as the Christmas period.

Splitting the data set into one year each for the validation and test would probably capture all seasonal varieties better. However, due to two reasons the data was not split in that way. Firstly, it was assumed that the training data would be too far back in time, meaning that de- liveries that usually occurred in 2015 would not occur two years later.

The data split above does did solve this problem completely, but it

made the training set and test set closer in time. Secondly, it allowed

different training periods to be used in the experiments without chang-

ing the validation and test set.

(37)

3.4.2 Grouping addresses into OD-matrices

Delivery addresses with their corresponding time needed to be grouped together into time-sliced Origin Destination matrices. A time of a de- livery is the latest allowable delivery time, i.e., deadline. A cell in an OD-matrix represents the number of deliveries from one zone to another. All cells on the diagonal represent the number of deliveries within a zone. Each OD-matrix represent one time-slice, i.e., an OD- matrix contains all deliveries between two times. An example of time- sliced OD-matrices are displayed in Figure 3.8, where OD-matrices for a single week has been created with each time-slice being one day long.

Monday Tuesday Wednesday Thursday

Friday Saturday Sunday

Figure 3.8: OD-matrices plotted as heat maps for a sample week. To see differences more easily the OD-matrices have been plotted in a log- arithmic scale. 29 zones were used resulting in 841 cells for each OD- matrix.

For the resulting model presented in Section 3.7 each time-slice was

one hour long and only OD-matrices in the range of 7:00 to 18:00 were

included. For instance, the first OD-matrix for one day contained all

deliveries with deadlines from 7:00 to 7:59, the next 8:00 to 8:59 etc.

(38)

Deliveries with times outside of this range were assumed to be other types of deliveries such as home deliveries, which were not to be pre- dicted. Note that, for example, a delivery with deadline 13:30 was only included in the 13:00-13:59 OD-matrix and not in any prior ones.

3.4.3 Additional features

Except for prior OD-matrices themselves some additional features were added. The additional features were

• hour {7, 8, 9, . . . , 18},

• day of week or public holiday with eight possible values,

• month with twelve possible values,

• precipitation,

• and air temperature.

The weather features are described in more detail in Section 3.2.

3.4.4 Input encoding

The input to a neural network can be either continuous in a regression format or in a categorical or ordinal format. If the input is in regres- sion the input is scaled to for example −1 to 1 if the hyperbolic tangent is used as an activation function. If the input is in categorical or or- dinal format it is one-hot encoded. All inputs were one-hot encoded, since basic experiments performed indicated that a ordinal format per- formed better. The ordinal encoding is described next.

Encoding OD-matrices

Each cell was transformed into a discrete range in the interval {0, 1, 2}.

0 means zero deliveries, 1 one delivery, and 2 two or more deliveries.

The reason for handling everything with two or more deliveries as the

same feature, is that knowing the exact amount of deliveries between

zones is harder to predict and less interesting to know. The most im-

portant thing to know in the output is whether or not a transport will

happen between two zones or not, since a delivery vehicle usually can

fit more than one package. Since an ordinal input was desirable each

(39)

matrix was one-hot encoded, resulting in each matrix being three times as large.

Encoding additional features

The hour, day of week, and month features were one-hot encoded.

The weather features, precipitation and air temperature, were split into discrete intervals (bins) and one-hot encoded. The bins for the precipitations were

{0, 0.01, 1.0, 3.0, 5.0, 7.0, 10.0, 20.0}

meaning that 2mm rain belong to 1.0 and everything 20mm and above belong to 20. The bins for the temperatures were

{−20, −15, −10, −5, 0, 5, 10, 20, 25}

i.e. 5-degree intervals.

3.4.5 Constructing the LSTM input

The input to a LSTM unit needs to be in a specific shape. First of all, a target is required when training a Neural Network, requiring the prob- lem to be transformed into a supervised learning problem. Secondly, the input needs to have multiple samples in the correct dimensions.

To transform the problem into a supervised learning problem a sliding window technique was used. Let x i denote an OD-matrix at time i. If x i is the target then all OD-matrices at i − 1 and earlier are prior matrices, i.e., available features for training. For the next OD- matrix at x i+1 all prior OD-matrices are instead matrices at x i and ear- lier.

The tensors, the inputs to a LSTM unit, were constructed by first flattening each OD-matrix, i.e., reshaping the 2D-matrix to a 1D vec- tor. Then samples were constructed from OD-matrices and if required additional features (month etc) were appended. To construct a single sample let the target sample be the flattened OD-matrix x i and let the time steps, the training vectors be

x _i−10 , x _i−9 , ..., x _i−1

where the vectors are flattened OD-matrices with optional additional

features appended.

(40)

A sample is now

{x i−10 , x i−9 , ..., x i−1 |x i }

with x i being a flattened OD-matrix target free from any additional features and x i−1 and below being flattened OD-matrices for training with optional additional features. The next sample is the next window in the sliding window principle, beginning at x i−9 and ending at x i+1 . The first four samples will therefore be in the format

{x _i−10 , x _i−9 , ..., x _i−1 |x _i } {x _i−9 , x _i−8 , ..., x _i |x _i+1 } {x i−8 , x i−7 , ..., x i+1 |x i+2 } {x _i−7 , x _i−6 , ..., x _i+2 |x _i+3 }

where the target is to the right of the | sign. Together a set of samples following each other form a batch.

3.5 Measuring the prediction performance

To measure the performance of the prediction mean square error (MSE) was used, since it is a common error metric used by for example both Toqué et al. [55] and Azzouni and Pujolle [6].

accuracy = correctly classified cells

total cells (3.2)

Since it can be hard to get an intuition on how good the perfor- mance is, the F1 score was also measured. Using only the accuracy given in Equation 3.2 for measuring the performance would lead to the accuracy being high even if the prediction only predicts no deliv- eries at all, since the sparsity of the matrix would lead to many cells with no deliveries being correctly classified. F1 score considers both the precision and recall of a classification, which avoids the problem with bias and accuracy. MSE and F1-score is defined in Section 2.5.

3.6 Baseline prediction: The calendar model

To be able to compare the machine learning approach to predicted OD-

matrices, a baseline was needed. A simple model was implemented,

(41)

named the calendar model. The calendar model split the OD-matrices into slots based on hour and day type and took the average value over the training set for each slot. If there are 18 − 7 = 11 available hours in one day and eight days (Mon-Sun & public holiday), there are in total 11·8 = 88 total slots. To predict a day, the slot at the corresponding time and day type was simply returned. A sample result with continuous output for a full day (not hour slot) is displayed in Figure 3.9. To get a classification prediction instead of a continuous, all cells were simply transformed in the same way as described in Section 3.4.4.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0

4 8 12 16

(a) Real

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0

4 8 12 16

(b) Calendar

Figure 3.9: Comparison for a real day and the calendar prediction for the same day.

3.7 Long-Short Term Memory based predic- tion

The input into the LSTM was as described in Section 3.4.5 prior OD- matrices with optional additional features. The output was a predicted OD-matrix. A few different models were implemented and they are presented in this section.

3.7.1 Network output

Since the output from a LSTM is hard to interpret directly, a densely-

connected Neural Network layer was placed as the last step in all mod-

(42)

els. The output was therefore in the same format in all models. The output is a one-hot encoded flattened OD-matrix, where each one-hot encoded cell group contains the softmax scores, the probabilities, for the different cells. The output can therefore be transformed into an OD-matrix by picking the cells with the highest softmax scores.

0.1 0.7 0.2 0.8 0.1 0.1

(a) Part of the Neural Network output.

1 0

(b) Transformed output.

Figure 3.10: Output processing.

Figure 3.10a displays a part of the output. Recall that there are three bins to choose from, which means that the figure will be trans- formed into two cells when processing the output, as displayed in Fig- ure 3.10b.

3.7.2 Implemented models

Three different models were implemented, one sequential and two dif- ferent parallel models. This section describes the models and their dif- ferent input variations that are evaluated in Chapter 4. In total there were five variations with the input and models, described below.

SMNoAF: Sequential Model No Additional Features

The first and most simple model only took prior OD-matrices as in- put and not any additional features such as day type etc. The input is described in Section 3.4.5, in short each input sample is ten prior flattened OD-matrices.

SMNoW: Sequential Model No Weather

This model was identical to SMNoAF, except that the input also con-

sisted of additional features (exlucding the weather) appended to the

OD-matrix. The model is visualised in Figure 3.11. Due to the fact that

the parallel models were expected to perform better, no experiment

with the weather and the sequential model was performed.

(43)

Prior OD-matrices

Concatenate

Additional features

LSTM layer(s)

Dense layer

Figure 3.11: SMNoW.

PMNoW: Parallel Model No Weather

The parallel model separated the prior OD-matrices and additional features into one LSTM for the prior OD-matrices and one dense layer for the additional features. The input into the LSTM is ten prior OD- matrices as in the sequential model, while the input to the dense layer is the additional features for a given time directly. The model is dis- played in Figure 3.12a. This variation has no weather as input. Dropout is added after the dense layer for the additional features to prevent overfitting.

PMNWD: Parallel Model Weather Dense

This model was exactly the same as the PMNoW, except that it also has the weather inserted with the additional features into the dense layer.

PMWLSTM: Parallel Model Weather LSTM

This model was created with the hypothesis that weather on the pre-

vious days can affect, for example, production in a factory. The idea

was to place the weather for the ten previous days into its own LSTM

model. The model is displayed in Figure 3.12b.

(44)

Additional features

Dense layer

Dropout

Prior OD-matrices

LSTM layer(s)

Concatenate

Dense

(a) PMNoW.

Additional features

Dense layer

Dropout

Prior OD-matrices

LSTM layer(s)

Concatenate

Weather

LSTM layer

Dense layer

(b) PMWLSTM.

Figure 3.12: Two parallel models.

3.7.3 Hyperparameter selection

The models have many possible hyperparameters that can be tuned and experimented with. The parameters that were experimented with will be presented in Section 3.7.4, while this Section states the hyper- parameters and configuration used for all experiments.

Gradient based optimisation with the ADAM-optimiser was used for training. Some research found, for example [55, 5], also used the ADAM-optimiser. The activation function used for the LSTMs was the hyperbolic tangent, since it was used by [55, 7, 25, 62, 65] among others.

Categorical cross entropy was used as loss function for the network and not MSE, since softmax was used as activation function on the last dense layer to allow categorical output. The batch size used was 32, meaning the network was trained with 32 samples in each iteration.

3.7.4 Experiment setup

The models in Section 3.7.2 were evaluated and compared. In addition

hyperparameter optimisation was performed by trying different hy-

perparameters on the PMNoW model. A final model was also trained

(45)

with the best model and hyperparameters found. The metrics used when presenting the results was test data MSE, the F1 score on the test data, and the standard deviation of both of the metrics. Since the MSE and F1 score was calculated on each OD-matrix separately, the standard deviation partially served as an indication on whether or not models actually managed to fit the data or simply tried to predict some sort of average. Due to the time needed to train all models, each ex- periment was run only once.

Common default parameters

Unless anything else is stated, all experiments used the configuration and hyperparameters stated below by default. The data was not shuf- fled, since the data was seen as a time series.

• 500 neurons for the LSTM layer(s).

• Training time 100 epochs.

• Early stopping, when the validation MSE has not improved for five epochs the training is stopped. I.e. a model was trained for 100 epochs or until the validation MSE stops improving, what- ever came first.

• Regularisation with the l1 & l2 norm on all LSTM layer(s), to prevent overfitting.

• Dropout with rate 0.4.

• Train data range: 2016-01-01 to 2016-12-31

• Validation data range: 2017-01-01 to 2017-06-30

• Test data range: 2017-07-01 to 2018-02-28

• Learning rate of 0.0001

• Batch size 32 and window size 10 (10 prior OD-matrices), result-

ing in input Tensors having the shape 32 × 10 × N , where N is

the number of features. If there are 29 zones, 32 one-hot encoded

hour, month, and day type features, 17 one hot encoded weather

features, then N = 29 ² ∗ 3 + 32 + 17 = 890.

(46)

Training period

The viability of using both of the available years for training versus only one year was explored, by training the model with the common parameters and changing the time range for the training data to also include 2015.

Number of epochs

The number of epochs were compared, naturally early stopping was not used. The reason for this experiments was to see what happens with the loss, validation MSE, and test MSE as the time spent train- ing is increased. For example it could be possible that early stopping would terminate a training due to the validation MSE temporarily worsening to then suddenly improve again. Another reason was to see if overfitting occur. The following epochs were evaluated:

• 10

• 30

• 50

• 100

• 200

Number of neurons

The number of neurons for use in the LSTM layer(s) were evaluated.

The hypothesis is that too few will not work at all, while more neu- rons will improve the result given enough training time. The follow- ing numbers of neurons will be tested:

• 50

• 100

• 250

• 500

• 750

• 1000 Number of hidden layers

In the PMNoW model the LSTM part (”LSTM layer(s)”) in Figure 3.12a

had hidden layers. Two to four hidden layers were tested. Hidden

layers were only tested with this model, due to time constrains and

since it was assumed that the parallel models would perform better

than the sequential ones.

(47)

Learning rates

Different learning rates were evaluated. The evaluated learning rates were:

• 0.001

• 0.0005

• 0.0001

• 0.00005

• 0.00001

The rates have been selected since basic experiments were performed that showed that learning rates in this range seemed interesting, i.e., differences could be noted. No early stopping was used when testing learning rates, since the training loss and validation MSE were anal- ysed over epochs.

An experiment was also performed where the learning rate was

reduced by a factor of 10 when the training loss plateaued. Plateauing

in this context is defined as the training loss not improving after three

epochs.