Comparison of machine learning algorithms for real-time vehicle selection in transport management

(1)

STOCKHOLM SWEDEN 2018 ,

Comparison of machine learning algorithms for real-time vehicle selection in transport management

DAVID FAGERLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

(4)

(5)

Abstract

This thesis compares algorithms for a dynamic pickup and delivery problem, when new orders are arriving throughout the day. The dis- patchers job is to assign incoming orders to a fleet of vehicles. His- torical data is used to train the algorithms, where the objective is to select the same vehicle as the human dispatchers based on the infor- mation about the delivery and vehicles. The idea is to learn latent vari- ables, which are common in the real world but difficult to incorporate in route optimization. The data set is compiled from deliveries from a courier company located in Stockholm, Sweden.

The studied algorithms are: logistic regression, support vector ma-

chine, decision tree, feedforward neural network, and permutation in-

variant neural network. An additional data set based on the same de-

liveries is used to add all current orders for each active vehicle which

is used in the permutation invariant neural network. The results show

that feedforward neural networks and decision trees performed best

with top 1 accuracy and top 3 accuracy respectivelly. The best perform-

ing technique for class imbalance mitigation was oversampling (dupli-

cating samples from the minority class) which outperformed under-

sampling (removing samples from the majority class) and weighted

cost function (additional cost when misclassifying the minority class).

(6)

Sammanfattning

Rapporten jämför algoritmer för ett dynamiskt hämtnings- och leve- ransproblem, med nya ordrar som kommer in under hela dagen. Tra- fikledarens jobb är att tilldela inkommande order till en fordonsflotta.

Historisk data används för att träna algoritmerna, där målet är att välja samma fordon som de mänskliga trafikledarna baserat på information om ordern och fordonen. Tanken är att lära sig latenta variabler som är vanliga i den dagliga verksamheten men svåra att använda i ruttopti- mering. Datan har sammanställs från leveranser från ett budföretag i Stockholm, Sverige.

Algoritmerna som studerades är: logistisk regression, stödvektor-

maskin, beslutsträd, artificiellt neuronnät och permutationsinvariant

neuronnät. Ett ytterligare dataset baserat på samma leveranser används

för att lägga till alla aktuella ordrar för varje aktivt fordon som an-

vänds i det permutationsinvarianta neuronnätet. Resultaten visar att

artificiella neuronnät och beslutsträd hade bäst prestanda baserat på

topp 1 - noggrannhet respektive topp 3 - noggrannhet. Den bästa meto-

den för minskad klassobalans var översampling (duplicering av prov

från minoritetsklassen), vilket presterade bättre än undersampling (bort-

tagning av prov från majoritetsklassen) och viktad kostnadsfunktion

(extra kostnad vid klassificering av minoritetsklassen).

(7)

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem Statement . . . . 2

1.3 Contribution . . . . 3

2 Background 5 2.1 Vehicle Routing Problem . . . . 5

2.2 Multinomial Logistic Regression . . . . 6

2.3 Support Vector Machines . . . . 7

2.3.1 Non-linear classifier . . . . 7

2.4 Decision Trees . . . . 8

2.4.1 Pruning . . . . 9

2.4.2 Selection measures . . . . 9

2.5 Artificial Neural Networks . . . 10

2.5.1 Feedforward networks . . . 10

2.5.2 Optimization algorithms . . . 11

2.5.3 Batch Normalization . . . 13

2.5.4 Dropout . . . 14

2.5.5 Word Embedding . . . 14

2.5.6 Recurrent Neural Networks . . . 15

2.6 Permutation Equivariance and Invariance . . . 16

2.7 Learning dispatch decisions . . . 17

2.8 Related works . . . 18

2.8.1 Automated dispatching . . . 18

2.8.2 Permutation Equivariance and Invariance . . . 19

2.8.3 Differences to previous research . . . 20

3 Method 21 3.1 Data . . . 21

v

(8)

3.1.1 Features . . . 22

3.1.2 Target . . . 24

3.1.3 Description of Data sets . . . 24

3.1.4 Training and test set . . . 25

3.2 Preprocessing . . . 25

3.3 Problem formulation . . . 26

3.4 Neural networks . . . 26

3.4.1 Sample definition and Loss function . . . 26

3.4.2 Feedforward Architecture . . . 26

3.4.3 Permutation invariant Architecture . . . 27

3.4.4 Class Imbalance . . . 28

3.5 Model implementation . . . 28

3.5.1 Decision tree and Logistic regression . . . 30

3.5.2 SVM . . . 30

3.5.3 Neural networks . . . 30

3.6 Hyperparameter Optimization . . . 30

3.7 Evaluation . . . 31

3.7.1 Metrics . . . 31

3.7.2 Sum digits . . . 33

3.7.3 Interpretation of learned rules from decision tree 33 3.7.4 Class imbalance mitigation methods . . . 33

3.7.5 Permutation invariant network with and with- out the free text . . . 34

3.7.6 Comparing decisions made in the morning and afternoon . . . 34

3.7.7 Time to train and run inference . . . 34

4 Results 35 4.1 Hyperparameter optimization . . . 35

4.2 Sum digits . . . 36

4.3 Interpretation of learned rules from decision tree . . . 36

4.4 Class imbalance mitigation methods . . . 37

4.5 Permutation invariant network with and without the free text . . . 39

4.6 Comparing decisions made in the morning and afternoon 39 4.7 Time to train and run inference . . . 41

4.8 Final Comparison . . . 41

(9)

5 Discussion 44

5.1 Analysis . . . 44

5.2 Comparison with previous work . . . 45

5.3 Practical Application . . . 46

5.4 Sustainability, Societal and Ethical Issues . . . 46

5.5 Conclusions . . . 47

5.6 Future work . . . 47

Bibliography 49

(10)

(11)

Introduction

This chapter introduces the background to the thesis. Section 1.1 de- scribes the problem and its domain, and also motivates why the prob- lem is interesting. Section 1.2 describes the problem and outlines the goals of the thesis. The last section (1.3) of this chapter states the con- tributions made in this thesis.

1.1 Background

The transportation sector is one of the worlds largest industries and the worldwide transportation revenue is estimated to be $4.7 trillion [31] in 2016. The transportation industry is currently undergoing a digital transformation with access to more data through telematics, and better tools for analytics [42].

In dynamic vehicle routing and dispatching problems, there are a number of vehicles which are assigned requests/deliveries. The dy- namic part means that new deliveries arrives over time, i.e., we deal with incomplete information. Dispatching problems can be divided into the following sub-problems:

• Courier services pickup packages from customers and bring them to a depot for further shipping.

• Dial-a-Ride is a door-to-door transportation service for elderly and people with disabilities.

• Emergency services tries to place ambulances in such a way that they can reach a patient quickly.

1

(12)

This thesis is concerned with vehicle dispatching in courier services which is often called dynamic and stochastic Pickup and Delivery prob- lem (PDP). Dynamic means that we have incomplete information, i.e., not all deliveries are known in advance, and stochastic means that there is one or more variables that are random, i.e., location of pickup and delivery. This means that new packages arrive throughout the day with a local pickup and delivery location.

Each vehicle is assigned multiple deliveries. The delivery has a pickup and delivery address, and time windows for pickup and de- livery. Currently the order assignment is made by humans. Automat- ing the assignment is crucial to allow for 1 hour deliveries and han- dle the dynamic workload with new orders coming in throughout the day [5]. It can also enable more efficient re-planing in case of disrup- tions [1]. Automation can also enable faster pickups for companies in the e-commerce industry when a customer returns a product, which allows the companies to resell the products faster [5]. Using an auto- mated system for dispatching limousines resulted in an 100% increase in weekly orders with the same number of planners/dispatchers [11].

This enables the companies to grow their business with reduced costs.

1.2 Problem Statement

Today companies are still relying on human operators for dispatching incoming deliveries to vehicles. The fleet dispatchers use their experi- ence to decide which vehicle is most suitable for a new delivery based on vehicle location, current vehicle load, package type, etc. Acquiring experience takes a long time and due to stress their careers are often short, which leads to low supply of good dispatchers [54]. Automated decision making could provide dispatchers with vehicle suggestions to ease their workload, and in the future potentially replace human dispatchers altogether.

One common approach to vehicle dispatching is route optimiza-

tion, which tries to minimize or maximize an objective function with a

number of constraints. The objective function could be total distance

traveled, total delivery time, minimum number of vehicles, and total

number of deliveries served, or a weighted combination of them. An

alternative approach is to assume a fuzzy objective and learn the hu-

man dispatchers decisions.

(13)

This thesis will focus on the latter approach and learn the dispatch- ers decisions. Some advantages of this approach are that the model can learn complex latent variables, e.g., certain drivers are more suit- able for carrying heavy packages, and it is also less computationally demanding during inference since no route optimization is performed.

If there are too many incoming deliveries to serve manually, the algo- rithm could give vehicle suggestions to the human dispatchers, en- abling the operator to focus on more complex deliveries. The solution can also be used as a starting point for route optimization which could reduce the time to find a quality solution. Improved dispatching de- cision could be used to reduce fuel consumption, emissions, or cost depending on which objective function is used.

Given information about a new order and the active vehicles, we want to predict the most suitable vehicle for the given order. The aim of this study is to train a machine learning model to select a vehicle for an incoming order. The training is based on historical dispatch decisions from human operators. The decisions made by the model are compared with the decisions made by the human operators. The first research question is how well machine learning can learn human dispatcher decisions, and the second question is which algorithms are more suitable for making those decisions.

The models compared in this thesis are Neural Networks, Support Vector Machines (SVM), Logistic Regression, and Decision Trees. The goal of this thesis is to compare different machine learning methods and evaluate their performance on vehicle selections.

1.3 Contribution

The data sets used in previous research [45, 10, 54] is often small, which prohibit larger models. The largest previous data set was used by Chen et al. [10] which had 25 days of decisions for training and 10 days for testing. The data set used in this thesis has two months of data for testing and training. The size of the data set was constrained by the computational complexity of the permutation invariant model (see Section 3.1). In addition this thesis use a more complex environ- ment with free text containing additional information such as time windows, package dimensions, weight, and type of package.

This thesis utilizes a permutation invariant network to incorporate

(14)

information about the other active vehicles.

Previous studies [54, 45, 10, 61] haven’t used detailed information

about the vehicles existing load, such as package type, pickup location,

delivery location, and time windows. The report will also compare

different machine learning methods and evaluate their performance

on vehicle selections.

(15)

Background

This chapter describes the relevant theoretical frameworks for this the- sis. Section 2.1 introduces the Vehicle Routing Problem and its vari- ants. Sections 2.2 to 2.5 describe the different classifiers used in this thesis. Section 2.6 describes the concepts of permutation equivariance and invariance and its use in Artificial Neural Networks. The problem of defining optimality is presented in Section 2.7. Section 2.8 describes related work.

2.1 Vehicle Routing Problem

The Vehicle Routing Problem (VRP) was introduced by Dantzig and Ramser [12] in 1959. It is a generalized Traveling Salesman Problem (TSP) [25]. The goal of TSP is to find the shortest route between a number of nodes, e.g., cities, that ends in the start node and where all nodes are visited. Dantzig and Ramser [12] used linear programming to find a near optimal solution for the "Truck Dispatching Problem"

(TDP). Laporte et al. describe VRP as:

The vehicle routing problem (VRP) consists of designing least cost delivery routes through a set of geographically scattered customers, subject to a number of side constraints [25, p.1].

In addition VRP has one or more depots with a given number of vehi- cles that serve the customers demand.

There are multiple variations on VRP with a few of them being:

5

(16)

• VRP with Pickup and Delivery (VRPPD); has packages which are picked up and delivered between certain locations

• VRP with Time Windows (VRPTW); has time constraints for each customer

• Capacitated VRP (CVRP) has constraints on the vehicles load ca- pacity

• Open VRP (OVRP) where there is no depot

The problem in this thesis is a Open Pickup and Delivery Problem with Time Windows. Two approaches that been used to solve this problem are through optimization algorithms [43, 17, 37, 35] and by learning the decisions from human dispatchers [45, 54, 10]. This thesis will use the latter approach.

2.2 Multinomial Logistic Regression

Multinomial Logistic regression is a simple parametric linear algo- rithm that is easy to interpret. It has been used for many different areas including stock performance [60], type of pregnancy [53], and anomaly intrusion detection [63]. Logistic regression is used for a binary depen- dent variable and multinomial logistic regression is a generalized method that is used for multiclass problems (with more than two classes). If we let K denote the number of classes, the idea of multinomial logistic regression is to pick a reference variable and compute the probability for K − 1 binary logistic regression models. We have input features

~

x = [1, x

₁

, . . . , x

_n

] , parameters ~ w = [w

₀

, w

₁

, . . . , w

_n

] , with n being the number of input features. The reason that we set x

0

= 1 and insert w

0

is to account for the bias term, i.e., w

0

is the bias term. If we pick K as the reference class, we get:

log( P (y = i|~ x)

P (y = K|~ x) ) = ~ w

_i

· ~ x, ∀i ∈ {1, . . . , K − 1}

And the probabilities:

P (y = i|~ x) = P (y = K|~ x) · e

^w^~ⁱ^·~^x

, ∀i ∈ {1, . . . , K − 1}

P (y = K|~ x) = 1 −

K

X

k=1

P (y = K|~ x) · e

^w^~ⁱ^·~^x

(17)

2.3 Support Vector Machines

Support Vector Machines (SVM) are models that can be used for classi- fication and regression. SVM have been used for many different tasks such as image recognition [28], image segmentation [62], and text clas- sification [18]. SVM can also be used for regression tasks using the same principles as classification. The simplest type of SVM is the linear SVM. In Figure 2.1 there are two linearly separable classes of points.

In general, there could be many different possible linear separations.

The idea behind SVM is to place the decision boundary so that the dis- tance between the decision boundary and the points is maximized, i.e., maximizing the margin around the decision boundary.

We assume that the data set consists of points (~x

i

, y

_i

) with y

i

= {−1, 1}. We want to pick the two hyperplanes that separate the points with the largest distance between them (dotted lines in Figure 2.1). To allow non linearly separable data we introduce slack variables ξ

i

. A positive ξ

i

enables ~x

i

to lie outside the margin at a cost proportional to ξ. If C is large then the cost will be high for points not respecting the margin, and the further away from the margin the point is (larger ξ

_i

) the larger the increase in cost will be. However for a smaller C the margin can be larger with the trade-off being a higher training error.

This can be modeled as the following optimization problem:

minimize

x

k ~ wk + C

N

X

i=1

ξ

_i

subject to y

i

( ~ w · ~ x − b) ≥ 1 − ξ

i

, ξ ≥ 0, i = 1, . . . , n.

(2.1)

Where ~ w is the normal vector to the separating hyperplane, and b is the bias variable for the hyperplane. C is a regularization term where large values increases the penalty for points on the wrong side of the margin.

2.3.1 Non-linear classifier

Boser, Guyon, and Vapnik [7] introduced a way to create non-linear

classifiers by the kernel trick. The idea is to transform the input-space

into a higher dimensional feature space to account for non-linearly

separable data. Instead of doing the actual transform, they apply the

kernel trick [7] to avoid computing the actual transformation. The ker-

nels used in this thesis are:

(18)

Figure 2.1: Maximum margin hyperplane and support vectors for a SVM classifier with two classes

• Linear: k(x, x

⁰

) = x

^T

x

⁰

+ c

• Polynomial: k(x, x

⁰

) = (γx

^T

x

⁰

+ c)

^d

, d = 3

• Radial Basis Function (RBF): k(x, x

⁰

) = e

^−γ||x−x⁰^||²

• Sigmoid: k(x, x

⁰

) = tanh(γx

^T

x

⁰

+ c) Where γ =

num_features¹

and r = 0.0.

2.4 Decision Trees

Decision trees can be used for both classification and regression tasks and has been used in previous research in pickup and delivery dis- patching [10], job scheduling [30], and container transportation [50].

Decision trees have the advantage of being a white box model, which is easy for a human to interpret.

A decision tree can be modeled as an acyclic directed graph with

two types of nodes. The internal nodes are decision nodes and the leaf

nodes are the target nodes. A decision tree partitions the data recur-

sively into smaller subsets based on a test at each internal node and the

subsets get more homogeneous for each test. Classification trees have

categorical values as targets, while regression trees have a continuous

variable as target. Each node in the tree has a single parent and all

(19)

Figure 2.2: A decision tree with input features x

1

, . . . , x

4

, threshold values a, b, c, d, e, and outcomes A, B, C, D

nodes except the leaf nodes have two or more descendents. The tree is partitioned until all subsets are homogeneous or a stop condition is met. A large tree is more difficult to understand and is more likely to overfit. Figure 2.2 shows an example of a categorical decision tree with 4 features (x

1

, . . . , x

₄

) and 4 classes (A, B, C, D).

2.4.1 Pruning

Mingers [34] found that reducing the tree size, called pruning, im- proved the accuracy with 20-25%. Reduced Error Pruning (REP) is one pruning method that was shown by Esposito et al. [14] to per- form well both in terms of accuracy and tree size. REP was introduced by Quinlan [47] and starts with a complete tree. A validation set is used to compare the classification error rate when the subtree is kept, and when the subtree is turned into a leaf node. If the error rate for the pruned subtree is the same or lower then the node is pruned and turned into a leaf node. This procedure is repeated until no more prun- ing can be performed without a higher error rate.

2.4.2 Selection measures

There are several methods to decide which attribute to split on. Infor-

mation gain [48] measures the change in entropy with knowledge about

(20)

the attribute, and is defined below

IG(T, a) = H(T ) − H(T |a) H(T ) = − X

i

p(x

_i

)log(p(x

_i

)

H(T |a) = − X

i,j

p(x

_i

, y

_j

)log( p(x

_i

, y

_j

) p(y

j

) )

Another selection measure is GINI index which was used in Breiman et al. [8]. GINI index

. . . measures the ’impurity’ of an attribute with respect to the classes. [34, p.7]

The measure is defined as

I

_G

(p) = 1 − X

i

p(x

_i

)

²

2.5 Artificial Neural Networks

2.5.1 Feedforward networks

Artificial Neural Networks (ANN) consist of a number of intercon- nected nodes/neurons that performs a non-linear transformation of a linear combination of inputs (~x) and weights ( ~ w ):

o(~ x, ~ w) = σ(

N

X

i

~

x

_i

× ~ w

_i

) = σ( ~ w · ~ x)

Where o(·) is the output value, and σ(·) is the activation function.

Feedforward network is a type of ANN where the nodes don’t form a cycle, i.e., the output of the node isn’t fed back to itself. The output from one layer is used as input to the next layer, see Figure 2.3 for an example of an ANN. Using the network in Figure 2.3 we denote the weights from the input layer to the hidden layer (w

¹_ij

) as W

1

, and the weights from the hidden layer to the output layer (w

²_ij

) as W

2

. The bias denoted by b

1

and b

2

, for the hidden layer and output layer respec- tively. The output for a network with a linear activation function on the output layer can then be written as:

ˆ

y(x, w) = σ(W

₂

σ(W

₁

~ x + b

₁

) + b

₂

)

(21)

There are several activation functions, for example the sigmoid func- tion [6] , and the Rectified Liner Units (ReLU) [38] which is the most popular activation function [27].

σ

_sigmoid

(z) = 1 1 + e

^−z

σ

_ReLU

(z) = max(0, z)

Backpropagation is typically used to train the weights by computing the gradient of the error or loss function J(θ). The error of loss function measure the difference between the networks output and the expected output. Backpropagation was discovered independently by several researchers during the 1970s and 1980s [64, 26, 40, 52]. Mean Squared Error (RMSE) is typically used as a loss function for the regression tasks, and is defined as:

J (θ) = 1 N

N

X

i=0

(y

_i

− ˆ y

_i

)

²

Cross-entropy loss is common for classification tasks, and is defined as:

J (θ) = −

N

X

i=0

p(~ x

_i

) log(q(~ x

_i

))

Where p(·) is the true distribution, q(·) is the model distribution, and N is the number of samples.

2.5.2 Optimization algorithms

In order to update the weights an optimization algorithm is used. Pop- ular optimization algorithms are typically based on Stochastic Gradient Descent (SGD) which was introduced by Robbins and Monro [51]. SGD has the following form:

w := w − η∇J

i

(w)

Where J

i

(w) is the loss for a single sample and η is the learning rate

which specifies how big the updates should be. Backpropagation is

typically used to compute the gradient of the loss function. Variants

on SGD that incorporate information about previous gradients are out-

lined below.

(22)

Figure 2.3: A fully connected neural network with 3 input features, 2 hidden units, and one output feature

AdaGrad was created by Duchi, Hazan, and Singer [13]. It uses an adaptive learning rate by computing separate learning rates for each feature based on the size of the previous updates. The gradient update with AdaGrad is computed as follows:

G

_t,i

=

t

X

j=1

g

_j,i²

g

_t,i

= ∇J

_i

(w),

x

^(t+1)_i

= x

^(t)_i

− η

pG

_t,i

+ g

_t,i

Where G

t,i

is the sum of squared gradients for parameter i up to timestep t . Since the learning rate depends on the sum of previous squared gra- dients, the algorithm can be informally interpreted as lower learning rate for frequent features and larger learning rate for infrequent fea- tures.

RMSProp was proposed by Tieleman and Hinton [59]. It uses an exponentially decaying average of the squared gradients to normalize the learning rate.

E[g

_t+1,i²

] = γE[g

²_t,i

] + (1 − γ)g

_t+1,i²

(23)

x

^(t+1)_i

= x

^(t)_i

− η q

E[g

_t,i²

] + g

_t+1,i

Where γ is a term to weight the previous gradients with the current gradient.

Adam was introduced by Kingma and Ba [24] and computes the exponentially decaying squared gradients (v

t

) like in RMSProp, and the exponentially decaying gradients (m

t

). m

t

and v

t

are estimates of the first (mean) and second (uncentered variance) moment. It is defined as:

m

_t

= β

₁

m

_t−1

+ (1 − β

₁

)g

_t

, v

_t

= β

₂

v

_t−1

+ (1 − β

₂

)g

²_t

, (2.2)

ˆ

m

t

= m

_t−1

1 − β

₁^t

, v ˆ

t

= v

_t−1

1 − β

₂^t

(2.3)

x

_t

= x

_t−1

− η

√ ˆ v

_t

+ m ˆ

_t

(2.4) The variables ˆ m

t

, ˆ v

t

are bias-corrected estimates of the first and sec- ond moment. β

1

, β

₂

are the hyperparameters for the decay rates. is used to prevent division by zero. In the original paper [24] the authors propose β

1

= 0.9 , b

2

= 0.999 , and = 10

⁻⁸

.

2.5.3 Batch Normalization

Batch Normalization was introduced by Ioffe and Szegedy [21]. It is a method that is used to normalize the input for each layer. In Stochastic Gradient Descent (SGD) the input to one layer depends on the parame- ters in the previous layers. When the parameters change during train- ing the input distribution to the next layer will change as well. This is called covariate shift [55]. Covariate shift hinders the learning since the model has to adapt to a changing input distribution. Batch normaliza- tion reduce the dependence on scale, which allows for higher learning rate [21], by normalizing each input in each layer as follows:

ˆ

x

^(k)_l

= x

^(k)_l

− µ(x

^(k)_l

) q

σ(x

^(k)_l

)

, y

^(k)_l

= γ

_l^(k)

x ˆ

^(k)_l

+ β

_l^(k)

Where ˆ x

^(k)_l

is the kth input for layer l, and µ(·) and σ(·) are estimated

over a mini-batch. The parameters γ and β are used to shift and scale

(24)

(a) Before dropout (b) After dropout

Figure 2.4: A dropout neural network. Left: A network with two lay- ers Right: The same network with thinned nodes using dropout. The nodes with cross are disabled

the normalized value since normalization may change the input rep- resentation. Ioffe and Szegedy [21] applied Batch Normalization to a variant of the Inception Network [58] and achieved the same accuracy in less than half the number of training steps, 13.3 · 10

⁶

steps compared with 31.0 · 10

⁶

steps.

2.5.4 Dropout

Dropout is a technique introduced by Srivastava et al. [56] which ran- domly ignores nodes during training. The benefit is that it addresses overfitting by "preventing" the nodes from co-adapting and develop dependencies between neurons. Dropout works by disabling random nodes and their incoming and outgoing connections with probability p . During testing all nodes are enabled. By disabling nodes during training it allows the network to approximately combine many differ- ent networks efficiently. An example of a dropout network is shown in Figure 2.4.

2.5.5 Word Embedding

Word embedding is a group of techniques for representing words or

phrases with vectors of real-numbers. A language vocabulary is often

very big and the "curse of dimensionality" is a big problem here. By

mapping words into lower dimensional vectors it is possible to reduce

the sparsity by capturing the words properties in the lower dimen-

(25)

sional continuous feature vector, and represent words used in similar context with similar feature vectors. An early implementation of word embedding was developed by Bengio et al. [3] with one linear projec- tion layer and one non-linear hidden layer. The model yielded better perplexity than the current state-of-the art (smoothed trigram model).

Mikolov et al. [33] developed two new word embedding models. Both models have one input layer and one output layer. The first model is the Continuous Bag-of-Words model (CBOW) where the current word is predicted based on the surrounding words. The second model is the continuous skip-gram model where the surrounding words are pre- dicted based on the current word.

2.5.6 Recurrent Neural Networks

Recurrent Neural Network (RNN) is a type of ANN with a connection between the hidden layer and itself (see figure 2.5). RNNs are typi- cally used for sequential input where the input are not independent.

A RNN allows the model to have a kind of memory by capturing in- formation from previously calculated steps and use it to compute the new output.

The steps to compute the output for timestep t are:

a

_t

= W s

_t−1

+ U~ x

_t

+ b (2.5)

s

_t

= tanh(a

_t

) (2.6)

o

_t

= V s

_t

+ c (2.7)

W is the weights for the previous state (s

t

), and U is the weights for the input features (~x

t

). b and c are the bias terms. In practice RNNs have difficulty in capturing longer dependencies with the gradients either

"blowing up" or "vanishing" [20]. Hochreiter and Schmidhuber [19]

introduced a new kind of RNN called Long Short Term Memory (LSTM) that was able to capture longer term dependencies. LSTM use memory cells to store information from previous timesteps. The computation steps are as follows [19]:

i

_t

= σ(W

_i

~ x

_t

+ U

_i

s

_t−1

+ b

_i

) (2.8) f

_t

= σ(W

_f

~ x

_t

+ U

_f

s

_t−1

+ b

_f

) (2.9) o

_t

= σ(W

_o

~ x

_t

+ U

_o

s

_t−1

+ b

_o

) (2.10) c

_t

= f

_t

c

_t−1

+ i

_t

tanh(W

_c

~ x

_t

+ U

_c

s

_t−1

+ b

_c

) (2.11)

s

_t

= o

_t

tanh(c

_t

) (2.12)

(26)

Figure 2.5: A folded and unfolded Recurrent Neural Network. The network have a number of hidden units as state (s) and use informa- tion from previous timesteps to compute the next state.

W

_q

and U

q

is the weights for the input features (~x

t

) and recurrent con- nections (s

t

) respectively, with q being either the input gate (i), output gate (o), forget gate (f ), or memory cell (c). σ(·) is the sigmoid function.

2.6 Permutation Equivariance and Invariance

Traditional methods require the input to be of fixed size Permuta- tion equivariance and invariance are important when dealing with sets where the input lacks a specific order and with a variable input size. It can be used for outlier detection [65], Multiple Traveling Salesman Problem [22], and 3D point clouds [46]. We denote S

N

to be the symmetric group of length N , i.e., the group of all bijections of {1, . . . , N } to itself. A function f is permutation equivariant iff

f (πx) = πf (x), ∀π ∈ S

_N

A function f is permutation invariant iff

f (πx) = f (x), ∀π ∈ S

N

Zaheer et al. [65] designed a deep network architecture for both the equivariant and invariant models. They state that a function f is invariant iff it decomposes to f (X ) = ρ P

x∈X

φ(x) [65], for the set X

(27)

and the transformations φ and ρ. The invariant model uses one or several fully connected layers to represent the transformations φ and ρ . The model sums all φ(x) and applies ρ to the output.

A function f

Θ

(~ x) = σ(Θ~ x) , f

Θ

: R

^N

→ R

^N

, with weights Θ ∈ R

^{N ×N}

is equivariant iff [49]

Θ = λI + γ(11

^T

), γ, λ ∈ R (2.13) One variation on equation 2.13 which was shown to perform better in some applications [49] is:

f (x) = σ(λI~ x − γ(max ~x)1)

A generalization of this equation to higher dimensions (K features for each object in the set) with a factoring to reduce the number of param- eters is:

f (x) = σ(β + (~ x − 1(max ~x))Γ), Γ ∈ R

^K×K

This is a fully connected layer with max-normalization within the set.

A permutation equivariant function can be transformed into a permu- tation invariant function by applying a pooling operation over the set- dimension [49]. The pooling operation should be commutative, such as maximization and summation [49].

2.7 Learning dispatch decisions

When learning from human dispatchers previous decisions the assump-

tion is that the decisions approximate the optimal solution, i.e., the

decisions are good. This assumption may not hold all the time and

the decision quality will vary. According to one dispatcher, who was

interviewed, the decisions will be worse when there are many incom-

ing orders, which is typical during the afternoon. This makes it more

problematic for the algorithms to make good decisions since not all

decisions are good. One improvement is to use route optimization to

generate the optimal solutions and then feed those decisions into the

algorithms to improve the learning, this was done by Chen et al. [10].

(28)

2.8 Related works

2.8.1 Automated dispatching

Potvin, Shen, and Rousseau [45] used a neural network to estimate a vehicles quality for a specific delivery and then pick the vehicle with the highest quality score. To incorporate the information about the other vehicles, they translated the inputs with respect to the best at- tribute value over all vehicles. The routing expert which was used to create the data set was not a professional dispatcher and the deliveries was simulated. The results were, however, promising with the model selecting the same vehicle as the human dispatcher in 89% of the test samples.

Shen et al. [54] developed a program to assist dispatchers in an express mail company without capacity constraints. For each deliv- ery they estimated the travel time by minimizing the cost of inserting the pickup and delivery location in the existing routes and the addi- tional lateness introduced from inserting the pickup and delivery loca- tion. They used 90 deliveries and corresponding dispatcher decisions as training data for a 3-layer backpropagation Neural Network, and 50 deliveries as test data. Features from a single driver was used as input and the output was a single quality score (1 for select driver, and 0 otherwise). The model output for each driver was ranked and compared to the dispatchers decision. The network was shown to be perform close to the human dispatcher in empty travel time, lateness at pickup, and lateness at delivery.

Chen et al. [10] introduced a Data Mining-based Dispatching Sys- tem (DMDS) to learn dispatching rules in intermodal freight industry.

Decision trees was used for learning the rules and the solution was

then used as input to an optimization algorithm in order to improve on

the dispatchers results. They complemented the load attributes, e.g.,

pickup and delivery location, start and end time of a load, required

trailer type, with driver attributes like the drivers start and end time,

home location, and remaining work hours. Based on the trained de-

cision trees the most important attributes were distance between load

and driver, difference between a drivers remaining work hours and a

load’s service duration, estimated remaining time for a driver to finish

current task, and the drivers remaining work hours. They show that

using the DMDS in the optimization algorithm give a small perfor-

(29)

mance improvement compared to a optimization-only approach. They also show that training the DMDS with the optimized solutions give a slightly better results, 5.6% lower empty travel mileage and 1.5% re- duction in empty ratio.

Vukadinovic, Teodorovic, and Pavkovic [61] used a neural network to learn decisions for "loading, transport and unloading of gravel by inland water transportation." [61, p.1], where the goal was to predict the number of barges assigned to each tug. The input was the suitabil- ity for barge i to be assigned to tug j, where suitability was a function of tug’s j barge capacity and the difference in release time for the barge and tug. The network was trained with heuristic simulated annealing.

The network was only trained on 56 samples and tested on 16 test samples. The network performed slightly worse than the dispatcher but was promising.

Riessen, Negenborn, and Dekker [50] developed a system for real- time container transport planning, to be able to provide instant de- cisions for incoming orders. They used decision trees to determine which service (train, barge, truck) to use for a single container. To train the decision tree they used historical data which was optimized and then used to train the model. They showed that their model could reduce transportation costs by 3% over a greedy approach.

Mojtaba Maghrebi, Claude Sammut, and Travis Waller [36] devel- oped a model for dispatching Ready Mixed Concrete (RMC). They used decision trees to prioritize customers for dispatching Ready Mixed Concrete and used several features, e.g., distance to depot, unloading time, travel time, required amount etc. The model was tested by sim- ulating a plant with three customers for 200 days, the data was sent to a human dispatcher to prioritize. The data was then used for training and testing (90-10 split) and the model approached a 80% accuracy.

2.8.2 Permutation Equivariance and Invariance

Zaheer et al. [65] designed a neural network architecture called Deep Sets which operates on unordered variable length feature sets. Un- ordered variable length features sets implies that the input is made out of sets of features where the number of elements in the set is not constant. The elements in the sets do not have any inherent order.

The permutation invariant model transforms each instance x

i

in the

set with one or more feedforward layers (φ(·)) into the representation

(30)

φ(x

_i

). Then the sum of the transformations are computed and passed as input to another network, e.g., feedforward network. They also propose an equivariant model. This model consists of a non-linear transformation applied to a weighted combination of the input and the sum of the input, i.e. f (x) = λIx + γ(11

^T

)x. The invariant model was shown to be comparable to other techniques in outlier detection, point cloud classification and image tagging.

Gardner et al. [16] proposed a permutation invariant network ar- chitecture called Convolutional Deep Averaging Networks (CDANs). The architecture is similar to Zaheer et al. [65] with a transformation (em- bedding) of all input instances and then a pooling operation. Given a set X

i

= (x

₁

, . . . , x

_l_i

), x

_k

∈ R

^d

the network applies an embedding func- tion f : R

^d

→ R

^m

to all input features x

k

, where m is the embedding size. The output from the embedding function is then pooled. The embedding function f (·) can be arbitrary with the restriction that it is compatible with backpropagation. In [16], f (·) is represented with a Multi Layer Perceptron/Feedforward network. Zaheer et al. [65] per- form tests with summation, averaging, and max pooling functions on a point cloud classification task. The best performing pooling opera- tion was summation and they showed that non-linear transformations outperforms linear transformations.

2.8.3 Differences to previous research

The main differences in this thesis compared with previous studies are

a comparison of multiple algorithms on a single data set, and the use of

permutation invariant layers in a neural network to use more informa-

tion about the vehicles. The previous studies on automated dispatch-

ing have focused on a single algorithm, therefore, it could be valuable

to compare multiple algorithms on the same data set and evaluate the

performance. The permutation invariant network makes it possible to

use a set of deliveries as input to the network. This has the potential

to create better vehicle selections by having the ability to analyze all

deliveries for all active vehicles. For example, by learning that a deliv-

eries with similar pickup and delivery locations preferably should be

assigned to the same vehicle.

(31)

Method

This chapter motivates the methods used in this thesis and the steps to prepare the data, and run and evaluate the models. It starts with describing the properties and features of two data sets as well as the target and training/test split (Section 3.1). Then preprocessing is per- formed to encode categorical features and scale continuous features (Section 3.2). The problem is formulated as a binary classification prob- lem (Section 3.3). The following section outlines the architecture of the two neural networks and gives an explanation of the models imple- mentation (Section 3.4 and Section 3.5 respectively). This chapter ends by listing all hyperparameters (Section 3.6) and giving a description of the metrics used for evaluation and tests performed (Section 3.7).

3.1 Data

The data used in this thesis is from a Swedish transportation company.

The complete data set had more than 4 years of deliveries. However, this was deemed unfeasible to run due to high computational com- plexity for the permutation invariant network. The permutation in- variant network uses information about all deliveries for all vehicles which increases the size of the input data. The computation is also hampered by having variable length input which have to be trans- formed to be of fixed size to increase the speed of computation. How- ever it is still very slow and running 35 epochs took approximately 19 hours.

The data set used contains 31, 309 deliveries over approximately 3 months, with on average 41 active vehicles for each delivery. The de-

21

(32)

Table 3.1: Performance metrics from Chen et al. [10] between a human dispatcher and a decision tree model.

Metric Dispatcher Model

Empty travel mileage 2911.4 2688.6 Loaded travel mileage 7668.3 7448.3 Total travel mileage 10579.8 10136.9

Empty ratio (%) 27.52 26.52

Loaded ratio (%) 72.48 73.48

liveries took place in the first 3 months of 2018, which was the time period with the least amount of missing data. The data that was some- times missing were home location, coordinates for pickup and deliv- ery, and start and end times. The dispatchers are grouped by region and for that reason only data from the Stockholm area are used. Stock- holm was chosen since most deliveries are made in the Stockholm area.

There are two data sets used in this thesis: one for the permutation invariant neural network, and one for the other algorithms. The dif- ferences between the data sets are outlined in Section 3.1.3. However, the main features are the same and described below in Section 3.1.1.

3.1.1 Features

The initial features were taken from Chen et al. [10] because their model performed slightly better than a human dispatcher, see Table 3.1. They write that the fact that the model outperformed the human dispatcher may be caused by randomness in, for example, vehicle speed and travel time in the real world which are assumed to be constant in the simulation.

All features from Chen et al. [10] were used except trailer type, esti-

mated time to finish current task, and estimated mileage during work

hours. These aren’t applicable since the vehicles don’t have trailers,

and the deliveries come in throughout the day which means that it’s

not possible to determine remaining mileage. The transportation com-

pany didn’t perform one task at a time, instead a new package could be

collected and delivered if it’s on the way to the current collection or de-

livery location. Distance between driver and load was also substituted

by travel time between load and driver due to inefficient computation

of distance. The features are shown in Table 3.2. Additional features

(33)

Table 3.2: Initial features used in all models. Based on the features from Chen et al. [10].

Name Description

LSTime Start time of a load LETime End time of a load

LCurLoc Current location of a load LDist Travel mileage of a load

LEstTime Estimated service duration of a load DBoard Driver ID

DSTime Start time of a driver DETime End time of a driver DHomeLoc Home location of a driver DCurLoc Current location of a driver

DRemTime A driver’s estimated remaining work hours

DLTimeDiff Difference between a driver’s remaining work hours and a load’s service duration

DLDur Estimated travel time between a driver and a load (proxy for DLDist)

have been added based on feedback from a human dispatcher on what

was deemed important when making the decisions, the features are

shown in Table 3.3. The data contains a free text field (LText) which is

used for arbitrary information which is deemed important. This field

could contain information about pickup and delivery times, where to

enter the building, whom to ask for, package dimensions etc. The

package type indicates how urgent the shipping is (1 hour shipping

etc.), and the number of deliveries are important to be able to balance

the load over all vehicles which increases the efficiency. Computation

of the distances and travel times were done by Open Source Routing

Machine (OSRM) [32]. The feature DLDur replaced DLDist (distance

between driver and load) since OSRM didn’t have support for distance

when computing more than one source-destination pairs. Computing

each source-destination pair individually were deemed too time con-

suming.

(34)

Table 3.3: Additional features used in all models based on feedback from a human dispatcher.

Name Description

LText Free text (package dimensions, time windows etc.) LPkgType The type of package (delivered in 3h, or urgent, etc.) DNumRequests The number of exisiting deliveries for the driver

3.1.2 Target

The data contains deliveries and their associated vehicle that was as- signed and made the delivery. The selection of a vehicle to a delivery was made by human dispatchers. The data also contains the other vehicles which was active at the time the delivery was assigned but which did not get selected. The target is therefore the vehicle selected by the human dispatcher, and the goal is for the algorithm to select the same vehicle as the human dispatcher did from a fleet of vehicles.

3.1.3 Description of Data sets

One data set is used for the permutation invariant neural network, and one data set is used for decision trees, logistic regression, SVM, and feedforward neural network. The idea was to use the permutation in- variant neural network to take into account the assigned deliveries for all other vehicles. Since the vehicles have different number of other deliveries this requires the algorithm to handle sequences of varying length which is difficult to do for logistic regression, SVM, and deci- sion trees. Therefore, this additional information only will be added to the permutation invariant neural network. An additional feature called DCurrLoad is a list with a vehicles other assigned deliveries.

The information for each delivery is a vector with: LSTime, LETime, LCurLoc, LDist, and LEstTime.

Permutation invariant data set

The features for the permutation invariant neural network are all fea-

tures in Table 3.2 and Table 3.3, as well as DCurrLoad for all other

vehicles (see Figure 3.2).

(35)

Secondary data set

The features in Table 3.2 and Table 3.3 are the ones used in the data set for the following algorithms: decision tree, logistic regression, SVM, and feedforward neural network.

3.1.4 Training and test set

The data set is divided into a training set, validation set, and test set.

60% of the data set is used for training, 20% is used as validation set, and 20% is used as test set. The sets are divided with respect to tempo- ral order such that the validation and test set consist of deliveries made after the deliveries in the training set. Due to training being slow, k- fold cross-validation would be too time consuming. A large training set allows the models to have more parameters, and reduce the risk of overfitting.

3.2 Preprocessing

The next step after the data extraction, is to encode categorical features and normalize the continuous features. The categorical variables LPkg- Type and DBoard are encoded using a one hot scheme. One hot encoding is a technique that

. . . transforms a single variable with n observations and d distinct values, to d binary variables with n observations each. Each observation indicating the presence (1) or ab- sence (0) of the dichotomous binary variable. (Potdar and Kinnerkar [44, p.7])

The continuous variables are transformed with min-max normal- ization such that the variables are in the interval [0, 1] using the for- mula:

x

⁰

= x − min(x) max(x) − min(x)

The free text feature is split into words and given as input to the

embedding layer.

(36)

3.3 Problem formulation

Before running the algorithms on the preprocessed data we must de- termine how to model the task. The dispatching problem can be mod- eled as a binary classification problem where a vehicle is either picked or rejected. The input is information about the load (pickup and deliv- ery location, earliest pickup, latest delivery, package dimensions and weight etc.) and driver information (current location, home location, working hours etc.) as input. The problem is therefore to select the most suitable vehicle x

^ti

at time t from a set of available vehicles X

^t

. This can be modeled as a binary classification problem where the tar- get vector y consist of two classes, ’non-suitable’ and ’suitable’, which can be seen as a vector of length 2 that is one hot encoded. The vec- tor [1, 0] is defined as ’non-suitable’ and the vector [0, 1] is defined as

’suitable’. The model output is the probability of the sample belong- ing to each class. The algorithms will be fed the active vehicles one at a time and output a probability of this vehicle being ’suitable’. The vehicle with the highest probability will be assigned the delivery. In the next section (3.4) the architectures for the two neural networks are described.

3.4 Neural networks

3.4.1 Sample definition and Loss function

The batch consists of a single set X

i

with one or more vehicles in each set. Since only one vehicle was selected for the delivery, the target vec- tor will have one sample with the class [0, 1] (’suitable’, i.e., the vehicle picked by the human dispatcher) and the rest have class [1, 0] (’non- suitable’, vehicles not picked by the human dispatcher). The loss func- tion used is categorical cross entropy which is defined in Section 2.5.1.

3.4.2 Feedforward Architecture

The first model is a simple feedforward network with batch normal-

ization (see Section 2.5.3). It has two hidden layers and one output

layer. The model can be seen in Figure 3.1. The input is a vector with

the features in Tables 3.2 and 3.3. The output layer use the softmax

activation function.

(37)

Load Information

Driver Information

Fully connected Layer

Batch Normalization

Output Fully connected

Layer Batch Normalization

Figure 3.1: Network architecture for the feed forward neural network.

3.4.3 Permutation invariant Architecture

Figure 3.2 shows the neural network model where Load Information are the attributes in Tables 3.2 and 3.3 which start with L, and Driver In- formation are the attributes in Tables 3.2 and 3.3 which start with D.

The parts that handle unordered variable length input have Deep Sets [65] layers. The Deep Sets layer has an embedding function, repre- sented by a two fully connected layers, which is applied separately to all existing orders (load information/DCurrLoad) for each vehicle, sim- ilar to a convolution. The weights in the fully connected layers are shared between all instances in a set. The result of the embedding function is pooled using summation which results in a fixed size rep- resentation of all current deliveries for each vehicle. If the deliveries for vehicle i are X

i

= {x

₁

, x

₂

, . . . x

_Q_i

} with x

j

being the features for delivery j, the Deep Sets layer applies two feedforward layers, which we denote as φ(·), to all x

j

(φ(x

j

) ). The results from the feedforward network are then pooled together with sum pooling which gets us deep_sets_output = φ(x

1

) + φ(x

₂

) + · · · + φ(x

_Q_i

) and has a fixed size regardless of the number of deliveries (Q

i

). The procedure for the sec- ond deep sets layer is analogous.

These representations are then used as input to another Deep Sets layer which results in a fixed size representation of all vehicles and their deliveries. This representation is called global vehicle information.

Global vehicle information is only computed once and then concatenated

with the load, driver, and text information for each active vehicle for a

(38)

specific delivery. The free text for the current order are transformed into a fixed size representation through an embedding layer and a LSTM layer. The text representation is then concatenated with the global vehicle information, and the information about the order and the current vehicle. The output layer has two nodes and uses the softmax activation function to get a score for good and bad selection.

3.4.4 Class Imbalance

One problem with having each batch being a single set is that the class

’suitable’ ([0, 1]) has a single sample while the ’non-suitable’ class have on average 40 samples. Problems with skewed data is called class im- balance problems and can be observed in, for example, fraud detection and medical diagnosis. There are several ways to address this problem with the two common categories being sampling based solutions and cost based solutions. With the sampling approach the idea is to either add samples from the minority class (oversampling), or remove samples from the majority class (undersampling). In the cost based approaches the algorithm is altered so that the cost of misclassifying a sample from the minority class is larger than misclassifying a sample from the ma- jority class. In a neural network the loss multiplied with a weight for each sample, e.g., the weight w could be added to the categorical cross entropy (see Section 2.5.1) like this:

J (θ) = −

N

X

i=0

w

_i

p(~ x

_i

) log(q(~ x

_i

))

With N being the number of samples. The weights for the cost based approach was chosen by manually testing a range of values and then pick the ones giving the best performance.

Sampling and cost based approaches are tested in Section 4.4.

3.5 Model implementation

Scikit-learn is a machine learning library for python [41] and was used

to implement the following algorithms: decision trees, logistic regres-

sion, and SVM.

(39)

Delivery Text

Word Embedding

Load Information

Driver Information Text Information

LSTM

...

Sum Pooling

Global Vehicle Information Load Info

1

... ...

Vehicle 1 Information

Vehicle N Information

Concatenate Layer

Output Sum Pooling

Sum Pooling Load Info

Q₁

Load Info Q_N Load Info

1

Deep Sets Layer

Dropout

Fully connected Layer Fully connected

Layer

Layer Fully connected

Layer

Deep Sets Layer

Figure 3.2: Network architecture for the permutation invariant neural

network. The Deep Sets layers is described in Section 2.8.2.

(40)

3.5.1 Decision tree and Logistic regression

The decision tree use the scikit-learn [41] library which uses an op- timized version of the Classification And Regression Tree (CART) algo- rithm. The LIBLINEAR [15] library is used for logistic regression.

3.5.2 SVM

Since the problem is modeled as a binary classification problem both Support Vector Classification (SVC) and Support Vector Regression (SVR) can be used. SVR was chosen due to faster computation time. In the regression case the output is a score s

i

∈ (0, 1) that represent the suit- ability of assigning the delivery to vehicle i, with the desired output s

_i

= 1 for the ’suitable’ vehicle and s

i

= 0 for ’non-suitable’ vehicles.

SVR is implemented using the scikit-learn [41] library. Scikit-learn use LIBSVM [9] for Support Vector Machines.

3.5.3 Neural networks

The neural networks are implemented in Keras [23] and Tensorflow [2]

due to both libraries being two of the most popular ML libraries when looking at activity from GitHub and Stack Overflow [29].

3.6 Hyperparameter Optimization

The hyperparameter optimization used Random search which was pro- posed by Bergstra et al. [4]. Random search was shown to be more ef- ficient than grid search theoretically and empirically [4]. The data set used for hyperparameter optimization contains approximately 20% of the whole data set, i.e, 7000 training samples and 2000 test samples, due to computational constraints. The objective for the optimization was top 1 accuracy. Top 1 accuracy was chosen over top 3 accuracy because if the system should replace the dispatcher it should give the best vehicle pick, and due to the assumption that the two would cor- relate. The values for the hyperparameter intervals was picked after a quick manual coarse search.

Below are the hyperparameters for each algorithm and their possi- ble values shown.

Logistic Regression

(41)

• Penalty norm ∈ [l1, l2]

• Inverse of regularization strength (C) ∈ [10

⁻

4, 5]

• Tolerance for stopping criteria ∈ [10

⁻

7, 10

⁻

2]

Decision Tree

• Minimum leaf size ∈ [2, 300]

• Max-Depth ∈ [2, 100]

Support Vector Machines

• Penalty error term ∈ [10

⁻⁴

, 5]

• Epsilon ∈ [10

⁻³

, 1]

• Kernel ∈ [

⁰

linear

⁰

,

⁰

polynomial(degree3)

⁰

,

⁰

rbf

⁰

,

⁰

sigmoid

⁰

] ANN

• Learning rate ∈ [10

⁻⁷

, 0.1]

• Dropout ∈ [0.01, 1]

• Embedding size ∈ [10, 100]

• LSTM units ∈ [10, 300]

• Deep Sets hidden units ∈ [8, 256]

• Deep Sets output units ∈ [8, 256]

• Activation function ∈ [tanh, relu]

• Optimizer ∈ [Adam, Adagrad, RMSProp, SGD]

3.7 Evaluation

Section 3.7.1 describe the metrics used for evaluation, and Sections 3.7.2 and 3.7.4 to 3.7.7 describe the different tests carried out.

3.7.1 Metrics

The two main metrics used for evaluation are accuracy and top 3 ac-

curacy. Accuracy is computed by giving all active vehicles for each

delivery a probability/score between 0 and 1, where 1 is a good selec-

tion (’suitable’) and 0 is a bad selection (’non-suitable’). The probabil-

ities are then sorted from highest to lowest. The models are evaluated

(42)

Table 3.4: Example prediction for 3 deliveries and 3 vehicles (A,B,C) Delivery Prediction

A

Prediction

B

Prediction

C

1 [0.4,0.6] [0.7,0.3] [0.9,0.1]

2 [0.1,0.9] [0.5,0.5] [0.7,0.3]

3 [0.2,0.8] [0.8,0.2] [0.7,0.3]

Figure 3.3: Histogram for a simple example with 3 deliveries and 3 vehicles (A,B,C)

by computing the accuracy of the models predictions compared to the dispatchers decisions, i.e., the number of times the vehicle selected by the human dispatcher got the highest score divided by the number of test samples. The top 3 accuracy is the number of times the vehicle selected by the human dispatcher got one of the three highest scores.

The histograms in the results section (chapter 4) shows the distribu- tion of the algorithms placement or rank of the actual vehicle selected by the dispatcher, so rank 2 means that the human dispatchers selected vehicle got the second highest score. We have a simple example with 3 vehicles (A,B,C) and 3 deliveries, and where all deliveries were made by vehicle A. The model makes predictions for each vehicle for every delivery, which is shown in shown in Table 3.4. We see that the correct vehicle got the highest predicted score (rank 1) twice (delivery 2 and 3), and the second highest score (rank 2) in one delivery (delivery 1).

The ranks is then used as data to create the histograms.

(43)

3.7.2 Sum digits

This test was done to validate the Deep Sets implementation. The test consist of 15 digits between 0 and 10 as input, and target is the sum of the digits. An example with all digits being 1 would have input x = {1, 1, 1, ..., 1} and the target would be y = 15. The input and output were one hot encoded. The networks were all the same except different first layers. There was 2 layers after the first layer with ReLU and softmax as activation function respectively. Three types of input layers were used: fully connected, LSTM, Deep Sets. The LSTM and Deep Sets layers had input shape = (batch size, number of digits, size of digit), and the fully connected layer had input shape = (batch size,

number of digits × size of digit). The goal is to learn how to do addition.

3.7.3 Interpretation of learned rules from decision tree

The resulting graph from the decision tree is analyzed to find the fea- tures that has the most "pure" split of the data, i.e., the lowest Gini impurity.

3.7.4 Class imbalance mitigation methods

This test explores which type of class imbalance mitigation technique performs best for this data set using the permutation invariant model.

The three techniques tested are:

• Undersampling. Remove half of the majority class (non-selected vehicles)

• Oversampling. Duplicate the single minority class (selected vehi- cle) by half the number of the majority class (non-selected vehi- cles)

• Weighted Cost function. Set the weights for the minority class to

0.6 and the weights for the majority class 1.0. The chosen values

performed best in a coarse manual search.

(44)

3.7.5 Permutation invariant network with and without the free text

The free text (LText) have short information which is important to the driver. An additional model will be trained without the text informa- tion, while using the same hyperparameters. The hyperparameters were the one giving the best performance when doing the hyperpa- rameter optimization. The performance difference can then be ana- lyzed to examine if the text information improves the performance of the network.

3.7.6 Comparing decisions made in the morning and afternoon

The idea is to investigate if the dispatchers decisions differ depending on when the decision was made. For this test two models were trained on orders dispatched during the morning and afternoon respectively.

The models will then be evaluated on three test sets: morning, after- noon, and all. If there is a performance gap between the models it could indicate that the decisions made in the morning and afternoon are dif- ferent.