STOCKHOLM SWEDEN 2018 ,
Comparison of machine learning algorithms for real-time vehicle selection in transport management
DAVID FAGERLUND
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Abstract
This thesis compares algorithms for a dynamic pickup and delivery problem, when new orders are arriving throughout the day. The dis- patchers job is to assign incoming orders to a fleet of vehicles. His- torical data is used to train the algorithms, where the objective is to select the same vehicle as the human dispatchers based on the infor- mation about the delivery and vehicles. The idea is to learn latent vari- ables, which are common in the real world but difficult to incorporate in route optimization. The data set is compiled from deliveries from a courier company located in Stockholm, Sweden.
The studied algorithms are: logistic regression, support vector ma-
chine, decision tree, feedforward neural network, and permutation in-
variant neural network. An additional data set based on the same de-
liveries is used to add all current orders for each active vehicle which
is used in the permutation invariant neural network. The results show
that feedforward neural networks and decision trees performed best
with top 1 accuracy and top 3 accuracy respectivelly. The best perform-
ing technique for class imbalance mitigation was oversampling (dupli-
cating samples from the minority class) which outperformed under-
sampling (removing samples from the majority class) and weighted
cost function (additional cost when misclassifying the minority class).
Sammanfattning
Rapporten jämför algoritmer för ett dynamiskt hämtnings- och leve- ransproblem, med nya ordrar som kommer in under hela dagen. Tra- fikledarens jobb är att tilldela inkommande order till en fordonsflotta.
Historisk data används för att träna algoritmerna, där målet är att välja samma fordon som de mänskliga trafikledarna baserat på information om ordern och fordonen. Tanken är att lära sig latenta variabler som är vanliga i den dagliga verksamheten men svåra att använda i ruttopti- mering. Datan har sammanställs från leveranser från ett budföretag i Stockholm, Sverige.
Algoritmerna som studerades är: logistisk regression, stödvektor-
maskin, beslutsträd, artificiellt neuronnät och permutationsinvariant
neuronnät. Ett ytterligare dataset baserat på samma leveranser används
för att lägga till alla aktuella ordrar för varje aktivt fordon som an-
vänds i det permutationsinvarianta neuronnätet. Resultaten visar att
artificiella neuronnät och beslutsträd hade bäst prestanda baserat på
topp 1 - noggrannhet respektive topp 3 - noggrannhet. Den bästa meto-
den för minskad klassobalans var översampling (duplicering av prov
från minoritetsklassen), vilket presterade bättre än undersampling (bort-
tagning av prov från majoritetsklassen) och viktad kostnadsfunktion
(extra kostnad vid klassificering av minoritetsklassen).
1 Introduction 1
1.1 Background . . . . 1
1.2 Problem Statement . . . . 2
1.3 Contribution . . . . 3
2 Background 5 2.1 Vehicle Routing Problem . . . . 5
2.2 Multinomial Logistic Regression . . . . 6
2.3 Support Vector Machines . . . . 7
2.3.1 Non-linear classifier . . . . 7
2.4 Decision Trees . . . . 8
2.4.1 Pruning . . . . 9
2.4.2 Selection measures . . . . 9
2.5 Artificial Neural Networks . . . 10
2.5.1 Feedforward networks . . . 10
2.5.2 Optimization algorithms . . . 11
2.5.3 Batch Normalization . . . 13
2.5.4 Dropout . . . 14
2.5.5 Word Embedding . . . 14
2.5.6 Recurrent Neural Networks . . . 15
2.6 Permutation Equivariance and Invariance . . . 16
2.7 Learning dispatch decisions . . . 17
2.8 Related works . . . 18
2.8.1 Automated dispatching . . . 18
2.8.2 Permutation Equivariance and Invariance . . . 19
2.8.3 Differences to previous research . . . 20
3 Method 21 3.1 Data . . . 21
v
3.1.1 Features . . . 22
3.1.2 Target . . . 24
3.1.3 Description of Data sets . . . 24
3.1.4 Training and test set . . . 25
3.2 Preprocessing . . . 25
3.3 Problem formulation . . . 26
3.4 Neural networks . . . 26
3.4.1 Sample definition and Loss function . . . 26
3.4.2 Feedforward Architecture . . . 26
3.4.3 Permutation invariant Architecture . . . 27
3.4.4 Class Imbalance . . . 28
3.5 Model implementation . . . 28
3.5.1 Decision tree and Logistic regression . . . 30
3.5.2 SVM . . . 30
3.5.3 Neural networks . . . 30
3.6 Hyperparameter Optimization . . . 30
3.7 Evaluation . . . 31
3.7.1 Metrics . . . 31
3.7.2 Sum digits . . . 33
3.7.3 Interpretation of learned rules from decision tree 33 3.7.4 Class imbalance mitigation methods . . . 33
3.7.5 Permutation invariant network with and with- out the free text . . . 34
3.7.6 Comparing decisions made in the morning and afternoon . . . 34
3.7.7 Time to train and run inference . . . 34
4 Results 35 4.1 Hyperparameter optimization . . . 35
4.2 Sum digits . . . 36
4.3 Interpretation of learned rules from decision tree . . . 36
4.4 Class imbalance mitigation methods . . . 37
4.5 Permutation invariant network with and without the free text . . . 39
4.6 Comparing decisions made in the morning and afternoon 39 4.7 Time to train and run inference . . . 41
4.8 Final Comparison . . . 41
5 Discussion 44
5.1 Analysis . . . 44
5.2 Comparison with previous work . . . 45
5.3 Practical Application . . . 46
5.4 Sustainability, Societal and Ethical Issues . . . 46
5.5 Conclusions . . . 47
5.6 Future work . . . 47
Bibliography 49
Introduction
This chapter introduces the background to the thesis. Section 1.1 de- scribes the problem and its domain, and also motivates why the prob- lem is interesting. Section 1.2 describes the problem and outlines the goals of the thesis. The last section (1.3) of this chapter states the con- tributions made in this thesis.
1.1 Background
The transportation sector is one of the worlds largest industries and the worldwide transportation revenue is estimated to be $4.7 trillion [31] in 2016. The transportation industry is currently undergoing a digital transformation with access to more data through telematics, and better tools for analytics [42].
In dynamic vehicle routing and dispatching problems, there are a number of vehicles which are assigned requests/deliveries. The dy- namic part means that new deliveries arrives over time, i.e., we deal with incomplete information. Dispatching problems can be divided into the following sub-problems:
• Courier services pickup packages from customers and bring them to a depot for further shipping.
• Dial-a-Ride is a door-to-door transportation service for elderly and people with disabilities.
• Emergency services tries to place ambulances in such a way that they can reach a patient quickly.
1
This thesis is concerned with vehicle dispatching in courier services which is often called dynamic and stochastic Pickup and Delivery prob- lem (PDP). Dynamic means that we have incomplete information, i.e., not all deliveries are known in advance, and stochastic means that there is one or more variables that are random, i.e., location of pickup and delivery. This means that new packages arrive throughout the day with a local pickup and delivery location.
Each vehicle is assigned multiple deliveries. The delivery has a pickup and delivery address, and time windows for pickup and de- livery. Currently the order assignment is made by humans. Automat- ing the assignment is crucial to allow for 1 hour deliveries and han- dle the dynamic workload with new orders coming in throughout the day [5]. It can also enable more efficient re-planing in case of disrup- tions [1]. Automation can also enable faster pickups for companies in the e-commerce industry when a customer returns a product, which allows the companies to resell the products faster [5]. Using an auto- mated system for dispatching limousines resulted in an 100% increase in weekly orders with the same number of planners/dispatchers [11].
This enables the companies to grow their business with reduced costs.
1.2 Problem Statement
Today companies are still relying on human operators for dispatching incoming deliveries to vehicles. The fleet dispatchers use their experi- ence to decide which vehicle is most suitable for a new delivery based on vehicle location, current vehicle load, package type, etc. Acquiring experience takes a long time and due to stress their careers are often short, which leads to low supply of good dispatchers [54]. Automated decision making could provide dispatchers with vehicle suggestions to ease their workload, and in the future potentially replace human dispatchers altogether.
One common approach to vehicle dispatching is route optimiza-
tion, which tries to minimize or maximize an objective function with a
number of constraints. The objective function could be total distance
traveled, total delivery time, minimum number of vehicles, and total
number of deliveries served, or a weighted combination of them. An
alternative approach is to assume a fuzzy objective and learn the hu-
man dispatchers decisions.
This thesis will focus on the latter approach and learn the dispatch- ers decisions. Some advantages of this approach are that the model can learn complex latent variables, e.g., certain drivers are more suit- able for carrying heavy packages, and it is also less computationally demanding during inference since no route optimization is performed.
If there are too many incoming deliveries to serve manually, the algo- rithm could give vehicle suggestions to the human dispatchers, en- abling the operator to focus on more complex deliveries. The solution can also be used as a starting point for route optimization which could reduce the time to find a quality solution. Improved dispatching de- cision could be used to reduce fuel consumption, emissions, or cost depending on which objective function is used.
Given information about a new order and the active vehicles, we want to predict the most suitable vehicle for the given order. The aim of this study is to train a machine learning model to select a vehicle for an incoming order. The training is based on historical dispatch decisions from human operators. The decisions made by the model are compared with the decisions made by the human operators. The first research question is how well machine learning can learn human dispatcher decisions, and the second question is which algorithms are more suitable for making those decisions.
The models compared in this thesis are Neural Networks, Support Vector Machines (SVM), Logistic Regression, and Decision Trees. The goal of this thesis is to compare different machine learning methods and evaluate their performance on vehicle selections.
1.3 Contribution
The data sets used in previous research [45, 10, 54] is often small, which prohibit larger models. The largest previous data set was used by Chen et al. [10] which had 25 days of decisions for training and 10 days for testing. The data set used in this thesis has two months of data for testing and training. The size of the data set was constrained by the computational complexity of the permutation invariant model (see Section 3.1). In addition this thesis use a more complex environ- ment with free text containing additional information such as time windows, package dimensions, weight, and type of package.
This thesis utilizes a permutation invariant network to incorporate
information about the other active vehicles.
Previous studies [54, 45, 10, 61] haven’t used detailed information
about the vehicles existing load, such as package type, pickup location,
delivery location, and time windows. The report will also compare
different machine learning methods and evaluate their performance
on vehicle selections.
Background
This chapter describes the relevant theoretical frameworks for this the- sis. Section 2.1 introduces the Vehicle Routing Problem and its vari- ants. Sections 2.2 to 2.5 describe the different classifiers used in this thesis. Section 2.6 describes the concepts of permutation equivariance and invariance and its use in Artificial Neural Networks. The problem of defining optimality is presented in Section 2.7. Section 2.8 describes related work.
2.1 Vehicle Routing Problem
The Vehicle Routing Problem (VRP) was introduced by Dantzig and Ramser [12] in 1959. It is a generalized Traveling Salesman Problem (TSP) [25]. The goal of TSP is to find the shortest route between a number of nodes, e.g., cities, that ends in the start node and where all nodes are visited. Dantzig and Ramser [12] used linear programming to find a near optimal solution for the "Truck Dispatching Problem"
(TDP). Laporte et al. describe VRP as:
The vehicle routing problem (VRP) consists of designing least cost delivery routes through a set of geographically scattered customers, subject to a number of side constraints [25, p.1].
In addition VRP has one or more depots with a given number of vehi- cles that serve the customers demand.
There are multiple variations on VRP with a few of them being:
5
• VRP with Pickup and Delivery (VRPPD); has packages which are picked up and delivered between certain locations
• VRP with Time Windows (VRPTW); has time constraints for each customer
• Capacitated VRP (CVRP) has constraints on the vehicles load ca- pacity
• Open VRP (OVRP) where there is no depot
The problem in this thesis is a Open Pickup and Delivery Problem with Time Windows. Two approaches that been used to solve this problem are through optimization algorithms [43, 17, 37, 35] and by learning the decisions from human dispatchers [45, 54, 10]. This thesis will use the latter approach.
2.2 Multinomial Logistic Regression
Multinomial Logistic regression is a simple parametric linear algo- rithm that is easy to interpret. It has been used for many different areas including stock performance [60], type of pregnancy [53], and anomaly intrusion detection [63]. Logistic regression is used for a binary depen- dent variable and multinomial logistic regression is a generalized method that is used for multiclass problems (with more than two classes). If we let K denote the number of classes, the idea of multinomial logistic regression is to pick a reference variable and compute the probability for K − 1 binary logistic regression models. We have input features
~
x = [1, x
1, . . . , x
n] , parameters ~ w = [w
0, w
1, . . . , w
n] , with n being the number of input features. The reason that we set x
0= 1 and insert w
0is to account for the bias term, i.e., w
0is the bias term. If we pick K as the reference class, we get:
log( P (y = i|~ x)
P (y = K|~ x) ) = ~ w
i· ~ x, ∀i ∈ {1, . . . , K − 1}
And the probabilities:
P (y = i|~ x) = P (y = K|~ x) · e
w~i·~x, ∀i ∈ {1, . . . , K − 1}
P (y = K|~ x) = 1 −
K
X
k=1
P (y = K|~ x) · e
w~i·~x2.3 Support Vector Machines
Support Vector Machines (SVM) are models that can be used for classi- fication and regression. SVM have been used for many different tasks such as image recognition [28], image segmentation [62], and text clas- sification [18]. SVM can also be used for regression tasks using the same principles as classification. The simplest type of SVM is the linear SVM. In Figure 2.1 there are two linearly separable classes of points.
In general, there could be many different possible linear separations.
The idea behind SVM is to place the decision boundary so that the dis- tance between the decision boundary and the points is maximized, i.e., maximizing the margin around the decision boundary.
We assume that the data set consists of points (~x
i, y
i) with y
i= {−1, 1}. We want to pick the two hyperplanes that separate the points with the largest distance between them (dotted lines in Figure 2.1). To allow non linearly separable data we introduce slack variables ξ
i. A positive ξ
ienables ~x
ito lie outside the margin at a cost proportional to ξ. If C is large then the cost will be high for points not respecting the margin, and the further away from the margin the point is (larger ξ
i) the larger the increase in cost will be. However for a smaller C the margin can be larger with the trade-off being a higher training error.
This can be modeled as the following optimization problem:
minimize
x
k ~ wk + C
N
X
i=1
ξ
isubject to y
i( ~ w · ~ x − b) ≥ 1 − ξ
i, ξ ≥ 0, i = 1, . . . , n.
(2.1)
Where ~ w is the normal vector to the separating hyperplane, and b is the bias variable for the hyperplane. C is a regularization term where large values increases the penalty for points on the wrong side of the margin.
2.3.1 Non-linear classifier
Boser, Guyon, and Vapnik [7] introduced a way to create non-linear
classifiers by the kernel trick. The idea is to transform the input-space
into a higher dimensional feature space to account for non-linearly
separable data. Instead of doing the actual transform, they apply the
kernel trick [7] to avoid computing the actual transformation. The ker-
nels used in this thesis are:
Figure 2.1: Maximum margin hyperplane and support vectors for a SVM classifier with two classes
• Linear: k(x, x
0) = x
Tx
0+ c
• Polynomial: k(x, x
0) = (γx
Tx
0+ c)
d, d = 3
• Radial Basis Function (RBF): k(x, x
0) = e
−γ||x−x0||2• Sigmoid: k(x, x
0) = tanh(γx
Tx
0+ c) Where γ =
num_features1and r = 0.0.
2.4 Decision Trees
Decision trees can be used for both classification and regression tasks and has been used in previous research in pickup and delivery dis- patching [10], job scheduling [30], and container transportation [50].
Decision trees have the advantage of being a white box model, which is easy for a human to interpret.
A decision tree can be modeled as an acyclic directed graph with
two types of nodes. The internal nodes are decision nodes and the leaf
nodes are the target nodes. A decision tree partitions the data recur-
sively into smaller subsets based on a test at each internal node and the
subsets get more homogeneous for each test. Classification trees have
categorical values as targets, while regression trees have a continuous
variable as target. Each node in the tree has a single parent and all
Figure 2.2: A decision tree with input features x
1, . . . , x
4, threshold values a, b, c, d, e, and outcomes A, B, C, D
nodes except the leaf nodes have two or more descendents. The tree is partitioned until all subsets are homogeneous or a stop condition is met. A large tree is more difficult to understand and is more likely to overfit. Figure 2.2 shows an example of a categorical decision tree with 4 features (x
1, . . . , x
4) and 4 classes (A, B, C, D).
2.4.1 Pruning
Mingers [34] found that reducing the tree size, called pruning, im- proved the accuracy with 20-25%. Reduced Error Pruning (REP) is one pruning method that was shown by Esposito et al. [14] to per- form well both in terms of accuracy and tree size. REP was introduced by Quinlan [47] and starts with a complete tree. A validation set is used to compare the classification error rate when the subtree is kept, and when the subtree is turned into a leaf node. If the error rate for the pruned subtree is the same or lower then the node is pruned and turned into a leaf node. This procedure is repeated until no more prun- ing can be performed without a higher error rate.
2.4.2 Selection measures
There are several methods to decide which attribute to split on. Infor-
mation gain [48] measures the change in entropy with knowledge about
the attribute, and is defined below
IG(T, a) = H(T ) − H(T |a) H(T ) = − X
i
p(x
i)log(p(x
i)
H(T |a) = − X
i,j
p(x
i, y
j)log( p(x
i, y
j) p(y
j) )
Another selection measure is GINI index which was used in Breiman et al. [8]. GINI index
. . . measures the ’impurity’ of an attribute with respect to the classes. [34, p.7]
The measure is defined as
I
G(p) = 1 − X
i
p(x
i)
22.5 Artificial Neural Networks
2.5.1 Feedforward networks
Artificial Neural Networks (ANN) consist of a number of intercon- nected nodes/neurons that performs a non-linear transformation of a linear combination of inputs (~x) and weights ( ~ w ):
o(~ x, ~ w) = σ(
N
X
i
~
x
i× ~ w
i) = σ( ~ w · ~ x)
Where o(·) is the output value, and σ(·) is the activation function.
Feedforward network is a type of ANN where the nodes don’t form a cycle, i.e., the output of the node isn’t fed back to itself. The output from one layer is used as input to the next layer, see Figure 2.3 for an example of an ANN. Using the network in Figure 2.3 we denote the weights from the input layer to the hidden layer (w
1ij) as W
1, and the weights from the hidden layer to the output layer (w
2ij) as W
2. The bias denoted by b
1and b
2, for the hidden layer and output layer respec- tively. The output for a network with a linear activation function on the output layer can then be written as:
ˆ
y(x, w) = σ(W
2σ(W
1~ x + b
1) + b
2)
There are several activation functions, for example the sigmoid func- tion [6] , and the Rectified Liner Units (ReLU) [38] which is the most popular activation function [27].
σ
sigmoid(z) = 1 1 + e
−zσ
ReLU(z) = max(0, z)
Backpropagation is typically used to train the weights by computing the gradient of the error or loss function J(θ). The error of loss function measure the difference between the networks output and the expected output. Backpropagation was discovered independently by several researchers during the 1970s and 1980s [64, 26, 40, 52]. Mean Squared Error (RMSE) is typically used as a loss function for the regression tasks, and is defined as:
J (θ) = 1 N
N
X
i=0
(y
i− ˆ y
i)
2Cross-entropy loss is common for classification tasks, and is defined as:
J (θ) = −
N
X
i=0
p(~ x
i) log(q(~ x
i))
Where p(·) is the true distribution, q(·) is the model distribution, and N is the number of samples.
2.5.2 Optimization algorithms
In order to update the weights an optimization algorithm is used. Pop- ular optimization algorithms are typically based on Stochastic Gradient Descent (SGD) which was introduced by Robbins and Monro [51]. SGD has the following form:
w := w − η∇J
i(w)
Where J
i(w) is the loss for a single sample and η is the learning rate
which specifies how big the updates should be. Backpropagation is
typically used to compute the gradient of the loss function. Variants
on SGD that incorporate information about previous gradients are out-
lined below.
Figure 2.3: A fully connected neural network with 3 input features, 2 hidden units, and one output feature
AdaGrad was created by Duchi, Hazan, and Singer [13]. It uses an adaptive learning rate by computing separate learning rates for each feature based on the size of the previous updates. The gradient update with AdaGrad is computed as follows:
G
t,i=
t
X
j=1
g
j,i2g
t,i= ∇J
i(w),
x
(t+1)i= x
(t)i− η
pG
t,i+ g
t,iWhere G
t,iis the sum of squared gradients for parameter i up to timestep t . Since the learning rate depends on the sum of previous squared gra- dients, the algorithm can be informally interpreted as lower learning rate for frequent features and larger learning rate for infrequent fea- tures.
RMSProp was proposed by Tieleman and Hinton [59]. It uses an exponentially decaying average of the squared gradients to normalize the learning rate.
E[g
t+1,i2] = γE[g
2t,i] + (1 − γ)g
t+1,i2x
(t+1)i= x
(t)i− η q
E[g
t,i2] + g
t+1,iWhere γ is a term to weight the previous gradients with the current gradient.
Adam was introduced by Kingma and Ba [24] and computes the exponentially decaying squared gradients (v
t) like in RMSProp, and the exponentially decaying gradients (m
t). m
tand v
tare estimates of the first (mean) and second (uncentered variance) moment. It is defined as:
m
t= β
1m
t−1+ (1 − β
1)g
t, v
t= β
2v
t−1+ (1 − β
2)g
2t, (2.2)
ˆ
m
t= m
t−11 − β
1t, v ˆ
t= v
t−11 − β
2t(2.3)
x
t= x
t−1− η
√ ˆ v
t+ m ˆ
t(2.4) The variables ˆ m
t, ˆ v
tare bias-corrected estimates of the first and sec- ond moment. β
1, β
2are the hyperparameters for the decay rates. is used to prevent division by zero. In the original paper [24] the authors propose β
1= 0.9 , b
2= 0.999 , and = 10
−8.
2.5.3 Batch Normalization
Batch Normalization was introduced by Ioffe and Szegedy [21]. It is a method that is used to normalize the input for each layer. In Stochastic Gradient Descent (SGD) the input to one layer depends on the parame- ters in the previous layers. When the parameters change during train- ing the input distribution to the next layer will change as well. This is called covariate shift [55]. Covariate shift hinders the learning since the model has to adapt to a changing input distribution. Batch normaliza- tion reduce the dependence on scale, which allows for higher learning rate [21], by normalizing each input in each layer as follows:
ˆ
x
(k)l= x
(k)l− µ(x
(k)l) q
σ(x
(k)l)
, y
(k)l= γ
l(k)x ˆ
(k)l+ β
l(k)Where ˆ x
(k)lis the kth input for layer l, and µ(·) and σ(·) are estimated
over a mini-batch. The parameters γ and β are used to shift and scale
(a) Before dropout (b) After dropout
Figure 2.4: A dropout neural network. Left: A network with two lay- ers Right: The same network with thinned nodes using dropout. The nodes with cross are disabled
the normalized value since normalization may change the input rep- resentation. Ioffe and Szegedy [21] applied Batch Normalization to a variant of the Inception Network [58] and achieved the same accuracy in less than half the number of training steps, 13.3 · 10
6steps compared with 31.0 · 10
6steps.
2.5.4 Dropout
Dropout is a technique introduced by Srivastava et al. [56] which ran- domly ignores nodes during training. The benefit is that it addresses overfitting by "preventing" the nodes from co-adapting and develop dependencies between neurons. Dropout works by disabling random nodes and their incoming and outgoing connections with probability p . During testing all nodes are enabled. By disabling nodes during training it allows the network to approximately combine many differ- ent networks efficiently. An example of a dropout network is shown in Figure 2.4.
2.5.5 Word Embedding
Word embedding is a group of techniques for representing words or
phrases with vectors of real-numbers. A language vocabulary is often
very big and the "curse of dimensionality" is a big problem here. By
mapping words into lower dimensional vectors it is possible to reduce
the sparsity by capturing the words properties in the lower dimen-
sional continuous feature vector, and represent words used in similar context with similar feature vectors. An early implementation of word embedding was developed by Bengio et al. [3] with one linear projec- tion layer and one non-linear hidden layer. The model yielded better perplexity than the current state-of-the art (smoothed trigram model).
Mikolov et al. [33] developed two new word embedding models. Both models have one input layer and one output layer. The first model is the Continuous Bag-of-Words model (CBOW) where the current word is predicted based on the surrounding words. The second model is the continuous skip-gram model where the surrounding words are pre- dicted based on the current word.
2.5.6 Recurrent Neural Networks
Recurrent Neural Network (RNN) is a type of ANN with a connection between the hidden layer and itself (see figure 2.5). RNNs are typi- cally used for sequential input where the input are not independent.
A RNN allows the model to have a kind of memory by capturing in- formation from previously calculated steps and use it to compute the new output.
The steps to compute the output for timestep t are:
a
t= W s
t−1+ U~ x
t+ b (2.5)
s
t= tanh(a
t) (2.6)
o
t= V s
t+ c (2.7)
W is the weights for the previous state (s
t), and U is the weights for the input features (~x
t). b and c are the bias terms. In practice RNNs have difficulty in capturing longer dependencies with the gradients either
"blowing up" or "vanishing" [20]. Hochreiter and Schmidhuber [19]
introduced a new kind of RNN called Long Short Term Memory (LSTM) that was able to capture longer term dependencies. LSTM use memory cells to store information from previous timesteps. The computation steps are as follows [19]:
i
t= σ(W
i~ x
t+ U
is
t−1+ b
i) (2.8) f
t= σ(W
f~ x
t+ U
fs
t−1+ b
f) (2.9) o
t= σ(W
o~ x
t+ U
os
t−1+ b
o) (2.10) c
t= f
tc
t−1+ i
ttanh(W
c~ x
t+ U
cs
t−1+ b
c) (2.11)
s
t= o
ttanh(c
t) (2.12)
Figure 2.5: A folded and unfolded Recurrent Neural Network. The network have a number of hidden units as state (s) and use informa- tion from previous timesteps to compute the next state.
W
qand U
qis the weights for the input features (~x
t) and recurrent con- nections (s
t) respectively, with q being either the input gate (i), output gate (o), forget gate (f ), or memory cell (c). σ(·) is the sigmoid function.
2.6 Permutation Equivariance and Invariance
Traditional methods require the input to be of fixed size Permuta- tion equivariance and invariance are important when dealing with sets where the input lacks a specific order and with a variable input size. It can be used for outlier detection [65], Multiple Traveling Salesman Problem [22], and 3D point clouds [46]. We denote S
Nto be the symmetric group of length N , i.e., the group of all bijections of {1, . . . , N } to itself. A function f is permutation equivariant iff
f (πx) = πf (x), ∀π ∈ S
NA function f is permutation invariant iff
f (πx) = f (x), ∀π ∈ S
NZaheer et al. [65] designed a deep network architecture for both the equivariant and invariant models. They state that a function f is invariant iff it decomposes to f (X ) = ρ P
x∈X
φ(x) [65], for the set X
and the transformations φ and ρ. The invariant model uses one or several fully connected layers to represent the transformations φ and ρ . The model sums all φ(x) and applies ρ to the output.
A function f
Θ(~ x) = σ(Θ~ x) , f
Θ: R
N→ R
N, with weights Θ ∈ R
N ×Nis equivariant iff [49]
Θ = λI + γ(11
T), γ, λ ∈ R (2.13) One variation on equation 2.13 which was shown to perform better in some applications [49] is:
f (x) = σ(λI~ x − γ(max ~x)1)
A generalization of this equation to higher dimensions (K features for each object in the set) with a factoring to reduce the number of param- eters is:
f (x) = σ(β + (~ x − 1(max ~x))Γ), Γ ∈ R
K×KThis is a fully connected layer with max-normalization within the set.
A permutation equivariant function can be transformed into a permu- tation invariant function by applying a pooling operation over the set- dimension [49]. The pooling operation should be commutative, such as maximization and summation [49].
2.7 Learning dispatch decisions
When learning from human dispatchers previous decisions the assump-
tion is that the decisions approximate the optimal solution, i.e., the
decisions are good. This assumption may not hold all the time and
the decision quality will vary. According to one dispatcher, who was
interviewed, the decisions will be worse when there are many incom-
ing orders, which is typical during the afternoon. This makes it more
problematic for the algorithms to make good decisions since not all
decisions are good. One improvement is to use route optimization to
generate the optimal solutions and then feed those decisions into the
algorithms to improve the learning, this was done by Chen et al. [10].
2.8 Related works
2.8.1 Automated dispatching
Potvin, Shen, and Rousseau [45] used a neural network to estimate a vehicles quality for a specific delivery and then pick the vehicle with the highest quality score. To incorporate the information about the other vehicles, they translated the inputs with respect to the best at- tribute value over all vehicles. The routing expert which was used to create the data set was not a professional dispatcher and the deliveries was simulated. The results were, however, promising with the model selecting the same vehicle as the human dispatcher in 89% of the test samples.
Shen et al. [54] developed a program to assist dispatchers in an express mail company without capacity constraints. For each deliv- ery they estimated the travel time by minimizing the cost of inserting the pickup and delivery location in the existing routes and the addi- tional lateness introduced from inserting the pickup and delivery loca- tion. They used 90 deliveries and corresponding dispatcher decisions as training data for a 3-layer backpropagation Neural Network, and 50 deliveries as test data. Features from a single driver was used as input and the output was a single quality score (1 for select driver, and 0 otherwise). The model output for each driver was ranked and compared to the dispatchers decision. The network was shown to be perform close to the human dispatcher in empty travel time, lateness at pickup, and lateness at delivery.
Chen et al. [10] introduced a Data Mining-based Dispatching Sys- tem (DMDS) to learn dispatching rules in intermodal freight industry.
Decision trees was used for learning the rules and the solution was
then used as input to an optimization algorithm in order to improve on
the dispatchers results. They complemented the load attributes, e.g.,
pickup and delivery location, start and end time of a load, required
trailer type, with driver attributes like the drivers start and end time,
home location, and remaining work hours. Based on the trained de-
cision trees the most important attributes were distance between load
and driver, difference between a drivers remaining work hours and a
load’s service duration, estimated remaining time for a driver to finish
current task, and the drivers remaining work hours. They show that
using the DMDS in the optimization algorithm give a small perfor-
mance improvement compared to a optimization-only approach. They also show that training the DMDS with the optimized solutions give a slightly better results, 5.6% lower empty travel mileage and 1.5% re- duction in empty ratio.
Vukadinovic, Teodorovic, and Pavkovic [61] used a neural network to learn decisions for "loading, transport and unloading of gravel by inland water transportation." [61, p.1], where the goal was to predict the number of barges assigned to each tug. The input was the suitabil- ity for barge i to be assigned to tug j, where suitability was a function of tug’s j barge capacity and the difference in release time for the barge and tug. The network was trained with heuristic simulated annealing.
The network was only trained on 56 samples and tested on 16 test samples. The network performed slightly worse than the dispatcher but was promising.
Riessen, Negenborn, and Dekker [50] developed a system for real- time container transport planning, to be able to provide instant de- cisions for incoming orders. They used decision trees to determine which service (train, barge, truck) to use for a single container. To train the decision tree they used historical data which was optimized and then used to train the model. They showed that their model could reduce transportation costs by 3% over a greedy approach.
Mojtaba Maghrebi, Claude Sammut, and Travis Waller [36] devel- oped a model for dispatching Ready Mixed Concrete (RMC). They used decision trees to prioritize customers for dispatching Ready Mixed Concrete and used several features, e.g., distance to depot, unloading time, travel time, required amount etc. The model was tested by sim- ulating a plant with three customers for 200 days, the data was sent to a human dispatcher to prioritize. The data was then used for training and testing (90-10 split) and the model approached a 80% accuracy.
2.8.2 Permutation Equivariance and Invariance
Zaheer et al. [65] designed a neural network architecture called Deep Sets which operates on unordered variable length feature sets. Un- ordered variable length features sets implies that the input is made out of sets of features where the number of elements in the set is not constant. The elements in the sets do not have any inherent order.
The permutation invariant model transforms each instance x
iin the
set with one or more feedforward layers (φ(·)) into the representation
φ(x
i). Then the sum of the transformations are computed and passed as input to another network, e.g., feedforward network. They also propose an equivariant model. This model consists of a non-linear transformation applied to a weighted combination of the input and the sum of the input, i.e. f (x) = λIx + γ(11
T)x. The invariant model was shown to be comparable to other techniques in outlier detection, point cloud classification and image tagging.
Gardner et al. [16] proposed a permutation invariant network ar- chitecture called Convolutional Deep Averaging Networks (CDANs). The architecture is similar to Zaheer et al. [65] with a transformation (em- bedding) of all input instances and then a pooling operation. Given a set X
i= (x
1, . . . , x
li), x
k∈ R
dthe network applies an embedding func- tion f : R
d→ R
mto all input features x
k, where m is the embedding size. The output from the embedding function is then pooled. The embedding function f (·) can be arbitrary with the restriction that it is compatible with backpropagation. In [16], f (·) is represented with a Multi Layer Perceptron/Feedforward network. Zaheer et al. [65] per- form tests with summation, averaging, and max pooling functions on a point cloud classification task. The best performing pooling opera- tion was summation and they showed that non-linear transformations outperforms linear transformations.
2.8.3 Differences to previous research
The main differences in this thesis compared with previous studies are
a comparison of multiple algorithms on a single data set, and the use of
permutation invariant layers in a neural network to use more informa-
tion about the vehicles. The previous studies on automated dispatch-
ing have focused on a single algorithm, therefore, it could be valuable
to compare multiple algorithms on the same data set and evaluate the
performance. The permutation invariant network makes it possible to
use a set of deliveries as input to the network. This has the potential
to create better vehicle selections by having the ability to analyze all
deliveries for all active vehicles. For example, by learning that a deliv-
eries with similar pickup and delivery locations preferably should be
assigned to the same vehicle.
Method
This chapter motivates the methods used in this thesis and the steps to prepare the data, and run and evaluate the models. It starts with describing the properties and features of two data sets as well as the target and training/test split (Section 3.1). Then preprocessing is per- formed to encode categorical features and scale continuous features (Section 3.2). The problem is formulated as a binary classification prob- lem (Section 3.3). The following section outlines the architecture of the two neural networks and gives an explanation of the models imple- mentation (Section 3.4 and Section 3.5 respectively). This chapter ends by listing all hyperparameters (Section 3.6) and giving a description of the metrics used for evaluation and tests performed (Section 3.7).
3.1 Data
The data used in this thesis is from a Swedish transportation company.
The complete data set had more than 4 years of deliveries. However, this was deemed unfeasible to run due to high computational com- plexity for the permutation invariant network. The permutation in- variant network uses information about all deliveries for all vehicles which increases the size of the input data. The computation is also hampered by having variable length input which have to be trans- formed to be of fixed size to increase the speed of computation. How- ever it is still very slow and running 35 epochs took approximately 19 hours.
The data set used contains 31, 309 deliveries over approximately 3 months, with on average 41 active vehicles for each delivery. The de-
21
Table 3.1: Performance metrics from Chen et al. [10] between a human dispatcher and a decision tree model.
Metric Dispatcher Model
Empty travel mileage 2911.4 2688.6 Loaded travel mileage 7668.3 7448.3 Total travel mileage 10579.8 10136.9
Empty ratio (%) 27.52 26.52
Loaded ratio (%) 72.48 73.48
liveries took place in the first 3 months of 2018, which was the time period with the least amount of missing data. The data that was some- times missing were home location, coordinates for pickup and deliv- ery, and start and end times. The dispatchers are grouped by region and for that reason only data from the Stockholm area are used. Stock- holm was chosen since most deliveries are made in the Stockholm area.
There are two data sets used in this thesis: one for the permutation invariant neural network, and one for the other algorithms. The dif- ferences between the data sets are outlined in Section 3.1.3. However, the main features are the same and described below in Section 3.1.1.
3.1.1 Features
The initial features were taken from Chen et al. [10] because their model performed slightly better than a human dispatcher, see Table 3.1. They write that the fact that the model outperformed the human dispatcher may be caused by randomness in, for example, vehicle speed and travel time in the real world which are assumed to be constant in the simulation.
All features from Chen et al. [10] were used except trailer type, esti-
mated time to finish current task, and estimated mileage during work
hours. These aren’t applicable since the vehicles don’t have trailers,
and the deliveries come in throughout the day which means that it’s
not possible to determine remaining mileage. The transportation com-
pany didn’t perform one task at a time, instead a new package could be
collected and delivered if it’s on the way to the current collection or de-
livery location. Distance between driver and load was also substituted
by travel time between load and driver due to inefficient computation
of distance. The features are shown in Table 3.2. Additional features
Table 3.2: Initial features used in all models. Based on the features from Chen et al. [10].
Name Description
LSTime Start time of a load LETime End time of a load
LCurLoc Current location of a load LDist Travel mileage of a load
LEstTime Estimated service duration of a load DBoard Driver ID
DSTime Start time of a driver DETime End time of a driver DHomeLoc Home location of a driver DCurLoc Current location of a driver
DRemTime A driver’s estimated remaining work hours
DLTimeDiff Difference between a driver’s remaining work hours and a load’s service duration
DLDur Estimated travel time between a driver and a load (proxy for DLDist)
have been added based on feedback from a human dispatcher on what
was deemed important when making the decisions, the features are
shown in Table 3.3. The data contains a free text field (LText) which is
used for arbitrary information which is deemed important. This field
could contain information about pickup and delivery times, where to
enter the building, whom to ask for, package dimensions etc. The
package type indicates how urgent the shipping is (1 hour shipping
etc.), and the number of deliveries are important to be able to balance
the load over all vehicles which increases the efficiency. Computation
of the distances and travel times were done by Open Source Routing
Machine (OSRM) [32]. The feature DLDur replaced DLDist (distance
between driver and load) since OSRM didn’t have support for distance
when computing more than one source-destination pairs. Computing
each source-destination pair individually were deemed too time con-
suming.
Table 3.3: Additional features used in all models based on feedback from a human dispatcher.
Name Description
LText Free text (package dimensions, time windows etc.) LPkgType The type of package (delivered in 3h, or urgent, etc.) DNumRequests The number of exisiting deliveries for the driver
3.1.2 Target
The data contains deliveries and their associated vehicle that was as- signed and made the delivery. The selection of a vehicle to a delivery was made by human dispatchers. The data also contains the other vehicles which was active at the time the delivery was assigned but which did not get selected. The target is therefore the vehicle selected by the human dispatcher, and the goal is for the algorithm to select the same vehicle as the human dispatcher did from a fleet of vehicles.
3.1.3 Description of Data sets
One data set is used for the permutation invariant neural network, and one data set is used for decision trees, logistic regression, SVM, and feedforward neural network. The idea was to use the permutation in- variant neural network to take into account the assigned deliveries for all other vehicles. Since the vehicles have different number of other deliveries this requires the algorithm to handle sequences of varying length which is difficult to do for logistic regression, SVM, and deci- sion trees. Therefore, this additional information only will be added to the permutation invariant neural network. An additional feature called DCurrLoad is a list with a vehicles other assigned deliveries.
The information for each delivery is a vector with: LSTime, LETime, LCurLoc, LDist, and LEstTime.
Permutation invariant data set
The features for the permutation invariant neural network are all fea-
tures in Table 3.2 and Table 3.3, as well as DCurrLoad for all other
vehicles (see Figure 3.2).
Secondary data set
The features in Table 3.2 and Table 3.3 are the ones used in the data set for the following algorithms: decision tree, logistic regression, SVM, and feedforward neural network.
3.1.4 Training and test set
The data set is divided into a training set, validation set, and test set.
60% of the data set is used for training, 20% is used as validation set, and 20% is used as test set. The sets are divided with respect to tempo- ral order such that the validation and test set consist of deliveries made after the deliveries in the training set. Due to training being slow, k- fold cross-validation would be too time consuming. A large training set allows the models to have more parameters, and reduce the risk of overfitting.
3.2 Preprocessing
The next step after the data extraction, is to encode categorical features and normalize the continuous features. The categorical variables LPkg- Type and DBoard are encoded using a one hot scheme. One hot encoding is a technique that
. . . transforms a single variable with n observations and d distinct values, to d binary variables with n observations each. Each observation indicating the presence (1) or ab- sence (0) of the dichotomous binary variable. (Potdar and Kinnerkar [44, p.7])
The continuous variables are transformed with min-max normal- ization such that the variables are in the interval [0, 1] using the for- mula:
x
0= x − min(x) max(x) − min(x)
The free text feature is split into words and given as input to the
embedding layer.
3.3 Problem formulation
Before running the algorithms on the preprocessed data we must de- termine how to model the task. The dispatching problem can be mod- eled as a binary classification problem where a vehicle is either picked or rejected. The input is information about the load (pickup and deliv- ery location, earliest pickup, latest delivery, package dimensions and weight etc.) and driver information (current location, home location, working hours etc.) as input. The problem is therefore to select the most suitable vehicle x
tiat time t from a set of available vehicles X
t. This can be modeled as a binary classification problem where the tar- get vector y consist of two classes, ’non-suitable’ and ’suitable’, which can be seen as a vector of length 2 that is one hot encoded. The vec- tor [1, 0] is defined as ’non-suitable’ and the vector [0, 1] is defined as
’suitable’. The model output is the probability of the sample belong- ing to each class. The algorithms will be fed the active vehicles one at a time and output a probability of this vehicle being ’suitable’. The vehicle with the highest probability will be assigned the delivery. In the next section (3.4) the architectures for the two neural networks are described.
3.4 Neural networks
3.4.1 Sample definition and Loss function
The batch consists of a single set X
iwith one or more vehicles in each set. Since only one vehicle was selected for the delivery, the target vec- tor will have one sample with the class [0, 1] (’suitable’, i.e., the vehicle picked by the human dispatcher) and the rest have class [1, 0] (’non- suitable’, vehicles not picked by the human dispatcher). The loss func- tion used is categorical cross entropy which is defined in Section 2.5.1.
3.4.2 Feedforward Architecture
The first model is a simple feedforward network with batch normal-
ization (see Section 2.5.3). It has two hidden layers and one output
layer. The model can be seen in Figure 3.1. The input is a vector with
the features in Tables 3.2 and 3.3. The output layer use the softmax
activation function.
Load Information
Driver Information
Fully connected Layer
Batch Normalization
Output Fully connected
Layer Batch Normalization
Figure 3.1: Network architecture for the feed forward neural network.
3.4.3 Permutation invariant Architecture
Figure 3.2 shows the neural network model where Load Information are the attributes in Tables 3.2 and 3.3 which start with L, and Driver In- formation are the attributes in Tables 3.2 and 3.3 which start with D.
The parts that handle unordered variable length input have Deep Sets [65] layers. The Deep Sets layer has an embedding function, repre- sented by a two fully connected layers, which is applied separately to all existing orders (load information/DCurrLoad) for each vehicle, sim- ilar to a convolution. The weights in the fully connected layers are shared between all instances in a set. The result of the embedding function is pooled using summation which results in a fixed size rep- resentation of all current deliveries for each vehicle. If the deliveries for vehicle i are X
i= {x
1, x
2, . . . x
Qi} with x
jbeing the features for delivery j, the Deep Sets layer applies two feedforward layers, which we denote as φ(·), to all x
j(φ(x
j) ). The results from the feedforward network are then pooled together with sum pooling which gets us deep_sets_output = φ(x
1) + φ(x
2) + · · · + φ(x
Qi) and has a fixed size regardless of the number of deliveries (Q
i). The procedure for the sec- ond deep sets layer is analogous.
These representations are then used as input to another Deep Sets layer which results in a fixed size representation of all vehicles and their deliveries. This representation is called global vehicle information.
Global vehicle information is only computed once and then concatenated
with the load, driver, and text information for each active vehicle for a
specific delivery. The free text for the current order are transformed into a fixed size representation through an embedding layer and a LSTM layer. The text representation is then concatenated with the global vehicle information, and the information about the order and the current vehicle. The output layer has two nodes and uses the softmax activation function to get a score for good and bad selection.
3.4.4 Class Imbalance
One problem with having each batch being a single set is that the class
’suitable’ ([0, 1]) has a single sample while the ’non-suitable’ class have on average 40 samples. Problems with skewed data is called class im- balance problems and can be observed in, for example, fraud detection and medical diagnosis. There are several ways to address this problem with the two common categories being sampling based solutions and cost based solutions. With the sampling approach the idea is to either add samples from the minority class (oversampling), or remove samples from the majority class (undersampling). In the cost based approaches the algorithm is altered so that the cost of misclassifying a sample from the minority class is larger than misclassifying a sample from the ma- jority class. In a neural network the loss multiplied with a weight for each sample, e.g., the weight w could be added to the categorical cross entropy (see Section 2.5.1) like this:
J (θ) = −
N
X
i=0
w
ip(~ x
i) log(q(~ x
i))
With N being the number of samples. The weights for the cost based approach was chosen by manually testing a range of values and then pick the ones giving the best performance.
Sampling and cost based approaches are tested in Section 4.4.
3.5 Model implementation
Scikit-learn is a machine learning library for python [41] and was used
to implement the following algorithms: decision trees, logistic regres-
sion, and SVM.
Delivery Text
Word Embedding
Load Information
Driver Information Text Information
LSTM
...
Sum Pooling
Global Vehicle Information Load Info
1
... ...
Vehicle 1 Information
Vehicle N Information
Concatenate Layer
Output Sum Pooling
Fully connected Layer
Sum Pooling Load Info
Q1
Load Info QN Load Info
1
Fully connected Layer
Deep Sets Layer
Dropout
Dropout
Fully connected Layer
Fully connected Layer Fully connected
Layer
Fully connected Layer
Fully connected Layer
Fully connected Layer Fully connected
Layer
Fully connected Layer Fully connected
Layer
Fully connected Layer Fully connected
Layer Fully connected
Layer
Deep Sets Layer