A metaheuristic for vehiclerouting problems based on reinforcement learning

(1)

A metaheuristic for vehiclerouting problems based on reinforcement learning

DAVID ÖDLING

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

reinforcement learning

DAVID ÖDLING

Degree Projects in Optimization and Systems Theory (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2018

Supervisor at KTH: Anders Forsgren Examiner at KTH: Anders Forsgren

(4)

TRITA-SCI-GRU 2018:039 MAT-E 2018:13

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

The vehicle routing problem is an old and well-studied problem that arise in last mile logistics. The rapid increase of e-commerce, in particular with an increasing the demand for time scheduled home deliveries on the customer’s terms, is making the problem ever more relevant.

By minimizing the cost and environmental impact, we have the setting for mathematical problem called the vehicle routing problem with time windows.

Since the problem is NP-Hard, heuristic methods are often used.

In practice, they work very well and typically offer a good tradeoff between speed and quality. However, since the heuristics are often tailor- made to fit the needs of the underlying problem, no known algorithm dominates the other on all problems.

One way to overcome the need for specialization is to produce heuristics that are adaptive. In this thesis, an offline learning method is pro- posed to generate an adaptive heuristic using local search heuristics and reinforcement learning.

The reinforcement learning agents explored in this thesis are situ- ated in both discrete and continuous state representations. Common to all state spaces are that they are inspired by human-crafted reference models where the last action and the result of that action define the state. Four different reinforcement learning models are evaluated in the various environments.

By evaluating the models on a subset of the Solomon benchmark instances, we find that all models but one improve upon a random baseline. The average learner from each of the successful models was slightly worse than the human crafted baseline. However, the best among the generated models was an actor-critic based model which outperformed the best human baseline model.

Due to the scalar objective function, the results are not directly com- parable to the Solomon benchmark results with hierarchal objectives.

None the less, the results are encouraging as a proof of principle with results in line with the human crafted baseline.

The results indicate two clear paths for further work. First, applying the formulation to more complex environments with more actions and more powerful state spaces. Secondly, investigate models based on stochastic policies and recurrent neural networks to cope with the inherently partially observed environment.

(6)

(7)

Sammanfattning

Ruttoptimering är ett gammalt och välstuderat optimeringsproblem som uppstår i city-nära logistik. Med en ständigt växande e-handel, ökar efterfrågan av tidspassade hemleveranser på kundens villkor. Att minimera kostnaden och miljöpåverkan blir då ett ruttoptimeringspro- blem med tidsfönster.

Då optimerinsproblemet är NP-Svårt, används heuristiska lösnings- metoder. I denna uppsatts undersöks möjligheten att generera heuristiker genom att träna på liknande problem. Mer specifikt genererar vi en heurisitik baserad på lokalsök genom att formulera lärningsproblemet som ett reinforcement learning problem.

De metoder som används är baserade på både diskreta och konti- nuerliga tillståndsrum. Gemensamt för tillståndsrummen är att de är inspirerade av den information som används av mänskligt generera- de heuristiker där det tidigare valet valet och dess resultat är informa- tionsbärare. Fyra olika reinforcement learning modeller utvärderas på olika problem samt tillståndsrymnder.

Genom att träna modellerna på olika typer av problem från de väl- kända Solomon problemen och utvärdera dessa på ett oberoende test set, kan vi konstatera att alla utom en modell är bättre än slumpen. Ing- en av modellerna slog dock den bästa referensmodellen i medeltal då variationen i utfallet var stort, men de är alla mycket nära. Den bästa bland alla modeller, vilket var en actor critic agent, uppnådde ett bättre resultat än den bästa referensmodellen.

På grund av att en skalär målfunktion använts är resultaten inte direkt jämförbara med andras på Solomon problemen då de skall op- timeras med en hierarkisk målfunktion. Trotts detta är resultaten goda och visar att metoden som introducerats fungerar bra eftersom de pre- sterar i linje med referensmodellerna baserade på samma information.

Resultaten pekar på två vägar framåt för vidare arbete. Det förs- ta är en mera kraftfull tillståndsrymd med mera information samt fler handlingsmöjligheter. Det andra är att applicera stokastiska baserade modeller eller motsvarande för att överkomma tillståndsrymndernas inneboende ofullständighet.

(8)

(9)

1 Introduction 1

1.1 Objectives . . . 2

1.2 Disposition . . . 3

2 The vehicle routing problem 4 2.1 Definition . . . 4

2.1.1 NP harness . . . 4

2.1.2 Time variable . . . 5

2.1.3 Capacitated Vehicle Routing Problems with Time Windows . . . 6

2.2 Solution Methods . . . 8

2.2.1 Exact . . . 8

2.2.2 Heuristics . . . 8

2.2.3 Local search . . . 10

2.3 Previous work . . . 13

2.3.1 Learning methods . . . 14

2.3.2 Offline methods . . . 15

3 Reinforcement learning 17 3.1 Markov Decision Process . . . 17

3.1.1 Value iteration . . . 19

3.1.2 Policy iteration . . . 19

3.1.3 Observability . . . 20

3.2 Model free learning . . . 21

3.2.1 Introduction . . . 21

3.2.2 Exploration versus exploitation . . . 22

3.2.3 Off policy versus On policy . . . 22

3.2.4 Tabular Q-learning . . . 23

3.2.5 Tabular Sarsa . . . 24

v

(10)

3.2.6 Deep Q-Networks . . . 24

3.2.7 Actor critic . . . 27

4 Problem formulation 29 4.1 Policy formulation . . . 29

5 Method 32 5.1 State space . . . 32

5.1.1 Discrete . . . 33

5.1.2 Continuous . . . 33

5.2 Action space . . . 34

5.2.1 Discrete agents . . . 37

5.2.2 Continuous agents . . . 38

5.2.3 Learning procedure . . . 39

5.3 Architectural choices . . . 40

5.3.1 Objective function . . . 40

5.3.2 Soft constraints . . . 40

5.3.3 Reward signal . . . 41

5.3.4 Stop criteria . . . 43

5.3.5 Initial solution . . . 43

5.4 Benchmarks . . . 43

5.4.1 Reference models . . . 44

5.5 Experiments . . . 44

5.5.1 Data set . . . 45

5.5.2 Models . . . 45

5.5.3 Optimization setting . . . 45

5.5.4 Problem types . . . 46

5.5.5 State spaces . . . 46

5.6 Summary of experiments . . . 46

5.7 Implementation details . . . 47

5.7.1 Hyperparameters . . . 47

6 Results 49 6.1 Training set . . . 49

6.1.1 Convergence . . . 50

6.1.2 Learned policies . . . 53

6.2 Test set . . . 55

6.2.1 Results . . . 55

6.2.2 Comparison to the reference models. . . 56

6.3 Qualitative tests . . . 58

(11)

6.3.1 Geometric structure . . . 58

6.3.2 Time structure . . . 59

6.3.3 State space . . . 59

6.4 Best model results . . . 61

6.4.1 Comparison to best known results . . . 62

7 Discussion 63 7.1 Analysis . . . 63

7.1.1 Convergence . . . 63

7.1.2 Learning capacity . . . 64

7.1.3 Problem structure . . . 64

7.1.4 Time structure . . . 65

7.1.5 State space . . . 65

7.1.6 Learnt policies . . . 66

7.2 Modeling choices . . . 67

7.2.1 Objective function . . . 67

7.2.2 Reward normalization . . . 67

7.2.3 Model architectures . . . 67

7.3 Further work . . . 68

8 Conclusions 69

Bibliography 70

(12)

(13)

Introduction

The vehicle routing problem is a fundamental problem in last mile logistics - given a set of goods to deliver and a fleet of vehicles, what is the most efficient way to deliver these goods in? With the rise of modern of e-commerce there is an increased demand for time scheduled home deliveries on the customer’s terms, the problem has become more relevant.

Being an old and well studied problem, there are a myriad of techniques for attempting to solve it. The most successful class of solvers are based on heuristics that do not guarantee a globally optimal solution and referred to as heuristics. Exact solvers are typically not even at- tempted at real world sized problems. In fact, almost all world records on large and complex problems are done with heuristics. They typically offer a clear tradeoff between solution quality and speed which is often valuable applications.

With time being scarce, there is no single algorithm which domi- nates the others. As pointed out by Andrychowicz et al. (2016), the No Free Lunch Theorems for Optimization by Wolpert and Macready (1997), says that in the setting of combinatorial optimization, specialization to a subclass of problems is essential for the performance in general.

The need for specialization does not imply that there cannot exist a general algorithm, it simply implies that it needs to adapt to the given problem at hand. In vehicle routing problems, the location is typically the same while the deliveries and the available fleet may change from day to day. Clearly, there is some persistent information that may be used to adapt the solver for the problem at hand.

Many solvers aimed to break new records learn about the given

1

(14)

problem online. This is works well for large timespans when attempting to improve best known solutions to academic benchmarks, but in reality time is often the limiting factor. The question is then if it even is possible to incorporate some general information about the problem in beforehand, making the optimization more efficient, at least for shorter timespans. This thesis will be using model free reinforcement learning to towards that goal.

1.1 Objectives

Informally, the objective of this thesis is to use reinforcement learning in order to generate a heuristic that is conditioned on the type of problem at hand. In the reinforcement learning formulation, the state space will contain some information of the problem state while the action space the set of local search heuristics.

More generally, the underlying problem solved studied in this thesis is NP-hard. As with many problems of this class, they can be reformulated into each other. Meaning that if one can be solved, many others can be too. Hence, in the broadest possible form, the main problem can be stated as follows.

Given a vehicle routing problem on a graph g and a distributionG of prob- lem instances, can we learn a heuristic solver, defined by a policy π, that gen- eralizes to unseen instances fromG?

If the vehicle routing problem is solvable to a sufficiently degree using this technique, then many more can be solved in the same manner.

More specifically, the scope of the thesis is to examine the following.

• Evaluate the learning capacity of the models, do they generalize?

• Evaluate different state spaces that the agents learn from.

• Compare different learning methods in this setting.

• Analyze the solutions, do they correspond to known heuristics?

In chapter 6 the results presented and in chapter 7 discussed in jux- taposition with each objective.

(15)

Scope

The main focus of this thesis is the Capacitated Vehicle Routing Prob- lems with Time Windows (CVRPTW) based on the Solomon benchmarks provided by Gehring and Homberger (Gehring and Homberger 1999). They provide a reference to the best-known solutions while being hard enough to benchmark. In principle, the methods used are chosen so that they apply to a larger verity of Vehicle Routing Prob- lems (VRP) and not specifically optimized for the Solomon instances.

Finally, the scope of this thesis is to test the principle of using reinforcement learning to generate a heuristic that can generalize. As such, the success will be measured by the performance in relation to human crafted heuristics given the same information.

1.2 Disposition

The thesis is based on theory from two distinct fields, local search heuristics and model free reinforcement learning. The reader familiar with both of these may move directly to chapter 5 where the reinforcement learning problem formulation is introduced.

The next chapter, chapter 2, provides a concise introduction to the vehicle routing problem and solution methods. Chapter 3 continues with a literature review for generating heuristics. With the vehicle routing problem and its solutions methods introduced, Chapter 4 contains the theoretical basics for model-free reinforcement learning. With the theoretical prerequisite in place, we can properly introduce the problem formulation in chapter 5. This chapter contains the main con- tribution of this thesis as it formalizes the meta optimization problem of generating a heuristic as a reinforcement learning problem. The specifics of the method and implementation as well as the experiments are found chapter 6.

The results of the experiments outlined in chapter 6 are presented with some observational comments in chapter 7. Finally, the conclusions to be drawn from the results and future directions are presented in chapter 8.

(16)

The vehicle routing problem

2.1 Definition

The vehicle routing problem is an fundamental problem in logistics.

Given a fleet of vehicles, a set of customers to visit with some goods and chosen time slot, the question arises, by which vehicle and in what time shall the customer be visited in order to minimize the total cost?

2.1.1 NP harness

The subproblem of finding the sequence to visit the customers¹to minimize the cost is often referred to as the Traveling Salesman Problem (TSP). A feasible solution to the simplest variant, with a complete graph and with no constraints, can be found in polynomial time. However, the decision problem of determining the stop condition of the exis- tence of a sequence with a lower cost than the current best cost is NP- complete (Papadimitriou 1977). The corresponding optimization problem of finding the global optimum can be shown to be NP-hard. The particular type of problems studied here, the VRP with time windows is at least as hard and hence NP-hard as well (Lenstra and Kan 1981).

1Where each customer is visited once except for the depot which is first and last.

4

(17)

In general, we have a Mixed Integer Linear Program of the form, minx,t f (x, t),

st g(x, t)≤ 0, h(x, t)≤ 0, x∈ Zⁿ, t∈ Rⁿ.

Typically, the functions f, g, h are linear but may also be non-linear.

Hence, real-world VRPs are often non-convex, contain both discrete and continuous decision variables. For instance, travel times are due to traffic in general non-convex with respect to the time of travel. More- over, the constraints may both be vehicle specific, such as load constraints, but also global such as loading bay restrictions on the number of vehicles at the loading bay. Thus, the VRP is very hard indeed. In this thesis, we will try to focus on simpler VRP problems but focus on algorithms and solution methods that can handle more general and real world like problems.

2.1.2 Time variable

In practice, a majority of the costs are due to wages to the drivers. More- over, in today’s densely populated areas traffic patterns have a signif- icant impact. Thus, the average travel time between customers varies as a function the start time typically in a non-convex manner.

Whenever the travel times are constant with respect to the start time and the time window constraints are soft with convex penalty functions, there exists methods running in polynomial time for finding the optimal start times (Ibaraki et al. 2008).

Whenever the cost is independent of time there is no cost of waiting.

This implies that it is sufficient to wait at the customer until the start of the time window to stay optimal in the time variable. This eliminates the time dependence entirely and makes the time variable dependent on the path only. This simplification will be used throughout in this thesis since we will focus on costs derived from distance only.

(18)

2.1.3 Capacitated Vehicle Routing Problems with Time Windows

This thesis will focus on a class of vehicle routing problems with a ho- mogeneous fleet of vehicles, euclidian times and distances, time window constraints often abbreviated as CVRPTW. To formally define the problem we introduce a directed graph G = (N, E), with vertices N and edges E. Let C be set of customer vertices, and D be the depots, whereC ∪ D = N. The time windows for each vertex i is defined by [a_i, b_i]. The service time at each vertex is si ∈ R⁺ and the load capac- ity of the vehicles is Cv . Let F be the set of feasible solutions to the problem and let s∈ F be a solution. Finally, let c : s → R⁺be the cost function of the solution s. The key variables are summarized below.

• n the number of customers

• m the number of vehicles

• Cv load capacity of vehicle v

• sv ∈ F a feasible route for v

• c : sv → R⁺the cost for sv

• ti,j time between nodes i, j

• aistart window at node i

• bi end window at node i

• siservice time at node i Mixed integer program formulation

We will present a multi-commodity network flow formulation. Let’s introduce the source depotD⁻and correspondingD⁺sink and let the decision variable xi,j,k ∈ X be a binary variable denoting the use of arc e_i,j ∈ E of vehicle k, where k ∈ V denotes the set of vehicles avail- able. The cost of one route c(sv)is the sum of the cost of each edge ci,j, c(sv) = ∑

(i,j)∈svci,j Finally, to express the time of arrival constraints, we introduce Ti,k as the time of arrival to vertex i by vehicle k. The

(19)

resulting set of equations is shown in Equation 2.1.

minx∈X

∑

i∈E

∑

j∈E

∑

k∈V

c_i,jx_i,j,k (2.1a)

st ∑

k∈V

∑

j∈C\{i}

x_i,j,k = 1 ∀i ∈ C (2.1b)

∑

j∈C

x_i,j,k = 1 ∀k ∈ V, i ∈ D⁻ (2.1c)

∑

i∈C\{n}

x_n,i,k− ∑

j∈C\{n}

x_j,n,k= 0 ∀k ∈ V, n ∈ D (2.1d)

∑

i∈D

x_i,j,k = 1 ∀k ∈ V, j ∈ D⁺ (2.1e)

∑

i∈C

c_i∑

j∈C

x_i,j,v ≤ C ∀k ∈ V (2.1f)

x_i,j,k(T_i,k+ s_i+ t_i,j − Tj,k)≤ 0 ∀i, j, k (2.1g) ai ≤ Ti,k ≤ bi ∀k ∈ K ∀i ∈ C (2.1h)

x_i,j,k ∈ {0, 1} T_i,k ∈ R⁺ (2.1i)

The formulation above is good since its makes it easy to translate the problem definition to lines in the equation. The objective function (a) is to minimize the total cost defined by the sum of the cost per used edge ei,jbetween the nodes in the graph. Each customer is visited only once by one vehicle (b). The depot source nodes are used by all vehicles (c). All vehicles arriving at a customer will also depart from it (d).

All vehicles uses the sink depot to mark the end of the route (e). The vehicle load capacity may not be exceeded (f). The time of arrival to each customer is greater than or equal to the time window start and the service time plus the time of travel from the previous one (g). The arrival time is greater than or equal to the time window start is and less than the window end at each customer (h). Finally, the decision vari- able of using edge ei,j with vehicle k is binary while the arrival times are positive real numbers (Desaulniers, Madsen, and Ropke 2014).

This formulation is non-linear due to the arrival time constraint (g) but can be linearized. However, since the focus of this thesis is not exact methods, we do not consider such reformulations.

(20)

2.2 Solution Methods

2.2.1 Exact

Exact solution methods have always been an active region of research.

The main types of solution methods are based on Column generation, Dynamic programming, Branch and bound or some combination of techniques. However, for problem sizes of practical use of about 100 vertices and more, very few solvers have solved problems up to global optima. For practical applications, low complexity and the ability to

Problem size Pure CP Column generation Hybrid CG-CP

25 20 % 100% 86%

50 7 % 48% 24%

100 0 % 24% 7%

Table 2.1: The fraction of solved Solomon instances. (Adapted from Rousseau, Gendreau, and Pesant 2002)

adjust the tradeoff between speed and accuracy are important. There are some methods that provide this, and still are proven to converge to the global optimum such as column generating methods. But since heuristics far outperform these methods they are rarely used in practice. For a good survey on exact as well as heuristic methods, see Toth and Vigo (2014).

2.2.2 Heuristics

A heuristic method does not guarantee that a globally optimal solution will be found. Heuristics are instead judged solely after the empirical performance. Most solvers used in real applications are heuristics due to their good tradeoff between speed, solution quality and ability to scale to large instances.

In a very broad sense there are three types, constructive, improve- ment-based and more adaptive combinations of these often referred to as metaheuristics. There is a myriad of approaches within each class, but due to the nature of the metaheuristics it is the richest in variety. Some of the most fascinating approaches rely on nature-inspired algorithms such as genetic programming (Vidal et al. 2013) and ant colonies (Do- nati et al. 2008). However, the class of metaheuristics that will be used

(21)

in this thesis is based on a technique called local search.

Local search neighborhood

A key feature in local search is the notion of a neighborhood N . Let s ∈ F be the set of all feasible solutions and s ∈ N(x) the set of solutions in a neighborhood around x. We can then define a local optimum s^∗ with respect to N as

c(s^∗)≤ c(s) ∀s ∈ N(s^∗), (2.2) in contrast to a globally optimal solution ˆs^∗,

c(ˆs^∗)≤ c(s) ∀s ∈ F. (2.3) The main idea behind local search is to use the fact that being globally optimal ˆs^∗ also implies locally optimal. Essentially, by making sure that the solution is locally optimal one hopes that this will imply the reverse, locally optimal→ globally optimal, even though there is no such guarantee. In practice, the average performance is more important than theoretical guarantees. In this light technique works well with average solutions close to the global optimum. For example, the so called 2-opt neighborhood has on average shown to be 5% above the Held-Karp lower bound²(Johnson and McGeoch 1997).

If we were to search through all possible combinations of routes, we would need to go through an exponential number of combinations. In the worst possible case, when all permutations are feasible, that would be

|F| ≤ (|C| + |K|)!

|K|! . (2.4)

This is obviously infeasible even for the small instances considered in this thesis, where|C| ≈ 100 and |K| ≈ 20 yields |F| ≈ 10¹⁸⁰.

The idea of the neighborhood is thus to search through a smaller subset around a current solution s, the neighborhood N (s). A natu- ral way to quantify the size of a neighborhood is thus to measure the distance ρ between solutions in the neighborhood by the cardinality of their symmetric difference. If solutions x1and x2differ in k edges, then

2With one vehicle, in the cartesian metric, with no time windows or other constraints

(22)

ρ(x₁, x₂) = k. With ρ defining the cardinal difference onX , the neigh- borhoods induced by a cardinal difference of k are formally defined with the set

N_k(x) ={s1, . . .|ρ(si, k) = k, s_i ∈ F ∀i}. (2.5) One example of a neighborhood of this size is the k-opt neighbor- hoods. The size of the neighborhood is |NK(s)| = C(n, k) ≈ O(n^k).

The most simple neighborhood is called 2-opt and is of size O(n²). We will introduce the neighborhoods used in this thesis more thoroughly in the method chapter. For a introduction and survey, see Toth and Vigo (2014).

b bb bb b bb bb

i

j j

i

i+1

j+1 j+1

i+1

Figure 2.1: Illustration of a 2-opt neighborhood move.

2.2.3 Local search

The general procedure of most local search algorithms is as follows.

1) Generate one or more constructive start solutions.

2) Perform local search until some stop criteria.

Since finding a feasible start guess also is NP-hard, non-feasible solutions need to be handled. Hence, the set of solutions explored by the neighborhood needs to be extended so that the search is done in the set of all solutions s ∈ S where the set of feasible solutions is a subset F ⊂ S. The most common way is to use soft constraints, where the penalty functions are constructed so that more feasible solutions are always proffered to less feasible. We will use the following decomposition,

mins∈S f (x) + µΨ(x)

where Ψ(x) = 0 when feasible and Ψ(x) > 0 when infeasible. A pro- cedure for the cost function implementation with penalty functions is found in Algorithm 5 in the appendix.

(23)

Many heuristics make use of a hierarchical cost function, where the primary objective is to reduce the number of vehicles and the second to reduce the cost. This will not be used here as the formulation of the reinforcement learning objective that we will present, we need a scalar cost.

Constructive start

Constructive algorithms are typically used to generate initial solutions.

They generate a solution by creating routes without any start reference.

However, since the fleet of vehicles is limited and time windows are present, the resulting solution may not be feasible. While trying to maintain feasibility, the key feature of the start guess is to keep a low order of complexity (Hasle and Kloster 2007). The most famous algorithms are Clarke Wright (Clarke and Wright 1964) and Christofides (Christofides, Mingozzi, and Toth 1981). Even though they are most suited for symmetric metrics and no time windows, either the problem can be approximated to these conditions, or the algorithms can be adjusted slightly to work well in practice. There are also other approaches, for example, cluster-based methods such as (Dondo and Cerdá 2007) that can be more efficient when clustered solutions are expected.

We will make use of Clarke Wright in this thesis due to its simplicity and easy implementation. The main idea is the following observation.

Consider the two sets of routes in Figure 2.2.

b

b b b

b b

C1= c0i+ ci0+ c0j+ cj0 C2= ci0+ c0j+ ci,j

j i j i

Figure 2.2: The Clarke Wright savings idea illustrated

The cost for the first is higher, assuming no violation of any con- straint with C1 = c_0i+ c_i0+ c_0j+ c_j0, compared to the nice looking with C₂ = c_i0+ c_0j + c_i,j. The saving derived from joining the nodes i, j are thus si,j = C₁−C2 = c_0i+c_j0−ci,j. Since the distance metric used in this thesis is symmetric, we get sij = ci0+ c0j−cij. By sorting the edges after this savings measure, the routes that are created are thus different from

(24)

a greedy solution since they account for the cost to and from the depot.

This is the main idea and the notion of the Clarke Wright savings and the computational complexity of this procedure is O(n²log n).

Two additions can be to account for the time windows. First, the directions i, j matter due to the asymmetry in time windows. Second, we add a factor τi,j = τmax(0, aj−ti,j−bi)to account for the possibility of waiting time. The full algorithm, Algorithm 6, is presented in the appendix.

With an initial solution in place, the local search can be begun. The next sections will go through a set of methods for this step.

Iterated local search

When taking a step in some neighborhood N (s) into s^′there will poten- tially be a different point s^′′ ∈ N(s^′) that we can continue to improve the solution with. Eventually, s will be locally optimal with respect to N . To further improve the solution, one idea is simply to perturb the solution far away from the locally optimal s. This algorithm often referred to as ILS.

Another way of moving away from a local optimum is to restrict the admissible moves in the neighborhood that are already visited using a Tabu list or through changing the objective function such that already visited solutions are penalized.

While the best current solution is always saved and only updated with improving solutions, the accepted moves in the neighborhood N (s)may not always be the improving the current solution s^∗. Vari- ous strategies exist, but the most popular ones are to choose the best improvement, the first improving, or most notably a stochastic acceptance criteria, typically inspired by simulated annealing. For example, a simulated annealing typed stochastic acceptance criteria has some temperature parameter T set such as the probability of accepting a non-improving solution P (s^′|c(s^∗) < c(s^′)) ∝ T anneals towards only accepting improving solutions as T− > 0.

Variable Neighborhood Search

As there are many ways to form a neighborhood, two questions quickly arise. What neighborhoods to choose and in what order shall they be applied? One natural approach is to begin with small neighborhoods,

(25)

such as the O(n²)and move into larger ones when reaching a local minimum in it. When applying more than one neighborhood in this fashion, one speaks of a variable neighborhood search procedure abbreviated VNS.

In this framework, the neighborhood move is often selected at random, which is why it is referred to as shaking, and a weaker intra-route neighborhood, typically some k-opt, is used to make the move locally optimal. (Mladenović and Hansen 1997)

However, another popular approach is to choose the best move in each neighborhood while the neighborhood Niis instead chosen at random (Penna, Subramanian, and Ochi 2013). This approach is will be used as a benchmark later since it is very fast. Methods using the best move in each neighborhood are often referred to as Variable Neighbor- hood Descent based methods and abbreviated VND.

Large Neighborhood Search

Unfortunately, it does not take long to reach a local minimum with respect to a set of neighborhoods. Instead of ILS or VND based methods based on polynomial sized neighborhoods, one can also consider larger neighborhood structures but instead of searching through them completely, simply sample from them. This is typically done by first re- moving a set of nodes and then reinserting them based on some more or less weak solver. The most common is to simply use a greedy or nearest neighbor insertion (Kritzinger et al. 2015) and sample the removed nodes in some stochastic manner.

2.3 Previous work

With these general solution methods in place, we can go through what has been done to make these methods more adaptive to the problem at hand. We will begin by examining what methods have been used to make the solvers more adaptive to the given problem. Secondly, we will see how offline learning methods has formalized and used to learn from examples.

(26)

2.3.1 Learning methods

Within the framework of local search, a lot of methods have been de- veloped to overcome the fact that the choice of neighborhood and strategy are highly problem sensitive. For the algorithm to be effective on a large number of instances, it has to adapt.

In general, there are two approaches. One can try to generate a heuristic that suits the given problem using some generative method or by selecting an existing heuristic by some selective method (Burke et al. 2013).

We will primarily focus on generative methods since they are most tractable in practice. Developing one heuristic is time-consuming enough, so utilizing the building blocks of it as efficiently as possible is a more practically viable option than developing a set of different heuristics as needed in selective variants.

One generative methods trying to adapt the choice of neighborhood on some problem-specific information are referred to as Adaptive Vari- able Neighborhood Search (Kritzinger et al. 2015). For example, one can vary the choice of neighborhood based on previous success. This can be done in an online fashion, where the probability of choosing neighborhood Ni is updated using the following update rule,

π_i = ρx_i

X_i + (1− ρ)πi, P (π_i) = π_i

∑

jπ_j (2.6)

where πiis the probability of neighborhood i, ρ is the learning rate, Xi

the number of times i has been selected and xiits immediate improvement (Stenger et al. 2013).

In the LNS based methods, there are similar approaches where the neighborhood is chosen based on how credible a node is to improve the solution if it is part of the sampled neighborhood (Pisinger and Ropke 2007). However, this thesis will not focus on methods of this character.

Among the previous work it is few that use reinforcement learning to learn a generative local search metaheuristic. But one of the few that are similar to this thesis is due to Meignan, Koukam, and Créput 2010, who devised an online meta-heuristic using a pool of reinforcement learning actors. The problem formulation was to solve the credit assignment problem of what sequence of neighborhood steps was re- sponsible for improving the best result. As such, each actor learned a mapping of the expected future value Q(s, a) in a state s∈ S and action

(27)

a ∈ A. Using this, they extended the roulette wheel formulation of the AVNS into being conditioned on the current state as follows,

π(a| s) = Q(a, s)

∑

a^′∈AQ(a^′, s) (2.7) where the action spaceA is the operator action on some neighborhood N^a(x) and the state space the previous action taken st = a_t₋₁. The learning procedure used to learn Q was through reinforcement learn- ing. They use two different reward signals, σ1 for improving current solution and σ2 for improving the global best. Moreover, the pool of agents also learn using mimetism, where the best agents Q-values are shared to the others with the mimetism rate ρ.

All of these methods are learning online, making the introduced models essentially equal the first couple of iterations. In environments with a clear structure and little time, this is not satisfactory. As time almost always is scarce, the question thus arises, if we instead could train offline based on some information about the problem could we do better within some timeframe?

2.3.2 Offline methods

As pointed out by Andrychowicz et al. 2016, the No Free Lunch Theorem by Wolpert and Macready 1997 show that in the setting of combinatorial optimization, no algorithm is able to do better than a random strategy in expectation. The only way improve is by specialization. Let us formalize this specialization into an optimization problem. Given a set of graphs g ∼ G and optimization algorithm r parametrized with θ yielding the cost c(g; θ), we wish to solve the minimization problem

minθ E_g_∼G[c(g)] . (2.8)

This formulation is not new but the majority are framed as selective, rather than generative meta-optimization problems. For example,(Rasku, Kärkkäinen, and Musliu 2016), introduced a problem dependent feature space, with geometric as well as algorithmically generated features, and a continuous set of parameter choose the configura- tion for the solver. Since the function is not differentiable Bayesian Op- timization methods are often used, e.g SMAC, CMA-ES, F-Race (Groër, Golden, and Wasil 2010) (Hutter, Hoos, and Leyton-Brown 2011). This

(28)

approach works well for selective heuristics, but is not well suited for generative approaches.

There are, however, a couple of examples of generative algorithms based on reinforcement learning. One of the first was given by Zhang and Dietterich 2000 who used a low-level generative model similar to LNS applied to Job shop scheduling. In more recent work, an advanced graph kernel technique based on deep learning and reproducing Hilbert spaces was used to form graph state embeddings to train a fully generative model using Q-learning (H. Dai, Khalil, et al. 2017). This work is the closest to the aims of this thesis regarding the learning goal and setting.

Based on the research done in this thesis, no one have however formulated an offline reinforcement learning problem to generate a metaheuristic for the CVRPTW in a local search framework. This is the primary focus of this thesis.

(29)

Reinforcement learning

In order to formalize the learning objective, the following section in- troduces some basic theory and notation that the reader familiar with Reinforcement learning may skip and move directly to the chapter 4.

3.1 Markov Decision Process

A Markov Decision Process is a stochastic process in discrete time. The mathematical space this process lives in has some well defined states X and for each state a reward signal r : X × A → R. The agent in this world has some set of actionsA and a probability distribution P (y|x, a), assigning the probability of moving into a given state y given the cur- rent state x ∈ X , and action a ∈ A. The goal in this environment is to find a policy, π : X → A, such that to maximize the discounted sum of the rewards R over all time steps t up to T

R =

∑T t=0

γ^trt+1. (3.1)

Where t ∈ [0, T ] denotes the current time and γ ∈ [0, 1) the discount factor.

The solution to this problem is called the stationary policy π^∗ and may be stochastic, yielding a probability distribution over the actions Agiven the stateX ,

a_t ∼ π(xt), (3.2)

where the probability density function is given by π(a|x). For the fi- nite time MDPs, the optimal policy is deterministic, but a more general

17

(30)

stochastic formulation will be useful later. The key insight needed to solve for the optimal stationary policy, due to Bellman (1954), is often referred to as Bellman’s principle of optimality.

An optimal policy has the property that whatever the initial state and ini- tial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

To formalize this, let us introduce the action-value function Q(x, a; π), the total expected reward from state x and action a, given the policy π, and onwards,

Q(x, a; π) =E [ _T

∑

t=0

γ^tr_t|Xo = x, A₀ = a, π ]

.

By following π^∗, the optimal policy, from any time step t and on- wards, the principle of optimality yields the following recursion relation

Q(xt, at; π^∗) =E [rt+ γQ(xt+1, at+1; π^∗)|Xo = x, A0 = a, π] . The optimal deterministic policy can then be stated through the op- timal action-value function Q(x, a; π^∗). The principle of optimally thus dictates that we must follow the path generated by choosing action a to maximize the expected in state x,

π^∗(a| x) = δa,s s =argmax_a′∈AQ(x, a^′), where δa,sis the kroniker delta function.

The expected value can be defined in terms of the transition prob- ability P (x, a, y), the expected reward r(x, a), and the state value func- tion V (x) as follows

Q^∗(a, x) = r(x, a) + γ∑

y∈X

P (x, a, y)V^∗(y) V^∗(x) = sup

a∈A

Q^∗(a, x) .

(31)

We can now express the optimal policy π∈ Πstatas a function of Q^∗

or V^∗, ∑

a∈A

π^∗(a|x)Q^∗(x, a) = V^∗(x), ∀x ∈ X . (3.3) It is now clear that the knowledge of either Q^∗, or V^∗ are sufficient to find an optimal stationary policy π^∗.

3.1.1 Value iteration

The easiest way of solving for π is through the value function V^∗(x)in the bellman equation,

V^∗(x) = r(x) + γ∑

y∈X

P (x, a, y)V^∗(y).

This functional equation may be represented using the non-linear operator T^∗ ◦ V^∗(x) = V^∗(x). The function V^∗ is thus a fix-point for the Bellman operator T . The question is then, does V^∗always exist and how can we find it?

First, let us conclude that T is continuous and bounded since the max operator is continuous and bounded and the rewards are finite.

Moreover, it can be showed that T is a so-called contraction mapping, so that for any two points in a Banach space of V there exists a 0 ≤ λ < 1 so that for any g, f ∈ V ,

|T u − T v| ≤ λ|f − g|.

Hence, by the fixed point theorem (Goebel and Kirk 1972), there exists a unique solution V^∗ that can be found by simply iteratively ap- plying the operator T to some start point until the fix-point is reached,

T^∗(V_k) = V_k+1. (3.4)

Similarly, an action-value iteration can also be used to find the optimal Q^∗in the same the manner T (Q^∗_k) = Q^∗_k+1(Sutton and Barto 1998). This process is often referred to as value iteration and is the central compo- nent in all value based algorithms presented in the next section.

3.1.2 Policy iteration

Given an initial policy π0 with the induced value function V^π⁰ we can always find a policy π1that is at least as good V^π¹ ≥ V^π⁰ by increasing

(32)

the probability of the best action a^∗in state x. This is the main idea of policy iteration methods.

In practice, the first step is to solve for the value functions under the current policy π0, this is a|X| × |X| sized linear system of equations,

V (x, π₀) = R(x, π₀(x)) + γ∑

y

P (x, π₀(y), y)V (y; π₀).

When the value function is found V^π⁰, one can always be least as good by choosing π1 as follows,

π₁(x) = argmax_a′

∑

y∈X

P (x, a^′, y) [R(x, π₀(y), y) + γV (y; π₀)] .

This two-stage iterative process is the basis for the more advanced methods introduced in the next section.

3.1.3 Observability

For fully observed MDPs as introduced here, if there is an optimal policy, there is a deterministic optimal policy. However, if the MDP state is not completely Markov, that is - it is not a sufficient statistic for the future, a stochastic policy may be optimal. Consider for example the game of rock paper scissors; there is typically no state that can with cer- tainty determine the next choice of the opponent. Hence, the optimal policy is a uniform distribution over the actions.

Another property of partially observed environments is that the model of the world is typically not given in terms of a reward func- tion R or transition probabilities P (x, y, a) over some enumerable and discrete set of states and actions. Thus, to find the optimal policy π^∗, other methods have to be used than the ones presented this far.

The next section, will go through ways of finding deterministic as well as stochastic policies without knowing anything about the world.

We still need some observations of some states x, that may or may not be partially observed, and corresponding reward observations r from choosing some discrete action a.

(33)

3.2 Model free learning

3.2.1 Introduction

A more real-world like environment is typically such that we have a set of actions a ∈ A and we can sample states x ∼ X and rewards r ∼ R(x) from the environment, but we do not know the actual model govern- ing the reward and transitions. But just as before, the goal is to find the optimal policy π^∗.

There are two ways to proceed either model free, by learning a policy directly without learning a model, or model based by first learning a model then use the previous sections methods to find the best policy for it. We will focus on model free methods even though both are applicable. The advantage of model free methods is that almost no do- main specific knowledge has to be incorporated into the model.

As stated in the previous chapter, the knowledge of either Q^∗ or V^∗ are sufficient to find the optimal strategy. The simplest possible way is to simply learn either one of them by Monte Carlo sampling.

By definition, V^π(x)is the expected return starting from state x. By sampling, the actions at∼ π(xt)and rewards for each period [0, . . . , T ] from the policy π, N times so that every state action pair is visited we get an unbiased estimate,

V^π(x) = 1 N

∑N n=0

∑T t=0

γ^tr(x_t, a_t; π, x₀ = x),

that converges asymptotically by the law of large numbers. The policy may then be learned using policy iteration. However, this is typically not viable for most applications since the state and action space may be continuous and more efficient methods exist.

A more efficient way is estimating the value function using bootstrapping,

V (xt) = rt+ γrt+1+ . . . + γⁿV (xt+n).

This bootstrapping is often referred to as temporal difference learn- ing. Moreover, methods based on the simplest case, n = 1, are often referred to as one-step methods and will be used in this thesis since

(34)

they are efficient to use where the state space is large (Sutton and Barto 1998).

3.2.2 Exploration versus exploitation

Both Monte Carlo based methods and temporal difference methods illuminate the need to maintain some degree of exploration. Since the optimal policy may be completely deterministic and both meth- ods need to visit all states, the sampling based on the current π cannot also be completely deterministic since that typically entails not visiting all states. When possible, the easiest solution to this is to simply vary the start state x0. But when only interacting with the environment, this may not always be possible. For this thesis, we make use of policies that enable some tradeoff between moving towards the optimal policy and exploring the state space so that the optimal policy can be detected.

An alternative approach to make sure sufficiently many state-action pairs are visited is to simply introduce some positive probability ϵ > 0 for each action during the learning. By annealing ϵ towards zero, the optimal deterministic policy may be found in the limit as long as ϵ is annealed sufficiently slow (Sutton and Barto 1998).

3.2.3 Off policy versus On policy

By differentiating the policy used during learning πl and some target policy πtwe can differentiate between the two types of learning meth- ods. Methods where the πtis learned such that it is independent of πl

are called off policy methods. These are often based on the determinis- tic target policy π ∼ argmax_a′Q(x, a^′). Methods where the target policy and learning policy are the same are called On policy.

The main difference between off-policy and on-policy algorithms, is that on-policy algorithms include the cost of exploration in the policy, while off-policy does not. The most illustrating example of the difference is the cliff walking problem. As seen in Figure 3.1, the on-policy learner typically learns a safer path than the off-policy learner that will insist on taking the dangerous but optimal path to the end. In the limit where the exploration is annealed sufficiently slow towards the greedy, they are both equal (Sutton and Barto 1998).

(35)

x

Start y

Goal Cliff

On-policy

Off-policy

Figure 3.1: The difference between off-policy and on-policy solutions illustrated by the cliff walk example.

3.2.4 Tabular Q-learning

One of the most famous and influential learning algorithms is known as Q-learning (Watkins and Dayan 1992). It is an off-policy algorithm with the fixed the greedy policy π(x) ∼ argmax

a∈A

Q(s, a). The classical formulation based on discrete states and a tabular Q(s, a) function is stated below.

Q(s_t, a_t) = r_t+1+ γ· max

a Q(s_t+1, a) (3.5) By updating the Q-values with some learning rate α, the procedure has been shown to converge to optimality for finite and discrete state spaces as long as every state is visited infinitely often (Littman 1996).

The complete algorithm is found in procedure 1.

Algorithm 1 Q-learning

1: procedure One step Q-learning(a)

2: initialize Q(s, a) arbitrarily

3: for all episodes do

4: Choose a from πl(x)

5: Take action a

6: Observe r, y

7: Q(x, a)← Q(s, a) + α (r + γ · maxa^′Q(y, a^′)− Q(x, a))

8: x← y

9: end for

10: end procedure

(36)

3.2.5 Tabular Sarsa

The off-policy version of the Q-learning algorithm is very similar. The only difference is that the updates are done based on the currently policy.

Q(s_t, a_t) = r_t+1+ γ· Q(st+1, a_t+1) (3.6) As discussed in the section about on policy versus off policy, the optimal solution can still be found, but in practice there typically a difference in the final policy. The complete algorithm is found in Algorithm 2.

Algorithm 2 SARSA

1: procedure One-step SARSA(a)

2: initialize Q(s, a) arbitrarily

3: for all episodes do

4: Choose a from π(x)

5: Take action a

6: Observe r, y

7: Sample next action a^′ ∼ π(y)

8: Q(x, a)← Q(s, a) + α (r + γ · Q(y, π(a^′))− Q(x, a))

9: x← y

10: end for

11: end procedure

3.2.6 Deep Q-Networks

When applied to large state spaces, problems arise with the tabular methods. For one, memory is limited, so storing all state-action pairs in memory may not be feasible. More importantly, it may not be possible to reach all state-action pairs within a reasonable amount of time.

The solution is a representation that generalizes in some fashion so that not all states need to be enumerated. The most common way to achieve this is using function approximations methods or space filling tiles. We will primarily consider function approximation methods on continuous state spaces, where the states are embedded in some vector space Rⁿ.

The Q-learning rule is still valid and the main difference is that we

(37)

now introduce some parameterization θ of the Q function, Q(a, x_t; θ) = r_t+1+ γmax

a^′ Q(a^′, x_t+1; θ), (3.7) but in order to learn the parameters, we need to set up an objective function to minimize.

The objective functionL is typically chosen to be the MSE due to its favorable bias-variance decomposition,

L = (r + γQ(xt+1, a_t+1)− Q(xt, a_t))². (3.8) The error-terms in the loss function are often referred to as the temporal difference error,

δ_t = r_t+1+ γ· max

a Q(x_t+1, a)− Q(xt, a_t). (3.9) Unfortunately, the learning algorithm used in the tabular case di- verges when using a non-linear function such as neural networks. This is due to the correlations between the state, action and predicted Q- values, making the target non-stationary.

One way to make the regression problem stationary is to sample the training observations in an independent manner through a simple shuffling of the training points. This is often done using something called experience replay. It is a queue with a fixed length, where training data are added and removed using first in last out principle. By sampling the data in the queue uniformly, the target becomes more stationary and is then more prone to converge.

To speed up learning, one can bias the uniform sampling into one accounting for the significance of the training example. Using a sam- pling weight α we get

P (i) = p^α_i

∑

ip^α_i , (3.10)

where α = 0 corresponds to the uniform case. We make use of a probability proportional to the temporal difference error, Equation 3.9, p_i =|δi| + ϵ where ϵ is added to make sure a positive probability for all values.

To correct for the bias in training, we use importance sampling weights as follows,

w_i = 1 N P (i)

β

, (3.11)

(38)

where β = 1 fully compensates for the bias (Schaul et al. 2015).

Finally, in the standard Q-learning, there is a tendency to overes- timate the future reward which may cause divergence or sub-optimal policies (Thrun and Schwartz 1993). Luckily, there is a way to fix this issue that was introduced by Van Hasselt, Guez, and Silver 2016. By adding a new set of parameters, θ⁻, that is not used in the prediction, but only in the update rule, we will get an unbiased estimate of the Q-values. The result holds as long as θ⁻ is updated sufficiently slow so that the correlation to θ is minimized. The update rules are thus adjusted to the following,

Y_t = R_t+1+ γQ(x_t+1,argmaxQ(xt, a; θ_t); θ_t⁻).

The neural network used to parametrize the function approximation is a function of the form,

f_j(x_i) = a(θ_j+∑

i

w_i,jx_i),

for every layer j in the network and where a is some activation func- tion. The complete neural network is thus a composition of the layers, making the functional output an input to the next layer

. . . fj−1◦ fj ◦ fj+1. . .

We will primary focus on a two layer architectures with x→ f1◦ f0(x). The parameters wi,j ∈ Rⁿ are often referred to as the neurons and θ_j ∈ R the bias terms. The activation functions are chosen such that they provide an easy first order derivative to make derivatives fast to compute. One of the most well used is the sigmoid function,

a_sigmoid(xi) = 1 1 + e^−xⁱ.

It has been showed that a neural network with sigmoid activation functions, given enough neurons, can approximate any continuous and bounded function (Cybenko 1989). This is often referred to as the uni- versal approximation theorem. However, the result relies upon the fact that the bound for the function can be found and used to translate the function onto the unit sphere. For this thesis, we cannot do this since the range of the Q-values is not known in advance but we can use sigmoid activations for the hidden layers while the last layer is chosen to be the identity a(xj) = x_j to maintain an unbounded range.

(39)

3.2.7 Actor critic

Since the environment can be partially observed, the optimal policy may be non-deterministic. Moreover, a direct policy search through stochastic policies may not only be applicable in partially observed environments, in fact, it can be more efficient in approaching a deterministic policy than a value iteration based deterministic policy search.

Moreover, since the policy is continuous, it also more efficient in han- dling continuous action spaces.

The main idea behind model free policy search methods is as fol- lows. Take some well-chosen parametrization of a policy π(x, a; θ) and the corresponding expected reward function J(θ). This expected re- ward can be reformulated as,

J (θ) = Eπ[r] =∑

x∈X

ρ(x)∑

a∈A

π_θ(x, a)R(x, a),

where ρ(x) is some distribution over the states x of and R(x, a) the re- ward of starting in state x choosing action a.

Let us proceed by taking the gradient of J(θ) and relate it to the instantaneous reward R(x, a),

∇θJ (θ) = ∇θ

∑

x

ρ(x)∑

a

π(x, a)R(x, a)

=∑

x

ρ(x)∑

a

π(x, a)∇θπ(x, a)

π(x, a) R(x, a)

=∑

x

ρ(x)∑

a

π(x, a)∇θlog π(x, a)R(x, a)

=Ex[∇θlog π(x, a)R(x, a)].

For any multistep MDP, the total average reward, can be shown to follow the same relationship where one step reward is exchanged with the future reward R(x, a) = Q^π^∗(s, a; θ). However, by using a bootstrapped value of Q^π^∗, bias may be introduced. Luckily, if the parametrization of π(x, a; θ) is chosen so that it is compatible with the value function V^π(x; w),

∇wQ(x, a; w) =∇θlog π(x, a; θ),

and the value function is chosen such that to minimize the MSE, V_w :min

w E[Qθ(x, a)− Qw(s, a))²],