Linköpings universitet

### Linköping University | Department of Computer and Information Science

### Master’s thesis, 30 ECTS | Computer science

### 2021 | LIU-IDA/LITH-EX-A--21/035--SE

## Combinatorial Optimization with

## Pointer Networks and

## Reinforce-ment Learning

*Kombinatorisk Optimering med Pointer Networks och *

*Reinforce-ment Learning*

**Axel Holmberg**

**Wilhelm Hansson**

Supervisor : George Osipov Examiner : Cyrille Berger

**Upphovsrätt**

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

**Copyright**

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

© Axel Holmberg Wilhelm Hansson

**Abstract**

Given the complexity and range of combinatorial optimization problems, solving them can be computationally easy or hard. There are many ways to solve them, but all available methods share a problem: they take a long time to run and have to be rerun when new cases are introduced. Machine learning could prove a viable solution to solving combinatorial optimization problems due to the possibility for models to learn and generalize, eliminat-ing the need to run a complex algorithm every time a new instance is presented. Uniter is a management consulting firm that provides services within product modularization. Product modularization results in the possibility for many different product variations to be created based on customer needs. Finding the best combination given a specific cus-tomer’s need will require solving a combinatorial optimization problem. Based on Uniter’s need, this thesis sought to develop and evaluate a machine learning model consisting of a Pointer Network architecture and trained using Reinforcement Learning.

The task was to find the combination of parts yielding the lowest cost, given a use case. Each use case had different attributes that specified the need for the final product. For each use case, the model was tasked with selecting the most appropriate combination from a set of 4000 distinct combinations. Three experiments were conducted: examining if the model could suggest an optimal solution after being trained on one use case, if the model could suggest an optimal solution of a previously seen use case, and if the model could suggest an optimal solution of an unseen use case. For all experiments, a single data set was used. The suggested model was compared to three baselines: a weighted random selection, a naive model implementing a feed-forward network, and an exhaustive search.

The results showed that the proposed model could not suggest an optimal solution in any of the experiments. In most tests conducted, the proposed model was significantly slower at suggesting a solution than any baseline. The proposed model had high accuracy in all experiments, meaning it suggested almost only feasible solutions in the allowed so-lution space. However, when the model converged, it suggested only one combination for every use case, with the feed-forward baseline showing the same behavior. This behavior could suggest that the model misinterpreted the task and identified a solution that would work in most cases instead of suggesting the optimal solution for each use case. The dis-cussion concludes that an exhaustive search is preferable for the studied data set and that an alternate approach using supervised learning may be a better solution.

**Acknowledgments**

We want to thank our supervisor George Osipov for insightful discussion throughout the the-sis work and our company supervisor Edvin Modigh for providing any material we needed as well as insightful discussions from Uniter’s perspective.

**Contents**

**Abstract** **iii**

**Acknowledgments** **iv**

**Contents** **v**

**List of Figures** **vii**

**List of Tables** **ix**
**1** **Introduction** **1**
1.1 Motivation . . . 2
1.2 Aim . . . 2
1.3 Research questions . . . 2
1.4 Delimitations . . . 2
**2** **Theory** **3**
2.1 Combinatorial Optimization . . . 3

2.2 Introduction to Neural Networks . . . 3

2.3 Embeddings . . . 5

2.4 Training a Neural Network . . . 6

2.5 Recurrent Neural Networks . . . 9

2.6 Sequence to Sequence (seq2seq) . . . 12

2.7 Attention mechanism . . . 12

2.8 Pointer net (Ptr-Net) . . . 13

2.9 Reinforcement Learning . . . 14

**3** **Method** **20**
3.1 Ensuring replicability . . . 20

3.2 Data set . . . 20

3.3 Preprocessing of data set . . . 22

3.4 Model . . . 24
3.5 Training . . . 26
3.6 Baselines . . . 26
3.7 Experiments . . . 29
3.8 Evaluation . . . 29
3.9 Technical Setup . . . 30
**4** **Results** **34**
4.1 Experiment 1 . . . 34
4.2 Experiment 2 . . . 37
4.3 Experiment 3 . . . 38

**5** **Discussion** **42**

5.1 Results . . . 42

5.2 Method . . . 45

5.3 The work in a wider context . . . 50

5.4 Future work . . . 50

**6** **Conclusion** **51**
**Bibliography** **53**
**Appendices** **56**
**A Graphs loss functions** **57**
A.1 Training on an individual use case . . . 57

A.2 Training on multiple use cases . . . 59

**List of Figures**

2.1 A neuron (unit) in a neural network . . . 4

2.2 Feed-forward Neural Network . . . 4

2.3 Sigmoid activation function . . . 5

2.4 ReLU activation function . . . 5

2.5 The procedure used to convert output values to probabilities. . . 5

2.6 Illustration of representing data as one-hot encoded or as embeddings. . . 6

2.7 Learning rate that is too small . . . 8

2.8 Learning rate that is too large . . . 8

2.9 Good learning rate . . . 8

2.10 A Recurrent Neural Network in its compact state and unfolded state with loss function . . . 9

2.11 LSTM cell . . . 10

2.12 A seq2seq RNN architecture with an encoder, a decoder and the context . . . 12

2.13 Seq2seq architecture vs Pointer Network solving the same problem . . . 13

2.14 The Reinforcement Learning setup . . . 15

2.15 The actor-critic architecture. . . 18

3.1 Processing of one sequence step of the input. . . 23

3.2 How the encoder receives its input. . . 23

3.3 Architecture of actor model . . . 25

3.5 From uniform to weighted probability distribution . . . 28

3.4 Architecture of critic model . . . 31

3.6 Actor model for the baseline feed-forward network. . . 32

3.7 Critic model for the baseline feed-forward network. . . 33

4.1 Probability distributions for individual tests in experiment 1 . . . 35

4.2 The probability distribution presented by the model in experiment 2 . . . 38

4.3 Distribution of optimality gaps in experiment 2 . . . 39

4.4 The probability distribution presented by the model in experiment 3 . . . 39

4.5 Distribution of optimality gaps in experiment 3 . . . 40

5.1 The set of optimal options as a subset of all options . . . 46

A.1 Graphs showing losses when training individual use cases experiment test num-ber 1. . . 57

A.2 Graphs showing losses when training individual use cases experiment test num-ber 2. . . 57

A.3 Graphs showing losses when training individual use cases experiment test num-ber 3. . . 58

A.4 Graphs showing losses when training individual use cases experiment test num-ber 4. . . 58

A.5 Graphs showing losses when training individual use cases experiment test num-ber 5. . . 58

A.6 Graphs showing losses when training individual use cases experiment test

num-ber 6. . . 59

A.7 Actor Loss for Experiment 2 . . . 59

A.8 Critic Loss for Experiment 2 . . . 59

**List of Tables**

3.1 Three use cases with two attributes each . . . 21

3.2 Table with examples of three performance series . . . 21

3.3 Hyperparameters for the model . . . 26

3.4 Hyperparameters for training the network . . . 28

3.5 Summary and comparison of experiments conducted. . . 29

3.6 Specifications of computers used for conducting the experiments . . . 30

3.7 Packages and libraries with version numbers for Computer 1 . . . 30

3.8 Packages and libraries with version numbers for Computer 2 . . . 30

4.1 Results from six individual tests using the proposed model. . . 34

4.2 Results from six individual tests when using Weighted Random Selection. . . 36

4.3 Optimality gap for individual tests for exhaustive search. Lower is better. . . 36

4.4 Optimality gap for individual tests with feed-forward baseline. Lower is better. . . 36

4.5 Average execution time for the model and each baseline in Experiment 1 . . . 37

4.6 Consolidated results from six individual tests using the proposed model and the three baselines. . . 37

4.7 Average execution time for the model and each baseline in Experiment 2 . . . 38

**1**

**Introduction**

Problems, where one wants to find an optimal solution in a finite set of possibilities are more formally called combinatorial optimization problems (COPs). These problems appear every-where, from packing a knapsack to more complex problems such as large-scale transportation and route planning. Many of these everyday combinatorial optimization problems share the characteristic that a suggested solution can be verified in polynomial time. However, finding the correct or optimal solution often requires a search of the whole solution space.

Combinatorial problems also occur in processes and decision-making within manufac-turing industries. Customers of companies working within manufacmanufac-turing seek the optimal setup of a combination of key features of a product fulfilling their requirements. The optimal configuration can be defined by different criteria, with the most obvious being the combination yielding the lowest cost and satisfying given requirements.

A simple example of this would be a powered drill. When developing a battery-powered drill, one would first look at a few key user factors for what is essential for the end-user of the drill. Some examples of key end-user factors would be how long the drilling session will be, how often the drill will be used, and how strong the person drilling is. From these factors, one can then calculate how large the battery needs to be, how short the recharge time should be, and how heavy the battery should be. These factors will give a large number of combinations of different setups that might be required by the users of the drill. One could go for a one size fits all approach with just one battery size. However, a better approach in many cases is to design modular products where there are several options for the battery size so that each user can choose the size they need.

In the case of a drill, there might not be many parts subject to variation, and searching the whole solution space for the optimal solution would be computationally easy. However, as one can imagine, finding the optimal solution becomes harder when increasing the number of choices and constraints as all parts may not fit with each other. To tackle this problem, we choose to turn to the field of machine learning and, more specifically, Pointer Networks, re-current neural networks (RNN), and deep reinforcement learning (DRL) since these methods have previously been used to solve similar combinatorial optimization problems [2, 23, 28].

1.1. Motivation

**1.1**

**Motivation**

Uniter is a management consulting firm that provides services within product modulariza-tion. The result of product modularization is the possibility to reuse the components a prod-uct is made of when designing new prodprod-ucts. It lowers R&D costs in prodprod-uct development, allows for greater product diversification, and tailors products to customer needs. However, since modular components allow for different combinations to be generated, modularization introduces a new problem when a company is in the phase of selecting components for a product. Multiple combinations can fit a single use case, and one must then choose which combination to use. The cheapest combination could be the objective, and then one would need to find the combination that satisfies this. This task can be defined as a combinatorial op-timization problem, and the task can be solved using dedicated solvers or methods such as exhaustive search. However, since the solution to one such optimization cannot be general-ized, the optimization would need to be performed for each new use case introduced. When the solution set is large, it will take a lot of time to get the optimal solution for every use case. New use cases and large solution sets are the nature of Uniter and its customers’ situations. Uniter is looking for a solution that would be able to generalize to decrease calculation and response time.

**1.2**

**Aim**

This study aims to identify if a machine learning approach using reinforcement learning and Pointer Networks can solve a combinatorial optimization problem at Uniter.

**1.3**

**Research questions**

1. How well does a reinforcement learning agent using a Pointer Network for policy gen-eration perform when trained on a single instance of this problem? Does it find an optimal solution? If not, how good is the approximation it obtains?

2. How well does a reinforcement learning agent using a Pointer Network for policy gen-eration perform when trained on multiple instances of this problem? Does it find opti-mal solutions? If not, how good is the approximation it obtains?

3. How well does a reinforcement learning agent using a Pointer Network for policy gen-eration trained on multiple instances of this problem generalize to unseen samples?

**1.4**

**Delimitations**

The data set used in this report belongs to Uniter and can thereby not be displayed or re-vealed in full. A theoretical example and the structure of the data are explained to ensure the possibility of future experiments.

In addition, the model is only tested on one data set due to time constraints. At the same time, multiple data sets would be beneficial to draw more definitive conclusions regarding trends that are visible in the results. Multiple data sets would also remove the uncertainty that the data set is the culprit of some aspects of the results.

Also, the study only focuses on one method to solve the problem, even though there may be several viable machine learning solutions to the problem.

**2**

**Theory**

This chapter will explain the underlying theory for the method used to carry out the different experiments. It starts with a section explaining the basics for neural networks and then de-scribes the types of neural networks used in this report, mainly Recurrent Neural Networks and Pointer Networks. It then goes on to explain what Reinforcement Learning is and how it works. It then ends by describing how one can combine these two techniques and brings up a few other related works that have used the same setup to solve similar problems.

**2.1**

**Combinatorial Optimization**

Combinatorial optimization is applied mathematics that combines combinatorics, linear pro-gramming, and algorithm theory to identify the optimal solution in a finite solution set. As the name implies, Combinatorial optimization problems (COPs) are discrete, and COPs ap-pear in many different fields, such as production, transportation, and communication net-work design [8]. As explained by W.J. Cook [8], many combinatorial optimization problems share the property that checking a given solution for correctness can be done in polynomial time relative to the input size. However, it does not imply that the optimal solution can be found in polynomial time for most cases. Deciding whether an instance of COP admits a so-lution of a specific cost is an NP-complete problem [12]. In computational complexity theory, an NP-hard problem is a type of problem widely believed to not be solvable in polynomial time [12].

**2.2**

**Introduction to Neural Networks**

To understand the essential parts of the method of this report, one first has to understand neural networks. Neural Networks (NNs) are networks of units (neurons). This network of units are inspired by the human brain. NNs are directed acyclic graphs consisting of mul-tiple layers, each emulating a function [13]. For example, in a network with three layers, see Figure 2.2, there is one input layer and one output layer, as well as a hidden layer in between. This gives three functions that are f(1), f(2) and f(3), where f(1) would be the transformation applied in the first (input) layer, f(2) the second (hidden) layer and f(3) the third (output) layer. The neural network can be seen as a composition of these functions

2.2. Introduction to Neural Networks

f(x) = f(3)(f(2)(f(1)(x))), which takes an input x and then outputs a result. An example
would be mapping an input x to a category y with the function y = f(x; θ), where θ
repre-sents the weights and bias, explained below. In this case, one wants to learn the parameter(s)
*θ*to achieve a good approximate of y.

Figure 2.1: A neuron (unit) in a neural network

Figure 2.2: Feed-forward Neural Network

The units in a neural network are in two parts. The first part is a linear function,

a=

D

ÿ

i=1

wixi+b0, (2.1)

where x1, ..., xDare the input, w1, ..., wDare the weights and b0is the bias, see Figure 2.1. The

output a, which is a linear combination of the input and parameters, is called the activation. An activation function is, according to Bishop [4], a function that helps with learning and making sense of all the different mapping between the inputs and corresponding outputs. The general equation of an activation function is,

z=h(a) (2.2)

where h is a non-linear function.

Each activation function has its strengths and weaknesses, and in a neural network, mul-tiple activation functions of different sorts can be used. There are mainly two activation func-tions used in this report, the rectified linear unit activation function (ReLU) and the sigmoid for the final output layer. ReLU is a non-linear function that only activates the neurons when

2.3. Embeddings

the output of the linear activation is greater than zero, see Figure 2.4. The equation for ReLU
is h(a) =max(0, a). Sigmoid is a non-linear activation function that scales the input value to
a range between zero and one, see Figure 2.3, but with the difference from ReLU, it also gives
a value to negative numbers. The equation for sigmoid is h(a) =1/(1+e´a_{).}

Figure 2.3: Sigmoid activation

func-tion Figure 2.4: ReLU activation function

The likelihood output from different units using sigmoid as activation function are called logits and can be used as input for a SoftMax function to generate a probability distribution over the output values. However, a SoftMax function can take a vector of arbitrary values and create a probability distribution, see Figure 2.5.

Figure 2.5: The procedure used to convert output values to probabilities.

The number of units in each layer varies based on model purpose and trial-and-error to see what number achieves the best overall model performance. The input layer must have the same number of units as the input data, thereby depending on how the data is structured. Depending on the purpose of the network, the output layer could be multiple or just a single output unit. For example, the output layer may have the same number of units as the number of labels there are to be predicted when used for predicting some actions. Also, the number of units in the hidden layer(s) is generally up to tuning and seeing what fits the model.

**2.3**

**Embeddings**

Independent of type, a neural network expects the input to be a vector containing continuous numerical values. The vector could either have one or multiple dimensions based on the problem at hand. Categorical features do not fulfill this requirement, and they present a challenge when used as input for a neural network.

D. Jurafsky and H. Martin [17] present different ways of encoding categorical data. One approach presented by the authors is One-hot encoding, where a vector with the same length as the number of values or categories is created. Then any given value would be represented by a vector with all positions being zero, and the index representing the value is marked with one. However, the authors highlight some issues with this approach. One problem is that there will be a lot of sparse vectors. Another is that if there are some relationships between the values in any given set, a one-hot encoding fails to represent this relation. Lastly, there is

2.4. Training a Neural Network

the problem of a fixed size vocabulary; if one wants to add a new value to any given set, the representation of each value would need to be recalculated.

There is another approach that is explained by Jurafsky and H. Martin [17] that uses the concept of Embeddings. This approach to the problem would be to decide on a vector of fixed length. Then each value is represented as one instance of this vector with continuous numer-ical values in more than one of the vector positions. A textbook example of this approach is encoding words to be represented in a high-dimensional space. The continuous numerical values of each vector are tuned as part of the training process. Machine learning models are often implemented with one or multiple embedding layers which do the encoding from cate-gorical feature values to embeddings. One effect of embeddings is that the dimensionality of a given data point can be reduced or increased, which is useful when latent features.

Figure 2.6: Illustration of representing data as one-hot encoded or as embeddings.

**2.4**

**Training a Neural Network**

The training and different training parameters are crucial to the model’s performance and take time and computing power to get right. In a book by Goodfellow et al. [13] from 2016, training a neural network is explained as a process to minimize or maximize a few select performance metrics. This goal is achieved by adjusting the network’s weights based on feedback on how well the network currently performs.

**2.4.1**

**Initializing Network Weights**

The initialization of network weights can significantly affect the outcome of the training phase as it can prolong converging and even prevent the network from ever converging. Goodfellow et al. [13] explain that network weights are usually randomly initialized at around zero by sampling either from a Gaussian or a uniform distribution. The choice be-tween the two has not been shown to impact the outcome significantly. However, the scale of the sample range can affect the network’s ability to generalize.

**2.4.2**

**Epochs and batching**

Bishop et al. [4] explain that a network is trained when parameters are updated through back-propagation during set intervals over several epochs. An epoch is a full pass of the entire training data. Each epoch can be divided into several mini-batches. A mini-batch contains n

2.4. Training a Neural Network

training instances and for each of them a prediction is made. These n predictions are then evaluated, and the parameters of the network are updated based on this evaluation.

The size of mini-batches can affect the model in multiple ways. Some of these are [13]: • Some hardware like GPU achieve better runtimes with specific sizes of arrays, making

it common to use batch sizes of power of two to get better runtime, from 16 to 256. • Larger mini-batches provide less than linear returns but provide a more accurate

esti-mate of the gradient.

• Small batches do not utilize multicore architectures well, motivating using some abso-lute minimum batch size.

• Smaller batches can offer a regularizing effect. Generalization error for a batch size of one is often best. However, training with such a small mini-batch size and low learning rate, see section 2.4.5, is required to maintain stability because of the high variance in the gradient estimate. A longer runtime is expected with a smaller mini-batch size as it backpropagates more times as it iterates through the entire data-set.

**2.4.3**

**Evaluating predictions based on loss**

When compiling the model, it is possible to use different loss functions. According to Jurafsky et al. [17], the loss function is used to compute the loss when training the model, which is the distance between the model output and the correct output. The aim is to minimize the loss when applied to the testing data, as a low loss on the training data can indicate over-fitting. A few examples of loss functions are mean squared error (MSE), binary cross-entropy, mean absolute error (MAE), and categorical cross-entropy.

**2.4.4**

**Gradient Descent optimization**

Most deep learning algorithms involve some optimization, where the optimization seeks to either maximize or minimize some function f(x)by altering parameters of f [13]. One way to do the optimization in a neural network is by using Gradient Descent Optimization. Gradient Descent is a method where one uses the first-order derivative to find the global minimum by taking small steps in the direction of the gradient, making it slowly approach the global minimum, see Figure 2.9 for a visual representation.

S. Ruder [24] explains that Gradient Descent can be performed at different set intervals; after each batch, after each training sample, or after each mini-batch. This report will make use of mini-batch Gradient Descent where an update occurs for every n training sample, see Equation 2.3. Note that Equation 2.3 is for a supervised setting, and the difference in an unsupervised environment is that the loss used is calculated before and inserted into the equation,

*θ*=*θ ´ η ¨*5* _{θ}*J(θ; x(i:i+n); y(i:i+n)) (2.3)

*where θ P*

**R**Dare the models parameters in J(θ),5

*J(θ)*

_{θ}*is the gradient and η is the learning*rate.

**2.4.5**

**Learning rate**

Learning rate is a hyperparameter that adjusts the size of the step that is taken in the direction of the gradient during gradient-based optimization [13]. The learning rate can have a large effect on the optimizer and affects how the model learns over time, see Figure 2.7, 2.8 and 2.9 for a visual representation of how different learning rates can affect gradient-descent.

2.4. Training a Neural Network

Figure 2.7: Learning rate that is too small

Figure 2.8: Learning rate that is too large

Figure 2.9: Good learning rate

**2.4.6**

**ADAM-optimizer**

Adaptive Moment Estimation, or Adam for short [24], is a method that computes adaptive
learning rates for each parameter. Ruder [24] goes on to explain that Adam stores an
*ex-ponentially decaying average of past squared gradients ν*tlike Adadelta and RMSprops. In

addition, Adam keeps an exponentially decaying average of past gradients mt, similar to

momentum:

mt=*β*1mt´1+ (1 ´ β1)gt

*ν*t=*β*2*ν*t´1+ (1 ´ β2)g2t

(2.4)

where mt *and ν*t are estimates of the first moment, the mean, and the second moment, the

uncentered variance, of the gradients, respectively. mt *and ν*t are initialized as vectors of

zeros. mt*and ν*tare biased towards zero during the initial time steps, and when the decay

*rates are low, which occurs when β*1*and β*2 are close to 1, it is counteracted by computing

bias-corrected first- and second-moment estimates:

ˆ
mt= mt
*1 ´ β*t_{1}
ˆ
*ν*t= *ν*t
*1 ´ β*t_{2}
(2.5)

where ˆmtand ˆ*ν*tare then used to updated the parameters:

*θ*t+1=*θ*t´? *ν*
ˆ

*ν*t+*e*mˆt (2.6)

2.5. Recurrent Neural Networks

Algorithm 1 shows Adam in a supervised setting. However, the only difference between the supervised and unsupervised setting is the input to the loss function L, see line 3 in Algorithm 1. In the supervised setting, the true labels ytruecan be part of the loss calculation.

However, other metrics must be used in the unsupervised setting, further discussed in Section 2.9.

**Algorithm 1**The Adam Algorithm [13]

**Require:** *: Step size e*

**Require:** *: Exponential decay rates for moment estimates, ρ*1*and ρ*2in [0,1]

**Require:** *: Small constant δ used for numerical stabilization*

**Require:** *: Initial parameters θ*

Initial 1st and 2nd moment variables s=0, r=0 Initialize time step t=0

1: **whilestopping criterion not met do**

2: **Sample a minibatch of m examples from the training set tx**(1)**, ..., x**(m)uwith
**corre-sponding targets y**(i)

3: Compute gradient: g ÐÝ_{m}1∆* _{θ}*ΣiL(f(

**x**(i)

*; θ), y*(i)) 4: t ÐÝt+1

5: **Update biased first moment estimate: s Ð***Ý ρ*1**s**+ (1 ´ ρ1)**g**
6: **Update biased second moment estimate: r Ð***Ý ρ*2**r**+ (1 ´ ρ2)**gÄ g**
7: **Correct bias in first moment: ˆs Ð**Ý _{1´ρ}**s**

1

8: **Correct bias in second moment: ˆr Ð**Ý **r**

*1´ρ*t
2

9: Compute update:* ∆θ*=

*´e*?

**ˆs**

*10:*

**ˆs+δ***+*

**Apply update: θ Ð****Ý θ**

**∆θ**11: **end while**

**2.5**

**Recurrent Neural Networks**

A subsection of neural networks is Recurrent Neural Networks (RNN). RNNs were first de-veloped back in 1985 by Rumelhart et al. [25] and are derived from feed-forward networks. RNNs can handle and understand sequential data such as text and voice because of their ability to use its output as input to the network. Using the output as input to the next step helps the network understand the context since it gets the previous state to decide the next state. An RNN can be seen in Figure 2.10, both in its normal state and unfolded state over a few time steps.

Figure 2.10: A Recurrent Neural Network in its compact state and unfolded state with loss function. Credit: Goodfellow et al. [13].

2.5. Recurrent Neural Networks

A classic example of an RNN is the "What’s for dinner?"-example. In this example, one wishes to predict what the dinner will be tonight. The input can be the day of the week, the month, the current weather, and dinner yesterday. The output, which can be pizza, ham-burger, or sushi, is then predicted. Using a regular feed-forward network, one would be able to make good predictions. However, if one misses what was for dinner for the past week, the network would not work anymore. The architecture of RNNs solves this problem. If the input is the output from the previous day, one can unroll the predictions back to when one was confident of the input. Using this known input, it is possible to step forward and predict what there will be for dinner tonight.

However, Goodfellow et al. [13] also bring up some problems with RNNs. One of the inputs to an RNN is the output of the previous time step, and problems can occur when the weights are multiplied over time. When doing this on weights having a large value, these values are amplified and tend to infinity hence exploding. When the weights are low, it leads to weights tending to zero and vanishing. The possibility for exploding or vanishing values must be taken into account when using Recurrent Neural Networks.

**2.5.1**

**Long Short-Term Memory**

A solution to vanishing gradients is Long Short-Term Memory (LSTM) that was invented by Jürgen Schmidhuber [14] in 1997. According to Goodfellow et al. [13] LSTMs can miti-gate vanishing gradients by implementing an LSTM-cell instead of a unit that only uses an element-wise nonlinearity to the affine transformation of recurrent units and inputs. LSTM-cells have a recurrence inside the cell, like a self-loop, that complements the recurrence of the network itself. The cells work similarly to the RNN and use the same inputs as well. How-ever, it also has multiple parameters and a system of gates made of units that can control the information inside the cell. In total, it has one input and three gates - an input gate, an output gate, and a forget gate, see Figure 2.11.

2.5. Recurrent Neural Networks

In addition, it also has a state unit s(t)_{i} . The weight in the self-loop is controlled by a forget
gate unit f_{i}(t)for cell i and time step t. The forget gate sets the weight between 0 and 1 with
the help of a sigmoid unit. Equation 2.7 shows the structure of the forget gate:

f_{i}(t)=*σ(***b**( f )_{i} +ÿ
j
**U**_{i,j}f x(t)_{j} +ÿ
j
**W**_{i,j}f **h**(t´1)_{j} ) (2.7)
where:

**• h**(t)is the current hidden unit
**• x**(t)is the current input unit
**• b**f are the biases

**• U**f are the input weights
**• W**f are the recurrent weights

This leads to the following equation for the state unit s(t)_{i} :
s(t)_{i} = f_{i}(t)s(t´1)_{i} +g(t)_{i} *σ(***b**( f )_{i} +ÿ

j

**U**_{i,j}f x(t)_{j} +ÿ

j

**W**_{i,j}f **h**(t´1)_{j} ) (2.8)

where the parameters are the same as in Equation 2.7 and g(t)_{i} , the external input gate, is
computed in a similar way to f_{i}(t). The equation for g(t)_{i} is:

g(t)_{i} =*σ(***b**_{i}g+ÿ

j

**U**g_{i,j}x(t)_{j} +ÿ

j

**W**_{i,j}g**h**(t´1)_{j} ) (2.9)

The output of the LSTM-cell, h(t)_{i} , is calculated via the following equation (Equation 2.10):
h(t)_{i} =tanh(s_{i}(t))q(t)_{i} (2.10)
where q(t)_{i} , the output gate, is calculated by Equation 2.11:

q(t)_{i} =*σ(***b**O_{i} +ÿ
j
**U**O_{i,j}x(t)_{j} +ÿ
j
**W**O_{i,j}**h**(t´1)_{j} ) (2.11)
where:

**• b**Oare the biases

**• U**Oare the input weights
**• W**Oare the recurrent weights

Overall this system of gates and the ability to "forget" signals helps preventing vanishing gradients in RNNs.

2.6. Sequence to Sequence (seq2seq)

**2.6**

**Sequence to Sequence (seq2seq)**

Sequence to sequence (seq2seq) was proposed back in 2014 by two different reports, first by
Cho et al. [6] and then a few months later by Sutskever et al. [26]. The idea behind a sequence
to sequence network is to use two RNNs, one as a encoder and one as a decoder, producing a
sequence from a sequence. This lead to the names that it got from these two reports -
encoder-decoder and seq2seq architecture. A seq2seq network processes the input sequence in the
**en-coder. The encoder then emits a context, C, that is, in most cases, a simple function of its final**
hidden state, which is the state of the neuron after it has gone through the entire sequence.
**The decoder is then conditioned on the fixed vector, C, to generate an output sequence. The**
input sequence and output sequence can have different lengths. The parts in the seq2seq
net-works is trained jointly to maximize the average of log P(**y**(1)**, ..., y**(ny)_{|}** _{x}**(1)

**(nx)**

_{, ..., x}_{)}

_{over all}

**the sequences of the training set, which contains pairs of x and y. hnx**is the last state of the

**en-coder and is in most cases used as a hidden representation, C, of the input sequence that the**
decoder uses as input. See Figure 2.12 for a visual representation of the seq2seq architecture.

**Figure 2.12: A seq2seq RNN architecture with an encoder, a decoder and the context C.**
Credit: Goodfellow et al. [13].

**2.7**

**Attention mechanism**

When reading longer input sequences, for example, a sentence consisting of upwards of 60 words, an RNN usually struggles with distinguishing the significance of individual elements. In theory, it is possible to solve this with enough time and training, but it is, in most cases, not a practical solution. Another solution to this problem was proposed by Bahdanau et al. [7] in 2015 called attention mechanism. Goodfellow et al. [13] describe an attention-based system as having three components:

• An input process that reads data and converts it into representations that are dis-tributed. The representations have one feature vector for each word position.

• A memory consisting of feature vectors that store the output of the input process. The memory emulates a sequence of facts, accessible later in any order needed.

2.8. Pointer net (Ptr-Net)

• A process that uses the memory to perform one task at each time step sequentially. This process can give attention to specific or several elements in memory, which would be required when translating a sentence read in the input process.

In summary, an attention mechanism can be understood as a weighted average of
**fea-ture vectors h**(t)*with weights α*(t)**. h**(t) can either be hidden units of a neural network or
*input data to the model.The weights, α*(t), are produced by the model during training and
**are usually valued between zero and one. They are trained to concentrate close to h**(t) to
approximate reading a specific time step precisely. This approximation is usually made by
applying a softmax function to relevance scores emitted by another part of the model. It is
more computationally expensive to use the attention mechanism than indexing the desired

**h**(t), but Gradient Descent can not be used with direct indexing.

Seq2seq architecture can not handle variable-sized sequences, and therefore, the architec-ture poses problems when trying to solve combinatorial optimization problems [28].

**2.8**

**Pointer net (Ptr-Net)**

Pointer Net (Ptr-Net), also called Pointer Network, was proposed by Vinyals et al. [28] in 2015 and is a combination of the attention mechanism and the seq2seq architecture that uses attention as a pointer to select a member of the input sequence as the output. This change results in the new architecture being able to handle variable-sized output dictionaries, and the network can generalize to data with more time steps than the training data. The architectural change makes Ptr-Nets a good fit for neural learning on discrete problems, where the solution is a single selection from or permutation of the input sequence.

Figure 2.13: An illustration of how a seq2seq architecture and a Pointer Network solves the same problem. The left shows the seq2seq and the right shows the Pointer Network.

The report by Vinyals et al. [28] continues to explain that a Ptr-Net is a small and simple modification of the attention model that allows for an output dictionary of variable length. The attention model used by Vinyals et al. can be defined as follows:

ui_{j}=*υ*Ttanh(W1ej+Wwdi), j P t1, . . . , nu
ai_{j}=so f tmax(ui_{j}), j P t1, . . . , nu
d1
i =
n
ÿ
j=1
ai_{j}ej
(2.12)

2.9. Reinforcement Learning

As a seq2seq model uses a softmax distribution over a fixed sized output dictionary to compute the conditional probability on the context it cannot be used for a case where the output is of the same length as the input. To solve this problem Vinyals et al. models it with the attention mechanism of Equation 2.12 and creates the following equation:

ui_{j} =*υ*Ttanh(W1ej+Wwdi), j P(1, ..., n)

p(Ci|C1, ..., Ci´1, P) =so f tmax(ui),

(2.13)

where the vector uiis normalized by softmax to be an output distribution over the dictionary
*of input and υ, W*1and W2are learnable parameters of the output model and Ci, i P t1, ..., nu

are the vectors with the probabilities for each time step. Instead of using the encoder state
ejto propagate extra information to the decoder, it uses ui_{j} as pointers to the input elements.

The report also notes that the approach is targeted explicitly at problems whose output is discrete and corresponds to positions in the input. The report also presents how they can use it to solve different combinatorial problems. Among these are a convex hull problem, the traveling salesman problem, and Delaunay Triangulation. They find that Pointer Networks are applicable and can learn solutions to all three problem types, which further proves the potential of Ptr-Net for combinatorial optimization problems.

A study by Mottini and Acuna-Agost uses the Pointer Network architecture to predict which Airline Itinerary a passenger is going to choose [23]. A passenger is often presented with multiple itineraries, matching a search and preference query given by the passenger. However, the passenger chooses at most one itinerary and for airlines and travel agencies to know which itineraries are more likely to be selected is valuable business knowledge. Mottini and Acuna-Agost seek to solve this task using machine learning, and in particular, Pointer Networks. However, since the purpose of Pointer Networks is to predict a sequence of items based on the input, Mottini and Acuna-Agost choose to do a simplified version of the original model. The main change is done on the decoder side, where the decoder RNN is removed, and the formula for the comparison between the encoder states and decoder vec-tor is changed. Conceptually the change can be interpreted as the new model only being able to point once instead of multiple times. Mottini and Acuna-Agost then use the condi-tional probability distribution generated by the pointing mechanism to sort the itineraries from most likely to least likely to be chosen. No changes were made on the Encoder side, but the authors added an extra preprocessing layer. This layer did two things; numerical fea-tures in the itinerary were normalized to the interval [0,1] to remove sensitivity to scale, and categorical features in the itinerary (e.g., destination/ origin) were mapped to vectors using embeddings. The output of the preprocessing layer was then fed to the input of the encoder in the Pointer Network. The authors conclude that their proposed method outperforms the methods perceived as standard within the industry.

**2.9**

**Reinforcement Learning**

In this section, the concept of Reinforced learning (RL) is presented. Reinforcement Learning focuses on finding the correct strategy. The basis for RL is an agent that tries to figure out the most valuable actions to take given the current state of its environment.

**2.9.1**

**Elements of Reinforcement Learning**

In their book Reinforcement Learning: an introduction, Sutton and Barto [27] go into detail about all aspects of Reinforcement Learning (RL). They start with a presentation about fundamental elements within Reinforcement Learning. The explanations below are based on Sutton and Barto’s work.

2.9. Reinforcement Learning

The main concept in RL is to place an agent in an environment, give the agent a goal and then let the agent observe the state of the environment, interact with it and then give feedback on how well it achieves the given goal. Besides these basic concepts, four sub-elements make up the RL method. Each of them is listed below with an explanation of their purpose. The Reinforcement method is illustrated in Figure 2.14.

Figure 2.14: The Reinforcement Learning setup

**• Policy: The policy is a state-action mapping and the core of the agent since it defines**
how the agent should act based on the perceived state of the environment. The actual
policy can be of different types; a lookup table, a stochastic model, or a more complex
search procedure. Independent of type, the policy always presents an action or a
prob-ability distribution over different actions.

**• Reward signal: As a result of each action taken, the agent receives a response in the**
form of a number. This response is called the reward, and the agent will try to maximize
the accumulated reward by updating the policy to promote actions yielding a higher
reward. This reward is the only thing the agent is aware of, and correlating the reward
with the goal objective is part of the implementation. The difficulty of defining what
should give a high reward and what should give a lower reward can vary. However,
there is always the risk of the agent exploiting something missed in the specification,
leading to unwanted behavior, which is further discussed in Section 2.9.

**• Value function: If an agent always chooses the action that results in the highest **
imme-diate reward, it could risk missing out on greater future rewards. The larger immeimme-diate
reward could also result in the accumulated reward being lower. An analogy of this
would be for a person to choose between candy and vegetables for dinner. The former
would be tasty in the present (high reward), but over time the latter choice would result
in an overall better life situation (larger future reward). The purpose of the value
func-tion is to predict the future rewards possible given the agent’s current state, allowing
the agent to make decisions based on future values rather than current rewards.
**• Model of the environment (optional): Before an agent chooses to take a specific action,**

it could be allowed to test its action on a model of the environment, hence evaluating different scenarios. This type of RL approach is called model-based learning and is a higher-level approach than the low-level trial-and-error approach, where the agent gets the reward after the action is taken. One could either choose to give a toy model of the environment, have the agent create a model while performing trial-and-error or skip this element altogether (model-free approach).

2.9. Reinforcement Learning

These four elements can be extended to define more concepts within RL. Also of relevance for this thesis are the following concepts:

**• Action-Value Function: Given the current state, one can define the Action-Value **
Func-tion as Q*π*_{(s, a). This expression is interpreted as the expected received value of action a}

*given state s under policy π.*

**• Exploration Versus Exploitation: A balance between using previous knowledge or **
ex-plore new actions must be struck each time the agent decides on the next action. There
*are different methods to tackle this dilemma. One such method is e-greedy where e is*
a decimal probability defining the probability of the agent acting greedy, meaning it
exploits previous knowledge.

**2.9.2**

**Approximate method and Policy gradient**

When using RL, creating the policy can be divided into Tabular solutions, and Approximate solutions [27]. In the former case, the state and action space are small enough, and the model can store an approximation of the value function in tabular form. This method often results in the agent finding the optimal value function and the optimal policy. However, when the state and action space increases in size, it is no longer feasible to store the value function approximation as a table due to computational and hardware limits. In this case, one must use approximation methods. Conceptually, approximation methods will compress information about specific states and actions [27], which will lead to a loss of data on individual states. However, it also results in the model being able to generalize between states, allowing the agent to make predictions when presented with previously unseen states [27].

One such approximate method was suggested in 2013 by Volodymyr et al. at DeepMind Technologies [21] called Deep Reinforcement Learning (DRL). DRL uses neural networks, one or multiple, to approximate a value or policy function. This approach turned out to perform well when presented with a large state or action space.

**Policy Gradient Methods**

Many methods within RL are action-value based, meaning that they learn the values of actions and then select actions based on their estimated value. However, Policy Gradient Methods use a different approach where the goal of training is to learn a parameterized policy that does not need to consult the value function. The following equation shows the parameterized policy,

*π(a|s, θ) =*PrtAt=a|St=* s, θ*t=

*(2.14)*

**θu****where θ P****R**d1 is the policy’s parameter vector and the equation can be interpreted as the
probability that action a is taken at time t given that the current state is s at time t with
* parameter vector θ. When representing the policy with a neural network, θ is the parameters*
of the network. A performance measure J(

*)is then defined, and the gradient of this scalar*

**θ**measurement is used to learn the policy. The goal will be to maximize performance, and Equation 2.15 shows the update that will approximate gradient ascent in J.

* θ*t+1=

*t+*

**θ***α {*5J(

**θ****t**) (2.15)

*where α is the learning rate and {*5J(**θ****t**) P **R**d
1

is a stochastic estimate for which**E**[5J({_{θ}** _{t}**)]
approximates5J(

*)*

**θ**

**(the gradient of the performance measure) with respect to θ. The**value-function is, as previously mentioned, not considered when using basic policy gradient meth-ods. However, there is a setup called actor-critic where the value-function is estimated and used during training. In this setup the output from the estimated value-function is used to lower the variance and this modification has shown to lower the model training time. Actor refers to the learned policy and critic to the estimated value-function.

2.9. Reinforcement Learning

**REINFORCE**

REINFORCE is a policy-gradient learning algorithm first presented by Williams [29] in 1992. The paper is highly theoretical, contains longer mathematical proofs of the algorithm, and many of the area-specific words have changed since the time of writing. Sutton and Barto make a more recent presentation of the REINFORCE algorithm in their book Reinforcement Learning: an introduction [27], republished 2018 and the authors reference the original paper by Williams [29]. It is the more recent presentation by Sutton and Barto [27] that will be used when explaining the REINFORCE learning algorithm.

The overall strategy of policy gradient methods requires a way to obtain samples J(* θ*t)

for which Equation 2.16 is true, i.e., the expectation of the sample gradient is proportional to the actual gradient. Based on this condition and using the policy gradient theorem, both Williams [29], and Sutton and Barto [27] derive the proportionality in Equation 2.17 which yields the REINFORCE update presented in Equation 2.18.

5J(* θ*)9

**E**h { 5J(

**θ****t**) i (2.16) 5J(

*)9*

**θ****E**Gt 5π(At|St

*t)*

**, θ***π(A*t|St

*t) (2.17)*

**, θ***t+1=*

**θ***t+*

**θ***αG*t5π(At|St

*t )*

**, θ***π(A*t|St

*t) (2.18)*

**, θ**Where Gtis the return and*5π(A _{π(A}*

_{t}

_{|S}t|S

_{t}

*t*

_{,θ}

**,θ**_{t}

_{)}t)yields a vector. This vector can be interpreted as the

direction in parameter space which will have the largest increase of probability of repeating action At on future visits to state St. Extending Equation 2.18 to use a baseline to lower

variance and training time results in Equation 2.19.

* θ*t+1=

*t+*

**θ***α(G*t´b(St))

5π(At|St* , θ*t)

*π(A*t|St* , θ*t) (2.19)

Where b(St)is a baseline and estimation of the state-value-function. Using the identity

5lnx = 5x_{x} , Sutton and Barto [27] rewrite the expression in Equation 2.19 to the expression
in Equation 2.20 which is the version of the REINFORCE update that will be used in this
thesis. The proportionality in Equation 2.17 will also be used.

* θ*t+1=

*t+*

**θ***α(G*t´b(St)) 5

*lnπ(*At|St

**, θ****t**) (2.20)

**Actor-critic training**

Actor-critic training is a method that uses Reinforcement Learning in its algorithm. Sutton and
Barto [27] explain that actor-critic algorithms learn both policies and value functions. The actor
is the part that learns policies, and critic is the part that learns whatever policy is currently
being followed by the actor to criticize the actor’s choices. The critic uses a Temporal Difference
(TD) algorithm to learn the state-value function for the actor’s current policy. The critic can
*critique the actor’s choices thanks to its value function by sending TD errors, δ, to the actor. A*
*positive δ means that the action was good as it led to a state with a better value than expected.*
*However, a negative δ means that the action was bad as it led to a state with a worse value*
than expected. Based on this, the actor continually updates its policy. Both the critic and the
actor’s networks receive input consisting of multiple features representing the current state
of the agent’s environment. On each transition from state St to state St+1, taking action At

and receiving reward Rt+1*, that algorithm computes the TD error, δ, which is then used for*

2.9. Reinforcement Learning

Figure 2.15: The actor-critic architecture.

**2.9.3**

**Combining Neural Networks and Reinforcement Learning**

In a paper by Bello et al. [2] Pointer Networks are combined with Reinforcement Learning to solve combinatorial optimization problems. The paper’s primary focus is the Traveling Salesman Problem, but the authors also show that their model can solve other types of COPs as well, the Knapsack problem being one.

Bello et al. [2] start by discussing previous work using the seq2seq model to perform machine translation. They highlight that the seq2seq model uses the same factorization as they do and that a seq2seq model could solve the task at hand. However, the authors mention two issues. The first is that a seq2seq model has a fixed size output vocabulary 1, 2, ...n and can therefore not generalize to inputs with more than n cities. Furthermore, the seq2seq paper used conditional log-likelihood to optimize the parameters. To use this method, one needs access to the ground truth, in this case, the permutations to the TSP (i.e., the correct traveling route). Bello et al. seek to implement a method that addresses both of these problems.

To address the issue of fixed size output vocabulary, Bello et al. use Pointer Networks [28] to allow the model to point to a given input rather than predicting an index in an output vocabulary. Vinyals implemented and trained a Pointer Network model using a supervised loss function comprising conditional log-likelihood to solve instances of the TSP. Also, in this approach, Bello et al. point to undesirable aspects that they seek to avoid. The first aspect is that the supervised system needs labeled data for training, and the model performance is dependent on the quality of the labeled data. Secondly, it is computationally expensive and may be infeasible to compile a high-quality labeled data set for new problem statements that are NP-hard. The third aspect addresses the fact that one might not seek to replicate the result of another algorithm but rather to find a competitive solution.

Bello et al. suggest using Reinforcement Learning to solve the Traveling salesman problem and other COPs [2]. The authors also highlight the fact that COPs usually have straightfor-ward restraightfor-ward mechanisms, such as the total tour length in the case of the TSP, which makes them suitable for a Reinforcement Learning approach.

To summarise their approach, Bello et al. propose to optimize the parameters of a Pointer Network by using a model-free policy-based Reinforcement Learning approach. The RL agent’s training objective is the expected tour length and the goal is to find the shortest tour.

2.9. Reinforcement Learning

Bello et al. used an Actor-critic design when training their RL agent. For encoding each city’s coordinates, both the actor and the critic used the same encoder architecture. However, the actor and the critic did not share encoder parameter values. The actor’s task was to suggest a permutation of the city coordinates using a decoder based on the Pointer Network architecture. The critic’s job was to estimate the tour length of a given set of city coordinates. When trained on multiple sets of cities, the critic’s prediction will approximate the expected value of the current set’s tour length. The approximation is used to reduce the variance when estimating the gradient for the actor according to Equation 2.21, where b(si) is the critic’s

prediction of the tour length for a sample siconsisting of a set of cities.

5J(* θ*)« 1
B
B
ÿ
i=1
(L(πi|si)´b(si)) 5

*θ*logp

*θ*(πi|si) (2.21)

Quentin Deffense wrote a master thesis in 2020 [9] regarding solving combinatorial opti-mization problems with Reinforcement Learning and neural networks, focusing on the trav-eling salesman problem. The report is heavily inspired by Bello et al. [2] and uses a similar method with Pointer Networks and Reinforcement Learning with the actor-critic setup. It tries to recreate the procedure used by Bello et al. and add one more search strategy, 2opt, to improve the performance as the model does not necessarily produce local optima solutions. Deffense goes into depth explaining the model and its different parts. Deffense achieves a result close to that of Bello et al., but not quite as good. However, the model was able to generalize to variable problem sizes within a specific range.

**3**

**Method**

This chapter will explain the method used to carry out the different experiments. It starts with defining the data and terms unique to Uniter and essential for understanding the study method. Then follows a thorough explanation of how the model architecture and how the training works. Next, follows a description of model testing and result evaluation. In ad-dition, the three baselines are described in detail to give a better context of them relative to the primary model in this report. Lastly, the technical setup with the specifications of the computers and the packages and libraries used, each with a corresponding version number, is presented for future replicability.

**3.1**

**Ensuring replicability**

The data set provided by Uniter is not an open data set and contains business-critical infor-mation. Although the data is not available as a whole, Section 3.2 describes it conceptually. Moreover, the preprocessing performed and the model architecture are described in Section 3.3 and Section 3.4 and the training procedure in Section 3.5.

**3.2**

**Data set**

The data set used for this report was acquired directly from Uniter and is currently in use for one of their customers. However, the data set is not open, so an example with the same composition will explain the structure of the actual data set.

For the rest of this example, we suppose that the data set consisted of parts in a com-puter. This highly modular product can be put together in different configurations based on a customer’s needs. Matching a use case with the right performance step from each performance series.

**3.2.1**

**Use case**

The use cases are central in this data set and describe each customer’s needs. The needs are paramount when deciding what configuration works for a specific customer. A use case consists of several different attributes that can be either numerical or categorical features. A numerical feature can be how long someone sits at the computer and can be represented by

3.2. Data set

Table 3.1: Three use cases with two attributes each

**Name** **Hours at computer** **Profession**

heavy_hacker 24 programmer

powerpoint_magician 5 office use

lazy_influencer 2 photographer

noob_gamer 2 office use

a number, for example, 5 hours. A categorical feature is a value selected from a predefined group of options for that feature. For example, suppose the feature describes the user’s pro-fession. In that case, the categories could be {photographer, programmer, office use} and the user is, for example, a programmer, that would be selected. A few examples of selections based on these two attributes would lead to the structure of Table 3.1.

The data set used in this report had about 3000 rows, each with about 125 attributes per use case, meaning it covers a wide range of different uses with very different needs.

**3.2.2**

**Performance series**

A performance series is a series of several choices. A performance series can be, using the same example as before, a choice of battery size, which CPU and how much RAM the user needs. That would be three different performance series. Each performance series would then consist of a few different performance steps. Table 3.2 shows an example of different choices for each of the suggested performance series.

Table 3.2: Table with examples of three performance series

**Battery size** **CPU** **RAM**

30 Wh Pentium 4 GB

40 Wh i3 8 GB

50 Wh i5 16 GB

60 Wh i7 32 GB

70 Wh i9 64 GB

Engineers create formulas to calculate which performance steps are applicable for the use case. For example, a programmer may need a more powerful CPU and more RAM than an office user, and a person that sits at the computer for 8 hours may need a larger battery than one that uses it for 2 hours. The required processor and battery size are calculated using these formulas. Continuing on the example, a use case with 2 hours at the computer may have all the performance steps in the battery size performance series available, while a use case with 12 hours at the computer may only have 60 Wh and 70 Wh available. The same reasoning applies to the CPU.

In addition, there are also limitations for which performance step fits with which perfor-mance step from another perforperfor-mance series. For example, a 30 Wh battery may be too weak for an i9 CPU, making it incompatible. This limitation is called an interface problem.

Also, each performance step has a metric that one can compare to other performance steps. This metric can be many different things and some examples are cost, performance or performance per cost unit, but it is always a numerical value.

**3.2.3**

**Finding the right combination for each use case**

After one selection from each performance series has been picked, one has a combination that makes up the final product. A combination can either be allowed or forbidden based on in-terface problems or limitations incurred by the attributes in the use case, as explained above.

3.3. Preprocessing of data set

Finding an allowed combination can be harder or easier depending on the data set. Some data sets have no interface problems, while some have many. So finding a correct combina-tion can be a very complex or simple procedure. However, finding a correct combinacombina-tion is only a part of finding the right combination of performance steps for a use case. Finding the optimal combination is the hard part. An optimal combination is where one can maximize or minimize, depending on the data set, a metric for each performance step. For example, if each computer part in the example in Table 3.2 had a cost, one would want to pick the cheapest combination available to that use case.

**3.2.4**

**Defining the combinatorial optimization problem**

Defined as a combinatorial optimization problem, the problem has the following specifica-tion:

I : The set containing all use cases x : A specific use case

y : A specific combination of performance steps m(x, y): cost of all components in y

g : min

(3.1)

Given the example with computer components m(x, y)would be:

m(x, y) =cost of selected battery+cost of selected CPU+cost of selected RAM (3.2)

**3.3**

**Preprocessing of data set**

The data set used for this study contained both numerical and categorical features. Cate-gorical features present a challenge for a neural network since it expects its input to be only numerical values.

**3.3.1**

**Use cases**

Every possible class for a categorical feature was known and could be mapped to a numerical index ranging from 0 to N where N = Total num. of classes ´ 1. This encoding was done for each categorical feature individually, meaning that no two categorical features shared class vocabulary. Encoding the categorical feature values was a prerequisite for the embedding layers, which were part of the final model.

Numerical features were scaled as part of the data preprocessing. Values were scaled to the range[0, 1]according to Equation 3.3.

**X***σ* =

**X ´ min**(**X**)
max(**X**)´min(**X**)

**X**scaled =**X***σ*˚(L ´ U) +L, where L and U are variable parameters.

(3.3)

**where X is a column vector of all values for a given feature and L and U are lower and upper**
limits for values of Xscaled. L and U are variable parameters, usually L=0 and U=1.

**3.3.2**

**Performance steps encoding**

The performance steps were also categorical and encoded following the same idea as cate-gorical features of each use case.

3.3. Preprocessing of data set

**3.3.3**

**Compiling a state for the RL environment.**

At each step the agent took, it was presented with a new state of the environment. Based on previous work [2, 28], it was decided that the agent would be given all possible combinations of performance steps as a sequence of data points. Therefore, each use case needed to be expanded along the time step axis. The expansion was made by duplicating the data points of the use case. The expanded use case was then concatenated with each of the possible combinations. The final data set has a shape similar to time series data. However, there is no dependence between the data points considering the time axis, as would be the case with ordinary time series data.

Figure 3.1: Processing of one sequence step. This processing is done for each choice before all choices are given as a sequence to the encoder network.

Figure 3.2: All choices are given to the encoder as a sequence where the i:th step contains all features of current use case and the i:th combination of performance steps. The blue and yellow parts are shared for all steps in the sequence but the green part changes for each step in the sequence.

3.4. Model

**3.4**

**Model**

In this section, the different neural network architectures that were used are presented. Two deep neural networks are used simultaneously during training, and Section 3.5 describe their interaction in more detail. The training goal is to adjust the weights of the neural network representing the actor so that it suggests the optimal combination of performance steps given a use case.

In an encoder-decoder setup, like Seq2Seq and Pointer Networks, the actor and the critic both represent decoders with different purposes. Both also need an encoder to generate a hid-den representation of the input data. For this, an encoder was implemented. Two instances of this encoder were created, one for the actor and one for the critic. Meaning that the actor and the critic share encoder architecture, but actual weights and embedding values are not shared.

**Encoder**

The architecture of the encoder resembles that of the original Pointer Networks presented by Vinyals et al. [28] and also used by Bello et al. [2].

First categorical features are encoded using an embedding layer for each category. The embedding layer was not shared between categories. Limitations in the library used for im-plementing the model forced all vectors to be of the same length when concatenated during input data preparation. For categorical features, this was solved by setting the same embed-ding size for all features. However, it posed a problem for numerical features. An auxiliary neural network was used to generate an embedding vector containing information about all nu-merical features. This additional network consisted of one fully connected layer outputting a vector of the same size as the embedding size.

All embedding vectors (and the numerical embedding vector) are then concatenated and given to an LSTM layer. The purpose of this LSTM layer is to generate a hidden representa-tion of the input data. This representarepresenta-tion was given to the decoder, which in the actor’s case was a Pointer Network and in the critic’s case was a second LSTM layer and then two fully connected layers.

**Actor**

The pointer mechanism designed by Vinyals et al. [28] and used by Bello et al. [2] assumes
that one wants to produce a sequence by pointing back to the input sequence multiple times.
However, the prediction of our model should only be a distribution of the input sequence
at the first step. Therefore, we modify the pointer mechanism in Equation 2.13 to be that of
Mottini and Acuna-Agost [23] who also wanted only the predictive distribution of the first
step. Mottini and Acuna-Agost [23] formulation of the pointer mechanism is presented in
Equation 3.4:
d=tanh(W2en+b)
uj=dTW1ej
p(yj|**X**) =
exp(uj)
řn
k=1exp(uk)
(3.4)

where d no longer is the hidden state of the i:th decoder step (as was the case in Equation 2.13)
but rather a single hidden state independent of i. The other modification is the computation
of the alignment vector between the decoder vector and the encoder states, which is now
calculated using a simpler equation. Mottini and Acuna-Agost [23] also mentions that this
modification results in better computational performance. In our RL model, the resulting
distribution p(**Y|X**)was then used by the agent to select a combination. During training, the

3.4. Model

Figure 3.3: Architecture of actor model

distribution was used as the probability distribution when randomly selecting the action to
take. This was done to encourage the model to explore during the training process. During
inference, a greedy strategy was used to select the combination which had the largest
proba-bility mass, i.e. the highest value for p(yj|**X**)was selected. Figure 3.3 illustrates the design of