Automatic parameter tuning in localization algorithms

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/052--SE

Automatic parameter tuning in

localization algorithms

Automatisk parameterjustering av lokaliseringsalgoritmer

Martin Lundberg

Supervisor : Cyrille Berger Examiner : Ola Leifler

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet ‐ eller dess framtida ersättare ‐ under 25 år från publicer‐ ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko‐ pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis‐ ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker‐ heten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman‐ nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet ‐ or its possible replacement ‐ for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down‐ load, or to print out single copies for his/hers own use and to use it unchanged for non‐commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Many algorithms today require a number of parameters to be set in order to perform well in a given application. The tuning of these parameters is often diﬀicult and tedious to do manually, especially when the number of parameters is large. It is also unlikely that a human can find the best possible solution for diﬀicult problems. To be able to automatically find good sets of parameters could both provide better results and save a lot of time.

The prominent methods Bayesian optimization and Covariance Matrix Adaptation Evolution Strategy (CMA-ES) are evaluated for automatic parameter tuning in localiza-tion algorithms in this work. Both methods are evaluated using a localizalocaliza-tion algorithm on different datasets and compared in terms of computational time and the precision and recall of the final solutions. This study shows that it is feasible to automatically tune the parameters of localization algorithms using the evaluated methods. In all experiments performed in this work, Bayesian optimization was shown to make the biggest improve-ments early in the optimization but CMA-ES always passed it and proceeded to reach the best final solutions after some time. This study also shows that automatic parameter tuning is feasible even when using noisy real-world data collected from 3D cameras.

(4)

Acknowledgments

Thanks to Ola Petersson and Kevin Kjellén at SICK IVP for suggesting the project idea and for showing great interest and enthusiasm throughout the project, and for always being available for discussion and assistance. Thanks to Annette Lef for all our discussions which helped bring the work forward. Thanks also to everyone else at SICK IVP who made me feel very welcome during the thesis work.

Thanks also to my supervisor Cyrille Berger and examiner Ola Leifler for providing good feedback and discussion throughout the thesis work.

(5)

3.5.1 Bayesian optimization . . . 29 3.5.2 CMA-ES . . . 29 3.6 Evaluation . . . 29 3.7 Experiments . . . 30 3.7.1 Experiment 1 . . . 31 3.7.2 Experiment 2 . . . 32 4 Results 33 4.1 Experiment 1 . . . 33 4.1.1 Bayesian optimization . . . 33 4.1.2 CMA-ES . . . 36 4.1.3 Comparison . . . 38 4.2 Experiment 2 . . . 40

4.2.1 High resolution camera . . . 40

4.2.2 Low resolution camera . . . 43

5 Discussion 46 5.1 Results . . . 46 5.1.1 Experiment 1 . . . 46 5.1.2 Experiment 2 . . . 47 5.2 Method . . . 48 5.2.1 Comparison of methods . . . 48

5.2.2 Reliability and validity . . . 49

5.2.3 Source criticism . . . 49

5.3 The work in a wider context . . . 50

5.3.1 Trust in AI systems . . . 50

5.3.2 Will humans lose their jobs? . . . 51

6 Conclusion 52 6.1 Future work . . . 52

(7)

List of Figures

2.1 Optimal assignment of points between two sets. . . 5

2.2 Gaussian process regression . . . 8

2.3 Randomly sampled functions from GPs using different kernel functions. . . 9

2.4 Acquisition functions for Bayesian optimization . . . 10

2.5 Convergence of CMA-ES for a simple problem . . . 13

2.6 Covariance matrix estimation in CMA-ES. . . 15

2.7 Several steps of the algorithm with the corresponding evolution path marked in blue. 16 2.8 Three different scenarios in step size control . . . 17

3.1 An example scene showing a number of hard drives in a bin. . . 25

3.2 An example scene recorded using the low resolution camera. . . 27

3.3 An example scene recorded using the high resolution camera. . . 27

3.4 A scene with marked ground truth positions in the annotation tool. . . 28

3.5 Classifying predicted positions as true or false positives . . . 30

4.1 Convergence plot for each variation of Bayesian optimization. . . 34

4.2 Convergence plot for each variation of CMA-ES. . . 36

4.3 Convergence of both Bayesian optimization and CMA-ES. . . 38

4.4 Visual comparison between manual baseline, Bayesian optimization and CMA-ES. 39 4.5 Convergence of both Bayesian optimization and CMA-ES on high resolution data. 40 4.6 Visual comparison for high resolution data. . . 42

4.7 Convergence of both Bayesian optimization and CMA-ES on low resolution data. . 43

(8)

List of Tables

3.1 The parameters of the localization algorithm which were automatically tuned. . . . 24

3.2 Evaluated acquisition functions. . . 31

3.3 Evaluated kernel functions. . . 31

3.4 Evaluated variations of CMA-ES. . . 32

4.1 Training times for Bayesian optimization. . . 34

4.2 Final OSPA values for the runs where Bayesian optimization did not converge. . . . 35

4.3 OSPA values on test dataset for Bayesian optimization. . . 35

4.4 Precision and recall on test dataset for Bayesian optimization. . . 36

4.5 Training time of different variations of CMA-ES. . . 37

4.6 The number of optimization runs where CMA-ES did not converge. . . 37

4.7 Results of the best found solution evaluated on the test dataset. . . 38

4.8 Training time of the selected variations of both methods. . . 41

4.9 Final OSPA values for the runs which did not converge on high resolution data. . . 41

4.10 Results of the best found solution evaluated on the high resolution test dataset. . . 41

4.11 Training time of the selected variations of both methods. . . 43

4.12 Final OSPA values for the runs which did not converge on low resolution data. . . 43

(9)

1 Introduction

Many algorithms today require multiple parameters to be set in order to perform well, for example in the fields of machine learning [1], robotics [2] and controllers [3]. The tuning of these parameters is often diﬀicult and tedious to do manually, especially when the number of parameters is large [4]. Performing this task manually can also introduce human error, and a human is unlikely to find the best set of parameters for a given application [5]. This is because with each additional parameter the dimensionality of the parameter space increases, which leads to good solutions being further apart. To be able to perform this task automatically could both provide better results and save a lot of time.

Automatic parameter tuning has previously been done with Bayesian optimization using Gaussian processes (GP), and it is often said to be one of the most effective methods for parameter tuning [1], [6]. Another field of algorithms that has been applied to this problem is evolutionary algorithms. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) has been successfully applied to parameter tuning of, for example, deep neural networks [1] and automatic speech recognition [7]. Loshchilov et al. has shown CMA-ES to be very competitive among black-box optimization algorithms, especially when many objective function evaluations are required [8].

One category of algorithms that is often diﬀicult to tune manually due to a large number of parameters is localization algorithms. The objective of this type of algorithm is to determine the position of one or more objects in an environment. The localization algorithm used in this work will be used to locate several 3D objects in an environment consisting of a point cloud. The point cloud, in turn, is made up of data obtained from a 3D camera. The algorithm is developed internally at SICK IVP and uses around 20 parameters. In general, a localization algorithm will try to match the points of one point cloud from a 3D scan of a target object, to a point cloud of the reference model being located. This is done by trying to find some 3D transformation and rotation that minimizes the distance between each point in the target and the closest point in the reference [9].

1.1 Motivation

According to SICK IVP, the parameterization of localization algorithms is a crucial part of achieving accurate results. The idea behind the process is to find a balance between high measurement accuracy in perfect conditions and the ability to robustly locate anything in

(10)

1.2. SICK IVP

imperfect conditions. Given the time-consuming nature of the tuning process, a lot of time could be saved if it was performed automatically [5]. Reducing the amount of time spent on manually tuning the parameters could also be of great financial interest since that time could instead be spent on other tasks.

The amount of manual labour when tuning the parameters manually is also increased due to the fact that the same algorithm can be applied to several different, but similar, problems. For example, the same localization algorithm might be used to located different types of objects, which might require a different set of parameters for each type of object to achieve accurate results.

To solve this problem and reduce the manual labour required for successfully applying a localization algorithm to a given problem, this thesis will investigate the methods CMA-ES and Bayesian optimization using GPs to automatically tune the parameters of localization algorithms using labelled training data. The findings from this work should also be applicable to localization algorithms more generally. The comparison between CMA-ES and Bayesian optimization using GPs will also be indicative of their performance for similar problems.

It would also be interesting to know how the type of data used affects the parameter tuning process. In an industrial application, it would be convenient for the user to be able to give a few example scans of the objects to locate along with annotated positions of the objects as input to the system and then have the system automatically find good parameters for locating the desired objects.

While automatic parameter tuning in general is an active research field, to the best of the author’s knowledge there are few studies made on parameter tuning in localization algorithms specifically. In [9], the parameters of a Particle Swarm Optimization-based localization al-gorithm are tuned successfully using a genetic alal-gorithm. A different genetic alal-gorithm has also been shown to successfully tune the parameters of two different algorithms in the related field of image segmentation in [10]. The limited amount of prior work in parameter tuning in localization algorithms further motivates the importance of this work.

1.2 SICK IVP

This work will be carried out in cooperation with SICK IVP1_{in Linköping. SICK IVP develops} software for industrial applications using machine vision and they make use of many machine learning techniques. One of the areas they work in involves localization of 3D objects. One example of usage for this technique is a bin picking system where a robotic arm has to pick up objects from a box of many objects, where the objects are potentially differently shaped but share enough properties to be considered part of the same class of objects. Finding good methods for automatically tuning the parameters of the localization algorithms would be of great interest for the company.

1.3 Aim

The aim of this thesis is to investigate and evaluate Bayesian optimization using GPs and CMA-ES for automatically tuning the parameters of localization algorithms. It will also investigate which type of data and labelling of the data is necessary during the parameter tuning process.

1.4 Research questions

The research question is

1. Which of the methods Bayesian optimization using GPs and CMA-ES provides the best results with respect to the computational time of finding a solution, and the precision and

(11)

1.5. Delimitations

recall of the final solution when applied to the automatic parameter tuning of localization algorithms?

1.5 Delimitations

This work will only evaluate the methods on the localization algorithms developed and used in-house at SICK IVP, but the findings from evaluating the methods on these algorithms will also be applicable to localization algorithms more generally.

(12)

2 Theory

2.1 Black-box optimization

Function optimization is the process of finding a set of parameters x = (x1, x2, . . . , xn) of

n dimensions which produces a minimum or maximum for a function f(x), usually called

the objective function or cost function. Global optimization refers to finding a global opti-mum rather than a local optiopti-mum. Many real-world optimization problems are multi-modal, meaning the objective function has several optima. Often the objective function has no math-ematical expression which can be optimized analytically, and the multi-modal nature of the objective function means that derivative-based methods such as gradient descent will generally not work. [11]

Solutions to these kinds of problems are found in the field of black-box optimization. Black-box algorithms do not rely on any information about the problem domain to find a solution, and can thus be applied to many real-world problems without modification. A black-box algorithm only relies on being able to evaluate a solution according to the objective function. [11]

In real-world optimization problems, the objective function is often expensive to evaluate. For example, when searching for good parameters for machine learning algorithms, a complete model often has to be trained for each function evaluation to be able to determine how good a set of parameters is [12]. This means that an evaluation of the objective function could take hours or even days, and it is therefore desirable for the optimization algorithm to reduce the number of function evaluations.

2.2 Evaluation metrics

The section introduces the metrics used in this work.

2.2.1 Precision and recall

A prediction which is classified as a correct prediction is called a true positive (TP) and a prediction which is incorrect is called false positive (FP). On the other hand, not predicting what should have been predicted is called a false negative (FN). Using this, the measures

(13)

2.2. Evaluation metrics

precision= T P

T P+ F P (2.1)

recall= T P

T P+ F N. (2.2)

Intuitively, precision measures how many of the predictions were correct predictions, and maximizing the precision would minimize the number of false positives. Recall on the other hand measures how many of the actual positives were correctly predicted, and maximizing recall would minimize the number of false negatives.

2.2.2 OSPA

c

Figure 2.1: Optimal assign-ment of points between two sets represented by○ and × re-spectively.

To get a measure of the distance between two sets of points, the Optimal Subpattern Assignment (OSPA) metric [14] can be used. The OSPA metric takes into account both the dis-tances between the points in each set and the cardinality er-ror of the sets, that is, if one of the sets contains more points than the other. It achieves this by using an optimal assign-ment between the sets of points so that each point in one set corresponds to exactly one point in the other set and so that the total distance between the assigned points is minimal. If there are more points in one of the sets, the extra points are left unassigned and a penalty is applied. Figure 2.1 shows an optimal assignment between two sets of points, where the red point has no corresponding point in the other set and is left unassigned.

The OSPA metric includes a cut-off parameter c which de-termines the distance at which the assigned points no longer count as a correct matching [14]. In Figure 2.1, the blue line indicates an assignment where the distance is greater than

c and has been cut off. In the OSPA metric, the assigned

points which are further away from each other than the cut-off distance get applied the same penalty as points which are not assigned at all. The cut off distance between points v and w of the different sets is defined as d(c)= min(c, d(v, w)), where d(v, w) is an arbitrary distance metric (Euclidean distance for instance). With this cut off distance in mind, the OSPA metric is defined as ¯ d(c)p (V, W ) = ( 1 n(minπ_∈Πn m ∑ i=1 d(c)(vi, wπ_(i))p+ cp(n − m))) 1_/p (2.3) where m≤ n and where V = {v1, . . . , vm} and W = {w1, . . . , wn} are the finite sets of points to compare and Πk is the set of permutations on {1, 2, . . . , k} for any k ∈ N [14]. In (2.3),

π is the assignment of points between the sets that minimizes the sum ∑m_i₌₁d(c)(vi, wπ(i))p, and wπ(i)denotes the point in W that is assigned to vi in V under that optimal assignment. Essentially, the metric finds an optimal assignment between the points in both sets and takes the sum of the cut off distances between the assigned points, plus a penalty for unassigned points in either set. 1≤ p < ∞ is the order of the OSPA metric, and a higher value of p will penalize estimated points which are not close to any points in the ground truth set more. If

(14)

2.3. Bayesian optimization

2.3 Bayesian optimization

A good approach for optimizing expensive black-box functions is Bayesian optimization [12]. In Bayesian optimization, a prior belief about the objective function is used to guide the search. The prior is updated according to Bayes’ theorem, which gives a posterior belief on the objective function. One perspective is that the function to optimize is modelled by a surrogate function that is cheaper to evaluate. The usual choice is to use a Gaussian process (GP), a distribution over functions, as the prior for the optimized function. Promising solutions can be sampled from the distribution and evaluated using the objective function. The GP is then updated according to the evaluated points of the actual objective function to give a posterior distribution that more closely matches the objective function.

This section will give an introduction to Bayesian optimization using GPs.

2.3.1 Gaussian processes

The name Bayesian optimization indicates that the method relies on a prior probability dis-tribution (often just called prior) to work, in this case a prior on the objective function, that is meant to encode any intuitions or domain expertise that exists for the problem [15]. The prior used in Bayesian optimization is often, but not always, a GP [6], [16].

A GP is a stochastic process. Instead of describing random variables which are scalars (or vectors in the case of multivariate distributions) like a regular probability distribution, a GP describes random functions so that any finite set of function values f(x1), . . . , f(xt) make up a multivariate normal distribution [17]. This means that drawing a sample from a GP returns an entire function over some domain. A GP is completely specified by its mean function m(x) and its covariance function (also called kernel function) k(x, x′_{) and is written as}

f(x) ∼ GP(m(x), k(x, x′)). (2.4)

A way to intuitively understand GPs is to think of a function as a very long vector where each element specifies a function output f(x) for an input x [17]. A GP can then be thought of as a function that, instead of returning a scalar value for a point x, returns the mean and variance of a normal distribution that describes the possible values of f at the point x [16].

To sample a function from a GP in a collection of points x1, . . . , xt ∈ Rn, recall that the function values by definition jointly make up a multivariate normal distribution:

f1_∶t∼ N (m1_∶t, K) (2.5)

where f1_∶t is used to denote the column vector of function values(f(x1), . . . , f(xt))T, m1_∶t is the mean vector given by the GPs mean function(m(x1), . . . , m(xt))T and K is the (t× t) covariance matrix given by evaluating the kernel function for each pair of points [17]:

K= ⎡⎢ ⎢⎢ ⎢⎢ ⎢⎣ k(x1, x1) ⋯ k(x1, xt) ⋮ ⋱ ⋮ k(xt, x1) ⋯ k(xt, xt) ⎤⎥ ⎥⎥ ⎥⎥ ⎥⎦ . (2.6)

Sampling is then performed by evaluating m1∶t and K and sampling points from a normal distribution [17].

Regression

The point of using a GP prior in Bayesian optimization is to be able to estimate a value for the objective function f(xt+1) in a point xt+1 for which f has not been evaluated, based on a number of previously evaluated points x1, . . . , xt with function values f1∶t = (f(x1), f(x2), . . . , f(xt))T. This is done by Gaussian process regression [17].

(15)

Graphically, Gaussian process regression can loosely be thought of as generating all possible functions from the prior and then reject those that do not fit the observed data points, see Figure 2.2 [17]. In practice though, this technique would be too computationally ineﬀicient. Instead, the joint normal prior distribution is conditioned on the observed data, which is a well known statistical operation. The joint distribution of the sampled function values f1∶t and the unknown function value ft₊₁can be written as

⎡⎢ ⎢⎢ ⎢⎣ f1∶t ft+1 ⎤⎥ ⎥⎥ ⎥⎦∼ N ⎛ ⎝ ⎡⎢ ⎢⎢ ⎢⎣ m1∶t mt+1 ⎤⎥ ⎥⎥ ⎥⎦, ⎡⎢ ⎢⎢ ⎢⎣ K k kT _k_(xt +1, xt+1) ⎤⎥ ⎥⎥ ⎥⎦ ⎞ ⎠ (2.7)

according to the prior, where K is the covariance matrix as defined in (2.6) and the column vector k= (k(xt+1, x1), k(xt+1, x2), . . . , k(xt+1, xt))T [16] (which gives the correct covariance matrix for the joint distribution since the kernel function is symmetric [17], that is k(x, x′_{) =}

k(x′, x)). Then the predictive posterior distribution on ft₊₁ is given by conditioning the prior distribution on the observed dataD1_∶t= {(x1, f(x1)), . . . , (xt, f(xt))} [15], [17]:

ft+1∣ D1∶t, xt+1∼ N (mt(xt+1), σt2(xt+1)) (2.8) where

mt(xt₊₁) = m(xt₊₁) + kTK−1(f1_∶t− m1_∶t) (2.9)

σ2_t(xt₊₁) = k(xt₊₁, xt+1) − kTK−1k (2.10) The predicted value of ft₊₁can be sampled by evaluating (2.9) and (2.10) and then generating samples from the posterior distribution given by (2.8) as described above [17]. Figure 2.2 shows an example GP prior being conditioned on a set of observed data points, producing a posterior GP. Note that in practical implementations, the matrix inverse in (2.9) and (2.10) is usually not calculated since it is an expensive and numerically unstable operation. Instead, Cholesky decomposition is used [17].

Mean function

The mean function of the GP is most commonly set to a constant, m0(x) = m, even if other choices also exist [15]. The value of m must then be estimated from the data along with the parameters of the kernel function, see Section 2.3.3.

Kernel function

The kernel function used for the GP is crucial since it encodes any prior knowledge about the objective function and determines the smoothness of the estimated surrogate function [17]. The kernel only depends on the input variables and is meant to describe the similarity between two points x and x′.

A common choice for the kernel function is the squared exponential (SE) kernel, defined as

kSE(r) = exp (−

r2

2l2) (2.11)

where r= ∥x − x′_{∥ is the Euclidean norm of the vector x − x}′_{and l is the characteristic} length-scale, a hyperparameter of the kernel function itself (not to be confused with the overall parameters being optimized by the whole Bayesian optimization procedure) [17]. Note that

kSE(r) = 1 when r = 0 (that is, when x = x′) and kSE(r) → 0 when r → ∞.

However, the SE kernel is often too smooth to accurately model real objective functions, and the Matérn class of kernels is often recommended instead. A general formula for the Matérn kernels is given in [17], which depends on a parameter v. Since functions sampled

(16)

2.3. Bayesian optimization 0 1 2 3 4 5 input x −3 −2 −1 0 1 2 3 output f (x ) (a) GP prior. 0 1 2 3 4 5 input x −3 −2 −1 0 1 2 3 output f (x ) (b) GP posterior.

Figure 2.2: Illustration of GP regression in one dimension, where the mean of the GP is shown with a black line and the shaded area represents the 95 % confidence band. (a) shows a GP prior with constant mean 0 and 4 functions sampled randomly from the prior represented by the coloured lines. (b) shows the GP posterior after conditioning the distribution on the 4 sampled data points (red dots) and 4 new functions sampled randomly from the posterior.

from a GP using a Matérn kernel are k-times differentiable if and only if v> k, the parameter

v can be used to control the smoothness of the functions, where higher values of v will produce

smoother functions. The Matérn kernels can be simplified when the value of v is half-integer, and v= 3/2 and v = 5/2 in particular are usually the most interesting ones since v = 1/2 gives functions which are very rough and with v ≥ 7/2 the functions are so smooth they are often hard to extinguish from functions produced by the SE kernel function [17]. The Matérn 3/2 and Matérn 5/2 kernel functions are given by

kM32(r) = (1 + √ 3r l ) exp (− √ 3r l ) (2.12) kM52(r) = (1 + √ 5r l + 5r2 3l2) exp (− √ 5r l ) . (2.13)

Figure 2.3 illustrates how functions sampled from GPs with different kernel functions differ. The SE kernel function produces very smooth functions, while the Matérn 5/2 and Matérn 3/2 kernels produce increasingly rough functions.

By introducing a separate length-scale hyperparameter ld, d = 1, . . . , n for each input

dimension, the automatic relevance determination (ARD) Matérn kernels are obtained [12]:

kAM32(x, x′) = l0(1 + √ 3r(x, x′)) exp (−√3r(x, x′)) (2.14) kAM52(x, x′) = l0(1 + √ 5r(x, x′) +5 3r 2_{(x, x}_′_{)) exp (−}√ 5r(x, x′)) (2.15) where l0has been introduced as a covariance amplitude hyperparameter and

r(x, x′) = ¿ Á Á À

_∑

n d=1 (xd− x′ d) 2 l2 d (2.16)

(17)

2.3. Bayesian optimization 0 1 2 3 4 5 input x −3 −2 −1 0 1 2 3 output f (x ) SE Mat´ern 3/2 Mat´ern 5/2

Figure 2.3: Randomly sampled functions from GPs using different kernel functions. All kernel functions used the same length-scale l= 1.

where n is the number of input dimensions. By using the ARD kernels, the relevance of each input dimension can be learned from data, which leads to irrelevant input being removed from the inference [17].

2.3.2 Acquisition function

Bayesian optimization essentially consists of two parts: a predictive model (GP regression) and an acquisition function for choosing where to sample the objective function next [18]. The GP regression described previously produces a posterior over functions. The acquisition function uses this posterior to decide which point should be sampled next. After sampling the objective function in t points, the next point to sample xt₊₁ is given by xt₊₁= argmaxxu(x) where u denotes the acquisition function [12].

There are several acquisition functions that can be used and this section will introduce the most common ones. Figure 2.4 shows the acquisition functions presented. Note that while this section describes the maximization of an objective function f(x), the same arguments hold for minimization by defining g(x) = −f(x) and then maximizing g(x) instead.

Probability of improvement

The probability of improvement (PI) acquisition function maximizes the probability that the new point to sample is better than the previous best. It is defined as

uPI(x) = P (f(x) ≥ f(x+)) = Φ (

mt(x) − f(x+)

σt(x) ) (2.17)

where Φ(⋅) is the normal cumulative distribution function and x+ _{denotes the best solution} out of the so far evaluated solutions [16]. mt(x) and σt(x) are given by (2.9) and (2.10).

The PI acquisition function may seem like an intuitive strategy but it has a big issue in practice. The PI acquisition function focuses purely on exploitation rather than exploration, since points with a high probability of giving a small improvement over the current best will be preferred over points which could give a very big improvement but with less certainty [16].

(18)

2.3. Bayesian optimization 0 1 2 3 4 5 −2 0 2 GP p osterior 0 1 2 3 4 5 0.0 0.5 1.0 PI 0 1 2 3 4 5 0.00 0.25 0.50 0.75 EI

Figure 2.4: At the top is a GP posterior distribution which has been fitted to 4 data points. Below it, the corresponding PI and EI acquisition functions are shown with their respective maxima (points to sample next) shown with green triangles.

Expected improvement

Rather than just maximizing the probability that the newly generated solution is better than the previous best as with the PI acquisition function, it might be a good idea to take into account how big of an improvement a new solution is expected to produce. This is the idea behind the expected improvement (EI) acquisition function [19], [20].

Assume that the objective function has already been sampled in t points x1_∶t, with

f(x+) = max

i_=1,...,tf(xi) (2.18)

being the best objective function value observed so far. If the objective function is evaluated at a new point xt₊₁, then the new best value is given by

f_t+₊₁= max(f(xt₊₁), f(x+)) (2.19) and the improvement from the previous best is then

ft+₊₁− f(x+) = max(f(xt+1) − f(x+), 0) (2.20) It would be desirable to maximize the improvement as given by (2.20) directly [15], however that assumes that f(xt₊₁) is already known for all xt₊₁, which it is not. Instead, the GP posterior is used to maximize the expected value of the improvement given the already observed points:

(19)

The EI acquisition function can be expressed more explicitly as

uEI(x) = (mt(x) − f(x+)) Φ (

mt(x) − f(x+)

σt(x) ) + σt(x)φ (

mt(x) − f(x+)

σt(x) ) (2.22)

where Φ(⋅) is the normal cumulative distribution function, φ(⋅) is the normal probability density function and mt(x) and σt(x) are given by (2.9) and (2.10). A complete derivation of (2.22) can be found in [15].

Optimizing the acquisition function

Having defined the acquisition function, it needs to be optimized to determine where next to sample the objective function. This is also a global optimization problem, just like the original problem of optimizing the objective function. Unlike the objective function, sampling the acquisition function is cheap, and its derivatives can be computed. This means that it is relatively easy to optimize, for example by repeatedly running gradient ascent for local search from many randomly selected start points and then picking the best solution found [15]. The maxima for the PI and EI acquisition functions for an example GP posterior is shown in Figure 2.4.

2.3.3 Dealing with hyperparameters

By using the ARD Matérn kernels a number of hyperparameters have been introduced which control the behaviour of the GP, namely the covariance amplitude l0 and the length-scales

l1, . . . , ln. If using a constant mean function, the constant m is also a hyperparameter which must be set. This raises the potential issue that the parameter tuning problem has only been pushed one step into the future, since now there are new parameters that need to be tuned instead of those of the objective function. However, the hyperparameters of Bayesian optimization, summarized as θ, can be inferred from data.

There are a few approaches for estimating the hyperparameters from data. One approach is to find the maximum likelihood estimate (MLE). The idea is to find the hyperparameter values which make the observed data as likely as possible under the prior [18]. The likelihood of the data given the hyperparameters, P(D1∶t∣ θ), is a multivariate normal distribution, and the MLE is given by

ˆ

θ= argmax

θ

P(D1∶t∣ θ) (2.23)

which can be optimized by a standard numerical optimizer.

Another common approach is to, rather than to optimize the likelihood, to optimize the

marginal likelihood. This method is called empirical Bayes [18]. Under the Gaussian process,

the log marginal likelihood is given by log P(f ∣ X, θ) = −1 2f T K−1f−1 2log∣K∣ − n 2log 2π (2.24)

where X is a matrix of input vectors x and ∣K∣ denotes the determinant of K [17]. To maximize the log marginal likelihood, the partial derivatives of (2.24) can be computed and then a gradient based method can be used [17].

Empirical Bayes can be seen as an approximation of the fully Bayesian approach [18]. In this approach, marginalization is done over all possible values of the hyperparameters by using the integrated acquisition function [12]:

ˆ

u(x) = ∫ u(x;θ)P(θ ∣ D1_∶t)d θ (2.25) This integral is generally intractable but it can be approximated by sampling using a Markov chain Monte Carlo (MCMC) method such as slice sampling [21].

(20)

2.4. CMA-ES

2.3.4 Algorithm summary

Algorithm 1 shows the pseudocode for Bayesian optimization based on [16], [18]. Sample objective function at t0 initial space-filling points

for t= (t0+ 1), (t0+ 2), . . . do

Update the posterior GP on D1∶t−1

Find xtby optimizing the acquisition function over the GP: xt= argmaxxu(x ∣ D1∶t−1) Sample objective function in xt: yt= f(xt)

Add ytto the data: D1∶t= {D1∶t−1,(xt, yt)}

end for

Algorithm 1: Bayesian optimization.

2.4 CMA-ES

Another approach for optimizing black-box functions is found in the field of evolutionary algorithms, where inspiration is taken from the natural evolution found in nature [22]. Evolu-tionary algorithms can be divided into several smaller fields which focus on different aspects of evolution. For example, the genetic algorithm (and its variants) performs recombination of solution proposals, called individuals in the field of evolutionary computing, according to their fitness and introduces random errors in the genes (mutation) [11]. Evolution strategies does away with the mechanisms of genetics and instead focuses on selection of the fittest individuals, mutation and recombination [22].

This work will investigate a particular evolutionary algorithm called Covariance Matrix Adaptation Evolution Strategy (CMA-ES). It has shown to be very competitive among black-box optimization algorithms [23], [24].

In CMA-ES, individuals are drawn from a multivariate normal distribution. This set of solutions is called a population. Each individual in the population is then evaluated using the objective function and the mean and covariance matrix of the normal distribution are updated to better match the best individuals of the population. This process is repeated for several iterations, called generations. See Figure 2.5 for an illustration of how CMA-ES converges. This section will first introduce the standard CMA-ES algorithm in more detail, and then present some variations which have been shown to improve performance. Hansen provides an overview of the standard CMA-ES algorithm in [25] and [26], and this section is a combination of these two sources.

2.4.1 Sampling

The first step needed is the generation of new individuals, which is done by sampling a multi-variate normal distribution. This is done using the equation

x(g+1)_k ∼ m(g)+σ(g)N (0, C(g)) for k= 1, . . . , λ (2.26) for generation g= 0, 1, . . . where x(g+1)k is individual number k in generation g+ 1, m(g) is the mean of the search distribution in generation g, σ(g) is the standard deviation, or step size, in generation g,N (0, C(g)) is the multivariate normal distribution with covariance matrix C(g) in generation g, and λ is the population size.

The sampled individuals are then evaluated using the objective function, which gives a value for how good, or how fit, an individual is. The next question is how to update the mean and covariance of the search distribution for the next generation based on these samples.

(21)

2.4. CMA-ES

(a) Generation 0 (b) Generation 1 (c) Generation 2

(d) Generation 3 (e) Generation 4 (f) Generation 5 Figure 2.5: Illustration of the convergence of CMA-ES for a 2-dimensional problem over 6 generations. The light shaded lines show the global optimum of the problem. The solid black ellipsis represents the search distribution in each step and the dots show a population of sampled individuals. In each generation, the search distributions mean and covariance matrix are updated to better reflect the most successful individuals.

2.4.2 Updating the mean

For updating the mean, a weighted average of the µ best individuals in the current population

x(g+1)₁ , . . . , x(g+1)_λ is used: m(g+1)= µ ∑ i=1 wix(g+1)i_∶λ , µ ∑ i=1 wi= 1 (2.27)

where w1 ≥ w2 ≥ ⋯ ≥ wµ≥ 0 are weight coeﬀicients and x(g+1)i_∶λ denotes the i-th best ranking individual in generation g+ 1 meaning f(x(g+1)₁_∶λ ) ≤ f(x(g+1)₂_∶λ ) ≤ ⋯ ≤ f(x(g+1)_λ_∶λ ) (in the case of minimization) where f is the objective function. A more general form of (2.27) including a learning rate cm is presented in [26], but Hansen notes that cm is usually set to 1, in which case the equations are equal.

For use later, the measure

µeff= ( µ ∑ i=1 w2_i) −1 (2.28) is defined. Since_∑µ

i₌₁wi= 1, it can be derived that 1 ≤ µeff≤ µ. According to Hansen, µeff≈ λ₄ is usually a reasonable setting.

(22)

2.4. CMA-ES

2.4.3 Updating the covariance matrix

The strength of CMA-ES lies in smart updates of the covariance matrix. As a first step, the covariance matrix is estimated using the already sampled points, which is done by using the empirical covariance matrix,

C(g+1)emp = 1 λ− 1 λ

∑

i=1 ⎛ ⎝x(g+1)i − 1 λ λ ∑ j₌₁ x(g+1)_j ⎞ ⎠ ⎛ ⎝x(g+1)i − 1 λ λ ∑ j₌₁ x(g+1)_j ⎞ ⎠ T , (2.29)

assuming that λ is large enough. Note that the expression 1 λ∑

λ

j₌₁x(g+1)j in (2.29) estimates the mean of the actual distribution in generation g+ 1 by the sample mean of the sampled individuals [27]. If the true mean value in generation g is used instead, the covariance matrix can be estimated by C(g+1)_λ = 1 λ λ

∑

i₌₁ (x(g+1)i − m(g)) (x (g+1) i − m(g)) T . (2.30)

Both C(g+1)emp and C(g+1)λ are unbiased estimators of C

(g)_{, meaning that}_E[C(g+1)

emp ∣ C(g)] = C(g). While (2.29) re-estimates C(g) based on a number of sampled points, (2.30) re-estimates the covariance matrix based on the actual mean of the distribution. Another perspective is that

C(g+1)_emp estimates the covariance matrix based on a number of points, while C(g+1)_λ estimates the covariance matrix based on a number of steps, from the mean of the previous generation to sampled points of the new one, written as the vectors x(g+1)_i − m(g)_.

So far, the original covariance matrix has just been re-estimated. To direct the search towards better individuals, weighted selection similar to how the mean was estimated in (2.27) is implemented: C(g+1)µ = µ

∑

i=1 wi(x(g+1)_i_∶λ − m(g)) (x(g+1)_i_∶λ − m(g)) T (2.31)

where, as before, x(g+1)_i_∶λ is the i-th best ranking individual in generation g+ 1 and where the uniform weight of 1

λ is replaced by more flexible weights wi. The distribution with covariance matrix C(g+1)µ will be more likely to generate samples closer to the µ most successful samples in generation g. Figure 2.6 shows how the covariance matrix is estimated from µ selected samples.

Rank-µ update

The problem with updating the covariance matrix by (2.31) is that it requires a large value of µeff as defined in (2.28), and by extension large populations, to be a reliable estimator. However, to achieve a fast search it would be preferable to generate smaller populations, since that would mean fewer number of evaluations of the objective function. To balance these requirements, memory of previous generations is used.

The mean of the estimated covariance matrices from all generations is given by

C(g+1)= 1 g+ 1 g

∑

i=0 1 σ(i)2 C (i+1) µ (2.32)

where the covariance matrix from each generation g is ”normalized” by the step size σ(g)2 to make matrices from different generations comparable. This becomes a good estimator for the selected steps after a large enough number of generations. However, information from all generations is given the same weight. It is reasonable to assume that more recent generations have incorporated more relevant information which should be exploited. For this purpose,

(23)

2.4. CMA-ES −4 −2 0 2 4 −4 −2 0 2 4

(a) Sampling points.

−4 −2 0 2 4 −4 −2 0 2 4 (b) Estimation. −4 −2 0 2 4 −4 −2 0 2 4

(c) New search distribution. Figure 2.6: Covariance matrix estimation in CMA-ES. The gray lines indicate the objective function, with the optimum in the top right corner. (a) shows the 95 % confidence region of the initial search distribution N (0, I) with λ = 100 sampled points. In (b), the covariance matrix has been updated according to (2.31) with wi= 1/µ, where the step vectors from the current generation’s mean to the µ= 30 best individuals are shown in the Figure and the next generation’s mean, estimated by (2.27), is represented by a blue×. (c) shows the new search distribution for the next generation.

exponential smoothing is introduced, which results in the following formula for updating the covariance matrix:

C(g+1)= (1 − cµ) C(g)+cµ 1 σ(g)2C

(g+1)

µ (2.33)

where the initial covariance matrix is set to the identity matrix, C(0) = I, and the learning rate for the covariance matrix cµ∈ [0, 1] is introduced. By substituting (2.31) for C(g+1)µ , the formula implementing rank-µ update is

C(g+1)= (1 − cµ) C(g)+cµ µ

∑

i=1 wi ⎛ ⎝ x(g+1)_i_∶λ − m(g) σ(g) ⎞ ⎠ ⎛ ⎝ x(g+1)_i_∶λ − m(g) σ(g) ⎞ ⎠ T . (2.34)

The choice of cµ is critical for successfully optimizing functions. If cµ = 0, no learning will be performed and C(g+1) = C(0) _{for all generations. On the other hand, if cµ} _{= 1, no} information from previous generations is used and the covariance matrix will be updated as in (2.31) except for a scaling with the step size σ(g)2. In [26] it is noted that the parameter value seems to be independent of the objective function and through empirical evaluations, the author found that cµ≈ min(1, µeff/n2) is a good choice that worked for all non-noisy objective functions they tried so far.

Rank-one update

So far, the covariance matrix has been estimated using all the selected steps in a single gen-eration. The rank-one update uses only one single selected step for updating the covariance matrix instead. By using a single sample (µ = 1), (2.34) can be modified into the rank-one update rule: C(g+1)= (1 − c1) C(g)+c1 ⎛ ⎝ x(g+1)₁_∶λ − m(g) σ(g) ⎞ ⎠ ⎛ ⎝ x(g+1)₁_∶λ − m(g) σ(g) ⎞ ⎠ T (2.35)

which adds the maximum likelihood term of the normalized stepx(g+1)1∶λ − m(g)

σ(g) into the covariance matrix C(g).

(24)

2.4. CMA-ES

Cumulation

Figure 2.7: Several steps of the algorithm with the correspond-ing evolution path marked in blue.

Next, the concept of an evolution path is introduced. An evolution path is a sequence of steps taken by the algorithm over a number of generations, which can be expressed as a sum of the step vectors:

p(g+1)= g

∑

i=g+1−s m(i+1)− m(i) σ(i) (2.36)

where s is the number of steps to include in the evolution path. The summation of the evolution path is what is called cumulation. See Figure 2.7 for an illustration. Note that the step size σ(g) is disregarded in the evolution path. By introducing exponential smoothing in the same way as in (2.33), the formula for the evolution path becomes

p(g+1)_c = (1 − cc) p(g)_c +√cc(2 − cc) µeff

m(g+1)− m(g)

σ(g) (2.37)

where p(g)c ∈ Rn is the evolution path in generation g (with p(0)c = 0) and cc ∈ [0, 1] is a smoothing constant for the evolution path. The normalization factor√cc(2 − cc) µeffis chosen so that p(g+1)c ∼ N (0, C) holds for every generation.

Now the rank-one update rule in (2.35) can be adapted to use the evolution path:

C(g+1)= (1 − c1) C(g)+c1p(g+1)c p(g+1)c T

(2.38) where c1≈ 2/n2 is an empirically validated choice for the rank-one learning rate according to Hansen [26].

For the final update rule for the covariance matrix, rank-one updates using cumulation and rank-µ updates are combined. Combining (2.38) and (2.34) results in the final covariance matrix update rule:

C(g+1)= (1 − c1− cµ) C(g) + c1p(g+1)c p(g+1)c T (rank-one update) + cµ µ ∑ i=1 wi ⎛ ⎝ x(g+1)_i_∶λ − m(g) σ(g) ⎞ ⎠ ⎛ ⎝ x(g+1)_i_∶λ − m(g) σ(g) ⎞ ⎠ T (rank-µ update) (2.39)

which combines the benefits of both the rank-one update and rank-µ update. It uses the evolution path in the rank-one update to incorporate correlations between generations, which is important for small populations, while also using the information in the current population eﬀiciently by using the rank-µ update, which is important for larger populations.

2.4.4 Step size control

So far, the center of the search distribution has been moved by moving the mean and the shape of the distribution has been altered to prefer successful individuals by updating the covariance matrix. What remains is to adapt the step size σ(g), or the ”scale”, of the distribution. This is necessary because, even with the largest reliable learning rate, the covariance matrix update in (2.39) is too slow to be competitive. The step size control is done by cumulative step length adaptation (CSA), which utilizes an evolution path as in Section 2.4.3.

(25)

2.4. CMA-ES

• If the evolution path is short, the individual steps in the path cancel each other out and the step size should be decreased. See Figure 2.8a.

• If the evolution path is long, the steps are pointing in a similar direction and could potentially be replaced by fewer, longer steps. In this scenario, the step size should be increased. See Figure 2.8c.

• The desired scenario occurs when the steps are close to perpendicular to each other as in Figure 2.8b.

(a) The step size is too long. (b) The desired scenario. (c) The step size is too short. Figure 2.8: Three different scenarios for the evolution path (marked in blue) after six steps of comparable size.

The question remaining then is how to decide if an evolution path is long or short. To make this decision, the actual evolution path will be compared to the expected length under random selection. Under random selection, each step is independent and uncorrelated, which is the desired situation. So in the desired scenario, the length of the evolution path should be approximately the same as the expected length. If the actual evolution path is biased to be longer than expected, the step size is increased, and if the evolution path is shorter than expected the step size is decreased.

The conjugate evolution path is constructed in a similar way to (2.37):

p(g+1)σ = (1 − cσ) p(g)σ + √ cσ(2 − cσ) µeffC(g)− 1 2m(g+1)− m(g) σ(g) (2.40)

where p(g)σ ∈ Rn is the conjugate evolution path in generation g (with p(0)σ = 0), cσ∈ [0, 1] is a smoothing constant for the evolution path and the normalization factor √cc(2 − cc) µeff is chosen like in (2.37).

In the evolution path defined in (2.37), the expected length depends on the direction. In (2.40), C(g)−

1

2 _{is introduced to rescale the step} m(g+1)− m(g)

σ(g) in the coordinate system given by the eigenvectors of C(g) (more details in [26]). The purpose of this is to make sure that the expected length of p(g+1)σ does not depend on its direction. A further consequence is that, given that p(0)σ ∼ N (0, I), under random selection p(g+1)σ ∼ N (0, I) for any sequence of covariance matrices.

The realized evolution path’s length is given by the Euclidean norm as ∥p(g+1)σ ∥ and its expected length is given byE[∥N (0, I)∥]. The step length update is performed by comparing these two values:

σ(g+1)= σ(g)exp⎛⎜ ⎝ cσ dσ ⎛ ⎜ ⎝ ∥p(g+1)σ ∥ E [∥N (0, I)∥]− 1 ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ (2.41)

(26)

2.4. CMA-ES

where dσ≈ 1 is a damping parameter. In the desired scenario where ∥p(g+1)σ ∥ = E[∥N (0, I)∥], the exponent in (2.41) is 0 and the step size is unchanged. In the other scenarios the step size is adapted to be longer or shorter. By using the exponential factor for the update rule, the step size can be adapted very quickly. Using the evolution path for updating the step size is a heuristic that is based on intuition and has been validated empirically as a good strategy [25].

2.4.5 Stopping criteria

When running the algorithm, there needs to be some way to know when to stop searching. Since in general there is no way for the algorithm to know for sure if the current best solution is a global optimum, there are eight stopping criteria defined that are designed to stop (or restart, see Section 2.4.8) the execution when better solutions are unlikely to be found:

• TolFun: the range of the best values of the last 10+ ⌈30n/λ⌉ generations together with all values of the current generation is close to zero (that is, less than TolFun).

• TolX: the standard deviation of the search distribution in all coordinates and all compo-nents of σ pc are smaller than TolX.

• NoEffectAxis: the mean m is not affected when adding a 0.1-standard deviation to it in the principal axis direction of C.

• NoEffectCoord: the mean m is not affected when adding a 0.2-standard deviation in any single coordinate.

• ConditionCov: the condition number of the covariance matrix is larger than 1014_(larger values can lead to numerical errors [28]).

• EqualFunValues: the range of the best values of the last 10+⌈30n/λ⌉ generations is zero (similar to TolFun).

• Stagnation: the algorithm keeps track of the best and the median values in the last 20 % of generations (at least 120+ 30n/λ and at most 20000 generations are kept track of), and if the median of the most recent 30 % of values is not better than the median of the 30 % oldest values in both histories, the algorithm stops.

• TolXUp: if σ× max(diag(D)) is increased by more than TolXUp (104 _{by default).} The first two criteria are problem dependent and need to be set to appropriate values for each application.

2.4.6 Boundary handling

The CMA-ES does not by itself impose any bounds on the values of x. In some applications there may be bounds on how far the values are allowed to grow in each dimension. There are several proposed ways to handle this which are to some extent problem dependent. In the case of box-constraints, where each dimension of x must be strictly inside the bounds of a minimum and maximum value, one approach is to repair the solution with penalization. An invalid solution is repaired by moving it to the closest solution on the bounds. A wrapper function around the objective function is defined which evaluates the true objective function on a repaired solution xrepaired and applies a penalty based on how big of a repair was necessary [26]:

g(x) = f(xrepaired) + α∥x − xrepaired∥2 (2.42) where α is a constant defining how much an infeasible solution should be penalized.

(27)

2.4. CMA-ES

2.4.7 Algorithm summary

The pseudocode in Algorithm 2 summarizes the algorithm that has been derived in this section and is based on [25]. The algorithm has several parameters that can be defined to tune its behaviour. Many of them have default values defined by the authors of the algorithm which are independent of the objective function and are not recommended to be changed. The parameters which are problem dependent and therefore need to be set by the user when running the algorithm is the initial distribution mean m(0) and the initial step size σ(0).

(28)

2.4. CMA-ES

Set parameters to default values λ= 4 + ⌊3 ln n⌋, µ = ⌊λ/2⌋ wi= ln(µ+1)−ln i ∑µ j=1(ln(µ+1)−ln j) for i= 1, . . . , µ, where ∑ µ i=1wi= 1 µeff= (∑µi₌₁w 2 i) −1 cσ= _nµ_+µeff_eff+2₊₅ dσ= 1 + 2 max (0, √ µeff−1 n+1 − 1) + 2cσ cc= 4_+µeff/n n+4+2 µeff/n c1= _(n+1.3)22_+µ_eff cµ= min (1 − c1, 2 µeff−2+1/ µeff (n+2)2_+µ_eff )

Choose (problem dependent) m(0)∈ Rn _{and σ}(0)_{∈ R}

Initialize

Evolution paths p(0)σ = 0 and p(0)c = 0 Covariance matrix C(0)= I

For generation g= 0, 1, 2, . . . , until stopping criteria met: Sample new population of λ individuals:

x(g+1)_k ∼ m(g)+σ(g)N (0, C(g)) for k= 1, . . . , λ Evaluate the objective function f(xk) for k = 1, . . . , λ Update the mean:

m(g+1)= ∑µ_i₌₁wix(g+1)i_∶λ , ∑ µ

i=1wi= 1, wi> 0 Step size control:

p(g+1)σ = (1 − cσ) p(g)σ + √ cσ(2 − cσ) µeffC(g)− 1 2m(g+1)− m(g) σ(g) σ(g+1)= σ(g)exp(cσ dσ( ∥p(g+1) σ ∥ E[∥N (0,I)∥]− 1)) Update the covariance matrix:

p(g+1)c = (1 − cc) p(g)c + √ cc(2 − cc) µeffm (g+1)_{− m}(g) σ(g) C(g+1)= (1 − c1− cµ) C(g)+c1p(g+1)c p(g+1)c T +cµ∑µ i=1wi( x(g+1)_i_∶λ − m(g) σ(g) ) ( x(g+1)_i_∶λ − m(g) σ(g) ) T

(29)

2.4. CMA-ES

2.4.8 Variations of CMA-ES

Since the initial introduction of the CMA-ES in [28], [29], several variations and extensions of the basic algorithm have been proposed to improve its performance [24]. While going through all of them is outside the scope of this work, a few of the most prominent variations will be introduced in this section.

Restart strategies

When running the regular CMA-ES described so far, it is possible for the algorithm to end up in a local optimum. A way to prevent this and promote a more global search is to use restart strategies, where the algorithm is restarted from a different starting point instead of terminating on the stopping criteria [30]. It is also common for the restart strategy to change some parameters of the search to increase the likelihood of performing a more successful search the next time.

One restart strategy for the CMA-ES is a restart strategy where the population size is increased for each restart, called IPOP-CMA-ES [30]. The default population size in the CMA-ES is set to λdef = 4 + ⌊3 ln n⌋, where n is the problem dimension [25]. While a small population size is desirable to keep down the number of objective function evaluations, a larger population size may also be needed to successfully find a global optimum, especially on multi-modal objective functions. In IPOP-CMA-ES, the initial population size is increased by a factor of 2 each time the search is restarted (the factor 2 has been found empirically), which has been shown to produce superior results to a local restart strategy where the population size is unchanged in the restarts [30].

Another restart strategy is the bi-population CMA-ES called BIPOP-CMA-ES [31]. This strategy keeps two separate regimes running interlaced with each other, where each regime has its own function evaluation budget to keep track of how many objective function evaluations have been performed under each regime. The first regime is to perform restarts where the population size is increased with each restart as in IPOP-CMA-ES. The second regime uses a smaller population size defined by

λs= ⎢⎢ ⎢⎢ ⎢⎣λdef( 1 2 λl λdef) U[0,1]2⎥⎥ ⎥⎥ ⎥⎦ (2.43)

where λdef= 4 + ⌊3 ln n⌋ is the default population size, λl is the most recent population size from the first regime andU[0, 1] denotes independent uniformly distributed random numbers in the range[0, 1] [31].

The BIPOP-CMA-ES starts by doing one regular run of CMA-ES with the default popu-lation size λdef. After this initial run, it starts to alternate between the two regimes, where the first regime is always used for the first and last restarts. For the other restarts, the regime with the lowest function evaluation budget is chosen. There is an upper bound of nine on the number of restarts, so the largest population size that can be reached using this strategy is 29λdef[31].

The criteria used by the BIPOP-ES to know when to terminate one run of the CMA-ES and perform a restart are mostly the same as the standard criteria defined in Section 2.4.5. A MaxIter-criterion is also added which limits the number of generations in each run of CMA-ES, by default to 100+ 50(n+3)√ 2

λ [31]. The BIPOP-CMA-ES launches new restarts until the target objective function value is reached or until the limit on the population size described above is reached [31].

Active covariance matrix updates

In the standard CMA-ES introduced in this section, the variance of the search distribution is increased in the direction of successful individuals, while the variance in less successful areas

(30)

2.4. CMA-ES

are allowed to decay passively. In the variation called Active-CMA-ES (or aCMA-ES), the search distribution is updated so that areas of unsuccessful individuals is actively discouraged when sampling new individuals [32]. Using active updates in CMA-ES has been shown to outperform CMA-ES without active updates on a number of functions [32], [33].

Recall the (passive) covariance matrix update rule in (2.39):

C(g+1)= (1 − c1− cµ) C(g)+c1p(g+1)c p(g+1)c T + cµ µ ∑ i=1 wiy(g+1)i_∶λ y (g+1) i_∶λ T (2.44)

where the notation y(g+1)_i_∶λ =x(g+1)_i_∶λ − m(g)

σ(g) has been introduced for the step vectors for notational eﬀiciency. When using active updates, the last term in the update rule is replaced by

cµ ⎛ ⎝ µ ∑ i₌₁ wiy(g+1)_i_∶λ y(g+1)_i_∶λ T − ∑λ i=λ−µ+1 wiy(g+1)_i_∶λ y(g+1)_i_∶λ T⎞ ⎠ (2.45)

so that the µ worst individuals contribute negatively to the covariance matrix update, where Jastrebski and Arnold use uniform weights wi= 1/µ, i = 1, . . . , µ and µ = λ/4 in [32]. Doing this modification means that the expectation of the covariance matrix will no longer be stationary when doing random selection as in the original covariance matrix update. This is solved by changing the scaling constants used in the update rule as detailed in [32].

Elitist selection

The regular CMA-ES selects the µ best out of λ individuals in each generation to guide the search towards more successful individuals. After selecting and updating the search distribu-tion, the individuals of the current generation die out and do not further influence the search. Elitist selection on the other hand will select the µ best individuals from the current generation and its parents [22]. This means that the parents in each generation are the µ all-time best individuals out of all evaluated solutions so far. Elitist selection ensures that exceptionally good individuals are allowed to have greater influence on the search. The downside is that elitist selection can make the search more prone to converge to local optima.

(31)

3 Method

This chapter explains how the work was carried out by explaining the localization algorithm that was used in this work and the objective function that was optimized, the datasets used and how the actual implementation was made. It then explains how solutions were evaluated and the experiments that were carried out.

3.1 Localization algorithm

The localization algorithm used in this work was a flat region-based algorithm implemented in SICK IVP’s bin picking system PLB (acronym for Part Locator in Bins). Objects are located by finding flat regions in the scene. The algorithm works in several steps. In the first step, the points in the 3D image are preprocessed by applying a smoothing convolution kernel. The normals of the points are estimated and any points whose normals deviate from the global up direction by more than a threshold angle are removed. The points are then segmented into clusters of coherent surfaces. If a point and its neighbours’ normal directions do not differ by too much, they are part of the same cluster. Clusters that have either too few or too many points are disregarded. The parameters in this step control the radius of the smoothing kernel, how tolerant the clustering should be to roughness and noise in the normals of the points and how close to each other points need to be for them to be considered neighbours. There is also a parameter controlling how curved a surface can be before being split into several clusters. The output of the first step is a list of, possibly curved, clusters of points.

The second step in the algorithm goes through all the clusters from the first step and tries to split them into planar sub-clusters. It does this by checking how well the points fit to a plane, by looking at the difference in normal direction between the point and the plane and the distance from the point to the plane. This step produces a set of planar clusters which is used in the next step.

The third step filters out candidate clusters so that only valid clusters remain. A minimal rectangle which encapsulates each cluster is generated, and clusters for which the rectangle’s size is outside of a minimum and maximum length and width are removed. There is also a criterion on the length/width ratio of the rectangle. This step also generates some additional metadata for the clusters, such as a confidence score for each found cluster and the picking order of valid clusters.

Automatic parameter tuning in localization algorithms

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/052--SE

Automatic parameter tuning in

localization algorithms

Automatisk parameterjustering av lokaliseringsalgoritmer

Martin Lundberg

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1 Motivation

1.2 SICK IVP

1.3 Aim

1.4 Research questions

1.5 Delimitations

2

Theory

2.1 Black-box optimization

2.2 Evaluation metrics

2.2.1 Precision and recall

2.2.2 OSPA

2.3 Bayesian optimization

2.3.1 Gaussian processes

∑

2.3.2 Acquisition function

2.3.3 Dealing with hyperparameters

2.3.4 Algorithm summary

2.4 CMA-ES

2.4.1 Sampling

2.4.2 Updating the mean

2.4.3 Updating the covariance matrix

∑

∑

∑

∑

∑

∑

2.4.4 Step size control

2.4.5 Stopping criteria

2.4.6 Boundary handling

2.4.7 Algorithm summary

2.4.8 Variations of CMA-ES

3

Method

3.1 Localization algorithm

_∑