Finding Optimal Jetting Waveform Parameters with Bayesian Optimization

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

Finding Optimal Jetting Waveform Parameters with Bayesian

Optimization

STEFAN XUEYAN FU

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Finding Optimal Jetting

Waveform Parameters with Bayesian Optimization

STEFAN XUEYAN FU

Degree Projects in Optimization and Systems Theory (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2018

Supervisors at Mycronic: Daniel Grafström, Gustaf Mårtensson Supervisor at KTH: Xiaoming Hu

(4)

TRITA-SCI-GRU 2018:297 MAT-E 2018:67

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

Jet printing is a method in surface mount technology (SMT) in which small volumes of solder paste or other electronic materials are applied to printed circuit boards (PCBs). The solder paste is shot onto the boards by a piston powered by a piezoelectric stack. The characteristics of jetted results can be controlled by a number of factors, one of which is the waveform of the piezo actuation voltage signal. While in theory any waveform is possible, in practice, the signal is defined by seven parameters for the specific technology studied here.

The optimization problem of finding the optimal parameter combination cannot be solved by standard derivative based methods, as the objective is a black-box function which can only be sampled though noisy and time-consuming evaluations. The current method for optimizing the parameters is an expert guided grid search of the two most important parameters, while the remaining five are kept constant at default values. Bayesian optimization is a heuristic model based search method for efficient optimization of possibly noisy functions with unavailable derivatives. An implementation of the Bayesian optimization algorithm was adapted for the optimization of the waveform parameters, and used to optimize various combinations of the parameters. Results from different trials produced similar values for the two known parameters, with differences within the uncertainty caused by noise. For the remaining five parameters results were more ambiguous. However, a closer examination of the model hyperparameters showed that these five parameters had almost no impact on the objective function. Thus, the best found parameter values were affected more by random noise than the objective. It is concluded that Bayesian optimization might be a suitable and effective method for waveform parameter optimization, and some directions for further development are suggested based on the results of this project.

Sammanfattning

Jet printing är en metod för att applicera lodpasta eller andra elektroniska mateiral på kretskort inom ytmontering inom elektronikproduktion. Lodpastan skjuts ut på kretskorten med hjälp av en pistong som drivs av piezoelektrisk enhet. Kvaliteten på det jettade resultatet kan påverkas av en mängd faktorer, till exempel vågformen av signalen som används för att aktivera piezoenheten. I teorin är vilken vågform som helst möjlig, men i praktiken används en vågform som definieras av sju parametrar. Att hitta optimala värden på dessa parametrar är ett optimeringsproblem som inte kan lösas med metoder baserade på deriva- ta, då optimeringens målfunktion är en s.k. svart låda (black-box function) som bara är tillgänglig via brusiga och tidskrävande evalueringar. Den nuvarande metoden för optimering av parametrarna är en modifierad gridsökning för de två viktigaste parametrarna där de kvaravarande fem parametrarna är fixerade. Bayesiansk optimering är en heuristisk modell- baserad sökmetod för dataeffektiv optimering av brusiga funktioner för vilka derivator inte kan beräknas. En implementation av Bayesiansk optimering anpassades för optimering av vågformsparametrar och användes för att optimera en mängd kombinationer av parametrarna. Alla resultaten gav liknande värden för de två kända parametrarna, med skillnader inom osäkerheten från mätbrus. Resultaten för de övriga fem parametrarna var motstridiga, men en närmare granskning av hyperparametrar för modellen visade att detta berodde på att de fem parametrarna bara har en minimal påverkan på det jettade resultatet. Därför kan de motstridiga resultaten förklaras helt som skillnader på grund av mätbrus. Baserat på resultaten verkar Bayesiansk optimering vara en passande och effektiv metod för optimering av vågformsparametrar. Slutligen föreslås några möjligheter för vidare utveckling av metoden.

(6)

iv

(7)

Acknowledgements

I would like to thank Daniel Grafström and Gustaf Mårtensson at Mycronic for your super- vision, patience and enthusiasm; For providing expert knowledge in jet printing, for being available to help whenever issues arose, and for the many interesting discussions. I also want to thank my supervisor at KTH, Xiaoming Hu, for your advice in finding a project, and for providing feedback and suggestions. Finally, my thanks goes to my parents for your support, not only during the thesis project, but also throughout the rest of my studies.

Thank you, Stefan Fu

(8)

vi

(9)

List of Figures

1.1 Simplified overview of the jet printing ejector. . . 1

2.1 Example of three iterations of Bayesian optimization. . . 15

3.1 Illustration of relative break-off. . . 16

3.2 Examples from high speed camera. . . 17

3.3 Initial variations after parameter change from x₁ = 35, x2 = 195 to x1 = 20, x2 = 150. . . 17

3.4 Periodical variations for x₁ = 14, x₂ = 219. . . 18

3.5 Schematic of Bayesian optimization iterations. . . 19

4.1 Final mean, standard deviation and acquisition function, optimization of x₁ and x₂. . . 21

4.2 Hyperparameter values estimated by maximum likelihood, optimization of x₁, x₂. 22 4.3 Final mean, standard deviation and acquisition function, optimization of x₁ and x2 with ARD. . . 23

4.4 Hyperparameter values estimated by maximum likelihood, optimization of x₁and x2 with ARD. . . 24

4.5 Final mean, standard deviation and acquisition function, optimization of x₃ and x₄. . . 25

4.6 Hyperparameter values estimated by maximum likelihood, optimization of x₃, x₄. 25 4.7 Hyperparameter values estimated by maximum likelihood, optimization of x₁, x2, x3, x4. 26 4.8 Values of lengthscales corresponding to x₁ and x₂ in linear scale, optimization of x₁, x₂, x₃, x₄. . . 27

4.9 Hyperparameter values estimated by maximum likelihood, optimization of all parameters. . . 28

4.10 Values of lengthscales corresponding to some of the parameters in linear scale, optimization of all parameters. . . 28

(12)

List of Tables

3.1 Search space and default parameter values. . . 19

4.1 Runs of Bayesian optimization. . . 20

4.2 Results and final hyperparameter values, optimization of x₁ and x₂. . . 21

4.3 Results and final hyperparameter values, optimization of x₁ and x₂ with ARD. . 23

4.4 Results and final hyperparameter values, optimization of x₃ and x₄. . . 25

4.5 Results and final hyperparameter values, optimization of x₁, x2, x3, x4. . . 26

4.6 Results and final hyperparameter values, optimization of all parameters. . . 27

5.1 Best found values of x₁ and x₂ from different trials. . . 29

5.2 Best found values of x₃ and x₄ from different trials. . . 30

x

(13)

Abbreviations

ARD Automatic Relevance Determination

CDF Cumulative Density Function

EGO Efficient Global Optimization

EI Expected Improvement

GP Gaussian Process

LCB Lower Confidence Bound

ML Maximum Likelihood

PCB Printed Circuit Board

PDF Probability Density Function

PI Probability of Improvement

RBF Radial Basis Function

SE kernel Squared Exponential kernel

SMT Surface Mount Technology

UCB Upper Confidence Bound

(14)

(15)

1 | Introduction

1.1 Background

Mycronic AB is a high-tech Swedish company, and creates production equipment for electronics and display manufacturing. One of the working areas is surface mount technology (SMT), which is a method for producing electronic circuits by soldering components on the surface of printed circuit boards (PCBs). This is opposed to through-hole technology in which components are mounted into holes in the circuit boards. SMT has advantages in both reduced production cost and higher component density, and is the most prevalent method used in contemporary electronic applications.

One of the machines developed and manufactured by Mycronic is the MY700 jet printer. Solder paste is a mixture of flux and solder alloy particles, and is used to attach components to the PCBs. The solder paste jet printer is similar to an inkjet printer, except that it jets solder paste instead of ink. The jetting head used for dispensing solder paste inside the jet printer is known as the ejector, and uses a piston powered by a piezoelectric stack to shoot the solder paste onto the circuit board. As pulses of voltage are applied to the piezo, it moves the piston to eject solder paste. Each pulse creates one dot onto the circuit board, and rows of dots are jetted as strips by moving the ejector across the PCB. Figure 1.1 provides a simplified overview of the jet printing ejector.

Figure 1.1: Simplified overview of the jet printing ejector.

The required precision and accuracy of the resulting depositions are very high. For example, the volume of dots are measured in nl (nanoliters) and sizes measured in µm. Different aspects

(16)

of jetting quality can be quantified: The positioning of dots refers to positioning accuracy, the shape to the roundness of the dots, and the amount of satellites to the number of smaller bodies of solder paste which are inadvertently created. Moreover, the quality has to be consistent for the entire lifetime of the ejector. Typically, several of the quality goals are in conflict in each other, for example jetting with higher speed can give better positioning, but may also result in more satellites.

Jetting characteristics can to some extent be controlled by a number of factors, including the temperature at which the jetting takes place, the pumping of paste into the jetting chamber, and the waveform (shape) of the piezo actuation voltage signal. These are represented by a number of parameters, which can be adjusted to achieve good jetting results. The current process for optimization of jetting parameters is through an expert guided multi-stage grid search, in which optimization of the waveform is one part. The expert is able to direct the search towards regions of the search space which appear to be promising based on previous experiences. Furthermore, some combinations of parameters are known to not result in any jetting, for example if the voltage is too small. These can also be avoided by an expert, since many of the quantitative measures of jetting quality are undefined if nothing is jetted.

The current system supports a waveform parameterized by seven parameters, but in theory any waveform within the physical limitations of the piezo with the dynamics of the system is possible.

However, optimization of the waveform is only performed for the two parameters that are known to have the greatest effect on the jetting quality, with the remaining parameters kept at default values. Relatively little is known about the effects of the remaining parameters. Even with the search space limited to two parameters, the optimization process is very time-consuming, since a large amount of data must be collected for each combination of parameters. In addition, the optimization must be repeated for many combinations of ejector models and solder pastes to fit different applications. It is therefore desirable to explore methods that could make the optimization process more efficient.

1.2 Objective

The objective of this thesis project is to investigate if machine learning strategies, in particular Bayesian optimization, can be used for optimization of the seven waveform parameters of the piezo actuation profile. More specifically, the goal is to explore if Bayesian optimization can be used to increase the optimization efficiency and expand the search space to include a larger number of parameters, and to automate all or parts of the process to reduce the amount of expert input needed.

1.3 Report Outline

This section has provided a short introduction to solder paste jetting and the need for more efficient optimization methods for the jetting parameters. A summary of the theoretic background, including Gaussian process regression and Bayesian optimization, is given in section 2.

Section 3 defines the problem scope covered in this thesis, and includes some of the preparatory work in defining a suitable optimization target. Results are presented in section 4, and section 5 will discuss the performance of Bayesian optimization for optimization of jetting parameters.

Finally, some ideas for further work are presented in section 6.

2

(17)

2 | Theoretic Background

2.1 Preliminaries

2.1.1 Marginal, Conditional, and Bayes’ Rule

For continuous and possible multivariate random variables a and b with joint probability p(a, b)¹, the marginal probability is given by

p(a) = Z ∞

−∞

p(a, b) db (2.1)

and the conditional probability function is

p(a| b) = p(a, b)

p(b) . (2.2)

Using (2.2) twice gives Bayes’ rule:

p(a| b) = p(b| a)p(a)

p(b) . (2.3)

If a and b are jointly Gaussian, i.e.

a b

∼ N µ_a µ_b

, A C C^T B

(2.4) then their marginal distributions are

a ∼ N (µ_a, A) (2.5)

b ∼ N (µ_b, B), (2.6)

and their respective conditional distributions given the other are [1]

a | b ∼ N (µ_a+ CB⁻¹(b − µ_b), A − CB⁻¹C^T) (2.7) b | a ∼ N (µ_b+ AB⁻¹(a − µ_a), C − AB⁻¹A^T). (2.8)

1The notation p is used for probability density functions instead of f to avoid confusion with the optimization objective function f (x).

(18)

2.1.2 Stochastic Processes

Definition 2.1.1. A stochastic process Z = {Z(t) | t ∈ T } is a collection of random variables Z(t) which are all defined on a common probability space (Ω, F , P).

The following functions are defined for a stochastic process Z:

The mean function

m_Z(t) = E[Z(t)], (2.9)

the variance function Var_Z(t) = E[Z²(t)] − m²_Z(t), the autocorrelation function R_Z(t, s) = E[Z(t)Z(s)] and the autocovariance function

kZ(t, s) = E[(Z(t) − mZ(t))(Z(s) − mZ(s))] = RZ(t, s) − mZ(t)mZ(s). (2.10) In machine learning settings, the autocovariance function is sometimes just called the covariance function or covariance kernel. Since the autocorrelation function is symmetric [2], it follows that the autocovariance function also is symmetric, k_Z(t, s) = k_Z(s, t). Additionally, k_Z(t, s) is also positive semi-definite [1] and thus for any set of points t₁, ..., tn the matrix K with entries K_ij = k_Z(t_i, t_j) is a positive semidefinite matrix.

2.1.3 Gaussian Processes

A Gaussian process (GP) can be interpreted as a generalization of the multivariate Gaussian distribution to infinite dimensions. One definition of a Gaussian process is:

Definition 2.1.2. A Gaussian process is a stochastic process for which any finite subset of the random variables Z(t) have a joint Gaussian distribution, ie. ∀t₁, ..., t_n∈ T ,

[Z(t1) ... Z(tn)]^T ∼ N (µ(t₁, ..., tn), C(t1, ..., tn)), (2.11) where the entries in µ(t₁, ..., tn) are given by

µ_i = m_Z(t_i) (2.12)

and C(t₁, ..., t_n) is a corresponding Gram matrix to k_Z given the input points t₁, ..., t_n. The entries in C(t₁, ..., t_n) are given by

cij = kZ(ti, tj). (2.13)

A Gaussian process is defined uniquely by its mean and autocovariance functions, and can thus be written as

Z(t) ∼ GP(m_Z(t), k_Z(t, s)). (2.14)

Gaussian processes fulfill the Kolmogorov consistency theorem since Gaussian processes are stochastic processes, i.e. the joint distribution of n variables from Z must be the same as the marginal distribution obtained by marginalization of a joint distribution of a superset of the same n variables [1] [2].

4

(19)

2.2 Gaussian Process Regression

Gaussian processes are used in machine learning, e.g. for supervised learning. In the regression setting, training data consisting of training inputs and corresponding outputs are given, and the objective is to predict outputs corresponding to new inputs. Training inputs have dimension D and are denoted x_i, i = 1, ..., n, with the matrix of all training inputs denoted as X. Similarly, the training outputs are scalars y_i, i = 1, ..., n and form a vector y. The outputs are related to the inputs by some unknown underlying function y = f (x). All the training data is denoted as D = (X, y) = {(x_i, yi) | i = 1, ..., n}. Finally, new data is denoted by ˆxi, ˆX, ˆyi, ˆy and ˆD. With this notation, the objective is to estimate ˆy | ˆX, X, y.

2.2.1 Regression

Modelling the underlying function y = f (x) between inputs and outputs as a Gaussian process F = {f (x) | x ∈ X } with mean function m(x) and covariance function k(x, x⁰)

f (x) ∼ GP(m(x), k(x, x⁰)), (2.15)

the outputs y, ˆy corresponding to given inputs X, ˆX by definition 2.1.2 follow a (finite) multivariate Gaussian distribution

y ˆ y

X, ˆX ∼ N m(X) m( ˆX)

,K(X, X) + σ_n²I K(X, ˆX) K( ˆX, X) K( ˆX, ˆX)

,

(2.16)

where m(X) = [m(x₁) · · · m(xn)]^T, K(X, X) has elements K_ij = k(xi, xj) etc. Thus, K is positive semidefinite and forms a valid covariance matrix. The term σ_n²I in the first block of the covariance matrix represents possible noise in outputs y = f (x) + ε_n, with σ_n= 0 if there is no noise, see section 2.2.2 and (2.33) on covariance kernels. The remaining three blocks have no noise term, as only the training outputs y are affected by noise.

Modeling f (x) as a Gaussian process (2.15) does not assume that f (x) has the shape of a normal distribution or a mixture of Gaussians. Instead, the assumption that is made is that for each given ¯x, the value of f (¯x) is drawn from a normal distribution with mean m(¯x), and variance and covariances given by the covariance function k. Equivalently, the finite dimensional vector

y ˆyT

| X, ˆX in (2.16) corresponds to a vector drawn from a multivariate normal distribution.

Assumptions on the covariance function typically correspond to assumptions on the structure and smoothness of f (x), and is covered in more detail in section 2.2.2.

Regression is done by Bayesian inference. Conditioning on y (using (2.8)) gives ˆ

y | ˆX, X, y ∼ N (µ_y_ˆ, C_y_ˆ) (2.17) where

µ_ˆ_y:= m( ˆX) + K( ˆX, X)[K(X, X) + σ_n²I]⁻¹[y − m(X)] (2.18) C_ˆ_y:= K( ˆX, ˆX) − K( ˆX, X)[K(X, X) + σ_n²I]⁻¹K(X, ˆX). (2.19) The predicted mean values and covariance matrix for the conditional distribution of ˆy | ˆX, X, y can then be used to obtain predictions for the values of ˆy. Defining a loss function L(ˆy_pred, ˆy_true),

(20)

the optimal prediction is the one that minimizes the expected loss. If the predicted distribution is Gaussian, and if the loss function is symmetric, i.e. the penalties for overestimating and underestimating are equal, then the best prediction is the predicted mean, µ_y_ˆ [1]. This will be the case in this report, and the predicted mean µ_y_ˆ will be referred to as the prediction.

The mean function m(x) and covariance function k(x, x⁰) correspond to the Bayesian prior, p(a) in Bayes’ rule (2.3), and (2.15) is sometimes referred to as the GP prior. The training outputs y correspond to the likelihood, and (2.17), (2.18), (2.19) correspond to the posterior. The prior mean m(x) can be used to incorporate expert knowledge about f (x) [3]. If no prior knowledge exists, the prior mean is often set to zero. Setting m(x) = 0 does not restrict the posterior mean nor the prediction, which becomes

µyˆ = K( ˆX, X)[K(X, X) + σ_n²I]⁻¹y. (2.20) In (2.20), the best predictions µ_y_ˆ for ˆy are linear combinations of the training outputs y, i.e.

Gaussian process regression is a linear smoother. GP regression can therefore be viewed as estimating the underlying function as a sum of radial basis functions, with each basis function centered at a training input. It is possible to use a non-deterministic prior mean functions, see chapter 2.7 in [1].

2.2.2 Covariance Kernels

While the mean function m(x) can be set to zero without restricting the GP posterior, the covariance function, also known as the covariance kernel, is key in Gaussian process regression.

The covariance kernel specifies the structure of the Gaussian process. In assuming that input pairs x and x⁰ which are close² to each other in input space also are close together in output space, a corresponding smoothness assumption is made for the GP. Similar to the mean function, prior knowledge about f (x) can be incorporated into the covariance kernel, e.g. if the function is known to be periodic [1] [3]. As stated in previous sections, in order for a function k(x, x⁰) to be a valid covariance kernel, it must be symmetric and positive semi-definite.

A covariance kernel is stationary if it can be written as a function of τ := x − x⁰, k(x, x⁰) = k(x − x⁰) = k(τ ), and isotropic if it is a function of only r := kτ k = kx − x⁰k. A stationary kernel is invariant to all translations in input space, and an isotropic kernel is invariant to all rigid motions. Isotropic covariance kernels are also referred to as radial basis functions (RBFs) in other machine learning applications [4].

The Squared Exponential and Matérn Kernels

A very common isotropic covariance kernel is the squared exponential (SE) kernel:

k_SE(r) = exp

− r² 2l²

= exp

−kx − x⁰k² 2l²

. (2.21)

The parameter l is the characteristic length scale and determines the rate at which the covariance between two points is expected to decrease as the distance increases. Points separated by a distance r are assumed to be highly correlated if r < l [5]. The SE kernel is also referred to as the RBF kernel or Gaussian model. From (2.21), it can clearly be seen that the SE kernel is infinitely differentiable, and the corresponding assumption that is made for the Gaussian process

2For example in Euclidean distance.

6

(21)

is that it has mean square derivatives of all orders, i.e. ∂^kf (x)/∂x^k_i exists for all i and k, which is a smoothness assumption that is often too strong for practical modeling [6].

A more practical class of isotropic covariance functions is the Matérn class:

k_Matérn(r) = 2^1−ν Γ(ν)

√ 2νr

l

ν

K_ν

√ 2νr

l

(2.22) where K_ν is a modified Bessel function of the second kind. The Matérn class has an additional parameter ν compared to the SE kernel, which represents the smoothness of the covariance function. The GP is m times mean square differentiable if and only if ν > m [6]. The SE kernel is obtained from the Matérn class when ν → ∞. The most commonly used values for ν are half-integer, as the covariance functions then simplyfy to become products of an exponential and a polynomial. For most practical applications, ν = ¹₂ is considered too rough and ν ≥ ⁷₂ is often difficult to distinguish from the SE kernel [1], leaving ν = ³₂ and ν = ⁵₂ as two of the most commonly used covariance functions:

k_Matérn3 2(r) =

1 +

√3r l

exp

−

√3r l

(2.23)

k_Matérn⁵

2(r) =

1 +

√ 5r l +5r²

3l²

exp

−

√ 5r l

. (2.24)

While all three covariance functions presented here are monotonically decreasing as functions of r = kx − x⁰k, this is not necessary for a function to be a valid covariance kernel. Some other covariance functions include dot product and polynomial covariance functions, which are commonly used for the kernel trick in support vector machine classification [1] [4].

Automatic Relevance Determination Kernels

One limitation of the isotropic covariance functions is that all dimensions of x are assumed to have the same decay of covariances. In practical applications, f (x) is often anisotropic, and it is not possible to re-scale the problem to become isotropic since the length scales in different dimensions are not known. The so-called automatic relevance determination (ARD) kernels can be obtained by modifying the common isotropic kernels by letting r be defined by

r²(x, x⁰)

l² = (x − x⁰)M (x − x⁰), (2.25)

where M is a positive semidefinite matrix. In the case of the SE and Matérn kernels, this can correspond to defining a vector l = [l₁, ..., lD]^T of length scales, one for each dimension of the input space, and letting M = diag(l)⁻², allowing different length scales in different dimensions.

The resulting ARD SE kernel becomes

kSE, ARD(x, x⁰) = exp

−1

2(x − x⁰) diag(l)⁻²(x − x⁰)

, (2.26)

and similar expressions are obtained for the Matérn kernels. For M = l⁻²I, the isotropic covariance functions are re-obtained. If dimension i is estimated to have little or no impact on the others, setting the corresponding length scale very large makes the covariance function almost independent of the input [7]. The length scales can be estimated together with the other

(22)

hyperparameters (see section 2.2.3 below), and the relevance of each dimension of the input is given by the size of the corresponding length scale.

Forming New Kernels

The sum, product and convolution of valid covariance kernels all form new covariance kernels [1].

In particular, the kernels (2.21), (2.23), and (2.24) are often multiplied by a variance parameter θ₀, since k(x, x) = θ₀ is a valid covariance function. θ₀ represents the global variance of x:

kSE, θ0(r) = θ0exp

− r² 2l²

= θ0exp

−kx − x⁰k² 2l²

(2.27)

k_Matérn³

2, θ0(r) = θ₀

1 +

√ 3r l

exp

−

√ 3r l

(2.28)

k_Matérn5

2, θ0(r) = θ₀

1 +

√5r l + 5r²

3l²

exp

−

√5r l

(2.29) The summing of kernels is used when modeling a process with noise. Assuming that the noise ε is additive, i.e.

y = f (x) + ε (2.30)

and independent identically distributed Gaussian with zero mean and variance σ²_n, it can be modeled as

ε ∼ GP(0, σ_n²δ(x, x⁰)), (2.31)

where δ denotes Dirac’s delta function, i.e. all variances are σ²_n and all covariances are zero.

knoise(x, x⁰) = σ_n²δ(x, x⁰) is symmetric and positive semidefinite, and is in fact a valid covariance function. The process with noise can then be modeled using a sum of covariance kernels

k(x, x⁰) = kprocess(x, x⁰) + knoise(x, x⁰). (2.32) The corresponding covariance matrix for given points X is then

K(X, X⁰) = K_process(X, X⁰) + σ_n²I, (2.33) which is the term used in (2.16). Different assumptions on the noise can be modeled by replacing σ_n²I. For example, if the noise is instead assumed to have some, but possibly low, correlation in space, i.e. if x and x⁰ are close together in input then ε(x) and ε(x⁰) will have some correlation, then a second Matérn kernel with a small lengthscale l_n and global variance θ₀ = σ²_n can be used. A similar approach is used when specifying prior knowledge into the covariance kernel, e.g. monthly mean temperature can be modeled by a Matérn kernel for long term trends, a periodic kernel for seasonal variations, and noise [1].

8

(23)

2.2.3 Estimating Hyperparameters

Gaussian process regression includes a number of model selection decisions, for example the choice of prior mean and covariance functions, and estimation of the corresponding hyperparameters. The hyperparameters which are to be estimated depend on the choice of covariance kernel.

For the ARD SE and Matérn kernels, hyperparameters θ consist of length scales l, variance θ₀ and noise variance σ_n, θ = {l, θ₀, σ_n}. One possible method for estimating the hyperparameters is Bayesian model selection using the maximum likelihood (ML) estimate, i.e. maximizing the probability of obtained outputs given the inputs and model parameters. Bayesian model selection is often done in stages; Model parameters, hyperparameters and structure are often estimated separately. Each of the stages can also be performed by manual selection, for example the choice of GP prior and covariance function.

Since Bayesian optimization with a GP model is a non-parametric method, the model parameters are not well-defined. One interpretation is to use to function values f = f (x), and by combining (2.1) and (2.2) the marginal likelihood p(y| X) becomes

p(y| X) = Z ∞

−∞

p(y| f , X)p(f | X)df . (2.34)

Letting the prior mean m(x) = 0, since this does not restrict the predictions, and using the prior from section 2.2.1, the log marginal likelihood is obtained as

log p(y|X, θ) = −1

2y^T[K(X, X) + σ_n²I]⁻¹y −1 2log

K(X, X) + σ²_nI −n

2log 2π, (2.35) where | ? | in the second term denotes the matrix determinant. See chapters 2.2 and 5.4 in [1] for more details. The first term in (2.35) can be interpreted as how well the model fits the data, the second term as a “complexity penalty” since the matrix determinant will be smaller for smoother covariance matrices, and the final as a normalization constant. Thus, the maximum likelihood estimate of the hyperparameters attempts to balance between model fit and complexity, known as the bias-variance tradeoff in machine learning, and reduce overfitting [1] [3].

The log marginal likelihood is then maximized using standard methods, for example by taking its partial derivatives with respect to the hyperparameters

∂

∂θ_j log p(y| X, θ) = 1

2y^TK⁻¹∂K

∂θ_jK⁻¹y −1 2tr

K⁻¹∂K

∂θ_j

(2.36) and setting to zero, again see [1] for details. However, finding the maximum likelihood is often not that straightforward. The (log) marginal likelihood often has multiple local minima, corresponding to different interpretations of the data. Outputs that vary rapidly can be explained by an underlying function with a very small lengthscale, but also by a long-term trend with noise. A larger amount of hyperparameters corresponds to a more complex model, allowing for an increased number of different interpretations of the results, and more data is needed to obtain accurate estimates of the hyperparameter values.

2.3 Bayesian Optimization

2.3.1 Efficient Optimization of Black-box Functions A typical optimization problem can be formulated as

(24)

minimize f (x) subject to x ∈ X ,

(2.37)

where f (x) : X 7→ R is the objective function and X ⊆ R^D is the feasible region for x. The goal is to find a global optimal solution x^? such that f (x^?) ≤ f (x) ∀ x ∈ X , i.e. x^? = argmin_x∈X f (x).

In the case where f (x) is a known function and its derivatives are available, there exists a large number of methods which use the so-called optimality conditions on the gradients to identify local minima. If the problem is convex, a local minimum is also a global minimum, while global optimality is difficult to prove without comparing all local optima if for example f (x) is non- convex. Some example of method classes using the optimality conditions are line search methods and interior methods [8].

In many applications, f (x) is a black-box function and is thus only available through (possibly noisy) measurements, for example if f (x) is the result of a simulation or real-world experiment.

In these cases, the gradients of f (x) are unavailable, and the optimality conditions can not be used. Approximating the gradients using finite differences is problematic for multiple reasons.

First, the accuracy of the finite difference approximation is limited. If the value of f (x) is accurate to four significant digits, gradients approximated using the standard forward difference are only accurate to two significant digits [8]. Second, if evaluations of the objective function f (x) are noisy, the approximations can be very inaccurate, especially if f (x) is near-constant so that the gradients of f (x) are smaller than or in the same order of magnitude as the noise.

Finally, using finite differences requires more evaluations of f (x), which is problematic if f (x) is expensive³ to evaluate [9].

Therefore, it is often preferred to use methods which do not require information about the derivatives. In general, these perform worse than their derivative-based counterparts if derivatives actually are available, but better than derivative-based methods using finite difference approximations. Derivative-free optimization algorithms can be classified into direct search algorithms and model-based search algorithms. Direct search methods generate a sequence of trial solutions with the goal to converge to an optimal solution. Instead, model-based methods attempt to construct a surrogate model by sampling the black-box objective function, and then optimize the surrogate model [10].

The main drawback of most derivative-free methods is that they converge slowly. Many methods evaluate the function in a pattern around each trial point to determine the next trial point, thus many evaluations are needed for each iteration. Just as in the case of finite differences, the number of evaluations of the objective function becomes limiting if the evaluations are expensive.

For example, the direct search method compass search has a slower than linear convergence, compared to the superlinear convergence of derivative-based quasi-Newton methods [8].

2.3.2 The Bayesian Optimization Algorithm

Bayesian optimization is a heuristic model-based search method for efficient global optimization⁴ of possibly noisy functions with unavailable derivatives. The general idea is similar to other model-based methods: Sample the objective function f (x) to construct a surrogate model, which is cheap to evaluate and for which derivatives are available, and then iteratively improve the model by sampling more points using Bayes’ rule to update the model. The search pattern of Bayesian optimization has been shown to be similar to the search strategy used by humans

3For example costly in time or material resources.

4Bayesian optimization is sometimes referred to simply as efficient global optimization (EGO) [11].

10

(25)

for optimization in one dimension, as it takes into account all known information and beliefs to sample in locations which are believed to be optimal or have a high uncertainty [12].

Overview

In Bayesian optimization, the optimization problem is treated as a decision problem [13]. Eval- uation points x₁, ..., xN are determined sequentially using all the available information at each point in time, and evaluations give outputs y₁, ..., y_N from the objective function f (x) with additive noise

y_n= f (x_n) + ε_n. (2.38)

All the available information, or data, at step n is denoted D_n = {(x_i, y_i)|i = 1, ..., n}, and is used to determine the next evaluation point x_n+1.The main advantages are that the algorithm is data efficient, works when f (x) is only available through noisy evaluations, and does not require the objective function to be convex, making it well-suited for efficient global optimization.

The Bayesian optimization algorithm consists of two major parts: the surrogate model representing the current knowledge about the black box function f (x), and an acquisition function α(x; D) representing the expected utility of sampling f (x) at x. In each iteration, the acquisition function is maximized to find the next point to evaluate. This is usually done using standard optimization methods, since α is a known function with gradients that can be computed. The function is then evaluated and the surrogate is updated using Bayes’ rule. The algorithm ter- minates when some criterion has been met, usually in that a predefined evaluation budget has been spent [3].

Algorithm 1 Bayesian Optimization

1: Select initial points {x₀}.

2: Evaluate the objective function at the initial points {y₀} = f ({x₀}) + ε.

3: Construct surrogate model from the data D₀= {(x₀, y₀)}.

4: for n = 1, 2, ... do

5: Maximize acquisition function to get next evaluation point x_n= argmax_x∈X α(x; D_n−1).

6: Evaluate the objective function at the obtained point y_n= f (x_n) + ε.

7: Add the new point and its objective value to the data D_n= {Dn−1, (xn, yn)}.

8: Update the surrogate model using Bayes’ rule.

9: end for

Assumptions

The standard assumptions for the Bayesian optimization algorithm are that the objective function f (x) is Lipschitz-continuous, i.e. there exists a so-called Lipschitz constant L ≥ 0, such that

kf (x₁) − f (x2)k ≤ L kx1− x₂k ∀ x₁, x2 ∈ X , (2.39) and that the search space X is bounded in all dimensions. It is also commonly assumed that the noise is additive as in (2.38), and that the noise terms ε are independent, identically distributed and Gaussian with zero mean [7], i.e.

(26)

ε ∼ N (0, σ_n²). (2.40) It is possible to modify the Bayesian optimization algorithm to not require the noise to be independent identically distributed by changing the way the noise is modeled. In the case when Gaussian processes are used, this can be accomplished by modifying the noise term in the covariance function (2.33) [1]. In fact the independent and identically distributed assumption on the noise rarely holds in reality. For example, the noise can be multiplicative and/or heteroscedstic, however having a model with i.i.d. noise is simpler and often a sufficiently good approximation.

The Surrogate Model

There are many different models that can be used for Bayesian optimization, but Gaussian processes are by far the most commonly used. The main advantage to GPs concerns tractability.

The normal distribution is its own conjugate prior, thus the posterior will also be Gaussian and can be calculated by (2.17), (2.18) and (2.19) to update the model in each iteration. Similarly, the marginal likelihood (2.34) used in estimation of the hyperparameters can be calculated without numerical integration. As covered in section 2.2, the GP regression model is very flexible, with smoothness assumptions that can be adapted through different choices of covariance kernel and hyperparameter values.

2.3.3 The Acquisition Function

The acquisition function, also sometimes called the utility function or infill function, is used to determine the next location in search space to sample. The goal of any optimization involving uncertainty is to minimize the expected loss or regret by finding a point x^∗_estwhich is as close to the true minimum x^∗_true as possible, i.e. minimizing |f (x^∗_estimate) − f (x^∗_true)|. This can be a very difficult and computationally demanding problem to solve, so Bayesian optimization instead uses a myopic (short-sighted) strategy. In each iteration, the point with the best expected utility is chosen, without regard to future iterations. This strategy in known to be suboptimal in terms of evaluation budget efficiency, but still gives good results [5]. The next evaluation point is therefore determined by maximizing the acquisition function x_n+1= argmaxx∈Xα(x; Dn, θ).

Typically, the acquisition function depends on all previous observations D_nand the current model hyperparameters θ, α = α(x; D_n, θ). The acquisition function balances the search between exploration and exploitation, known as the exploration-exploitation trade-off, directing the search to points where the uncertainty is high or the objective is expected to have good values. This corresponds to the balance between global and local search.

For the Gaussian process model, since a Gaussian process is uniquely defined by its mean function µ(x; D, θ) (2.18) and (auto)covariance function k(x, x⁰; D, θ) (2.19), acquisition functions are commonly defined through these and the variance σ²(x; D, θ) = k(x, x; D, θ). This is another advantage of the GP model, as an acquisition function with known analytic form is much easier to maximize than the unknown objective f (x).

For simplicity, the dependence on all known data D and hyperparameters θ will be implicit in the definitions of acquisition functions. The notation φ(·) denotes the PDF of the standard normal distribution, and Φ(·) the corresponding CDF.

φ(x; µ, σ²) = 1

√

2πσ² exp

−(x − µ)² 2σ²

(2.41)

12

(27)

Φ(x) = 1

√2π Z x

−∞

e^−t²^/2dt (2.42)

The probability of improvement (PI) aims to maximize the probability of improving over the current best known point x_best = argmin_x_if (x_i):

αPI(x) = Φ µ(x) − f (x_best) − ξ σ(x)

. (2.43)

The main disadvantage of PI is that it focuses solely on exploitation, since it is likely that the points with the greatest probability of improvement are the ones near the current best point with only very small expected improvement. The parameter ξ is therefore added to encourage exploration by requiring the improvement to be greater than ξ, possibly with some cooling schedule.

The expected improvement (EI) acquisition function takes both the probability of improvement and the size of the improvement into account:

α_EI(x) =

((µ(x) − f (x_best)Φ_{µ(x)−f (x}

best) σ(x)

+ σ(x)φ_{µ(x)−f (x}

best) σ(x)

if σ > 0

0 it σ = 0

(2.44)

The lower confidence bound (LCB) criterion for minimization problems, or upper confidence bound (UCB) for maximization, uses the confidence bounds with a parameter κ:

αLCB(x) = µ(x) − κσ(x) (2.45)

α_UCB(x) = µ(x) + κσ(x). (2.46)

Out of the described acquisition functions, EI is the one most commonly used [3] [7] [14]. There are many global optimization methods for optimizing the acquisition function [3], for example the derivative-free DIRECT optimizer [7] and the quasi-Newton method limited-memory-BFGS [15].

2.3.4 Choosing Initial Points

The Bayesian optimization algorithm requires some number of initial points {x₀} to construct the first model. The number of initial points and where in the search space these are placed can have a very large effect on the convergence of the algorithm. For example, if all the initial points are in the same sub-region of the search space, the model will be less reliable further away from that region. For an objective function f (x) with several local minima, it is possible that the true global minimum is then never sampled due to an erroneous assumption made by the model. The sampling of initial points can be seen as a purely exploratory phase, with the purpose to obtain an overview of the function behavior. Therefore, the number of initial points and the method for choosing their locations can have a large impact on the performance of the algorithm.

There is no consensus method for determining the number of initial points to use, as it is heavily dependent on the expected behavior of the objective function and number of dimensions. Thus the number of initial points ranges from two-thirds of the total number of evaluations [11], to

(28)

two initial points for a 1D optimization [7]. For sampling locations, the simplest and most commonly used methods are grid search and random search. A comparison between grid and random search for hyperparameter optimization showed that random search outperforms grid search, especially for problems with low effective dimensionality, as more distinct values of the

“important” dimensions are tested [16]. The same applies to choosing initial points for Bayesian optimization: it is desirable that the initial points cover the feasible region with minimal overlaps.

More complex methods include latin hypercube sampling and Sobol sequence sampling [15]. It is possible to incorporate prior knowledge into the choice of initial points, for example by designing a grid which has higher density of points in known regions of interest.

2.3.5 Advantages and Disadvantages

Bayesian optimization is a heuristic method. As has been stated previously, it is known that the myopic strategy using the acquisition function is not optimally efficient in terms of number of evaluations. While there exist proofs of convergence under some circumstances, these circumstances rarely hold, for example that the hyperparameters should be known in advance [5]. However, the Bayesian optimization algorithm has been shown to perform well, both on various test functions [11], and more practical tasks including optimization of machine learning hyperparameters [14] [17], and robot gait optimization [18].

The main advantages of Bayesian optimization have already been stated: it is data efficient, makes very few and very unrestrictive assumptions, and can handle noise. The main limitations of Bayesian optimization are the lack of guaranteed convergence to the global optimum, and computation time. A matrix inverse for a n × n matrix is calculated in each iteration, n being the number of data points collected. The matrix inverse has a time complexity of O(n³), making the algorithm unsuitable for larger amounts of data. However, for problems where evaluations of the objective function are expensive, the computation time can be a relatively small cost [1].

2.3.6 Example

An example of a few iterations of Bayesian optimization is displayed in figure 2.1. The underlying function is a test function known as Forrester’s function, f (x) = (6 − 2x)²sin(12x − 4), and it is optimized by Bayesian optimization using a GP model with Matérn 5/2 kernel and the EI acquisition function. The figures were generated in Python using GPyOpt [15].

14

(29)

(a) After the first iterations the samples are explained as a near-constant function with a lot of noise.

(b) The model changes once enough data has been obtained.

(c) The region where the acquisition function has non-negligible values becomes more limited, making the search more refined

Figure 2.1: Example of three iterations of Bayesian optimization.

(30)

3 | Problem Formulation and Scope

3.1 Optimization Target

The main focus of this thesis project is on the Bayesian optimization algorithm and its use in optimization of the jetting waveform parameters. In order to conduct tests of the Bayesian optimization algorithm, jetting quality will be evaluated using a high-speed camera instead of jetting dots onto a test board. This allows for near-unlimited testing using a simpler development rig “lillaMy”, instead of using more expensive time on full jet printing machines. The scope of this project does not include defining an objective that captures the different aspects of jetting quality (positioning, shape, satellites, etc.) since these are not obtainable from the high-speed camera. For simplicity, the main focus will be on a scalarized objective capturing as many aspects of jetting quality as possible, as opposed to a multi-objective optimization problem.

Some suggestions on how to extend the method to multi-objective optimization are discussed in section 6.2.

The optimization target used is the average relative break-off, illustrated in figure 3.1. The relative break-off is defined as the cross section area of parts that have broken off from the main body (highlighted by blue) relative to the cross sectional of all ejected solder paste (the sum of the red and blue areas). Since the jetting speed can vary depending on the parameters, the relative break-off is measured when the main part of the dot (red in the figure) has completely passed some set distance d. Based on existing knowledge, the amount of break-off should loosely correlate to the satellite level when jetting onto boards. Intuitively, this is due to the secondary bodies of solder paste landing outside of the desired location for the dot and forming satellites.

Figure 3.1: Illustration of relative break-off. The illustration has been rotated 90 degrees clock- wise.

Two examples using images from the high speed camera are displayed in figure 3.2.

16

(31)

(a) Example of low relative break-off

(b) Example of high relative break-off

Figure 3.2: Examples from high speed camera. Images have been rotated 90 degrees.

Finally, the scope will also be limited to optimizing jetting quality in steady state shooting. In the initial exploratory tests, it was found that jetting quality often needed some time to settle after a significant change in parameter values, see figure 3.3 for an example, with jetting speed included for clarity. Typically, between 2000 and 5000 dots had to be jetted before results were completely stable.

0 0.5 1 1.5 2 2.5

Dot No. 10⁴

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Relative break-off

(a) Relative break-off

0 0.5 1 1.5 2 2.5

Dot No. 10⁴

15 20 25 30 35 40 45

Tip Speed [m/s]

(b) Jetting speed

Figure 3.3: Initial variations after parameter change from x₁ = 35, x₂ = 195 to x₁ = 20, x₂ = 150.

It was also found that some combinations of parameters caused the jetted quality to vary period- ically, for example as in figure 3.4. The objective function is therefore defined as a long average over 30600 dots, which is significantly longer than the longest period time, and the initial 5100 dots are not used in order to allow time for jetting quality to stabilize.

(32)

1 1.5 2 2.5 3

Dot No. 10⁴

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Relative break-off

(a) Relative break-off

1 1.5 2 2.5 3

Dot No. 10⁴

0 10 20 30 40 50 60 70 80

Tip Speed [m/s]

(b) Jetting speed

Figure 3.4: Periodical variations for x₁= 14, x₂ = 219.

In the color coding of figure 3.1, the objective function is roughly defined as

f (x) = mean area of break-off total area

= mean area of blue part area of both parts

. (3.1)

3.2 Mathematical Formulation

The problem of finding optimal waveform parameters can be stated as an optimization problem in the form of (2.37):

minimize f (x) subject to x ∈ X ,

(3.2)

where x = (x₁, x₂, x₃, x₄, x₅, x₆, x₇)^T represents the seven waveform parameters, and X is the search space. Each of the parameters corresponds to a shape characteristic of the voltage pulse.

x1 and x₂ will correspond to the two parameters that are known to be the most important and currently optimized for. The main difficulties of the problem are the issues outlined in section 2.3.1: f (x) is not a known function with unavailable derivatives, is time-consuming to sample, evaluations are noisy, and Bayesian optimization should fit the problem.

3.2.1 The Search Space

The search space X is bounded by a hyperrectangle, which fulfills the Bayesian optimization assumption that the search space should be bounded. The search space and default values for each of the parameters are given by table 3.1. The values for x₆ and x₇ are arbitrary in when the remaining parameters are at defaults, thus their defaults are simply set to 0.

18

(33)

Table 3.1: Search space and default parameter values.

Parameter Search range Default value

x₁ 0 ≤ x₁≤ 50 Not used

x₂ 120 ≤ x₂ ≤ 220 Not used

x3 10 ≤ x3≤ 80 50

x₄ 10 ≤ x₄≤ 80 50

x₅ 0 ≤ x₅≤ 220 0

x6 10 ≤ x6≤ 80 0

x7 10 ≤ x7≤ 80 0

The existing knowledge on x₁ and x₂ is used to further limit X by adding a constraint to keep the search away from regions where jetting quality is known to be poor, and some measures of quality are undefined:

5 ≤ x2/x1≤ 20. (3.3)

Finally, the parameters can only take integer values: x₁, x2, x3, x4, x5, x6, x7∈ Z.

3.3 Experimental Setup

An existing implementation of Bayesian optimization for Python, GPyOpt [15], is adapted (in Python) to create a Bayesian optimization algorithm for optimization of the waveform parameters. The Bayesian optimization algorithm is then connected to a (from its perspective) black-box objective function, with parameters x as input and average relative break-off as output. Eval- uations of the objective function consists of jetting in the “lillaMy” development rig, capturing jetting images using a high-speed camera, and analysis of the high-speed video in Matlab.

Figure 3.5: Schematic of Bayesian optimization iterations.

(34)

4 | Results and Analysis

The Baysian optimization algorithm was run using a Gaussian process model with a zero mean prior and a Matérn-5/2 covariance kernel with independent and identically distributed noise, using maximum likelihood to estimate hyperparameters. The choice of the Matérn-5/2 kernel reflects an assumption that the target objective can be modeled by a GP which is twice mean square differentiable. Exploratory tests found no significant difference in performance between the expected improvement (EI) and lower confidence bound (LCB) acquisition functions, and EI was chosen as it is the more commonly used one. Inputs were normalized to be between 0 and 1 for all runs except for the first test involving ARD (automatic relevance determination). To verify the results, all tests involving jetting were done using an ejector model and solder paste type which are both very well known. The best values of x₁ and x₂ for jetting quality (not necessarily for relative break-off) for this ejector-paste combination are x₁ = 35 and x2 = 195.

In total, results from seven complete optimization trials for waveform parameters are presented.

These start at the known x₁ and x₂, and end at an attempt to optimize all seven parameters simultaneously, see table 4.1 for an overview. Results are presented in the form of the best found parameter values and hyperparameter values. For the two parameter optimizations, the final GP model is also visualized.

The evaluation budget for most trials was set to 60 evaluations, which was the number of evaluations what could be done in a day. The distribution between initial points and Bayesian optimization iterations vary slightly, as tests were re-run with more initial points when the previous amount proved insufficient to explore the search space. For example, if a certain region was not explored initially and the optimizer started with an erroneous estimate of the objective value in that region, that region often would not be explored at all, in spite of the fact that it could potentially contain good objective values.

Table 4.1: Runs of Bayesian optimization.

Trial Description # initial points # iterations

A1 x1, x₂ 10 40

A2 x1, x₂ 10 50

B x₁, x₂ with ARD 20 40

C x3, x₄ with good values of x₁, x₂ 15 45 D x3, x₄ with bad values of x₁, x₂ 10 50

E x₁, x₂, x₃, x₄ with ARD 30 30

F All seven parameters with ARD 150 50

20

(35)

4.1 Two Parameter Optimization

4.1.1 x₁ and x₂

The Bayesian optimization algorithm was first tested on the already well-known parameters x₁ and x₂, with remaining parameters fixed at default values as shown in table 3.1. The final estimates of mean, standard deviation and acquisition function are shown in figure 4.1. The best found parameters are presented together with the final estimated values of the hyperparameters in table 4.2.

(a) A1

(b) A2

Figure 4.1: Final mean, standard deviation and acquisition function, optimization of x₁ and x₂.

Table 4.2: Results and final hyperparameter values, optimization of x₁ and x₂. Trial Best found parameters Final hyperparameter values A1 x₁ = 38, x₂= 190 θ₀= 2.81, l = 0.184, σ²_n= 0.00597 A2 x1 = 40, x2= 203 θ0= 2.15, l = 0.183, σ²_n= 0.166

The best found values from A1 and A2 are close to each other despite differences in sampled points and evaluation noise, indicating that they are close to the optimal values for relative break-off. The small difference between A1 and A2 can be explained as noise, since figure 4.1 shows that the entire region in the upper right corner and along the rightside boundary of the search space all give good results. Indeed, the average relative break-off varies only by a few

Finding Optimal Jetting Waveform Parameters with Bayesian Optimization

Finding Optimal Jetting Waveform Parameters with Bayesian

Optimization

STEFAN XUEYAN FU

Finding Optimal Jetting

Waveform Parameters with Bayesian Optimization

STEFAN XUEYAN FU

Contents

List of Figures

List of Tables

Abbreviations

1 | Introduction

1.1 Background

1.2 Objective

1.3 Report Outline

2 | Theoretic Background

2.1 Preliminaries

2.2 Gaussian Process Regression

2.3 Bayesian Optimization

3 | Problem Formulation and Scope

3.1 Optimization Target

3.2 Mathematical Formulation

3.3 Experimental Setup

4 | Results and Analysis

4.1 Two Parameter Optimization