Optimizing process parameters to increase the quality of the output in a separator : An application of Deep Kernel Learning in combination with the Basin-hopping optimizer

(1)

(2)

(3)

“Statisticians, like artists, have the bad habit of falling in love with their

models.”

(4)

(5)

Abstract

Achieving optimal efficiency of production in the industrial sector is a process that is continuously under development. In several industrial installations separators, produced by Alfa Laval, may be found, and therefore it is of interest to make these separators operate more efficiently. The separator that is investigated separates impurities and water from crude oil. The separation performance is partially affected by the settings of process parameters. In this thesis it is investigated whether optimal or near optimal process parameter settings, which minimize the water content in the output, can be obtained. Furthermore, it is also investigated if these settings of a session can be tested to conclude about their suitability for the separator. The data that is used in this investigation originates from sensors of a factory-installed separator. It consists of five variables which are related to the water content in the output. Two additional variables, related to time, are created to enforce this relationship. Using this data, optimal or near optimal process parameter settings may be found with an optimization technique. For this procedure, a Gaussian Process with the Deep Kernel Learning extension (GP-DKL) is used to model the relationship between the water content and the sensor data. Three models with different kernel functions are evaluated and the GP-DKL with a Spectral Mixture kernel is demonstrated to be the most suitable option. This combination is used as the objective function in a Basin-hopping optimizer, resulting in settings which correspond to a lower water content. Thus, it is concluded that optimal or near optimal settings can be obtained. Furthermore, the process parameter settings of a session can be tested by utilizing the Bayesian properties of the GP-DKL model. However, due to large posterior variance of the model, it can not be determined if the process parameter settings are suitable for the separator.

(6)

Acknowledgements

First I want to give my sincere thanks to my supervisor at Linköping University, Oleg Sysoev who has been very supportive in discussing the methodology in this thesis and providing suggestions for improvements.

I would like to thank Harald Hermansson and Lennart Haggård at Decerno AB for having arranged this thesis, to have introduced me to the company, and provided support during the process.

I would also like to thank Jimmie Karlsson and Jan Ackalin at Alfa Laval who have pushed this idea forward and followed up on the process.

A special thanks goes to Harpal Singh at Alfa Laval, who has been there to answer questions, introduced me to the relevant people, and made sure that I had a place to sit. Many thanks also goes to Axel Werkelin for letting me have a place to stay for my visits to Stockholm.

Last but not least I would like to thank Berglind Dúna Sigurðardóttir for supporting me by proofreading my thesis and providing me with suggestions for improvements.

(7)

1 Introduction 1 1.1 Aims . . . 1 1.2 Related work . . . 2 1.3 Ethical considerations . . . 3 2 Data 4 2.1 Variables . . . 4 2.2 Introducing time . . . 5 2.2.1 Between events . . . 6 2.2.2 Between shutdowns . . . 7 2.3 Preparation of data . . . 8 2.3.1 Missing data . . . 9 3 Method 10 3.1 Gaussian process regression . . . 10

3.1.1 Deep kernel learning . . . 13

3.2 Optimization procedure . . . 17

3.2.1 Basin-hopping . . . 18

3.2.2 k-Nearest Neighbour . . . 19

3.3 Investigation of process settings . . . 21

4 Results 23 4.1 Performance of models . . . 23

4.2 Optimal process settings . . . 26

4.3 Investigating process settings . . . 29

5 Discussion 30 5.1 Models . . . 30 5.2 Optimization procedure . . . 31 5.3 Testing settings . . . 33 6 Conclusion 35 Appendix i A Complimentary figures i

(8)

List of Figures

1.1 An S class separator, illustrated by Alfa Laval [3]. . . 2

2.1 Convergence of the GP-DKL models . . . 6

2.2 Target variable y with marked restarts of the separator . . . . 7

2.3 Un-aggregated sample of the target variable y . . . . 8

3.1 An illustration of finding optimal or near optimal process parameters . . . 11

3.2 Deep kernel learning, inspired by Wilson et al. [31] . . . 15

4.1 Prediction on test data using DKL with RBF kernel . . . 24

4.2 Prediction on test data using DKL with Periodic kernel . . . 25

4.3 Prediction on test data using DKL with SM kernel . . . 25

4.4 π from 200 different initial values . . . . 26

4.5 The minimized Water in Output from optimization . . . 27

4.6 Simulation of the Basin-hopping optimizer . . . 28

4.7 Testing the process parameters of a session . . . 29

A.1 Introduced complementary time variables . . . i

A.2 Example of missing data occurrences . . . ii

A.3 Distribution of Pressure out . . . ii

(9)

List of Tables

2.1 Variables in X , respective groups and descriptive statistics . . . 4 4.1 Resulting performance of the models on the test data . . . 23

(10)

1 Introduction

Optimizing the efficiency of production has been an integral part of the modern industrial sector since its beginning in the late eighteenth century, when a couple of inventions transformed the way of manufacturing [16]. Recently, tools such as statistics, machine learning and AI has been applied to optimize processes in a wide range of areas, such as: robotics, monitoring of the environment, and extraction of information [22].

Alfa Laval, which was founded in 1883, produces pumps, valves, heat exchangers and separators [2]. They continuously work on making their products more efficient, while producing good results and following the regulations prescribed in the law. This thesis will focus on a separator, manufactured by Alfa Laval, more specifically a separator for heavy oil.

A separator is a machine that can separate fluids or unwanted particles in a fluid. The type of the separator may be unique to perform a certain task. Figure (1.1) is an illustration of a separator investigated in this thesis. This separator takes pre-heated heavy oil and separates continuously to clean out impurities. After separation, cleaned oil (yellow) is continuously pumped away and impurities and water (blue) are gathered in the periphery of the separator’s bowl. In the outlet, the water content of the oil is measured and if it is too high, the separator disposes accumulated impurities and water [3]. If the oil in the output contains large amounts of water it may damage machinery further on in the process. In this thesis it is this water that is investigated, and is referred by the title as “quality” since the larger the water content, the worse quality of output.

1.1 Aims

Over the years, these separators have been made more reliable and cost effective. It is of interest to detect ways to further improve them by optimizing different process parameters. It is also known that the condition of a separator deteriorates with time due to constant and heavy use, moreover the effectiveness of the separator may decrease. Therefore, the process parameter settings need to be adjusted over time to maintain the production of

(11)

1.2. Related work 2

Figure 1.1: An S class separator, illustrated by Alfa Laval [3].

a good quality product. Motivated by this fact, the following aims for this thesis are constructed:

• While in operation mode, can optimal or near optimal process parameter settings be obtained to minimize the water content for the current state of the separator? • Can the process parameter settings in a session be tested to see whether they are

suitable for the separator?

1.2 Related work

Su and Chiang [24] investigate a way to optimize process parameters, such that the average IC wire bonding strength is maximized. For this task a Neural Network (NN) combined with a Genetic Algorithm (GA) is used to obtain optimal parameters for the industrial process. The NN, with a single hidden layer, is used to model a non-linear multivariate relationship between the covariates and response. Then a GA is applied to evaluate the NN and to obtain the optimal parameter settings, which maximizes the average IC wire bonding strength predicted by the NN.

A more recent work, conducted by Pfrommer et al. [19], also adapts the procedure of a NN and GA combination to maximize a shear angle in the textile industry. In contrast to the work from Su and Chiang [24], Pfrommer et al. [19] put more emphasis on the structure of the NN, where the notable difference is the depth of the network. Instead of using only a single hidden layer, they use four hidden layers, to be able to extract features to predict the shear angle with high performance.

Studies using Gaussian Processes (GPs) combined with a Bayesian Optimization (BO) for making decisions, are conducted by Wang et al. [26] and Shahriari et al. [22]. The GP

(12)

1.3. Ethical considerations 3

function (f ) is applied on the data to obtain a posterior, but is assumed to be expensive to compute. A BO is thereafter applied to evaluate the assumed expensive function

f , to maximize an acquisition function (utility function). Wang et al. [26] construct

an utility function which favours points with high variance and high mean. When the number of observations T is large, Shahriari et al. [22] consider the usage of Sparse GPs in combination of BO. The obtained results from the BO procedure is used to find the optimal decision settings to increase productivity.

Studies with GPs on medical time series data is conducted by Cheng et al. [6], where a GP regression, with a sparse kernel, is used to predict hospital patient values. Cheng et al. [6] use this model on a time series because the model may capture irregular time series observations in a probabilistic way. The choice of kernel is a mixture of spectral kernels which allow periodic behavior and short-term and long-term dependencies. A part of the mixture of kernels is the Radial Basis Function (RBF) kernel and the Periodic kernel.

Great success has been made when combining NN with Reinforcement Learning (RL) in procedure known as Deep Q-Learning (DQL). To enhance the performance of this procedure, Xuan et al. [32] replaces the NN in DQL with a GP and Deep Kernel Learning (GP-DKL) methodology. Xuan et al. [32] use simulation of a cart, where the optimal decision of which direction the cart will move is learned. Results indicate an average performance improvement when using the GP-DKL, but also more uncertainty in the performance.

1.3 Ethical considerations

This thesis is conducted in order to gain new understanding on how to make a separator more efficient, in terms of optimizing process parameters to minimize the water content in the output. The methodology presented in this thesis may also be applied to investigate other variables in the separator, provided that data with good enough quality exists. For instance, such investigation could lead to better understanding about power consumption, predictive maintenance and emissions.

The information provided in this thesis does not concern any individuals or groups. This thesis only contains information about some attributes of a separator.

(13)

2 Data

Data is produced from different sensors, which measure every second and are installed on the separator. The data is collected and stored in a database, which can be accessed from an online IIoT (Industrial Internet of Things) interface. This separator operates in a factory and is therefore considered suitable for modeling the behaviour of a separator in a real-world scenario.

2.1 Variables

Throughout this thesis, variables for parameter settings are denoted by z and are variables that can be changed before or inside the separator. Variables that are affected by z are unchangeable variables and are denoted by x. The variable of interest: the response variable, is denoted by y. The whole data set X contains the variable y and variable groups x and z, such that X = {y, x, z}. Henceforth, the variable groups x and z is denoted as a T × D large matrix P = [x z]. Furthermore, the observations in P are denoted as p1, p2, ..., pT.

Table 2.1: Variables in X , respective groups and descriptive statistics

Name Group Mean Standard deviation NA (%)

Temperature z 97.93 1.59 < 0.01 tpeak _z _27.86 _19.00 _0.0 Pressure out z 2.05 0.44 0.0 Pressure before x 0.20 0.12 0.0 Pressure inside x 1.97 0.30 0.0 tshut x 351.52 217.13 0.0 Event x 0.02 0.14 5.02 Water in Output y 87.48 1.606 0.0 4

(14)

2.2. Introducing time 5

The Temperature variable in Table (2.1), is a continuous variable that varies around its mean. It is this mean or expected value that can be changed by an operator. In the industrial process, this variable may be modified right before the separator.

In the Table, the variables tpeak _{and t}shut _{denote different time periods. As illustrated}

in the Table, these two different variables belong to different groups. Moreover, the mean values of tpeak _{and t}shut _{are not to be interpreted as the time between events or shutdowns}

since these variables are a count over time. For instance, the value 27.86 is not the average time value between events, but rather the average of counts over time between events.

As seen in the Table, there are three different Pressure variables. The variable rep-resenting the pressure in the outlet of the separator, is of interest by Alfa Laval. This variable is a continuous variable, similarly as Temperature, and may be controlled and changed. The other two Pressure variables in the Table are also continues and are mea-sured before and inside the separator. The three variable together are also interpreted as a flow through the separator.

The Event variable is binary and denotes whether an event occurs (1) or not occur (0), and is an necessary action in the separation process. From the Table it is observed that the mean is close to 0 and standard deviation is low. This indicates that there exists more situations where an event does not occur. A sample of the Event variable is visualized in Figure (2.1).

Water is undesirable in the output of the separator, and therefore the goal is to have this at a minimum. This variable is denoted as Water in Output and is dependent on all the presented variables. For instance, it is known by Alfa Laval that when an event happens, the water content in the Water in Output variable increases.

Another variable that is measured in the separator and used in the pre-processing of the data, is the Motor Current (MC). According to information provided by Alfa Laval, this variable does not affect the water content due to it, currently, only having one power setting. Hence, either full or zero power. Therefore, this variable is not used in the optimization procedure. However, this variable is used in the pre-processing as mentioned, with the purpose of removing time periods where the MC variable is zero of close to zero.

2.2 Introducing time

The variables tpeak _{and t}shut_{, which are introduced in Table (2.1), are denoting the time}

between events and shutdowns in the separator. This section motivates the creation of these variables.

(15)

(a) Water content in the output (b) Events in the separator

Figure 2.1: Convergence of the GP-DKL models

2.2.1 Between events

In discussion with Alfa Laval, it is of interest to investigate the optimal or near optimal time between events to obtain a lower water content. The time between events is a parameter which is controllable by an operator.

Looking at Figure (2.1) it may be seen that between each event there is a certain pat-tern in the Water in Output, which may be seen as a “trend”. The collection of events, which is occurring between time points 300 and 350 in Figure (2.1b), also illustrate an effect on the Water in Output in Figure (2.1a), where the water content decreases but thereafter increases. Investigation of Figure (2.1) leads to the conclusion that introduc-ing a new time variable tpeak, which denotes the time between each event (peak), may contribute to finding a relationship between variable y and variables in P. The variable tpeak is created with according to the pseudo code in algorithm (1), where a sample of the result is displayed in Figure (A.1a), in appendix (A). In the Figure, it is illustrated that the events between time points 300 and 350 from Figure (2.1) are created correctly. Algorithm 1 Creation of tpeak

1: set tpeak to empty list 2: set t = 0 3: for i in 1 to T do 4: if Event[i] 6= 1 then 5: t += 1 6: else 7: t = 0 8: append t to tpeak 9: end for

(16)

Figure 2.2: Target variable y with marked restarts of the separator

2.2.2 Between shutdowns

Figure (2.2) is a visualization of 10 000 minutes for the Water in Output variable. The vertical lines denote a restart of the separator. Investigating the Figure it is illustrated that the value of y may vary between the starts and shutdowns of the separator such that the level of y is different between the starts and shutdowns. Thus, the variable tshut is introduced. It denotes the time from the start-up to the shutdown of the machine. The time between start-up and shutdown will hereinafter be known as a session.

The procedure of creating this variable is presented in algorithm (2), where a sample of this variable is presented in Figure (A.1b), in appendix (A). The current approach utilizes the MC variable, as seen in algorithm (2). The purpose of UL (Upper Limit) is to be able to identify shutdowns, since the values in the MC variable may at times be very close to 0, but are still considered to indicate that the separator is turned off. Algorithm 2 Creation of tshut

1: set Upper Limit: UL

2: set tshut _{to empty list}

3: set t = 0 4: for i in 1 to T do 5: if 0 ≤ MCi ≤ UL then 6: t = 0 7: else 8: t += 1 9: append t to tshut 10: end for

(17)

2.3. Preparation of data 8

2.3 Preparation of data

As previously mentioned, the raw unprocessed data from the IIoT system is measured every second. In this thesis, the data is aggregated into minutes, where the minutes are a mean of the seconds. One might argue that aggregating the data in such a way might result in loss of information. However, it is known that the general behaviour of the separator may change over longer periods of time. It might therefore not be necessary to investigate each individual second to achieve an estimation of a model which represents the general behaviour of the separator. Aggregating the data also results in less computational expenses when estimating this model. With this argument one might argue that aggregating seconds into, for instance, hours would obtain a model which represents a more general behaviour of the separator. However, investigating Table (2.1) it is noted that important variables, such as tpeak_{, might provide considerable less contribution to an}

estimation of a model, if such aggregation would be used.

Figure 2.3: Un-aggregated sample of the target variable y

As illustrated in the right image of Figure (2.3), the un-aggregated Water in Output variable seem to have different levels of possible measurements (88.5, 88.6, 88.7, and 88.8). Information from Alfa Laval states that when sending data from the separator to the database, an aggregation or approximation is required to save bandwidth. Hence, resulting in data might have been rounded to the closest 1th _{decimal. Aggregating the} data into minutes may provide a more general approximation of the behaviour.

Since the purpose is to find optimal or near optimal parameters while a separator is in operation mode, it is not interesting to investigate the behaviour of the separator when powered off. Therefore, all observations that have the corresponding MC value as zero or close to zero, are removed.

(18)

2.3. Preparation of data 9

training-, validation- and test sets with the rates 85%, 7.5% and 7.5% respectively. In this thesis, kernel based methods are used. This procedure involves estimation of a covariance matrix over all observations, thus there is a risk when inverting such matrix since it might cause numerical errors. To reduce this risk, the variables described in Table (2.1) are scaled to a similar mean and standard deviation. Hence, each dimension in P are scaled to have a zero-mean with unit variance such that

Vs= V − µV

sV

(2.1) where V is a provided dimension (variable) from P, µV is the variable mean and sV is the variable standard deviation. The result of scaling V is that Vs _{∼ N (0, 1) [1].}

2.3.1 Missing data

By investigating the Table (2.1), it is observed that there are missing values. The highest amount of missing values is around 5 percent and is regarded as low. The cause of these missing values is unknown and it appears to be random. Using the definitions from Donders et al. [7], it may be concluded that the missing data is a Missing At Random (MAR) case. An example of the missing data in the Temperature variable is shown in Figure (A.2) in appendix (A).

A suggestion from Donders et al. [7] is to use a multivariate normal distribution to include the relationships of the observations. To perform such imputation, all the observations with missing values are removed from X . Thereafter, an estimation of a mean variable µ_X and a covariance matrix ΣX is performed, with dimensions corresponding to the amount of variables in X . These components are thereafter used to sample from a multivariate normal distribution to fill missing parts of the data.

The Event variable, which is seen in Table (2.1), contains binary values and is therefore a special case. Prior knowledge from Alfa Laval declares that the behaviour of this variable is; if the value of Event at time point i is unknown, then this value is the value of the last known point.

(19)

3 Method

In this chapter the proposed methods used to solve the aims, which are presented in sec-tion 1.1, are described. To be able to find optimal parameters for sessions, a Gaussian Process regression with a Deep Kernel Learning (GP-DKL) extension is estimated to find a non-linear relationship between the response variable y, and covariates (x) and process parameters (z). The Basin-hopping optimizer will thereafter evaluate this GP-DKL re-gression to minimize the predictive value ˆy to obtain optimal or near optimal parameters.

Furthermore, another interest is to investigate whether the process parameters of a ses-sion, which are proposed by the optimization algorithm, produces an expected or lower amount of water content.

Figure (3.1) is an illustration on how the collection of methods, used in this thesis, interacts. Note that these images does not represent the results of this thesis, only the methodology. With this in mind, the following 3 points corresponds to each step in the Figure.

1. Eight example observations are shown where the y-axis corresponds to the Water in Output variable.

2. With these observations and variables, a GP-DKL is estimated.

3. After the estimation of the GP-DKL, an iterative process known as the Basin-hopping optimizer is used to evaluate the GP-DKL to find which values correspond to the minimum of the water content, thus finding optimal or near optimal process parameter settings.

The variables in the groups z and x are hereinafter assumed to be scaled according to equation (2.1)

3.1 Gaussian process regression

The usage of kernels, as previously mentioned by Cheng et al. [6], is an ideal way to handle time, periodical patterns, and observations that are similar to one another. The Gaussian

(20)

3.1. Gaussian process regression 11 GP-DKL Basin-hopping 1. 2. 3.

Figure 3.1: An illustration of finding optimal or near optimal process parameters

Process (GP) is a kernel based method, which operates in the Bayesian framework. As interpreted by Rasmussen and Williams [20], the GP is based on a collection of random variables, where a finite number of these is a joint Gaussian distribution. A random variable in a GP may be seen as an observation pi from P. Recall that the

observations in P is denoted as p1, p2, ..., pi, ..., pT. Each observation may be expressed

with a function f , which results in a marginal Gaussian distribution

f (pi) ∼ N

m(pi), κ(pi, pi)

(3.1) where m is the mean function and κ is the covariance function. This mean and covariance function may be expressed as

m(pi) = E h f (pi) i , κ(pi, pi) = E h (f (pi) − m(pi))(f (pi) − m(pi)) i

where the mean function is henceforth set to zero. It is noted that κ(pi, pi) is a scalar

(21)

3.1. Gaussian process regression 12

function, thus allows the “kernel trick” to be applied. This technique maps the inputs space into a high or even infinite-dimensional feature space. In this feature space it may be easier to find relationships between variables and observations, thus enabling the GP to find complex and powerful relationships between an observation pi and yi in the input

space [5].

The considered kernel functions, that are used to compute the covariance κ(pi, pi) in

this thesis, are the common Radial Basis Function (RBF), the Periodic Kernel and the Spectral Mixture (SM). The motivation to use the RBF kernel is that since data is being considered time series, it is interesting to look at the similarity between observations [21]. Furthermore, the motivation to use the Periodic kernel is similar to the RBF, however looking at Figure (2.2) it is seen that there is a periodic pattern, which might be of interest to take into consideration [6, 9]. Any kernel function which are stationary may be approximated by the SM kernel, provided enough mixture components [29]. The amount of mixture components used in this thesis is 2. Since the RBF and Periodic kernels are

stationary, their attributes may be approximated by the SM kernel. To get an intuitive

understanding on how a similarity can be computed with a kernel function, an example of the RBF kernel is made

κ(pi, pi) = ν2exp −||pi− pi|| 2 2`2 (3.2) where the hyperparameters ν is a scale factor and ` is a length scale. It may be noted by equation (3.2) that since ||pi− pi|| = 0 then κ(pi, pi) = ν2. Hence, if the observations

are similar, then the value of the kernel is close to ν2 _{in the RBF.}

The set of hyperparameters from either the RBF, Periodic and SM kernel will hence-forth be denoted as θ.

As earlier mentioned, the GP may be expressed as a joint Gaussian distribution based on a finite collection of random variables, where these random variables may be seen as the observations in data. We have already seen in equation (3.1) the distribution of one observation. Therefore, to estimate this joint Gaussian distribution over all observations, a T × T covariance matrix is required to describe the relationship. This matrix can be estimated as

KP, P= Ki,j = κ(pi, pj) (3.3)

where i = j = 1, 2, ..., T . K is denoting the function for computing the kernel matrix, and K denoting the actual kernel matrix. The diagonal of this matrix is now the variances of each observation and the off-diagonals are the covariances between the observations. Furthermore, note that the values in the diagonal are equal since κ(pi, pi) = ν2. The

joint Gaussian distribution may therefore be expressed as

f = f (p1), f (p2), ..., f (pT) ∼ MN

(22)

where 0 is a T -dimensional zero vector. Similarly, the joint Gaussian distribution may be computed for the test data P∗, which is a Ttest_{× D matrix, such that}

f∗ = f (p∗₁), f (p∗₂), ..., f (p∗_Ttest) ∼ MN

0, KP∗, P∗ (3.5) where KP∗, P∗ is a Ttest _{× T}test _{kernel matrix. Equations (3.4) and (3.5) may}

thereafter be used to obtain a posterior predictive conditional distribution

f∗|P, y, θ, P∗ = MN_E[f∗], cov(f∗) (3.6) E[f∗] = K

P∗, PhKP, P+ σ2Ii−1y (3.7) cov(f∗) = KP∗, P∗− KP∗, PhKP, P+ σ2Ii−1KP, P∗ (3.8) where σ2 _{is the noise variance. If there are T}test _{test observations and T training}

observations, then KP∗, P is a Ttest _{× T large matrix. Utilizing equation (3.3) the}

kernel function may be used as κ(p∗_i, pj), where i = 1, 2, ..., Ttest and j = 1, 2, ..., T .

The hyperparameters θ are learned through maximizing the log marginal likelihood

log p(y|P, θ) = −T 2log(2π) − 1 2log K P, P+ σ2I − 1 2y >h KP, P+ σ2Ii−1y (3.9) such that θ∗ = arg max

θ

log p(y|P, θ), where θ∗ is the optimal or near optimal values. The earlier presented noise variance is also treated as a part of the hyperparameters. This log marginal likelihood will be henceforth known as LL.

Computing thei exact GP regression may seem like a straight forward task, but from observing equations (3.6) to (3.9) it may be noted that the number of operations to solve the linear systemhKP, P+ σ2_Ii−1 _{will increase rapidly with T . Inference of a GP} requires O(T3_{) operations and O(T}2_{) storage, hence when T increases, the computational} expenses increases. After inference, the predictive mean E[f∗] requires O(T ) operations per observation p∗[31]. To simplify the notation for predicting with the GP, the expected value of the posterior predictive conditional distribution E[f∗] will be denoted as ˆy∗.

3.1.1 Deep kernel learning

It is of interest to continue using the flexibility and high performance of a GP in both inference and prediction, however the major issue is that the cost for inference in a exact GP requires O(T3_{) operations and O(T ) operations for each predictive observation p}∗_. Thus, making it optimal for usage only up to a few thousand observations [30]. Therefore, a more sophisticated method is necessary to efficiently estimate the kernel matrix K.

Deep kernel learning (DKL), proposed by Wilson et al. [31], is combining the properties of neural networks with the flexible non-parametric kernel methods. To obtain a scaleable kernel representation for large data sets, deep kernel learning is combined with a Kernel Interpolation for Scaleable Structured GP (KISS-GP), proposed by Wilson and Nickisch

(23)

[30]. The combination with KISS-GP results in powerful performance with only O(T ) inference- training operations and O(1) prediction operations per observation p∗.

As mentioned, Neural Networks (NN) are used in combination with KISS-GP and a NN, as interpreted by Goodfellow et al. [10], is a method that can both perform classifi-cation of categorical response variables and regression on continuous response variables. The NN defines a non-linear mapping

YN N = g(P, W), i = 1, 2, ..., T (3.10) where YN N is a T × C matrix, containing yN N₁ , yN N₂ , ..., yN N_T outputs. This output matrix is not explicitly tied to a response variable when combined with kernel methods, which will be apparent later on. Therefore, the values C may be set to any arbitrary discrete number. The weights in the matrices W = {W1, W2, ..., WL} (layers) are tuned to obtain the best function approximation. A network is built as a hierarchical structure, where each matrix Wl ∈ W is representing a hidden layer that contains Hl _{number of}

hidden units. In every layer (l), the matrix operation

hl = hl−1_∗ Wl, hl_∗ = Φ(hl) (3.11) is performed, where the weight matrix Wl is a Hl−1 × Hl _{large matrix. The hidden}

outputs hl of the layer is passed through an activation function Φ to achieve the non-linearity. The activation function that is used in this thesis is the ReLU activation function where the values are mapped as Φ(hl) = max(0, hl). This activation function is a popular choice, since its derivative is equal to 1 and therefore increases numerical stability and speeds up convergence when training the network [10]. To additionally speed up the convergence, when combined with the ReLU activation function, the weight matrices in W are initialized as Wl ∼ N (0, 2/Hl_{) [12].}

The observations piare imputed to the network in a forward propagation and for every

layer, the equation (3.11) is performed. To obtain a new continuous representation from the NN, a Linear activation function is applied on the output layer.

The output of a NN is evaluated in a loss function. Typically, this loss is a metric on how much “wrong” the network has when estimating the response variable, hence the word “loss”. However, this network is combined with kernel methods, where the log marginal likelihood LL (see equation (3.9)) is used as the loss function. Hence, as stated in the end of section (3.1), this “loss” may be seen as a “utility” function requires to be

maximized.

The value of LL is used to compute a gradient, where this gradient is used in a backward

propagation. The goal is to tune the weights in W to maximize the value of LL. This

gradient is used in a gradient decent optimizer, and the one used in this thesis is Adam, proposed by Kingma and Ba [15]. The Adam algorithm is a suitable optimizer for this problem, since experiments from [15] has demonstrated that the optimizer consistently outperforms other methods on a variety of models and data sets. It is also well suited in

(24)

problems that contain large data or parameters, because it has little memory requirements and works well for stationary as well as non-stationary data [15]. The weights in the network are tuned with

W_kl = W_k−1l − η · √mˆk ˆ

vk+ 

(3.12) for k = 1, 2, ... iterations or epochs, where ˆmk and ˆvk are the bias-corrected moments,

η is the step length or learning rate and has the purpose to improve numerical stability.

The estimation of ˆmk and ˆvk is based on a chosen setting of certain parameters. The

settings used in the training of GP-DKL are the same as recommendations from [15, p. 2]. The values of η used in this thesis are 0.1, 0.01, and 0.001. These values are gradually decreasing during training, hence the value 0.1 is applied in the start of training, 0.01 is applied further in the training, and 0.001 is applied during the final part of the training. This procedure are decided from empirical testing of the models.

The output of the NN is thereafter used in the kernel function κ, where this kernel may be expressed as

K(YN N, YN N) = Kdeep_ij = κ(yN N_i , yN N_j ) (3.13) where i = j = 1, 2, 3, ..., T and Kdeep_{is a T ×T kernel matrix estimated from the output}

of the deep NN. The network acts as a feature extractor and captures non-stationary and hierarchical structures [31].

Figure 3.2: Deep kernel learning, inspired by Wilson et al. [31]

The architecture for the NN that is used in this thesis is [D-1000-1000-500-50-2], hence four hidden layers with an output of 2 dimensions. This architecture is proposed and used by Wilson et al. [31] when T > 6000. Figure (3.2) illustrates the forward propagation of the GP-DKL, where the observations in P are passed through the hidden layers of the NN, producing a 2-dimensional output. This output is thereafter passed through the kernel matrix function K. The result of this kernel matrix function are thereafter used in the equations (3.6) to (3.9).

(25)

This merge of NN and kernel methods consequently joins the weights W from the NN and hyperparameters θ in the kernel function, thus forming deep kernel parameters

γ = {W, θ}. These parameters are now jointly tuned in the Adam optimizer to maximize

LL in equation (3.9) [31].

In the start of this section, the KISS-GP method was mentioned. This method was introduced by Wilson and Nickisch [30] with the purpose of obtaining a scalable kernel representation, which is an approximation of the exact kernel matrix. To obtain this scalable kernel matrix, the input to the kernel matrix function K needs to be adjusted. This input is a grid U , containing u regularly spaced inducing points, placed over a 2-dimensional lattice, thus forming a u×2 matrix. The dimensions of this grid are dependent on the output dimension of the NN, hence 2 dimensions. In experiments conducted by Wilson and Nickisch [30], the size of the grid U lies in the interval u ∈ [2500, 5000] for 59 306 observations. Considering this information and the number of observations in this thesis, the size of U is decided to be u = 1000. This grid represents the structure of the output from the NN (YN N _{= g(P, W)), thus an approximation. This grid U are}

thereafter passed through the kernel matrix function, such that Kdeep_U = K(U, U ), thus resulting in a u × u matrix. Finally, to obtain the scalable kernel matrix, K_Udeep is used in

SKdeep_U S> = KKISS

where S is a T × u matrix containing interpolation weights with only 4 non-zero entries per row. Thus, each row in S has 4 weights approximating an observation. Since the matrix Kdeep_U is generated from a stationary kernel function, and the inputs (U ) are regularly spaced on a grid, this matrix is then a Toeplitz matrix1 _{[30, p. 7]. The kernel} matrix may also be expressed as a Kronecker product2 _Kdeep

U = K1⊗ · · · ⊗ KJ, where

Kdeep_U is divided into J parts. These structures of the kernel matrix enables inference by solving K−1_KISSy with linear conjugate gradients3_{, which only involve matrix vector} products. This only requires a small amount of iterations for convergence with machine precision [30].

Evaluation

The RBF, Periodic and SM kernels are modeled with the GP-DKL method separately. The proposed method for evaluating these separate models is to compare the models with mean absolute error (MAE), a 95 percent credible interval (CI) and compare them visually.

1_{A Toepliz matrix is where each diagonal in the matrix has a constant value. An example of the} Toepliz matrix is found in [14, p. 156-157].

2_{The Kronecker product between a m × n matrix A and a p × q matrix B is denoted as A ⊗ B, and} results in a mp × nq block matrix. An example of the Kronecker product is found in [14, p. 156-157].

3_{The Conjugate Gradient method is an iterative method for solving large systems of linear equations,} such as Ax = b. Detailed examples of different versions of this algorithm may be found in [23, p. 49-53].

(26)

3.2. Optimization procedure 17

The MAE is an measurement on how “wrong” the model is in predicting. Thus, it is preferable to minimize the MAE. This measurement is computed with

M AE = 1 Ttest Ttest X i=1 |ˆy_i∗− y_i∗| (3.14) where ˆy∗_i prediction and y∗_i is a true point in the test data. Willmott and Matsuura [28] argue that the MAE is a natural measure of the average error, since it is less affected by outliers in data. Investigating Figure (2.2), it is seen that the data has regular pattern spikes, which occur approximately every minute. It is questionable whether these spikes may be seen as outliers, since the cause of the spikes is known and taken into consideration when modelling the GP-DKL. However, the magnitude differs gravely and the model might have difficulties estimating these spikes with high accuracy. The spikes are seen as an unpredictable event and are therefore seen as outliers. Hence, equation (3.14) is a suitable choice for evaluating the models.

The CI utilizes the posterior mean and variance for every predictive observation ˆy∗. The CI is defined as

ˆ

y∗_i ± 1.96 ·qcov(f∗₎

ii

where 1.96 is the quantile from a normal distribution, which corresponds to 95 percent of the distribution [9]. This interval is used such that if a true point of y∗ in test data is within this interval, then it is considered to be a correctly estimated test point. Since this interval is computed for every observation in P∗, the number of test points inside this interval can be determined.

The MAE measurement and CI might not be sufficient to fully evaluate the perfor-mance of the models. Therefore a visual comparison between the models is performed with the models predicting the test data. These comparisons are performed separately to identify the performance.

3.2 Optimization procedure

The next step is to obtain optimal or near optimal values of z such that

z∗_opt = arg min

z∗_∈p∗ ω(p ∗

) (3.15)

where z∗_opt is the optimal or near optimal values and the objective function ω(p∗) computes the estimated water content for the test point p∗ = [z∗ x∗]>. Note that z∗ and x∗ are vectors and belong to the variable group z and x respectively, illustrated in Table (2.1). Also recall that p∗ is a D-dimensional vector and is an observation from P∗, hence it is a point from the test data. As seen in equation (3.15), the interest is to tune the values in z∗ such that the ω is minimized. The values in x∗ are kept fixed throughout the

(27)

optimization process and are consequently acting as an “anchor to reality”. Intuitively one can think of an operator working with the separator, on a display the proposed values in z∗_optand the corresponding x∗will show. The operator can thereafter adjust the process parameters accordingly.

3.2.1 Basin-hopping

The proposed method that is used to tune the values of z is the Basin-hopping optimizer. The application of the Basin-hopping optimizer uses the estimated GP-DKL model, where the objective function ω, in equation (3.15), is replaced by f (p∗) → ω(p∗).

The Basin-hopping is a global optimization algorithm, which uses a local optimizer to minimize or maximize the objective function f (p∗) and runs for n = 1, 2, ..., N iterations. The intuition of the Basin-hopping algorithm is that, the values in z∗ are randomly perturbed and then used as a initial point in the local optimizer. Thus, having the possibility of obtaining a new local minimum at every iteration in the Basin-hopping optimizer. If one would run the Basin-hopping “enough” times, one would find the set of values in z∗_opt corresponding to the global minimum [13, 25].

The local minimization optimizer that is used is the Limited-memory Broyden – Fletcher – Goldfarb – Shanno with Bound constraints (L-BFGS-B) optimizer. This algo-rithm operates on an inverse Hessian matrix to steer the algoalgo-rithm in the right direction. In the original BFGS method, the approximation of this matrix is a dense Dz∗ × D_z∗

matrix, where Dz∗ is the dimension of z∗. The Limited-memory addition to BFGS

(L-BFGS) only considers a few vectors that represent the inverse Hessian matrix [17]. This algorithm may be bounded (L-BFGS-B) by some constraints that prohibits the search of potential values for z∗ that are outside a pre- defined interval a ≤ z∗ ≤ b [4]. The purpose of the values a and b are to prevent the algorithm to find “illegal” values. For instance, an “illegal” value could be the Temperature having negative degrees. It is determined that the values a and b are set to the minimum and maximum values of the variables corresponding to the group z. The motivation is that since the data has been previously recorded, it can be recorded again. An exception of the interval values are made for the Pressure out variable, since the minimum of this variable is zero. The scaled version of the Pressure out variable is visualized in Figure (A.3) in appendix (A). In this Figure it can be seen that a few values tends to the value −4.5, which corresponds to the un-scaled value of zero or close to zero. The lower bound a for this variable is marked in the Figure. As mentioned, before entering the L-BFGS-B algorithm, a random perturbation of z∗ is performed by the Basin-hopping, such that this perturbed vector results in the initial

Dz∗ -dimensional vector _ez₀. As explained by Byrd et al. [4], the L-BFGS-B algorithm

takes the vector ez0 and iterates for kmax iterations to obtain the estimates ez

global

n , which

corresponds to a potential global minimum, such that ezkmax =ez

global

n . These estimates are

(28)

e

zk+1=ezk+ η · ∇k (3.16)

where k = 1, 2, ..., kmax, η is the step length or learning rate and ∇k is the decent

direction at iteration k. This direction “pushes” the values in zek to obtain a better

estimate ezk+1 that minimizes the objective function. As demonstrated in [17, p. 774-776],

∇k is based on the approximated inverse Hessian matrix.

To prevent the the values inezk to obtain “illegal” values, the bounds in the algorithm

are incorporated as e zk =            a if ezk< a e zk if a ≤zek ≤ b b if ezk > b

As mentioned, for every iteration of the Basin-hopping algorithm the values in z∗ are perturbed and then entered as initial points of the L-BFGS-B algorithm. For every iteration a potential set of values ez

global

n is found, which corresponds to a potential global

minimum. This set of potential values is then accepted or rejected, based on previous obtained ez

global

n−1 and its corresponding objective function value. At the last iteration of the

Basin-hopping algorithm, theez

global

N corresponds to the lowest objective function value. It

is chosen by the algorithm ez

global

N = z∗opt [13, 25].

3.2.2 k-Nearest Neighbour

It might not be necessary to use a more advanced gradient-based algorithm to find optimal or near optimal values z∗_opt. Therefore, a simpler approach is also considered, hence the k-Nearest Neighbour (k-NN) algorithm. The k-NN is applied to the equation (3.15), such that kNN(p∗) → ω(p∗), which is an equal approach as in the previous section. The difference to the Bashinhopping optimizer is that the k-NN takes an initial point p∗ as input and simply considers an arbitrary chosen integer k as the number of closest points in the Euclidean space [33]. It achieves this by computing

d(p∗_i, p∗_j) =q||p∗

i − p∗j|| (3.17)

which is the Euclidean distance between the initial observation p∗_i and the observation p∗

j in the test data. The observations with k smallest distances are obtained [33]. From

these obtained observations, it is investigated which observation has the lowest water content in the Water in Output. This point is denoted as p∗opt, and from this point the

values z∗_opt are obtained, where z∗_opt ⊂ p∗

opt.

The considered values of k are 3, 5 and 10. If a the value of k is low, it is assumed that the values x∗ in p∗ and x∗_opt in p∗_opt are similar enough, such that x∗ ⇔ x∗

opt. A value

(29)

values of k. However, if a too large value of k is chosen, then x∗ 6⇔ x_opt∗ as a result of the Euclidean distance being too large.

Evaluation

It is interesting to investigate whether the Basin-hopping performs significantly better than the k-NN, for finding z∗_opt. To conduct such investigation, the following steps are iterated over q = 1, 2, ..., Q times:

1. Sample an observation from the test data p∗_q = [z∗ x∗], which will work as an initialization point of the Basin-hopping and k-NN algorithm.

2. The Basin-hopping and k-NN provides the optimal values z∗ and z∗ respectively. 3. Merge the values obtained in step 2 with x∗ to obtain the vectors p∗ = [z∗ x∗]>and

p∗ = [z∗ x∗]>.

4. These vectors are merged into a matrix 2 × D matrix (P) and passed through the GP-DKL to obtain a posterior mean vector [ˆy_p∗∗ yˆ∗

p∗]

> _{= µ}

P and a 2 × 2 large covariance matrix KP, hence a bivariate Gaussian distribution.

5. From the this distribution, a random sample S of n samples are drawn as {ˆyS₁, ..., ˆy_nS} = ˆ

YS, where ˆYS ∼ MNµP, KP

, thus forming a n × 2 matrix.

6. Count how many times the Basin-hopping algorithm is better than the K-NN, such that pyˆ∗_p∗ < ˆy∗ p∗ = 1 n   n X j 1_{{ ˆ}_YS j1< ˆYSj2}  = π_q

The sampling from test data performed in step 1 only considers observations that do

not involve values, which are under the process of an event. Recall that an event is related

to the water content, which is visualized in Figure (2.1). To dispose of accumulated impurities and water, an Event needs to happen eventually [3]. Therefore, it is not interesting to sample a point p∗_q from the process of an event. Consequently, using these sampled points as initial points, will result in optimized points that corresponds to when an event is not occurring.

The n samples, obtained in step 5, form a numerical representation of the posterior bivariate normal probability density function f (Pjoint). According to the law of large

numbers, if one were to sample a very large or infinite amount of samples such that n → ∞, one would ensure that the estimated mean vector from the n samples is equal

to the “true” mean µP. This also applies to the estimated sample covariance or kernel matrix [27]. Hence, it is possible to compute a sufficient approximation of the estimated probability pyˆ_p∗∗ < ˆy∗

p∗

in step 6, with a large n. In this thesis, a large n is assumed to be 10 000.

(30)

3.3. Investigation of process settings 21

The steps 1 to 6 are performed for Q amount of iterations, and for each iteration the estimated probability πi is saved. If πi > 0.5 then the Basin-hopping optimizer is finding

estimates of z∗_opt that generally produces a lower water content, for iteration q.

3.3 Investigation of process settings

When the optimal or near optimal values z∗_opt are found they are inserted into the sep-arator. To change process settings one needs to restart the sepsep-arator. Thereafter, the separator is expected to produce a certain or lower amount of water content. It is there-fore interesting to investigate whether the separator produces higher than the expected amount of water content in the output, with the settings from z∗_opt.

The parameters in a separator of a session (se) are denoted as pse₀ and may intuitively be thought of as the first time step of a session. All the consecutive observations pse_0+i = [zse_0+i x_0+ise ]>, where i = 1, 2, ..., Tse, are individually merged with pse₀ , thus forming the 2 × D matrix Pi =    pse 0 pse 0+i   

that is passed through the GP-DKL, thus producing a bivariate normal distribution. This merge transpires during the whole session, hence resulting in Tseamounts of bivariate Gaussian distributions. Tse denotes the maximum time of a session for the separator, hence a restart of the separator.

The observation pse_0+i may produce a significantly higher estimated posterior ˆy_0+ise than the estimated expected posterior ˆy₀sevalue, such that p(ˆyse₀ < ˆy_0+ise ) ≥ 0.95. The estimation of this probability is similar as in step 5 of the Evaluation in section (3.2), where n samples are sampled as {ˆyse

1 , ..., ˆysen}i = Yise∼ MN

µse_i , K_ise, where the n × 2 matrix with the mean vector µse_i = [ˆy0 yˆ0+i] and covariance or kernel matrix Kise. Hence, the probability

is estimated as p(ˆy₀se< ˆyse_0+i) = 1 n   n X j 1{Yse j1<Y se j2}  = πse_i

and if πi ≥ 0.95, it can be concluded that the observation p0+i is responsible for at least 95 percent of all the samples in the first dimension of Yse

i to be lower than in the

second dimension. Typically, the value 0.95 is used in statistical testing. However, this value may be changed depending on the desired outcome of this test [11].

Since the second dimension is related to the observation pse

0+i, it can be concluded that this point significantly contributes to a higher than expected water content. Recall that the observation pse

0+iis a result of the parameter settings z ∗ opt, where [z ∗ optx ∗_]>_{≈ [z}se 0 xse0 ] >_, thus the parameter settings z∗_opt does not produce a lower water content for point 0 + i.

(31)

3.3. Investigation of process settings 22

Evaluation

In the time frame of this thesis it is not possible to evaluate the model and obtain optimal settings to put them into the separator and observe the outcome. Consequently, the evaluation is conducted on a session from the test data.

It is assumed that when inserting the new process parameters into the separator, the closest consecutive observations after a restart will reflect the values of these process parameters and its outcome, such that [z∗_opt x∗]> ≈ [zse

0 x0se]> = pse0 , where the values of zse₀ and xse₀ are the first observation of a session (se) and pse is a D-dimensional vector. However, only taking the first observation might result in a poor representation of the process parameters due to possible errors or wrong measurements in data. Furthermore, looking at the Figures (2.1a) and (2.2) it is seen that generally for a session that the period from the start of this session to the first event appears to have a decreasing pattern of water content. Therefore, this period is discarded and to decrease the influence of the possibility of errors or wrong measurements in data, the first five observations in a session are averaged to represent pse₀ .

The first five points after the first event of a session is averaged and is will be assumed to act similarly to the process parameter settings of the session. These values of the settings will, as previously described, be merged with pse_0+i and form the 2 × D matrix P. The posterior mean and covariance is estimated and will then be used to estimate the probability p(ˆyse₀ < ˆy_0+ise ) = πse_i . For every observation that π_ise ≥ 0.95 holds, the system will obtain an “error”, such that

ei =      1 if πse i ≥ 0.95 0 otherwise

The average of these “errors” denotes the proportion of estimated water content, which is higher than the estimated expected water content.

(32)

4 Results

The results presented in this chapter are based on the methods presented in chapter 3, and on data presented in chapter 2. Firstly, the results of the GP-DKL models, estimated on training data and predicted on test data, are presented. Secondly, the results from the optimization procedure are obtained, using the most suitable GP-DKL model from the first step. Thirdly, results from testing the process parameters in the start of a session are conducted, using the most suitable GP-DKL model from the first step.

4.1 Performance of models

As mentioned in the earlier chapter, the first step is to model the relationship between y and P. Three separate GP-DKL models are applied to the training data with the RBF, Periodic and SM kernels, respectively. The GP-DKL models are estimated using the library gpytorch [8], which uses the pytorch backend. All three models are trained with validation data to investigate convergence and the general performance of the models for each epoch. The result of the convergence are presented in Figure (A.4), appendix (A). The resulting models are used to predict the test data and obtain the results.

Table 4.1: Resulting performance of the models on the test data

Method RBF Periodic SM

95% CI 0.97924 0.97722 0.97924 MAE 0.48839 0.45129 0.43243

The results of the CI and MAE of the models are presented in Table (4.1). According to the results of the 95% CI, both the RBF and SM kernels capture the same proportion of the test data in the interval. Looking at the MAE measurements, it is observed that the SM kernel provides the lowest MSE value. A conclusion is that the SM kernel is the most suitable kernel for this data. To further investigate the model performances, each respective kernel is visualized.

(33)

4.1. Performance of models 24

Figure 4.1: Prediction on test data using DKL with RBF kernel

Figures (4.1), (4.2) and (4.3) illustrate the prediction of the RBF, Periodic and the SM kernels, respectively. It is clear that the Periodic and SM kernels provide a preferable generalization when predicting the test data. In comparison, the RBF kernel provides an acceptable generalization. For each kernel it is demonstrated that the models has a tendency to predict an increase of water content, when an event is occurring. The major differences between these kernels seem to be the time periods [250-500] and [1500-1750]. For the RBF and Periodic kernel, it may be noted that both of these kernels have a hard time providing correct predictions for these time periods. However, the Periodic kernel seems to provide the better estimate. Adding the SM kernel into the comparison, it can be seen that it provides the best prediction for the test data.

Considering the results presented in this section, it is concluded that the most suitable kernel for this problem is the SM kernel, providing the recommendations for hyperparam-eters in section (3.1.1). This kernel is hereinafter used to obtain the results presented in sections (4.2) and (4.3).

(34)

4.1. Performance of models 25

Figure 4.2: Prediction on test data using DKL with Periodic kernel

(35)

4.2. Optimal process settings 26

4.2 Optimal process settings

As mentioned in section (3.2), the proposed algorithm that is used to obtain optimal or near optimal process settings is the Basin-hopping optimizer, with the bashinhopping function from [13]. As explained in section (3.2.2), it might not be necessary with a sophisticated optimizer such as Bashin-hopping to obtain optimal or near optimal process settings. Therefore, the simpler k-NN algorithm is used, with the implementation from the sklearn library [18].

Performing the steps in the Evaluation part of section (3.2) 200 times, results in 200 different sampled test data points from P∗, which works as initial points in both algorithms. The output of using 200 different initial points, is 200 π:s. For each of the initial point, the Basin-hopping performs 100 iterations.

Figure 4.4: π from 200 different initial values

Figure (4.4) illustrates that for the majority of all initial data points (for all k = 3, 5 and 10), the Basin-hopping algorithm finds settings z∗_optwhich minimizes the water content better than the k-NN.

(36)

Figure 4.5: The minimized Water in Output from optimization

Figure (4.5) illustrates the initial values of Water in Output, and what the optimiza-tion algorithms manage to obtain. When investigating the Figure further, it is noted that the k-NN algorithm, with k = 3, 5 and 10, is only able to find values of Water in Output that are in a similar range as the initial Water in Output values. This is reasonable con-sidering that the k-NN only considers data points that exists in data. The Bashin-hopping optimizer is proved to potentially find values that lowers the water content.

From the results in Figures (4.4) and (4.5) it can be concluded that the Basin-hopping optimizer is able to find z∗_opt, which minimizes the Water in Output better than the z∗_opt found by the k-NN. This is possible since the Basin-hopping algorithm is a global optimizer, which means it has a larger search space. Thus, having the possibility of finding values that do not exist in data. However, the Basin-hopping optimizer might not obtain a realistic combination of settings for z∗_opt, because of this larger search space. For instance, if there exists a rule for the separator that “when temperature is 95 degrees, then pressure can not physically be larger than 1.9”. The current implementation of the objective function of the Basin-hopping optimizer does not have such rules incorporated. It only relies on the GP-DKL model, which has “learned” the relationship between y and P. Hence, this optimizer has the possibility of finding values for z∗_opt that defy this rule such as both finding the temperature 95 and a pressure higher than 1.9. With this in mind, one can investigate the values in output z∗_opt individually.

The 200 values of the Basin-hopping optimization is illustrated in Figure (4.6). These values are obtained given the trained model and boundaries in the L-BFGS-B algorithm. It can be seen that the majority of the initial values, representing the Temperature process parameter, are sampled around its mean and result in higher temperature values

(37)

Figure 4.6: Simulation of the Basin-hopping optimizer

after optimization. This indicates that when the temperature is higher, the water content is lower. For the values representing the Pressure out process parameter, it can be seen that the optimal values are either low or high pressure. It can also be noted that if the initial values are high, the resulting optimal values are low. If the sampled values are low, then the resulting optimal values are high. Hence, the water content is lower when the pressure is low or high. For the values representing the Time elapsed between events in the separator, it can be seen that to achieve minimal water content, the time between the events generally needs to increase.

(38)

4.3. Investigating process settings 29

Figure 4.7: Testing the process parameters of a session

4.3 Investigating process settings

When the separator restarts, the optimal process settings can be installed in the separator and the expected result is a low amount of water content. All consecutive data points pse_0+i are jointly tested with pse₀ , according to the Evaluation part of section (3.3).

Figure (4.7) displays an example of when the a sessions starts as expected, but increases at approximately time point 100. The lower Figure in (4.7) displays the settings for this session and the actual values of the Water in Output. The upper Figure displays the proportion πse_i of the sampled posterior distribution which is larger than the sampled posterior distribution of the initial point pse₀ (the process settings). Hence, if π_ise ≈ 0.5 then the distributions of point pse₀ and pse_0+i are similar. From this Figure, it is seen that all π1, ..., πTse < 0.95. This can be interpreted as that the model does not capture

any significantly higher values of Water in Output, hence the process parameter settings

provide an expected behaviour in this session.

The values which are closer to the limit of 0.95 are under the process of an event. The reason for this can be seen in Figure (4.3), where it is concluded that the GP-DKL model, with the SM kernel, estimates the water content to be higher during an event. Therefore, all the peaks seen in the upper Figure of (4.7) represent an event from the Event variable. As already discussed in section (2.2.1), the Event has an effect on the Water in Output. It is therefore reasonable to conclude that if πi ≥ 0.95 when an event occurs, the time

(39)

5 Discussion

In this chapter, the contents of this thesis is discussed. The chapter is divided into three sections where each section discusses the results, issues and improvements of the methods.

5.1 Models

The Bayesian properties of the GP made it possible to perform tests on single observations. These tests are used to conclude which optimization algorithm is needed and whether the process parameter settings of a session produces expected values for Water in Output. Moreover, since a posterior mean and variance from a normal distribution is estimated for every data point, it is used to draw random samples from that distribution. Using a GP is therefore an ideal choice and with the DKL extension it is possible to process larger quantities of data while still maintaining accuracy of prediction. The DKL extension allows usage of NNs and all the benefits included with it, where the main benefit is the feature extraction pre- kernel. The joint learning in the Adam optimizer allows the parameters W in the network and hyperparameters θ in the kernel function to adjust to one another.

A problem that arises when using NNs is that there is no general architecture for a specific problem. In this thesis, the architecture provided by Wilson et al. [31] includes four hidden NN layers with the ReLU activation function. The ReLU function is a popular choice for deep networks and is proven to work well for most cases [10], hence used in this thesis. However, the amount of hidden layers in the NN may be different for the data in this thesis, since it is very different than the data mentioned in [31]. Another very important aspect of the NN is what learning rate η it is trained with. As mentioned, in this thesis the values 0.1, 0.01, and 0.001 are used. The setting of these learning rates might be too strict for the Adam optimizer. Therefore, an adaptable learning rate could be used, where this learning rate is based on a gradient decent [15].

As mentioned, the NN is primarily seen as a feature extractor pre- kernel, therefore one might find it interesting to use a higher dimensional output than 2 in the NN. Wilson

Optimizing process parameters to increase the quality of the output in a separator : An application of Deep Kernel Learning in combination with the Basin-hopping optimizer

“Statisticians, like artists, have the bad habit of falling in love with their

models.”

Acknowledgements

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aims

1.2

Related work

1.3

Ethical considerations

2

Data

2.1

Variables

2.2

Introducing time

2.2.1

Between events

2.2.2

Between shutdowns

2.3

Preparation of data

2.3.1

Missing data

3

Method

3.1

Gaussian process regression

3.1.1

Deep kernel learning

3.2

Optimization procedure

3.2.1

Basin-hopping

3.2.2

k-Nearest Neighbour

Evaluation

3.3

Investigation of process settings

Evaluation

4

Results

4.1

Performance of models

4.2

Optimal process settings

4.3

Investigating process settings

5

Discussion

5.1

Models