Efficient Sampling of Gaussian Processes under Linear Inequality Constraints

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2021 | LIU-IDA/STAT-A–21/020–SE

Eﬃcient Sampling of Gaussian

Processes under Linear

Inequal-ity Constraints

Bayu Beta Brahmantio

Supervisor : Johan Alenlöv Examiner : Fredrik Lindsten

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

In this thesis, newer Markov Chain Monte Carlo (MCMC) algorithms are implemented and compared in terms of their efficiency in the context of sampling from Gaussian pro-cesses under linear inequality constraints. Extending the framework of Gaussian process that uses Gibbs sampler, two MCMC algorithms, Exact Hamiltonian Monte Carlo (HMC) and Analytic Elliptical Slice Sampling (ESS), are used to sample values of truncated multi-variate Gaussian distributions that are used for Gaussian process regression models with linear inequality constraints.

In terms of generating samples from Gaussian processes under linear inequality con-straints, the proposed methods generally produce samples that are less correlated than samples from the Gibbs sampler. Time-wise, Analytic ESS is proven to be a faster choice while Exact HMC produces the least correlated samples.

(4)

Acknowledgments

I would like to thank my supervisor, Johan Alenlöv for his guidance and supervision that significantly helped me at every phase of this thesis.

I would also like to express my gratitude to Fredrik Lindsten. As an examiner, he provided insightful suggestions and feedback for this thesis. As a former research project supervisor, I am grateful that he introduced me to this topic and gave me ideas for this thesis.

Lastly, I would like to thank my family and friends for their endless support during the whole study period.

(5)

List of Figures

3.1 Plot ofΣ10, covariance used in Experiment 4. . . 11 3.2 Training points for GP. . . 12 4.1 Plots of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler

with J = 1, 5, 10, and Gibbs sampler for the TMG distribution with µ = [0, 0]|, σ₁2=σ₂2=1, σ1,2=σ2,1 =0.8, a= [1, 1]|, and b= [1, 1]|. . . 15 4.2 ACF values of the first 10 lags from 10000 sampled values from Exact HMC

sam-pler, Analytic ESS sampler with J = 1, 5, 10, and Gibbs sampler in each vari-able of the TMG distribution with µ = [0, 0]|_{, σ}2

1 = σ22 = 1, σ1,2 = σ2,1 = 0.8, a= [1, 1]|, and b= [1, 1]|. . . 16 4.3 Plots of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler

with J = 1, 5, 10, and Gibbs sampler for the TMG distribution with µ = [0, 0]|, σ₁2=σ₂2=1, σ1,2=σ2,1 =0.8, a= [8, 0.2]|, and b= [8, 0.2]|. . . 17 4.4 ACF values of the first 10 lags from 10000 sampled values from Exact HMC

sam-pler, Analytic ESS sampler with J = 1, 5, 10, and Gibbs sampler in each vari-able of the TMG distribution with µ = [0, 0]|, σ₁2 = σ₂2 = 1, σ1,2 = σ2,1 = 0.8, a= [8, 0.2]|, and b= [8, 0.2]|. . . 18 4.5 ACF values of the first 10 lags from 10000 sampled values from Exact HMC

sam-pler, Analytic ESS sampler with J = 1, 5, 10, and Gibbs sampler in each vari-able of the TMG distribution with µ = [0, 0]|_{, σ}2

1 = σ22 = 1, σ1,2 = σ2,1 = 0.8, a= [0.5, 0.5]|_{, and b}_{= [}_{8, 8}_]|_{. . . .} ₁₉ 4.6 Plots of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler

with J = 1, 5, 10, and Gibbs sampler for the TMG distribution with µ = [0, 0]|, σ₁2=σ₂2=1, σ1,2=σ2,1 =0.8, a= [0.5, 0.5]|, and b= [8, 8]|. . . 20 4.7 Plots of the first three lags ACF values of Exact HMC sampler, Analytic ESS

sam-pler with J = 1, 5, 10, and Gibbs sampler for the TMG distribution with µ =010, a=1₁₀, b=110, andΣ as in Figure 3.1. . . 21 4.8 Sample paths of unconstrained GP. . . 22 4.9 Sample paths of GP under boundedness constraint. . . 23 4.10 The first three lags ACF values from samples of GPs under boundedness constraint. 23 4.11 Sample paths of GP under monotonicity constraint. . . 24 4.12 The first three lags ACF values from samples of GPs under monotonicity constraint. 24 5.1 Illustration of an ellipseEin a single iteration of Analytic ESS for TMG distribution

with µ = [0, 0]|, σ₁2 = σ₂2 = 1, σ1,2 = σ2,1 = 0.8, a = [0.5, 0.5]|, and b = [8, 8]|. The orange dashed ellipse depictsE given an initial state x and auxiliary variable

(7)

List of Tables

4.1 Estimated moments, number of effective of samples, and number of effective of samples per second of CPU runtime of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J =1, 5, 10, and Gibbs sampler for the TMG distribution with µ = [0, 0]|, σ₁2 = σ₂2 = 1, σ1,2 = σ2,1 = 0.8, a = [1, 1]|, and b= [1, 1]|. . . 16 4.2 Estimated moments, number of effective of samples, and number of effective of

samples per second of CPU runtime of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J =1, 5, 10, and Gibbs sampler for the TMG distribution with µ= [0, 0]|, σ2

1 =σ22=1, σ1,2 =σ2,1 =0.8, a= [8, 0.2]|, and b= [8, 0.2]|. . . 18 4.3 Estimated moments, number of effective of samples, and number of effective of

samples per second of CPU runtime of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J =1, 5, 10, and Gibbs sampler for the TMG distribution with µ = [0, 0]|_{, σ}2

1 = σ22 = 1, σ1,2 = σ2,1 = 0.8, a= [0.5, 0.5]|, and b= [8, 8]|. . . 19 4.4 Number of effective of samples and number of effective of samples per second

of CPU runtime of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J =1, 5, 10, and Gibbs sampler for the TMG distribution with µ =

010, a=110, b=110, andΣ as in Figure 3.1. . . 21 4.5 CPU runtime (seconds) of Exact HMC sampler, Analytic ESS sampler J = 5, and

Gibbs sampler for the GP given the training data and boundedness constraints between a=0 and b=1. . . 23 4.6 CPU runtime (seconds) of Exact HMC sampler, Analytic ESS sampler J = 5, and

Gibbs sampler for the GP given the training data and monotonicity constraint that restricts d f (x)_dx ¥ 0. . . 24

(8)

1 Introduction

1.1 Motivation

Gaussian processes have been used as popular choices to model various complex systems. One of the reasons is that they allow us to model or simulate the underlying function that generates the observed data. This is especially desirable in the situations where only a limited number of data is available because the cost of running experiments is expensive. The uti-lization of Gaussian distributions also makes it possible to build tractable models that inherit nice properties of Gaussian random variables. Besides that, working in Bayesian framework lets us build a model that incorporates uncertainty quantification into its predictions.

One of the many applications of a Gaussian process as a supervised learning model is regression. The aim of Gaussian process regression is to create a model that predicts func-tion values given a set of training data. By working directly in the funcfunc-tion space, we can omit the requirements of setting the number of parameters, hence we only need to tune the hyperparameters through the mean and covariance function.

In some cases, the domain and range of the approximated function is limited to certain values. This arises naturally in many real-world units, e.g., height, speed, and weight. Aside from that, many physical phenomenons can be explained through limiting their first and sec-ond derivative to a certain range. Therefore, we want to create a predictor that can accommo-date this quantity and ultimately that can make better predictions given those limitations. In other words, our Gaussian process regression model needs to incorporate those information of the constrained function values to produce better and more realistic predictions.

In most cases, sampling from a truncated multivariate Gaussian distribution is needed to sample from a constrained Gaussian process. However, sampling from a high-dimensional truncated Gaussian distribution is not a straightforward task. Exact methods are possible but as the number of dimensions gets higher, they suffer from the increasing number of rejec-tion rate. To alleviate this, sampling from Markov Chain Monte Carlo (MCMC) is preferred. Ideally, we want to draw independent samples from our MCMC sampler. Since this is not usually realistic, it is desirable to pick one that can draw samples that are less correlated from our Gaussian process model to produce unbiased estimation of the underlying function. At the same time, we must also consider the runtime of the sampler.

(9)

1.2. Related Works

1.2 Related Works

Many studies have been done to incorporate the constraints into Gaussian process regres-sion models. In Jidling, Wahlström, Wills, and Schön (2017), the covariance function is trans-formed for the Gaussian process to satisfy the linear equality constraints. Maatouk (2017) and López-Lopera, Bachoc, Durrande, and Roustant (2018) used similar framework to approxi-mate the Gaussian processes into its finite-dimensional approximation that can incorporate inequality constraints. In Agrell (2019) and Da Veiga and Marrel (2020), a set of finite points called virtual observations locations where the inequality constraints are satisfied. Using this framework, we only need the constraints to hold at a finite set of points to sample from the constrained Gaussian processes.

A large number of methods to generate MCMC samples from a truncated Gaussian distri-bution have been studied. Methods that uses Gibbs sampling (Gelfand, Smith, and Lee 1992, Kotecha and Djuric 1999, and Rodríguez-Yam, Davis, and Scharf 2004) are widely used be-cause of their simplicity. Recent methods have shown that it is possible to incorporate more advanced mechanisms to generate samples. In Pakman and Paninski (2014), a Hamiltonian system is recreated to simulate truncated Gaussian distributions. Fagan, Bhandari, and Cun-ningham (2016) showed that it is possible to adapt Elliptical Slice Sampling (ESS) (Murray, Adams, and MacKay 2010) to sample from truncated Gaussian distributions.

1.3 Aim

López-Lopera, Bachoc, Durrande, and Roustant (2018) have implemented several MCMC methods in their framework and found that HMC-based sampler works best while Agrell (2019) used Gibbs sampler and an exact method by Botev (2017). Hence, this thesis project will mainly be focused on evaluating MCMC methods on the framework introduced by Agrell (2019) to incorporate linear inequality constraints into Gaussian process regression models.

Specifically, the main aim of this thesis project is to evaluate the performance of Exact HMC algorithm (Pakman and Paninski 2014) and Analytic ESS (Fagan, Bhandari, and Cun-ningham 2016) to sample from truncated multivariate Gaussian distributions in the context of sampling from Gaussian process regression models under linear inequality constraints. The aforementioned sampling methods will be compared to Gibbs sampling method (Kotecha and Djuric 1999) as a baseline method. Finally, a few research questions will be answered from this thesis project:

1. Can Exact HMC and Analytic ESS be implemented for Gaussian processes under linear inequality constraints?

2. How are the performance of each method against each other and against the baseline method?

(10)

2 Theory

2.1 Truncated Multivariate Gaussian Distribution

The truncated multivariate Gaussian (TMG) distribution is a probability distribution ob-tained from a multivariate Gaussian random variable by bounding it under some linear inequality constraints (Li and Ghosh 2015). Let w P Rd be a random vector from a d-dimensional Gaussian random variable W with mean µ P Rd and covariance matrix Σ P Rdd_{. The probability density function (pdf) of W is}

pW(w) =exp 1 2(w µ) |_Σ1₍_w_µ₎ . (2.1.1)

The corresponding TMG distribution X can be acquired by bounding W between a and b is pX(x) = pW (x) ³_b a pW(x)dx 1(a¤ x ¤ b) = exp1₂(x µ)|Σ1(x µ) ³b aexp12(x µ)|Σ1(x µ) dx1 (a¤ x ¤ b), (2.1.2)

where a, b P Rd. The notation ³b_a denotes the multiple integrals ³b1

a1. . .

³b_d

ad and 1 is an

in-dicator function. From now on, pX(x) is going to be referred as p(x). We also denote

X T N (µ,Σ, a, b)as the event where the random variable X has a TMG distribution with

parameters µ andΣ and bounded under a, b).

We can always rewrite the constraints on the form a and b to the form F and g. Using this, we get a more general pdf form:

p(x) = exp1₂(x µ)|Σ1(x µ) ³ Fx+g¥0exp12(x µ)|Σ1(x µ) dx1 (Fx+g¥ 0), (2.1.3)

where F is an m d matrix, which, together with the m 1 vector of g, defines all m inequality constraints of p(x). The integral³_Fx+g_¥0expresses the multiple integrals under the area where the condition Fx+g ¥ 0 holds. We now denote this as X  T N (µ,Σ, F, g). That is, X has

(11)

2.2. Exact Hamiltonian Monte Carlo for Truncated Multivariate Gaussians

a TMG distribution with parameters µ andΣ and bounded under the constraints matrices F and g.

Moreover, using another Gaussian density parametrization (Schön and Lindsten 2011), the TMG pdf (2.1.3) can be rewritten as

p(x) = 1 Zexp 1 2x |_Λx₊_ν|_x₁₍_Fx₊_g_{¥ 0}₎_, _(2.1.4)

where Z is a normalizing constant,Λ = Σ1, and ν = Σ1µ. Using linear transformations

(Hossain 2014, Pakman and Paninski 2014), (2.1.4) can be transformed into p(x)9 exp

1

2x

|_x₁₍_F_x₊_g_{¥ 0}₎_, _(2.1.5) such that it is distributed asT N (0, Id, F, g), for some matrices of F and g. This trans-formation lets us to work in a whitened frame, where the data is generated from a standard Gaussian distribution. To transform back to its original distribution, we do a reverse trans-formation.

2.2 Exact Hamiltonian Monte Carlo for Truncated Multivariate Gaussians

Exact Hamiltonian Monte Carlo (HMC) for Truncated Multivariate Gaussians (TMG) (Pak-man and Paninski 2014) considers the exact paths of particle trajectories in a Hamiltonian system that is proportional to the negative logarithm of TMG distributions density. Suppose our target distribution p(x)is a TMG distribution (2.1.5), where xP Rd. We can set our mo-menta to be normally distributed, that is s N (0, Id). Therefore, the Hamiltonian system can be described as H=U(x) +K(s) = 1 2x |_x₊1 2s |_s_, _(2.2.1) subject to Fx+g¥ 0, (2.2.2)

for some matrices F and g.

The equations of motion for the Hamiltonian system in 2.2.1 are: Bxi Bt = BH Bsi =si Bsi Bt = BH Bxi =xi, i=1, ..., d. (2.2.3)

In this sense, we want the particles in Hamiltonian system to only move around inside the constrained space. The exact trajectory of a particle using the equations above is

xi(t) =si(0)sin(t) +xi(0)cos(t). (2.2.4) A particle will follow the trajectory above until it hits a wall, or in other words, until it hits one of the constraints. Mathematically, this can be described as Fhx(th) +gh=0, where h is one of the rows of F and the h-th element of g. Let thbe the time when the particle hits wall h. It will hit the wall with velocity ˙x(th)which can be decomposed into

˙x(th) =projn˙x(th) +projFh ˙x(th), (2.2.5)

where proj_n˙x(th)is the projection of ˙x(th)on the normal vector n perpendicular to Fhand proj_F h ˙x(th) = Fh˙x(th) ||Fh|| Fh ||Fh|| = Fh ˙x(th) ||Fh||2 Fh =αhFh. (2.2.6)

(12)

2.3. Analytic Elliptical Slice Sampling

By inverting the direction of proj_n˙x(th), we can obtain the reflected velocity as ˙xR(th) = projn˙x(th) +projFh ˙x(th)

=˙x(th) +2αhFh,

(2.2.7) which can be used as the new initial velocity (si(0)) in the equation of trajectory (2.2.4) for the particle to continue its path. It can be shown that exact HMC follows the detailed balanced condition (Pakman and Paninski 2014) which leaves the canonical distribution invariant.

The algorithm for one iteration of Exact HMC can be found in Algorithm 1. One hyper-parameter that can be changed is the maximum travelling time of the particle, T, which in practice is often set to π

2. In short, the algorithm tries to calculate the trajectory of the particle as long as T. When it detects that the trajectory will hit a wall (Fh), new momenta values are calculates as to simulate the reflection. Then, the trajectory continues using the same proce-dure until it reaches T. To do the next iteration, the new positions, x1, is used as the initial positions for the next iteration.

Algorithm 1:Exact HMC

Require:Current positions x=x(0), Maximum travelling time T, Constraints matrices F and g

Sample initial momenta s(0)N (0, Id) t=0

while t T do

Find the time the particle needs to reach the first wall/constraint th, where Fhx(th) +gh=0

if th (T t)then t=t+th

x(th) =s(0)sin(th) +x(0)cos(th) ˙xR(th) =˙x(th) +2F_||Fh˙x(t_h_||h2)Fh

Continue the trajectory using reflected positions and momenta: x(0) =x(th) s(0) = ˙xR(th) else t=T t x1=x(t) =s(0)sin(t) +x(0)cos(t) return x1

2.3 Analytic Elliptical Slice Sampling

Analytic Elliptical Slice Sampling was introduced by Fagan, Bhandari, and Cunningham (2016) as an exact implementation of Elliptical Slice Sampling (ESS) (Murray, Adams, and MacKay 2010) which is a variant of Slice Sampling (Neal 2003). In essence, Slice Sampling tries to sample uniformly from the (d+1)-dimensional region under f(x)that is proportional to density function p(x). It is done by using Gibbs sampling to sample from the joint dis-tribution (x, y) where y is an auxiliary variable. We do this by alternately sampling from y|x U(0, f(x))and x|y U(S)where S=tx : f(x)¡ yu. To obtain samples from x, we can

sample from p(x, y)and marginalize the y variable by ignoring it later.

Elliptical Slice Sampling, on the other hand, uses Gaussian prior as an auxiliary variable to sample from a posterior distribution of the form

p(x) = 1

ZN (x; 0,Σ)L(x), (2.3.1) where Z is the normalization constant,N (0,Σ)is a multivariate Gaussian prior, and L is a likelihood function. To do this, we introduce an auxiliary variable νN (0,Σ)so that given

(13)

2.4. Gibbs Sampling for Truncated Multivariate Gaussians

an initial state x, the update ellipse is

x1=xcos(θ) +νsin(θ), θP[0, 2π]. (2.3.2) The new state x1is accepted if it satisfies L(x1)¡ y, where y U(0, L(x)).

In ESS, x1is found by shrinking the interval[0, 2π]iteratively towards θ=0 (initial state) until θ is found that satisfies L(x1) ¡ y. Analytic ESS tries to cut the shrinking process by

analytically defining the set of possible proposed states:

S (y,E ) =tx1PE : L(x1)¡ yu, (2.3.3) whereE =tx1: x1(θ) =xcos(θ) +νsin(θ)u.

In the case of TMG, we can use the target distribution as in 2.1.5, that is p(x)9 exp

1

2x

|_x₁₍_Fx₊_g_{¥ 0}₎_. _(2.3.4)

Hence, the prior isN (0, Id)while the identity function term acts as the likelihood. Notice that, since y U(0, L(x1)) and L(x1) = 1(Fx1+g ¥ 0) P t0, 1u, therefore y U(0, 1) for x1 P S (y,E ). In other words, L(x1)is always accepted if x1 P S (y,E )(and always rejected if x1 RS (y,E )). Hence, we can reformulateS into

S (E ) =tx1PE : Fx1+g=1u. (2.3.5) Using this notation, we can think ofS (E )as a portion, or slice, of the ellipseEthat falls inside the constraints described by F and g. The pseudocode for one iteration of Analytic ESS can be found in Algorithm 2.

Algorithm 2:Analytic ESS

Require:Current state x=x, Constraints matrices F and g Sample auxiliary variable νN (0, Id)

DefineE =tx1: x1(θ) =xcos(θ) +νsin(θ)u DefineS (E )) =Xm

j=1tθ P[0, 2π]: Fj|x1(θ) +gj¥ 0u Sample θ U(S (y,E ))

return x1=x1(θ)

It is also possible to sample sample J points from a single ESS iteration as demonstrated by Fagan, Bhandari, and Cunningham 2016. In an ESS setting, it is done by sampling J different values of y and for each one of them we uniformly sample one point fromS (y,E ). One of the J points is then randomly selected as the next initial state. In the case of Analytic ESS, we simply sample J points from the ellipseS (E ).

2.4 Gibbs Sampling for Truncated Multivariate Gaussians

Gibbs sampling for TMG distributions uses the fact that the conditional of a TMG distribu-tion is also a TMG (Kotecha and Djuric 1999). Therefore, we can use the full condidistribu-tionals to generate samples.

Let f(xi|xi) = f(xi|x1, ..., xi1, xi+1, xd)be the conditional distribution of the variable i given the rest of the variables, denoted as i. i. The joint distribution of xi and xi is given by z= xi x_i N (µ,Σ) with µ= µ_i µ_i and Σ= _Σi,i _Σi, i Σi,i Σi,i . (2.4.1)

The conditional distribution xi|x1is also a normal distribution with mean and covariance:

µ_i.i=µ_i+Σi,_iΣ1_i,i(x_i µ_i)

Σi._i=Σi,i Σi,iΣ1_i,iΣi,i

(14)

2.5. Gaussian Process

To sample from a TMG distribution given mean µ, covariance matrix Σ, vector of lower bounds and upper bounds a and b, we can draw samples consecutively from f(xi|xi) sub-ject to ai ¤ xi ¤ bi. In a case where we use constraint matrices F and g, we can find the

(d 1)-dimensional vectors of a and b that define the upper and lower bounds of all vari-ables besides the i-th variable.

The algorithm for Gibbs sampling for TMG distributions can be found on Algorithm 3 as implemented by Wilhelm and Manjunath (2010). The functionΦ in the algorithm stands for the cumulative distribution function (cdf) of a standard normal distribution.

Algorithm 3:Gibbs Sampling

Require:Initial value x(0), mean µ, covariance matrixΣ for j=1 to N do Draw x(j)_i |x₁(j), ..., x_i(j)₁, x(j1)_i+1 , x(j1)_d : for i=1 to d do Draw U U(0, 1) Draw xi.i=µ_i._i+σi.iΦ1 h UΦ biµi.i σi.i Φ a_iµ_i.i σi.i +Φaiµi.i σi.i i return XP RNd

2.5 Gaussian Process

Gaussian Process Regression

A Gaussian process (GP) is a stochastic process such that when evaluated at a finite number of points, it has a joint Gaussian distribution (Rasmussen and Williams 2005). Consider func-tions where f :Rnx Ñ R. A GP over f can be specified by its mean and covariance function:

µ(x) =E[f(x)]:Rnx Ñ R

K(x, x1) =E[(f(x) µ(x))(f(x1) µ(x1))]:Rnxnx Ñ R. (2.5.1)

Denote f GP (µ(x), K(x, x1))as a GP over function f with mean function µ(x)and covari-ance function K(x, x1).

We can view the GP as a prior distribution over f . To see this, consider a set of inputs X= [x₁, ..., x_M]|. To sample function values at the inputs, f|X, we sample from the multivariate Gaussian distributionNµ(X), K(X, X1)

, with mean and covariance: µ(X) = [µ(x₁), ..., µ(x_M)]|P RM

K(X, X1)P RMM, where K(X, X1)_i,j=K(x_i, x_j1). (2.5.2) Given a set of training inputs X = [x1, ..., xN]|and their corresponding observed values Y = [y1, ..., yN]|and assuming that yi = f(xi) +ε, ε i.i.d. N (0, σ2), we can find the posterior predictive distribution of f for new test inputs Xgiven the training data X and Y, which is still a Gaussian distribution

f|X, X, YN (E[f], cov(f)), where

E[f] =µ(X) +K(X, X)[K(X, X) +σ2IN]1(Y µ(X)) cov(f) =K(X, X) K(X, X)[K(X, X) +σ2IN]1K(X, X).

(2.5.3)

Gaussian Process under Linear Operators

LetLbe a class of linear operators that takes f : Rnx Ñ R and produces another function Lf :Rnx Ñ Rnc_{. If f} _{GP (}_µ₍_x₎_{, K}₍_x_{, x}1₎₎_{, then}_L_{f is also a GP because GPs are closed under}

(15)

2.5. Gaussian Process

linear operators (Rasmussen and Williams 2005). The mean and covariance function ofLf are:

E[Lf(x)] = Lµ(x):Rnx Ñ Rnc

cov(Lf(x),Lf(x1)) = LK(x, x1)L|:Rnxnx Ñ Rncnc_, (2.5.4)

whereLK(x, x1)and K(x, x1)L|indicates whether the linear operator is applied to the left or right input of K(x, x1). In other words,LK(x, x1) = LK(x,)and K(x, x1)L|_{= (L}_K₍_x1_,₎₎|_.

Gaussian Process under Linear Inequality Constraints

Suppose we want to impose constraints into f GP (µ(x), K(x, x1)). To do this, we can condition the GP on the event of a(x)¤Lf(x)¤ b(x), where a(x) ¤ b(x),@x, in addition to conditioning on X, X and Y.

Using the notation of conditional expectation (Da Veiga and Marrel 2020), the mean of the constrained GP can be written as

E[f|X, X, Y, a(x)¤Lf(x)¤ b(x),@x P Ω], (2.5.5) whereΩ is the domain ofLf . To approximate the mean of constrained GP above (2.5.5), we can choose a set of points Xυ _{= [}_xυ

1, ..., xυS]|fromΩ such that the constraints hold at those points

E[f|X, X, Y, a(xυ

i)¤Lf(xυi) +ευi ¤ b(xυi),@i=1, ..., S], (2.5.6) where ευ

i is an error term that is i.i.d. distributed asN (0, συ2).

The set of points Xυ_{is also called virtual observations locations (Agrell 2019). To}

incorpo-rate this into the condition of the GP, denote C(Xυ_{) =}_XS

i=1ta(xυi)¤Lf(xiυ) +ευi ¤ b(xυi)u (2.5.7) as the event where the constraints are satisfied in all points of Xυ_{. Therefore, the posterior}

predictive distribution given the constraints f|X, X, Y, C(Xυ₎_is

f|X, X, Y, C(Xυ₎_{N (} µ(X) +A(CLµ(Xυ)) +B(Y µ(X)),Σ), (2.5.8) where CT N (Lµ(Xυ) +A1(Y µ(X)), B1, a(Xυ), b(Xυ)), (2.5.9) where: A=B3B1₁ B= A2 AA1 Σ=B2 AB3| A1= (LK(Xυ, X))(K(X, X) +σ2IN)1 A2=K(X, X)(K(X, X) +σ2IN)1 B1= LK(Xυ, Xυ)L|+σ_υ2IS A1K(X, Xυ)L| B2=K(X, X) A2K(X, X) B3=K(X, Xυ)L| A2K(X, Xυ)L|. (2.5.10)

In essence, to sample from the posterior predictive distribution given the training data and the constraints, f|X, X, Y, C(Xυ₎_{, we need to sample values from the distribution C}

which is a TMG distribution bounded between a(Xυ_{) = [}_a₍_xυ

1), ..., a(xυS)]| and b(Xυ) =

[b(xυ

1), ..., b(xυS)]|.

The parameters defined in (2.5.10) carries information of the covariance functions between training, test, and virtual observations locations. Covariance function values between train-ing locations are computed for A1and A2, but A1 carries additional information between

(16)

2.6. Autocorrelation and Effective Sample Size

training and virtual observations locations while A2has information between training and test locations. For B1, the covariance function values within virtual observations locations are computed as well as between training and virtual observations locations. On the other hand, B2confines the calculations of training and test points. Notice that, all calculations that involve virtual observations locations are subject to the linear operatorLas to provide the information of the linear operator.

To put this into example, consider a GP over f :R Ñ R as f GP (0, K(x, x1)), with K(x, x1) =σ2_fexp (x x1)2 2l2 (2.5.11) as the covariance function. In the case of boundedness constraints, we restrict values of f directly, a(x) ¤ f(x) ¤ b(x). Hence, L = 1 and LK(X, X1) = LK(X, X1)L| = K(X, X1), which is the same as the original covariance function defined in 2.5.11.

For the monotonicity constraints, the values of the first derivative of f will be restricted or a(x) ¤ d f (x)_dx ¤ b(x). Thus,Lrepresents _dxd andLK(X, X1)becomes _dxdK(X, X1), where

d

dxK(X, X1)i,j= dxdLK(xi, x1j). At the same time,LK(X, X1)L|= d

2 dx2K(X, X1). Precisely, d dxK(x, x 1_{) =}_σ2 f exp (x x1)2 2l2 1 l2(x x1) =K(x, x1) 1 l2(x x1) , d2 dx2K(x, x1) =σ 2 f exp (x x1)2 2l2 1 l2 1 1 l2(x x1) 2 =K(x, x1)1 l2 1 1 l2(x x1) 2 . (2.5.12) Notice that, a different type of constraint will result in different values of the parameters (2.5.10) that results in different sample values from the posterior predictive distribution given the constraints f|X, X, Y, C(Xυ₎_.

2.6 Autocorrelation and Effective Sample Size

Let xibe a time series at time i. The autocorrelation function can be defined as ρt= cov(xi, xit)

var(xi) , (2.6.1)

for lag t. Ideally we want the samples generated by our sampler to be independent. However, samples from an MCMC chain will always be correlated with samples in the next lag, or autocorrelated. Hence, samples that are less autocorrelated are preferred because we want our generated samples to accurately represent the distribution of our interest.

To assess the effective number of independent samples from our MCMC chain, the num-ber of effective samples is calculated using

Ne f f = N 1+2°1₈ρt

, (2.6.2)

where N is the number of samples (Gelman, Carlin, Stern, and Rubin 2004). The value of Ne f f can be more than N in a case where the ACF values are negative on odd lags (Stan Development Team 2019). In a multivariate setting, the Effective Sample Size is defined as

Ne f f =N |Λ|

|Σ| 1_d

, (2.6.3)

whereΛ is the sample covariance matrix, d is the number of dimensions, and Σ is the esti-mated covariance matrix for the Markov chain (Vats, Flegal, and Jones 2017). Using this met-ric, we can compare the sampling performance between two sampling methods, in which a sampler that generates more correlated samples will have a lower value of number of effec-tive size.

(17)

3 Method

3.1 Methods and Hyperparameters

Truncated Multivariate Gaussian Distribution

As previously mentioned in Section 2.1, any kind of truncated multivariate Gaussian distri-bution can be transformed into a whitened frame. Hence, the pdf form in (2.1.5) is used for both Exact HMC and Analytic ESS algorithms. For all samplers, there will be 100 iterations of burn-in period. To ensure reproducibility, a random seed of 1234 will be used in all cases.

Exact Hamiltonian Monte Carlo

Exact HMC has one hyparameter that can be tweaked, that is the maximum travelling time of the particle (T). While there are some calculations for obtaining value of T that make sampling as efficient as possible, they are often not helpful in practice (Pakman and Paninski 2014). Hence, the value of T= π

2 is used as it is a safe choice. This choice balances the number of bounces and also the travelling distance of the particle.

Analytic Elliptical Slice Sampling

The hyperparameter J determines how many samples generated from one ellipse. The values of J = 1, 5, and 10 will be used to compare the performance of Analytic ESS sampler on different TMG distributions. One of the J values will then be selected in the comparison for GPs under linear inequality constraints.

Gaussian Process

Throughout this thesis project, the zero mean is used for the Gaussian process. For the co-variance function, squared-exponential function is used:

K(x, x1) =σ2_fexp ||x x1||2 2l2 , (3.1.1)

(18)

3.2. Experiments

where||x x1||2 is the euclidean distance between x and x1. The parameters are chosen as σ2_f = 0.5 and l = 0.1. Since the focus of this thesis is to compare the sampling algorithms, there will not be an optimization of the GP hyperparameters.

3.2 Experiments

In this thesis, simulated data will be mainly used since the focus is on the MCMC algorithms for TMG. The performance of the sampling methods will be compared on a few cases of TMG:

1. XT N (µ,Σ, a= [1, 1]|, b= [1, 1]|) 2. XT N (µ,Σ, a= [8, 0.2]|, b= [8, 0.2]|) 3. XT N (µ,Σ, a= [0.5, 0.5]|, b= [8, 8]|) 4. XT N (010,Σ10, a=110, b=110) where µ=0 0 , Σ= 1 0.8 0.8 1 , (3.2.1)

010, 110P R10, andΣ10is explained in Figure 3.1.

Figure 3.1: Plot ofΣ10, covariance used in Experiment 4.

The choices of the cases are based on different conditions of TMG distributions. In the first case, we have a fairly simple example where the mean of the distribution is located in the center of the support. In the second case, we have a case of narrow TMG distribution.

(19)

3.2. Experiments

In the third case, we are trying to sample from a lower probability region of the untruncated multivariate Gaussian distribution. Finally, we have a high-dimensional case with d=10.

Since experiments 1 to 3 are done in 2 dimensions, we will be able to visualize the gen-erated samples from each method. For all methods, the calculations of autocorrelation and effective sample size will be done. To see how many effective samples a sampler generates in a given time, there will also be calculations of numbers of effective samples per second of CPU runtime.

In the context of Gaussian process under linear inequality constraints, the example from Agrell (2019) is followed given by the 1-dimensional function

f(x) = 1

3(tan

1₍_20x₁₀₎_tan1₍₁₀₎₎_. _(3.2.2)

From the function above, training points are obtained from 7 input observations where xi = 0.1+ _(i+1)1 for i = 1, .., 7. The training points are assumed to be noiseless, hence y = f(x). The illustration of the training points can be seen on Figure 3.2.

Figure 3.2: Training points for GP.

Two cases of constraints will be run to compare the performance of each sampling algo-rithms:

1. Boundedness constraint: 0¤ f(x)¤ 1.

2. Monotonicity constraint: d f (x)_dx ¥ 0.

The aforementioned types of constraints are selected because their simplicity. In practice, it is possible to implement more complicated types of constraints or even combinations of different types of constraints.

To define those constraints into our GP model, we design the covariance function to in-corporate the linear operator. For boundedness case, the covariance function is the same as defined in (3.1.1). For monotonicity case, we use the derivative and double-derivative of the covariance function as explained in (2.5.12) for calculations of covariance matrices that involve the virtual observations.

In both cases, because the domain is xP[0, 1], a finite set of virtual observations locations can be chosen to make sure that the constraints hold at those points. Hence, 30 equally spaced

(20)

3.2. Experiments

points between[0, 1] will be used as Xυ _{or the virtual observations locations. For the test}

inputs, X, 500 equally spaced points between[0, 1]will be selected.

For each case, Exact HMC, Analytic ESS, and Gibbs sampler will be used to generate 200 sample paths. These three methods will be compared graphically to show whether the sample paths produced obey the constraints or not. The ACF values from the generated samples will also be checked at every test inputs to analyze the autocorrelation between samples. In addition to that, the sampling time of each sampling method will be compared.

(21)

4 Results

In this chapter, the results of the experiments as explained in 3.2 are presented through figures and numbers. In summary, the chapter is divided into two sections. The first part presents the results of the comparative studies of different MCMC samplers on different TMG dis-tributions. The second section is focused on the results and the performance of the MCMC algorithms in the context of Gaussian processes under linear inequality constraints.

4.1 Truncated Multivariate Gaussian Distributions

First TMG Distribution

The first distribution to compare is the TMG distribution with parameters:

µ=0 0 , Σ= 1 0.8 0.8 1 , a= ₁ 1 , b=1 1 . (4.1.1)

The true mean and covariance matrix of this distribution are:

µ=0 0 , Σ=0.2639571 0.1243004 0.1243004 0.2639571 . (4.1.2)

Figure 4.1 shows 10000 generated samples for the distribution above. For sample plots from Analytic ESS samplers, J in ESS(J)denotes the number of samples drawn in a single ellipse explained in Section 2.3. The figure also comes with histogram of samples in each axis for each plot.

(22)

4.1. Truncated Multivariate Gaussian Distributions

(a) HMC (b) ESS(1)

(c) ESS(5) (d) ESS(10)

(e) Gibbs

Figure 4.1: Plots of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J=1, 5, 10, and Gibbs sampler for the TMG distribution with µ = [0, 0]|_{, σ}2

1 =σ22=1, σ1,2=σ2,1=0.8, a= [1, 1]|, and b= [1, 1]|.

In Table 4.1, we can see the estimated mean and covariance matrix, number of effective of samples, and number of effective of samples per second of CPU runtime from samples in Figure 4.1. The bold value in the Ne f f/CPU column indicates the highest number of effective samples per second of CPU runtime of all sampler.

(23)

4.1. Truncated Multivariate Gaussian Distributions ¯x Q Ne f f Ne f f/CPU HMC [0.00664586, 0.00332774]| 0.25912417 0.11655442 0.11655442 0.2619817 10259.758752898982 3613.8247678651164 ESS, J=1 [0.00149456, 0.00337604]| 0.2509207 0.12306735 0.12306735 0.25752818 10149.79752289619 62.33296736787626 ESS, J=5 [0.00390068, 0.00164058]| 0.24282278 0.11112609 0.11112609 0.23882308 11059.616746798903 242.2128288400859 ESS, J=10 [0.00079829,0.0007009]| 0.24315261 0.11706768 0.11706768 0.24088345 8430.600274850432 298.5169090822068 Gibbs [6.37593466 105_{,8.32749025 10}4_]| 0.26909097 0.13006213 0.13006213 0.27046932 9642.921709487204 542.9510645151653

Table 4.1: Estimated moments, number of effective of samples, and number of effective of samples per second of CPU runtime of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J = 1, 5, 10, and Gibbs sampler for the TMG distribution with

µ= [0, 0]|, σ₁2=σ₂2=1, σ1,2=σ2,1=0.8, a= [1, 1]|, and b= [1, 1]|.

The autocorrelation function (ACF) plots (Figure 4.2) present the first 10 lags ACF values of samples from each sampler in Figure 4.1. The ACF calculations are done for both variables, hence we can see that the left-hand side of the Figure 4.2 shows the plot of ACF values for the first variable (X1) while the right hand side presents values for the second variable (X2).

Figure 4.2: ACF values of the first 10 lags from 10000 sampled values from Exact HMC sam-pler, Analytic ESS sampler with J =1, 5, 10, and Gibbs sampler in each variable of the TMG distribution with µ= [0, 0]|, σ₁2=σ₂2=1, σ1,2=σ2,1=0.8, a= [1, 1]|, and b= [1, 1]|.

Second TMG Distribution

The second distribution is the TMG distribution with parameters:

µ=0 0 , Σ= 1 0.8 0.8 1 , a= ₈ 0.2 , b= ₈ 0.2 (4.1.3) with true mean and covariance function:

µ=0 0 , Σ=0.36848791 0.01060989 0.01060989 0.01326236 . (4.1.4)

(24)

The purpose of comparing generated samples from this distribution is to simulate the con-dition where the support of the TMG distribution is narrow. The results of the generated samples along with histograms in all axes can be seen in Figure 4.3.

(a) HMC (b) ESS(1)

(c) ESS(5) (d) ESS(10)

(e) Gibbs

Figure 4.3: Plots of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J=1, 5, 10, and Gibbs sampler for the TMG distribution with µ = [0, 0]|, σ₁2 =σ₂2=1, σ1,2=σ2,1=0.8, a= [8, 0.2]|, and b= [8, 0.2]|.

(25)

The calculations of estimated moments and numbers of effective samples from samples in Figure 4.3, unnormalized and normalized by CPU time, are shown in Table 4.2.

¯x Q Ne f f Ne f f/CPU HMC [0.00596755, 0.00018826]| 0.37223397 0.01106454 0.01106454 0.01324386 8611.843005973342 741.4776306046007 ESS, J=1 [0.00290735, 0.00020221]| 0.37510172 0.01115079 0.01115079 0.01313496 10429.157236786246 30.368238251347535 ESS, J=5 [0.00187388, 0.0004976]| 0.36657137 0.01108339 0.01108339 0.01319566 11992.169730913867 133.2559392126834 ESS, J=10 [0.00640117, 0.00123323]| 0.41437014 0.00900725 0.00900725 0.01297812 10685.3908816864 194.86919498731402 Gibbs [0.00016817, 0.00028968]| 0.37651692 0.01223855 0.01223855 0.01345002 10866.502603088855 638.7976352627825

µ= [0, 0]|, σ₁2=σ₂2=1, σ1,2=σ2,1=0.8, a= [8, 0.2]|, and b= [8, 0.2]|.

Figure 4.4 shows the first 10 lags ACF values in each variable of the samples from each sampling method for samples in Figure 4.3.

Figure 4.4: ACF values of the first 10 lags from 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J = 1, 5, 10, and Gibbs sampler in each variable of the TMG distribution with µ = [0, 0]|, σ₁2 = σ₂2 = 1, σ1,2 = σ2,1 = 0.8, a = [8, 0.2]|, and b= [8, 0.2]|_.

Third TMG Distribution

The third TMG distribution has the following parameters:

µ=0 0 , Σ= 1 0.8 0.8 1 , a=0.5 0.5 , b= ₈ 8 . (4.1.5)

(26)

along with the following true mean and covariance matrix:

µ=1.257853 1.257853 , Σ=0.2950358 0.1571118 0.1571118 0.2950358 . (4.1.6)

The mean of the untruncated distribution is located outside of the support of the TMG distri-bution. This sometimes happen in cases where the values on the tail of the distribution are of interest. The plots of the generated samples from each sampler can be seen on Figure 4.6.

Table 4.3 presents the estimated mean and covariance of samples from each sampler, along with number of effective samples divided and not divided by CPU runtime in seconds for samples in Figure 4.6. ¯x Q Ne f f Ne f f/CPU HMC [1.25885906, 1.26349034]| 0.29036326 0.14970013 0.14970013 0.28997145 8504.485930026789 523.9904336937425 ESS, J=1 [1.25300642, 1.25588111]| 0.28283208 0.1537371 0.1537371 0.29404967 2345.896987827597 12.942402035240582 ESS, J=5 [1.27892596, 1.29781343]| 0.28637714 0.15064811 0.15064811 0.3291746 718.5596034160061 16.11050956935596 ESS, J=10 [1.28374485, 1.28558345]| 0.32047451 0.18124769 0.18124769 0.30509886 430.68684745280433 17.227473898112173 Gibbs [1.2640684, 1.26334693]| 0.30760738 0.16783637 0.16783637 0.30564369 8677.961743412814 370.1793908322658

µ= [0, 0]|, σ₁2=σ₂2=1, σ1,2=σ2,1=0.8, a= [0.5, 0.5]|, and b= [8, 8]|.

In Figure 4.5, we can see the first 10 lags ACF values for generated samples in Figure 4.6 that are generated from different sampling methods.

Figure 4.5: ACF values of the first 10 lags from 10000 sampled values from Exact HMC sam-pler, Analytic ESS sampler with J =1, 5, 10, and Gibbs sampler in each variable of the TMG distribution with µ= [0, 0]|, σ₁2=σ₂2=1, σ1,2=σ2,1=0.8, a= [0.5, 0.5]|, and b= [8, 8]|.

(27)

(a) HMC (b) ESS(1)

(c) ESS(5) (d) ESS(10)

(e) Gibbs

Figure 4.6: Plots of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J=1, 5, 10, and Gibbs sampler for the TMG distribution with µ = [0, 0]|_{, σ}2

1 =σ22=1, σ1,2=σ2,1=0.8, a= [0.5, 0.5]|, and b= [8, 8]|.

(28)

Fourth TMG Distribution

In this particular case, we choose a TMG distribution with d=10, with parameters:

µ=010, a=110, b=110, (4.1.7) where 010, 110 P R10, andΣ as in Figure 3.1. Since it is a case of high-dimensional distribution, showing the distribution plots between each dimension and the estimated moments would be impractical. The calculations of effective sample sizes are presented in Table 4.4 while the plots of the first 3 lags ACF values for each method are shown in Figure 4.7.

Ne f f Ne f f/CPU HMC 9036.100956097438 1113.4426049281212 ESS, J=1 11060.0064486866 39.098130986497814 ESS, J=5 9833.363966444724 129.22854167406302 ESS, J=10 11278.654922381878 255.66169832885257 Gibbs 277.5148881916632 3.3418134580268952

Table 4.4: Number of effective of samples and number of effective of samples per second of CPU runtime of 10000 sampled values from Exact HMC sampler, Analytic ESS sampler with J=1, 5, 10, and Gibbs sampler for the TMG distribution with µ=010, a=110, b=110, and Σ as in Figure 3.1.

(a) HMC (b) ESS(1) (c) ESS(5)

(d) ESS(10) (e) Gibbs

Figure 4.7: Plots of the first three lags ACF values of Exact HMC sampler, Analytic ESS sam-pler with J = 1, 5, 10, and Gibbs sampler for the TMG distribution with µ = 010, a = 110, b=110, andΣ as in Figure 3.1.

(29)

4.2. Gaussian Process under Linear Inequality Constraints

4.2 Gaussian Process under Linear Inequality Constraints

Unconstrained GP

From the predictive distribution given linear inequality constraints (2.5.8 and 2.5.9), sampling from a GP given some linear inequality constraints requires us to sample from a TMG distri-bution first. Hence, for each sampler, sample paths of the constrained GP will be generated as to compare the validity and the goodness of the samplers. In each test input, the first three lags ACF values from samples from each sampler will also be compared. The value of J=5 for Analytic ESS will be used as it results in a compromise between number of effective samples and runtime.

Figure 4.8 shows how the unconstrained GP specified in 3.1 samples paths with and with-out the training data as explained in Section 3.2 regarding the detail of the experiments. To sample paths from this GP, a finite set of 500 uniformly spaced points between 0 and 1 is chosen as the test locations Xas to sample values from the Gaussian distribution.

Figure 4.8: Sample paths of unconstrained GP.

GP with Boundedness Constraints

The first type of constraint to impose is boundedness constraint that restricts f to values be between 0 and 1. As seen from Figure 4.9, 200 sample paths of constrained GPs are shown for each sampling method along with its sample mean.

(30)

Figure 4.9: Sample paths of GP under boundedness constraint.

Figure 4.10 shows the first three lags ACF values for GP samples in Figure 4.9 that use boundedness constraint. The calculation of ACF values are done in each test point, X = [x₁, ..., x₅₀₀].

Figure 4.10: The first three lags ACF values from samples of GPs under boundedness con-straint.

In Table 4.5, we can see how long each sampling method is taking to sample 200 paths of GP under boundedness constraints. Notice that, the bold number indicates the fastest runtime of the three samplers.

Method Runtime (s) HMC 2.4633405208587646

ESS 1.148186445236206 Gibbs 6.479651927947998

Table 4.5: CPU runtime (seconds) of Exact HMC sampler, Analytic ESS sampler J = 5, and Gibbs sampler for the GP given the training data and boundedness constraints between a=0 and b=1.

GP with Monotonicity Constraint

In Figure 4.11, the sampled paths of GPs under monotonicity constraint are shown for each kind of sampler. Specifically, we restrict the function f to be monotonically increasing, that is, d f (x)_dx ¥ 0.

(31)

Figure 4.11: Sample paths of GP under monotonicity constraint.

Figure 4.12 shows the ACF values of samples from Figure 4.11 that are generated from GPs under monotonicity constraint.

Figure 4.12: The first three lags ACF values from samples of GPs under monotonicity con-straint.

The CPU runtime to draw 200 samples for each GP under monotonicity constraint in Figure 4.11 is presented in Table 4.6.

Method Runtime (s) HMC 9.958067178726196

ESS 0.7899127006530762 Gibbs 3.749497175216675

Table 4.6: CPU runtime (seconds) of Exact HMC sampler, Analytic ESS sampler J = 5, and Gibbs sampler for the GP given the training data and monotonicity constraint that restricts

d f (x) dx ¥ 0.

(32)

5 Discussion

5.1 Results

There are some interesting things to mention from the results. This will be divided into two parts, where the first part is focused on discussing the results from the experiments of differ-ent TMG distributions (Section 4.1) and the second part will discuss the results from Section 4.2 regarding the GPs under linear inequality constraints.

Truncated Multivariate Gaussian Distributions

First TMG Distribution

As we can see from Figure 4.1, all of the plots of the generated samples look similar judging by the distribution and also by the histogram. This observation is confirmed by the estimated means and covariance matrices in Table 4.1 which are not far off the true mean and covariance of the TMG distribution (4.1.2).

Looking at the numbers of effective samples in Table 4.1, it is shown that Analytic ESS with J=10 produces samples that have the lowest number of effective samples. This result is expected because it is more likely to generate samples that are more correlated if we use a higher number of J, which is the number of samples generated in each ellipse of Analytic ESS sampler. Gibbs sampler also produces samples that have lower number of effective sample sizes compared to Exact HMC sampler and Analytic ESS sampler with J=1 and J =5.

However, when normalized by the CPU runtime, we can see that the numbers of effective samples from all Analytic ESS samplers fall below Gibbs sampler while Exact HMC sampler has the highest number of effective samples in a second of CPU runtime. This is due to the fact that it takes more time to calculate analytically the ellipse and slice for a single iteration of Analytic ESS than to run a single iteration of Exact HMC or Gibbs sampler.

The autocorrelation function (ACF) plots tell us that the Gibbs sampler produces samples that are relatively more highly correlated than the other samplers. As seen from Figure 4.2, ACF scores of samples from Gibbs sampler are quite high in the first two lags. On the other hand, ACF values of samples from Exact HMC and Analytic samplers are consistently close to zero. This behaviour is expected because both Exact HMC and Analytic ESS sampler have an update rule that makes it possible to explore the sample space more. Hence, the samples produced by Exact HMC and Analytic ESS are less correlated.

(33)

5.1. Results

This particular distribution is a relatively simple one, since the mean is in the center of the distribution support and the variances and covariances are not necessarily high that we still can see the shape of the untruncated distribution. In this scenario, Exact HMC works well because the shape of the distribution is convenient and we can expect that most of the time, the particle does not hit the walls.

Second TMG Distribution

All of the distribution plots in Figure 4.3 appear to be similar, which indicates that they are from the same distribution. The estimated moments from each sampler shown on Table 4.2 are close to the true moments as well. Although, the estimated covariances from samples of Analytic ESS with J=10 are quite far from the true covariances.

An interesting result is found in the numbers of effective samples in Table 4.2. The number of effective samples of HMC sampler is found to be the lowest among all of the sampler. However, when the numbers of effective samples are scaled by the CPU runtime, the same pattern occurred in which Exact HMC has the highest number of effective sample per second followed by Gibbs sampler and Analytic ESS sampler with J = 10, 5, and 1 respectively. Analytical calculations of ellipses and slices in Analytic ESS could, once again, contribute to this behaviour.

Looking at the ACF scores (Figure 4.4), the scores of samples from Exact HMC sampler are relatively higher in the first lag. We can also observe that the first lag ACF value of samples from Analytic ESS sampler with J = 5 is really low which results in the calculation of the number of effective samples that is significantly larger than 10000. But overall, all samplers produced samples that are not highly correlated because the ACF scores are close to zero. Third TMG Distribution

Figure 4.6 tells us that all samplers generate samples that are similar and indeed it can be seen from the comparison of moments on Table 4.3 that they have moments that are relatively close to the true moments. However, samples from Analytic ESS sampler with J = 10 and Gibbs have estimated covariances values that are quite far from the truth.

Similar pattern as the previous distributions emerged for the calculations of number of effective samples per second. Exact HMC, once again, came out on top followed by Gibbs sampler and Analytic ESS sampler with J =10, 5, and 1 respectively. One thing to notice, is that there is a discrepancy of numbers of effective samples per second between all Analytic ESS samplers and the other two methods.

As discussed by Murray, Adams, and MacKay (2010) and Fagan, Bhandari, and Cunning-ham (2016), ESS tends to work well when the prior aligns properly with the target (poste-rior) distribution and the likelihood is weak. In this case, we are trying to generate samples from low probability region of a multivariate Gaussian distribution. The prior has a mean

µ = [0, 0]|while the likelihood is the region1(x1, x2 ¡ 0.5). The posterior is then a TMG within the likelihood.

We can see that most of the prior lies outside of the support of the posterior distribution and this results in less effective exploration of posterior space. As illustrated in Figure 5.1, the set of possible new states (dashed orange line inside the red box) is limited to only a portion of the whole ellipse. It also affects the exploration of the posterior space because often times the auxiliary variable ν does not change the ellipse in a significant way. Fagan, Bhandari, and Cunningham (2016) uses expectation propagation to manipulate the parameters of the prior to alleviate this problem.

(34)

5.1. Results

Figure 5.1: Illustration of an ellipseEin a single iteration of Analytic ESS for TMG distribution with µ= [0, 0]|_{, σ}2

1 = σ22 =1, σ1,2 =σ2,1 =0.8, a= [0.5, 0.5]|, and b= [8, 8]|. The orange dashed ellipse depictsEgiven an initial state x and auxiliary variable ν, while the blue ellipses represent the contour of the prior.

The ACF plots in Figure 4.5, confirm the numbers of effective samples because the samples from all cases of Analytic ESS samplers are more highly correlated with each other. Notice that there is a slow decreasing trend in the cases of Analytic ESS samplers and Gibbs sampler, although the latter one already reaches ACF around zero in lag 4. On the other hand, ACF values of samples from Exact HMC are consistently close to zero.

Fourth TMG Distribution

There is a significant difference between numbers of effective samples from Gibbs sampler and other methods as seen in Table 4.4. While generally both Exact HMC and Analytic ESS produce samples with number of effective samples close to the value of N = 10000, Gibbs sampler can only produce 277 effective samples. Exact HMC still produces the largest number of effective sample per second due to its short runtime.

The numbers of effective samples are in line with the ACF values shown in Figure 4.7 which show that ACF values of samples from Gibbs sampler are close to 1 for lag-1 to lag-3. On the other hand, ACF values of samples from Exact HMC and Analytic ESS are consistently close to zero. This behaviour could be explained by the structure of the covariance matrix in Figure 3.1 in which the covariance values are still relatively high for neighboring variables. This will affect the performance of Gibbs sampler because it limits the exploration of the sample space and therefore increases the correlation between the samples. On the flip side, high covariance values do not affect the performance of Exact HMC and Analytic ESS sampler as much as Gibbs sampler because the update rule is not based on the full conditional of a single variable given all other variables.

Comparison of the Sampling Methods

We discussed the performance of Exact HMC sampler and Analytic ESS samplers with J =

(35)

5.1. Results

On top of that, it runs in a significantly less time. Both facts result in really high numbers of effective samples per second, which indicates that it is a more efficient sampler than Analytic ESS and Gibbs sampler.

Analytic ESS failed to beat Gibbs sampler in terms of the numbers of effective samples generated per second. This is due to the fact that it takes a longer time to draw the same number of samples from Analytic ESS than the other two methods. The selection of number of draws made in a single ellipse (J) could be a trade-off between correlatedness of the samples and the runtime of the algorithm. Generally, it is observed that higher value of J, results in a more correlated samples but less running time.

In the third case, Analytic ESS produced lower number effective samples because they are highly correlated due to the fact that it suffers from an unaligned prior problem. In a high dimensional case (fourth experiment), Gibbs sampler suffers because the variables are highly correlated between each other while other methods thrive.

Gaussian Process under Linear Inequality Constraints

Unconstrained GP

In the plots of samples from an unconstrained GP (Figure 4.8), we can see how the uncon-strained GP behaves with and without the training data. Without the training data, we sam-ple paths from the prior distribution which has µ = 0. As soon as the training data is in-troduced, we sample from a posterior distribution which has a mean and covariance that is more tailored towards the training data. We can also observe that the paths vary more the further x P Xfrom the training data.

GP with Boundedness Constraints

In the comparison plots of samples from GPs under boundedness constraints (Figure 4.9), it is observed that all 200 sampled paths are inside the constraints. The sample means of all GPs are also looking quite similar. Another thing to notice, is that the sampled paths from Gibbs sampler are more closely grouped together than the other two. This indicates that the sampled values from Gibbs sampler are more highly correlated. As discussed in 5.1, Gibbs sampler suffers in high dimensions especially when the covariance values are high. This results in Gibbs sampler which does not explore the sample space properly hence the compactness of the sample paths.

This is also confirmed by the ACF values for the first three lags from the sampled paths of the GP under boundedness constraint in Figure 4.10. Although the ACF values fluctuate and are relatively lower in values of x that are closer to the training points, in general they are higher and close to 1 compared to ACF values for the other two GPS that used Exact HMC and Analytic ESS.

In terms of the runtime of GP sampling, there is a stark difference between the three algorithm. GP with Analytic ESS sampler runs twice faster than GP with Exact HMC to sample 200 paths. Meanwhile, GP with Gibbs sampler needs around 6.5 seconds which is more than two times slower than GP with Exact HMC. One possible explanation for this is that there is significantly a lot more of bounces in a single iteration of Exact HMC in a constrained Gaussian process case, in which we use 30 dimensions. Analytic ESS, on the other hand, spends a lot of its runtime analytically computes the possible sampling region, which is still quite time-consuming but scales better with number of dimensions.

GP with Monotonicity Constraint

The 200 sampled paths for each GP under monotonicity constraint that uses different sam-pling method can be seen in Figure 4.11. Notice that the sampled paths are predominantly

Efficient Sampling of Gaussian Processes under Linear Inequality Constraints

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2021 | LIU-IDA/STAT-A–21/020–SE

Eﬃcient Sampling of Gaussian

Processes under Linear

Inequal-ity Constraints

Bayu Beta Brahmantio

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Related Works

1.3

Aim

2

Theory

2.1

Truncated Multivariate Gaussian Distribution

2.2

Exact Hamiltonian Monte Carlo for Truncated Multivariate Gaussians

2.3

Analytic Elliptical Slice Sampling

2.4

Gibbs Sampling for Truncated Multivariate Gaussians

2.5

Gaussian Process

Gaussian Process Regression

Gaussian Process under Linear Operators

Gaussian Process under Linear Inequality Constraints

2.6

Autocorrelation and Effective Sample Size

3

Method

3.1

Methods and Hyperparameters

Truncated Multivariate Gaussian Distribution

Exact Hamiltonian Monte Carlo

Analytic Elliptical Slice Sampling

Gaussian Process

3.2

Experiments

4

Results

4.1

Truncated Multivariate Gaussian Distributions

4.2

Gaussian Process under Linear Inequality Constraints

5

Discussion

5.1

Results

Truncated Multivariate Gaussian Distributions

Gaussian Process under Linear Inequality Constraints