Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

RUTGER FUGLESANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

RUTGER FUGLESANG

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020

Supervisor at KTH: Jimmy Olsson Examiner at KTH: Jimmy Olsson

(4)

TRITA-SCI-GRU 2020:306 MAT-E 2020:078

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Sammanfattning

This thesis investigates the possibility of using Sequential Monte Carlo methods (SMC) to create an online algorithm to infer properties from a dataset, such as unknown model parameters. Statistical inference from data streams tends to be difficult, and this is particularly the case for parametric models, which will be the focus of this paper. We develop a sequential Monte Carlo algorithm sampling sequentially from the model’s posterior distributions. As a key ingredient of this approach, unknown static parameters are jittered towards the shrinking support of the posterior on the basis of an artificial Markovian dynamic allowing for correct pseudo-marginalisation of the target distributions. We then test the algorithm on a simple latent Gaussian model, latent Gausian Mixture Model (GMM), as well as a variable dimension laten GMM. All tests and coding were done using Matlab. The outcome of the simulation is promising, but more extensive comparisons to other online algorithms for static parameter models are needed to really gauge the computational efficiency of the developed algorithm.

(6)

(7)

Partikelbaserad Bayesiansk realtidsinlärning av statiska modellparameterar med tillämpning på mixturmodeller

Author: Rutger Fuglesang Supervisor: Prof. Jimmy Olsson

Sammanfattning

Detta examensarbete undersöker möjligheten att använda Sekventiella Monte Carlo me- toder (SMC) för att utveckla en algoritm med syfte att utvinna parametrar i realtid givet en okänd modell. Då statistisk slutledning från dataströmmar medför svårigheter, särskilt i parameter-modeller, kommer arbetets fokus ligga i utvecklandet av en Monte Carlo algoritm vars uppgift är att sekvensiellt nyttja modellens posteriori fördelningar. Resultatet är att okän- da, statistiska parametrar kommer att förflyttas mot det krympande stödet av posteriorn med hjälp utav en artificiell Markov dynamik, vilket tillåter en korrekt pseudo-marginalisering utav mål-distributionen. Algoritmen kommer sedan att testas på en enkel Gaussisk-modell, en Gaussisk mixturmodell (GMM) och till sist en GMM vars dimension är okänd. Kodningen i detta projekt har utförts i Matlab.

(8)

(9)

1 Introduction

1.1 Bayesian Inference

Today Bayesian inference plays an important role in a wide range of scientific and engineering fields, such as signal processing, automatic control, biology, and mathematical finance to mention just a few. The ability to be able to update and improve a model continuously as data arrives with an online algorithm has clear efficiency and computational appeal over clunkier batch methods. An online algorithm also allows decisions to be made in real time, which could be critical in some applications. For instance, in financial fraud-detection where millions of transactions are made every day, real-time decision making is of utmost importance [10].

1.2 The problem

In static parameter models we have a sequence of i.i.d (independent and identically distributed) random variables (Y_t), t ∈ N and a parametric model fθ, θ ∈ R^d where we would like to ’learn’ the value of the parameter θ based on observed data. However, finding an efficient online algorithm for the task has been a fundamental problem in the area for some time.

Some modern attempts, including this paper, utilize Sequential Monte Carlo (SMC) methods. In order to find an online algorithm for the static parameter θ we introduce N particles which are propagated with every new data point that arrives. Here the goal for an online algorithm is that the posterior π(θ|y1:t) can be updated given the particle’s previous position and solely the newest data point. The fundamental difficulty with achieving this goal is that due to the posterior consistency (see section 2.1.2), the support for the posterior shrinks for every propagation step.

As a result, the particles may collapse too quickly without having explored the full support of the actual posterior.

1.3 Related research

There is a plethora of recently published algorithms that tackle the issue. One class of such algorithms operate under the framework of variational inference (VI) [13]. These methods attempt to approximate the posterior density by a variational density qθ characterized by a set of parameter θ. The latter focuses on optmizing the parameters by minimizing a divergence measure between qθ

and the posterior, for instance using the Kullback-Leibler divergence measure.

A recent method called ’Nested Particle Filters’ [15] involves, as the title implies, using a particle filter within a particle filter to assure that the propagation of the main SMC particles is done in a manner that explores the full posterior.

Another technique operating under the SMC framework is to create a kernel for θ given y_1:t from which new particles are drawn. While alleviating the issue of section 1.2, this jittering technique has a key requirement that the built kernel must shrink at an appropriate rate to control the variance of the jittered particles. For instance, [6] introduces a plausible jittering scheme, which is however rather complicated and relies on the assumption that the posterior consistency happens at a prescribed rate. It is this train of thought that has led to the project at hand and raised the question if a similar manner of perturbation can be achieved via a simpler scheme.

(12)

Various methods have been suggested, with various degrees of success. Some attempts lack a theoretical guarantee of convergence while others are only valid for a small, specific set of conditions [5].

1.4 This Thesis

1.4.1 Aim

This project aims to find an online algorithm for Bayesian inference of static parameters θ via SMC methods. We do so by seeking a simple method of jittering by introducing a Markovian dynamic on the SMC particles. In this thesis the aforementioned dynamic will be implemented via single steps of a Gibbs’ sampler (see Section 2.4.2), but other ways are possible. Thus we can succinctly summarize the primary objective of this paper as follows: Is it possible to introduce a Markovian dynamic on the particles as a form of jittering so as to ensure full exploration of the posterior.

1.4.2 Report Structure

The report is set up in the following manner. Section 2 elementarily covers the mathematical background required to follow the work at hand. Section 3 goes through the concept of jittering via a Markovian dynamic imposed upon the SMC particles. The latter also includes a thorough proof that shows that the sought posterior remains unchanged. Afterwards in Section 4 the idea is put to the test on 3 models of increasing difficulty. Firstly, on a simple Gaussian model in Section 4.1,then on a more difficult latent Gaussian Mixture Model in Section 4.2 and finally on a variable dimensional GMM in Section 4.3. Finally, Section 5 then summarizes the work and discusses possible extensions.

(13)

2 Background

2.1 Bayesian statistics

Let us ease in to the math behind this paper with a quick recap of what Bayesian statics entail. In 1763 Thomas Bayes [3] introduced his famous theorem, which in its raw form is stated as:

π(θ|y) =p(y|θ)p(θ) p(y)

The innocuous looking theorem has developed a rather large school of thought in statistics that is based upon it. In Bayesian statistics we usually consider y the data, θ a set of parameters pertaining to the distribution of said data. Then we have the distributions: π(y|θ) is the likelihood (that is, the probability distribution of the data given a set of parameters), p(θ) the prior (a probability distribution chosen to reflect the statistician’s prior beliefs of the variable), p(θ|y) the posterior and p(y) the probability of the data.

In the context of Bayesian statistics we are often interested in inferring the values of θ. This is often done by declaring π(θ|y) ∝ p(y|θ)p(θ), where we know the distribution up to a normalizing constant. The normalizing constant can then either be figured out, if for instance we are using conjugate priors, or alternatively be worked around with methods such as Importance Sampling. Let us briefly look at a simple example of a weighted coin to really exemplify how this can be applied.

2.1.1 Weighted Coin Example

Say that we have been flipping a coin and that we begin to suspect that it may be weighted. We will let θ denote the ’weightness’ of it. That is θ = 0.5 would signify a balanced coin where the probability of turning heads is equal to that of getting tails. We may first say the the every ’coin flip trial’ is Bernoulli distributed. That entails that the outcome is distributed as:

p(y = k|θ) = θ^y(1 − θ)^1−y

where y = 0 denotes a result of ’heads’ and y = 1 one of ’tails’. And as each coin flip is independent from the latter if a series of coin flips y1:n were done, the probability of that outcome would be:

p(y_1:n|θ) =

n

Y

i=1

θ^yⁱ(1 − θ)^1−yⁱ

Now to really apply a proper Bayesian approach, we need to choose a prior p(θ) that is compatible.

In this case we may suggest θ ∼ β(1, 1) where a β(α, γ) has the probability density function:

p(θ) = θ^a(1 − θ)^γ−1 B(α, γ)

Here B(α, β) is the beta-function. Now if we for the moment ignore the denominator of Bayes’

theorem and focus on the numerator we find:

p(y|θ)p(θ) = θ^a(1 − θ)^γ−1 B(α, γ)

n

Y

i=1

θ^yⁱ(1 − θ)^1−yⁱ

(14)

So if we have n trials (i.e. coin flips) and we let z denote the number of tails, then we have:

p(y|θ)p(θ) = θ^α+z(1 − θ)^γ+n−z B(α, γ) Without touching the denominator we know that:

π(θ|y) = p(y|θ)p(θ)

p(y) ∝ p(y|θ)p(θ) = θ^α+z(1 − θ)^γ+n−z B(α, γ)

But we know that a probability distribution that is proportional to another is equal to that one as well (since they both must integrate to one). Thus we can see from the form of the equation above that the posterior is distributed as π(θ|y) ∼ β(α + z, γ + n − z + 1). But how about α and γ? Well these variables really signify our priors. If we want to signify that we have no prior knowledge we may set them to α = 1 and γ = 1, yielding a uniform, or flat prior; otherwise one could set them asymmetrically to reflect our belief of the coin. But to conclude if we ever suspect a coin to be weighted, we simply need to flip it n times, count the z-times it lands as tails and then look at the posterior above.

2.1.2 Posterior Consistency

Consider a sequence of i.i.d. stochastic variables Xn, n = 1, 2, ... modeled as (X1, ..., Xn) ∼ pn((x1, ..., xn)|θ), θ ∈ Θ; θ ∼ Πn

where Πn(·|(X1, ..., Xn)) denotes the posterior of θ based on the n-samples. We can assume that with a growing n the set of samples X1, ..., Xn provides more and more information on θ. Once again, in the i.i.d. case we have pn((x1, ..., xn)|θ) = Qn

i=1f (xi|θ). Here if the true value of the hidden parameter θ is given by θ0∈ Θ we would intuitively expect Πn(·|(X1, ..., Xn)) to converge to the Dirac measure at the true value, δθ0. This convergence or contraction of the distribution is what is referred to as posterior consistency [19].

While posterior consistency seems as, and often is, a purely positive outcome of a Bayesian model, it can come with some negative consequences. In the case of Sequential Monte Carlo methods for static parameter models the posterior consistency converges so quickly that it can be quite difficult to move the SMC particles and properly explore the full support of the posterior.

2.2 Markov Models

This thesis presents an algorithm that is essentially a hybrid of an Markov Chain Monte Carlo algorithm and a Sequential Monte Carlo one. To understand these we need to reiterate from the ground up starting with Markov models. A sequence {Xt}_t∈N of stochastic variables are considered Markovian if the probability of being in state X_tconditional on all the previous states X_0:t−1only depends on the last state X_t−1. The transition probability of going from X_t−1 to X_tis given by a Markov transition kernel K, which can be defined as followes:

Definition 2.1. Markov Kernel

Let {X, A} and {Y, B} be measurable spaces. A markov kernel K : X 7→ Y with source (X, A) and target (Y, B) is a map K : B × X 7→ [0, 1] with the following properties:

(15)

i For every b ∈ B, the map x 7→ K(B, x) is A-measurable

ii For every x ∈ X, the map B 7→ K(B, x) is a probability measure on (Y, B)

In other words it associates to each point x ∈ X a probability measure K(x, dy) : b 7→ K(b, x) on (Y, B) such that for every measurable set b ∈ B, the map x 7→ K(b, x) is measurable with respect to the σ-algebra A. [14]

The formal definition above is rather clunky to any reader unfamiliar with the measure- theory formalities. So for the purposes of this paper we can loosely say that for the Markov chain X_0:t the markov kernel at K(x_t−1, X_t) equates to the transition probability measure p(x_t|X_t−1).

Having defined the Kernel we may give a rigorous definition of what constitutes a Markov Chain.

Definition 2.2. Markov Chain

Let (Ω, F , F, P) be a filtered probability space and let K be a Markov transition kernel on a measurable space (X, X ). An X-valued stochastic process {Xt}t≥0 is said to be a Markov chain under P , with respect to the filtration F and with transition kernel K if it is F-adapted and for all k ≥ 0 and A ∈ X

P (Xt+1∈ A|Ft) = Kt(Xt, A)

The distribution of X0is called the initial distribution of the chain, and X is called the state space [8].

In many cases, including this thesis, it is important that the distribution produced by the kernel is stationary. This means that the probability measure remains unchanged in the Markov chain as time progress. Equivalently we may say that a kernel is invariant (or formally that the probability measure π is invariant for a given kernel) if:

π(y) = Z

K(x, y)π(x)dx (1)

A Markov chain is denoted as reversible if:

K(y, x)f (y) = K(x, y)f (x)

Note that a reversible Markov chain is also invariant. This notion of reversibility or invariance is of utmost importance in this paper as the theoretical proof for the particle framework in section 3 assumes an invariant Markov Kernel.

2.3 Monte Carlo methods

As the algorithm developed in this applies MCMC moves within an SMC framework it is important that we fully understand these concepts. Firstly let us iterate what a Monte Carlo method entails.

Broadly speaking it is any method where we are using random samples to estimate a result. The basis of which is the law of large numbers: If we have a sequence of random variables X1, ..., XN

sampled from some distribution, then:

lim

N →∞

1 N

N

X

i=1

X_i→ µ

where µ is the common mean of X. To exemplify how a Monte Carlo method can be used practically, we shall below go through a classic example of estimating π via MC.

(16)

2.3.1 MC estimate of π

If we have a circle with radius r = ¹₂ centered in a unit square, we know that the area of the circle is Ac= ^π₄. Assuming that we can sample from the uniform U (0, 1) distribution, then we can sample points (x, y) in the unit square by drawing via x ∼ U (0, 1), y ∼ U (0, 1). We can also categorize if a point is within the circle or not by its distance from the center, as a point is within the circle if x²+ y²< 1 (Note that p(x²+ y²= 1) = 0 so it doesn’t matter how we treat the boundary), and outside otherwise. So after sampling N points, categorizing them all, and letting N_c denote the number of points inside the circle. Then, our estimate of π is given by:

π^{M C}_N 4 = N_c

N

2.4 Markov Chain Monte Carlo

Here we recap what MCMC methods entail, as these will be implemented in conjunction with a Sequential Monte Carlo framework for the algorithm developed. While basic MC methods are occasionally of use, they suffer in large dimensional settings and as such are somewhat restricted in their use. Markov chain Monte Carlo (MCMC) methods are a natural extension of the latter, and are much less restrictive. To estimate an aspect of a distribution π we in a pure MC setting simulate an i.i.d. sequence X1, ..., XN from the distribution. However, we may consider simulating a Markovian sequence instead. In that case, if we want to estimate the value of some function f (for instance if we want the mean then f (x) = x) under some distribution, π, then this would take the form:

π^MCMC_N (f ) = 1 N

N

X

i=1

f (X_i) For MCMC to be practical though we have 3 key requirements [8]:

i Simulating the chain {Xi}_i∈N given an arbitrary initial value X1 is an easily implementable process.

ii The chain {Xi}_i∈N satisfies conditions need to guarantee the convergence towards π, irrespec- tively of the initial value of X1.

To clarify how this may actually work we shall go through two MCMC algorithms which are relevant to this paper.

2.4.1 Metropolis-Hastings

The Metropolis-Hastings algorithm was in an earlier version published developed in 1950 by N.

Metropolis. The latter was quite narrow in its’ applications and in 1970 W.K. Hastings [17] improved upon it giving it the form we shall go through below. The algorithm is quite versatile in its applications, and can for instance be used to sample from a distribution π, which we sample from directly. This is done by simulating from a transition density g, which we call the proposal density.

(17)

Algorithm 1 Metropolis-Hastings Algorithm Set X1arbitrarily

for i = 1:N do

Generate X⁰ ∼ g(X_i, •)

Set Xi+1= X⁰ w.pr. min(1,^π(X

0)g(X⁰,X_i) π(Xi)g(Xi,X⁰)) Set Xi+1= Xi otherwise

end for

Above the ratio A(X, X⁰) = min(1,^π(X_π(X⁰^)g(X⁰^,Xⁱ⁾

i)g(X_i,X⁰)) is referred to as the acceptance ratio.

We will later on refer to f racπ(X⁰)g(X⁰, X_i)π(X_i)g(X_i, X⁰) as α(X, X⁰). The acceptance ratio has that given form because in that manner the sequence {X_i}_i≥1 satisfies the detailed balance in Equation 1. The Metropolis-Hastings algorithm will not be used directly in this paper, however it is a precursor to the Reversible Jump Monte Carlo algorithm, which will be implemented in the variable dimension model. Now let us show that the acceptance ratio above is reversible (and recall that reversibility implies stationarity):

Given that we are in a state x and are moving to x⁰, then we need the following to hold:

π(x)p(x, x⁰) = π(x⁰)p(x⁰, x) This is equivalent to:

Z

A

π(x)K(x, A⁰)dx = Z

A⁰

π(x⁰)K(x⁰, A)dx⁰, ∀A, A⁰ (2) Where K(x, B) :=R

Bp(x, y)dy is the transition kernel of Metropolis-Hastings. Now note that:

K(x, A⁰) = Z

A⁰

g(x, x⁰)α(x, x⁰)dx⁰+

Z

g(x, x⁰)[1 − α(x, x⁰)]dx⁰

· I_{x∈A0} (3) The above states that starting at a state x, we will be in the set A⁰if we propose a move to x⁰ ∈ A⁰ and accept it. Or we if propose any move and reject it, given that x ∈ A⁰. Now let us insert Equation 3 into Equation 2 and denote the termR g(x, x⁰)[1 − α(x, x⁰)]dx⁰ as F (x) . First the left hand side becomes:

Z

A

π(x)

"

Z

A⁰

g(x, x⁰)α(x, x⁰)dx⁰+F (x) · I_{x∈A⁰_}

# dx

= Z

A

Z

A⁰

π(x)g(x, x⁰)α(x, x⁰)dx⁰dx + Z

A∩A⁰

F (x) · I_{x∈A⁰_}dx Similarly the right hand side becomes:

Z

A⁰

π(x⁰)

"

Z

A

g(x⁰, x)α(x⁰, x)dx+F (x⁰) · I_{x⁰_∈A}

# dx⁰

= Z

A⁰

Z

A

π(x⁰)g(x⁰, x)α(x⁰, x)dxdx⁰+ Z

A⁰∩A

F (x⁰) · I_{x⁰_∈A}dx⁰

(18)

As the terms with the indicator functions are clearly equal when we set the two sides together we have:

Z

A

Z

A⁰

π(x)g(x, x⁰)α(x, x⁰)dx⁰dx = Z

A⁰

Z

A

π(x⁰)g(x⁰, x)α(x⁰, x)dxdx⁰ (4) Thus we see that reversibility and the detailed balance equation hold, with this acceptance ratio.

2.4.2 The Gibbs’ sampler

The Gibbs sampler is an MCMC algorithm that generates a Markov chain and will be implemented in the algorithms described later on in this paper. The algorithm was first described in 1986 by the brothers Stuart and Donald Gaiman [4] and takes the following form:

If we have data ~y distributed with dependencies on x₁, x₂, ..., x_n and we have access to all the posteriors p(x_k|~y, x₁, ..., x_k−1, x_k+1, .., x_n), k = 1, 2, ..., n then we can draw from the full posterior as follows:

Algorithm 2 Gibbs sampler Set/Draw all x⁰_k from priors for m = 1:M do

for k=1:n do

θ^m_k ∼ p(xk|~y, x^m₁, ..., x^m_k−1, x^m−1_k+1, .., x^m−1_n ) end for

end for

All θ^m_k are then sampled from the full posterior. It is important to note that a single step of the Gibbs’ sampler (i.e. m = 1 above). Since this result is important for the developments to come, we provide the proof (as given by [8]) in the following:

Theorem 1. Each of the m individual steps of the Gibbs sampler is π reversible and thus admits π as a stationary probability density function.

Consider the step that updates the kth component and denote by Kk the corresponding transition kernel. We can always write λ = dx_kN dx−k where dx_k and dx_−k are measures on X_k and X_−k, respectively, such that dx_k dominates π_k(·|x_−k) for all values of x_k∈ Xk. With these notations,

K_k(x, dx⁰) = δ_x_−k(dx⁰_−k)π(x⁰_k|x_−k)dx⁰_k Hence, for any functions f1, f2∈ Fb(X),

Z Z

f1(x)f2(x⁰)π(x)dxK(x, dx⁰)

= Z

f1(x)π(x)dxk

Z

f2(x⁰_k, x_−k)πk(x⁰_k|x_−k)dx⁰_k

dx_−k

where (x⁰_k, x_−k) refers to the element u of X such that uk = x⁰_k and u_−k = x_−k. Because π(xk, x_−k)πk(x⁰_k|x_−k) = πk(xk|x_−k)π(x⁰_k, x_−k), we may also write

(19)

Z Z

f1(x)f2(x⁰)π(x)λ(dx)K(x, dx⁰)

= Z

f₂(x⁰_k, x_−k)π(x⁰_k, x_−k)dx⁰_k Z

f₁(x_k, x_−k)π_k(x_k|x−k)dx_k

dx_−k

which is the same expression as before where the roles of f1 and f2 have been exchanged, thus showing that the detailed balance condition (equation 1) holds.

2.5 Sequential Monte Carlo

Sequential Monte Carlo methods, also known as particle filters, are a set of methods used to solve filtering problems, and in the context of this paper Bayesian inference filtering problems. In the context of the present thesis, Particle filters use a set of particles or samples to represent the posterior from which we are trying to sample from. The particle filter methodology is often used to solve hidden Markov model (HMM) filtering problems as well as other nonlinear filtering problems.

2.5.1 Problem formulation

The typical SMC problem looks as follows. We want to estimate sequentially sequences I(ft)t≥0of expectations

I(ft) = E^p(x0:t)[ft(x0:t)] = Z

ft(x0:t)pt(x0:t)dx0:t. (5) Here f_tis some function of interest, which is integrable with respect to the posterior. For instance, if we want an estimate of the conditional mean then we simply have f_t= x_0:t.

To gain a better understanding of how this theory is applied, let us take a look at one of the developments that lead to the earliest successful SMC algorithm, the Sequential Importance Sampling with Resampling algorithm. FIND REF

2.5.2 Importance Sampling

Assume we want to approximate integrals I(f_t) as in Equation 5, which is intractable since it is only known up to a normalizing constant i.e. p(x0:t) = ^g(x_c^0:t⁾, where c is unknown. We can then introduce an instrumental distribution k(x0:t), which has a support that covers that of the target distribution. The integral I(ft) can then be rewritten as:

I(ft) = Z

ft(x0:t)p(x0:t)dx0:t

= Z

f_t(x_0:t)p(x0:t)

k(x_0:t)k(x_0:t)dx_0:t

= Z

ft(x0:t)p(x_0:t)

k(x0:t)k(x0:t)dx0:t

= Z

ft(x0:t)ω(x0:t)k(x0:t)dx0:t

(20)

Here ω(x_0:t) = ^p(x_k(x^0:t⁾

0:t) is referred to as the importance weight. If we then simulate N i.i.d. particles from k(x0:t) we can then formulate a Monte Carlo estimate for I(ft) as:

IM C(ft) = PN

i=1ft(xⁱ_0:t)ω(xⁱ_0:t) PN

i=1ω(xⁱ_0:t)

Importance sampling is a general Monte Carlo integration method but it, in the form presented, is not suitable for an online algorithm as we need all the point x1:tbefore being able to estimating I(ft). Thus, if a new data point xt+1 arrives, the computation needs to be done anew as it cannot be updated recursively. This problem leads us to one of the earliest SMC methods, the Sequential Importance Sampling algorithm.

2.5.3 Sequential Importance Sampling (SIS)

The earliest SMC attempts can be summarized by the Sequential Importance Sampling (SIS) algorithm, which as the name implies is a sequential version of the classic Importance Sampling method and has been known since the 1970’s.

As in the last case we want to evaluate integrals of the form of equation 5, but are operating under the same constraints as outlined in the Importance Sampling case. The SIS framework functions by approximating the integral using particles. A particle X0:n = X0, ..., Xn is a draw with corresponding importance weight w0:n= w(X0:n). Since we are unable to draw directly from the target distribution we instead draw from an importance distribution k(x0:t). Now to be able to function recursively, i.e. have an online algorithm we need the instrumental density to be written as a product of transition densities:

k(x0:t) = k0(x0)

n

Y

k=1

k(x0:k−1; xk)

Given a particle X0:nthe new state Xn+1can be sampled from the transition density k(x0:n; xn+1).

The updated particle X0:n+1is then simply set as X0:n+1= (X0:n; Xn+1). The importance weights can also be updated in an online fashion as follows:

w_n+1= g_n+1(X_0:n+1) kn+1(X0:n+1)

= gn+1(X0:n+1) gn(X0:n)kn(X0:n; Xn+1)

gn(X0:n) kn(X0:n)

= gn+1(X0:n+1)

g_n(X_0:n)k_n(X_0:n; X_n+1)wn

Now the Monte Carlo estimate of equation 5 is given by:

I_{M C}(f_n) =

N

X

i=1

wⁱ_n PN

i=1w_nⁱf_n(X_0:nⁱ )

The sum within the sum above is simply there to self-normalize the weights and takes care of the unknown normalizing constant of the target distribution.

(21)

2.5.4 Weight degeneracy

While the SIS algorithm shows promising results for a few initial iterations it quickly degenerates.

This flaw is a result of the multiplicative nature of updating the weights and is referred to as weight degeneracy. What happens is that quite quickly a few weights dominate all the others leaving our Monte Carlo estimate flawed. As a result, the algorithm historically did not find any practical application till a solution in the form of resampling was put forth in 1990’s, which brings us to the first truly successful implementation of Sequential Monte Carlo methods: The Sequential Importance Sampling with Resampling algorithm.

2.5.5 Sequential Importance Sampling with Resampling (SISR)

In 1993 Gordon et al [12] solved the weight degeneracy issue by adding a resampling step after a single sweep of the SIS algorithm. Resampling is a straightforward step that functions as follows:

Given N particles with corresponding importance weights, (X_0:nⁱ , w_0:nⁱ ), new indices Iⁱ are drawn via a multinomial process with probabilities PN^w¹ⁿ

j=1w^j_n, ...,PN^w^Nⁿ

j=1w_n^j. Thus, Iⁱ = k w. pr. w_n^k

PN j=1wn^j

Afterwards we reset the particles as X_0:nⁱ = X_0:n^Iⁱ and we set all weights as w_nⁱ =

g_n+1(X_0:n+1)

gn(X0:n)kn(X0:n;Xn+1). This resampling has become a staple of all SMC methods and the method above of multinomial resampling (named as such as we are using a multinomial process) is simply the first and most simple way of doing so. Usually resampling is implemented systematically after every propagation step, however much optimization can be done on when and how often to apply the step. The algorithm in pseudo-code can then be written as:

Algorithm 3 Given (X_0:tⁱ , wⁱ_0:t) for i in nParticles do

X^˜_0:tⁱ = X_tⁱ w.pr. _{P ω}^ωⁱ^ti t

X_t+1ⁱ ∼ kt(X^˜_0:tⁱ , X_t+1ⁱ ) end for

for i in nParticles do X_0:t+1ⁱ ←− (X^˜_0:tⁱ , X_t+1ⁱ ) ωⁱ_t+1= _g ^gⁿ⁺¹^(X^0:n+1⁾

n(X_0:n)k_n(X_0:n;X_n+1)

end for

2.6 Reversible Jump Markov Chain Monte Carlo

Later, in Section 4.3 we will be tackling a variable dimensional model. That is a multidimensional model (we will be looking at a Gaussian mixture model) where the dimensionality is unknown. To accomplish this we need a method for our SMC particles to explore a variety of dimensions. So to this end we will be implementing the Reversible Jump Markov Chain Monte Carlo (RJMCMC)

(22)

algorithm.

In his 1995 paper [7], P. Green presented the RJMCMC algorithm, which allows one to move between models with different dimensions. The algorithm is essentially that of a Metropolis- Hastings but with specific trans-dimensional proposals chosen to fulfill the stationarity of the desired distribution.

For RJMCMC we want to make moves from x ∈ R^k to x⁰ ∈ R^k⁰. The moves are done by drawing random varibles W and W⁰ and then using a deterministic function that maps (x, w) to (x⁰, w⁰) such that (x⁰, w⁰) = h(x⁰, w⁰). We may also need to fill w or w⁰ with auxiliary variables so that the dimension of the (x, w) matches that of x⁰, w⁰. I.e. if w ∈ R^d and w⁰ ∈ R^d0 then we have the dimension matching requirement:

k + d = k⁰+ d⁰

Now since x is known prior to the move, and the move is purely deterministic given (x, w), then the candidate transition density ofthe Metropolis-Hastings algorithm, g(x, x⁰), reduces to q(w). That is the density of the random variable we drew. In this case the reversibility equation we derived for the MH algorithm (Equation 4) becomes:

Z

A

Z

A⁰

π(x)q(w)A(x, x⁰)dwdx = Z

A⁰

Z

A

π(x⁰)q(w⁰)A(x⁰, x)dx⁰dw⁰

Now for the above to properly match we need to use the substition on the integrals from x⁰ and w⁰ back to x and w:

dx⁰dw⁰ =

∂(x⁰, w⁰)

∂(x, w)

dxdw Here

∂(x⁰,w⁰)

∂(x,w)

is the absolute value of the determinant of the Jacobian of the transformation from (x, w) to (x⁰, w⁰). The invariance equality then holds by letting the acceptance ratio of the RJMCMC algorithm be:

A(x, x⁰) = min

1,π(x⁰)q(w⁰) π(x)q(w) ·

∂(x⁰, w⁰)

∂(x, w)

It is in this manner we will, in the variable dimension model, be proposing and accepting moves which allow the particles to change dimensions.

(23)

3 Particle-based Online Learning

3.1 Introduction

As outlined in the previous section, the posterior consistency proves to be somewhat of a nuisance in the attempt to find an SMC-based online algorithm for static parameter models as the posterior converges before the SMC particles can explore the posterior. One method of solving this is to periodically move the particles. However, these nudges must be done in such a manner that does not actually change the posterior. This technique of jittering, while feasible, can lead to some convoluted techniques [6]. The idea presented in this thesis is to perform jittering on the basis of a Markovian dynamics leaving the posterior invariant.

3.2 Mathematical Framework and Theorem

For our markovian dynamic to even be a possibility we need to first and foremost show that, given some assumptions, the filtered distribution from this new setup coincides with that of a classic Bayes model. For this section we will be using π to denote the classical posterior and π for the Markovian setup.

Specifically we want to show that the posterior of the new model with a Markovian dynamic is equal the classic Bayes posterior is given by

π(θ|y1:t) = p(y1:t|θ)π(θ) R p(y1:t|θ⁰)π(θ⁰)dθ⁰ And since we have i.i.d. data:

π(θ|y_1:t) = Qt

i=1p(yi|θ)π(θ) R p(y1:t|θ⁰)π(θ⁰)dθ⁰

In order to allow for a sequential representation of the posterior, we define the model

p(y1:t, θ1:t) = p(y1|θ1)π(θ1)

t−1

Y

s=1

Ks(θs, θs+1)p(ys+1|θs+1)

where we assume that the Markov kernel K leaves the posterior invariant:

Z

K_t(θ_t, θ_t+1)π(θ_t|y1:t)dθ_t= π(θ_t+1|y1:t)

And what we want to show is then:

π(θt|y1:t) ∝ π(θt|y1:t)

Since we know that if a probability density function is proportional to another probability density function, then the two of them must be equal thus we are really attempting to show:

π(θt|y1:t) = π(θt|y1:t) Let us summarize this as a theorem:

(24)

Theorem 2. A sequence of i.i.d stochastic variables (Y1, ..., Yt); t ∈ N modeled conditionally on a set of unknown parameters θ ∈ Θ as in p(y|θ) and the parameter θ modeled as a Markov Chain who’s filtered posterior at each time step t, πt(θt|y1:t), is invariant for the Markov Kernel Kt(θt, θt+1).

Then :

π(θt|y1:t) = πt(θt|y1:t) Proof. First, for all t and (θ, θ⁰)

Z

Kt(θ, θ⁰)

t

Y

s=1

p(ys|θ)p(θ)dθ ∝ Z

Kt(θ, θ⁰)π(θ|y1:t)dθ

= π(θ⁰|y1:t) ∝

t

Y

s=1

p(ys|θ⁰)π(θ⁰)

(6)

Now, for all 1 ≤ k ≤ t Z

. . . Z

π(θk)

k

Y

l=1

p(yl|θk)

t−1

Y

s=k

K(θs, θs+1)p(ys+1|θs+1)dθk. . . dθt−1

= Z

. . .

Z Z

π(θ_k)

k

Y

l=1

p(y_l|θk)K(θ_k, θ_k+1)dθ_k

!

p(y_k+1|θk+1)

t−1

Y

s=k+1

K(θ_s, θ_s+1)p(y_s+1|θs+1)dθ_k+1. . . dθ_t−1

∝ Z

. . . Z

π(θk+1)

k+1

Y

l=1

p(yl|θk)

t−1

Y

s=k+1

K(θs, θs+1)p(ys+1|θs+1)dθk+1. . . dθt−1

(7) In the step above we have applied Equation 6 between the last two lines. Now repeating Equation 7 yields

π(θt|y1:t) ∝ Z

. . . Z

p(θ1:t, y1:t)dθ1. . . dθt

= Z

. . . Z

π(θ1)p(y1|θ1)

t−1

Y

s=k+1

K(θs, θs+1)p(ys+1|θs+1)dθ1. . . dθt−1

∝ Z

. . . Z

π(θ₂)

2

Y

l=1

p(y_l|θ2)

t−1

Y

s=2

K(θ_s, θ_s+1)p(y_s+1|θs+1)dθ₂. . . dθ_t−1

∝ . . . ∝ Z

π(θt−1)

t−1

Y

l=1

p(yl|θt−1)K(θt−1, θt)p(yt|θt)dθt−1

(8)

In Equation 8 where we have ∝ . . . ∝ we are simply recursively repeating the steps from Equation 7. Now applying Equation 6 a final time gives

π(θt|y1:t) ∝ π(θt)

t

Y

l=1

p(yl|θt) ∝ π(θt|y1:t)

(25)

π(θt|y1:t) ∝ π(θt)

t

Y

l=1

p(yl|θt) ∝ π(θt|y1:t)

We have now shown that the filtered posterior of the particles from the Markovian dynamic is proportional to that from a classic setup.

3.3 Algorithm

Now that we have gone through the theory behind the algorithm, let us delve into how it practically works. First let us reiterate the goal of the algorithm. We have a datapoints Y1:n distributed via p(y|~θ) where ~θ is a set of latent variables. Here our goal is to sample from the posterior π(~θ|y), which can be used to, for instance, find the values of the unknown variables.

3.3.1 Step 1: Initiating particles

As mentioned earlier the algorithm is essentially a MCMC step within an SMC framework. The first step is then to instantiate the N SMC particles θ_1:N. Each particle θ_i is a vector that holds an estimate for each unknown variable of ~θ as well as the priors for all of these. So if ~θ were to be composed of 5 variables than each particle θi may be composed of something like 9 elements.

In the following we will denote by ψt the set of parameters governing the posterior of πt(θ|y1:t). It is then a simple matter of instantiating the particles as follows:

Algorithm 4

for i in nParticles do for j in ~θ do

Draw^jθⁱ₀ ∼ πjψ₀ⁱ(^jθ) end for

end for

In the form above^jθ₀ⁱ designates the j-th sub-component of the i-th particle. The 0 signifies that no datapoints have been processed as of yet. Accordingly, ^jψⁱ₀ is the set of parameters or priors necessary for the update of this specific component of θⁱ₀. For conciseness we will simply be writing

"Set/Draw all θ₀ⁱ from priors" for this step from now on.

3.3.2 Step 2: Resampling

For every sweep of the algorithm, as in all SMC methods, we must resample. This can be done before or after the particles are propagated. The merits of which are marginal and the differences are beyond the scope of this papaer. However, in this algorithm we have chosen to resample before propagation. For our algorithm we use the likelihood of a particle given the latest datapoints, or set of datapoints i.e p(y_t|θⁱ_t) as the weight. Thus, we resample in the following manner:

(26)

Algorithm 5 Resampling for i in nParticles do θ^˜ⁱ_t= θⁱ_tw.pr. _{P p(y}^p(y^t^|θ^tⁱ⁾

t|θⁱ_t)

end for

3.3.3 Step 3: Propagation

After having initiated the particles and resampled we’re ready to begin propagating the particles for each data point that arrives. It is at this point that an MCMC step is used, in all our model implementations we’ll be using a single Gibb’s step to fulfill this purpose but this is by no means the only way to do so and other MCMC methods could be used instead.

Algorithm 6 Propagation Given a datapoint y_t for i in nParticles do

for j in ~θ do ψ_tⁱ= f (ψ_t−1ⁱ , y_t)

jθⁱ_t∼^jKjψⁱ_t(^jθt−1,^jθt) end for

end for

As we are choosing conjugate priors for the estimates of ~θ in the particles, then these priors can be updated in a deterministic manner given a datapoint, which is why in we have ψ_t+1ⁱ = f (ψ_tⁱ, y_t). In the pseudo-code above^jKjψⁱ_t(^jθ_t−1,^jθ_t) signifies the transition kernel we are drawing propagating the particles via. It is here that we are applying our MCMC step.

3.3.4 Complete algorithm

Putting the three steps together the algorithm can be summarized in the following manner:

Algorithm 7

Set/Draw all θⁱ₀from priors for t in nData do

for i in nParticles do θ^˜ⁱ_t= θⁱ_tw.pr. _{P p(y}^p(y^t^|θ^tⁱ⁾

t|θⁱ_t)

for j in ~θ do ψ_t+1ⁱ = f (ψ^˜_tⁱ, yt)

jθⁱ_t∼^jKjψⁱ_t(^jθt−1,^jθt) end for

end for end for

The algorithm is in essence quite simple but actually implementing it can become quite cumbersome as we’ll see in the next section.

(27)

4 Model Implementations

4.1 Simple Gaussian Model

First we want to test the setup on a simple model, such as one for normally distributed data:

Y |θ ∼ N (µ, σ²) With Priors on θ = [µ, σ²]:

µ ∼ N (m,1

τ) i.e. τ is the precision of µ σ²∼ IG(a, b)

In this setup we have chosen simple conjugate priors with known posteriors:

p(σ²|~y, µ) ∼ IG(a + t/2, b +X1

2(y_i− µ)²)) p(µ|~y, σ²) ∼ N +_σ¹2P yi

τ +_σⁿ2

, (τ + n σ²)⁻¹

4.1.1 Mathematical background

The posteriors given above are found in the following manner:

p(µ, σ²|~y) ∝ p(µ, σ²~y) = p(~y|µ, σ²)p(µ)p(σ²) p(µk, σ², ~y) = (

t

Y

i=1

1 σ√

2πe⁻¹²⁽^yi−µ^σ ⁾²) 1 τ√

2πe⁻¹²⁽^µ−m^τ ⁾² b^a

Γ(a)(σ²)^−a−1e^−b^σ2 p(σ²|y, µ) ∝ 1

σ^t(σ²)^−a−1e^−b^σ2

t

Y

i=1

e⁻¹²⁽^yi−µ^σ ⁾²

∝ σ^{−2a−2−t}e^−b^σ2e^{P −}¹²⁽^yi−µ^σ ⁾²

∝ (σ²)^{−a−1−t/2}e⁻¹^σ2^(b^k⁺^P¹²^(yⁱ^−µ)²⁾ Thus we have: p(σ²|µ,~y) ∼ IG(a + t/2, b +X1

2(y_i− µ)²)). Now for p(µ|σ², ~y):

p(µ|σ², ~y) ∝ exp

−1

2((µ − m)τ )²−1

2(Xy_i− m σ )²

∝ exp

−1 2

τ 2 σ²(µ²(n

τ² + σ²) − 2µ_k(σ²m + 1 τ²P yi

) + C)

∝ exp −1 2

τ²(_τⁿ2 + σ²)

σ² (µ²− 2µ(σ²m +_τ2P y¹ _i)

n

τ² + σ² + D)

!

∝ exp

−1 2(n

σ²_k + τ²)(µ −τ²m +_σ¹2P yi n

σ² + τ² )

(28)

From here it is clear that:

p(µ|σ², ~y) ∼ N (τ²m +_σ¹₂P yi n

σ² + τ² , ( n

σ² + τ²)⁻¹) These posteriors can now be applied in the particle based sampler.

4.1.2 Particle based sampler with Gibbs step

One approach to implementing the Markovian dynamic in a particle based sampler, if the priors allow, to introduce a Gibbs step for every single particle. Here let our particles be θ_tⁱ = [µⁱ_t, σ_t²ⁱ, τ_tⁱ, mⁱ_t, aⁱ_t, bⁱ_t]. Where τ, m and a, b are the sufficient statistics needed for the update of µ and σ² respectively:

Algorithm 8

Set/Draw all θⁱ₀from priors for t in nData do

for i in nParticles do θ^˜ⁱ_t= θⁱ w.pr. p(yt|θ_tⁱ) aⁱ_t+1= a^˜ⁱ+ 0.5 bⁱ_t+1= bⁱ_t+ (yt− µ^˜ⁱ_t)² σ_t+1² ∼ IG(aⁱ_t+1, bⁱ_t+1) τ_t+1ⁱ = τ_t^˜ⁱ+_σ2¹

t+1i

mⁱ_t+1=

τ_t+1ⁱ m^˜ⁱ_t+ ^yt

σi 2t+1

τ_t+1ⁱ + ¹

σi 2t+1

µⁱ_t+1∼ N (mⁱ_t+1, ¹

τ_t+1ⁱ + ¹

σ2t+1i

) end for

end for

An implementation of the algorithm (using MATLAB) can be found in appendix A.

4.1.3 Results

Running this algorithm for a set of 500 datapoints normally distributed as N (2, 5) with hyperpa- rameters a = 1, b = 1, m = 0, τ = 3 gives the following results.

It is interesting to note that the algorithm above is essentially identical to one arrived at in [1] although they do not approach the problem in the same manner with a clear Markovian Dynamic.

(29)

(a) (b)

Figure 1: Plot of the particles after propagation and the analytical (unnormalized) joint posterior

4.2 Mixed Gaussian Model

Let us now investigate the possibilities of a particle-based algorithm for a mixture of gaussians.

Mixture models are a set of probabilistic models for representing the presence of subcategories in an overall population without data points identifying their membership to specific groups. Formally we could say that a distribution f is a mixture of K components if:

f (X) =

d

X

k=1

αkfk(X)

d

X

k=1

αk = 1

αk> 0, k = 1, 2, ..., d

Generally such a model is useful for inferring properties of the subcategories of a population when the datapoints are not labeled with which subcategory they pertain to. There are a wide variety of applications for such models from healthcare[9] or finance [11] to much more. The applications of GMMs are enormous as a GMM is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components [18].

In the context of unsupervised machine learning the K-means clustering algorithm is a popular choice to classify datapoints to groups. However this method fails to give a probabilistic measure on the uncertainty on the classification. Thus many models are then done as Gaussian Mixture Models (GMM), i.e. where every f_k is a normal distribution.

For GMMs there are quite efficient batch algorithms for latent parameter inference, but not so many definitive online ones, particularly not MCMC-based algorithms. Thus the next section will investigate if the algorithm described in the previous section can be extended to a GMM.

Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

RUTGER FUGLESANG

Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

RUTGER FUGLESANG

Partikelbaserad Bayesiansk realtidsinlärning av statiska modellparameterar med tillämpning på mixturmodeller

Author: Rutger Fuglesang Supervisor: Prof. Jimmy Olsson

Contents

1 Introduction

1.1 Bayesian Inference

1.2 The problem

1.3 Related research

1.4 This Thesis

2 Background

2.1 Bayesian statistics

2.2 Markov Models

2.3 Monte Carlo methods

2.4 Markov Chain Monte Carlo

2.5 Sequential Monte Carlo

2.6 Reversible Jump Markov Chain Monte Carlo

3 Particle-based Online Learning

3.1 Introduction

3.2 Mathematical Framework and Theorem

3.3 Algorithm

4 Model Implementations

4.1 Simple Gaussian Model

4.2 Mixed Gaussian Model