• No results found

Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

N/A
N/A
Protected

Academic year: 2022

Share "Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

RUTGER FUGLESANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)
(3)

Particle-Based Online Bayesian Learning of Static Parameters with Application to Mixture Models

RUTGER FUGLESANG

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020

Supervisor at KTH: Jimmy Olsson Examiner at KTH: Jimmy Olsson

(4)

TRITA-SCI-GRU 2020:306 MAT-E 2020:078

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Sammanfattning

This thesis investigates the possibility of using Sequential Monte Carlo methods (SMC) to create an online algorithm to infer properties from a dataset, such as unknown model para- meters. Statistical inference from data streams tends to be difficult, and this is particularly the case for parametric models, which will be the focus of this paper. We develop a sequential Monte Carlo algorithm sampling sequentially from the model’s posterior distributions. As a key ingredient of this approach, unknown static parameters are jittered towards the shrinking support of the posterior on the basis of an artificial Markovian dynamic allowing for correct pseudo-marginalisation of the target distributions. We then test the algorithm on a simple la- tent Gaussian model, latent Gausian Mixture Model (GMM), as well as a variable dimension laten GMM. All tests and coding were done using Matlab. The outcome of the simulation is promising, but more extensive comparisons to other online algorithms for static parameter models are needed to really gauge the computational efficiency of the developed algorithm.

(6)
(7)

Partikelbaserad Bayesiansk realtidsinlärning av statiska modellparameterar med tillämpning på mixturmodeller

Author: Rutger Fuglesang Supervisor: Prof. Jimmy Olsson

Sammanfattning

Detta examensarbete undersöker möjligheten att använda Sekventiella Monte Carlo me- toder (SMC) för att utveckla en algoritm med syfte att utvinna parametrar i realtid givet en okänd modell. Då statistisk slutledning från dataströmmar medför svårigheter, särskilt i parameter-modeller, kommer arbetets fokus ligga i utvecklandet av en Monte Carlo algoritm vars uppgift är att sekvensiellt nyttja modellens posteriori fördelningar. Resultatet är att okän- da, statistiska parametrar kommer att förflyttas mot det krympande stödet av posteriorn med hjälp utav en artificiell Markov dynamik, vilket tillåter en korrekt pseudo-marginalisering ut- av mål-distributionen. Algoritmen kommer sedan att testas på en enkel Gaussisk-modell, en Gaussisk mixturmodell (GMM) och till sist en GMM vars dimension är okänd. Kodningen i detta projekt har utförts i Matlab.

(8)
(9)

Contents

1 Introduction 6

1.1 Bayesian Inference . . . 6

1.2 The problem . . . 6

1.3 Related research . . . 6

1.4 This Thesis . . . 7

1.4.1 Aim . . . 7

1.4.2 Report Structure . . . 7

2 Background 8 2.1 Bayesian statistics . . . 8

2.1.1 Weighted Coin Example . . . 8

2.1.2 Posterior Consistency . . . 9

2.2 Markov Models . . . 9

2.3 Monte Carlo methods . . . 10

2.3.1 MC estimate of π . . . 11

2.4 Markov Chain Monte Carlo . . . 11

2.4.1 Metropolis-Hastings . . . 11

2.4.2 The Gibbs’ sampler . . . 13

2.5 Sequential Monte Carlo . . . 14

2.5.1 Problem formulation . . . 14

2.5.2 Importance Sampling . . . 14

2.5.3 Sequential Importance Sampling (SIS) . . . 15

2.5.4 Weight degeneracy . . . 16

2.5.5 Sequential Importance Sampling with Resampling (SISR) . . . 16

2.6 Reversible Jump Markov Chain Monte Carlo . . . 16

3 Particle-based Online Learning 18 3.1 Introduction . . . 18

3.2 Mathematical Framework and Theorem . . . 18

3.3 Algorithm . . . 20

3.3.1 Step 1: Initiating particles . . . 20

3.3.2 Step 2: Resampling . . . 20

3.3.3 Step 3: Propagation . . . 21

3.3.4 Complete algorithm . . . 21

4 Model Implementations 22 4.1 Simple Gaussian Model . . . 22

4.1.1 Mathematical background . . . 22

4.1.2 Particle based sampler with Gibbs step . . . 23

4.1.3 Results . . . 23

4.2 Mixed Gaussian Model . . . 24

4.2.1 Model Description . . . 25

4.2.2 Mathematical background . . . 25

4.2.3 Algorithm . . . 27

4.2.4 Results . . . 28

(10)

4.2.5 Alternate resampling scheme . . . 28

4.3 Variable dimension model . . . 31

4.3.1 Birth and Death moves . . . 31

4.3.2 Results . . . 35

5 Discussion 37 5.1 Future Work . . . 37

Appendices 38

A Code for Simple Gaussian 38

B Code for GMM 39

(11)

1 Introduction

1.1 Bayesian Inference

Today Bayesian inference plays an important role in a wide range of scientific and engineering fields, such as signal processing, automatic control, biology, and mathematical finance to mention just a few. The ability to be able to update and improve a model continuously as data arrives with an online algorithm has clear efficiency and computational appeal over clunkier batch methods. An online algorithm also allows decisions to be made in real time, which could be critical in some ap- plications. For instance, in financial fraud-detection where millions of transactions are made every day, real-time decision making is of utmost importance [10].

1.2 The problem

In static parameter models we have a sequence of i.i.d (independent and identically distributed) random variables (Yt), t ∈ N and a parametric model fθ, θ ∈ Rd where we would like to ’learn’ the value of the parameter θ based on observed data. However, finding an efficient online algorithm for the task has been a fundamental problem in the area for some time.

Some modern attempts, including this paper, utilize Sequential Monte Carlo (SMC) meth- ods. In order to find an online algorithm for the static parameter θ we introduce N particles which are propagated with every new data point that arrives. Here the goal for an online algorithm is that the posterior π(θ|y1:t) can be updated given the particle’s previous position and solely the newest data point. The fundamental difficulty with achieving this goal is that due to the posterior consistency (see section 2.1.2), the support for the posterior shrinks for every propagation step.

As a result, the particles may collapse too quickly without having explored the full support of the actual posterior.

1.3 Related research

There is a plethora of recently published algorithms that tackle the issue. One class of such algo- rithms operate under the framework of variational inference (VI) [13]. These methods attempt to approximate the posterior density by a variational density qθ characterized by a set of parameter θ. The latter focuses on optmizing the parameters by minimizing a divergence measure between qθ

and the posterior, for instance using the Kullback-Leibler divergence measure.

A recent method called ’Nested Particle Filters’ [15] involves, as the title implies, using a particle filter within a particle filter to assure that the propagation of the main SMC particles is done in a manner that explores the full posterior.

Another technique operating under the SMC framework is to create a kernel for θ given y1:t from which new particles are drawn. While alleviating the issue of section 1.2, this jittering technique has a key requirement that the built kernel must shrink at an appropriate rate to control the variance of the jittered particles. For instance, [6] introduces a plausible jittering scheme, which is however rather complicated and relies on the assumption that the posterior consistency happens at a prescribed rate. It is this train of thought that has led to the project at hand and raised the question if a similar manner of perturbation can be achieved via a simpler scheme.

(12)

Various methods have been suggested, with various degrees of success. Some attempts lack a theoretical guarantee of convergence while others are only valid for a small, specific set of conditions [5].

1.4 This Thesis

1.4.1 Aim

This project aims to find an online algorithm for Bayesian inference of static parameters θ via SMC methods. We do so by seeking a simple method of jittering by introducing a Markovian dynamic on the SMC particles. In this thesis the aforementioned dynamic will be implemented via single steps of a Gibbs’ sampler (see Section 2.4.2), but other ways are possible. Thus we can succinctly summarize the primary objective of this paper as follows: Is it possible to introduce a Markovian dynamic on the particles as a form of jittering so as to ensure full exploration of the posterior.

1.4.2 Report Structure

The report is set up in the following manner. Section 2 elementarily covers the mathematical background required to follow the work at hand. Section 3 goes through the concept of jittering via a Markovian dynamic imposed upon the SMC particles. The latter also includes a thorough proof that shows that the sought posterior remains unchanged. Afterwards in Section 4 the idea is put to the test on 3 models of increasing difficulty. Firstly, on a simple Gaussian model in Section 4.1,then on a more difficult latent Gaussian Mixture Model in Section 4.2 and finally on a variable dimensional GMM in Section 4.3. Finally, Section 5 then summarizes the work and discusses possible extensions.

(13)

2 Background

2.1 Bayesian statistics

Let us ease in to the math behind this paper with a quick recap of what Bayesian statics entail. In 1763 Thomas Bayes [3] introduced his famous theorem, which in its raw form is stated as:

π(θ|y) =p(y|θ)p(θ) p(y)

The innocuous looking theorem has developed a rather large school of thought in statistics that is based upon it. In Bayesian statistics we usually consider y the data, θ a set of parameters pertaining to the distribution of said data. Then we have the distributions: π(y|θ) is the likelihood (that is, the probability distribution of the data given a set of parameters), p(θ) the prior (a probability distribution chosen to reflect the statistician’s prior beliefs of the variable), p(θ|y) the posterior and p(y) the probability of the data.

In the context of Bayesian statistics we are often interested in inferring the values of θ. This is often done by declaring π(θ|y) ∝ p(y|θ)p(θ), where we know the distribution up to a normalizing constant. The normalizing constant can then either be figured out, if for instance we are using conjugate priors, or alternatively be worked around with methods such as Importance Sampling. Let us briefly look at a simple example of a weighted coin to really exemplify how this can be applied.

2.1.1 Weighted Coin Example

Say that we have been flipping a coin and that we begin to suspect that it may be weighted. We will let θ denote the ’weightness’ of it. That is θ = 0.5 would signify a balanced coin where the probability of turning heads is equal to that of getting tails. We may first say the the every ’coin flip trial’ is Bernoulli distributed. That entails that the outcome is distributed as:

p(y = k|θ) = θy(1 − θ)1−y

where y = 0 denotes a result of ’heads’ and y = 1 one of ’tails’. And as each coin flip is independent from the latter if a series of coin flips y1:n were done, the probability of that outcome would be:

p(y1:n|θ) =

n

Y

i=1

θyi(1 − θ)1−yi

Now to really apply a proper Bayesian approach, we need to choose a prior p(θ) that is compatible.

In this case we may suggest θ ∼ β(1, 1) where a β(α, γ) has the probability density function:

p(θ) = θa(1 − θ)γ−1 B(α, γ)

Here B(α, β) is the beta-function. Now if we for the moment ignore the denominator of Bayes’

theorem and focus on the numerator we find:

p(y|θ)p(θ) = θa(1 − θ)γ−1 B(α, γ)

n

Y

i=1

θyi(1 − θ)1−yi

(14)

So if we have n trials (i.e. coin flips) and we let z denote the number of tails, then we have:

p(y|θ)p(θ) = θα+z(1 − θ)γ+n−z B(α, γ) Without touching the denominator we know that:

π(θ|y) = p(y|θ)p(θ)

p(y) ∝ p(y|θ)p(θ) = θα+z(1 − θ)γ+n−z B(α, γ)

But we know that a probability distribution that is proportional to another is equal to that one as well (since they both must integrate to one). Thus we can see from the form of the equation above that the posterior is distributed as π(θ|y) ∼ β(α + z, γ + n − z + 1). But how about α and γ? Well these variables really signify our priors. If we want to signify that we have no prior knowledge we may set them to α = 1 and γ = 1, yielding a uniform, or flat prior; otherwise one could set them asymmetrically to reflect our belief of the coin. But to conclude if we ever suspect a coin to be weighted, we simply need to flip it n times, count the z-times it lands as tails and then look at the posterior above.

2.1.2 Posterior Consistency

Consider a sequence of i.i.d. stochastic variables Xn, n = 1, 2, ... modeled as (X1, ..., Xn) ∼ pn((x1, ..., xn)|θ), θ ∈ Θ; θ ∼ Πn

where Πn(·|(X1, ..., Xn)) denotes the posterior of θ based on the n-samples. We can assume that with a growing n the set of samples X1, ..., Xn provides more and more information on θ. Once again, in the i.i.d. case we have pn((x1, ..., xn)|θ) = Qn

i=1f (xi|θ). Here if the true value of the hidden parameter θ is given by θ0∈ Θ we would intuitively expect Πn(·|(X1, ..., Xn)) to converge to the Dirac measure at the true value, δθ0. This convergence or contraction of the distribution is what is referred to as posterior consistency [19].

While posterior consistency seems as, and often is, a purely positive outcome of a Bayesian model, it can come with some negative consequences. In the case of Sequential Monte Carlo methods for static parameter models the posterior consistency converges so quickly that it can be quite difficult to move the SMC particles and properly explore the full support of the posterior.

2.2 Markov Models

This thesis presents an algorithm that is essentially a hybrid of an Markov Chain Monte Carlo algorithm and a Sequential Monte Carlo one. To understand these we need to reiterate from the ground up starting with Markov models. A sequence {Xt}t∈N of stochastic variables are considered Markovian if the probability of being in state Xtconditional on all the previous states X0:t−1only depends on the last state Xt−1. The transition probability of going from Xt−1 to Xtis given by a Markov transition kernel K, which can be defined as followes:

Definition 2.1. Markov Kernel

Let {X, A} and {Y, B} be measurable spaces. A markov kernel K : X 7→ Y with source (X, A) and target (Y, B) is a map K : B × X 7→ [0, 1] with the following properties:

(15)

i For every b ∈ B, the map x 7→ K(B, x) is A-measurable

ii For every x ∈ X, the map B 7→ K(B, x) is a probability measure on (Y, B)

In other words it associates to each point x ∈ X a probability measure K(x, dy) : b 7→ K(b, x) on (Y, B) such that for every measurable set b ∈ B, the map x 7→ K(b, x) is measurable with respect to the σ-algebra A. [14]

The formal definition above is rather clunky to any reader unfamiliar with the measure- theory formalities. So for the purposes of this paper we can loosely say that for the Markov chain X0:t the markov kernel at K(xt−1, Xt) equates to the transition probability measure p(xt|Xt−1).

Having defined the Kernel we may give a rigorous definition of what constitutes a Markov Chain.

Definition 2.2. Markov Chain

Let (Ω, F , F, P) be a filtered probability space and let K be a Markov transition kernel on a mea- surable space (X, X ). An X-valued stochastic process {Xt}t≥0 is said to be a Markov chain under P , with respect to the filtration F and with transition kernel K if it is F-adapted and for all k ≥ 0 and A ∈ X

P (Xt+1∈ A|Ft) = Kt(Xt, A)

The distribution of X0is called the initial distribution of the chain, and X is called the state space [8].

In many cases, including this thesis, it is important that the distribution produced by the kernel is stationary. This means that the probability measure remains unchanged in the Markov chain as time progress. Equivalently we may say that a kernel is invariant (or formally that the probability measure π is invariant for a given kernel) if:

π(y) = Z

K(x, y)π(x)dx (1)

A Markov chain is denoted as reversible if:

K(y, x)f (y) = K(x, y)f (x)

Note that a reversible Markov chain is also invariant. This notion of reversibility or invariance is of utmost importance in this paper as the theoretical proof for the particle framework in section 3 assumes an invariant Markov Kernel.

2.3 Monte Carlo methods

As the algorithm developed in this applies MCMC moves within an SMC framework it is important that we fully understand these concepts. Firstly let us iterate what a Monte Carlo method entails.

Broadly speaking it is any method where we are using random samples to estimate a result. The basis of which is the law of large numbers: If we have a sequence of random variables X1, ..., XN

sampled from some distribution, then:

lim

N →∞

1 N

N

X

i=1

Xi→ µ

where µ is the common mean of X. To exemplify how a Monte Carlo method can be used practically, we shall below go through a classic example of estimating π via MC.

(16)

2.3.1 MC estimate of π

If we have a circle with radius r = 12 centered in a unit square, we know that the area of the circle is Ac= π4. Assuming that we can sample from the uniform U (0, 1) distribution, then we can sample points (x, y) in the unit square by drawing via x ∼ U (0, 1), y ∼ U (0, 1). We can also categorize if a point is within the circle or not by its distance from the center, as a point is within the circle if x2+ y2< 1 (Note that p(x2+ y2= 1) = 0 so it doesn’t matter how we treat the boundary), and outside otherwise. So after sampling N points, categorizing them all, and letting Nc denote the number of points inside the circle. Then, our estimate of π is given by:

πM CN 4 = Nc

N

2.4 Markov Chain Monte Carlo

Here we recap what MCMC methods entail, as these will be implemented in conjunction with a Sequential Monte Carlo framework for the algorithm developed. While basic MC methods are occasionally of use, they suffer in large dimensional settings and as such are somewhat restricted in their use. Markov chain Monte Carlo (MCMC) methods are a natural extension of the latter, and are much less restrictive. To estimate an aspect of a distribution π we in a pure MC setting simulate an i.i.d. sequence X1, ..., XN from the distribution. However, we may consider simulating a Markovian sequence instead. In that case, if we want to estimate the value of some function f (for instance if we want the mean then f (x) = x) under some distribution, π, then this would take the form:

πMCMCN (f ) = 1 N

N

X

i=1

f (Xi) For MCMC to be practical though we have 3 key requirements [8]:

i Simulating the chain {Xi}i∈N given an arbitrary initial value X1 is an easily implementable process.

ii The chain {Xi}i∈N satisfies conditions need to guarantee the convergence towards π, irrespec- tively of the initial value of X1.

To clarify how this may actually work we shall go through two MCMC algorithms which are relevant to this paper.

2.4.1 Metropolis-Hastings

The Metropolis-Hastings algorithm was in an earlier version published developed in 1950 by N.

Metropolis. The latter was quite narrow in its’ applications and in 1970 W.K. Hastings [17] improved upon it giving it the form we shall go through below. The algorithm is quite versatile in its applications, and can for instance be used to sample from a distribution π, which we sample from directly. This is done by simulating from a transition density g, which we call the proposal density.

(17)

Algorithm 1 Metropolis-Hastings Algorithm Set X1arbitrarily

for i = 1:N do

Generate X0 ∼ g(Xi, •)

Set Xi+1= X0 w.pr. min(1,π(X

0)g(X0,Xi) π(Xi)g(Xi,X0)) Set Xi+1= Xi otherwise

end for

Above the ratio A(X, X0) = min(1,π(Xπ(X0)g(X0,Xi)

i)g(Xi,X0)) is referred to as the acceptance ratio.

We will later on refer to f racπ(X0)g(X0, Xi)π(Xi)g(Xi, X0) as α(X, X0). The acceptance ratio has that given form because in that manner the sequence {Xi}i≥1 satisfies the detailed balance in Equation 1. The Metropolis-Hastings algorithm will not be used directly in this paper, however it is a precursor to the Reversible Jump Monte Carlo algorithm, which will be implemented in the variable dimension model. Now let us show that the acceptance ratio above is reversible (and recall that reversibility implies stationarity):

Given that we are in a state x and are moving to x0, then we need the following to hold:

π(x)p(x, x0) = π(x0)p(x0, x) This is equivalent to:

Z

A

π(x)K(x, A0)dx = Z

A0

π(x0)K(x0, A)dx0, ∀A, A0 (2) Where K(x, B) :=R

Bp(x, y)dy is the transition kernel of Metropolis-Hastings. Now note that:

K(x, A0) = Z

A0

g(x, x0)α(x, x0)dx0+

Z

g(x, x0)[1 − α(x, x0)]dx0



· I{x∈A0} (3) The above states that starting at a state x, we will be in the set A0if we propose a move to x0 ∈ A0 and accept it. Or we if propose any move and reject it, given that x ∈ A0. Now let us insert Equation 3 into Equation 2 and denote the termR g(x, x0)[1 − α(x, x0)]dx0 as F (x) . First the left hand side becomes:

Z

A

π(x)

"

Z

A0

g(x, x0)α(x, x0)dx0+F (x) · I{x∈A0}

# dx

= Z

A

Z

A0

π(x)g(x, x0)α(x, x0)dx0dx + Z

A∩A0

F (x) · I{x∈A0}dx Similarly the right hand side becomes:

Z

A0

π(x0)

"

Z

A

g(x0, x)α(x0, x)dx+F (x0) · I{x0∈A}

# dx0

= Z

A0

Z

A

π(x0)g(x0, x)α(x0, x)dxdx0+ Z

A0∩A

F (x0) · I{x0∈A}dx0

(18)

As the terms with the indicator functions are clearly equal when we set the two sides together we have:

Z

A

Z

A0

π(x)g(x, x0)α(x, x0)dx0dx = Z

A0

Z

A

π(x0)g(x0, x)α(x0, x)dxdx0 (4) Thus we see that reversibility and the detailed balance equation hold, with this acceptance ratio.

2.4.2 The Gibbs’ sampler

The Gibbs sampler is an MCMC algorithm that generates a Markov chain and will be implemented in the algorithms described later on in this paper. The algorithm was first described in 1986 by the brothers Stuart and Donald Gaiman [4] and takes the following form:

If we have data ~y distributed with dependencies on x1, x2, ..., xn and we have access to all the posteriors p(xk|~y, x1, ..., xk−1, xk+1, .., xn), k = 1, 2, ..., n then we can draw from the full posterior as follows:

Algorithm 2 Gibbs sampler Set/Draw all x0k from priors for m = 1:M do

for k=1:n do

θmk ∼ p(xk|~y, xm1, ..., xmk−1, xm−1k+1, .., xm−1n ) end for

end for

All θmk are then sampled from the full posterior. It is important to note that a single step of the Gibbs’ sampler (i.e. m = 1 above). Since this result is important for the developments to come, we provide the proof (as given by [8]) in the following:

Theorem 1. Each of the m individual steps of the Gibbs sampler is π reversible and thus admits π as a stationary probability density function.

Consider the step that updates the kth component and denote by Kk the corresponding transition kernel. We can always write λ = dxkN dx−k where dxk and dx−k are measures on Xk and X−k, respectively, such that dxk dominates πk(·|x−k) for all values of xk∈ Xk. With these notations,

Kk(x, dx0) = δx−k(dx0−k)π(x0k|x−k)dx0k Hence, for any functions f1, f2∈ Fb(X),

Z Z

f1(x)f2(x0)π(x)dxK(x, dx0)

= Z 

f1(x)π(x)dxk

Z

f2(x0k, x−kk(x0k|x−k)dx0k

 dx−k

where (x0k, x−k) refers to the element u of X such that uk = x0k and u−k = x−k. Because π(xk, x−kk(x0k|x−k) = πk(xk|x−k)π(x0k, x−k), we may also write

(19)

Z Z

f1(x)f2(x0)π(x)λ(dx)K(x, dx0)

= Z 

f2(x0k, x−k)π(x0k, x−k)dx0k Z

f1(xk, x−kk(xk|x−k)dxk

 dx−k

which is the same expression as before where the roles of f1 and f2 have been exchanged, thus showing that the detailed balance condition (equation 1) holds.

2.5 Sequential Monte Carlo

Sequential Monte Carlo methods, also known as particle filters, are a set of methods used to solve filtering problems, and in the context of this paper Bayesian inference filtering problems. In the context of the present thesis, Particle filters use a set of particles or samples to represent the pos- terior from which we are trying to sample from. The particle filter methodology is often used to solve hidden Markov model (HMM) filtering problems as well as other nonlinear filtering problems.

2.5.1 Problem formulation

The typical SMC problem looks as follows. We want to estimate sequentially sequences I(ft)t≥0of expectations

I(ft) = Ep(x0:t)[ft(x0:t)] = Z

ft(x0:t)pt(x0:t)dx0:t. (5) Here ftis some function of interest, which is integrable with respect to the posterior. For instance, if we want an estimate of the conditional mean then we simply have ft= x0:t.

To gain a better understanding of how this theory is applied, let us take a look at one of the developments that lead to the earliest successful SMC algorithm, the Sequential Importance Sampling with Resampling algorithm. FIND REF

2.5.2 Importance Sampling

Assume we want to approximate integrals I(ft) as in Equation 5, which is intractable since it is only known up to a normalizing constant i.e. p(x0:t) = g(xc0:t), where c is unknown. We can then introduce an instrumental distribution k(x0:t), which has a support that covers that of the target distribution. The integral I(ft) can then be rewritten as:

I(ft) = Z

ft(x0:t)p(x0:t)dx0:t

= Z

ft(x0:t)p(x0:t)

k(x0:t)k(x0:t)dx0:t

= Z

ft(x0:t)p(x0:t)

k(x0:t)k(x0:t)dx0:t

= Z

ft(x0:t)ω(x0:t)k(x0:t)dx0:t

(20)

Here ω(x0:t) = p(xk(x0:t)

0:t) is referred to as the importance weight. If we then simulate N i.i.d. particles from k(x0:t) we can then formulate a Monte Carlo estimate for I(ft) as:

IM C(ft) = PN

i=1ft(xi0:t)ω(xi0:t) PN

i=1ω(xi0:t)

Importance sampling is a general Monte Carlo integration method but it, in the form presented, is not suitable for an online algorithm as we need all the point x1:tbefore being able to estimating I(ft). Thus, if a new data point xt+1 arrives, the computation needs to be done anew as it cannot be updated recursively. This problem leads us to one of the earliest SMC methods, the Sequential Importance Sampling algorithm.

2.5.3 Sequential Importance Sampling (SIS)

The earliest SMC attempts can be summarized by the Sequential Importance Sampling (SIS) algo- rithm, which as the name implies is a sequential version of the classic Importance Sampling method and has been known since the 1970’s.

As in the last case we want to evaluate integrals of the form of equation 5, but are operating under the same constraints as outlined in the Importance Sampling case. The SIS framework functions by approximating the integral using particles. A particle X0:n = X0, ..., Xn is a draw with corresponding importance weight w0:n= w(X0:n). Since we are unable to draw directly from the target distribution we instead draw from an importance distribution k(x0:t). Now to be able to function recursively, i.e. have an online algorithm we need the instrumental density to be written as a product of transition densities:

k(x0:t) = k0(x0)

n

Y

k=1

k(x0:k−1; xk)

Given a particle X0:nthe new state Xn+1can be sampled from the transition density k(x0:n; xn+1).

The updated particle X0:n+1is then simply set as X0:n+1= (X0:n; Xn+1). The importance weights can also be updated in an online fashion as follows:

wn+1= gn+1(X0:n+1) kn+1(X0:n+1)

= gn+1(X0:n+1) gn(X0:n)kn(X0:n; Xn+1)

gn(X0:n) kn(X0:n)

= gn+1(X0:n+1)

gn(X0:n)kn(X0:n; Xn+1)wn

Now the Monte Carlo estimate of equation 5 is given by:

IM C(fn) =

N

X

i=1

win PN

i=1wnifn(X0:ni )

The sum within the sum above is simply there to self-normalize the weights and takes care of the unknown normalizing constant of the target distribution.

(21)

2.5.4 Weight degeneracy

While the SIS algorithm shows promising results for a few initial iterations it quickly degenerates.

This flaw is a result of the multiplicative nature of updating the weights and is referred to as weight degeneracy. What happens is that quite quickly a few weights dominate all the others leaving our Monte Carlo estimate flawed. As a result, the algorithm historically did not find any practical application till a solution in the form of resampling was put forth in 1990’s, which brings us to the first truly successful implementation of Sequential Monte Carlo methods: The Sequential Importance Sampling with Resampling algorithm.

2.5.5 Sequential Importance Sampling with Resampling (SISR)

In 1993 Gordon et al [12] solved the weight degeneracy issue by adding a resampling step after a single sweep of the SIS algorithm. Resampling is a straightforward step that functions as follows:

Given N particles with corresponding importance weights, (X0:ni , w0:ni ), new indices Ii are drawn via a multinomial process with probabilities PNw1n

j=1wjn, ...,PNwNn

j=1wnj. Thus, Ii = k w. pr. wnk

PN j=1wnj

Afterwards we reset the particles as X0:ni = X0:nIi and we set all weights as wni =

gn+1(X0:n+1)

gn(X0:n)kn(X0:n;Xn+1). This resampling has become a staple of all SMC methods and the method above of multinomial resampling (named as such as we are using a multinomial process) is simply the first and most simple way of doing so. Usually resampling is implemented systematically after every propagation step, however much optimization can be done on when and how often to apply the step. The algorithm in pseudo-code can then be written as:

Algorithm 3 Given (X0:ti , wi0:t) for i in nParticles do

X˜0:ti = Xti w.pr. P ωωiti t

Xt+1i ∼ kt(X˜0:ti , Xt+1i ) end for

for i in nParticles do X0:t+1i ←− (X˜0:ti , Xt+1i ) ωit+1= g gn+1(X0:n+1)

n(X0:n)kn(X0:n;Xn+1)

end for

2.6 Reversible Jump Markov Chain Monte Carlo

Later, in Section 4.3 we will be tackling a variable dimensional model. That is a multidimensional model (we will be looking at a Gaussian mixture model) where the dimensionality is unknown. To accomplish this we need a method for our SMC particles to explore a variety of dimensions. So to this end we will be implementing the Reversible Jump Markov Chain Monte Carlo (RJMCMC)

(22)

algorithm.

In his 1995 paper [7], P. Green presented the RJMCMC algorithm, which allows one to move between models with different dimensions. The algorithm is essentially that of a Metropolis- Hastings but with specific trans-dimensional proposals chosen to fulfill the stationarity of the desired distribution.

For RJMCMC we want to make moves from x ∈ Rk to x0 ∈ Rk0. The moves are done by drawing random varibles W and W0 and then using a deterministic function that maps (x, w) to (x0, w0) such that (x0, w0) = h(x0, w0). We may also need to fill w or w0 with auxiliary variables so that the dimension of the (x, w) matches that of x0, w0. I.e. if w ∈ Rd and w0 ∈ Rd0 then we have the dimension matching requirement:

k + d = k0+ d0

Now since x is known prior to the move, and the move is purely deterministic given (x, w), then the candidate transition density ofthe Metropolis-Hastings algorithm, g(x, x0), reduces to q(w). That is the density of the random variable we drew. In this case the reversibility equation we derived for the MH algorithm (Equation 4) becomes:

Z

A

Z

A0

π(x)q(w)A(x, x0)dwdx = Z

A0

Z

A

π(x0)q(w0)A(x0, x)dx0dw0

Now for the above to properly match we need to use the substition on the integrals from x0 and w0 back to x and w:

dx0dw0 =

∂(x0, w0)

∂(x, w)

dxdw Here

∂(x0,w0)

∂(x,w)

is the absolute value of the determinant of the Jacobian of the transformation from (x, w) to (x0, w0). The invariance equality then holds by letting the acceptance ratio of the RJMCMC algorithm be:

A(x, x0) = min



1,π(x0)q(w0) π(x)q(w) ·

∂(x0, w0)

∂(x, w)



It is in this manner we will, in the variable dimension model, be proposing and accepting moves which allow the particles to change dimensions.

(23)

3 Particle-based Online Learning

3.1 Introduction

As outlined in the previous section, the posterior consistency proves to be somewhat of a nuisance in the attempt to find an SMC-based online algorithm for static parameter models as the posterior converges before the SMC particles can explore the posterior. One method of solving this is to periodically move the particles. However, these nudges must be done in such a manner that does not actually change the posterior. This technique of jittering, while feasible, can lead to some convoluted techniques [6]. The idea presented in this thesis is to perform jittering on the basis of a Markovian dynamics leaving the posterior invariant.

3.2 Mathematical Framework and Theorem

For our markovian dynamic to even be a possibility we need to first and foremost show that, given some assumptions, the filtered distribution from this new setup coincides with that of a classic Bayes model. For this section we will be using π to denote the classical posterior and π for the Markovian setup.

Specifically we want to show that the posterior of the new model with a Markovian dynamic is equal the classic Bayes posterior is given by

π(θ|y1:t) = p(y1:t|θ)π(θ) R p(y1:t0)π(θ0)dθ0 And since we have i.i.d. data:

π(θ|y1:t) = Qt

i=1p(yi|θ)π(θ) R p(y1:t0)π(θ0)dθ0

In order to allow for a sequential representation of the posterior, we define the model

p(y1:t, θ1:t) = p(y11)π(θ1)

t−1

Y

s=1

Kss, θs+1)p(ys+1s+1)

where we assume that the Markov kernel K leaves the posterior invariant:

Z

Ktt, θt+1)π(θt|y1:t)dθt= π(θt+1|y1:t)

And what we want to show is then:

π(θt|y1:t) ∝ π(θt|y1:t)

Since we know that if a probability density function is proportional to another probability density function, then the two of them must be equal thus we are really attempting to show:

π(θt|y1:t) = π(θt|y1:t) Let us summarize this as a theorem:

(24)

Theorem 2. A sequence of i.i.d stochastic variables (Y1, ..., Yt); t ∈ N modeled conditionally on a set of unknown parameters θ ∈ Θ as in p(y|θ) and the parameter θ modeled as a Markov Chain who’s filtered posterior at each time step t, πtt|y1:t), is invariant for the Markov Kernel Ktt, θt+1).

Then :

π(θt|y1:t) = πtt|y1:t) Proof. First, for all t and (θ, θ0)

Z

Kt(θ, θ0)

t

Y

s=1

p(ys|θ)p(θ)dθ ∝ Z

Kt(θ, θ0)π(θ|y1:t)dθ

= π(θ0|y1:t) ∝

t

Y

s=1

p(ys0)π(θ0)

(6)

Now, for all 1 ≤ k ≤ t Z

. . . Z

π(θk)

k

Y

l=1

p(ylk)

t−1

Y

s=k

K(θs, θs+1)p(ys+1s+1)dθk. . . dθt−1

= Z

. . .

Z Z

π(θk)

k

Y

l=1

p(ylk)K(θk, θk+1)dθk

!

p(yk+1k+1)

t−1

Y

s=k+1

K(θs, θs+1)p(ys+1s+1)dθk+1. . . dθt−1

∝ Z

. . . Z

π(θk+1)

k+1

Y

l=1

p(ylk)

t−1

Y

s=k+1

K(θs, θs+1)p(ys+1s+1)dθk+1. . . dθt−1

(7) In the step above we have applied Equation 6 between the last two lines. Now repeating Equation 7 yields

π(θt|y1:t) ∝ Z

. . . Z

p(θ1:t, y1:t)dθ1. . . dθt

= Z

. . . Z

π(θ1)p(y11)

t−1

Y

s=k+1

K(θs, θs+1)p(ys+1s+1)dθ1. . . dθt−1

∝ Z

. . . Z

π(θ2)

2

Y

l=1

p(yl2)

t−1

Y

s=2

K(θs, θs+1)p(ys+1s+1)dθ2. . . dθt−1

∝ . . . ∝ Z

π(θt−1)

t−1

Y

l=1

p(ylt−1)K(θt−1, θt)p(ytt)dθt−1

(8)

In Equation 8 where we have ∝ . . . ∝ we are simply recursively repeating the steps from Equation 7. Now applying Equation 6 a final time gives

π(θt|y1:t) ∝ π(θt)

t

Y

l=1

p(ylt) ∝ π(θt|y1:t)

(25)

π(θt|y1:t) ∝ π(θt)

t

Y

l=1

p(ylt) ∝ π(θt|y1:t)

We have now shown that the filtered posterior of the particles from the Markovian dynamic is proportional to that from a classic setup.

3.3 Algorithm

Now that we have gone through the theory behind the algorithm, let us delve into how it practically works. First let us reiterate the goal of the algorithm. We have a datapoints Y1:n distributed via p(y|~θ) where ~θ is a set of latent variables. Here our goal is to sample from the posterior π(~θ|y), which can be used to, for instance, find the values of the unknown variables.

3.3.1 Step 1: Initiating particles

As mentioned earlier the algorithm is essentially a MCMC step within an SMC framework. The first step is then to instantiate the N SMC particles θ1:N. Each particle θi is a vector that holds an estimate for each unknown variable of ~θ as well as the priors for all of these. So if ~θ were to be composed of 5 variables than each particle θi may be composed of something like 9 elements.

In the following we will denote by ψt the set of parameters governing the posterior of πt(θ|y1:t). It is then a simple matter of instantiating the particles as follows:

Algorithm 4

for i in nParticles do for j in ~θ do

Drawjθi0 ∼ πjψ0i(jθ) end for

end for

In the form abovejθ0i designates the j-th sub-component of the i-th particle. The 0 signifies that no datapoints have been processed as of yet. Accordingly, jψi0 is the set of parameters or priors necessary for the update of this specific component of θi0. For conciseness we will simply be writing

"Set/Draw all θ0i from priors" for this step from now on.

3.3.2 Step 2: Resampling

For every sweep of the algorithm, as in all SMC methods, we must resample. This can be done before or after the particles are propagated. The merits of which are marginal and the differences are beyond the scope of this papaer. However, in this algorithm we have chosen to resample before propagation. For our algorithm we use the likelihood of a particle given the latest datapoints, or set of datapoints i.e p(ytit) as the weight. Thus, we resample in the following manner:

(26)

Algorithm 5 Resampling for i in nParticles do θ˜it= θitw.pr. P p(yp(ytti)

tit)

end for

3.3.3 Step 3: Propagation

After having initiated the particles and resampled we’re ready to begin propagating the particles for each data point that arrives. It is at this point that an MCMC step is used, in all our model implementations we’ll be using a single Gibb’s step to fulfill this purpose but this is by no means the only way to do so and other MCMC methods could be used instead.

Algorithm 6 Propagation Given a datapoint yt for i in nParticles do

for j in ~θ do ψti= f (ψt−1i , yt)

jθitjKjψit(jθt−1,jθt) end for

end for

As we are choosing conjugate priors for the estimates of ~θ in the particles, then these priors can be updated in a deterministic manner given a datapoint, which is why in we have ψt+1i = f (ψti, yt). In the pseudo-code abovejKjψit(jθt−1,jθt) signifies the transition kernel we are drawing propagating the particles via. It is here that we are applying our MCMC step.

3.3.4 Complete algorithm

Putting the three steps together the algorithm can be summarized in the following manner:

Algorithm 7

Set/Draw all θi0from priors for t in nData do

for i in nParticles do θ˜it= θitw.pr. P p(yp(ytti)

tit)

for j in ~θ do ψt+1i = f (ψ˜ti, yt)

jθitjKjψit(jθt−1,jθt) end for

end for end for

The algorithm is in essence quite simple but actually implementing it can become quite cumbersome as we’ll see in the next section.

(27)

4 Model Implementations

4.1 Simple Gaussian Model

First we want to test the setup on a simple model, such as one for normally distributed data:

Y |θ ∼ N (µ, σ2) With Priors on θ = [µ, σ2]:

µ ∼ N (m,1

τ) i.e. τ is the precision of µ σ2∼ IG(a, b)

In this setup we have chosen simple conjugate priors with known posteriors:

p(σ2|~y, µ) ∼ IG(a + t/2, b +X1

2(yi− µ)2)) p(µ|~y, σ2) ∼ N +σ12P yi

τ +σn2

, (τ + n σ2)−1



4.1.1 Mathematical background

The posteriors given above are found in the following manner:

p(µ, σ2|~y) ∝ p(µ, σ2~y) = p(~y|µ, σ2)p(µ)p(σ2) p(µk, σ2, ~y) = (

t

Y

i=1

1 σ√

2πe12(yi−µσ )2) 1 τ√

2πe12(µ−mτ )2 ba

Γ(a)(σ2)−a−1e−bσ2 p(σ2|y, µ) ∝ 1

σt2)−a−1e−bσ2

t

Y

i=1

e12(yi−µσ )2

∝ σ−2a−2−te−bσ2eP −12(yi−µσ )2

∝ (σ2)−a−1−t/2e−1σ2(bk+P12(yi−µ)2) Thus we have: p(σ2|µ,~y) ∼ IG(a + t/2, b +X1

2(yi− µ)2)). Now for p(µ|σ2, ~y):

p(µ|σ2, ~y) ∝ exp



−1

2((µ − m)τ )2−1

2(Xyi− m σ )2



∝ exp



−1 2

τ 2 σ22(n

τ2 + σ2) − 2µk2m + 1 τ2P yi

) + C)



∝ exp −1 2

τ2(τn2 + σ2)

σ22− 2µ(σ2m +τ2P y1 i)

n

τ2 + σ2 + D)

!

∝ exp



−1 2(n

σ2k + τ2)(µ −τ2m +σ12P yi n

σ2 + τ2 )



(28)

From here it is clear that:

p(µ|σ2, ~y) ∼ N (τ2m +σ12P yi n

σ2 + τ2 , ( n

σ2 + τ2)−1) These posteriors can now be applied in the particle based sampler.

4.1.2 Particle based sampler with Gibbs step

One approach to implementing the Markovian dynamic in a particle based sampler, if the pri- ors allow, to introduce a Gibbs step for every single particle. Here let our particles be θti = [µit, σt2i, τti, mit, ait, bit]. Where τ, m and a, b are the sufficient statistics needed for the update of µ and σ2 respectively:

Algorithm 8

Set/Draw all θi0from priors for t in nData do

for i in nParticles do θ˜it= θi w.pr. p(ytti) ait+1= a˜i+ 0.5 bit+1= bit+ (yt− µ˜it)2 σt+12 ∼ IG(ait+1, bit+1) τt+1i = τt˜i+σ21

t+1i

mit+1=

τt+1i m˜it+ yt

σi 2t+1

τt+1i + 1

σi 2t+1

µit+1∼ N (mit+1, 1

τt+1i + 1

σ2t+1i

) end for

end for

An implementation of the algorithm (using MATLAB) can be found in appendix A.

4.1.3 Results

Running this algorithm for a set of 500 datapoints normally distributed as N (2, 5) with hyperpa- rameters a = 1, b = 1, m = 0, τ = 3 gives the following results.

It is interesting to note that the algorithm above is essentially identical to one arrived at in [1] although they do not approach the problem in the same manner with a clear Markovian Dynamic.

(29)

(a) (b)

Figure 1: Plot of the particles after propagation and the analytical (unnormalized) joint posterior

4.2 Mixed Gaussian Model

Let us now investigate the possibilities of a particle-based algorithm for a mixture of gaussians.

Mixture models are a set of probabilistic models for representing the presence of subcategories in an overall population without data points identifying their membership to specific groups. Formally we could say that a distribution f is a mixture of K components if:

f (X) =

d

X

k=1

αkfk(X)

d

X

k=1

αk = 1

αk> 0, k = 1, 2, ..., d

Generally such a model is useful for inferring properties of the subcategories of a popu- lation when the datapoints are not labeled with which subcategory they pertain to. There are a wide variety of applications for such models from healthcare[9] or finance [11] to much more. The applications of GMMs are enormous as a GMM is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components [18].

In the context of unsupervised machine learning the K-means clustering algorithm is a popular choice to classify datapoints to groups. However this method fails to give a probabilistic measure on the uncertainty on the classification. Thus many models are then done as Gaussian Mixture Models (GMM), i.e. where every fk is a normal distribution.

For GMMs there are quite efficient batch algorithms for latent parameter inference, but not so many definitive online ones, particularly not MCMC-based algorithms. Thus the next section will investigate if the algorithm described in the previous section can be extended to a GMM.

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Svetlana Bizjajeva and Jimmy Olsson: Antithetic Sampling for Sequential Monte Carlo Methods with Application to State Space Models..

suggested that the two algorithms performs almost equally well on a fixed data set, with the online algorithm being somewhat faster. However as mentioned in Chapter 5 , as

The main contribution of this thesis is the exploration of different strategies for accelerating inference methods based on sequential Monte Carlo ( smc) and Markov chain Monte Carlo

We study Vasicek’s closed form approximation for large portfolios with the mixed binomial model using the beta distribution and a two-factor model inspired by Merton as

1754 Department of Electrical Engineering, Linköping University. Linköping University SE-581 83