Monte Carlo methods for inference in dynamical systems

(1)

Monte Carlo methods for inference in dynamical systems

Fredrik Lindsten

Uppsala University

March 30, 2016

1 / 34 fredrik.lindsten@it.uu.se Uppsala University

(2)

Identication of state-space models

Consider a nonlinear, discrete-time state-space model, x_t+1 | x_t∼ f_θ(x_t+1| x_t),

y_t| x_t∼ g_θ(y_t| x_t), and x1 ∼ µ(x1).

We observe y1:T := (y1, . . . , yT) and wish to estimate θ.

Bayesian model: θ is a random variable with prior density p(θ).

Aim: Compute p(θ | y1:T).

(3)

Gibbs sampler for SSMs

Aim: Find p(θ | y1:T).

Alternate between updating θ and updating x1:T.

⇒ Gibbs sampling

(4)

Crash course in MCMC (I/IV)

Markov kernel

Let {Xk}_k≥0 be a Markov chain on a general state space.

The Markov transition kernel is given by,

P (x_k−1, A) = P(Xk∈ A | X_k−1= x_k−1).

Stationary distribution

If π is a probability distribution such that, π(A) =

Z

P (x, A)π(x) dx, we say that π is a stationary distribution for P .

(5)

Crash course in MCMC (II/IV)

Ergodic theorem

If {Xk}_k≥0 is ergodic with stationary distribution π, then

Eπ[g(X)] = Z

g(x)π(x)dx

| {z }

average over space

= lim

m→∞

1 m

m

X

k=1

g(X_k)

| {z }

average over "time"

Markov chain Monte Carlo: We want to compute Eπ[g(X)]

1. Construct an ergodic Markov kernel P with stationary distribution π.

2. Simulate {Xk}^m_k=1 according to P for large m.

3. Approximate Eπ[g(X)] ≈ _m¹ Pm

k=1g(X_k)

(6)

Crash course in MCMC (III/IV)

Two common methods for constructing P : 1. Metropolis-Hastings

2. Gibbs sampling

Update components Xj of the vector X = (X1, . . . , Xd) in turn, Xj ∼ π(xj | x1, . . . , xj−1, xj+1, . . . , xd), j = 1, . . . , d.

(7)

Crash course in MCMC (IV/IV)

ex)

Sample from, π(x, y) = Nx y

;10 10

,2 1 1 1

. Gibbs sampler

I Draw x⁰ ∼ π(x | y);

I Draw y⁰ ∼ π(y | x⁰).

0 5 10 15

X

Y

(8)

Gibbs sampler for SSMs, cont'd

Aim: Find p(θ, x1:T | y1:T).

Alternate between updating θ and updating x1:T. MCMC: Gibbs sampling for state-space models. Iterate,

I Draw θ[k] ∼ p(θ | x1:T[k − 1], y1:T);

I Draw x1:T[k] ∼ p(x_1:T | θ[k], y_1:T). The above procedure results in a Markov chain,

{θ[k], x_1:T[k]}_k≥1 with stationary distribution p(θ, x1:T | y_1:T).

(9)

Linear Gaussian state-space model (I/II)

ex) Gibbs sampling for linear system identication.

xt+1= Axt+ vt, vt∼ N (0, Q), yt= Cxt+ et, et∼ N (0, R), with θ = (A, C, Q, R).

Gibbs sampling: Iterate,

I Draw θ⁰∼ p(θ | x_1:T, y_1:T);

I Draw x⁰_1:T ∼ p(x_1:T | θ⁰, y_1:T).

I Run a Kalman lter for t = 1, . . . , T .

I Simulate backward, x⁰T, x⁰T −1, . . . , x⁰1.

0 0.5 1 1.5 2 2.5 3

−10

−5 0 5 10 15 20 25

Frequency (rad/s)

Magnitude(dB)

0 0.5 1 1.5 2 2.5 3

−50 0 50 100

Frequency (rad/s)

Phase(deg)

TruePosterior mean 95 % credibility

(10)

Gibbs sampler for general SSM?

What about the general nonlinear/non-Gaussian case?

I Draw θ[k] ∼ p(θ | x1:T[k − 1], y_1:T); OK!

I Draw x1:T[k] ∼ p(x1:T | θ[k], y1:T). Hard!

One-at-a-time: xt[k] ∼ p(xt| θ[k], xt−1[k], xt+1[k − 1], yt) Idea: Approximate p(x1:T | θ, y1:T) using a particle lter.

(11)

The particle lter

Consider state inference for a xed value of θ.

Theparticle lterapproximates p(xt| θ, y1:t), t = 1, . . . , T by

pb^N(xt| θ, y1:t) :=

N

X

i=1

wⁱ_t P

`w_t^`δ_xⁱ

t(xt).

t= 1, importance sampling from prior: For i = 1, . . . , N,

I Sample xⁱ₁ ∼ µ(x1);

I Compute weights, wⁱ₁ = g_θ(y₁ | xⁱ₁).

(12)

The particle lter

t> 1:

I Resampling: {xⁱ_t−1, wⁱ_t−1}^N_i=1→ {˜xⁱ_t−1, 1/N }^N_i=1.

I Propagation: xⁱt∼ f_θ(x_t| ˜xⁱ_t−1).

I Weighting: wⁱt= g_θ(y_t| xⁱ_t).

⇒ {xⁱ_t, wⁱ_t}^N_i=1

N.B. This is the simplest PF and many extensions and improvements are avialble.

Weighting Resampling Propagation Weighting Resampling

(13)

The particle lter

Algorithm Bootstrap particle lter (all steps for i = 1, . . . , N) 1. Initialize (t = 1):

(a) Draw xⁱ1∼ µ(x1). (b) Set w1ⁱ = gθ(y1| xⁱ₁).

2. for t = 2, 3, . . . :

(a) Draw aⁱt∼Discrete({w^jt−1}^N_j=1).

(b) Draw xⁱt∼ fθ(xt| x^a_t−1ⁱ^t ).

(c) Set wtⁱ= gθ(yt| xⁱ_t).

Theoretical justication:

I Strongly consistent, central limit theorem at rate√ N.

I Stable as t → ∞ under mixing assumptions.

I Unbiased estimate of normalizing constant p(y1:t| θ).

I . . .

(14)

PF illustration

5 10 15 20 25

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

Time

State

(15)

Filtering or smoothing?

I Possible to reconstructstate trajectoriesby tracing the genealogy of the particles,

(. . . , x^a

aiT T −1

T −2, x^a_{T −1}ⁱ^T , xⁱ_T) =: xⁱ_1:T

I At t = T the particle lter provides an approximation of the joint smoothing distribution, p(x1:T | θ, y1:T):

pb^N(x1:T | θ, y1:T) :=

N

X

i=1

wⁱ_T P

`w^`_Tδ_xⁱ

1:T(x1:T).

(16)

Path degeneracy

5 10 15 20 25

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

Time

State

(17)

Path degeneracy

5 10 15 20 25

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

Time

State

(18)

Sampling based on the PF

Recall Gibbs sampler:

Want to sample from p(x1:T | θ, y_1:T) for a xed value of θ.

With P(x^?_1:T = xⁱ_1:T) ∝ wⁱ_T we get x^?_1:T ^approx.∼ p(x1:T | θ, y1:T).

5 10 15 20 25

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

Time

State

(19)

Particle MCMC idea

Problems with this approach,

I Based on a PF ⇒ approximate sample.

I N → ∞required to get a correct MCMC.

I A lot of wasted computations.

Analyze PF + MCMC together ⇒ Particle MCMC Particle Gibbs sampling

Letx⁰_1:T = (x⁰₁, . . . , x⁰_T)be a xed reference trajectory.

I Sample only N − 1 particles in the standard way.

I Set the Nth particle deterministically: x^Nt =x⁰_t and a^Nt = N.

C. Andrieu, A. Doucet and R. Holenstein, Particle Markov chain Monte Carlo methods.

Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.

(20)

Particle Gibbs

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

(21)

Particle Gibbs

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

I Samplingx^?_1:T with P(x^?_1:T = xⁱ_1:T) ∝ w_Tⁱ stochastically

maps x⁰_1:T intox^?_1:T.

I Implicitly denes a Markov kernel (the PG kernel) on X^T, the space of state trajectories.

(22)

Particle Gibbs

Theorem (Stationary distribution)

The joint smoothing distribution p(x1:T | θ, y_1:T) is a stationary distribution of the PG kernelfor any N ≥ 1.

C. Andrieu, A. Doucet and R. Holenstein, Particle Markov chain Monte Carlo methods.

Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.

Theorem (Uniform ergodicity)

The PG kernel isuniformly ergodic under weak conditions.

Additionally, under strong mixing conditions the convergence rate does not degenerate as T → ∞ provided that N ∝ T .

F. Lindsten, R. Douc, and E. Moulines, Uniform ergodicity of the Particle Gibbs sampler.

Scandinavian Journal of Statistics, 42(3): 775-797, 2015.

(23)

ex) Particle Gibbs

Stochastic volatility model,

x_t+1= 0.9x_t+ v_t, v_t∼ N (0, θ), y_t= e_texp (¹₂x_t) , e_t∼ N (0, 1).

Consider the ACF of θ[k] − E[θ | y1:T].

0 50 100 150 200

0 0.2 0.4 0.6 0.8 1

PG, T = 100

Lag

ACF

N=5 N=20 N=100 N=1000

0 50 100 150 200

0 0.2 0.4 0.6 0.8 1

PG, T = 1000

Lag

ACF

N=5 N=20 N=100 N=1000

(24)

Ancestor Sampling

Particle Gibbs:

Letx⁰_1:T = (x⁰₁, . . . , x⁰_T)be a xed reference trajectory.

I Sample only N − 1 particles in the standard way.

I Set the Nth particle deterministically: x^Nt =x⁰_t.

I Set a^Nt = N.

I Sample a^N_t ∈ {1, . . . , N } with

P(a^Nt = j) ∝ w_t−1^j f_θ(x⁰_t | x^j_t−1).

F. Lindsten, M. I. Jordan and T. B. Schön, Particle Gibbs with Ancestor sampling, Journal of Machine Learning Research, 15: 2145-2184, 2014.

(25)

Particle Gibbs with Ancestor Sampling

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

(26)

Particle Gibbs with Ancestor Sampling

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

(27)

PGAS vs. PG

PGAS

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

PG

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

(28)

ex) Stochastic volatility model, cont'd

Stochastic volatility model,

x_t+1= 0.9x_t+ v_t, v_t∼ N (0, θ), y_t= e_texp (¹₂x_t) , e_t∼ N (0, 1).

Consider the ACF of θ[k] − E[θ | y1:T].

0 50 100 150 200

0 0.2 0.4 0.6 0.8 1

PG, T = 1000

Lag

ACF

N=5 N=20 N=100 N=1000

0 50 100 150 200

0 0.2 0.4 0.6 0.8 1

PG-AS, T = 1000

Lag

ACF

N=5 N=20 N=100 N=1000

(29)

ex) Wiener system identication

G h(·) Σ

ut y_t

v_t e_t

Semi-parametric model: State-space model for G, Gaussian process model for h(·).

−10 0 10 20

Magnitude(dB)

0 0.5 1 1.5 2 2.5 3

−50 0 50 100

Frequency (rad/s)

Phase(deg)

TruePosterior mean 99 % credibility

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

z

h(z)

True Posterior mean 99 % credibility

F. Lindsten, T. B. Schön and M. I. Jordan, Bayesian semiparametric Wiener system identication, Automatica, 49(7): 2053-2063, 2013.

(30)

ex) Gaussian process state-space models

A GP-SSM is a exible nonparametric dynamical systems model,

f (·) ∼ GP m_θ(x_t), k_θ(x_t, x⁰_t), x_t+1| f, x_t∼ N (f (x_t), Q),

yt| xt∼ g_θ(yt| xt).

−20 −10 0 10 20

−1

−0.5 0 0.5 1

−20

−10 0 10 20

u x

f(x,u)

Idea: Marginalize out f(·) and infer x1:T (and θ) directly.

Marginalization of f introduces non-Markovian dependencies in {xt}.

R. Frigola, F. Lindsten, T. B. Schön and C. E. Rasmussen, Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC, Conference on Neural Information Processing Systems (NIPS), 2013.

(31)

ex) Epidemiological forecasting

10 20 30 40 50 60 70 80 90

0 500 1000 1500

It

Month

True 1 month predictions 95 % credibility

Figure: Disease activity over an eight year period. First half used for inference, second half for one-month predictions.

Inference (and prediction) for SIR model with environmental noise and seasonal uctuations

S_t+dt= St+ µPdt − µStdt − (1 + F vt) βtStP⁻¹Itdt, I_t+dt= I_t− (γ + µ)I_tdt + (1 + F v_t) β_tS_tP⁻¹I_tdt, R_t+dt= R_t+ γI_tdt − µR_tdt,

(32)

Applications of PGAS

I Oceanography[Marcos et al., JGR, 2015]

I Econometrics[Nonejad, Economics Letters, 2015]

I Synaptic plasticity models[Linderman et al., NIPS, 2014]

I Probabilistic programming[Meent et al., AISTATS, 2015]

I Power disaggregation[Valera et al., NIPS, 2015]

I . . .

(33)

Ongoing: Blocking

J1 J3 J5

J2 J4

1 · · · · · · T

I Update blocks of states using PG(AS).

I Stable as T → ∞ using xed N.

I Opens up for parallelisation.

I Possible to do apply blocking also over components of the state vector (spatial blocking).

S. S. Singh, F. Lindsten, and E. Moulines, Blocking Strategies and Stability of Particle Gibbs Samplers, arXiv preprint, arXiv:1509.08362, 2015.

(34)

Ongoing: Spatio-temporal models

Spatio-temporal model:

I State xt spatially structured

I High dimensionality prevents conventional methods!

I Using multiple Nested Particle Filters we can handle dimensions ∼ 100 1 000.

Figure: Drought probability in Sahel region in 1989.

C. A. Naesseth, F. Lindsten, and T. B. Schön, Nested Sequential Monte Carlo Methods.

(35)

Ongoing: Probabilistic graphical models

ϕ ϕ ϕ

G₃

≅ G2

u3(1)

u₃^(2:9)

ϕ ϕ

≅ G(2,4) ≅ G(2,4)

G(4,4)

I(b)Solving staticproblems using sequential methods.

I Divide-and-Conquer PF extending particle ltering from chainstotrees

I In particular: hierarchical Bayesian models

I Applications in genetics, linguistics, social science, . . .

F. Lindsten, A. M. Johansen, C. A. Naesseth, B. Kirkpatrick, T. B. Schön, J. Aston, and A. Bouchard- Côté, Divide-and-Conquer with Sequential Monte Carlo, arXiv preprint, arXiv:1406.4993v2, 2015.

(36)

S. Linderman, C. H. Stock, and R. P. Adams.

A framework for studying synaptic plasticity with neural spike train data.

In Advances in Neural Information Processing Systems (NIPS) 27. 2014.

M. Marcos, F. M. Calafat, A. Berihuete, and S. Dangendorf.

Long-term variations in global sea level extremes.

Journal of Geophysical Research, 120(12):81158134, 2015.

N. Nonejad.

Flexible model comparison of unobserved components models using particle Gibbs with ancestor sampling.

Economics Letters, 133:3539, 2015.

I. Valera, F. Ruiz, L. Svensson, and F. Perez-Cruz.

Innite factorial dynamical model.

In Advances in Neural Information Processing Systems (NIPS) 28. 2015.

J.-W. van de Meent, Y. Hongseok, V. Mansinghka, and F. Wood.

Particle Gibbs with ancestor sampling for probabilistic programs.

In Proceedings of the 18th International Conference on Articial Intelligence and Statistics, 2015.

Monte Carlo methods for inference in dynamical systems