Monte Carlo methods for inference in dynamical systems
Fredrik Lindsten
Uppsala University
March 30, 2016
1 / 34 fredrik.lindsten@it.uu.se Uppsala University
Identication of state-space models
Consider a nonlinear, discrete-time state-space model, xt+1 | xt∼ fθ(xt+1| xt),
yt| xt∼ gθ(yt| xt), and x1 ∼ µ(x1).
We observe y1:T := (y1, . . . , yT) and wish to estimate θ.
Bayesian model: θ is a random variable with prior density p(θ).
Aim: Compute p(θ | y1:T).
Gibbs sampler for SSMs
Aim: Find p(θ | y1:T).
Alternate between updating θ and updating x1:T.
⇒ Gibbs sampling
3 / 34 fredrik.lindsten@it.uu.se Uppsala University
Crash course in MCMC (I/IV)
Markov kernel
Let {Xk}k≥0 be a Markov chain on a general state space.
The Markov transition kernel is given by,
P (xk−1, A) = P(Xk∈ A | Xk−1= xk−1).
Stationary distribution
If π is a probability distribution such that, π(A) =
Z
P (x, A)π(x) dx, we say that π is a stationary distribution for P .
Crash course in MCMC (II/IV)
Ergodic theorem
If {Xk}k≥0 is ergodic with stationary distribution π, then
Eπ[g(X)] = Z
g(x)π(x)dx
| {z }
average over space
= lim
m→∞
1 m
m
X
k=1
g(Xk)
| {z }
average over "time"
Markov chain Monte Carlo: We want to compute Eπ[g(X)]
1. Construct an ergodic Markov kernel P with stationary distribution π.
2. Simulate {Xk}mk=1 according to P for large m.
3. Approximate Eπ[g(X)] ≈ m1 Pm
k=1g(Xk)
5 / 34 fredrik.lindsten@it.uu.se Uppsala University
Crash course in MCMC (III/IV)
Two common methods for constructing P : 1. Metropolis-Hastings
2. Gibbs sampling
Update components Xj of the vector X = (X1, . . . , Xd) in turn, Xj ∼ π(xj | x1, . . . , xj−1, xj+1, . . . , xd), j = 1, . . . , d.
Crash course in MCMC (IV/IV)
ex)
Sample from, π(x, y) = Nx y
;10 10
,2 1 1 1
. Gibbs sampler
I Draw x0 ∼ π(x | y);
I Draw y0 ∼ π(y | x0).
7 / 34 fredrik.lindsten@it.uu.se Uppsala University
0 5 10 15
0 5 10 15
X
Y
Gibbs sampler for SSMs, cont'd
Aim: Find p(θ, x1:T | y1:T).
Alternate between updating θ and updating x1:T. MCMC: Gibbs sampling for state-space models. Iterate,
I Draw θ[k] ∼ p(θ | x1:T[k − 1], y1:T);
I Draw x1:T[k] ∼ p(x1:T | θ[k], y1:T). The above procedure results in a Markov chain,
{θ[k], x1:T[k]}k≥1 with stationary distribution p(θ, x1:T | y1:T).
Linear Gaussian state-space model (I/II)
ex) Gibbs sampling for linear system identication.
xt+1= Axt+ vt, vt∼ N (0, Q), yt= Cxt+ et, et∼ N (0, R), with θ = (A, C, Q, R).
Gibbs sampling: Iterate,
I Draw θ0∼ p(θ | x1:T, y1:T);
I Draw x01:T ∼ p(x1:T | θ0, y1:T).
I Run a Kalman lter for t = 1, . . . , T .
I Simulate backward, x0T, x0T −1, . . . , x01.
0 0.5 1 1.5 2 2.5 3
−10
−5 0 5 10 15 20 25
Frequency (rad/s)
Magnitude(dB)
0 0.5 1 1.5 2 2.5 3
−50 0 50 100
Frequency (rad/s)
Phase(deg)
TruePosterior mean 95 % credibility
9 / 34 fredrik.lindsten@it.uu.se Uppsala University
Gibbs sampler for general SSM?
What about the general nonlinear/non-Gaussian case?
I Draw θ[k] ∼ p(θ | x1:T[k − 1], y1:T); OK!
I Draw x1:T[k] ∼ p(x1:T | θ[k], y1:T). Hard!
One-at-a-time: xt[k] ∼ p(xt| θ[k], xt−1[k], xt+1[k − 1], yt) Idea: Approximate p(x1:T | θ, y1:T) using a particle lter.
The particle lter
Consider state inference for a xed value of θ.
Theparticle lterapproximates p(xt| θ, y1:t), t = 1, . . . , T by
pbN(xt| θ, y1:t) :=
N
X
i=1
wit P
`wt`δxi
t(xt).
t= 1, importance sampling from prior: For i = 1, . . . , N,
I Sample xi1 ∼ µ(x1);
I Compute weights, wi1 = gθ(y1 | xi1).
11 / 34 fredrik.lindsten@it.uu.se Uppsala University
The particle lter
t> 1:
I Resampling: {xit−1, wit−1}Ni=1→ {˜xit−1, 1/N }Ni=1.
I Propagation: xit∼ fθ(xt| ˜xit−1).
I Weighting: wit= gθ(yt| xit).
⇒ {xit, wit}Ni=1
N.B. This is the simplest PF and many extensions and improvements are avialble.
Weighting Resampling Propagation Weighting Resampling
The particle lter
Algorithm Bootstrap particle lter (all steps for i = 1, . . . , N) 1. Initialize (t = 1):
(a) Draw xi1∼ µ(x1). (b) Set w1i = gθ(y1| xi1).
2. for t = 2, 3, . . . :
(a) Draw ait∼Discrete({wjt−1}Nj=1).
(b) Draw xit∼ fθ(xt| xat−1it ).
(c) Set wti= gθ(yt| xit).
Theoretical justication:
I Strongly consistent, central limit theorem at rate√ N.
I Stable as t → ∞ under mixing assumptions.
I Unbiased estimate of normalizing constant p(y1:t| θ).
I . . .
13 / 34 fredrik.lindsten@it.uu.se Uppsala University
PF illustration
5 10 15 20 25
−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5 0 0.5 1
Time
State
Filtering or smoothing?
I Possible to reconstructstate trajectoriesby tracing the genealogy of the particles,
(. . . , xa
aiT T −1
T −2, xaT −1iT , xiT) =: xi1:T
I At t = T the particle lter provides an approximation of the joint smoothing distribution, p(x1:T | θ, y1:T):
pbN(x1:T | θ, y1:T) :=
N
X
i=1
wiT P
`w`Tδxi
1:T(x1:T).
15 / 34 fredrik.lindsten@it.uu.se Uppsala University
Path degeneracy
5 10 15 20 25
−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5 0 0.5 1
Time
State
Path degeneracy
5 10 15 20 25
−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5 0 0.5 1
Time
State
17 / 34 fredrik.lindsten@it.uu.se Uppsala University
Sampling based on the PF
Recall Gibbs sampler:
Want to sample from p(x1:T | θ, y1:T) for a xed value of θ.
With P(x?1:T = xi1:T) ∝ wiT we get x?1:T approx.∼ p(x1:T | θ, y1:T).
5 10 15 20 25
−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5 0 0.5 1
Time
State
Particle MCMC idea
Problems with this approach,
I Based on a PF ⇒ approximate sample.
I N → ∞required to get a correct MCMC.
I A lot of wasted computations.
Analyze PF + MCMC together ⇒ Particle MCMC Particle Gibbs sampling
Letx01:T = (x01, . . . , x0T)be a xed reference trajectory.
I Sample only N − 1 particles in the standard way.
I Set the Nth particle deterministically: xNt =x0t and aNt = N.
C. Andrieu, A. Doucet and R. Holenstein, Particle Markov chain Monte Carlo methods.
Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.
19 / 34 fredrik.lindsten@it.uu.se Uppsala University
Particle Gibbs
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
Particle Gibbs
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
I Samplingx?1:T with P(x?1:T = xi1:T) ∝ wTi stochastically
maps x01:T intox?1:T.
I Implicitly denes a Markov kernel (the PG kernel) on XT, the space of state trajectories.
21 / 34 fredrik.lindsten@it.uu.se Uppsala University
Particle Gibbs
Theorem (Stationary distribution)
The joint smoothing distribution p(x1:T | θ, y1:T) is a stationary distribution of the PG kernelfor any N ≥ 1.
C. Andrieu, A. Doucet and R. Holenstein, Particle Markov chain Monte Carlo methods.
Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.
Theorem (Uniform ergodicity)
The PG kernel isuniformly ergodic under weak conditions.
Additionally, under strong mixing conditions the convergence rate does not degenerate as T → ∞ provided that N ∝ T .
F. Lindsten, R. Douc, and E. Moulines, Uniform ergodicity of the Particle Gibbs sampler.
Scandinavian Journal of Statistics, 42(3): 775-797, 2015.
ex) Particle Gibbs
Stochastic volatility model,
xt+1= 0.9xt+ vt, vt∼ N (0, θ), yt= etexp (12xt) , et∼ N (0, 1).
Consider the ACF of θ[k] − E[θ | y1:T].
0 50 100 150 200
0 0.2 0.4 0.6 0.8 1
PG, T = 100
Lag
ACF
N=5 N=20 N=100 N=1000
0 50 100 150 200
0 0.2 0.4 0.6 0.8 1
PG, T = 1000
Lag
ACF
N=5 N=20 N=100 N=1000
23 / 34 fredrik.lindsten@it.uu.se Uppsala University
Ancestor Sampling
Particle Gibbs:
Letx01:T = (x01, . . . , x0T)be a xed reference trajectory.
I Sample only N − 1 particles in the standard way.
I Set the Nth particle deterministically: xNt =x0t.
I Set aNt = N.
I Sample aNt ∈ {1, . . . , N } with
P(aNt = j) ∝ wt−1j fθ(x0t | xjt−1).
F. Lindsten, M. I. Jordan and T. B. Schön, Particle Gibbs with Ancestor sampling, Journal of Machine Learning Research, 15: 2145-2184, 2014.
Particle Gibbs with Ancestor Sampling
25 / 34 fredrik.lindsten@it.uu.se Uppsala University
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
Particle Gibbs with Ancestor Sampling
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
PGAS vs. PG
PGAS
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
PG
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
27 / 34 fredrik.lindsten@it.uu.se Uppsala University
ex) Stochastic volatility model, cont'd
Stochastic volatility model,
xt+1= 0.9xt+ vt, vt∼ N (0, θ), yt= etexp (12xt) , et∼ N (0, 1).
Consider the ACF of θ[k] − E[θ | y1:T].
0 50 100 150 200
0 0.2 0.4 0.6 0.8 1
PG, T = 1000
Lag
ACF
N=5 N=20 N=100 N=1000
0 50 100 150 200
0 0.2 0.4 0.6 0.8 1
PG-AS, T = 1000
Lag
ACF
N=5 N=20 N=100 N=1000
ex) Wiener system identication
G h(·) Σ
ut yt
vt et
Semi-parametric model: State-space model for G, Gaussian process model for h(·).
−10 0 10 20
Magnitude(dB)
0 0.5 1 1.5 2 2.5 3
−50 0 50 100
Frequency (rad/s)
Phase(deg)
TruePosterior mean 99 % credibility
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1
z
h(z)
True Posterior mean 99 % credibility
F. Lindsten, T. B. Schön and M. I. Jordan, Bayesian semiparametric Wiener system identication, Automatica, 49(7): 2053-2063, 2013.
29 / 34 fredrik.lindsten@it.uu.se Uppsala University
ex) Gaussian process state-space models
A GP-SSM is a exible nonparametric dynamical systems model,
f (·) ∼ GP mθ(xt), kθ(xt, x0t), xt+1| f, xt∼ N (f (xt), Q),
yt| xt∼ gθ(yt| xt).
−20 −10 0 10 20
−1
−0.5 0 0.5 1
−20
−10 0 10 20
u x
f(x,u)
Idea: Marginalize out f(·) and infer x1:T (and θ) directly.
Marginalization of f introduces non-Markovian dependencies in {xt}.
R. Frigola, F. Lindsten, T. B. Schön and C. E. Rasmussen, Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC, Conference on Neural Information Processing Systems (NIPS), 2013.
ex) Epidemiological forecasting
10 20 30 40 50 60 70 80 90
0 500 1000 1500
It
Month
True 1 month predictions 95 % credibility
Figure: Disease activity over an eight year period. First half used for inference, second half for one-month predictions.
Inference (and prediction) for SIR model with environmental noise and seasonal uctuations
St+dt= St+ µPdt − µStdt − (1 + F vt) βtStP−1Itdt, It+dt= It− (γ + µ)Itdt + (1 + F vt) βtStP−1Itdt, Rt+dt= Rt+ γItdt − µRtdt,
31 / 34 fredrik.lindsten@it.uu.se Uppsala University
Applications of PGAS
I Oceanography[Marcos et al., JGR, 2015]
I Econometrics[Nonejad, Economics Letters, 2015]
I Synaptic plasticity models[Linderman et al., NIPS, 2014]
I Probabilistic programming[Meent et al., AISTATS, 2015]
I Power disaggregation[Valera et al., NIPS, 2015]
I . . .
Ongoing: Blocking
J1 J3 J5
J2 J4
1 · · · · · · T
I Update blocks of states using PG(AS).
I Stable as T → ∞ using xed N.
I Opens up for parallelisation.
I Possible to do apply blocking also over components of the state vector (spatial blocking).
S. S. Singh, F. Lindsten, and E. Moulines, Blocking Strategies and Stability of Particle Gibbs Samplers, arXiv preprint, arXiv:1509.08362, 2015.
33 / 34 fredrik.lindsten@it.uu.se Uppsala University
Ongoing: Spatio-temporal models
Spatio-temporal model:
I State xt spatially structured
I High dimensionality prevents conventional methods!
I Using multiple Nested Particle Filters we can handle dimensions ∼ 100 1 000.
Figure: Drought probability in Sahel region in 1989.
C. A. Naesseth, F. Lindsten, and T. B. Schön, Nested Sequential Monte Carlo Methods.
Ongoing: Probabilistic graphical models
ϕ ϕ ϕ
ϕ ϕ ϕ
G3
≅ G2
≅ G2
u3(1)
u3(2:9)
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
ϕ ϕ
≅ G(2,4) ≅ G(2,4)
G(4,4)
I(b)Solving staticproblems using sequential methods.
I Divide-and-Conquer PF extending particle ltering from chainstotrees
I In particular: hierarchical Bayesian models
I Applications in genetics, linguistics, social science, . . .
F. Lindsten, A. M. Johansen, C. A. Naesseth, B. Kirkpatrick, T. B. Schön, J. Aston, and A. Bouchard- Côté, Divide-and-Conquer with Sequential Monte Carlo, arXiv preprint, arXiv:1406.4993v2, 2015.
35 / 34 fredrik.lindsten@it.uu.se Uppsala University
S. Linderman, C. H. Stock, and R. P. Adams.
A framework for studying synaptic plasticity with neural spike train data.
In Advances in Neural Information Processing Systems (NIPS) 27. 2014.
M. Marcos, F. M. Calafat, A. Berihuete, and S. Dangendorf.
Long-term variations in global sea level extremes.
Journal of Geophysical Research, 120(12):81158134, 2015.
N. Nonejad.
Flexible model comparison of unobserved components models using particle Gibbs with ancestor sampling.
Economics Letters, 133:3539, 2015.
I. Valera, F. Ruiz, L. Svensson, and F. Perez-Cruz.
Innite factorial dynamical model.
In Advances in Neural Information Processing Systems (NIPS) 28. 2015.
J.-W. van de Meent, Y. Hongseok, V. Mansinghka, and F. Wood.
Particle Gibbs with ancestor sampling for probabilistic programs.
In Proceedings of the 18th International Conference on Articial Intelligence and Statistics, 2015.