Conditional particle filters for system identification
Fredrik Lindsten
Division of Automatic Control Linköping University, Sweden
Identification of state-space models
2(28)Consider a nonlinear, discrete-time state-space model, xt+1
=
ft(
xt; θ) +
vt(
θ)
,yt
=
ht(
xt; θ) +
et(
θ)
. We observey1:T
= {
y1, . . . , yT}
. Aim: Estimateθgiveny1:T.Latent variable model
3(28)Alternate between updatingθ and updatingx1:T.
Frequentists:
• Find ˆθML
=
arg maxθpθ(
y1:T)
.• Use e.g. the EM algorithm.
Bayesians:
• Findp
(
θ|
y1:T)
.• Use e.g. Gibbs sampling
Invariant distribution
4(28)Definition
LetK
(
y|
x)
be a Markovian transition kernel and letπ(
x)
be some probability density function.Kis said to leaveπinvariant ifZ
K
(
y|
x)
π(
x)
dx=
π(
y)
.• A Markov chain
{
Xn}
n≥1with transition kernelKhasπas stationary distribution.• If the chain is ergodic, thenπis its limiting distribution.
ex) Invariant distribution
5(28)0 100 200 300 400 500
−10 0 10 20 30 40
Time (t) xt
−100 −5 0 5 10
0.05 0.1 0.15 0.2
x
Densityvalue
ex) Invariant distribution
5(28)0 2000 4000 6000 8000 10000
−10 0 10 20 30 40
Time (t) xt
−100 −5 0 5 10
0.05 0.1 0.15 0.2
x
Densityvalue
Markov chain Monte Carlo
6(28)Objective: Sample from intractable target distributionπ
(
x)
.Markov chain Monte Carlo (MCMC):
• Find kernelK
(
y|
x)
which leavesπ(
x)
invariant (key step).• Sample a Markov chain with kernelK.
• Discard the transient (burnin) and use the sample path as approximate samples fromπ.
Gibbs sampler
7(28)ex) Sample from,
N
xy
;10 10
,2 1 1 1
.
Gibbs sampler
• Drawx0
∼
π(
x|
y)
;• Drawy0
∼
π(
y|
x0)
.0 5 10 15
0 5 10 15
X
Y
Gibbs sampler for SSMs
8(28) Aim: Findp(
θ|
y1:T)
.Alternate between updatingθ and updatingx1:T. MCMC: Gibbs sampling for state-space models. Iterate,
• Drawθ
[
r] ∼
p(
θ|
x1:T[
r−
1]
, y1:T)
;• Drawx1:T
[
r] ∼
p(
x1:T|
θ[
r]
, y1:T)
.The above procedure results in a Markov chain,
{
θ[
r]
, x1:T[
r]}
r≥1with stationary distributionp
(
θ, x1:T|
y1:T)
.Gibbs sampler for SSMs
8(28) Aim: Findp(
θ, x1:T|
y1:T)
.Alternate between updatingθ and updatingx1:T. MCMC: Gibbs sampling for state-space models. Iterate,
• Drawθ
[
r] ∼
p(
θ|
x1:T[
r−
1]
, y1:T)
;• Drawx1:T
[
r] ∼
p(
x1:T|
θ[
r]
, y1:T)
.The above procedure results in a Markov chain,
{
θ[
r]
, x1:T[
r]}
r≥1with stationary distributionp
(
θ, x1:T|
y1:T)
.Gibbs sampler for SSMs
8(28)Aim: Findp
(
θ, x1:T|
y1:T)
.Alternate between updatingθ and updatingx1:T.
MCMC: Gibbs sampling for state-space models. Iterate,
• Drawθ
[
r] ∼
p(
θ|
x1:T[
r−
1]
, y1:T)
;• Drawx1:T
[
r] ∼
p(
x1:T|
θ[
r]
, y1:T)
.The above procedure results in a Markov chain,
{
θ[
r]
, x1:T[
r]}
r≥1with stationary distributionp
(
θ, x1:T|
y1:T)
.Gibbs sampler for SSMs
8(28)Aim: Findp
(
θ, x1:T|
y1:T)
.Alternate between updatingθ and updatingx1:T. MCMC: Gibbs sampling for state-space models. Iterate,
• Drawθ
[
r] ∼
p(
θ|
x1:T[
r−
1]
, y1:T)
;• Drawx1:T
[
r] ∼
p(
x1:T|
θ[
r]
, y1:T)
.The above procedure results in a Markov chain,
{
θ[
r]
, x1:T[
r]}
r≥1with stationary distributionp
(
θ, x1:T|
y1:T)
.Linear Gaussian state-space model
9(28)ex) Gibbs sampling for linear system identification.
xt+1
yt
=
A B C Dxt ut
+
vtet
. Iterate,
• Drawθ0
∼
p(
θ|
x1:T, y1:T)
;• Drawx01:T
∼
p(
x1:T|
θ0, y1:T)
.0 0.5 1 1.5 2 2.5 3
−10
−5 0 5 10 15 20 25
Frequency (rad/s)
Magnitude(dB)
0 0.5 1 1.5 2 2.5 3
−50 0 50 100
Frequency (rad/s)
Phase(deg)
TruePosterior mean 95 % credibility
Gibbs sampler for general SSM?
10(28)What about the general nonlinear/non-Gaussian case?
• Drawθ0
∼
p(
θ|
x1:T, y1:T)
;OK!
• Drawx01:T
∼
p(
x1:T|
θ0, y1:T)
.Hard! Problem:p
(
x1:T|
θ, y1:T)
not available!Idea: Approximatep
(
x1:T|
θ, y1:T)
using a particle filter.Gibbs sampler for general SSM?
10(28)What about the general nonlinear/non-Gaussian case?
• Drawθ0
∼
p(
θ|
x1:T, y1:T)
; OK!• Drawx01:T
∼
p(
x1:T|
θ0, y1:T)
. Hard!Problem:p
(
x1:T|
θ, y1:T)
not available!Idea: Approximatep
(
x1:T|
θ, y1:T)
using a particle filter.Gibbs sampler for general SSM?
10(28)What about the general nonlinear/non-Gaussian case?
• Drawθ0
∼
p(
θ|
x1:T, y1:T)
; OK!• Drawx01:T
∼
p(
x1:T|
θ0, y1:T)
. Hard!Problem:p
(
x1:T|
θ, y1:T)
not available!Idea: Approximatep
(
x1:T|
θ, y1:T)
using a particle filter.The particle filter
11(28)• Resampling:
{
xi1:t−1, wit−1}
Ni=1→ {
˜xi1:t−1, 1/N}
Ni=1.• Propagation: xit
∼
Rtθ(
dxt|
˜xi1:t−1)
andxi1:t= {
˜xi1:t−1, xit}
.• Weighting:wit
=
Wtθ(
xi1:t)
.⇒ {
xi1:t, wit}
Ni=1Weighting Resampling Propagation Weighting Resampling
The particle filter
11(28)• Resampling + Propagation:
(
ait, xit) ∼
Mtθ(
at, xt) =
wat
t−1
∑lwlt−1Rtθ
(
xt|
xa1:tt−1)
.• Weighting:wit
=
Wtθ(
xi1:t)
.⇒ {
xi1:t, wit}
Ni=1Weighting Resampling Propagation Weighting Resampling
The particle filter
12(28)Algorithm Particle filter (PF) 1. Initialize (t
=
1):(a) Drawxi1∼Rθ1(x1)fori=1, . . . , N. (b) Setwi1=Wθ1(xi1)fori=1, . . . , N. 2. fort
=
2, . . . , T:(a) Draw(ait, xit) ∼Mθt(at, xt)fori=1, . . . , N.
(b) Setxi1:t= {xa1:t−1it , xit}andwit=Wtθ(xi1:t)fori=1, . . . , N.
The particle filter
13(28)5 10 15 20 25
−4
−3
−2
−1 0 1
Time
State
Sampling based on the PF
14(28)• WithP
(
x01:T=
xi1:T)
∝ wiT we get,x1:T0 approx.∼
p(
x1:T|
θ, y1:T)
.5 10 15 20 25
−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5 0 0.5 1
Time
State
Problems
15(28)Problems with this approach,
• Based on a PF
⇒
approximate sample.• Does not leavep
(
θ, x1:T|
y1:T)
invariant!• Relies on largeNto be successful.
• A lot of wasted computations.
To get around these problems,
Use a conditional particle filter (CPF). One prespecified path is retained throughout the sampler.
C. Andrieu, A. Doucet and R. Holenstein, “Particle Markov chain Monte Carlo methods”, Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.
Problems
15(28)Problems with this approach,
• Based on a PF
⇒
approximate sample.• Does not leavep
(
θ, x1:T|
y1:T)
invariant!• Relies on largeNto be successful.
• A lot of wasted computations.
To get around these problems,
Use a conditional particle filter (CPF). One prespecified path is retained throughout the sampler.
C. Andrieu, A. Doucet and R. Holenstein, “Particle Markov chain Monte Carlo methods”, Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.
Conditional PF with ancestor sampling
16(28)Algorithm CPF w. ancestor sampling (CPF-AS), conditioned onx?1:T 1. Initialize (t
=
1):(a) Drawxi1∼Rθ1(x1)fori6=Nand setxN1 =x?1. (b) Setwi1=Wθ1(xi1)fori=1, . . . , N.
2. fort
=
2, . . . , T:(a) Draw(ait, xit) ∼Mθt(at, xt)fori6=Nand setxNt =x?t. (b) DrawaNt withP(aNt =i)∝ wit−1p(x?t |θ, xit−1).
(c) Setxi1:t= {xa1:t−1it , xit}andwit=Wtθ(xi1:t)fori=1, . . . , N.
F. Lindsten, M. I. Jordan and T. B. Schön, “Ancestor Sampling for Particle Gibbs” In proceedings of the 2012 Conference on Neural Information Processing Systems (NIPS), (accepted).
Conditional PF with ancestor sampling
17(28)Theorem
For anyN
≥
2, the procedure;(i) Run CPF-AS
(
x1:T?)
;(ii) SampleP
(
x01:T=
xi1:T)
∝ wiT;defines a Markov kernel on XT which leavesp
(
x1:T|
θ, y1:T)
invariant.Proof.
Ask me later. . .
Conditional PF with ancestor sampling
17(28)Theorem
For anyN
≥
2, the procedure;(i) Run CPF-AS
(
x1:T?)
;(ii) SampleP
(
x01:T=
xi1:T)
∝ wiT;defines a Markov kernel on XT which leavesp
(
x1:T|
θ, y1:T)
invariant.Proof.
Ask me later. . .
CPF vs. CPF-AS
18(28)CPFCPF-AS
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
Particle Gibbs with ancestor sampling
19(28)Bayesian identification: Gibbs + CPF-AS = PG-AS Algorithm PG-AS: Particle Gibbs with ancestor sampling
1. Initialize: Set
{
θ[
0]
, x1:T[
0]}
arbitrarily.2. Forr
≥
1, iterate:(a) Drawθ
[
r] ∼
p(
θ|
x1:T[
r−
1]
, y1:T)
.(b) Run CPF-AS
(
x1:T[
r−
1])
, targetingp(
x1:T|
θ[
r]
, y1:T)
. (c) Sample withP(
x1:T[
r] =
xi1:T)
∝ wiT.For any number of particles N
≥
2, the Markov chain{
θ[
r]
, x1:T[
r]}
r≥1has stationary distributionp(
θ, x1:T|
y1:T)
.ex) Stochastic volatility
20(28)PG-AS for stochastic volatility model,
xt+1
=
θ1xt+
vt, vt∼ N (
0, θ2)
, yt=
etexp 12xt
, et
∼ N (
0, 1)
.0 200 400 600 800 1000
0.2 0.4 0.6 0.8 1
Iteration number θ1
0 200 400 600 800 1000
0 0.2 0.4 0.6 0.8 1
Iteration number θ2
N=5 N=20 N=100 N=1000 N=5000
ex) Stochastic volatility
20(28)PG-AS for stochastic volatility model,
xt+1
=
θ1xt+
vt, vt∼ N (
0, θ2)
, yt=
etexp 12xt
, et
∼ N (
0, 1)
.0.8 0.85 0.9 0.95 1
0 10 20 30 40
θ1
Probabilitydensity
0 0.1 0.2 0.3 0.4 0.5
0 5 10 15
θ2
Probabilitydensity
N=5 N=20 N=100 N=1000 N=5000
Maximum likelihood inference
21(28)Back to frequentistic objective, ˆθML
=
arg maxθ
pθ
(
y1:T)
.Expectation maximization (EM). Iterate;
(E) Q
(
θ, θ[
r−
1]) =
Eθ[r−1][
log pθ(
x1:T, y1:T) |
y1:T]
; (M) θ[
r] =
arg maxθQ(
θ, θ[
r−
1])
.Problem: The E-step requires us to solve a smoothing prob- lem, i.e. to compute an expectation underpθ
(
x1:T|
y1:T)
.Maximum likelihood inference
21(28)Back to frequentistic objective, ˆθML
=
arg maxθ
pθ
(
y1:T)
.Expectation maximization (EM). Iterate;
(E) Q
(
θ, θ[
r−
1]) =
Eθ[r−1][
log pθ(
x1:T, y1:T) |
y1:T]
; (M) θ[
r] =
arg maxθQ(
θ, θ[
r−
1])
.Problem: The E-step requires us to solve a smoothing prob- lem, i.e. to compute an expectation underpθ
(
x1:T|
y1:T)
.Particle smoother EM
22(28)Idea: Use a particle smoother (PS) for the E-step.
pθ0
(
x1:T|
y1:T) ≈
1 N∑
N j=1δ˜xj1:T
(
x1:T)
.The E-step is approximated with
QbN
(
θ, θ0) ,
1 N∑
N j=1log pθ
(
˜xj1:T, y1:T)
.Problems with PS-EM
23(28)Problems with PS-EM,
• Doubly asymptotic – requiresN
→
∞andR→
∞ simultaneously to converge.• Relies on largeNto be successful.
• A lot of wasted computations.
Stochastic approximation EM
24(28)Assume for the time being that we can sample frompθ
(
x1:T|
y1:T)
. Stochastic approximation EM (SAEM): Replace the E-step with,Qbr
(
θ) =
Qbr−1(
θ) +
γr 1 M∑
M j=1log pθ
(
˜xj1:T, y1:T) −
Qbr−1(
θ)
! ,
where˜xj1:Ti.i.d.
∼
pθ(
x1:T|
y1:T)
forj=
1, . . . , M.SAEM converges to a maximum of pθ
(
y1:T)
for any M≥
1 under standard stochastic approximation conditions.B. Delyon, M. Lavielle and E. Moulines, “Convergence of a stochastic approximation version of the EM algorithm”, The Annals of Statistics, 27:94-128, 1999.
Stochastic approximation EM
24(28)Assume for the time being that we can sample frompθ
(
x1:T|
y1:T)
. Stochastic approximation EM (SAEM): Replace the E-step with,Qbr
(
θ) =
Qbr−1(
θ) +
γr 1 M∑
M j=1log pθ
(
˜xj1:T, y1:T) −
Qbr−1(
θ)
! ,
where˜xj1:Ti.i.d.
∼
pθ(
x1:T|
y1:T)
forj=
1, . . . , M.SAEM converges to a maximum of pθ
(
y1:T)
for any M≥
1 under standard stochastic approximation conditions.B. Delyon, M. Lavielle and E. Moulines, “Convergence of a stochastic approximation version of the EM algorithm”, The Annals of Statistics, 27:94-128, 1999.
Stochastic approximaiton EM
25(28)• Bad news: We cannot sample frompθ
(
x1:T|
y1:T)
.• Good news: It is enough to sample from a uniformly ergodic Markov kernel, leavingpθ
(
x1:T|
y1:T)
invariant.We can use CPF-AS to sample the states!
Stochastic approximaiton EM
25(28)• Bad news: We cannot sample frompθ
(
x1:T|
y1:T)
.• Good news: It is enough to sample from a uniformly ergodic Markov kernel, leavingpθ
(
x1:T|
y1:T)
invariant.We can use CPF-AS to sample the states!
Stochastic approximaiton EM
25(28)• Bad news: We cannot sample frompθ
(
x1:T|
y1:T)
.• Good news: It is enough to sample from a uniformly ergodic Markov kernel, leavingpθ
(
x1:T|
y1:T)
invariant.We can use CPF-AS to sample the states!
SAEM-AS
26(28)Maximum likelihood identification: SAEM + CPF-AS = SAEM-AS Algorithm SAEM-AS: Particle SAEM with ancestor sampling
1. Initialize: Set
{
θ[
0]
, x1:T[
0]}
arbitrarily.2. Forr
≥
1, iterate:(a) Run CPF-AS
(
x1:T[
r−
1])
, targetingp(
x1:T|
θ[
r−
1]
, y1:T)
. (b) ComputeQbr
(
θ) =
Qbr−1(
θ) +
γr∑
N i=1wiTlog pθ
(
xi1:T, y1:T) −
Qbr−1(
θ)
! .
(c) Computeθ
[
r] =
arg maxθQbr(
θ)
. (d) Sample withP(
x1:T[
r] =
xi1:T)
∝ wiT.ex) Stochastic volatility
27(28)SAEM-AS withN
=
10for stochastic volatility model, xt+1=
θ1xt+
vt, vt∼ N (
0, θ2)
,yt
=
etexp 1 2xt
, et
∼ N (
0, 1)
.0 200 400 600 800 1000
0.2 0.4 0.6 0.8 1
Iteration number θ1
0 200 400 600 800 1000
0 0.2 0.4 0.6 0.8 1
Iteration number θ2
Conclusions
28(28)Conclusions,
• Conditional particle filters are useful for identification!
• CPF-AS defines a kernel on XT leavingpθ
(
x1:T|
y1:T)
invariant.• CPF-AS consists of two parts
• Conditioning: Ensures correct stationary distribution for anyN.
• Ancestor sampling: Mitigates path degeneracy and enables movement around the conditioned path.
• PG-AS for Bayesian inference and SAEM-AS for maximum likelihood inference. Both work with few particles.
Future work includes finding. . .
• . . . stronger ergodicity results for CPF-AS (uniformly ergodic?).
• . . . exact conditions for almost sure convergence of SAEM-AS.
Conclusions
28(28)Conclusions,
• Conditional particle filters are useful for identification!
• CPF-AS defines a kernel on XT leavingpθ
(
x1:T|
y1:T)
invariant.• CPF-AS consists of two parts
• Conditioning: Ensures correct stationary distribution for anyN.
• Ancestor sampling: Mitigates path degeneracy and enables movement around the conditioned path.
• PG-AS for Bayesian inference and SAEM-AS for maximum likelihood inference. Both work with few particles.
Future work includes finding. . .
• . . . stronger ergodicity results for CPF-AS (uniformly ergodic?).
• . . . exact conditions for almost sure convergence of SAEM-AS.