Conditional particle ﬁlters for system identiﬁcation

(1)

Conditional particle filters for system identification

Fredrik Lindsten

Division of Automatic Control Linköping University, Sweden

(2)

Identification of state-space models

2(28)

Consider a nonlinear, discrete-time state-space model, xt+1

=

ft

(

xt; θ

) +

vt

(

θ

)

,

y_t

=

h_t

(

x_t; θ

) +

e_t

(

θ

)

. We observe

y_1:T

= {

y₁, . . . , y_T

}

. Aim: Estimateθgiveny_1:T.

(3)

Latent variable model

3(28)

Alternate between updatingθ and updatingx_1:T.

Frequentists:

• Find ˆθML

=

arg max_θp_θ

(

y_1:T

)

.

• Use e.g. the EM algorithm.

Bayesians:

• Findp

(

θ

|

y_1:T

)

.

• Use e.g. Gibbs sampling

(4)

Invariant distribution

4(28)

Definition

LetK

(

y

|

x

)

be a Markovian transition kernel and letπ

(

x

)

be some probability density function.Kis said to leaveπinvariant if

Z

K

(

y

|

x

)

π

(

x

)

dx

=

π

(

y

)

.

• A Markov chain

{

X_n

}

_n_≥₁with transition kernelKhasπas stationary distribution.

• If the chain is ergodic, thenπis its limiting distribution.

(5)

ex) Invariant distribution

5(28)

0 100 200 300 400 500

−10 0 10 20 30 40

Time (t) xt

−100 −5 0 5 10

0.05 0.1 0.15 0.2

x

Densityvalue

(6)

ex) Invariant distribution

5(28)

0 2000 4000 6000 8000 10000

−10 0 10 20 30 40

Time (t) xt

−100 −5 0 5 10

0.05 0.1 0.15 0.2

x

Densityvalue

(7)

Markov chain Monte Carlo

6(28)

Objective: Sample from intractable target distributionπ

(

x

)

.

Markov chain Monte Carlo (MCMC):

• Find kernelK

(

y

|

x

)

which leavesπ

(

x

)

invariant (key step).

• Sample a Markov chain with kernelK.

• Discard the transient (burnin) and use the sample path as approximate samples fromπ.

(8)

Gibbs sampler

7(28)

ex) Sample from,

N

^x

y

;10 10

,2 1 1 1

.

Gibbs sampler

• Drawx⁰

∼

π

(

x

|

y

)

;

• Drawy⁰

∼

π

(

y

|

x⁰

)

.

0 5 10 15

X

Y

(9)

Gibbs sampler for SSMs

8(28) Aim: Findp

(

θ

|

y_1:T

)

.

Alternate between updatingθ and updatingx_1:T. MCMC: Gibbs sampling for state-space models. Iterate,

• Drawθ

[

r

] ∼

p

(

θ

|

x_1:T

[

r

−

1

]

, y_1:T

)

;

• Drawx_1:T

[

r

] ∼

p

(

x_1:T

|

θ

[

r

]

, y1:T

)

.

The above procedure results in a Markov chain,

{

θ

[

r

]

, x_1:T

[

r

]}

_r≥1

with stationary distributionp

(

θ, x_1:T

|

y_1:T

)

.

(10)

Gibbs sampler for SSMs

8(28) Aim: Findp

(

θ, x_1:T

|

y_1:T

)

.

• Drawθ

[

r

] ∼

p

(

θ

|

x_1:T

[

r

−

1

]

, y_1:T

)

;

• Drawx_1:T

[

r

] ∼

p

(

x_1:T

|

θ

[

r

]

, y1:T

)

.

{

θ

[

r

]

, x_1:T

[

r

]}

_r≥1

(

θ, x_1:T

|

y_1:T

)

.

(11)

Gibbs sampler for SSMs

8(28)

Aim: Findp

(

θ, x_1:T

|

y_1:T

)

.

Alternate between updatingθ and updatingx_1:T.

MCMC: Gibbs sampling for state-space models. Iterate,

• Drawθ

[

r

] ∼

p

(

θ

|

x_1:T

[

r

−

1

]

, y_1:T

)

;

• Drawx_1:T

[

r

] ∼

p

(

x_1:T

|

θ

[

r

]

, y1:T

)

.

{

θ

[

r

]

, x_1:T

[

r

]}

_r≥1

(

θ, x_1:T

|

y_1:T

)

.

(12)

Gibbs sampler for SSMs

8(28)

Aim: Findp

(

θ, x_1:T

|

y_1:T

)

.

• Drawθ

[

r

] ∼

p

(

θ

|

x_1:T

[

r

−

1

]

, y_1:T

)

;

• Drawx_1:T

[

r

] ∼

p

(

x_1:T

|

θ

[

r

]

, y1:T

)

.

{

θ

[

r

]

, x_1:T

[

r

]}

_r≥1

(

θ, x_1:T

|

y_1:T

)

.

(13)

Linear Gaussian state-space model

9(28)

ex) Gibbs sampling for linear system identification.

x_t+1

y_t

=

^{A B} C D

x_t u_t

+

^v^t

e_t

. Iterate,

• Drawθ⁰

∼

p

(

θ

|

x_1:T, y_1:T

)

;

• Drawx⁰_1:T

∼

p

(

x_1:T

|

θ⁰, y1:T

)

.

0 0.5 1 1.5 2 2.5 3

−10

−5 0 5 10 15 20 25

Frequency (rad/s)

Magnitude(dB)

0 0.5 1 1.5 2 2.5 3

−50 0 50 100

Frequency (rad/s)

Phase(deg)

TruePosterior mean 95 % credibility

(14)

Gibbs sampler for general SSM?

10(28)

What about the general nonlinear/non-Gaussian case?

• Drawθ⁰

∼

p

(

θ

|

x_1:T, y_1:T

)

;

OK!

• Drawx⁰_1:T

∼

p

(

x_1:T

|

θ⁰, y_1:T

)

.

Hard! Problem:p

(

x_1:T

|

θ, y_1:T

)

not available!

Idea: Approximatep

(

x_1:T

|

θ, y_1:T

)

using a particle filter.

(15)

Gibbs sampler for general SSM?

10(28)

• Drawθ⁰

∼

p

(

θ

|

x_1:T, y_1:T

)

; OK!

• Drawx⁰_1:T

∼

p

(

x_1:T

|

θ⁰, y_1:T

)

. Hard!

Problem:p

(

x_1:T

|

θ, y_1:T

)

not available!

(

x_1:T

|

θ, y_1:T

)

(16)

Gibbs sampler for general SSM?

10(28)

• Drawθ⁰

∼

p

(

θ

|

x_1:T, y_1:T

)

; OK!

• Drawx⁰_1:T

∼

p

(

x_1:T

|

θ⁰, y_1:T

)

. Hard!

Problem:p

(

x_1:T

|

θ, y_1:T

)

not available!

(

x_1:T

|

θ, y_1:T

)

(17)

The particle filter

11(28)

• Resampling:

{

xⁱ_1:t₋₁, wⁱ_t₋₁

}

^N_i₌₁

→ {

˜xⁱ_1:t₋₁, 1/N

}

^N_i₌₁.

• Propagation: xⁱ_t

∼

R_t^θ

(

dx_t

|

˜xⁱ_1:t₋₁

)

andxⁱ_1:t

= {

˜xⁱ_1:t₋₁, xⁱ_t

}

.

• Weighting:wⁱ_t

=

W_t^θ

(

xⁱ_1:t

)

.

⇒ {

xⁱ_1:t, wⁱ_t

}

^N_i₌₁

Weighting Resampling Propagation Weighting Resampling

(18)

The particle filter

11(28)

• Resampling + Propagation:

(

aⁱ_t, xⁱ_t

) ∼

M_t^θ

(

a_t, x_t

) =

^w

at

t−1

∑lw^l_t₋₁R_t^θ

(

x_t

|

x^a_1:t^t₋₁

)

.

• Weighting:wⁱ_t

=

W_t^θ

(

xⁱ_1:t

)

.

⇒ {

xⁱ_1:t, wⁱ_t

}

^N_i₌₁

Weighting Resampling Propagation Weighting Resampling

(19)

The particle filter

12(28)

Algorithm Particle filter (PF) 1. Initialize (t

=

1):

(a) Drawxⁱ₁∼R^θ₁(x1)fori=1, . . . , N. (b) Setwⁱ₁=W^θ₁(xⁱ₁)fori=1, . . . , N. 2. fort

=

2, . . . , T:

(a) Draw(aⁱ_t, xⁱ_t) ∼M^θ_t(at, xt)fori=1, . . . , N.

(b) Setxⁱ_1:t= {x^a_1:t−1ⁱ^t , xⁱ_t}andwⁱ_t=W_t^θ(xⁱ_1:t)fori=1, . . . , N.

(20)

The particle filter

13(28)

5 10 15 20 25

−4

−3

−2

−1 0 1

Time

State

(21)

Sampling based on the PF

14(28)

• WithP

(

x⁰_1:T

=

xⁱ_1:T

)

∝ wⁱ_T ^{we get,}x_1:T⁰ ^approx.

∼

p

(

x_1:T

|

θ, y_1:T

)

.

5 10 15 20 25

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

Time

State

(22)

Problems

15(28)

Problems with this approach,

• Based on a PF

⇒

approximate sample.

• Does not leavep

(

θ, x_1:T

|

y_1:T

)

invariant!

• Relies on largeNto be successful.

• A lot of wasted computations.

To get around these problems,

Use a conditional particle filter (CPF). One prespecified path is retained throughout the sampler.

C. Andrieu, A. Doucet and R. Holenstein, “Particle Markov chain Monte Carlo methods”, Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.

(23)

Problems

15(28)

Problems with this approach,

• Based on a PF

⇒

approximate sample.

• Does not leavep

(

θ, x_1:T

|

y_1:T

)

invariant!

To get around these problems,

Use a conditional particle filter (CPF). One prespecified path is retained throughout the sampler.

C. Andrieu, A. Doucet and R. Holenstein, “Particle Markov chain Monte Carlo methods”, Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.

(24)

Conditional PF with ancestor sampling

16(28)

Algorithm CPF w. ancestor sampling (CPF-AS), conditioned onx^?_1:T 1. Initialize (t

=

1):

(a) Drawxⁱ₁∼R^θ₁(x1)fori6=Nand setx^N₁ =x^?₁. (b) Setwⁱ₁=W^θ₁(xⁱ₁)fori=1, . . . , N.

2. fort

=

2, . . . , T:

(a) Draw(aⁱ_t, xⁱ_t) ∼M^θ_t(at, xt)fori6=Nand setx^N_t =x^?_t. (b) Drawa^N_t withP(a^N_t =i)∝ wⁱ_t−1p(x^?_t |θ, xⁱ_t−1).

(c) Setxⁱ_1:t= {x^a_1:t−1ⁱ^t , xⁱ_t}andwⁱ_t=W_t^θ(xⁱ_1:t)fori=1, . . . , N.

F. Lindsten, M. I. Jordan and T. B. Schön, “Ancestor Sampling for Particle Gibbs” In proceedings of the 2012 Conference on Neural Information Processing Systems (NIPS), (accepted).

(25)

Conditional PF with ancestor sampling

17(28)

Theorem

For anyN

≥

2, the procedure;

(i) Run CPF-AS

(

x_1:T^?

)

;

(ii) SampleP

(

x⁰_1:T

=

xⁱ_1:T

)

_{∝ w}ⁱ_T;

defines a Markov kernel on X^T which leavesp

(

x_1:T

|

θ, y_1:T

)

invariant.

Proof.

Ask me later. . .

(26)

Conditional PF with ancestor sampling

17(28)

Theorem

For anyN

≥

2, the procedure;

(i) Run CPF-AS

(

x_1:T^?

)

;

(ii) SampleP

(

x⁰_1:T

=

xⁱ_1:T

)

_{∝ w}ⁱ_T;

defines a Markov kernel on X^T which leavesp

(

x_1:T

|

θ, y_1:T

)

invariant.

Proof.

Ask me later. . .

(27)

CPF vs. CPF-AS

18(28)

CPFCPF-AS

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

(28)

Particle Gibbs with ancestor sampling

19(28)

Bayesian identification: Gibbs + CPF-AS = PG-AS Algorithm PG-AS: Particle Gibbs with ancestor sampling

1. Initialize: Set

{

θ

[

0

]

, x_1:T

[

0

]}

arbitrarily.

2. Forr

≥

1^{, iterate:}

(a) Drawθ

[

r

] ∼

p

(

θ

|

x_1:T

[

r

−

1

]

, y_1:T

)

.

(b) Run CPF-AS

(

x_1:T

[

r

−

1

])

, targetingp

(

x_1:T

|

θ

[

r

]

, y_1:T

)

. (c) Sample withP

(

x_1:T

[

r

] =

xⁱ_1:T

)

_{∝ w}ⁱ_T.

For any number of particles N

≥

₂, the Markov chain

{

θ

[

r

]

, x_1:T

[

r

]}

_r≥1has stationary distributionp

(

θ, x_1:T

|

y_1:T

)

.

(29)

ex) Stochastic volatility

20(28)

PG-AS for stochastic volatility model,

x_t+1

=

θ₁xt

+

vt, vt

∼ N (

0, θ₂

)

, y_t

=

e_texp 1

2x_t

, e_t

∼ N (

0, 1

)

.

0 200 400 600 800 1000

0.2 0.4 0.6 0.8 1

Iteration number θ1

0 200 400 600 800 1000

0 0.2 0.4 0.6 0.8 1

N=5 N=20 N=100 N=1000 N=5000

(30)

ex) Stochastic volatility

20(28)

PG-AS for stochastic volatility model,

x_t+1

=

θ₁xt

+

vt, vt

∼ N (

0, θ₂

)

, y_t

=

e_texp 1

2x_t

, e_t

∼ N (

0, 1

)

.

0.8 0.85 0.9 0.95 1

0 10 20 30 40

θ1

Probabilitydensity

0 0.1 0.2 0.3 0.4 0.5

0 5 10 15

θ2

Probabilitydensity

N=5 N=20 N=100 N=1000 N=5000

(31)

Maximum likelihood inference

21(28)

Back to frequentistic objective, ˆθML

=

arg max

θ

p_θ

(

y_1:T

)

.

Expectation maximization (EM). Iterate;

(E) Q

(

θ, θ

[

r

−

₁

]) =

E_θ_[_r₋₁_]

[

log pθ

(

x_1:T, y1:T

) |

y_1:T

]

; (M) θ

[

r

] =

arg max_θQ

(

θ, θ

[

r

−

1

])

.

Problem: The E-step requires us to solve a smoothing prob- lem, i.e. to compute an expectation underp_θ

(

x_1:T

|

y_1:T

)

.

(32)

Maximum likelihood inference

21(28)

Back to frequentistic objective, ˆθML

=

arg max

θ

p_θ

(

y_1:T

)

.

Expectation maximization (EM). Iterate;

(E) Q

(

θ, θ

[

r

−

₁

]) =

E_θ_[_r₋₁_]

[

log pθ

(

x_1:T, y1:T

) |

y_1:T

]

; (M) θ

[

r

] =

arg max_θQ

(

θ, θ

[

r

−

1

])

.

Problem: The E-step requires us to solve a smoothing prob- lem, i.e. to compute an expectation underp_θ

(

x_1:T

|

y_1:T

)

.

(33)

Particle smoother EM

22(28)

Idea: Use a particle smoother (PS) for the E-step.

p_θ⁰

(

x_1:T

|

y_1:T

) ≈

¹ N

∑

N j=1

δ˜x^j_1:T

(

x_1:T

)

.

The E-step is approximated with

Qb^N

(

θ, θ⁰

) ,

¹ N

∑

N j=1

log p_θ

(

˜x^j_1:T, y_1:T

)

.

(34)

Problems with PS-EM

23(28)

Problems with PS-EM,

• Doubly asymptotic – requiresN

→

_∞andR

→

_∞ simultaneously to converge.

(35)

Stochastic approximation EM

24(28)

Assume for the time being that we can sample fromp_θ

(

x_1:T

|

y_1:T

)

. Stochastic approximation EM (SAEM): Replace the E-step with,

Qb_r

(

θ

) =

Qb_r−1

(

θ

) +

γ_r 1 M

∑

M j=1

log p_θ

(

˜x^j_1:T, y_1:T

) −

Qb_r−1

(

θ

)

! ,

where˜x^j_1:T^i.i.d.

∼

pθ

(

x1:T

|

y1:T

)

forj

=

1, . . . , M.

SAEM converges to a maximum of p_θ

(

y_1:T

)

for any M

≥

1 under standard stochastic approximation conditions.

B. Delyon, M. Lavielle and E. Moulines, “Convergence of a stochastic approximation version of the EM algorithm”, The Annals of Statistics, 27:94-128, 1999.

(36)

Stochastic approximation EM

24(28)

Assume for the time being that we can sample fromp_θ

(

x_1:T

|

y_1:T

)

. Stochastic approximation EM (SAEM): Replace the E-step with,

Qb_r

(

θ

) =

Qb_r−1

(

θ

) +

γ_r 1 M

∑

M j=1

log p_θ

(

˜x^j_1:T, y_1:T

) −

Qb_r−1

(

θ

)

! ,

where˜x^j_1:T^i.i.d.

∼

pθ

(

x1:T

|

y1:T

)

forj

=

1, . . . , M.

SAEM converges to a maximum of p_θ

(

y_1:T

)

for any M

≥

1 under standard stochastic approximation conditions.

B. Delyon, M. Lavielle and E. Moulines, “Convergence of a stochastic approximation version of the EM algorithm”, The Annals of Statistics, 27:94-128, 1999.

(37)

Stochastic approximaiton EM

25(28)

• Bad news: We cannot sample fromp_θ

(

x_1:T

|

y_1:T

)

.

• Good news: It is enough to sample from a uniformly ergodic Markov kernel, leavingp_θ

(

x_1:T

|

y_1:T

)

invariant.

We can use CPF-AS to sample the states!

(38)

Stochastic approximaiton EM

25(28)

(

x_1:T

|

y_1:T

)

.

(

x_1:T

|

y_1:T

)

invariant.

(39)

Stochastic approximaiton EM

25(28)

(

x_1:T

|

y_1:T

)

.

(

x_1:T

|

y_1:T

)

invariant.

(40)

SAEM-AS

26(28)

Maximum likelihood identification: SAEM + CPF-AS = SAEM-AS Algorithm SAEM-AS: Particle SAEM with ancestor sampling

1. Initialize: Set

{

θ

[

0

]

, x_1:T

[

0

]}

arbitrarily.

2. Forr

≥

1^{, iterate:}

(a) Run CPF-AS

(

x_1:T

[

r

−

1

])

, targetingp

(

x_1:T

|

θ

[

r

−

1

]

, y_1:T

)

. (b) Compute

Qb_r

(

θ

) =

Q^b_r−1

(

θ

) +

γ_r

∑

N i=1

wⁱ_Tlog p_θ

(

xⁱ_1:T, y_1:T

) −

Q^b_r−1

(

θ

)

! .

(c) Computeθ

[

r

] =

arg max_θQbr

(

θ

)

. (d) Sample withP

(

x_1:T

[

r

] =

xⁱ_1:T

)

∝ wⁱ_T^.

(41)

ex) Stochastic volatility

27(28)

SAEM-AS withN

=

10for stochastic volatility model, x_t+1

=

θ₁xt

+

vt, vt

∼ N (

0, θ₂

)

,

y_t

=

e_texp 1 2x_t

, e_t

∼ N (

0, 1

)

.

0 200 400 600 800 1000

0.2 0.4 0.6 0.8 1

0 200 400 600 800 1000

0 0.2 0.4 0.6 0.8 1

(42)

Conclusions

28(28)

Conclusions,

• Conditional particle filters are useful for identification!

• CPF-AS defines a kernel on X^T leavingp_θ

(

x_1:T

|

y_1:T

)

invariant.

• CPF-AS consists of two parts

• Conditioning: Ensures correct stationary distribution for anyN.

• Ancestor sampling: Mitigates path degeneracy and enables movement around the conditioned path.

• PG-AS for Bayesian inference and SAEM-AS for maximum likelihood inference. Both work with few particles.

Future work includes finding. . .

• . . . stronger ergodicity results for CPF-AS (uniformly ergodic?).

• . . . exact conditions for almost sure convergence of SAEM-AS.

(43)

Conclusions

28(28)

Conclusions,

• Conditional particle filters are useful for identification!

• CPF-AS defines a kernel on X^T leavingp_θ

(

x_1:T

|

y_1:T

)

invariant.

• CPF-AS consists of two parts

• Conditioning: Ensures correct stationary distribution for anyN.

• Ancestor sampling: Mitigates path degeneracy and enables movement around the conditioned path.

• PG-AS for Bayesian inference and SAEM-AS for maximum likelihood inference. Both work with few particles.

Future work includes finding. . .

• . . . stronger ergodicity results for CPF-AS (uniformly ergodic?).

• . . . exact conditions for almost sure convergence of SAEM-AS.