Conditional particle filters for system identification

43  Download (0)

Full text

(1)

Conditional particle filters for system identification

Fredrik Lindsten

Division of Automatic Control Linköping University, Sweden

(2)

Identification of state-space models

2(28)

Consider a nonlinear, discrete-time state-space model, xt+1

=

ft

(

xt; θ

) +

vt

(

θ

)

,

yt

=

ht

(

xt; θ

) +

et

(

θ

)

. We observe

y1:T

= {

y1, . . . , yT

}

. Aim: Estimateθgiveny1:T.

(3)

Latent variable model

3(28)

Alternate between updatingθ and updatingx1:T.

Frequentists:

• Find ˆθML

=

arg maxθpθ

(

y1:T

)

.

• Use e.g. the EM algorithm.

Bayesians:

• Findp

(

θ

|

y1:T

)

.

• Use e.g. Gibbs sampling

(4)

Invariant distribution

4(28)

Definition

LetK

(

y

|

x

)

be a Markovian transition kernel and letπ

(

x

)

be some probability density function.Kis said to leaveπinvariant if

Z

K

(

y

|

x

)

π

(

x

)

dx

=

π

(

y

)

.

• A Markov chain

{

Xn

}

n1with transition kernelKhasπas stationary distribution.

• If the chain is ergodic, thenπis its limiting distribution.

(5)

ex) Invariant distribution

5(28)

0 100 200 300 400 500

−10 0 10 20 30 40

Time (t) xt

−100 −5 0 5 10

0.05 0.1 0.15 0.2

x

Densityvalue

(6)

ex) Invariant distribution

5(28)

0 2000 4000 6000 8000 10000

−10 0 10 20 30 40

Time (t) xt

−100 −5 0 5 10

0.05 0.1 0.15 0.2

x

Densityvalue

(7)

Markov chain Monte Carlo

6(28)

Objective: Sample from intractable target distributionπ

(

x

)

.

Markov chain Monte Carlo (MCMC):

• Find kernelK

(

y

|

x

)

which leavesπ

(

x

)

invariant (key step).

• Sample a Markov chain with kernelK.

• Discard the transient (burnin) and use the sample path as approximate samples fromπ.

(8)

Gibbs sampler

7(28)

ex) Sample from,

N

x

y



;10 10



,2 1 1 1



.

Gibbs sampler

• Drawx0

π

(

x

|

y

)

;

• Drawy0

π

(

y

|

x0

)

.

0 5 10 15

0 5 10 15

X

Y

(9)

Gibbs sampler for SSMs

8(28) Aim: Findp

(

θ

|

y1:T

)

.

Alternate between updatingθ and updatingx1:T. MCMC: Gibbs sampling for state-space models. Iterate,

• Drawθ

[

r

] ∼

p

(

θ

|

x1:T

[

r

1

]

, y1:T

)

;

• Drawx1:T

[

r

] ∼

p

(

x1:T

|

θ

[

r

]

, y1:T

)

.

The above procedure results in a Markov chain,

{

θ

[

r

]

, x1:T

[

r

]}

r1

with stationary distributionp

(

θ, x1:T

|

y1:T

)

.

(10)

Gibbs sampler for SSMs

8(28) Aim: Findp

(

θ, x1:T

|

y1:T

)

.

Alternate between updatingθ and updatingx1:T. MCMC: Gibbs sampling for state-space models. Iterate,

• Drawθ

[

r

] ∼

p

(

θ

|

x1:T

[

r

1

]

, y1:T

)

;

• Drawx1:T

[

r

] ∼

p

(

x1:T

|

θ

[

r

]

, y1:T

)

.

The above procedure results in a Markov chain,

{

θ

[

r

]

, x1:T

[

r

]}

r1

with stationary distributionp

(

θ, x1:T

|

y1:T

)

.

(11)

Gibbs sampler for SSMs

8(28)

Aim: Findp

(

θ, x1:T

|

y1:T

)

.

Alternate between updatingθ and updatingx1:T.

MCMC: Gibbs sampling for state-space models. Iterate,

• Drawθ

[

r

] ∼

p

(

θ

|

x1:T

[

r

1

]

, y1:T

)

;

• Drawx1:T

[

r

] ∼

p

(

x1:T

|

θ

[

r

]

, y1:T

)

.

The above procedure results in a Markov chain,

{

θ

[

r

]

, x1:T

[

r

]}

r1

with stationary distributionp

(

θ, x1:T

|

y1:T

)

.

(12)

Gibbs sampler for SSMs

8(28)

Aim: Findp

(

θ, x1:T

|

y1:T

)

.

Alternate between updatingθ and updatingx1:T. MCMC: Gibbs sampling for state-space models. Iterate,

• Drawθ

[

r

] ∼

p

(

θ

|

x1:T

[

r

1

]

, y1:T

)

;

• Drawx1:T

[

r

] ∼

p

(

x1:T

|

θ

[

r

]

, y1:T

)

.

The above procedure results in a Markov chain,

{

θ

[

r

]

, x1:T

[

r

]}

r1

with stationary distributionp

(

θ, x1:T

|

y1:T

)

.

(13)

Linear Gaussian state-space model

9(28)

ex) Gibbs sampling for linear system identification.

xt+1

yt



=

A B C D

 xt ut



+

vt

et

 . Iterate,

• Drawθ0

p

(

θ

|

x1:T, y1:T

)

;

• Drawx01:T

p

(

x1:T

|

θ0, y1:T

)

.

0 0.5 1 1.5 2 2.5 3

−10

−5 0 5 10 15 20 25

Frequency (rad/s)

Magnitude(dB)

0 0.5 1 1.5 2 2.5 3

−50 0 50 100

Frequency (rad/s)

Phase(deg)

TruePosterior mean 95 % credibility

(14)

Gibbs sampler for general SSM?

10(28)

What about the general nonlinear/non-Gaussian case?

• Drawθ0

p

(

θ

|

x1:T, y1:T

)

;

OK!

• Drawx01:T

p

(

x1:T

|

θ0, y1:T

)

.

Hard! Problem:p

(

x1:T

|

θ, y1:T

)

not available!

Idea: Approximatep

(

x1:T

|

θ, y1:T

)

using a particle filter.

(15)

Gibbs sampler for general SSM?

10(28)

What about the general nonlinear/non-Gaussian case?

• Drawθ0

p

(

θ

|

x1:T, y1:T

)

; OK!

• Drawx01:T

p

(

x1:T

|

θ0, y1:T

)

. Hard!

Problem:p

(

x1:T

|

θ, y1:T

)

not available!

Idea: Approximatep

(

x1:T

|

θ, y1:T

)

using a particle filter.

(16)

Gibbs sampler for general SSM?

10(28)

What about the general nonlinear/non-Gaussian case?

• Drawθ0

p

(

θ

|

x1:T, y1:T

)

; OK!

• Drawx01:T

p

(

x1:T

|

θ0, y1:T

)

. Hard!

Problem:p

(

x1:T

|

θ, y1:T

)

not available!

Idea: Approximatep

(

x1:T

|

θ, y1:T

)

using a particle filter.

(17)

The particle filter

11(28)

• Resampling:

{

xi1:t1, wit1

}

Ni=1

→ {

˜xi1:t1, 1/N

}

Ni=1.

• Propagation: xit

Rtθ

(

dxt

|

˜xi1:t1

)

andxi1:t

= {

˜xi1:t1, xit

}

.

• Weighting:wit

=

Wtθ

(

xi1:t

)

.

⇒ {

xi1:t, wit

}

Ni=1

Weighting Resampling Propagation Weighting Resampling

(18)

The particle filter

11(28)

• Resampling + Propagation:

(

ait, xit

) ∼

Mtθ

(

at, xt

) =

w

at

t1

lwlt1Rtθ

(

xt

|

xa1:tt1

)

.

• Weighting:wit

=

Wtθ

(

xi1:t

)

.

⇒ {

xi1:t, wit

}

Ni=1

Weighting Resampling Propagation Weighting Resampling

(19)

The particle filter

12(28)

Algorithm Particle filter (PF) 1. Initialize (t

=

1):

(a) Drawxi1∼Rθ1(x1)fori=1, . . . , N. (b) Setwi1=Wθ1(xi1)fori=1, . . . , N. 2. fort

=

2, . . . , T:

(a) Draw(ait, xit) ∼Mθt(at, xt)fori=1, . . . , N.

(b) Setxi1:t= {xa1:t−1it , xit}andwit=Wtθ(xi1:t)fori=1, . . . , N.

(20)

The particle filter

13(28)

5 10 15 20 25

−4

−3

−2

−1 0 1

Time

State

(21)

Sampling based on the PF

14(28)

• WithP

(

x01:T

=

xi1:T

)

∝ wiT we get,x1:T0 approx.

p

(

x1:T

|

θ, y1:T

)

.

5 10 15 20 25

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

Time

State

(22)

Problems

15(28)

Problems with this approach,

• Based on a PF

approximate sample.

• Does not leavep

(

θ, x1:T

|

y1:T

)

invariant!

• Relies on largeNto be successful.

• A lot of wasted computations.

To get around these problems,

Use a conditional particle filter (CPF). One prespecified path is retained throughout the sampler.

C. Andrieu, A. Doucet and R. Holenstein, “Particle Markov chain Monte Carlo methods”, Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.

(23)

Problems

15(28)

Problems with this approach,

• Based on a PF

approximate sample.

• Does not leavep

(

θ, x1:T

|

y1:T

)

invariant!

• Relies on largeNto be successful.

• A lot of wasted computations.

To get around these problems,

Use a conditional particle filter (CPF). One prespecified path is retained throughout the sampler.

C. Andrieu, A. Doucet and R. Holenstein, “Particle Markov chain Monte Carlo methods”, Journal of the Royal Statistical Society: Series B, 72:269-342, 2010.

(24)

Conditional PF with ancestor sampling

16(28)

Algorithm CPF w. ancestor sampling (CPF-AS), conditioned onx?1:T 1. Initialize (t

=

1):

(a) Drawxi1∼Rθ1(x1)fori6=Nand setxN1 =x?1. (b) Setwi1=Wθ1(xi1)fori=1, . . . , N.

2. fort

=

2, . . . , T:

(a) Draw(ait, xit) ∼Mθt(at, xt)fori6=Nand setxNt =x?t. (b) DrawaNt withP(aNt =i)∝ wit−1p(x?t |θ, xit−1).

(c) Setxi1:t= {xa1:t−1it , xit}andwit=Wtθ(xi1:t)fori=1, . . . , N.

F. Lindsten, M. I. Jordan and T. B. Schön, “Ancestor Sampling for Particle Gibbs” In proceedings of the 2012 Conference on Neural Information Processing Systems (NIPS), (accepted).

(25)

Conditional PF with ancestor sampling

17(28)

Theorem

For anyN

2, the procedure;

(i) Run CPF-AS

(

x1:T?

)

;

(ii) SampleP

(

x01:T

=

xi1:T

)

∝ wiT;

defines a Markov kernel on XT which leavesp

(

x1:T

|

θ, y1:T

)

invariant.

Proof.

Ask me later. . .

(26)

Conditional PF with ancestor sampling

17(28)

Theorem

For anyN

2, the procedure;

(i) Run CPF-AS

(

x1:T?

)

;

(ii) SampleP

(

x01:T

=

xi1:T

)

∝ wiT;

defines a Markov kernel on XT which leavesp

(

x1:T

|

θ, y1:T

)

invariant.

Proof.

Ask me later. . .

(27)

CPF vs. CPF-AS

18(28)

CPFCPF-AS

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

(28)

Particle Gibbs with ancestor sampling

19(28)

Bayesian identification: Gibbs + CPF-AS = PG-AS Algorithm PG-AS: Particle Gibbs with ancestor sampling

1. Initialize: Set

{

θ

[

0

]

, x1:T

[

0

]}

arbitrarily.

2. Forr

1, iterate:

(a) Drawθ

[

r

] ∼

p

(

θ

|

x1:T

[

r

1

]

, y1:T

)

.

(b) Run CPF-AS

(

x1:T

[

r

1

])

, targetingp

(

x1:T

|

θ

[

r

]

, y1:T

)

. (c) Sample withP

(

x1:T

[

r

] =

xi1:T

)

∝ wiT.

For any number of particles N

2, the Markov chain

{

θ

[

r

]

, x1:T

[

r

]}

r1has stationary distributionp

(

θ, x1:T

|

y1:T

)

.

(29)

ex) Stochastic volatility

20(28)

PG-AS for stochastic volatility model,

xt+1

=

θ1xt

+

vt, vt

∼ N (

0, θ2

)

, yt

=

etexp 1

2xt



, et

∼ N (

0, 1

)

.

0 200 400 600 800 1000

0.2 0.4 0.6 0.8 1

Iteration number θ1

0 200 400 600 800 1000

0 0.2 0.4 0.6 0.8 1

Iteration number θ2

N=5 N=20 N=100 N=1000 N=5000

(30)

ex) Stochastic volatility

20(28)

PG-AS for stochastic volatility model,

xt+1

=

θ1xt

+

vt, vt

∼ N (

0, θ2

)

, yt

=

etexp 1

2xt



, et

∼ N (

0, 1

)

.

0.8 0.85 0.9 0.95 1

0 10 20 30 40

θ1

Probabilitydensity

0 0.1 0.2 0.3 0.4 0.5

0 5 10 15

θ2

Probabilitydensity

N=5 N=20 N=100 N=1000 N=5000

(31)

Maximum likelihood inference

21(28)

Back to frequentistic objective, ˆθML

=

arg max

θ

pθ

(

y1:T

)

.

Expectation maximization (EM). Iterate;

(E) Q

(

θ, θ

[

r

1

]) =

Eθ[r1]

[

log pθ

(

x1:T, y1:T

) |

y1:T

]

; (M) θ

[

r

] =

arg maxθQ

(

θ, θ

[

r

1

])

.

Problem: The E-step requires us to solve a smoothing prob- lem, i.e. to compute an expectation underpθ

(

x1:T

|

y1:T

)

.

(32)

Maximum likelihood inference

21(28)

Back to frequentistic objective, ˆθML

=

arg max

θ

pθ

(

y1:T

)

.

Expectation maximization (EM). Iterate;

(E) Q

(

θ, θ

[

r

1

]) =

Eθ[r1]

[

log pθ

(

x1:T, y1:T

) |

y1:T

]

; (M) θ

[

r

] =

arg maxθQ

(

θ, θ

[

r

1

])

.

Problem: The E-step requires us to solve a smoothing prob- lem, i.e. to compute an expectation underpθ

(

x1:T

|

y1:T

)

.

(33)

Particle smoother EM

22(28)

Idea: Use a particle smoother (PS) for the E-step.

pθ0

(

x1:T

|

y1:T

) ≈

1 N

N j=1

δ˜xj1:T

(

x1:T

)

.

The E-step is approximated with

QbN

(

θ, θ0

) ,

1 N

N j=1

log pθ

(

˜xj1:T, y1:T

)

.

(34)

Problems with PS-EM

23(28)

Problems with PS-EM,

• Doubly asymptotic – requiresN

andR

simultaneously to converge.

• Relies on largeNto be successful.

• A lot of wasted computations.

(35)

Stochastic approximation EM

24(28)

Assume for the time being that we can sample frompθ

(

x1:T

|

y1:T

)

. Stochastic approximation EM (SAEM): Replace the E-step with,

Qbr

(

θ

) =

Qbr1

(

θ

) +

γr 1 M

M j=1

log pθ

(

˜xj1:T, y1:T

) −

Qbr1

(

θ

)

! ,

where˜xj1:Ti.i.d.

pθ

(

x1:T

|

y1:T

)

forj

=

1, . . . , M.

SAEM converges to a maximum of pθ

(

y1:T

)

for any M

1 under standard stochastic approximation conditions.

B. Delyon, M. Lavielle and E. Moulines, “Convergence of a stochastic approximation version of the EM algorithm”, The Annals of Statistics, 27:94-128, 1999.

(36)

Stochastic approximation EM

24(28)

Assume for the time being that we can sample frompθ

(

x1:T

|

y1:T

)

. Stochastic approximation EM (SAEM): Replace the E-step with,

Qbr

(

θ

) =

Qbr1

(

θ

) +

γr 1 M

M j=1

log pθ

(

˜xj1:T, y1:T

) −

Qbr1

(

θ

)

! ,

where˜xj1:Ti.i.d.

pθ

(

x1:T

|

y1:T

)

forj

=

1, . . . , M.

SAEM converges to a maximum of pθ

(

y1:T

)

for any M

1 under standard stochastic approximation conditions.

B. Delyon, M. Lavielle and E. Moulines, “Convergence of a stochastic approximation version of the EM algorithm”, The Annals of Statistics, 27:94-128, 1999.

(37)

Stochastic approximaiton EM

25(28)

Bad news: We cannot sample frompθ

(

x1:T

|

y1:T

)

.

Good news: It is enough to sample from a uniformly ergodic Markov kernel, leavingpθ

(

x1:T

|

y1:T

)

invariant.

We can use CPF-AS to sample the states!

(38)

Stochastic approximaiton EM

25(28)

Bad news: We cannot sample frompθ

(

x1:T

|

y1:T

)

.

Good news: It is enough to sample from a uniformly ergodic Markov kernel, leavingpθ

(

x1:T

|

y1:T

)

invariant.

We can use CPF-AS to sample the states!

(39)

Stochastic approximaiton EM

25(28)

Bad news: We cannot sample frompθ

(

x1:T

|

y1:T

)

.

Good news: It is enough to sample from a uniformly ergodic Markov kernel, leavingpθ

(

x1:T

|

y1:T

)

invariant.

We can use CPF-AS to sample the states!

(40)

SAEM-AS

26(28)

Maximum likelihood identification: SAEM + CPF-AS = SAEM-AS Algorithm SAEM-AS: Particle SAEM with ancestor sampling

1. Initialize: Set

{

θ

[

0

]

, x1:T

[

0

]}

arbitrarily.

2. Forr

1, iterate:

(a) Run CPF-AS

(

x1:T

[

r

1

])

, targetingp

(

x1:T

|

θ

[

r

1

]

, y1:T

)

. (b) Compute

Qbr

(

θ

) =

Qbr1

(

θ

) +

γr

N i=1

wiTlog pθ

(

xi1:T, y1:T

) −

Qbr1

(

θ

)

! .

(c) Computeθ

[

r

] =

arg maxθQbr

(

θ

)

. (d) Sample withP

(

x1:T

[

r

] =

xi1:T

)

∝ wiT.

(41)

ex) Stochastic volatility

27(28)

SAEM-AS withN

=

10for stochastic volatility model, xt+1

=

θ1xt

+

vt, vt

∼ N (

0, θ2

)

,

yt

=

etexp 1 2xt



, et

∼ N (

0, 1

)

.

0 200 400 600 800 1000

0.2 0.4 0.6 0.8 1

Iteration number θ1

0 200 400 600 800 1000

0 0.2 0.4 0.6 0.8 1

Iteration number θ2

(42)

Conclusions

28(28)

Conclusions,

• Conditional particle filters are useful for identification!

• CPF-AS defines a kernel on XT leavingpθ

(

x1:T

|

y1:T

)

invariant.

• CPF-AS consists of two parts

Conditioning: Ensures correct stationary distribution for anyN.

Ancestor sampling: Mitigates path degeneracy and enables movement around the conditioned path.

• PG-AS for Bayesian inference and SAEM-AS for maximum likelihood inference. Both work with few particles.

Future work includes finding. . .

• . . . stronger ergodicity results for CPF-AS (uniformly ergodic?).

• . . . exact conditions for almost sure convergence of SAEM-AS.

(43)

Conclusions

28(28)

Conclusions,

• Conditional particle filters are useful for identification!

• CPF-AS defines a kernel on XT leavingpθ

(

x1:T

|

y1:T

)

invariant.

• CPF-AS consists of two parts

Conditioning: Ensures correct stationary distribution for anyN.

Ancestor sampling: Mitigates path degeneracy and enables movement around the conditioned path.

• PG-AS for Bayesian inference and SAEM-AS for maximum likelihood inference. Both work with few particles.

Future work includes finding. . .

• . . . stronger ergodicity results for CPF-AS (uniformly ergodic?).

• . . . exact conditions for almost sure convergence of SAEM-AS.

Figure

Updating...

References

Related subjects :