Inference in nonlinear state-space models using Particle Gibbs with Ancestor Sampling ?

(1)

Inference in nonlinear state-space models using Particle Gibbs with Ancestor Sampling ^?

Fredrik Lindsten

Division of Automatic Control Linköping University, Sweden

Lund, November 29, 2013

?

Joint work with Michael I. Jordan and Thomas B. Schön

(2)

Identification of state-space models 2(33)

Consider a nonlinear discrete-time state-space model, x _t ∼ ^f θ ( x _t | ^x t − ¹ ) ,

y _t ∼ ^g θ ( y _t | ^x t ) , and x ₁ ∼ ^π ( x ₁ ) .

We observe y _1:T = ( y ₁ , . . . , y T ) and wish to estimate θ .

• Frequentists: Find ˆθ ML = arg max _θ p _θ ( y _1:T ) . - Use e.g. the Monte Carlo EM algorithm.

• Bayesians: Find p ( θ | ^y 1:T ) .

- Use e.g. Gibbs sampling.

(3)

Data augmentation 3(33)

Introduce x _1:T as latent variables and iterate between,

• ^Update ^θ ^given ^x 1:T (and y 1:T ).

• ^Update ^x 1:T given θ (and y _1:T ).

Central problem is to infer the latent states!

Monte Carlo: For given θ , generate samples from a Markov kernel

leaving p _θ ( x _1:T | ^y 1:T ) invariant.

(4)

Gibbs sampler for SSMs 4(33)

Aim: Find p ( θ | ^y 1:T ) .

MCMC: Gibbs sampling for state-space models. Iterate,

• ^Draw ^θ [ k ] ∼ ^p ( θ | ^x 1:T [ k − ¹ ] , y _1:T ) ; OK!

• ^Draw ^x 1:T [ k ] ∼ ^p θ [ k ] ( x _1:T | ^y 1:T ) . Hard!

Problem: p _θ ( x _1:T | ^y 1:T ) not available!

Idea: Approximate p _θ ( x _1:T | ^y 1:T ) using a particle filter.

(5)

The particle filter 5(33)

• Resampling: { ^x ⁱ 1:t − ¹ , w ⁱ _t ₋ ₁ } ^N _i = 1 → { ^˜x ⁱ 1:t − ¹ , 1/N } ^N _i = 1 .

• Propagation: x ⁱ _t ∼ ^R _t ^θ ( x _t | ^˜x ⁱ _1:t ₋ ₁ ) and x ⁱ _1:t = { ^˜x ⁱ _1:t ₋ ₁ ^{, x} ⁱ _t } ^.

• ^Weighting: ^w ⁱ _t = W _t ^θ ( x ⁱ _1:t ) .

⇒ { ^x ⁱ 1:t , w ⁱ _t } ^N i = 1

Weighting

Resampling Propagation Weighting

Resampling

(6)

The particle filter 5(33)

• Resampling: P ( a ⁱ _t = j | F t ^N − ¹ ) = w ^j _t ₋ ₁ / ∑ l w ^l _t ₋ ₁ .

• Propagation: x ⁱ _t ∼ ^R t ^θ ( x _t | ^x ^a _1:t

ⁱ^t

₋ ₁ ) and x ⁱ _1:t = { ^x ^a _1:t

ⁱ^t

₋ ₁ ^{, x} ⁱ t } ^.

• ^Weighting: ^w ⁱ t = W _t ^θ ( x ⁱ _1:t ) .

⇒ { ^x ⁱ 1:t , w ⁱ _t } ^N _i = 1

Weighting

Resampling Propagation Weighting

Resampling

(7)

The particle filter 6(33)

Algorithm Particle filter (PF) 1. Initialize ( t = 1 ):

(a) Draw x ⁱ ₁ ∼ ^R ^θ ₁ ( x ₁ ) for i = 1, . . . , N . (b) Set w ⁱ ₁ = W ^θ ₁ ( x ⁱ ₁ ) ^for i = 1, . . . , N ^. 2. for t = 2, . . . , T :

(a) Draw a ⁱ _t ∼ ^Discrete ( { ^w ^j _t ₋₁ } ^N _j=1 ) for i = 1, . . . , N . (b) Draw x ⁱ _t ∼ R ^θ _t ( x t | x ^a _1:t

ⁱ^t

₋₁ ) for i = 1, . . . , N .

(c) Set x ⁱ _1:t = { x ^a _1:t

ⁱ^t

₋₁ , x ⁱ _t } ^and w ⁱ _t = W _t ^θ ( x ⁱ _1:t ) for i = 1, . . . , N .

(8)

The particle filter 7(33)

5 10 15 20 25

−4

−3

−2

−1 0 1

Time

State

(9)

Sampling based on the PF 8(33) With P ( x ^? _1:T = x ⁱ _1:T | F _T ^N ) ∝ w ⁱ _T ^{we get} x ^? _1:T ^approx. ∼ ^p θ ( x _1:T | ^y 1:T ) .

5 10 15 20 25

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

Time

State

(10)

Conditional particle filter with ancestor sampling 9(33)

Problems with this approach,

• Based on a PF ⇒ approximate sample.

• Does not leave p ( θ, x _1:T | ^y 1:T ) invariant!

• Relies on large N to be successful.

• A lot of wasted computations.

Conditional particle filter with ancestor sampling (CPF-AS) Let x ⁰ _1:T = ( x ⁰ ₁ , . . . , x ⁰ _T ) be a fixed reference trajectory.

• At each time t , sample only N − ¹ particles in the standard way.

• ^{Set the} ^N th particle deterministically: x ^N _t = x ⁰ _t .

• Generate an artificial history for x ^N _t by ancestor sampling.

(11)

Conditional particle filter with ancestor sampling 10(33)

Algorithm CPF-AS, conditioned on x ⁰ _1:T 1. Initialize ( t = 1 ):

(a) Draw x ⁱ ₁ ∼ ^R ^θ ₁ ( x 1 ) for i = 1, . . . , N − ¹ ^. (b) Set x ^N ₁ = x ⁰ ₁ .

(c) Set w ⁱ ₁ = W ^θ ₁ ( x ⁱ ₁ ) for i = 1, . . . , N . 2. for t = 2, . . . , T :

(a) Draw a ⁱ _t ∼ ^Discrete ( { ^w ^j _t ₋₁ } ^N _j=1 ) for i = 1, . . . , N − ¹ ^. (b) Draw x ⁱ _t ∼ ^R ^θ t ( x t | ^x ^a _1:t

ⁱ^t

₋₁ ) for i = 1, . . . , N − ¹ ^. (c) Set x ^N _t = x ⁰ _t .

(d) Draw a ^N _t with P ( a ^N _t = i | F _t ^N ₋₁ ) ∝ w ⁱ _t ₋₁ f _θ ( x ⁰ _t | x ⁱ _t ₋₁ ) .

(e) Set x ⁱ _1:t = { x ^a _1:t

ⁱ^t

₋₁ , x ⁱ _t } ^and w ⁱ _t = W _t ^θ ( x ⁱ _1:t ) for i = 1, . . . , N .

(12)

The PGAS Markov kernel (I/II) 11(33)

Consider the procedure:

1. Run CPF-AS ( N, x ⁰ _1:T ) targeting p _θ ( x _1:T | ^y 1:T ) ,

2. Sample x ^? _1:T with P ( x ^? _1:T = x ⁱ _1:T | F T ^N ) ∝ w ⁱ _T ^.

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

(13)

The PGAS Markov kernel (I/II) 11(33)

Consider the procedure:

1. Run CPF-AS ( N, x ⁰ _1:T ) targeting p _θ ( x _1:T | ^y 1:T ) , 2. Sample x ^? _1:T with P ( x ^? _1:T = x ⁱ _1:T | F T ^N ) ∝ w ⁱ _T ^.

5 10 15 20 25 30 35 40 45 50

−3

−2

−1 0 1 2 3

Time

State

(14)

The PGAS Markov kernel (II/II) 12(33)

This procedure:

• ^Maps ^x ⁰ 1:T stochastically into x ^? _1:T .

• Implicitly defines a Markov kernel ( P ^N _θ ) on ( X ^T , X ^T ) , referred to as the PGAS (Particle Gibbs with ancestor sampling) kernel.

Theorem

For any number of particles N ≥ ¹ and for any θ ∈ ^Θ ^{, the PGAS} kernel P ^N _θ leaves p _θ ( x _1:T | ^y 1:T ) invariant,

p _θ ( dx ^? _1:T | ^y 1:T ) =

Z

P ^N _θ ( x _1:T ⁰ , dx ^? _1:T ) p _θ ( dx ⁰ _1:T | ^y 1:T ) .

F. Lindsten, M. I. Jordan and T. B. Schön, P. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger (Eds.), Ancestor Sampling for Particle Gibbs Advances in Neural Information Processing Systems (NIPS) 25, 2600-2608, 2012.

(15)

Proof idea (I/II) 13(33)

• ^Let ^x t = { ^x ¹ t , . . . , x ^N _t } ând â t = { â ¹ t , . . . , a ^N _t } ^.

• ^Let ^b 1:T be the indices s.t. x ^? _1:T = x ^b _1:T

^1:T

= ( x ^b ₁

¹

, . . . , x ^b _T

^T

) . The PGAS procedure generates a collection of random variables:

{ ^x 1:T , a _2:T , b _T } ∈ Ω , ^X ^NT × { 1, . . . , N } ^N ⁽ ^T ⁻ ¹ ⁾⁺ ¹ ^. Define an extended target density on Ω ^,

φ _θ ( x _1:T , a _2:T , b _T ) = φ _θ ( x ^b _1:T

^1:T

, b _1:T ) φ _θ ( x ⁻ _1:T ^b

^1:T

, a ⁻ _2:T ^b

^2:T

| ^x ^b _1:T

^1:T

^{, b} 1:T ) , ^p ^θ ( x ^b _1:T

^1:T

| ^y 1:T )

N ^T

| {z }

marginal

∏ N i = 1 i 6= ^b

1

R ^θ ₁ ( x ⁱ ₁ )

∏ T t = 2

∏ N i = 1 i 6= ^b

t



 w ^a _t ₋

ⁱ^t

₁

∑ l w ^l _t ₋ ₁ R ^θ _t ( x ⁱ _t | ^x ^a _t ₋

ⁱ^t

₁ )





| {z }

conditional

.

(16)

Proof idea (II/II) 14(33)

Key observations:

• ^φ θ is a PDF on Ω ^.

• By construction, φ _θ admits p _θ ( x _1:T | ^y 1:T ) as a marginal.

• Each step of PGAS is a properly collapsed Gibbs step for φ _θ , starting from { ^x ^b _1:T

^1:T

^{, b} 1:T } = { ^x ⁰ 1:T , ( N, . . . , N ) } ^.

• ^{The law of} ^x ^? _1:T is unaffected by setting b _1:T = ( N, . . . , N ) at each iteration.

PGAS leaves φ _θ , and thereby also p _θ ( x _1:T | ^y 1:T ) , invariant.

(17)

Ergodicity 15(33)

Theorem

Assume that there exist constants ε > 0 and κ < ∞ such that,for any θ ∈ Θ ^, t ∈ { 1, . . . , T } ^and ^x 1:t ∈ ^X ^t ^, ^W t ^θ ( x _1:t ) ≤ ^κ ^and p _θ ( y _1:T ) ≥ ^ε ^.

Then, for any N ≥ ² the PGAS kernel P ^N _θ is uniformly ergodic. That is, there exist constants R < ∞ ^and ρ ∈ [ ^{0, 1} ) such that

k( ^P ^N θ ) ^k ( x _1:T ⁰ , ·) − ^p θ ( · | ^y 1:T ) k ^TV ≤ ^Rρ ^k ^, ∀ ^x 1:T ⁰ ∈ ^X ^T ^.

(18)

Proof idea 16(33)

Show that P ^N _θ satisfies a Doeblin condition, P ^N _θ ( x ⁰ _1:T , ·) ≥ ^εp θ ( · | ^y 1:T ) . Take A ∈ X ^T . We can write,

P ^N _θ ( x ⁰ _1:T , A ) = E _θ,x

⁰_1:T

h

1 A ( x ^b _1:T

^T

) ⁱ =

∑ N j = 1

E _θ,x

⁰_1:T

"

w ^j _T

∑ l w ^l _T 1 A ( x ^j _1:T )

#

≥ _Nκ ¹

N − 1 j ∑ = 1

E _θ,x

⁰_1:T

h

w ^j _T 1 A ( x ^j _1:T ) ⁱ = ^N − ¹

Nκ E _θ,x

⁰_1:T

h

W ^θ _T ( x ¹ _1:T ) 1 A ( x ¹ _1:T ) ⁱ .

Compute the integral w.r.t. x ¹ _T . Repeat for t = T − ¹ ^, ^t = T − ² ^{, etc.}

(19)

PGAS for Bayesian identification 17(33)

Bayesian identification: PGAS + Gibbs Algorithm PGAS for Bayesian identification

1. Initialize: Set { ^θ [ 0 ] , x _1:T [ 0 ] } arbitrarily.

2. For k ≥ ¹ ^{, iterate:}

(a) Draw x 1:T [ k ] ∼ P ^N _θ[k−1] ( x 1:T [ k − 1 ] , · ) ^. (b) Draw θ [ k ] ∼ p ( θ | x 1:T [ k ] , y 1:T ) .

For any number of particles N ≥ ² , the Markov chain

{ ^θ [ k ] , x _1:T [ k ] } k ≥ 1 has limiting distribution p ( θ, x _1:T | ^y 1:T ) .

(20)

ex) Stochastic volatility model 18(33)

Stochastic volatility model,

x _t + 1 = 0.9x _t + v _t , v _t ∼ N ( ^{0, θ} ) , y t = e t exp (

¹₂

x t ) , e t ∼ N ( ^{0, 1} ) . Consider the ACF of θ [ k ] − E [ θ | ^y 1:T ] .

0 50 100 150 200

0 0.2 0.4 0.6 0.8 1

PG-AS, T = 1000

Lag

ACF

N=5 N=20 N=100 N=1000

0 50 100 150 200

0 0.2 0.4 0.6 0.8 1

PG, T = 1000

Lag

ACF

N=5 N=20 N=100 N=1000

(21)

ex) Wiener system identification 19(33)

G ^h ( ·) Σ

u _t y _t

v t e t

Semi-parametric model: State-space model for G ^{, Gaussian} process model for h ( ·) ^.

−10 0 10 20

Magnitude(dB)

0 0.5 1 1.5 2 2.5 3

−50 0 50 100

Frequency (rad/s)

Phase(deg)

TruePosterior mean 99 % credibility

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

z

h(z)

True Posterior mean 99 % credibility

F. Lindsten, T. B. Schön and M. I. Jordan, Bayesian semiparametric Wiener system identification, Automatica, 49(7):

2053-2063, July 2013.

(22)

ex) Gaussian process state-space models 20(33)

A GP-SSM is a flexible nonparametric dynamical systems model, f ( ·) ∼ GP ^m θ ( x _t ) , k θ ( x _t , x ⁰ _t ) ,

x _t + 1 | ^{f , x} t ∼ N ( ^f ( x t ) , Q ) , y t | ^x t ∼ ^g θ ( y t | ^x t ) .

Idea: Marginalize out f ( ·) and do inference directly on x _1:T (and θ ).

R. Frigola, F. Lindsten, T. B. Schön and C. E. Rasmussen, Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC, Conference on Neural Information Processing Systems (NIPS), accepted for publication, Lake Tahoe, NV, USA, 2013.

By marginalizing over f , we introduce a non-Markovian depen-

dence in { ^x t } ^.

(23)

ex) Gaussian process state-space models, cont’d. 21(33)

! PGAS is well suited for tackling such non-Markovian problems.

−20 −10 0 10 20

−1

−0.5 0 0.5 1

−20

−10 0 10 20

u x

f(x,u)

10 20 30 40 50 60 70 80 90

−20

−15

−10

−5 0 5 10 15 20

t

x

(24)

Maximum likelihood identification 22(33)

Back to frequentistic objective, ˆθ ML = arg max

θ

p _θ ( y _1:T ) .

Expectation maximization (EM). Iterate:

(E) Q ( θ, θ [ k − ¹ ]) = E θ [ k − 1 ] [ log p _θ ( x _1:T , y _1:T ) | ^y 1:T ] , (M) θ [ k ] = arg max _θ _∈ _Θ Q ( θ, θ [ k − ¹ ]) .

Problem: The E-step requires us to solve a smoothing prob-

lem, i.e. to compute an expectation under p _θ ( x _1:T | ^y 1:T ) .

(25)

Particle smoothing EM 23(33)

Common approach: Use a particle smoother ⇒ SMC version of the Monte Carlo EM algorithm.

Particle smoothing EM (PSEM): Replace the E-step with, (S) Run a PS to generate ex ⁱ _1:T ^approx. ∼ ^p θ [ k − 1 ] ( x 1:T | ^y 1:T ) . (E) Q b ^PS _k ( θ ) = _N ¹ _∑ ^N _i ₌ ₁ log p _θ ( _ex ⁱ _1:T , y _1:T )

EM MCEM PSEM

(26)

Problems with PSEM 24(33)

Problems with PSEM,

• Based on PS ⇒ biased approximation of Q .

• Doubly asymptotic – requires N → ∞ ^and k → ∞ simultaneously to converge.

• Relies on large N to be successful.

• A lot of wasted computations.

Furthermore the computational complexity of many particle

smoothers is O ( N ² ) .

(27)

Stochastic approximation EM 25(33)

EM MCEM

PSEM

SAEM

Assume, for now, that we can sample from p _θ ( x _1:T | ^y 1:T ) .

Stochastic approximation EM (SAEM): Replace the E-step with, (S) Sample ex 1:T ∼ ^p θ [ k − ¹ ] ( x _1:T | ^y 1:T ) .

(E) Q b _k ( θ ) = Q b _k ₋ ₁ ( θ ) + γ _k

log p _θ ( _ex _1:T , y _1:T ) − ^Q ^b k − ¹ ( θ ) .

(28)

Markovian SAEM 26(33)

• The iterates { ^θ [ k ] } k ≥ ⁰ converge to a maximizer of p θ ( y 1:T ) under standard stochastic approximation conditions.

• Computationally much more efficient than MCEM when the simulation step is complicated.

B. Delyon, M. Lavielle and E. Moulines, Convergence of a stochastic approximation version of the EM algorithm, The Annals of Statistics, 27:94-128, 1999.

• Bad news: Not possible to sample from p θ ( x 1:T | ^y 1:T ) .

• Good news: It is enough to sample from a uniformly ergodic Markov kernel with stationary distribution p θ ( x 1:T | ^y 1:T ) .

We can use PGAS to sample the states!

(29)

PGAS for maximum likelihood identification 27(33)

Maximum likelihood identification: PGAS + SAEM

EM MCEM

PSEM

SAEM PSAEM

Particle SAEM (PSAEM): Replace the E-step with, (S) Sample x _1:T [ k ] ∼ ^P ^N _θ _[ _k ₋ ₁ _] ( x _1:T [ k − ¹ ] , ·) ^. (E) Q b _k ( θ ) = Q b _k ₋ ₁ ( θ ) + γ _k

log p θ ( x _1:T [ k ] , y 1:T ) − ^Q ^b k − ¹ ( θ ) .

F. Lindsten, An efficient stochastic approximation EM algorithm using conditional particle filters, Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.

(30)

Particle SAEM – implementation details 28(33)

Consider exponential family models:

p _θ ( x _1:T , y _1:T ) = C exp ( h ^ψ ( θ ) , s ( x _1:T ) i − ^A ( θ )) . We can then compute the auxiliary quantity as:

• ^S k = S _k ₋ ₁ + γ _k

∑ ^N i = 1 w ⁱ _T s ( x ⁱ _1:T [ k ]) − ^S k − ¹

.

• ^Q ^b k ( θ ) = h ^ψ ( θ ) , S _k i − ^A ( θ ) .

Note:

• The SA update is on the sufficient statistics.

• We can use all the particles in the update.

(31)

ex) Linear time series 29(33)

Proof of concept:

x _t + 1 = ax _t + v _t , v _t ∼ N ( ^{0, σ} v ² ) , y _t = x _t + e _t , e _t ∼ N ( ^{0, σ} e ² ) , with θ = ( a, σ _v ² , σ _e ² ) . With N = 15 we get,

−1

−0.5 0 0.5 1

θ

k

−

b θ

^M

L

k = 100 k = 1000 k = 10000 k = 100000

10×ak

σ_v,k² σ_e,k²

(32)

ex) Nonlinear time series 30(33)

Consider,

x _t + 1 = ax _t + b x t

1 + x ² _t + c cos ( 1.2t ) + v _t , v _t ∼ N ( ^{0, σ} v ² ) , y _t = 0.05x ² _t + e _t , e _t ∼ N ( ^{0, σ} e ² ) .

• Parameterization: θ = ( a, b, c, σ _v ² , σ _e ² )

• Relative error: k( ^θ [ k ] − ^bθ ^ML ) ./b θ ML k 2 .

(33)

ex) Nonlinear time series 31(33)

10⁻² 10⁻¹ 10⁰ 10¹ 10² 10³ 10⁴

10⁻³ 10⁻² 10⁻¹ 10⁰ 10¹ 10²

Computational time (seconds)

Average relative error

PSAEM,N=15 PSEM, N=15 PSEM, N=50 PSEM, N=100 PSEM, N=500

(34)

Inference in nonlinear state-space models using Particle Gibbs with Ancestor Sampling ?