Inference in nonlinear state-space models using Particle Gibbs with Ancestor Sampling ?
Fredrik Lindsten
Division of Automatic Control Linköping University, Sweden
Lund, November 29, 2013
?
Joint work with Michael I. Jordan and Thomas B. Schön
Identification of state-space models 2(33)
Consider a nonlinear discrete-time state-space model, x t ∼ f θ ( x t | x t − 1 ) ,
y t ∼ g θ ( y t | x t ) , and x 1 ∼ π ( x 1 ) .
We observe y 1:T = ( y 1 , . . . , y T ) and wish to estimate θ .
• Frequentists: Find ˆθ ML = arg max θ p θ ( y 1:T ) . - Use e.g. the Monte Carlo EM algorithm.
• Bayesians: Find p ( θ | y 1:T ) .
- Use e.g. Gibbs sampling.
Data augmentation 3(33)
Introduce x 1:T as latent variables and iterate between,
• Update θ given x 1:T (and y 1:T ).
• Update x 1:T given θ (and y 1:T ).
Central problem is to infer the latent states!
Monte Carlo: For given θ , generate samples from a Markov kernel
leaving p θ ( x 1:T | y 1:T ) invariant.
Gibbs sampler for SSMs 4(33)
Aim: Find p ( θ | y 1:T ) .
MCMC: Gibbs sampling for state-space models. Iterate,
• Draw θ [ k ] ∼ p ( θ | x 1:T [ k − 1 ] , y 1:T ) ; OK!
• Draw x 1:T [ k ] ∼ p θ [ k ] ( x 1:T | y 1:T ) . Hard!
Problem: p θ ( x 1:T | y 1:T ) not available!
Idea: Approximate p θ ( x 1:T | y 1:T ) using a particle filter.
The particle filter 5(33)
• Resampling: { x i 1:t − 1 , w i t − 1 } N i = 1 → { ˜x i 1:t − 1 , 1/N } N i = 1 .
• Propagation: x i t ∼ R t θ ( x t | ˜x i 1:t − 1 ) and x i 1:t = { ˜x i 1:t − 1 , x i t } .
• Weighting: w i t = W t θ ( x i 1:t ) .
⇒ { x i 1:t , w i t } N i = 1
Weighting
Resampling Propagation Weighting
ResamplingThe particle filter 5(33)
• Resampling: P ( a i t = j | F t N − 1 ) = w j t − 1 / ∑ l w l t − 1 .
• Propagation: x i t ∼ R t θ ( x t | x a 1:t
it− 1 ) and x i 1:t = { x a 1:t
it− 1 , x i t } .
• Weighting: w i t = W t θ ( x i 1:t ) .
⇒ { x i 1:t , w i t } N i = 1
Weighting
Resampling Propagation Weighting
ResamplingThe particle filter 6(33)
Algorithm Particle filter (PF) 1. Initialize ( t = 1 ):
(a) Draw x i 1 ∼ R θ 1 ( x 1 ) for i = 1, . . . , N . (b) Set w i 1 = W θ 1 ( x i 1 ) for i = 1, . . . , N . 2. for t = 2, . . . , T :
(a) Draw a i t ∼ Discrete ( { w j t −1 } N j=1 ) for i = 1, . . . , N . (b) Draw x i t ∼ R θ t ( x t | x a 1:t
it−1 ) for i = 1, . . . , N .
(c) Set x i 1:t = { x a 1:t
it−1 , x i t } and w i t = W t θ ( x i 1:t ) for i = 1, . . . , N .
The particle filter 7(33)
5 10 15 20 25
−4
−3
−2
−1 0 1
Time
State
Sampling based on the PF 8(33) With P ( x ? 1:T = x i 1:T | F T N ) ∝ w i T we get x ? 1:T approx. ∼ p θ ( x 1:T | y 1:T ) .
5 10 15 20 25
−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5 0 0.5 1
Time
State
Conditional particle filter with ancestor sampling 9(33)
Problems with this approach,
• Based on a PF ⇒ approximate sample.
• Does not leave p ( θ, x 1:T | y 1:T ) invariant!
• Relies on large N to be successful.
• A lot of wasted computations.
Conditional particle filter with ancestor sampling (CPF-AS) Let x 0 1:T = ( x 0 1 , . . . , x 0 T ) be a fixed reference trajectory.
• At each time t , sample only N − 1 particles in the standard way.
• Set the N th particle deterministically: x N t = x 0 t .
• Generate an artificial history for x N t by ancestor sampling.
Conditional particle filter with ancestor sampling 10(33)
Algorithm CPF-AS, conditioned on x 0 1:T 1. Initialize ( t = 1 ):
(a) Draw x i 1 ∼ R θ 1 ( x 1 ) for i = 1, . . . , N − 1 . (b) Set x N 1 = x 0 1 .
(c) Set w i 1 = W θ 1 ( x i 1 ) for i = 1, . . . , N . 2. for t = 2, . . . , T :
(a) Draw a i t ∼ Discrete ( { w j t −1 } N j=1 ) for i = 1, . . . , N − 1 . (b) Draw x i t ∼ R θ t ( x t | x a 1:t
it−1 ) for i = 1, . . . , N − 1 . (c) Set x N t = x 0 t .
(d) Draw a N t with P ( a N t = i | F t N −1 ) ∝ w i t −1 f θ ( x 0 t | x i t −1 ) .
(e) Set x i 1:t = { x a 1:t
it−1 , x i t } and w i t = W t θ ( x i 1:t ) for i = 1, . . . , N .
The PGAS Markov kernel (I/II) 11(33)
Consider the procedure:
1. Run CPF-AS ( N, x 0 1:T ) targeting p θ ( x 1:T | y 1:T ) ,
2. Sample x ? 1:T with P ( x ? 1:T = x i 1:T | F T N ) ∝ w i T .
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
The PGAS Markov kernel (I/II) 11(33)
Consider the procedure:
1. Run CPF-AS ( N, x 0 1:T ) targeting p θ ( x 1:T | y 1:T ) , 2. Sample x ? 1:T with P ( x ? 1:T = x i 1:T | F T N ) ∝ w i T .
5 10 15 20 25 30 35 40 45 50
−3
−2
−1 0 1 2 3
Time
State
The PGAS Markov kernel (II/II) 12(33)
This procedure:
• Maps x 0 1:T stochastically into x ? 1:T .
• Implicitly defines a Markov kernel ( P N θ ) on ( X T , X T ) , referred to as the PGAS (Particle Gibbs with ancestor sampling) kernel.
Theorem
For any number of particles N ≥ 1 and for any θ ∈ Θ , the PGAS kernel P N θ leaves p θ ( x 1:T | y 1:T ) invariant,
p θ ( dx ? 1:T | y 1:T ) =
Z
P N θ ( x 1:T 0 , dx ? 1:T ) p θ ( dx 0 1:T | y 1:T ) .
F. Lindsten, M. I. Jordan and T. B. Schön, P. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger (Eds.), Ancestor Sampling for Particle Gibbs Advances in Neural Information Processing Systems (NIPS) 25, 2600-2608, 2012.
Proof idea (I/II) 13(33)
• Let x t = { x 1 t , . . . , x N t } and a t = { a 1 t , . . . , a N t } .
• Let b 1:T be the indices s.t. x ? 1:T = x b 1:T
1:T= ( x b 1
1, . . . , x b T
T) . The PGAS procedure generates a collection of random variables:
{ x 1:T , a 2:T , b T } ∈ Ω , X NT × { 1, . . . , N } N ( T − 1 )+ 1 . Define an extended target density on Ω ,
φ θ ( x 1:T , a 2:T , b T ) = φ θ ( x b 1:T
1:T, b 1:T ) φ θ ( x − 1:T b
1:T, a − 2:T b
2:T| x b 1:T
1:T, b 1:T ) , p θ ( x b 1:T
1:T| y 1:T )
N T
| {z }
marginal
∏ N i = 1 i 6= b1
R θ 1 ( x i 1 )
∏ T t = 2
∏ N i = 1 i 6= bt
w a t −
it1
∑ l w l t − 1 R θ t ( x i t | x a t −
it1 )
| {z }
conditional
.
Proof idea (II/II) 14(33)
Key observations:
• φ θ is a PDF on Ω .
• By construction, φ θ admits p θ ( x 1:T | y 1:T ) as a marginal.
• Each step of PGAS is a properly collapsed Gibbs step for φ θ , starting from { x b 1:T
1:T, b 1:T } = { x 0 1:T , ( N, . . . , N ) } .
• The law of x ? 1:T is unaffected by setting b 1:T = ( N, . . . , N ) at each iteration.
PGAS leaves φ θ , and thereby also p θ ( x 1:T | y 1:T ) , invariant.
Ergodicity 15(33)
Theorem
Assume that there exist constants ε > 0 and κ < ∞ such that,for any θ ∈ Θ , t ∈ { 1, . . . , T } and x 1:t ∈ X t , W t θ ( x 1:t ) ≤ κ and p θ ( y 1:T ) ≥ ε .
Then, for any N ≥ 2 the PGAS kernel P N θ is uniformly ergodic. That is, there exist constants R < ∞ and ρ ∈ [ 0, 1 ) such that
k( P N θ ) k ( x 1:T 0 , ·) − p θ ( · | y 1:T ) k TV ≤ Rρ k , ∀ x 1:T 0 ∈ X T .
Proof idea 16(33)
Show that P N θ satisfies a Doeblin condition, P N θ ( x 0 1:T , ·) ≥ εp θ ( · | y 1:T ) . Take A ∈ X T . We can write,
P N θ ( x 0 1:T , A ) = E θ,x
01:Th
1 A ( x b 1:T
T) i =
∑ N j = 1
E θ,x
01:T"
w j T
∑ l w l T 1 A ( x j 1:T )
#
≥ Nκ 1
N − 1 j ∑ = 1
E θ,x
01:Th
w j T 1 A ( x j 1:T ) i = N − 1
Nκ E θ,x
01:Th
W θ T ( x 1 1:T ) 1 A ( x 1 1:T ) i .
Compute the integral w.r.t. x 1 T . Repeat for t = T − 1 , t = T − 2 , etc.
PGAS for Bayesian identification 17(33)
Bayesian identification: PGAS + Gibbs Algorithm PGAS for Bayesian identification
1. Initialize: Set { θ [ 0 ] , x 1:T [ 0 ] } arbitrarily.
2. For k ≥ 1 , iterate:
(a) Draw x 1:T [ k ] ∼ P N θ[k−1] ( x 1:T [ k − 1 ] , · ) . (b) Draw θ [ k ] ∼ p ( θ | x 1:T [ k ] , y 1:T ) .
For any number of particles N ≥ 2 , the Markov chain
{ θ [ k ] , x 1:T [ k ] } k ≥ 1 has limiting distribution p ( θ, x 1:T | y 1:T ) .
ex) Stochastic volatility model 18(33)
Stochastic volatility model,
x t + 1 = 0.9x t + v t , v t ∼ N ( 0, θ ) , y t = e t exp (
12x t ) , e t ∼ N ( 0, 1 ) . Consider the ACF of θ [ k ] − E [ θ | y 1:T ] .
0 50 100 150 200
0 0.2 0.4 0.6 0.8 1
PG-AS, T = 1000
Lag
ACF
N=5 N=20 N=100 N=1000
0 50 100 150 200
0 0.2 0.4 0.6 0.8 1
PG, T = 1000
Lag
ACF
N=5 N=20 N=100 N=1000
ex) Wiener system identification 19(33)
G h ( ·) Σ
u t y t
v t e t
Semi-parametric model: State-space model for G , Gaussian process model for h ( ·) .
−10 0 10 20
Magnitude(dB)
0 0.5 1 1.5 2 2.5 3
−50 0 50 100
Frequency (rad/s)
Phase(deg)
TruePosterior mean 99 % credibility
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2 0 0.2 0.4 0.6 0.8 1
z
h(z)
True Posterior mean 99 % credibility
F. Lindsten, T. B. Schön and M. I. Jordan, Bayesian semiparametric Wiener system identification, Automatica, 49(7):
2053-2063, July 2013.
ex) Gaussian process state-space models 20(33)
A GP-SSM is a flexible nonparametric dynamical systems model, f ( ·) ∼ GP m θ ( x t ) , k θ ( x t , x 0 t ) ,
x t + 1 | f , x t ∼ N ( f ( x t ) , Q ) , y t | x t ∼ g θ ( y t | x t ) .
Idea: Marginalize out f ( ·) and do inference directly on x 1:T (and θ ).
R. Frigola, F. Lindsten, T. B. Schön and C. E. Rasmussen, Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC, Conference on Neural Information Processing Systems (NIPS), accepted for publication, Lake Tahoe, NV, USA, 2013.
By marginalizing over f , we introduce a non-Markovian depen-
dence in { x t } .
ex) Gaussian process state-space models, cont’d. 21(33)
! PGAS is well suited for tackling such non-Markovian problems.
−20 −10 0 10 20
−1
−0.5 0 0.5 1
−20
−10 0 10 20
u x
f(x,u)
10 20 30 40 50 60 70 80 90
−20
−15
−10
−5 0 5 10 15 20
t
x
Maximum likelihood identification 22(33)
Back to frequentistic objective, ˆθ ML = arg max
θ
p θ ( y 1:T ) .
Expectation maximization (EM). Iterate:
(E) Q ( θ, θ [ k − 1 ]) = E θ [ k − 1 ] [ log p θ ( x 1:T , y 1:T ) | y 1:T ] , (M) θ [ k ] = arg max θ ∈ Θ Q ( θ, θ [ k − 1 ]) .
Problem: The E-step requires us to solve a smoothing prob-
lem, i.e. to compute an expectation under p θ ( x 1:T | y 1:T ) .
Particle smoothing EM 23(33)
Common approach: Use a particle smoother ⇒ SMC version of the Monte Carlo EM algorithm.
Particle smoothing EM (PSEM): Replace the E-step with, (S) Run a PS to generate ex i 1:T approx. ∼ p θ [ k − 1 ] ( x 1:T | y 1:T ) . (E) Q b PS k ( θ ) = N 1 ∑ N i = 1 log p θ ( ex i 1:T , y 1:T )
EM MCEM PSEM
Problems with PSEM 24(33)
Problems with PSEM,
• Based on PS ⇒ biased approximation of Q .
• Doubly asymptotic – requires N → ∞ and k → ∞ simultaneously to converge.
• Relies on large N to be successful.
• A lot of wasted computations.
Furthermore the computational complexity of many particle
smoothers is O ( N 2 ) .
Stochastic approximation EM 25(33)
EM MCEM
PSEMSAEM
Assume, for now, that we can sample from p θ ( x 1:T | y 1:T ) .
Stochastic approximation EM (SAEM): Replace the E-step with, (S) Sample ex 1:T ∼ p θ [ k − 1 ] ( x 1:T | y 1:T ) .
(E) Q b k ( θ ) = Q b k − 1 ( θ ) + γ k
log p θ ( ex 1:T , y 1:T ) − Q b k − 1 ( θ ) .
Markovian SAEM 26(33)
• The iterates { θ [ k ] } k ≥ 0 converge to a maximizer of p θ ( y 1:T ) under standard stochastic approximation conditions.
• Computationally much more efficient than MCEM when the simulation step is complicated.
B. Delyon, M. Lavielle and E. Moulines, Convergence of a stochastic approximation version of the EM algorithm, The Annals of Statistics, 27:94-128, 1999.
• Bad news: Not possible to sample from p θ ( x 1:T | y 1:T ) .
• Good news: It is enough to sample from a uniformly ergodic Markov kernel with stationary distribution p θ ( x 1:T | y 1:T ) .
We can use PGAS to sample the states!
PGAS for maximum likelihood identification 27(33)
Maximum likelihood identification: PGAS + SAEM
EM MCEM
PSEMSAEM PSAEM
Particle SAEM (PSAEM): Replace the E-step with, (S) Sample x 1:T [ k ] ∼ P N θ [ k − 1 ] ( x 1:T [ k − 1 ] , ·) . (E) Q b k ( θ ) = Q b k − 1 ( θ ) + γ k
log p θ ( x 1:T [ k ] , y 1:T ) − Q b k − 1 ( θ ) .
F. Lindsten, An efficient stochastic approximation EM algorithm using conditional particle filters, Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
Particle SAEM – implementation details 28(33)
Consider exponential family models:
p θ ( x 1:T , y 1:T ) = C exp ( h ψ ( θ ) , s ( x 1:T ) i − A ( θ )) . We can then compute the auxiliary quantity as:
• S k = S k − 1 + γ k
∑ N i = 1 w i T s ( x i 1:T [ k ]) − S k − 1
.
• Q b k ( θ ) = h ψ ( θ ) , S k i − A ( θ ) .
Note:
• The SA update is on the sufficient statistics.
• We can use all the particles in the update.
ex) Linear time series 29(33)
Proof of concept:
x t + 1 = ax t + v t , v t ∼ N ( 0, σ v 2 ) , y t = x t + e t , e t ∼ N ( 0, σ e 2 ) , with θ = ( a, σ v 2 , σ e 2 ) . With N = 15 we get,
−1
−0.5 0 0.5 1
θ
k−
b θ
ML
k = 100 k = 1000 k = 10000 k = 100000
10×ak
σv,k2 σe,k2
ex) Nonlinear time series 30(33)
Consider,
x t + 1 = ax t + b x t
1 + x 2 t + c cos ( 1.2t ) + v t , v t ∼ N ( 0, σ v 2 ) , y t = 0.05x 2 t + e t , e t ∼ N ( 0, σ e 2 ) .
• Parameterization: θ = ( a, b, c, σ v 2 , σ e 2 )
• Relative error: k( θ [ k ] − bθ ML ) ./b θ ML k 2 .
ex) Nonlinear time series 31(33)
10−2 10−1 100 101 102 103 104
10−3 10−2 10−1 100 101 102
Computational time (seconds)
Average relative error
PSAEM,N=15 PSEM, N=15 PSEM, N=50 PSEM, N=100 PSEM, N=500