Uniform Ergodicity of the Particle Gibbs Sampler

(1)

Uniform Ergodicity of the Particle Gibbs

Sampler

Fredrik Lindsten, Randal Douc and Eric Moulines

Linköping University Post Print

N.B.: When citing this work, cite the original article.

Original Publication:

Fredrik Lindsten, Randal Douc and Eric Moulines, Uniform Ergodicity of the Particle Gibbs

Sampler, 2015, Scandinavian Journal of Statistics, (42), 3, 775-797.

http://dx.doi.org/10.1111/sjos.12136

Copyright: Wiley: 12 months

http://eu.wiley.com/WileyCDA/

Postprint available at: Linköping University Electronic Press

(2)

Uniform ergodicity of the Particle Gibbs sampler

FREDRIK LINDSTEN

Department of Engineering, The University of Cambridge, and

Division of Automatic Control, Link¨

oping University

RANDAL DOUC

Department CITI, Telecom Sudparis

ERIC MOULINES

Department LTCI, Telecom Paristech

November 21, 2014

Abstract

The particle Gibbs (PG) sampler is a systematic way of using a particle filter within Markov chain Monte Carlo (MCMC). This results in an off-the-shelf Markov kernel on the space of state trajectories, which can be used to simulate from the full joint smoothing distribution for a state space model in an MCMC scheme. We show that the PG Markov kernel is uniformly ergodic under rather general assumptions, that we will carefully review and discuss. In particular, we provide an explicit rate of convergence which reveals that: (i) for fixed number of data points, the convergence rate can be made arbitrarily good by increasing the number of particles, and (ii) under general mixing assumptions, the convergence rate can be kept constant by increasing the number of particles superlinearly with the number of observations. We illustrate the applicability of our result by studying in detail a common stochastic volatility model with a non-compact state space.

Keywords: particle Gibbs, particle Markov chain Monte Carlo, conditional sequential Monte Carlo, particle smoothing, state space models

(3)

1 Introduction

Statistical inference in general state space hidden Markov models involves computation of the posterior distribution of a set Xs:t := [Xs, . . . , Xt] of hidden state variables conditionally on

a record Y0:T of observations, which we denote as φs:thY0:Ti. Of particular interest is the so

called joint smoothing distribution (JSD) φ0:ThY0:Ti. Any marginal or fixed-interval smoothing

distribution can be obtained from the JSD by marginalization. The JSD can be expressed in closed-form only in very specific cases, principally, when the state space model is linear and Gaussian or when the state space of the hidden Markov chain is a finite set. In the vast majority of cases, nonlinearity or non-Gaussianity render analytic solutions intractable.

This limitation has lead to an increase of interest in computational strategies handling more general state and measurement equations. Among these, sequential Monte Carlo (SMC) meth-ods play a central role. SMC methmeth-ods refer to a class of algorithms approximating a sequence of probability distributions, defined on a sequence of probability spaces. This is done by updat-ing recursively a set of random particles with associated nonnegative importance weights. The SMC methodology has emerged as a key tool for approximating JSD flows in general state space models; see Doucet et al. (2000); Del Moral (2004); Doucet and Johansen (2011) for general introductions as well as applications and theoretical results for SMC methods.

However, a well known problem with SMC methods is that the particle approximation of any marginal smoothing distribution φt:thY0:Ti becomes inaccurate for t T . The reason is that

the particle trajectories degenerate gradually as the interacting particle system evolves (Godsill et al., 2004; Fearnhead et al., 2010). To address this problem, several methods have been proposed; see Lindsten and Sch¨on (2013) and the references therein. Among these methods, the recently introduced particle Markov chain Monte Carlo (PMCMC) framework, proposed in the seminal paper by Andrieu et al. (2010), plays a prominent role. PMCMC samplers make use of SMC (or variants thereof) to construct efficient, high-dimensional MCMC kernels which are reversible with respect to the JSD. These methods can then be used as components of more general sampling schemes relying on Markov kernels, for instance enabling joint state and parameter inference in general state space models.

Coupling SMC and MCMC is very useful since the distribution of the state sequence given the stream of observations is generally both high-dimensional and strongly dependent, ren-dering the design of alternative MCMC procedures, such as single-state Gibbs samplers and Metropolis-Hastings samplers, problematic. PMCMC has already found many applications in

(4)

areas such as hydrology (Vrugt et al., 2013), finance (Pitt et al., 2012), systems biology (Go-lightly and Wilkinson, 2011), and epidemiology (Rasmussen et al., 2011), to mention a few. Several methodological developments of the framework have also been made; see e.g. Whiteley et al. (2010); Lindsten et al. (2014); Chopin and Singh (2014); Pitt et al. (2012).

PMCMC algorithms can, broadly speaking, be grouped into two classes of methods: those based on particle independent Metropolis-Hastings (PIMH) kernels and those based on particle Gibbs (PG) kernels. The two classes of kernels are motivated in different ways and they have quite different properties. The former class, PIMH, exploits the fact that the SMC method defines an unbiased estimator of the likelihood, which is used in place of the intractable likelihood in the MH acceptance probability. This method can thus be viewed as a special case of the pseudo-marginal method introduced by Beaumont (2003); Andrieu and Roberts (2009) and later analyzed by Andrieu and Vihola (2012); Lee and Latuszynski (2012). The latter class, PG, on the other hand relies on conditioning the underlying SMC sampler on a reference trajectory to enforce the correct limiting distribution of the kernel; see Section 3. This algorithm can be interpreted as a Gibbs sampler for an extended model where the random variables generated by the SMC sampler are treated as auxiliary variables.

One of the main practical issues with PMCMC algorithms is the choice of the number, N , of particles. Using fewer particles will result in faster computations at each iteration, but can at the same time result in slower mixing of the resulting Markov kernel. For a fixed computational budget, there is a trade-off between taking the number of particles N large to get a faster mixing kernel, and to run many iterations of the MCMC sampler. Andrieu and Roberts (2009); Andrieu and Vihola (2012); Lee and Latuszynski (2012) investigate the rate of convergence of the pseudo-marginal method and characterize the approximation of the marginal algorithm by the pseudo-marginal algorithm in terms of the variability of their respective ergodic averages. Doucet et al. (2012) and Pitt et al. (2012) conclude, using partially heuristic arguments, that it is close to optimal to let N scale at least linearly with T .

The theoretical properties of the PG kernel, however, are not as well understood. Andrieu et al. (2010) establish under weak conditions that the PG kernel is φ-irreducible and aperiodic for any N ≥ 2 (see Meyn and Tweedie (2009) for definitions). However, this does not provide a control for the rate of convergence of the iterates of the PG kernel to stationarity. In this work, we establish that the PG kernel is, under mild assumptions, uniformly ergodic. This interesting property has already been established in an earlier work by Chopin and Singh (2014), but we give here a direct proof under weaker conditions, which in addition provides an explicit lower

(5)

bound for the convergence rate.

During the preparation of this manuscript, a preprint was made available by Andrieu et al. (2013), who, independently, have found similar results as presented here. Indeed, they establish basically the same lower bound on the minorizing constant for the PG kernel (which they refer to as the iterated conditional SMC kernel), though using a different proof technique based on a “doubly conditional” SMC algorithm. There are, however, several differences between these two contributions. We focus in particular on analyzing the minorizing constant under mixing conditions for the state space model which hold very generally, even if the state space is not compact (see Section 4.3). We then study how the number of particles N should be increased with the number of observations T . We show that under weak assumptions, it suffices to increase the number of particles N as Tδ _{where δ ≥ 1 can be determined explicitly. This is in contrast}

with Andrieu et al. (2013) who, effectively, assume a compact state space; see Remark 3 and Section 4.2. On the other hand, Andrieu et al. (2013) study necessary (i.e., not only sufficient) conditions for uniform ergodicity and translate the convergence results for the PG kernel to a composite MCMC scheme for simulating both states and parameters of a state space model. Given these differences, we believe that the two contributions complement each other.

This paper is organized as follows: In Section 2 we introduce our notation, and in Section 3 we review the PG sampler and formally define the PG Markov kernel. In Section 4 we state the main results, starting with a minorization condition for the PG kernel followed by mixing conditions that allow for time uniform control of the convergence rate. In Section 5 we study, in detail, a commonly used stochastic volatility model with a non-compact state space to illustrate how the conditions of our results can be verified in practice. The proofs of the main theorems are postponed to Sections 6 and 7. An additional example as well as some of the proofs are given in the online supplementary material.

2 Notations and problem statement

We write N and N+ for the sets of non-negative and positive integers, respectively. Let (X, X ) and (Y, Y) be two measurable spaces and let P(X) be the set of all probability measures on (X, X ). Let M be a kernel on (X, X ) and G a kernel on (X, Y). Assume that for all x ∈ X, G(x, ·) is dominated by some common nonnegative measure κ on (Y, Y) and denote by g(x, ·) its Radon-Nikodym derivative, i.e., for all (x, y) ∈ X × Y: g(x, y) = dG(x,·)_dκ(·) (y). Let {(Xt, Yt) , t ∈ N}

(6)

chain with transition kernel defined by: for all (x, y) ∈ X × Y and all C ∈ X ⊗ Y,

((x, y), C) 7→ Z Z

C

M (x, dx0)G(x0, dy0) .

The sequence {Xt, t ∈ N} is usually not observed and inference should be carry out on the basis

of the observations {Yt, t ∈ N} only. With µ ∈ P(X) being the initial distribution of the hidden

state process, for all t ≥ 0, denote by

y0:t7→ pµ(y0:t) := Z µ(dx0)g(x0, y0) t Y s=1 M (xs−1, dxs)g(xs, ys) ,

the density of the observations Y0:t with respect to κ⊗(t+1). In what follows, we set, by abuse

of notation, for all x ∈ X, p_x(y0:t) = pδx(y0:t), where δxis the Dirac measure at x.

For all y ∈ Y, define the (unnormalized) kernel Qhyi on (X, X ) by

Qhyi(x, A) = Z

M (x, dx0)g(x0, y)1A(x0) , (1)

and for all s ≤ t and all ys:t∈ Yt−s+1, define the kernel Qhys:ti on (X, X ) by

Qhys:ti(x, A) = QhysiQhys+1i . . . Qhyti(x, A) . (2)

With these notations, pµ(y0:t) = µ(g(·, y0)Qhy1:ti1(·)) where 1 is the constant function, 1(x) = 1

for all x ∈ X. In what follows, we set by convention Qhys:ti1(x) ≡ 1 for all s > t.

For all µ ∈ P(X) and for all 0 ≤ s ≤ t, denote p_µ(ys:t|y0:s−1) := pµ(y0:t)/pµ(y0:s−1) if

p_µ(y0:s−1) 6= 0 and pµ(ys:t|y0:s−1) := 0 otherwise, with the convention pµ(y0:t|y0:−1) = pµ(y0:t).

A quantity of central interest is the JSD, given by

φµ,0:thy0:ti(D) := 1 p_µ(y0:t) Z µ(dx0)g(x0, y0) t Y s=1 M (xs−1, dxs)g(xs, ys)1D(x0:t) , (3)

for all D ∈ X⊗(t+1). With T being some final time point, the PG sampler (reviewed in the subsequent section) defines a Markov kernel which leaves the JSD φµ,0:Thy0:Ti invariant.

Sam-ples drawn from the PG kernel can thus be used to draw inference about the states (and/or parameters) of the state space model.

3 The particle Gibbs sampler

Consider first an SMC sampler targeting the sequence of JSDs defined in (3). Let y0:T be a

(7)

weighted samples {(Xi

0:t, ωti)}Ni=1, in the sense that

φN_µ,0:thy0:ti(h) := N X i=1 ωi t PN `=1ω ` t h(X_0:ti )

is an estimator of φµ,0:thy0:ti(h) for a measurable function h : Xt+1→ R. These weighted samples

can be generated in many different ways, see e.g. Doucet et al. (2000); Del Moral (2004); Doucet and Johansen (2011); Capp´e et al. (2005) and the references therein. Here we review a basic method, though it should be noted that the PG sampler can be generalized to more advanced procedures, see Andrieu et al. (2010); Chopin and Singh (2014).

Initially, φµ,0:0hy0i is approximated by importance sampling. That is, we simulate

indepen-dently {Xi

0}Ni=1 from a proposal distribution: X0i ∼ r0hy0i(·). The samples, commonly referred

to as particles, are then assigned importance weights, ωi

0 = w0hy0i(X0i), where w0hy0i(x) =

g(x, y0)dr0dµhy0i(x) (provided that r0hy0i is such that µ r0hy0i).

We proceed inductively. Denote by FN

t the filtration generated by the particles and weights

up to the current time instant t: FN

t := σ {(X0:si , ωsi)}Ni=1, 0 ≤ s ≤ t . Assume that we have at

hand a weighted sample {(Xi

0:t−1, ωit−1)}Ni=1approximating the JSD φµ,0:t−1hy0:t−1i at time t−1.

This weighted sample is then propagated sequentially forward in time. This is done by sampling, conditionally independently given the particle history F_t−1N , for each particle i ∈ {1, . . . , N } an ancestor index Aitwith probability

P Ait= j F_t−1N = ωj_t−1 PN `=1ω ` t−1 , j ∈ {1, . . . , N } , (4)

and then by sampling a new particle position from the proposal kernel Rhyti:

X_ti∼ Rhyti(X Ai

t

t−1, ·) . (5)

Finally, the particles are assigned importance weights given by

ωi_t= whyti(X Ai t t−1, X i t) := dQhyti(X Ai t t−1, ·) dRhyti(X Ai t t−1, ·) (X_ti) , (6)

where Qhyi is defined in (1) and, as before, it is assumed that Qhyi(x, ·) Rhyi(x, ·). The par-ticle trajectories (i.e., the ancestral paths of the parpar-ticles Xi

t, i ∈ {1, . . . , N }) are constructed

sequentially by associating the current particle Xi

t with the particle trajectory of its

ances-tor: Xi

0:t := (X Ai

t

0:t−1, Xti). This results in a weighted particle system {(X0:ti , ωti)}Ni=1 targeting

φµ,0:thy0:ti, completing the induction. Two classical choices for the proposal kernel Rhyi are:

Rhyi(x, dx0) =      M (x, dx0) bootstrap filter, M (x,dx0_)g(x0_,y)

R M (x,dx0_)g(x0_,y) fully-adapted filter.

(8)

The former is commonly used due to its simplicity, whereas the latter is preferable whenever it can be implemented since it is known to minimize the incremental weight variance; see Doucet and Johansen (2011) for further discussion.

Assume now that T is some final time point and that we are interested in simulating from the JSD φµ,0:Thy0:Ti using an MCMC procedure. For that purpose, it is required to define a Harris

positive recurrent Markov kernel on the path space (XT +1_{, X}⊗(T +1)_{) having the JSD φ}

µ,0:Thy0:Ti

as its unique invariant distribution. The PG sampler accomplishes this by making use of SMC. From an algorithmic point of view, the difference between PG and a standard SMC sampler is that in the PG sampler, one particle trajectory, denoted as x0_0:T = (x00, . . . , x0T) ∈ X

t+1_{, is}

specified a priori. This trajectory is used as a reference for the sampler, as discussed below. The reference trajectory is taken into account by simulating only N − 1 particles in the usual way. The N th particle is then set deterministically according to the reference. At the initialization, we thus simulate independently {Xi

0} N −1

i=1 with X0i ∼ r0hy0i(·) and set X0N = x00.

We then compute importance weights for all particles: ωi

0= w0hy0i(X0i), for i = 1, . . . , N .

Analogously, at any consecutive time point t, we sample the first N −1 particles {(Ai t, Xti)}

N −1 i=1

conditionally independently given FN

t−1 according to (4)–(5). Note that these particles will

de-pend on the reference trajectory through the resampling step (4). The N th particle and its ancestor index are then set deterministically: XtN = x0t and ANt = N . Finally, importance

weights are then computed for all the particles according to (6). Note that, by construction, the N th particle trajectory will coincide with the reference trajectory for all t, XN

0:t= x00:t.

After a complete pass of the above procedure, a trajectory X?

0:T is sampled from among the

particle trajectories at time T , with probabilities given by their importance weights, i.e.,

P X0:T? = X i 0:T F_TN = ωi T PN `=1ω ` T , i ∈ {1, . . . , N } . (8)

The PG sampling procedure (summarized in Algorithm 1 in the online supplementary mate-rial) thus associates each trajectory x0_0:T ∈ XT +1_{to a probability distribution on (X}T +1_{, X}⊗(T +1)_),

defining a Markov kernel on (XT +1_{, X}⊗(T +1)_{). More specifically, this kernel is given by}

PT ,N(x00:T, D) := E "_P_N i=1ω i T1D(X0:Ti ) PN `=1ω ` T # , (9) for (x0_0:T, D) ∈ XT +1× X⊗(T +1)

, where E refers to expectation with respect to the random variables generated by the PG algorithm. We refer to PT ,N as the PG kernel.

(9)

the PG kernel leaves the JSD invariant:

φµ,0:Thy0:Ti(D) =

Z

PT ,N(x00:T, D)φµ,0:Thy0:Ti(dx00:T) , ∀D ∈ X (T +1)_.

Quite remarkably, this invariance property holds for any N ≥ 1.

Empirically, it has been found that the mixing of the PG kernel can be improved significantly by updating the ancestor indices AN

t for t ∈ {1, . . . , T }, either as part of the forward recursion

(Lindsten et al., 2014) or in a separate backward recursion (Whiteley et al., 2010). We shall not specifically analyze these modified PG algorithms in this work. However, as detailed below, our analysis uses a worst-case bound to completely remove the influence of the reference trajectory (and thus also the corresponding ancestor indices) and therefore our uniform ergodicity result applies straightforwardly to these modified algorithms as well.

4 Main result

In this section we state the main results. First, in Section 4.1, we give a minorization condition for the PG kernel. For a fixed observation sequence y0:T, the result of Section 4.1 implies uniform

ergodicity of the PG kernel under weak conditions. Thereafter, in Section 4.2 and 4.3, we discuss how to increase the number of particles N = NT as a function of the number of observations T

in order to obtain a non-degenerate lower-bound as T → ∞.

4.1 Minorization condition

Define the sequence of nonnegative variables {Bt,T}Tt=0by

Bt,T = sup 0≤`≤T −t

|whyti|∞|Qhyt+1:t+`i1|∞

p_µ(yt:t+`|y0:t−1)

(10)

where |whyi|_∞= sup_(x,x0₎whyi(x, x0), |Qhyt+1:t+`i1|∞= supxQhyt+1:t+`i1(x) and, by

conven-tion, |Qhyt+1:ti1|∞= 1. We then have the following minorization condition for the PG kernel.

Theorem 1. For all x0_0:T ∈ XT +1 _{and D ∈ X}⊗(T +1)_,

PT ,N(x00:T, D) ≥ T ,N φµ,0:Thy0:Ti(D) , (11) where T ,N := T Y t=0 N − 1 2Bt,T + N − 2 . (12)

(10)

Proof. The proof is postponed to Section 6. However, to provide some intuition for the result, the main ideas of the proof are outlined below. From (9) we have

PT ,N(x00:T, D) ≥ (N − 1)E " ω_T11D(X0:T1 ) PN i=1ω i T # ≥ (N − 1)E " E " ω_T11D(X0:T1 ) 2|whyti|∞+P N −1 i=2 ω i T FN T −1 ## ,

where, for the first inequality, we have simply discarded the N th term (corresponding to the reference particle) and used the fact that the N − 1 weighted particles {(Xi

0:T, ω i T)}

N −1 i=1 are

equally distributed. For the second inequality, we bound the first and the last term of the sum in the denominator by |whyti|∞. This has the effect that the random variables entering the

numerator and the denominator of the expression are conditionally independent given FN T −1.

By convexity of x 7→ 1/x and using Jensen’s inequality we therefore obtain the bound

PT ,N(x00:T, D) ≥ (N − 1)E " E ωT11D(X0:T1 ) F_{T −1}N 2|whyti|∞+ (N − 2)E ω2T| FT −1N # .

The inner conditional expectations can be computed explicitly. Essentially, the result follows by repeating this procedure for time T − 1, then for T − 2, etc.

Corollary 2. Assume that g(x, y) > 0 for all (x, y) ∈ X × Y and |whyi|_∞< ∞ for all y ∈ Y. Then, for fixed a fixed observation sequence y0:T,

T ,N ≥ 1 + 1 N − 1 T X t=0 (1 − 2Bt,T) + O(N−2) , and limN →∞T ,N = 1.

Proof. From the definition (12) we have

T ,N = exp ( − T X t=0 ln 1 + 2Bt,T− 1 N − 1 ) ≥ exp ( 1 N − 1 T X t=0 (1 − 2Bt,T) ) .

For a fixed T , we thus obtain the result provided that Bt,T < ∞ for all t ∈ {0, . . . , T }. However,

the positivity of g implies that p_µ(yt:t+`|y0:t−1) > 0 for all ` ≥ 0, and since |whyi|∞ < ∞

for all y ∈ Y, it can be easily checked that |Qhyt+1:t+`i1|∞ ≤

Qt+`

s=t+1|whysi|∞ < ∞, which

immediately implies that Bt,T < ∞ for all t ∈ {0, . . . , T }.

For a fixed observation sequence y0:T (with finite T ), Theorem 1 and Corollary 2 ensure

uniform ergodicity of the PG kernel under weak conditions. In the following two sections we analyze the minorizing constant of Theorem 1 when T is taken to infinity. Specifically, we discuss how to increase the number of particles N = NT as a function of the number of observations T

(11)

Remark 3. The minorization condition of Theorem 1 is similar to Proposition 6 by Andrieu et al. (2013). However, they express the minorizing constant in terms of the expectation of a likelihood estimator with respect to the law of a “doubly conditional SMC” algorithm. They do not pursue an analysis of the effect on the minorization condition by the forgetting of the initial condition of the state space model. To obtain a non-degenerate rate of convergence when T → ∞ they assume (in our notation) that the triangular array of variables {Bt,T}0≤t≤T is

uniformly bounded for T ≥ 0. This is the case, basically, only when the model satisfies strong mixing conditions, as we discuss in the subsequent section. Indeed, Andrieu et al. (2013, Propo-sition 14 and Lemma 17) is the same as our PropoPropo-sition 5 below.

4.2 Strong mixing condition

We first assume a strong mixing condition for the kernel M :

(S-1) There exist positive constants (σ−, σ+), a nonnegative measure γ and an integer m ∈ N+

such that for all x ∈ X: σ−γ(dx0) ≤ Mm(x, dx0) ≤ σ+γ(dx0) .

This condition has been introduced by Del Moral and Guionnet (1999) to establish the uniform-in-time convergence of the particle filter. This condition, which typically requires that the state space is compact, is overly restrictive but it is often used in the analysis of state space models because it implies a form of uniform forgetting of the initial condition of the filter, which is key to obtaining long-term stability of the particle filter.

Proposition 4. Assume that g(x, y) > 0 for all (x, y) ∈ X × Y, that (S-1) holds with m = 1 and that the proposal kernel is fully-adapted as defined in (7). Then, taking NT ∼ λT for some

λ > 0, we have lim inf T →∞ T ,NT ≥ exp 1 − 2(σ+/σ−)2 λ > 0 . Proof. See the online supplementary material.

Note that Proposition 4 holds whatever the observation sequence {yt, t ∈ N} is. This is

a consequence of the strong mixing condition (S-1) which provides a simple result, but at the expense of an assumption which is rarely met in practice. If instead of the fully adapted case, we consider the bootstrap filter (see (7)), we may also obtain a uniform-in-time bound. However, this requires an even stronger assumption of the existence of a lower and an upper bound for the observation likelihood.

(12)

Proposition 5. Assume that (S-1)-(S-2) hold and that the bootstrap proposal is used: Rhyi(x, ·) = M (x, ·). Then, taking NT ∼ λT for some λ > 0, we have

lim inf T →∞ T ,NT ≥ exp 1 − 2δm_σ +/σ− λ > 0 . where m is defined in (S-1).

Proof. See the online supplementary material.

4.3 Moment assumption

Under the restrictive assumptions (S-1) and (S-2), we obtained non-degenerate uniform conver-gence bounds when NT ∝ T . However, these conditions are hardly ever satisfied when the state

space is non-compact. We now turn to the analysis of the minorization condition under a much weaker moment assumption.

When the strong mixing assumption is relaxed, however, we are no longer able to obtain bounds that hold uniformly with respect to the observations. Instead, we take a probabilistic approach and analyze the minorizing constant with respect to the distribution of the observation sequence {Yt, t ∈ N}. For this reason, it is also of interest to carry out the analysis for a

parametric family of state space models {(Mθ_{, G}θ_{), θ ∈ Θ}, where Θ is a compact subset of}

a Euclidean space. Informally, this allows us to analyse the ergodicity of the PG kernel, even when the algorithm is executed using a misspecified model. That is, assume that {Yt, t ∈ N}

is generated according to (Mθ?_{, G}θ?_{) where θ}

? is some “true” parameter of the model. Since

θ? is typically unknown we assume that the PG algorithm is implemented using some estimate

θ instead. The question is then: with respect to the law ¯Pθ? _{(see definition below) of the}

observation sequence {Yt, t ∈ N}, can we expect that the convergence rate of the PG algorithm

(which depends on the realization of the observations) will be non-degenerate as T → ∞? In Theorem 6 below, we show that this is indeed the case provided that NT is a power of T .

We consider a sequence of parameters {θT, T ∈ N} that become increasingly close to θ? (in a

sense that will be made precise in Theorem 6), converging at a rate 1/√T . The rationale for this assumption is that we are considering the large T regime and we can therefore expect θT

to be close to θ?. We discuss this condition further below.

We emphasize that the result of this section concerns the dependence of the convergence rate of the PG sampler on the observation sequence for large T . For a fixed observation se-quence Y0:T = y0:T (with finite T ) we can instead appeal to Corollary 2, which ensures uniform

(13)

(A-1) For all θ ∈ Θ, the kernel Mθ_{has a unique stationary distribution denoted as π}θ_.

In what follows, for θ ∈ Θ we let Eθµand Pθµrefer to the expectation and probability, respectively,

induced on ((X × Y)N_{, (X ⊗ Y)}⊗N_{) by a Markov chain {(X}_t_{, Y}_t_{), t ∈ N} evolving according to}

the state space model (Mθ_{, G}θ_{) starting with X}

0 ∼ µ. For simplicity, we write ¯Eθ = Eθπθ

and ¯_Pθ

= Pθ

πθ. For 1 ≤ s ≤ t, we write p

θ

µ,s(ys:t) = R pθµ(y0:t)κ⊗s(dy0:s−1) for the marginal

probability density function of Ys:twith respect to κ⊗(t−s+1).

(A-2) There exists a constant σ+∈ R+ and a nonnegative measure λ such that for all θ ∈ Θ and

(x, A) ∈ X × X : Mθ_{(x, A) ≤ σ} +λ(A).

Denote by mθ(x, ·) the Radon-Nikodym derivative

mθ(x, x0) =dM

θ_{(x, ·)}

dλ(·) (x

0_{) .}

Under (A-2), the stationary distribution πθ _{is absolutely continuous with respect to λ.}

Fur-thermore, for notational simplicity it is assumed that the initial distribution µ is absolutely continuous with respect to λ. By abuse of notation, we write πθand µ also for the correspond-ing density functions.

(A-3) For all θ ∈ Θ and (x, x0, y) ∈ X2_{× Y, m}θ_{(x, x}0_{) > 0 and g}θ_{(x, y) > 0.}

Define the random variables eBθ

thYti = |wθhYti|∞/pθµ,t(Yt) and for all (t, `) ∈ N × N+,

where λ is defined in (A-2).

(A-4) There exist constants (`?, α) ∈ N+× (0, 1) such that,

sup t∈N sup θ∈ΘE θ µ h ( eB_tθhYt:t+`i)α i

< ∞ , for all ` ∈ {0, . . . , `?− 1} and , (15)

sup t∈N sup θ∈Θ Eθµ h ( eC_tθhYt:t+`?i) αi_{< ∞ .} ₍₁₆₎

Theorem 6. Assume that (A-1), (A-2), (A-3), and (A-4) hold. Let θ?∈ Θ and let {θT, T ∈ N}

be a sequence of parameters such that

lim sup T →∞ T ¯_Eθ? ln m θ?_(X 0, X1)gθ?(X1, Y1) mθT(X₀, X₁)gθT(X₁, Y₁) < ∞ . (17)

(14)

Furthermore, assume that ¯ Eθ? ln π θ?_(X 0) µ(X0) < ∞ , (18)

where µ is the initial distribution used in the PG algorithm. Then, for all 0 ≤ γ < α (where α is defined in (A-4)) and for all sequences of positive integers {NT}T ≥1 such that NT ∼ T1/γ,

the sequence {−1_{T ,N}

T(θT)}T ≥1, defined in (12) is ¯P

θ?_{-tight (bounded in probability).}

Proof. The proof is postponed to Section 7.

To understand condition (17), note that for any θ ∈ Θ, D(θ?||θ) := ¯Eθ? ln m θ?_(X 0, X1)gθ?(X1, Y1) mθ_(X 0, X1)gθ(X1, Y1) ,

is the expectation under the stationary distribution πθ? _{of the Kullback-Leibler divergence}

be-tween the conditional distribution of pθ?_(X

1, Y1|X0) and pθ(X1, Y1|X0). Hence, D(θ?||θ) ≥ 0

for all θ ∈ Θ and D(θ?||θ?) = 0. Assuming that θ? belongs to the interior of Θ and that the

function θ 7→ D(θ?||θ) is twice differentiable at θ?, a Taylor expansion at θ? yields

D(θ?||θ) =

1

2(θ?− θ)

t_Hθ?_(θ

?− θ) + o(kθ?− θk2) ,

where Hθis the Hessian of θ 7→ D(θ?||θ). Consequently, for regular statistical models, (17) holds

provided that θT converges to θ? at a rate 1/

√

T , i.e., θT = θ?+ %T/

√

T , where the sequence {%T, T ∈ N} is bounded: supT ≥0k%Tk < ∞.

Remark 7. It should be noted that our results do not cover explicitly the case when the sequence {%T, T ∈ N} is stochastic. Still, we believe that our results hint at the possibility of obtaining

a non-degenerate lower bound on the minorizing constant also in the stochastic case, given that {%T, T ∈ N} is tight, under conditions that are much weaker than the previously considered

strong mixing assumption.

Remark 8. It is interesting to note that we do not require the initial distribution µ to be equal to πθ?_{, but only that the Kullback-Leibler divergence (18) is bounded. Hence, we may use a quite}

arbitrary initial distribution and still obtain a sequence of inverse minorization constants that is tight with respect to ¯Pθ?_.

A straightforward generalization of the above result is to let the initial distribution belong to a parametric family of distributions, {µθ _{: θ ∈ Θ}. The condition (18) should then be replaced}

by lim sup_{T →∞}E¯θ?

h

lnπθ?(X0)

µθT(X0)

i

< ∞. Allowing for the initial distribution to depend on θ can be useful in some cases. For instance, if the stationary distribution πθ _{is known it may serve as}

(15)

5 Example – A stochastic volatility model

In this section we consider a concrete example to illustrate how the conditions of Theorem 6 can be verified in practice. An additional example is provided in the online supplementary material. We start by stating a technical lemma which will be very useful for checking the assumptions.

Lemma 9. Let (Z, Z) be a measurable set and ξ be a measure on (Z, Z). Let α ∈ (0, 1) and let ϕ, ψ and q be nonnegative measurable functions, such that

Z

ψ(z)ϕ(z)ξ(dz) < ∞ , (19) Z

ϕ−1−αα _{(z)q(z)ξ(dz) < ∞ .} ₍₂₀₎

Then,R ψα(z)q1−α(z)ξ(dz) < ∞.

Proof. The result follows from H¨older’s inequality; see the online supplementary material.

The canonical model in stochastic volatility for discrete-time data has been introduced by Taylor (1982) and worked out since then by many authors; see Shephard and Andersen (2009) for an up-to-date survey. In this model, the hidden volatility process, {Xt, t ∈ N}, follows a

first order autoregression,

Xt+1= φXt+ σWt+1, (21)

Yt= β exp(Xt/2)Ut. (22)

where {Wt, t ∈ N} and {Ut, t ∈ N} are mutually independent white Gaussian noises with

mean zero and unit variance. We denote by θ = (φ, σ, β) ∈ Θ, where Θ is a compact subset of (−1, 1) × (0, ∞)2. For δ > 0, denote by Vδ(x) = eδ|x| and let µ an arbitrary distribution on

(R, B(R)), for which µ(Vδ) < ∞.

For this model the transition kernel and the likelihood of the observation are given by

mθ(x, x0) = √ 1 2πσ2exp − 1 2σ2(x 0_{− φx)}2 and gθ(x, y) = _p1 2πβ2e −(x/2+(y2_/2β2_)e−x₎ ,

respectively. For any θ ∈ Θ, the autoregressive process {Xt, t ∈ N} has a unique stationary

distribution πθ_{, which is zero-mean Gaussian with variance σ}2_/(1−φ2_{). Hence, (A-1) is satisfied.}

We consider the bootstrap proposal kernel as defined in (7), in which case wθhyi(x, x0_{) =}

(16)

and (A-3) are readily satisfied. We finally check (A-4). It is easily shown that, for all θ ∈ Θ, Z ∞ −∞ gθ(x, y)dx = D1 |y| , D1:= 1 √ 2π Z ∞ 0 e−u/2 √ u du , (23) sup x∈R gθ(x, y) = D2 |y| , D2:= 1 √ 2πe . (24)

We will check (A-4) with `?= 1, i.e., we show that

sup t∈N sup θ∈Θ Eθµ h ( eB_tθhYti)α i < ∞ , and sup t∈N sup θ∈Θ Eθµ h ( eC_tθhYt:t+1i)α i < ∞ , (25)

for any α ∈ (0, 1). Note that we cannot expect (25) to hold with α = 1 since,

Eθµ h e Bθ_thYti i = Eθµ " |wθ_hY ti|∞ pθ µ,t(Yt) # = Z sup x∈R gθ(x, y)dy = D2 Z ₁ |y|dy = ∞ .

We now turn to the proof of (25). Note first that lim sup_|x|→∞sup_θ∈ΘEθx[Vδ(X1)]

Vδ(x) = 0, and, for

any K < ∞, sup_|x|≤Ksup_θ∈ΘEθx[Vδ(X1)] < ∞. Therefore, there exist constants λδ ∈ (0, 1) and

bδ < ∞ such that, for all x ∈ X, supθ∈ΘEθx[Vδ(X1)] ≤ λδVδ(x) + bδ . This implies that, for all

δ > 0, supθ∈ΘEθx[Vδ(Xt)] ≤ λtδVδ(x) + bδ(1 + λδ+ · · · + λt−1δ ) ≤ Vδ(x) + bδ/(1 − λδ) , and thus

sup t∈N sup θ∈ΘE θ µ[Vδ(Xt)] ≤ µ(Vδ) + bδ/(1 − λδ) < ∞ . (26)

Using the fact that wθ_{hyi(x, x}0_{) = g}θ_(x0_{, y) and (24), we get}

Eθµ h ( eBtθhYti)α i = Z |wθ_hy ti|α∞{pθµ,t(yt)}1−αdyt≤ Dα2 Z |yt|−α{pθµ,t(yt)}1−αdyt. (27)

We apply Lemma 9 to establish a bound for (27). Consider the functions ϕ and ψ given by

ψ(y) = 1

|y| , and ϕ(y) =

|y|γ

|y|2_{∨ 1},

with γα/(1−α) < 1 and γ ∈ (0, 1). With these definitions, we getR ϕ(y)ψ(y)dy = R 1 |y|

|y|γ

|y|2_∨1dy <

∞, showing that the first condition, (19), is satisfied. We now check (20): Z ϕ−1−αα _(y_t_)pθ µ,t(yt)dyt= Eθµ EθXtϕ − α 1−α_(Y₀₎ _. ₍₂₈₎

Since Y0 = β exp(X0/2)U0 it follows that Eθxϕ − α

1−α_(Y₀₎ _{= E}ϕ−1−αα _(βex/2U ) where U is

standard normal. We have

E " β2_ex_U2_{∨ 1} βγ_eγx/2_{|U |}γ 1−αα # ≤ Eh(βex/2|U |)−1−αγα 1 {β|U |ex/2_≤1} i + E(β2exU2)1−αα 1 {β|U |ex/2_>1} ≤ (βex2)− γα 1−α E h |U |−1−αγα i + (β2ex)1−αα E h |U |1−α2α i .

(17)

Since γα/(1 − α) < 1 it holds that Eh|U |−1−αγα

i

< ∞ and, additionally, Eh|U |1−α2α

i < ∞. Therefore, there exist constants D3< ∞ and δ > 0 such that, for all x ∈ R and θ ∈ Θ,

Eθxϕ − α

1−α(Y₀) ≤ D₃eδ|x|= D₃V_δ(x) . (29)

Using (28), (29) and (26) verifies the second condition in (20). Lemma 9 can thus be used to conclude that Eθ µ h ( eBθ thYti)α i

< ∞ for all α ∈ (0, 1). Since this holds for any t ∈ N and θ ∈ Θ, we establish the first part of (25).

Next we check that, for all α ∈ (0, 1), Eθ µ

h ( eCθ

thYt:t+1i)α

i

< ∞. Using (23) and (24), we get

We use again Lemma 9 with

ψ(y0, y1) = 1 |y0||y1| , and ϕ(y0, y1) = |y0|γ|y1|γ (y2 0∨ 1) (y21∨ 1) ,

with γα/(1 − α) < 1 and γ ∈ (0, 1). Note first that Z Z

ψ(y0, y1)ϕ(y0, y1)dy0:1=

Z Z

(|y0||y1|)1−γ(y20∨ 1)(y 2 1∨ 1)

−1

dy0:1< ∞ . (31)

Hence, (19) is satisfied. We finally check (20). Using the conditional independence of the observations given the states and (29),

Z Z ϕ−1−αα (y_t:t+1)pθ µ,t(yt:t+1)dyt:t+1= Eθµ Eθµ ϕ − α 1−α(Y_t:t+1)X_t:t+1 = Eθµ " EθXt " β2_eX0_U2∨ 1 βeγX0/2|U |γ 1−αα # EθXt+1 " β2_eX0_U2∨ 1 βeγX0/2|U |γ 1−αα ## ≤ D23E θ µ h eδ|Xt|_eδ|Xt+1|i _.

From the Cauchy-Schwarz inequality we get,

Eθµ h eδ|Xt|_eδ|Xt+1|i_≤ Eθµ h e2δ|Xt|i Eθµ h e2δ|Xt+1|i 1/2 ,

Applying (26) with δ replaced by 2δ yields (20). Using Lemma 9 thus establishes (25) and thereby, (A-4) holds.

Provided that θT converges to θ? at a rate 1/

√

T , we may therefore apply Theorem 6 which shows that for any γ ∈ (0, 1), {−1_{T ,N}

T(θT)}T ≥1is tight with NT = T

(18)

6 Proof of Theorem 1

We will now turn to the proof of the minorization condition in Theorem 1. As in the statement of the theorem, we will not explicitly indicate any possible dependence on unknown model pa-rameters in the notation in this section. This is done for notational convenience and is without loss of generality. Throughout this section, P and E refer to probability and expectation, respec-tively, with respect to the random variables generated by the PG algorithm (the observation sequence y0:T is treated as fixed). The proof is inductive and follows from a series of lemmas.

Lemma 10. Let X ≥ 0 and Y > 0 be independent random variables. Then, EXY ≥ E[X] E[Y ].

Proof. Since f (y) = 1/y is convex on y > 0 the result follows by Jensen’s inequality.

Lemma 11. Let f and h be nonnegative measurable functions. For t ∈ {0, . . . , T − 1}, we have

E "_PN i=1ω i t+1f (X0:t+1i ) PN i=1ω i t+1h(Xt+1i ) FtN # ≥ PN i=1ω i tR Qhyt+1i(Xti, dxt+1)f (X0:ti , xt+1) PN i=1ωti h N −2 N −1Qhyt+1ih(X i

t) +N −12 sup(x,x0₎whyt+1i(x, x0)h(x0)

i , (32) and E "_P_N i=0ω i 0f (X0i) PN i=0ω0ih(X0i) # ≥ (N − 1)µ(g(·, y0)f (·))

(N − 2)µ(g(·, y0)h(·)) + 2 supx[why0i(x)h(x)]

. (33)

Proof. Using that ωt+11 h(Xt+11 ) + ωt+1N h(Xt+1N ) ≤ 2 sup(x,x0₎whyt+1i(x, x0)h(x0), and that the

weighted particles {(Xi

t+1, ωt+1i )} N −1

i=1 are conditionally i.i.d. with respect to F N t , we get E "_PN i=1ωt+1i f (X0:t+1i ) PN i=1ωt+1i h(Xt+1i ) FN t # ≥ E "_PN −1 i=1 ωt+1i f (X0:t+1i ) PN i=1ωit+1h(Xt+1i ) FN t # ≥ (N − 1)E " ωt+11 f (X0:t+11 ) PN −1 i=2 ω i t+1h(X i

t+1) + 2 sup(x,x0₎whyt+1i(x, x0)h(x0)

FtN # ≥ (N − 1) E ω 1 t+1f (X0:t+11 ) F_tN E h PN −1 i=2 ω i t+1h(Xt+1i ) F N t i

+ 2 sup_(x,x0₎whyt+1i(x, x0)h(x0)

, (34)

where the last inequality follows from Lemma 10. Consider first the numerator in the right-hand side of (34). We have E ωt+11 f (X 1 0:t+1) F_tN = 1 PN l=1ω l t N X j=1 ωj_t Z Rhyt+1i(X j t, dxt+1)whyt+1i(X j t, xt+1)f (X j 0:t, xt+1) = 1 PN l=1ωlt N X j=1 ωjt Z Qhyt+1i(Xtj, dxt+1)f (X0:tj , xt+1) . (35)

(19)

We now consider the denominator in the right-hand side of (34): E "N −1 X i=2 ω_t+1i h(X_t+1i ) FN t # = (N − 2)E ω1 t+1h(X 1 t+1) F_tN = (N − 2) PN l=1ω l t N X j=1 ω_tjQhyt+1ih(Xtj) ,

where the last identity follows from (35) with f (x0:t+1) = h(xt+1). The proof of (32) follows.

Consider now (33). Since the particles {X0i} N −1

i=1 are i.i.d., we obtain, using again Lemma 10,

E "_P_N i=1ω i 0f (X0i) PN i=1ω i 0h(X0i) # ≥ (N − 1)Eω 1 0f (X01) E h PN −1 i=2 ω0ih(X0i) i

+ 2 sup_xwhy0i(x)h(x)

.

The numerator is given by Eω10f (X 1

0) = R r0hy0i(dx0)why0i(x0)f (x0) = µ(g(·, y0)f (·)).

Simi-larly, we get EhPN −1 i=2 ω i 0h(X0i) i = (N − 2)Eω01h(X01) = (N − 2)µ(g(·, y0)h(·)).

Define a sequence of nonnegative scalars {βt}Tt=0by the backward recursion: βT = |whyTi|∞,

and for t = T − 1, T − 2, . . . , 0, βt= |whyti|∞ ( 2 N −1 T −t X `=1 N −2 N −1 `−1 βt+`|Qhyt+1:t+`−1i1|∞+ N −2 N −1 T −t |Qhyt+1:Ti1|∞ ) . (36) Given {βt}Tt=0, define the functions {ht}Tt=0, ht: X → R+, by the backward recursion: hT = 1

and, for all t = T − 1, T − 2, . . . , 0,

ht: x 7→ ht(x) =

N − 2

N − 1 Qhyt+1iht+1(x) + 2

N − 1βt+1. (37) By solving the backward recursion, (37) implies

ht(x) = 2 N − 1 T −t X `=1 N − 2 N − 1 `−1 βt+`Qhyt+1:t+`−1i1(x)+ N − 2 N − 1 T −t Qhyt+1:Ti1(x) . (38)

For D ∈ X⊗(T +1), set fT(x0:T) =1D(x0:T) and, for t = T − 1, T − 2, . . . , 0,

ft(x0:t) =

Z

Qhyt+1i(xt, dxt+1)ft+1(x0:t+1) , (39)

or equivalently, ft(x0:t) =R QT −t_`=1 Qhyt+`i(xt+`−1, dxt+`)1D(x0:T).

Lemma 12. For any x0_0:T ∈ XT +1 _{and D ∈ X}⊗(T +1)_,

PT ,N(x00:T, D) = E "_P_N i=1ω i T1D(X0:Ti ) PN i=1ωTi # ≥ (N − 1)pµ(y0:T) (N − 2)µ(g(·, y0)h0(·)) + 2β0 φµ,0:Thy0:Ti(D) .

(20)

Proof. Note first that, by construction, E "_P_N i=1ω i T1D(X0:Ti ) PN i=1ω i T # = E "_P_N i=1ω i TfT(X0:Ti ) PN i=1ω i ThT(X i T) # . (40)

We now show that, by backward induction, for all t ∈ {0, . . . , T − 1},

E "_P_N i=1ω i t+1ft+1(X0:t+1i ) PN i=1ω i t+1ht+1(Xt+1i ) # ≥ E "_P_N i=1ω i tft(X0:ti ) PN i=1ω i tht(Xti) # . (41)

To obtain (41), note first that the tower property of the conditional expectation, Lemma 11, and (39) imply E "_P_N i=1ω i t+1ft+1(X0:t+1i ) PN i=1ωt+1i ht+1(Xt+1i ) # = E " E "_P_N i=1ω i t+1ft+1(X0:t+1i ) PN i=1ωit+1ht+1(Xt+1i ) FN t ## ≥ E   PN i=1ωitft(X0:ti ) PN i=1ωti h N −2 N −1Qhyt+1iht+1(X i

t) +N −12 sup(x,x0₎whyt+1i(x, x0)ht+1(x0)

i   .

It follows directly from (36) and (38) that

sup x why0i(x)h0(x) ≤ β0, (42) sup x,x0 whyt+1i(x, x0)ht+1(x0) ≤ βt+1, t ∈ {0, . . . , T − 1} . (43)

Combining the inequality (43) with the definition of ht in (37) yields N X i=1 ω_ti " N − 2 N − 1Qhyt+1iht+1(X i t) + 2 N − 1(x,xsup0₎ whyt+1i(x, x0)ht+1(x0) # ≤ N X i=1 ωi_tht(Xti) ,

showing (41). Combining (41) with (40) and using Lemma 11-(33) establishes that

E "_P_N i=1ω i T1D(X0:Ti ) PN i=1ω i T # ≥ E "_P_N i=1ω i 0f0(X0i) PN i=1ω i 0h0(X0i) # ≥ (N − 1)µ(g(·, y0)f0(·)) (N − 2)µ(g(·, y0)h0(·)) + 2β0 ,

where the last inequality stems from (42). The proof is completed by noting that

µ(g(·, y0)f0(·)) = φµ,0:Thy0:Ti(D)pµ(y0:T) .

Finally, to prove Theorem 1 it remains to show the following. Lemma 13. With Bt,T defined as in (10), it holds that

(N − 2)µ(g(·, y0)h0(·)) + 2β0≤ (N − 1)pµ(y0:T) "T Y t=0 2Bt,T+ N − 2 N − 1 # . (44)

(21)

Proof. Define for t ∈ {0, . . . , T },

αt=

βt

pµ(yt:T|y0:t−1)

, (45)

with the convention p_µ(y0:T|y0:−1) = pµ(y0:T). In particular, α0= β0/pµ(y0:T).

and the definition in (10) imply that

αt≤ Bt,T ( 2 N − 1 T −t X `=1 N − 2 N − 1 `−1 αt+`+ N − 2 N − 1 T −t) . (46)

By a backward induction, define the sequence { ˜αt}Tt=0as follows: set ˜αT = BT ,T and

˜ αt= Bt,T " 2 N − 1 T −t X `=1 N − 2 N − 1 `−1 ˜ αt+`+ N − 2 N − 1 T −t# . (47)

Since by construction, αT ≤ BT ,T = ˜αT, an elementary backward recursion using (46) shows

that αt≤ ˜αtfor all t ∈ {0, . . . , T }. However

˜ αt−1= Bt−1,T " 2 N − 1 T −t+1 X s=1 N − 2 N − 1 `−1 ˜ αt+`−1+ N − 2 N − 1 T −t+1# = Bt−1,T " 2 N − 1α˜t+ 2 N − 1 T −t X k=1 N − 2 N − 1 k ˜ αt+k+ N − 2 N − 1 T −t+1# = 2Bt−1,T N − 1 α˜t+ Bt−1,T N − 2 N − 1 ˜ αt Bt,T = Bt−1,T Bt,T 2Bt,T N − 1+ N − 2 N − 1 ˜ αt. Therefore ˜ α0= B0,T T Y t=1 2Bt,T + N − 2 N − 1 . (48) Now, since h0(x) = 2 N − 1 T X s=1 N − 2 N − 1 s−1 βsQhy1:s−1i1(x) + N − 2 N − 1 T Qhy1:Ti1(x) , we have µ(g(·, y0)h0(·)) = 2 N − 1 T X s=1 N − 2 N − 1 s−1 βspµ(y0:s−1) + N − 2 N − 1 T p_µ(y0:T) .

(22)

Plugging (45) into this equation and using that p_µ(y0:s−1) = pµ(y0:T)/pµ(ys:T|y0:s−1) yields µ(g(·, y0)h0(·)) = 2 N − 1 T X s=1 N − 2 N − 1 s−1 αspµ(y0:T) + N − 2 N − 1 T p_µ(y0:T) .

Finally, using that αt≤ ˜αt for all t ∈ {0, . . . , T } and then (47),

(N − 2)µ(g(·, y0)h0(·)) + 2β0 ≤ pµ(y0:T) ( (N − 2) " 2 N − 1 T X s=1 N − 2 N − 1 s−1 ˜ αs+ N − 2 N − 1 T# + 2 ˜α0 ) ≤ p_µ(y0:T) (N − 2) α˜0 B0,T + 2 ˜α0 = p_µ(y0:T)(N − 2 + 2B0,T) T Y t=1 2Bt,T+ N − 2 N − 1 , where the last equality follows from (48). The proof follows.

7 Proof of Theorem 6

thYt:t+`i and eCtθhYt:t+`i are defined in (13) and (14), respectively.

Lemma 14. For all θ ∈ Θ, the sequence { bCθ

thY0:t+`i}`≥0 defined in (50) is a (Pθµ, {Ft+`}`≥0

)-martingale, where Ft= σ(Y0:t).

Proof. See the online supplementary material.

Lemma 15. For all 0 ≤ γ < 1 and all ` ∈ N,

Eθµ h ( bB_tθhY0:t+ì)γ i ≤ Eθµ h ( eB_tθhYt:t+ì)γ i , and _Eθ_µh( bC_tθhY0:t+ì)γ i ≤ Eθµ h ( eC_tθhYt:t+ì)γ i .

(23)

Lemma 16. Assume (A-2) and (A-4). Then, sup_t≥0sup_θ∈Θ_Eθ µ h sup_`≥0Bb_tθhY0:t+`i γi < ∞ for all 0 ≤ γ < α, where α is defined in (A-4).

Proof. Under (A-2), we obtain by definitions of bBθ

thY0:t+ì and bCtθhY0:t+ì, sup `≥0 b B_tθhY0:ì ≤ `?−1 X `=0 b B_tθhY0:ì + sup `≥`? b Bθ_thY0:t+ì ≤ `?−1 X `=0 b Bθ_thY0:t+ì + σ+sup `≥`? b C_tθhY0:t+ì ,

where σ+and `?are defined in (A-2) and (A-4), respectively. Then, by subadditivity of u 7→ uγ,

Eθµ sup `≥0 b B_tθhY0:t+ì γ ≤ `?−1 X `=0 Eθµ h ( bB_tθhY0:t+ì)γ i + (σ+)γEθµ sup `≥`? ( bC_tθhY0:t+ì)γ .

Applying Lemma 15 and (15), it is thus sufficient to bound Eθµ

h sup`≥`?( bC θ thY0:t+`i)γ i . By Lemma 14, { bCtθhY0:t+`i}k≥0 is a {Ft+`}`≥0-martingale and since α ∈ (0, 1), we have that

{( bCtθhY0:t+`i)α}`≥0 is a nonnegative {Ft+`}`≥0-supermartingale. The Doob maximal inequality

then applies: for all a > 0, aPθµ sup `≥`? ( bC_tθhY0:t+`i)α≥ a Ft+`?−1 ≤ Eθµ h ( bC_tθhY0:t+`?i) α Ft+`?−1 i .

Take now the expectation in both sides of the previous inequality and set δ = aγ/α_{. We obtain}

Pθµ sup `≥`? ( bC_tθhY0:t+`i)γ ≥ δ ≤ δ−α/γEθµ h ( bC_tθhY0:t+`?i) αi _.

Combining this with the inequality E [U ] ≤ 1 +R∞

1 P [U > δ] dδ which holds for all nonnegative

random variable U , we obtain under (A-4)

Eθµ sup `≥`? ( bC_tθhY0:t+`i)γ ≤ 1 +R∞ 1 δ −α γ_dδ Eθµ h ( bC_tθhY0:t+`?i) αi_{= 1 +} γ α − γE θ µ h ( bC_tθhY0:t+`?i) αi _.

The proof follows by applying again Lemma 15 under (A-4).

Proof of Theorem 6. For simplicity we will use in this proof the notations ¯pθ_(Y

0:t) := pθ_πθ(Y0:t)

and ¯pθ_(Y

t:s|Y0:t−1) = pθ_πθ(Yt:s|Y0:t−1). First note that

¯ Pθ? ¯p θ?_(Y 0:T) pθT µ (Y0:T) > ρ = ¯Pθ? ( lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) +p θT µ (Y0:T) ¯ pθ?(Y_0:T)− 1 > ln ρ + pθT µ (Y0:T) ¯ pθ?(Y_0:T)− 1 ) ≤ ¯Pθ? ( lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) +p θT µ (Y0:T) ¯ pθ?_(Y_0:T₎− 1 > ln ρ − 1 ) .

Now, since for all u > 0, ln(u) + u−1 − 1 ≥ 0, we obtain using Markov’s inequality, for all ρ > e = exp(1): ¯ Pθ? ¯pθ?_(Y 0:T) pθT µ (Y0:T) > ρ ≤ 1 ln ρ − 1 ¯ Eθ? " lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) +p θT µ (Y0:T) ¯ pθ?_(Y_0:T₎− 1 # = 1 ln ρ − 1 ¯ Eθ? lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) .

(24)

This implies that for all K > 0 and all ρ > e, ¯ Pθ?(−1T ,NT(θT) > K) ≤ E θT µ    ¯ pθ?_(Y 0:T) pθT µ (Y0:T) 1( ¯ p θ? (Y0:T ) pθTµ (Y0:T ) ≤ρ )1n −1 T ,NT(θT)>K o   + ¯P θ? ¯p θ?_(Y 0:T) pθT µ (Y0:T) > ρ ≤ ρPθT µ h −1_{T ,N} T(θT) > K i + 1 ln ρ − 1 ¯ Eθ? lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) ≤ ρPθT µ h −1_{T ,N} T(θT) > K i + 1 ln ρ − 1 sup T ≥0 ¯ Eθ? lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) . (52) We consider first the last term of the right-hand side. Note first that, by the tower property,

because this quantity is the expectation under the stationary distribution of a Kullback-Leibler divergence. This implies that

¯ Eθ? lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) ≤ ¯Eθ? lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) + ¯Eθ? lnp¯ θ?_(X 0:T|Y0:T) pθT µ (X0:T|Y0:T) = ¯Eθ? lnp¯ θ?_(X 0:T, Y0:T) pθT µ (X0:T, Y0:T) = ¯Eθ? ln π θ?_(X 0)gθ?(X0, Y0) µ(X0)gθT(X0, Y0) + T ¯Eθ? ln m θ?_(X 0, X1)gθ?(X1, Y1) mθT_(X₀_{, X}₁_)gθT_(X₁_{, Y}₁₎ . Hence, we obtain, under (17) and (18), that

sup T ≥0 ¯ Eθ? lnp¯ θ?_(Y 0:T) pθT µ (Y0:T) < ∞ . (53)

Assume first that

lim sup K→∞ sup T ≥0P θT µ h −1_{T ,N} T(θT) > K i = 0 . (54)

The proof of the tightness of {−1_{T ,N}

T(θT)}T ≥0 then follows by plugging (53) into (52) and by

noting that (52) holds for all ρ > e, combined with (54). To complete the proof, it thus remains to show (54). Rewriting the definition (12), we obtain

−1_{T ,N} T(θT) ≤ T Y t=0 2BθT t + NT − 2 NT − 1 = exp ( _T X t=0 ln 2B θT t + NT− 2 NT − 1 !) ≤ exp ( _T X t=0 2BθT t − 1 NT − 1 ) .

where Btθ:= sup`≥0BbtθhY0:t+`i. Using Markov’s inequality, this implies that for K > 1,

PθµT h −1_{T ,N} T(θT) > K i ≤ PθT µ " _T X t=0 2BθT t − 1 NT − 1 > ln K # ≤ 1 (ln K)γE θT µ " _T X t=0 2BθT t − 1 NT− 1 !γ# .

The proof of (54) follows by noting that NT ∼ T1/γ and by using

EθµT " _P_T t=02B θT t − 1 T1/γ !γ# ≤ EθT µ "_P_T t=0(2B θT t )γ T # ≤ 2γ_sup t≥0 sup θ∈Θ Eθµ(B θ t) γ_{< ∞ ,}

(25)

Acknowledgement

This work was supported by the project Learning of complex dynamical systems (Contract number: 637-2014-466) funded by the Swedish Research Council.

Supporting Information

Additional information for this article is available online including: Appendix S1 An algorithmic statement of the particle Gibbs sampler.

Appendix S2 An example of how the conditions of Theorem 6 can be verified.

Appendix S3 Proofs of Proposition 4, Proposition 5, Lemma 9, Lemma 14, and Lemma 15.

References

Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B 72 (3), 269–342.

Andrieu, C., A. Lee, and M. Vihola (2013, December). Uniform ergodicity of the iterated conditional SMC and geometric ergodicity of particle Gibbs samplers. Preprint, arXiv:1312.6432.

Andrieu, C. and G. O. Roberts (2009). The pseudo-marginal approach for efficient Monte Carlo com-putations. The Annals of Statistics 37 (2), 697–725.

Andrieu, C. and M. Vihola (2012, October). Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms. arXiv.org, arXiv:1210.1484.

Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitored popula-tions. Genetics 164 (3), 1139–1160.

Capp´e, O., E. Moulines, and T. Ryd´en (2005). Inference in Hidden Markov Models. Springer. Chopin, N. and S. S. Singh (2014). On particle Gibbs sampling. Bernoulli . Forthcoming.

Del Moral, P. (2004). Feynman-Kac Formulae - Genealogical and Interacting Particle Systems with Applications. Probability and its Applications. Springer.

Del Moral, P. and A. Guionnet (1999). Central limit theorem for nonlinear filtering and interacting particle systems. Annals of Applied Probability 9 (2), 275–297.

Doucet, A., S. J. Godsill, and C. Andrieu (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10 (3), 197–208.

(26)

Doucet, A. and A. Johansen (2011). A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan and B. Rozovskii (Eds.), The Oxford Handbook of Nonlinear Filtering. Oxford University Press.

Doucet, A., M. K. Pitt, and R. Kohn (2012, October). Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Preprint, arXiv:1210.1871.

Fearnhead, P., D. Wyncoll, and J. Tawn (2010). A sequential smoothing algorithm with linear compu-tational cost. Biometrika 97 (2), 447–464.

Godsill, S. J., A. Doucet, and M. West (2004, March). Monte Carlo smoothing for nonlinear time series. Journal of the American Statistical Association 99 (465), 156–168.

Golightly, A. and D. J. Wilkinson (2011). Bayesian parameter inference for stochastic biochemical network models using particle Markov chain Monte Carlo. Interface Focus 1 (6), 807–820.

Lee, A. and K. Latuszynski (2012, October). Variance bounding and geometric ergodicity of Markov chain Monte Carlo kernels for approximate Bayesian computation. Preprint, arXiv:1210.6703. Lindsten, F., M. I. Jordan, and T. B. Sch¨on (2014). Particle Gibbs with ancestor sampling. Journal of

Machine Learning Research 15, 2145–2184.

Lindsten, F. and T. B. Sch¨on (2013). Backward simulation methods for Monte Carlo statistical inference. Foundations and Trends in Machine Learning 6 (1), 1–143.

Meyn, S. and R. L. Tweedie (2009). Markov Chains and Stochastic Stability (2nd ed.). Cambridge University Press.

Pitt, M. K., R. S. Silva, P. Giordani, and R. Kohn (2012). On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. Journal of Econometrics 171, 134–151. Rasmussen, D. A., O. Ratmann, and K. Koelle (2011). Inference for nonlinear epidemiological models

using genealogies and time series. PLoS Comput Biology 7 (8), e1002136.

Shephard, N. and T. Andersen (2009). Stochastic volatility: Origins and overview. In T. Andersen, R. Davis, J.-P. Kreiss, and T. Mikosch (Eds.), Handbook of Financial Time Series. Berlin: Springer. Taylor, S. J. (1982). Financial returns modelled by the product of two stochastic processes – A study of daily sugar prices, 1961-79. In O. D. Anderson (Ed.), Time Series Analysis: Theory and Practice, Volume 1, pp. 203–226. Elsevier/North-Holland, Amsterdam.

Vrugt, J. A., J. F. ter Braak, C. G. H. Diks, and G. Schoups (2013). Hydrologic data assimilation using particle Markov chain Monte Carlo simulation: Theory, concepts and applications. Advances in Water Resources 51, 457–478.

(27)

Whiteley, N., C. Andrieu, and A. Doucet (2010). Efficient Bayesian inference for switching state-space models using discrete particle Markov chain Monte Carlo methods. Technical report, Bristol Statistics Research Report 10:04.

Fredrik Lindsten, Department of Engineering, University of Cambridge, Trumpington Street, Cam-bridge, CB2 1PZ, UK.

(28)

Uniform ergodicity of the Particle Gibbs sampler

-Supplementary material

Fredrik Lindsten

Department of Engineering, University of Cambridge, Cambridge, UK, and

Division of Automatic Control, Link¨

oping University, Link¨

oping, Sweden

Randal Douc

Department CITI, Institut Mines-Telecom/CNRS UMR 5157

Telecom Sudparis, Evry, France

Eric Moulines

Department LTCI, Institut Mines-Telecom/CNRS UMR 5141

Telecom Paristech, Paris, France

September 9, 2014

Section numbers, equation numbers, etc., in this supplementary material are prefixed with “S”. References without this prefix refer to the main document.

S1

Particle Gibbs algorithm

The Particle Gibbs (PG) sampling procedure, described in Section 3 of the main document, is summarized in Algorithm S1 below.

Algorithm S1 PG sampler

Input: Observations y0:T and reference trajectory x00:T.

Output: Draw from the PG Markov kernel X_0:T? ∼ PT ,N(x00:T, ·).

1: Draw Xi

0∼ r0hy0i(·) for i = 1, . . . , N − 1 and set X0N = x00.

2: Set ω0i = w0hy0i(X0i) for i = 1, . . . , N .

3: for t = 1 to T do 4: Draw Aitw.p. P Ait= j Ft−1N = ω j t−1/ PN `=1ω ` t−1, j ∈ {1, . . . , N }, for i = 1, . . . , N − 1. 5: Draw Xi t∼ Rhyti(X Ai_t t−1, ·) for i = 1, . . . , N − 1. 6: Set XN t = x0t and ANt = N . 7: Set X1:ti = (X Ai_t 1:t−1, X i t) for i = 1, . . . , N . 8: Set ωi t= whyti(X Ai t t−1, Xti) for i = 1, . . . , N . 9: end for 10: _{Draw J w.p. P J = j | F}_TN = ωj_T/PN `=1ω`T, j ∈ {1, . . . , N }. 11: return X? 0:T := X0:TJ .

(29)

S2

Example – A nonlinear model with additive

measure-ment noise

In this section we study a second example (in addition to the one presented in Section 5 of the main document) and show how the conditions of Theorem 6 can be verified. We consider a class of nonlinear state space models where the latent process is observed in additive noise,

Xt+1= hξ(Xt) + σWWt+1 (S1)

Yt= φXt+ σUUt (S2)

where {Wt, t ∈ N} and {Ut, t ∈ N} are two independent sequences of i.i.d. standard Gaussian

random variables and {hξ_{, ξ ∈ Ξ} is a parametric family of measurable real-valued functions,}

where Ξ is a compact subset of a Euclidean space. We denote by θ = (ξ, φ, σU, σW) the

pa-rameters of the model. It is assumed that θ ∈ Θ, where Θ is a compact subset of Ξ × (0, ∞)3. We assume that for all ξ ∈ Ξ, x 7→ hξ(x) is continuous and sup_ξ∈Ξlim sup_x→∞|hξ_{(x)|/|x| < 1.}

For any δ > 0, we set Vδ(x) = eδ|x|. It is easily seen that there exist constants λδ ∈ (0, 1) and

bδ < ∞ such that

sup

θ∈ΘE θ

x[Vδ(X1)] ≤ λδVδ(x) + bδ . (S3)

The Markov chain is strong Feller, Harris recurrent, all the compact sets are small, and the Markov chain admits a single invariant distribution. Therefore, (A-1) and (A-2) are satisfied. Since both the transition density and the observation density are Gaussian, (A-3) is also readily satisfied. We will thus focus on verifying the moment assumption (A-4).

First, note that sup

θ∈ΘE θ

x[Vδ(Xt)] ≤ λtδVδ(x) + bδ(1 + λδ+ · · · + λt−1_δ ) ≤ Vδ(x) + bδ/(1 − λδ) . (S4)

We assume that the initial distribution µ is such that µ(Vδ) < ∞. Therefore,

sup

t∈N

sup

θ∈Θ

Eθµ[Vδ(Xt)] < ∞ . (S5)

Interestingly for the model (S1)–(S2) it is possible to use the fully adapted proposal kernel (Doucet et al., 2000) as defined in Equation (7) of the main document, for which

wθhyi(x, x0) = Z mθ(x, x00)gθ(x00, y)dx00= 1 p2π(φ2_σ2 W + σ 2 U) exp − y − φh ξ_(x)2 2(φ2_σ2 W + σ 2 U) ! , (S6)

for all (x, x0_{) ∈ R × R, y ∈ R, and θ ∈ Θ. It can be seen that, for any θ ∈ Θ and any y ∈ R,} Z ∞ −∞ gθ(x, y)dx = 1 φ, and |w θ_hyi| ∞≤ 1 p2π(φ2_σ2 W+ σ 2 U) ,

which implies the existence of constants D1 and D2 such that

sup

θ∈Θ

Z ∞

−∞

gθ(x, y)dx ≤ D1, and sup θ∈Θ

|wθ_hyi|

∞≤ D2. (S7)

Analogous bounds hold also if we would instead consider the bootstrap proposal. To verify (A-4) we let `?= 1 and show that,

sup t∈N sup θ∈ΘE θ µ h ( eB_tθhYti)α i < ∞ , sup t∈N sup θ∈ΘE θ µ h ( eC_tθhYt:t+1i)α i < ∞ , (S8)

(30)

for some (and actually any) α ∈ [0, 1). Consider first Eθµ h ( eB_tθhYti)α i = Z |wθ_hy ti|α∞{pθµ,t(yt)}1−αdyt≤ Dα2 Z {pθ µ,t(yt)}1−αdyt, (S9)

where the inequality follows from (S7). We apply Lemma 9 to establish a bound for the right-hand side of (S9). Let ψ(y) = 1 and ϕ(y) = 1/(1 ∨ |y|2). With these definitions the first condition in (19) is satisfied. To check (20), note that

ϕ−1−αα _{(y) = (1 ∨ |y|}2₎ α

1−α ≤ 1 + |y|2α/(1−α)_.

The integral in (20) may be expressed as Z ϕ−1−αα _(y_t_)pθ µ,t(yt)dyt= Eθµϕ − α 1−α_(Y_t₎ = Eθ µ EθXtϕ − α 1−α_(Y₀₎ _. _(S10)

Since Y0= φX0+ σU0, we get that for any x ∈ X, Eθxϕ − α

1−α(Y₀) ≤ 1 + E |φx + U|2α/(1−α),

where U is standard normal. This implies that there exists a constant D3 such that, for all

x ∈ X and all θ ∈ Θ,

Eθxϕ − α

1−α_(Y₀) ≤ D₃_{(1 + |x|}2α/(1−α)_{) .} _(S11)

Plugging this into (S10) and using (S5), this verifies the second condition in (20). Lemma 9 can thus be used to conclude that Eθ

µ

h ( eBθ

thYti)α

i

< ∞ for all α ∈ (0, 1). Since this holds for any t ∈ N and θ ∈ Θ, we obtain the first part of (S8).

Next, we consider Eθµ h ( eC_tθhYt:t+1i)α i = Eθµ " |wθ_hY ti|α∞ R gθ(xt+1, Yt+1)dxt+1 α {pθ µ,t(Yt:t+1)}α # ≤ Dα 1D α 2 Z Z {pθ µ,t(yt:t+1)}1−αdyt:t+1.

We will again make use of Lemma 9 to bound this quantity. Proceeding analogously to above, we let ψ(y0, y1) = 1 and

ϕ(y0, y1) =

1 (y2

0∨ 1) (y12∨ 1)

, (S12)

for which (19) is satisfied. To check (20), we use the conditional independence of the observations given the states and (S10) to get, for any θ ∈ Θ,

Z Z ϕ−1−αα _(y_t:t+1_)pθ µ,t(yt:t+1)dyt:t+1= Eθµ Eθµ ϕ − α 1−α_(Y_t:t+1₎_X_t:t+1 ≤ Eθ µ h EθXt h 1 + |Y0| 2α 1−α i EθXt+1 h 1 + |Y0| 2α 1−α ii ≤ D2 3E θ µ h (1 + |Xt| 2α 1−α_{)(1 + |X}_t+1| 2α 1−α₎ i . From the Cauchy-Schwarz inequality we get, by using (S5),

sup t∈N sup θ∈ΘE θ µϕ − α 1−α_(Y_t:t+1) ≤ D2 3sup t∈N sup θ∈ΘE θ µ h (1 + |Xt| 2α 1−α₎2 i < ∞ .

This shows that (20) is satisfied for any θ ∈ Θ and any t ∈ N which, by Lemma 9 implies sup_t∈Nsup_θ∈ΘEθ

h ( eCθ

thYt:t+1i)α

i

< ∞ for all α ∈ [0, 1), verifying (A-4). Provided that θT converges to θ? at a rate 1/

√

T , we may therefore apply Theorem 6 which shows that for any γ ∈ (0, 1), {−1_{T ,N}

T(θT)}T ≥1is tight with NT ∼ T

(31)

S3

Proofs

Proof of Proposition 4 First, note that for all ` ≥ 1,

|Qhyt+1:t+ì1|∞≤ σ+ Z γ(dxt+1)g(xt+1, yt+1)Qhyt+2:t+ì1(xt+1) (S13) and p_µ(yt:t+`|y0:t−1) ≥ σ−2 Z γ(dxt)g(xt, yt) Z γ(dxt+1)g(xt+1, yt+1)Qhyt+2:t+ì1(xt+1) . (S14)

Now, in the fully-adapted case, we have:

Rhyi(x, dx0) = M (x, dx

0_)g(x0_{, y)}

R M (x, du)g(u, y) , so that by the definition of whyi,

|whyti|∞= sup x∈X Z M (x, dxt)g(xt, yt) ≤ σ+ Z γ(dxt)g(xt, yt) .

Combining this equality with (S13) and (S14) yields:

Bt,T = sup 0≤`≤T −t

|whyti|∞|Qhyt+1:t+`i1|∞

p_µ(yt:t+`|y0:t−1)

≤ σ+ σ−

2 .

By the definition of T ,N (see Equation (12) of the main document) we then obtain:

T ,N = T Y t=0 N − 1 2Bt,T + N − 2 ≥ _{N − 1} N − 2 + 2(σ+/σ−)2 T +1 .

Finally, letting NT ∼ λT , we obtain the desired result.

Proof of Proposition 5 For the bootstrap filter, whyi(x, x0) = g(x0, y). Therefore, |whyi|_∞= sup_x∈Xg(x, y). On the other hand, for ` ≥ m,

|Qhyt+1:t+`i1|∞≤ σ+ t+m−1 Y s=t+1 sup x∈X g(x, ys) ! Z γ(dxt+m)g(xt+m, yt+m)Qhyt+m+1:li1(xt+m) , (S15) and p_µ(yt:t+`|y0:t−1) ≥ σ− t+m−1 Y s=t inf x∈Xg(x, ys) ! Z γ(dxt+m)g(xt+m, yt+m)Qhyt+m+1:t+`i1(xt+m) . (S16) Combining (S15) and (S16) yields

Bt,T ≤ δm

σ+

σ−

. (S17)

(32)

Proof of Lemma 9 The result follows from H¨older’s inequality: Z ψα(z)q1−α(z)ξ(dz) = Z [ψ(z)ϕ(z)]αϕ−1−αα _(z)q(z)1−α_ξ(dz) ≤ Z ψ(z)ϕ(z)ξ(dz) αZ ϕ−1−αα (z)q(z)ξ(dz) 1−α .

Proof of Lemma 14 For all ` ≥ 0,

Eθµ h b C_tθhY0:t+`+1i Ft+` i = |wθhYti|∞ Z ( pθ_µ(yt+`+1|Y0:t+`) ×R λ(dxt+1)g θ_(x t+1, Yt+1)QθhYt+2:t+`i(xt+1, dxt+`)Qθhyt+`+1i1(xt+`) pθ µ(Yt:t+`, yt+`+1|Y0:t−1) κ(dyt+`+1) ) .

Combining this identity with

pθ_µ(Yt:t+`, yt+`+1|Y0:t−1) = pθµ(yt+`+1|Y0:t+`)pθµ(Yt:t+`|Y0:t−1) ,

andR Qθhyt+`+1i1(xt+`)κ(dyt+`+1) = Mθ(xt+`, X) = 1, we obtain

Eθµ h b C_tθhY0:t+`+1i Ft+` i = |w θ_hY ti|∞R λ(dxt+1)gθ(xt+1, Yt+1)QθhYt+2:t+`i1(xt+1) pθ µ(Yt:t+`|Y0:t−1) = bCtθhY0:t+`i ,

which completes the proof.

Proof of Lemma 15 Using the expressions in Equation (51) of the main document, the proof of Lemma 15 follows from the inequality:

Eθµ " ψ(Yt:t+`) pθ µ(Yt:t+`|Y0:t−1) γ # ≤ Eθ µ " ψ(Yt:t+`) pθ µ,t(Yt:t+`) γ # , (S18)

which holds for any nonnegative measurable function ψ : Y`+1 _{→ R}+_{. We now show (S18).}

Note first that, by applying the tower property of the conditional expectation and then the Tonelli-Fubini theorem, we get

Eθµ " ψ(Yt:t+`) pθ µ(Yt:t+`|Y0:t−1) γ # = Eθµ " Eθµ " ψ(Yt:t+`) pθ µ(Yt:t+`|Y0:t−1) γ Y0:t−1 ## = Eθµ Z ψ(yt:t+`)pθµ(yt:t+`|Y0:t−1) 1−γ κ⊗(`+1)(dyt:t+`) = Z ψ(yt:t+`)Eθµ h pθ µ(yt:t+`|Y0:t−1) 1−γi κ⊗(`+1)(dyt:t+`) . (S19)

By the Jensen identity, Eθ µ h pθ µ(yt:t+`|Y0:t−1) 1−γi ≤ Eθµpθµ(yt:t+`|Y0:t−1) 1−γ . On the other hand, Eθµp θ µ(yt:t+`|Y0:t−1) = Z pθ_µ(yt:t+`|y0:t−1)pθµ(y0:t−1)κ⊗t(dy0:t−1) = pθµ,t(yt:t+`) . (S20)

(33)

Author address

Fredrik Lindsten

Department of Engineering, University of Cambridge Trumpington Street, Cambridge, CB2 1PZ, UK E-mail: fredrik.lindsten@eng.cam.ac.uk

References

Doucet, A., S. J. Godsill, and C. Andrieu (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10 (3), 197–208.