Convergence in distribution for filtering processes associated to Hidden Markov Models with densities

(1)

Linköping University Electronic Press

Report

Convergence in distribution for filtering processes

associated to Hidden Markov Models with densities

Thomas Kaijser

Series: LiTH-MAT-R, ISSN 0348-2960, No. 2013:05

ISRN: LiTH-MAT-R–2013/05–SE

Available at: Linköping University Electronic Press

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-92590

(2)

Convergence in distribution for filtering processes

associated to Hidden Markov Models with densities

Thomas Kaijser

LiTH-MAT-R–2013/05–SE

14 May 2013

Department of Mathematics

Link¨

oping University

(3)

Convergence in distribution for filtering processes

associated to Hidden Markov Models with

densities.

Thomas Kaijser

Department of Mathematics, Link¨oping University, S-581 83 Link¨oping, Sweden ; thkai@mai.liu.se

Abstract

A Hidden Markov Model generates two basic stochastic processes, a Markov chain, which is hidden, and an observation sequence. The filtering process of a Hidden Markov Model is, roughly speaking, the sequence of conditional distributions of the hidden Markov chain that is obtained as new observations are received.

It is well-known, that the filtering process itself, is also a Markov chain. A classical, theoretical problem is to find conditions which implies that the distributions of the filtering process converge towards a unique limit measure.

This problem goes back to a paper of D Blackwell for the case when the Markov chain takes its values in a finite set and it goes back to a paper of H Kunita for the case when the state space of the Markov chain is a compact Hausdorff space.

Recently, due to work by F Kochmann, J Reeds, P Chigansky and R van Handel, a necessary and sufficient condition for the convergence of the distributions of the filtering process has been found for the case when the state space is finite. This condition has since been generalised to the case when the state space is denumerable.

In this paper we generalise some of the previous results on convergence in distribution to the case when the Markov chain and the observation sequence of a Hidden Markov Model take their values in complete, sep-arable, metric spaces; it has though been necessary to assume that both the transition probability function of the Markov chain and the transi-tion probability functransi-tion that generates the observatransi-tion sequence have densities.

Keywords: Hidden Markov Models, filtering processes, Markov chains on nonlocally compact spaces, convergence in distribution, barycenter, er-godicity, Kantorovich metric.

Mathematics Subject Classification (2000): Primary 60J05; Sec-ondary 60F05.

(4)

1 Introduction

Let {Xn, n = 0, 1, 2, ...} be an aperiodic, irreducible Markov chain with finite

state space S, transition probability matrix (tr.pr.m) P and initial distribution

p0. Let g : S → A be a mapping from S to another space A and, for n = 1, 2, ...,

define Yn = g(Xn). (The function g is sometimes called a lumping function.)

For each s ∈ S and every integer n ≥ 1 define

Zn,s= P r[Xn= s|Y0, Y1, Y2, ..., Yn]

and set Zn = (Zn,s, s ∈ S). Clearly Zn is a probability vector on the finite

set S. It is also random. Hence {Zn, n = 1, 2, ...} is a sequence of random

probability vectors. It is well-known that the sequence {Zn, n = 1, 2, ...} of

conditional distributions is also a Markov chain. Let P denote the transition probability function (tr.pr.f) for this chain. A rather natural question to ask is under which conditions the Markov chain generated by P is ergodic in the sense that it is an aperiodic Markov chain such that its distributions tend to a unique

limit distribution, which is independent of the initial distribution p0.

In the paper [17] from 2006 by F Kochman and J Reeds, the authors gave a sufficient condition for ergodicity and in the paper [6] from 2010 by P Chigansky and R van Handel the authors proved that this condition is also necessary.

In the classical paper [4] from 1957, D Blackwell conjectured that if the

Markov chain {Xn, n = 0, 1, 2...} is indecomposable, then the tr.pr.f P has a

unique invariant measure. However, in the paper [13] from 1975 a counterex-ample to Blackwell’s conjecture was given and in the same paper a rather weak sufficient condition for unicity was presented, a condition only slightly stronger than the condition introduced in [17] by Kochman and Reeds.

In the paper [16] from 2011 (see also [15]) the result obtained by Kochman and Reeds was generalised to the case when the state space S is denumerable. In the paper [6] by Chigansky and van Handel the main results of [15] and [16] are proved with different methods.

We shall now generalise the ”set-up” described above, somewhat. This time

we let S be an arbitrary set, let F be a σ − algebra on S and let {Xn, n =

0, 1, 2, ...} be an aperiodic and ergodic Markov chain taking values in S and which is generated by a tr.pr.f P : S × F → [0, 1] and an initial probability

distribution p0 on (S, F ). We shall denote the set of all probability measures

on (S, F ) by P(S, F ). As above, we let g : S → A be a mapping from S onto another set A, which we assume is a denumerable set. We also assume that g

is measurable. This implies that for each a ∈ A we can define a subset Sa ∈ F

by Sa= {s : g(s) = a}. Clearly ∪a∈ASa= S.

Let us again - as above - for n = 1, 2, ..., define Yn = g(Xn) and let again

Zn denote the conditional distribution of Xn given Y1, Y2, ..., Yn. But, what do

we more precisely mean by saying that Zn denotes the conditional distribution

of Xn given Y1, Y2, ....Yn?

Well, to illustrate, let us consider the distribution of Z1. If we assume that

Y1 = a, it follows that Z1 is the probability distribution z1,a ∈ P(S, F ), say,

defined by z1,a(F ) = R SP (s, F ∩ Sa)p0(ds) R SP (s, Sa)p0(ds) , F ∈ F ,

(5)

and the probability that Z1 will take the value z1,a is of course equal to the

probability that Y1= a which is equal to

Z

S

P (s, Sa)p0(ds).

Thus, the distribution of Z1is a discrete distribution a mass distribution

-on the set P(S, F ).

The distribution of Znfor n = 2, 3, ... can be described similarly, although the

integral formulas describing the distribution become slightly more complicated. Now, let us again raise the question under which conditions the sequence

{Zn, n = 1, 2...} converges in distribution. In order to be able to give an answer

to this question we need to define a topology for the set P(S, F ). In this paper we shall make such assumptions, that it will be convenient to use the topology generated by the metric induced by the total variation norm as the topology for the set P(S, F ). In order to make this choice suitable we shall assume 1)

that {S, F } is a complete, separable, metric space with a metric δ0 and that

F is the Borel field generated by δ0, and 2) that there exists a σ − f inite

measure λ on (S, F ) such that the tr.pr.f P : S × F → [0, 1] has a density kernel p : S × S → [0, ∞) with respect to the measure λ, that is we have

P (s, F ) = Z

F

p(s, t)λ(dt), ∀ s ∈ S, ∀F ∈ F .

Having made these assumptions, we can restrict our considerations to the subset of probability measures in P(S, F ) which have a density with respect to the base

measure λ; we denote this set Pλ(S, F ).

Next, let us return to the process {Zn, n = 1, 2, ...} of conditional

distribu-tions, that we described above. If we make the assumptions that the tr.pr.f

P has a density kernel p with respect to λ and that the initial distribution p0

belongs to the set Pλ(S, F ), then it is not difficult to prove that the process

{Zn, n = 1, 2, ...} is a Markov chain with state space Pλ(S, F ).

Now, let µn denote the distribution of Zn, n = 1, 2, ... . In order to illustrate

the kind of results that we shall prove in this paper, we shall now give sufficient

conditions implying that the sequence {µn, n = 1, 2, ...} will converge in

distri-bution towards a unique limit measure which is independent of p0. For, if 1) S

is a compact, metric space, 2) π is a unique invariant measure for the tr.pr.f P which satisfies

lim

n→∞_{x∈P(S,F )}sup ||xP n

− π|| = 0,

(here || · || denotes the total variation norm), and 3) there exists an element a ∈ A such that the density kernel p satisfies

0 < d0≤ p(s, t) ≤ D0< ∞, ∀ s, t, ∈ Sa

then {µn, n = 1, 2, ...} converges in distribution towards a unique limit measure

on Pλ(S, F ), which is independent of the initial distribution p0.

Next a few words on hidden Markov models. A hidden Markov model

(HMM), as described in the classical paper [22] by L R Rabiner and B H Juang, consists of a finite state space S, a finite observation space A, a tr.pr.m P on

(6)

this classical definition of a HMM is obtained by replacing the deterministic ”lumping function” g introduced above by a ”stochastic mapping” R. In the

more modern literature, see e.g [5] by O Capp´e, E Moulines and T Ryden, one

allows both the state space S and the observation space A to be measurable spaces, (S, F ) and (A, A) say, and then, of course, the transition probability matrices P and R must be replaced by transition probability functions.

In this paper our definition of an HMM will be slightly different from the one given in for example [5], and will be based on a transition probability kernel from the state space (S, F ) to the product space (S × A, F ⊗ A). (See Definition 4.2 below.)

As is well-known a HMM gives rise to two stochastic process {Xn, n =

0, 1, 2, ...} and {Yn, n = 1, 2, ...} where the former process, - often called the

hidden Markov chain -, is determined by the tr.pr.f P and the initial distribution

p0, whereas the latter process - usually called the observation sequence - is

determined by using the tr.pr.f R from S to A. (We have chosen to start the observation sequence at n = 1 for sake of convenience.)

Next, for n = 1, 2, ..., let Zn denote the conditional law of Xn given the

observations Y1, Y2, ..., Yn. The process {Zn, n = 1, 2, ...} is usually called the

filtering process (or the filter process).

One question regarding the filtering process {Zn, n = 1, 2...} often treated

in the literature is whether Zn’s dependence on the initial distribution p0

dis-appears as n → ∞. This property is often called the forgetting property. It is though not this question which is in focus in this paper. Our main focus

will instead be on how the distribution, µn say, of Zn depends on the initial

distribution p0, and more precisely whether µn’s dependence of p0 vanishes as

n → ∞ and whether the sequence {µn, n = 1, 2, ...} of distributions converges

in distribution to a limit distribution independent of the initial distribution p0,

as n → ∞.

Let us here also mention that if both the state space and the observation space are denumerable, then, as pointed out in the paper [3] by L Baum and T Petrie, a HMM can be ”transformed” to the “set-up” with a lumping function, if one uses the two tr.pr.ms P and R to construct another tr.pr.m having the product S ×A as state space and if one defines a lumping function g : S ×A → A simply by g(s, a) = a.

The plan of the paper is as follows. In Section 2 we introduce some basic notations, present some simple relations to be used later and define various types of ergodicity; weak ergodicity, strong ergodicity and uniform ergodicity.

In Section 3 we introduce the notion ”partition of transition probability func-tions”, a generalisation of the notion ”partition of transition probability matri-ces” introduced in [16].

In Section 4 we first introduce a notion we call HMM-kernel and then we make our definition of a Hidden Markov Model (HMM). Our definition covers the usual definition of a HMM with finite state space and finite observation space, and also covers the usual definition of a HMM with general state space and general observation space (sometimes also called “state space model”). We also introduce a smaller class of HMMs which we call HMMs with densities consisting of such HMMs for which both the state space and the observation space are complete, separable, metric spaces and the HMM-kernel is determined by a density.

(7)

In Section 4 we also introduce a notion which we have chosen to call ran-dom mapping. The classical name is ranran-dom system with complete connections. Other names for this concept is e.g. learning model or iterated function sys-tem with place-dependent probabilities. One important property of a random mapping is that it induces a Markov kernel.

In Section 5 we define the filter kernel, induced by a HMM with densities.

The filter kernel we introduce is a tr.pr.f on the subset Pλ(S, F ) of P(S, F )

consisting of those probability measures which have a density with respect to a fixed σ − f inite base measure λ on (S, F ). We denote the filter kernel by P. To

simplify notations we shall usually denote the set Pλ(S, F ) by K. The topology

which we shall use on K will be the topology induced by the metric determined by the total variation between probabilities.

In order to be able to prove that certain sets that occur in our investigation are measurable, we introduce a property which we call a regularity prop-erty and which means that the kernel determining the HMM satisfies a certain continuity condition. (See Definition 5.1.)

The problems we want to solve are thus 1) to find conditions on the HMM such that the dependence of the initial distribution for the distributions of the filtering process vanishes (weak ergodicity) and 2) to find conditions such that there exists a unique probability measure on the set K towards which the dis-tributions of the filtering process converge for all choices of initial disdis-tributions. (Weak ergodicity with stationary measure.)

In Section 6 we define the random mapping associated to a HMM with den-sities, and show that the filter kernel can be considered as the Markov kernel induced by this random mapping. This is an old observation and goes at least back to the paper [4] from 1957 by Blackwell.

In Section 7 we recall the definition of the Kantorovich distance between probability measures.

In Section 8 we introduce the set of probability measures on K which have the same barycenter. It was probably H Kunita who first observed the usefulness of the concept barycenter when studying the limit behaviour of filtering processes. (See [18].)

In the paper [16] it was proved, that for a denumerable state space S, the set of probabilities on K with equal barycenter is a tight set, and by us-ing this property, it is not difficult to prove that if the hidden Markov chain

{Xn, n = 0, 1, 2...} has a stationary probability measure π ∈ K and the

proba-bility measure µ on K has barycenter π then the sequence {µPn_{, n = 1, 2, ...}}

is a tight sequence, where thus P denotes the filter kernel. Unfortunately, when we tried to generalise this result from the case when the state space is denumer-able to the case when the state space is a complete, separdenumer-able, metric space, we failed.

In Section 9 we recall the well-known notion called coupling, and we also introduce the well-known Vaserstein coupling, and in Section 10 we describe in detail what we mean by the Vaserstein coupling of a transition probability function with density.

In Section 11 we present the main theorem of this paper. In the theorem we introduce a condition, which we call Condition E, and which reads as follows: To every ρ > 0 there exist an integer N and a number α > 0 such that for any two probability measures µ and ν on K having barycenter equal to π (the

(8)

find a coupling ˜µN of µPN and νPN such that

˜

µN(Dρ) ≥ α

where

Dρ= {(x, y) ∈ K × K : ||x − y|| < ρ}.

One result of the main theorem is that if the HMM under consideration has a density which fulfills the regularity condition introduced in Section 5, and there exists a unique probability measure π in P(S, F ) which is invariant with respect to the tr.pr.f P of the hidden Markov chain and is such that

lim

n→∞_{x∈P(S,F )}sup ||xP n

− π|| = 0, (1)

then the distributions of the filtering process converge in distribution towards a unique limit distribution, if Condition E is satisfied. In case the tr.pr.f P of the

hidden Markov chain {Xn, n = 0, 1, 2, ...} is such that

lim

n→∞||xP

n_{− π|| = 0, ∀x ∈ P(S, F )}

and also Condition E is satisfied, then we have not been able to prove the existence of a unique invariant measure for the filtering process; we have only been able to prove that the Kantorovich distance between the distributions of two filtering processes generated by two different initial distributions tends to zero as n → ∞.

In Sections 12 to 15 the proof of the main theorem is given. In Section 12 we prove two auxiliary theorems; both theorems are for a Markov chain taking its values in a complete, separable, metric, bounded space. The first theorem is based on two properties which were introduced in the paper [16] namely a property we call the shrinking property and a property we call Lipschitz equicontinuity. The second theorem is similar and based on a stronger property which we call the strong shrinking property.

In Section 13 we prove that the filtering process induced by a HMM with densities has the Lipschitz equicontinuity property, and in Section 14 we prove that the Kantorovich distance between a probability measure on the set K with barycenter x and the set of probability measures with barycenter y is equal to the total variation between x and y.

In Section 15 we conclude the proof of the main theorem by verifying that the hypotheses of the auxiliary theorems are satisfied under the various hypotheses of the main theorem.

In Section 16 we prove some inequalities for the nth composition of positive integral kernels. The proofs of these inequalities are based on some tricks used in the classical paper [9] on products of random matrices by H Furstenberg and H Kesten.

In Section 17 we introduce some further conditions and show, by using the inequalities proved in Section 16, that these conditions imply Condition E. In concrete situations it may be easier to verify the conditions introduced in Section 17 then to verify Condition E directly. In this section we also verify that the results in [17] and [16] are covered by the results of the present paper.

(9)

In Section 18 finally, we first apply our theorems to two examples. We end our paper with an example which shows that even if the hidden Markov chain of a HMM is uniformly ergodic (see (1)), the filtering process need not be weakly ergodic. In fact, the filtering process may even become a periodic process.

2 Basic notations

Let (S, F ) be a measurable space. We let B[S] - or B[S, F ] if we want to emphasize the σ − algebra F , denote the set of real, bounded, F - measurable

functions on S, and we let Bu[S] or Bu[S, F ] denote the set of real F - measurable

functions on S, thus not necessarily bounded. For u ∈ B[S] we define ||u|| = sup{|u(s)| : s ∈ S},

we define

osc(u) = sup{u(s) − u(t) : s, t ∈ S} and, if F ⊂ S, F 6= ∅, we define

oscF(u) = sup{u(s) − u(t) : s, t ∈ F }.

We let B[(S1, F1), (S2, F2)] denote the set of measurable mappings from

(S1, F1) to (S2, F2); we let P(S, F ) denote the set of probability measures on

(S, F ), we let M(S, F ) denote the set of finite signed measures on (S, F ), we let Q(S, F ) denote the set of finite non-negative measures on (S, F ) and we let

Q∞_{(S, F ) denote the set of positive σ − f inite measures on (S, F ).}

As is well-known, M(S, F ) is a vector space. We use the notation δs to

denote the Dirac measure at s ∈ S. Furthermore, if F ∈ F we let (F, FF)

denote the measurable space {F0∈ F : F0_{⊂ F }.}

If u ∈ Bu[S] and x ∈ M(S, F ) we usually write

hu, xi = Z

S

u(s)x(ds)

whenever the integral exists. Furthermore, if F ⊂ S, we let IF : S → {0, 1}

denote the indicator function of the set F .

Next, if x, y ∈ M(S, F ), we let F+(x, y) ∈ F denote a set in F such that

x(F ∩ F+(x, y)) ≥ y(F ∩ F+(x, y)), ∀F ∈ F ,

we set F−(x, y) = S \ F+_{(x, y), we define x ∨ y ∈ M(S, F ) by}

x ∨ y(F ) = x(F ∩ F+(x, y)) + y(F ∩ F−(x, y)),

and we define x ∧ y ∈ M(S, F ) by

x ∧ y(F ) = x(F ∩ F−(x, y)) + y(F ∩ F+(x, y)).

If x ∈ M(S, F ) we define x+ _{∈ Q(S, F ) by x}+ _{= x ∨ 0 where thus 0 ∈}

M(S, F ) denotes the 0 − measure, we define x−∈ Q(S, F ) by x−= (−x) ∨ 0,

and we define |x| ∈ Q(S, F ) by |x| = x+_{+ x}−_.

For x ∈ M(S, F ) we also define ||x|| by ||x|| = |x|(S). It is of course well-known that || · || satisfies the properties of a norm and therefore M(S, F ) can be regarded as a normed vector space.

The following inequality, which will be of use to us later, is easily proved by using the triangle inequality.

(10)

Lemma 2.1 Let x, y belong to a normed vector space and suppose that ||x|| > 0 and ||y|| > 0. Then

|| x ||x|| − y ||y|||| ≤ 2||x − y|| ||x|| .2

For x, y ∈ M(S, F ) we call ||x − y|| the total variation between x and y. As

is well-known, if we define δT V : M(S, F ) × M(S, F ) → [0, ∞) by δT V(x, y) =

||x − y|| then δT V determines a metric on M(S, F ). We call δT V the total

variation metric.

For r > 0 we define Qr_{(S, F ) = {x ∈ Q(S, F ) : ||x|| = r}. The following}

well-known and easily proved inequality will also be of use to us.

Lemma 2.2 Let r > 0, let x, y ∈ Qr(S, F ) and let u ∈ B[S, F ]. Then

|hu, xi − hu, yi| ≤ osc(u)(1/2)||x − y||. 2

Next, let δ0: S × S → [0, ∞) be a metric on (S, F ). We then always assume

implicitly that the σ − algebra F is the Borel-field induced by the metric δ0.

Usually we then write (S, F , δ0) instead of (S, F ) and sometimes we write (S, δ0)

instead. We let C[S] or C[S, F ] denote the set of real, bounded, continuous functions on (S, F ). For u ∈ C[S] we define

γ(u) = sup{|u(s) − u(t)|

δ0(s, t)

: s, t ∈ S, s 6= t},

define Lip[S] = {u ∈ C[S] : γ(u) < ∞} and Lip1[S] = {u ∈ Lip[S] : γ(u) ≤ 1}.

Next, let λ ∈ Q∞(S, F ). We let L1_λ[S] denote the set

{f ∈ Bu[S] :

Z

S

|f (s)|λ(ds) < ∞}.

If a measure x ∈ Q(S, F ) is such that there exists a function f ∈ L1

λ[S] such

that for all F ∈ F

x(F ) = Z

F

f (s)λ(ds),

then we say that x has a density f with respect to the base measure λ, we say

that x ∈ Qλ(S, F ) and we call f a representative of x. If also x ∈ P(S, F ), then

we write x ∈ Pλ(S, F ). If x, y ∈ Qλ(S, F ) and f, g are representatives of x and

y respectively, then it is well-known that ||x − y|| =

Z

S

|f (s) − g(s)|λ(ds).

Furthermore, if (S, F ) is a complete, separable, measurable space then it is

well-known that also (Qλ(S, F ), δT V) is a complete, separable, measurable space.

(See e.g. the book [8] by R Dudley.)

Next let (S1, F1) and (S2, F2) be two given measurable spaces. A mapping

K : S1× F2→ [0, ∞)

will in this paper be called a transition function (tr.f) from (S1, F1) to (S2, F2)

if

(11)

2) K(·, F ) belongs to B[S1, F1] for all F ∈ F2.

We denote the set of all tr.fs from (S1, F1) to (S2, F2) by T Q((S1, F1), (S2, F2)).

If a tr.f K from (S1, F1) to (S2, F2) is such that K(s, ·) ∈ P(S2, F2), ∀s ∈ S1,

then we call K a transition probability function (tr.pr.f) from (S1, F1)

to (S2, F2) and we denote the set of all tr.pr.fs from (S1, F1) to (S2, F2) by

T P((S1, F1), (S2, F2)). We often use the letter P to denote a tr.pr.f.

If K is a tr.f from (S1, F1) to (S1, F1) we say that K is tr.f. on (S1, F1) and

we write K ∈ T Q((S1, F1)). Similarly, if P is a tr.pr.f from (S1, F1) to (S1, F1)

we say that P is tr.p.f. on (S1, F1) and we write P ∈ T P((S1, F1)). If P is a

tr.p.f. on (S1, F1) we call P a Markov kernel on (S1, F1).

If K1 ∈ T Q((S1, F1), (S2, F2)) and K2 ∈ T Q((S2, F2), (S3, F3)) one can

define a tr.f K1K2∈ T Q((S1, F1), (S3, F3)) by

(K1K2)(s, F ) =

Z

S2

K1(s, dt)K2(t, F ), s ∈ S1, F ∈ F3.

We call K1K2 the product of K1 and K2.

As is well-known K1(K2K3) = (K1K2)K3 (see e.g section 1.1 of the book

[23] by D Revuz). More generally, if Km∈ T Q((Sm, Fm), (Sm+1, Fm+1)), m =

1, 2, ..., n, we can define

K1K2...Km∈ Q((S1, F1), (Sm+1, Fm+1)), m = 2, 3, ...n

recursively by

K1K2...Km= (K1K2...Km−1)Km, m = 2, 3, ..., n.

Associated to a tr.f K ∈ T Q((S1, F1), (S2, F2)) we can define two linear

mappings. We define T : B[S2, F2] → B[S1, F1] by

T u(s) = Z

S2

K(s, dt)u(t). We call T the transition operator associated to K.

Now suppose P is a tr.pr.f on (S, F ) where F is a Borel field, and let T denote the associated transition operator. If T is such that

u ∈ C[S] ⇒ T u ∈ C[S] then the tr.pr.f P is called Feller continuous.

The following terminology is not standard and therefore we make a more formal definition.

Definition 2.1 Suppose (S, F , δ0) is a metric space and suppose P ∈ T P((S, F )).

I . If the associated transition operator T satisfies u ∈ Lip[S] ⇒ T u ∈ Lip[S] then we call P Lipschitz-continuous.

II. If P is Lipschitz-continuous and also there exists a constant C > 0 such that the associated transition operator T satisfies

γ(Tnu) ≤ Cγ(u), n = 1, 2, ..., ∀u ∈ Lip[S]

(12)

The other linear mapping, ˘K, associated to a tr.f K ∈ T Q((S1, F1), (S2, F2))

is a mapping from Q(S1, F1) to Q(S2, F2) and is defined by

˘ Kx(F ) =

Z

S

x(ds)K(s, F ), F ∈ F2.

Instead of ˘Kx we shall usually write xK.

As is well-known the following relation holds for u ∈ B[S2, F2], x ∈ Q(S1, F1)

and K ∈ T Q((S1, F1), (S2, F2)) (see e.g [23], section 1.1):

hT u, xi = hu, xKi (= hu, ˘Kxi). (2)

Next, let (S1, F1) and (S2, F2) be two given measurable sets, and let λ ∈

Q∞_(S

2, F2). We let Dλ[S1, S2] denote the subset of Bu[S1×S2, F1⊗F2] defined

by Dλ[S1, S2] = (3) {f ∈ Bu[S1× S2, F1⊗ F2] : f ≥ 0, Z S2 f (s, t)λ(dt) < ∞, ∀s ∈ S1}.

If (S1, F1) = (S2, F2) we simply write Dλ[S1] instead of Dλ[S1, S1].

Furthermore, if K ∈ T Q((S1, F1), (S2, F2)) is such that there exists a

func-tion k ∈ Dλ[S1, S2] such that for all s ∈ S1 and F ∈ F2

K(s, F ) = Z

F

k(s, t)λ(dt), (4)

then we say that the tr.f K has a density kernel k with respect to the base

measure λ and we say that K belongs to the set T Qλ((S1, F1), (S2, F2)). If

K is a tr.pr.f we usually write P instead of K, we write p instead of k, we

call p a probability density kernel and we write P ∈ T Pλ((S1, F1), (S2, F2)).

Furthermore, if (S1, F1) = (S2, F2) we simply write T Qλ((S1, F1)) instead of

T Qλ((S1, F1), (S1, F1)) and T Pλ((S1, F1)) instead of T Pλ((S1, F1), (S1, F1)).

If k ∈ Dλ[S1, S2] and K ∈ T Qλ((S1, F1), (S2, F2)) denotes the tr.f defined

by (4), we call K the tr.f determined by k ∈ Dλ[S1, S2].

An important observation is the following:

Proposition 2.1 Suppose K ∈ T Qλ((S1, F1), (S2, F2)) and that k ∈ Dλ[S1, S2]

is a density kernel of K. Then, if x ∈ Q(S1, F1), it follows that xK ∈ Qλ(S2, F2)

and, if we define f1: S2→ [0, ∞) by

f1(t) =

Z

S1

k(s, t)x(ds),

then f1 is a density of xK with respect to the base measure λ.

Proof. Let F ∈ F2. Then

xK(F ) = Z S1 Z F k(s, t)x(ds)λ(dt) = Z F f1(t)λ(dt)

from which the conclusion of the proposition follows. 2.

Next a few notations regarding Markov chains. Let again (S, F ) be a mea-surable space and let P ∈ T P((S, F )). If x ∈ P(S, F ) we denote the Markov

chain, which is generated by x and P , by {Xn,x, n = 0, 1, 2, ...}. If x = δs we

(13)

Definition 2.2 Let (S, F ) be a measurable space. Suppose P ∈ T P((S, F )). If for any two probabilities x and y in P(S, F )

lim

n→∞||xP

n_{− yP}n_{|| = 0,}

then we say that P is strongly ergodic.

If furthermore there exists a measure π ∈ P(S, F ) such that lim

n→∞||xP

n_{− π|| = 0, ∀ x ∈ P(S, F )}

then we say that P is strongly ergodic with stationary measure π. 2

Definition 2.3 Let (S, F ) be a measurable space. Suppose P ∈ T P((S, F )). If lim

n→∞sup{||xP

n_{− yP}n_{|| : x, y ∈ P(S, F )} = 0,}

then we say that P is uniformly ergodic.

n→∞sup{||xP

n_{− π|| : x ∈ P(S, F )} = 0,}

then we say that P is uniformly ergodic with stationary measure π.2

Before our next definition we define

Cunif orm[S, F ] = {u ∈ C[S, F ] : u unif ormly continuous }.

Definition 2.4 Let (S, F , δ0) be a measurable space and let F be the Borel

field generated by the metric δ0. Suppose P ∈ T P((S, F )). Let T denote the

transition operator associated to P . If

lim

n→∞|T

n_{u(s) − T}n_{u(t)| = 0, ∀ u ∈ C}

unif orm[S, F ] and ∀ s, t ∈ S

then we say P is weakly ergodic.

n→∞|T

n_{u(s) − hu, πi| = 0, ∀ u ∈ C[S] and ∀ s ∈ S}

then we say P is weakly ergodic with stationary measure π. 2

We end this long section about notations with another simple but important observation.

Proposition 2.2 Suppose P ∈ T Pλ((S, F )), that π ∈ P(S, F ) and that π =

πP. Then π ∈ Pλ(S, F ).

(14)

3 Partitions of transition probability functions.

In [16] we introduced the notion “partition of a transition probability matrix”. In this section shall generalise this notion by defining partitions of transition probability function.

Definition 3.1 Let (S1, F1) and (S2, F2) be two measurable spaces and let P be

a tr.pr.f from (S1, F1) to (S2, F2). Let (A, A) be yet another measurable space

and let

M ∈ T P((S1, F1), (S2× A, F2⊗ A))

be a transition probability function from (S1, F1) to (S2× A, F2⊗ A). If

M (s1, F2, A) = P (s1, F2), ∀s1∈ S1, ∀F2∈ F2,

then we call the tr.pr.f M a partition of the tr.pr.f P .

Furthermore, if there exists a measure λ ∈ Q∞(S2, F2), a probability density

kernel p ∈ Dλ[S1, S2] such that

P (s, F2) =

Z

F2

p(s, t)λ(dt), ∀s ∈ S1, ∀F2∈ F2,

a measure τ ∈ Q∞(A, A) and a measurable function m : S1× S2× A → [0, ∞)

such that M (s, F, B) = Z F Z B m(s, t, a)λ(dt)τ (da)

then we denote the partition M by (m, τ ), we call m a density of M and call m

a partition function. 2

Next let us present some simple facts about partitions of tr.pr.fs which we state without proof.

1) Let M1 ∈ T P((S1, F1), (S2 × A1, F2 ⊗ A1)) be a partition of P1 ∈ T P((S1, F1), (S2, F2)) and let M2 ∈ T P((S2, F3), (S3× A2, F3 ⊗ A2)) be a partition of P2∈ T P((S2, F2), (S3, F3)). Define M3∈ T P((S1, F1), (S3× A1× A2, F3⊗ A1⊗ A2)) by M3(s1, F3, B1× B2) = Z s3∈F3 Z S2 M1(s1, ds2, B1)M2(s2, ds3, B2).

Then M3is a partition of P1P2. We call M3 the product of M1and M2.

2) Let Mi be a partition of Pi for i = 1, 2, 3 and suppose that the product

P1P2P3 is well-defined. Then

(M1M2)M3= (M1)M2M3.

This implies that if P is a tr.pr.f on (S, F ) and M ∈ T P(S, F ), (S × A, F ⊗ A))

is a partition of P , then, for n = 2, 3, ..., the product Mn _{∈ T P(S, F ), (S ×}

An_{, F ⊗ A}n_{)) is well-defined and is a partition of P}n_.

We shall soon see the connection between partitions of tr.pr.fs and hidden Markov models.

Remark. Recall that a measure on the product space of two measurable spaces

(15)

4 Hidden Markov Models and Random

Map-pings

Let (S, F ) and (A, A) be two measurable spaces and let (S × A, F ⊗ A) be the product space. Let ξ ∈ P(S × A, F ⊗ A) and let Λ ∈ T P((S × A, F ⊗ A)). The Markov chain generated by the tr.pr.f Λ and the initial distribution ξ is called a

bivariate Markov chain. We denote a bivariate Markov chain {(Xn,ξ, Yn,ξ), n =

0, 1, 2, ...}.

In this section we shall consider two special classes of bivariate Markov chains.

Definition 4.1 Let (S, F ) and (A, A) be two measurable spaces, let M ∈ T P((S, F ), (S × A, F ⊗ A)) and define P ∈ T P((S, F )) by

P (s, F ) = M (s, F × A).

Then we call M a Hidden Markov Model kernel (HMM-kernel) and we call P the Markov kernel associated to the HMM-kernel M.

Remark. In other words, if P is a Markov kernel on (S, F ), then any partition

of P determines a HMM-kernel. 2

In the next definition we present our definition of a Hidden Markov Model. Definition 4.2 Let (S, F ) and (A, A) be two measurable spaces, let P be a tr.pr.f on (S, F ) and let M ∈ T P((S × A, F ⊗ A)) be a partition of P . The set

H = {(S, F ), P, (A, A), M } (5)

is called a Hidden Markov Model (HMM). We call (S, F ) the state space and call (A, A) the observation space.

Remark. Since the tr.pr.f P is determined by M we could have excluded P in the expression of the right hand side of (5). We have included P for sake of

clarity. 2

Definition 4.3 Let

H = {(S, F ), P, (A, A), M } be a HMM, and define Λ ∈ T P((S × A, F ⊗ A)) by

Λ(s, a, F, B) = M (s, F, B), ∀s ∈ S, ∀a ∈ A, ∀F ∈ F , ∀B ∈ A.

Let x ∈ P(S, F ), let α ∈ P(A, A), let ξ = x ⊗ α and and let {(Xn,ξ, Yn,ξ), n =

0, 1, 2, ...} denote the bivariate Markov chain generated by Λ and ξ.

We call {Xn,ξ, n = 0, 1, 2, ...} the hidden Markov chain generated by

H and write {Xn,x, n = 0, 1, 2, ...} instead of {Xn,ξ, n = 0, 1, 2, ...} since the

first component is independent of the initial distribution α ∈ P(A, A), we

call {Yn,ξ, n = 1, 2, ...} the observation sequence generated by H and write

{Yn,x, n = 1, 2, ...} instead of {Yn,ξ, n = 1, 2, ...} since also the second component

is independent of the initial distribution α if n ≥ 1. 2

Remark 1. Our definition of a HMM is similar to the definition given in [5]. Remark 2. It is for sake of convenience that we start the observation sequence

with n = 1 instead of n = 0. 2

(16)

Definition 4.4 Let (S, F , δ0) and (A, A, %) be two complete, separable, metric

spaces and let

H = {(S, F , δ0), P, (A, A, %), M }

be a HMM. If there exists a measure λ ∈ Q∞(S, F ), a probability density kernel

p ∈ Dλ[S] such that

P (s, F ) = Z

F

p(s, t)λ(dt), ∀s ∈ S, ∀F ∈ F ,

a measure τ ∈ Q∞(A, A) and a measurable function m : S × S × A → [0, ∞)

such that M (s, F, B) = Z F Z B m(s, t, a)λ(dt)τ (da)

then we call H a HMM with densities, and we denote the HMM by H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )}.

We call λ and τ base measures. 2

Remark 1. Our concept “HMM with densities” is a minor generalisation of the concept “fully dominated HMM” as defined in [5], Definition 2.2.3.

Remark 2. See Section 2 for the definition of Dλ[S]. 2

We now present three simple examples.

Example 4.1 Let (S, F ) be a measurable set. Let {Si, i ∈ I} be a denumerable

set of disjoint subsets of S such that ∪i∈ISi = S and such that Si ∈ F , i ∈ I,

and let P ∈ T P((S, F )). Set A = I and let A denote all subsets of I. Define M : S × F × A → [0, 1] simply by

M (s, F, B) =X

a∈B

P (s, F ∩ Sa)

Then clearly M is a partition of P . 2

Example 4.2 (Standard hidden Markov model. The denumerable case.) Let S and A be denumerable sets and let F and A denote all subsets of S and A respectively.

Let p : S × S → [0, 1] be a transition probability matrix (tr.pr.m) on S and let r : S × A → [0, 1] be a tr.pr.m from S to A. Define M : S × F × A → [0, 1] simply by M (s, F, B) =X t∈F X a∈B p(s, t)r(t, a) and let P ∈ T P((S, F )) be defined by

p(s, F ) =X

t∈F

p(s, t).

Again it is clear that M is a partition of P .2

Our third example can be considered as the standard HMM as defined in [5], Definition 2.2.1.

(17)

Example 4.3 ( Standard hidden Markov model. The general case.) Let (S, F ) and (A, A) be two measurable sets. Let P ∈ T P((S, F )) and R ∈ T P((S, F ), (A, A)). Define M : S × F × A → [0, 1] simply by

M (s, F × B) = Z

F

R(y, B)P (s, dy).

Again it is clear that M is a HMM-kernel and that M is a partition of the

Markov kernel P .2

We shall next consider another special kind of bivariate Markov chains. Definition 4.5 Let (S, F ) and (A, A) be two measurable sets. Let

h ∈ B[S × A, S] be a measurable function from S × A to S and let Q ∈ T P((S, F ), (A, A)) be a tr.pr.f from (S, F ) to (A, A). We call the 4-tuple

R = {(S, F ), (A, A), h, Q}

a random mapping. We call h the response function, we call Q the index probability, we call (S, F ) the state space and we call (A, A) the index

space. 2

Definition 4.6 Let

R = {(S, F ), (A, A), h, Q}

be a random mapping. For s ∈ S and F ∈ F , we define h−1(s, F ) ∈ A by

{a ∈ A : h(s, a) ∈ F }. We call the tr.pr.f M ∈ T P((S, F ), (S × A, F ⊗ A)) defined by

M (s, F × B) = Q(s, h−1(s, F ) ∩ B) (6)

the HMM-kernel induced by the random mapping R, we call the tr.pr.f P ∈ T P((S, F )) defined by

P (s, F ) = Q(s, h−1(s, F )) (= M (s, F × A)) (7)

the Markov kernel induced by the random mapping R and we call the hidden Markov model

H = {(S, F ), P, (A, A), M },

where P and M are defined by (7) and (6) respectively, the HMM induced by the random mapping R.

Furthermore, if {Xn,x, n = 0, 1, 2, ...} and {Yn,x, n = 1, 2, ...} denote the

hidden Markov chain and the observation sequence generated by the HMM H induced by the random mapping R and the initial distribution x ∈ K, we call

{Xn,x, n = 0, 1, 2, ...} the state sequence and {Yn,x, n = 1, 2, ...} the index

sequence generated by the random mapping R. 2

In case a random mapping R = {(S, F ), (A, A), h, Q} is such that Q has a density, that is if there exist a σ − f inite measure τ on (A, A) and a function

q : S × A → [0, ∞) such that q ∈ Dτ[S, A] and such that

Q(s, B) = Z

B

(18)

we usually denote the random mapping by R = {(S, F ), (A, A), h, (q, τ )}. Remark. The classical name for the 4-tuple

{(S, F ), (A, A), h, Q}

is random system with complete connections. (See e.g the book [10] by M

Iosifescu and R Theodorescu or the book [11] by M Iosefescu and S Grigorescu.) Another classical name is learning model. (See e.g the book [20] by F Nor-man.) A later terminology, introduced by M.Barnsley and coworkers, is iterated function system with place-dependent probabilities (see e.g [2]). The terminol-ogy random mapping is inspired by the terminolterminol-ogy used in the paper [7] by P Diaconis and D Freedman. In learning model theory the index space is called

the event space and the index sequence {Yn,x, n = 1, 2, ...} is called the event

sequence. (See e.g [20].)2

The motivation for introducing the concept random mapping is that there is a strong connection between the theory on filtering processes and the theory on iterations of random mappings, which we shall see later in this paper. (See Section 6.)

The study of random mappings has a long history (see e.g [10], [20], [14],[11]); here we shall just present a few basic facts.

First suppose R = {(S, F ), (A, A), h, Q} is a random mapping and let P ∈ T P((S, F )) be the Markov kernel induced by R. Let T : B[S, F ] → B[S, F ] be the transition operator associated to the Markov kernel P . We have

T u(s) = Z S u(t)P (s, dt) = Z A

u(h(s, a))Q(s, da).

In order to state some further relations we need some further notations. Thus

let {(S, F ), (A, A), h, Q} be a random mapping. We set A1= A and define An

iteratively by

An+1= A1× An, n = 1, 2, ... .

Similarly, we set A1_{= A and define A}n _{iteratively by}

An+1_{= A}n_{⊗ A.}

For (a1, a2, ..., an) ∈ An we use the notation

an= (a1, a2, ..., an)

and we write

Bn= B1× B2× ... × Bn, if Bm∈ A, m = 1, 2, ..., n.

We define hn _{: S × A}n _{→ K, n = 1, 2, .. by first defining h}1 _{= h, and then}

defining hn _{iteratively by}

hn+1(x, an+1) = h(hn(x, an), an+1), n = 1, 2, ... .

We define Qn: S × An→ [0, 1] iteratively by Q1_{= Q and}

Qn+1(s, B0× Bn+1) = Z B0 Z Bn+1 Qn(s, dan)Q(hn(s, an), dan+1).

(19)

It is well-known (see [11]) and easily proved that hn _{is measurable for each}

positive integer n and that Qn _{∈ T P((S, F )(A}n_{, A}n_{)). This implies that also}

{(S, F ), (An_{, A}n_{), h}n_{, Q}n_{} is a random mapping for each positive integer n.}

If P ∈ T P((S, F )) is the Markov kernel induced by the random mapping

{(S, F ), (A, A), h, Q} and P(n) _{denotes the Markov kernel induced by the}

ran-dom mapping {(S, F ), (An_{, A}n_{), h}n_{, Q}n_{}, then it is easily proved that}

Pn= P(n), n = 1, 2, ... . Furthermore, if u ∈ B[S, F ] and s ∈ S, then

Z

S

u(t)Pn(s, dt) = E[u(Xn(s))] = E[u(hn(s, Yn(s))] =

Z

An

u(hn(s, an))Qn(s, dan) = Tnu(s)

where of course Yn_{(s) = (Y}

1(s), Y2(s), ..., Yn(s)). We also have that for n =

1, 2, ...

Xn(s) = hn(s, Yn(s)), s ∈ S,

a fact which we already have used in the previous string of equalities.

Next suppose that the index probability Q ∈ T P((S, F ), (A, A)) has a

den-sity q ∈ Dτ[S, A] with respect to a σ − f inite measure τ on (A, A). We define

qn: S × An→ [0, ∞) iteratively by q1_{= q and}

qn+1(s, an+1) = qn(s, an)q(hn(s, an), an+1), n = 1, 2, ...,

and then we can express Qn _{∈ T P((S, F )(A}n_{, A}n_{)), for n=1,2,..., by}

Qn(s, B) =

Z

B

qn(s, an)τn(dan).

where τn _{= τ ⊗ τ ⊗ ... ⊗ τ ( n times). We call h}n _{: S × A}n _{→ S the nth}

composition of h : S × A → S, we call Qn _{∈ T P((S, F )(A}n_{, A}n_{)) the nth}

composition of Q ∈ T P((S, F )(A, A)) and we call {(S, F ), (An_{, A}n_{), h}n_{, Q}n_{} the}

nth composition of {(S, F ), (A, A), h, Q}. In case Q has a density q with respect to the base measure τ we denote the nth composition of {(S, F ), (A, A), h, (q, τ )} by {(S, F ), (An_{, A}n_{), h}n_{, (q}n_{, τ}n_)}.

5 The filter kernel associated to an HMM with

densities.

From now on we shall mainly restrict our analysis to HMMs with densities, and we shall assume that the measurable space (S, F ) is a complete, separable,

metric space with metric δ0.

Thus let

H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )}

be a HMM with densities. As above let Pλ(S, F ) denote the subset of P(S, F )

consisting of all probability measures which have a density with respect to a σ-finite measure λ. To simplify notations we shall usually write

(20)

Next, let δT V be the metric determined by the total variation on Pλ(S, F ),

and let E be the σ − algebra on Pλ(S, F ) generated by δT V. In agreement

with our notations introduced above, we let P(K, E ) denote the set of proba-bility measures on (K, E ), we let Q(K, E ) denote the set of finite, nonnegative measures on (K, E ), we let B[K, E ] denote the set of real, bounded, measur-able functions on (K, E ) and we let C[K, E ] denote the set of real, bounded, continuous functions on (K, E ).

Next let us for each a ∈ A define ma∈ Dλ[S] simply by

ma(s, t) = m(s, t, a)

and let Ma∈ T Q((S, F )) be the tr.f on (S, F ) determined by the density kernel

ma that is Ma(s, F ) = Z F ma(s, t)λ(dt), = ( Z F m(s, t, a)λ(dt) ), s ∈ S, F ∈ F . (8)

(See (3) for the definition of Dλ[S].) As we showed in Section 2 (see Proposition

2.1) the tr.f Ma induces a mapping ˘Ma from Qλ(S, E ) to Qλ(S, E ). As we

mentioned above we shall usually write xMa instead of ˘Max.

We now define M : Qλ(S, F ) × A → Qλ(S, F ) by

M (x, a) = xMa. (9)

In order to simplify the proof of certain measurability properties for various quantities below, we now make the following definition.

Definition 5.1 Let

H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )}

be a HMM with densities. Let M : Qλ(S, F ) × A → Qλ(S, F ) be defined by (9)

and (8). If M : Qλ(S, F ) × A → Qλ(S, F ) is a continuous function then we

say that the partition (m, τ ) is regular.

Proposition 5.1 Let H1= {(S, F , δ0), (p1, λ), (A1, A1, %1), (m1, τ1)} and

H2 = {(S, F , δ0), (p2, λ), (A2, A2, %2), (m2, τ2)} be two hidden Markov models

with densities and regular partitions, and with the same state space and the

same base measure λ. Define m1,2_{: S × S × A}

1× A2→ [0, ∞) by

m1,2(s, t, a1, a2) =

Z

S

m1(s, σ, a1)m2(σ, t, a2)λ(dσ)

and define τ1,2 _{as the product measure τ}

1⊗ τ2. Then (m1,2, τ1,2) is the product

of the partitions (m1, τ ) and (m2, τ2) and furthermore the partition (m1,2, τ1,2)

is also regular. 2

We give a proof for sake of completeness.

Proof. Set M1= (m1, τ1), set M2 = (m2, τ2) and let M3= M1M2 denote the

product of the partitions M1and M2. By definition

M3(s, F, B1, B2) = Z σ∈S Z t∈F M1(s, dσ, B1)M2(σ, dt, B2) =

(21)

Z σ∈S Z a1∈B1 Z t∈F Z a2∈B2 m1(s, σ, a1)m2(σ, t, a2)λ(dσ)λ(dt)τ1(da1)τ2(da2) = Z t∈F Z a1∈B1 Z a2∈B2 m1,2(s, t, a1, a2)λ(dt)τ1,2(da1, da2)

from which follows that M3 = (m1,2, τ1,2) and thereby we have shown that

(m1,2_{, τ}1,2_{) is the product of the partitions (m}

1, τ ) and (m2, τ2).

To prove that m1,2 _{is regular we proceed as follows. Define M}

1: Qλ(S, F ) × A1 → Qλ(S, F ), M2: Qλ(S, F ) × A2→ Qλ(S, F ) and M3 : Qλ(S, F ) × A1× A2→ Qλ(S, F ) by M1(x, a1)(F ) = Z t∈F Z s∈S m1(s, t, a1)x(ds)λ(dt), F ∈ F M2(x, a2)(F ) = Z t∈F Z s∈S m2(s, t, a2)x(ds)λ(dt), F ∈ F and M3(x, a1, a2)(F ) = Z t∈F Z s∈S m1,2(s, t, a1, a2)x(ds)λ(dt), F ∈ F respectively. Since m1,2(s, t, a1, a2) = Z S m1(s, σ, a1)m2(σ, t, a2)λ(dσ) and Z F Z S m1(s, t, a1)x(ds)λ(dt) = xMa1(F ) we find that M3(x, a1, a2)(F ) = Z t∈F Z s∈S Z σ∈S m1(s, σ, a1)m2(σ, t, a2)λ(dσ)x(ds)λ(dt) = Z F m2(σ, t, a2)xMa1(dσ)λ(dt). Hence M3(x, a1, a2) = M2(xMa1, a2) = M2(M1(x, a1), a2)

because of (9), and since both M2 and M1 are continuous it follows that also

M3 is continuous. 2

We next prove that if the observation space is denumerable and the partition function is bounded then the partition is regular.

Proposition 5.2 Let

H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )}

be a HMM with densities. Suppose that A is denumerable, that the metric % is the discrete metric and assume also that the partition function m is bounded. Then the partition (m, τ ) is regular.

(22)

Proof. Since % is the discrete metric it suffices to prove that for each a ∈ A ˘

Ma : Qλ(S, F ) → Qλ(S, F ) is continuous in the topology induced by the total

variation metric.

Thus let > 0 be given. Since the partition function m is assumed to be bounded we can find a constant C say, such that m(s, t, a) < C, ∀s, t ∈ S, ∀a ∈ A.

Now let x, y ∈ Qλ(S, F ) and let f and g be representatives of x and y with

respect to the base measure λ. We find

||xMa− yMa|| = | Z S Z S (f (s) − g(s))m(s, t, a)λ(ds)λ(dt)| ≤ C Z S Z S |f (s) − g(s)|λ(ds)λ(dt)| = C||x − y||.

Hence if ||x − y|| < /C it follows that ||xMa− yMa|| < . 2

Remark. Another, less trivial, result is presented in Section 18. (See Example 18.2 and Proposition 18.1.)

Our next aim is to introduce a notion for HMMs with densities and regular partitions, which we call the filter kernel.

Thus, as usual, let

H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )}

be a HMM with densities and assume that the partition (m, τ ) is regular. We

now define the set KA+_{⊂ K × A by}

KA+ = {(x, a) ∈ K × A : ||xMa|| > 0}

and for each x ∈ K we let A+_x be the set

A+x = {a ∈ A : ||xMa|| > 0}.

Since ||xMa|| = ||M (x, a)|| and M (x, a) is a continuous function, it follows that

KA+ _{is an open set, and also that A}+

x ⊂ A is an open set for each x ∈ K.

In particular KA+ _{∈ E ⊗ A is measurable, and also A}+

x ∈ A for each x ∈ K.

Furthermore, if E0∈ E is an open set and we define KA+(E0) as the subset of

KA+ _{defined by}

{(x, a) ∈ KA+_{: xM}

a/||xMa|| ∈ E0},

then it follows from the continuity of the map M (·, ·) and Lemma 2.1 that

KA+_(E

0) is an open set.

We now define the tr.pr.f P on (K, E ) by P(x, E) = Z A+x IE( xMa ||xMa|| )||xMa||τ (da) (10) and we define T : B[K, E ] → B[K, E ] by Tu(x) = Z A+x u( xMa ||xMa|| )||xMa||τ (da) (11)

That P(x, ·) is a probability measure in P(K, E ) for every x ∈ K is easily proved and that P(·, E) is E − measurable for each open E ∈ E follows from the fact

(23)

that the set {(x, a) ∈ KA+ _{: xM}

a/||xMa|| ∈ E0} is an open set if E0 is open.

That P(·, E) is E − measurable for each E ∈ E follows then easily from the fact that the set B = {E ⊂ K : P(·, E) measurable} is a σ − algebra and contains all open sets.

That T is the transition operator associated to P is evident from (10) and (11).

Definition 5.2 Let H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )} be a HMM with

densities such that (m, τ ) is regular, and let P be defined by (10). We call P the filter kernel induced by the HMM H or simply the filter kernel induced by the partition (m, τ ).

If µ ∈ P(K, E ) we call {Zn,µ, n = 0, 1, 2, ...} the filtering process

gener-ated by the HMM H and the initial distribution µ or more simply the filtering process induced by the partition (m, τ ) and the initial distribution µ.

If the initial distribution is the Dirac-measure at x ∈ K we write {Zn(x), n =

0, 1, 2, ...} instead of {Zn,δx, n = 0, 1, 2, ...}.

To emphasize the dependence of the HMM H we may write PH instead of

just P. We use a similar notation for the associated transition operator. 2

The following lemma is easy to prove but yet quite important. Lemma 5.1 Let H1= {(S, F , δ0), (p1, λ), (A1, A1, %1), (m1, τ1)} and

H2 = {(S, F , δ0), (p2, λ), (A2, A2, %2), (m2, τ2)} be two hidden Markov models

with densities such that both (m1, τ1) and (m2, τ2) are regular. Let (m1,2, τ1,2)

be the product of their partitions (m1, τ1) and (m2, τ2), let p3denote the density

of the Markov kernel determined by (m1,2_{, τ}1,2_{) and let H}

3be the HMM defined

by

H3= {(S, F , δ0), (p3, λ), (A1,2, A1,2), (m1,2, τ1,2)}.

Let PH1, PH2, PH3 denote the filter kernels induced by H1,H2and H3

respec-tively and let TH1, TH2, TH3 denote the associated transition operators.

Then a)

PH1PH2 = PH3

and b)

TH1TH2= TH3. 2

Proof. The equality in a) follows from the equality in b) if one uses the identity (2) of Section 2.

To prove b) let u ∈ B[K, E ], and set u2= TH2u. From (11) we find that

u2(x) = Z A+_2,x u( xMa2 ||xMa2|| )||xMa2||τ2(da2). Hence TH1TH2u(x) = Z A+_1,x u2( xMa1 ||xMa1|| )||xMa1||τ1(da1) = Z A+_1,x Z A+ 2,y(a1) u( ( xMa1 ||xM_a1||Ma2) || xMa1 ||xM_a1||Ma2|| )|| xMa1 ||xMa1|| Ma2||τ (da2)||xMa1||τ (da1) = Z A+_1,x Z A+ 2,y(a1) u( xMa1Ma2 ||xMa1Ma2|| )||xMa1Ma2||τ (da2)τ (da1)

(24)

where thus y(a1) and A+2,y(a1) are defined by y(a1) = xMa1/||xMa1||, a1∈ A + 1,x and A+_2,y(a 1)= {a2∈ A2: ||y(a1)Ma2|| > 0} respectively.

It is easily checked that the set

B(x) = {(a1, a2) ∈ A1× A2: ||xMa1Ma2|| > 0}

satisfies

B(x) = {(a1, a2)} ∈ A1× A2: a1∈ A1,xand a2∈ A2,y(a1)}.

Hence TH1TH2u(x) = Z B(x) u( xMa1Ma2 ||Ma1Ma2|| )||xMa1Ma2||τ 2_(da 1, da2) = TH3u(x). 2

The following result is an immediate corollary of the previous lemma. Corollary 5.1 Let Hn = {(S, F , δ0), (pn, λ), (An, An, %n), (mn, τn)}, n = 1, 2,

..., N be a sequence of HMMs with densities having the same state space (S, F )

and such that the partitions (mn, τn), n = 1, 2, ...N are regular. For n =

1, 2, ..., N let let Pn denote the filter kernel induced by Hn and let Tn be

the associated transition operator.

Set (m1_{, τ}1_{) = (m, τ ) and, for n = 2, 3, ..., N, let (m}n_{, τ}n_{) be the product of}

the partitions (mn, τn), n = 1, 2, ...N defined recursively by

mn+1(s, t, an+1) = Z S mn(s, σ, an)mn+1(σ, t, an+1)λ(dσ) and τn+1= τn⊗ τn+1,

and let pn _{denote the density kernel of the Markov kernel determined by the}

partition (mn_{, τ}n_{). From Proposition 5.1 we know that (m}n_{, τ}n_{) is also regular.}

Let HN _{denote the HMM defined by}

HN _{= {(S, F , δ}

0), (pN, λ), (AN, AN), (mN, τN)}, (12)

let PN denote the filter kernel induced by HN and let TN be the associated

transition operator. For (a1, a2, ..., aN) ∈ AN we write Ma1Ma2...MaN = M N aN. Then a) P1P2...PN(x, E) = PN(x, E) = Z {aN:||xMN aN||>0} IE( xM_aNN ||xMN aN|| )||xM_aNN||τN(daN), ∀x ∈ K, ∀E ∈ E , b) T1T2...TNu(x) = TNu(x) = Z {an_:||xMN aN||>0} u( xM N aN ||xMN aN|| )||xM_aNN||τ N_(daN_{), ∀x ∈ K, ∀u ∈ B[K, E ].}₂

(25)

We end this section recalling the notion of weak ergodicity.

Definition 5.3 Let H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )} be a HMM with

densities such that the partition (m, τ ) is regular, let P be the filter kernel in-duced by H and let T be the transition operator associated to P. We say that the filter kernel P is weakly ergodic if

lim

n→∞|T

n_{u(x) − T}n_{u(y)| = 0}

for all x, y in K and all functions u ∈ C[K, E ] which are uniformly continuous. If furthermore there exists a measure ν ∈ P(K, E ) such that

lim

n→∞hu, µP

n_{i = hu, νi, ∀µ ∈ P(K, E) and ∀u ∈ C[K, E]}

then we say that P is weakly ergodic with stationary measure ν.

6 The random mapping associated to a Hidden

Markov Model with densities.

Let again

H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )}

be a HMM with densities, such that the partition (m, τ ) is regular. As before,

set K = Pλ(S, F ), let δT V denote the ”total variation metric” on K and let E

denote the σ − algebra generated by δT V.

Let x ∈ Qλ(S, F ). As before, we write

||xMa|| = Z S Z S m(s, t, a)x(ds)λ(dt),

and for each a ∈ A we let xMa ∈ Q(S, F ) be defined by

xMa(F ) = Z s∈S Z t∈F m(s, t, a)λ(dt)x(ds), F ∈ F .

Also as above, we define M : Qλ(S, F ) × A → Qλ(S, F ) by M (x, a) = xMa.

Since the partition (m, τ ) is regular it follows by definition that M : K × A → K is a continuous function. We now define g : K × A → [0, ∞) by g(x, a) = ||xMa|| (13) and define h : K × A → K by h(x, a) = xMa/||xMa||, if ||xMa|| > 0 (14) h(x, a) = x, if ||xMa|| = 0. (15)

Now, since g(x, a) = ||M (x, a)||, and the mapping M : Qλ(S, F ) × A →

Qλ(S, F ) is continuous, it follows that the mapping g : K × A → [0, ∞) is also

continuous, and therefore in particular g is measurable. Furthermore, Z

A

(26)

since Z A ||xMa||τ (da) = Z A Z S Z S m(s, t, a)x(ds)λ(dt)τ (da) = = Z S Z S m(s, t, a)τ (da)λ(dt)x(ds) = Z S Z S p(s, t)λ(dt)x(ds)) = Z S x(ds) = 1. Therefore, if we define G : K × A → [0, 1] by G(x, B) = Z B g(x, a)τ (da) (16)

it is clear that G(x, A) = 1 for all x ∈ K and that G(·, B) is measurable for each B ∈ A; therefore G : K × A → [0, 1] defines a tr.pr.f from (K, E ) to (A, A). Furthermore, using the inequality (2.1) and the fact that the set

{(x, a) ∈ K × A : ||xMa|| = 0} is a closed set it is not difficult to prove that

h : K × A → K is measurable, and therefore {(K, E ), (A, A), h, (g, τ )} defines a random mapping.

Definition 6.1 We call the random mapping {(K, E ), (A, A), h, G} defined by (14), (15), (13) and (16) the random mapping associated to the HMM {(S, F , δ0), (p, λ), (A, A, %), (m, τ )}.

Next, let Q denote the Markov kernel induced by the random mapping {(K, E), (A, A), h, G} (see formula (7) of Definition 4.6 ) and let P denote the

filter kernel induced by the HMM {(S, F , δ0), (p, λ), (A, A, %), (m, τ )} (see

Defi-nition 5.2). Observation 6.1 Q(x, E) = P(x, E), ∀x ∈ K, ∀E ∈ E . Proof. Q(x, E) = G(x, h−1(x, E)) = Z h−1_(x,E) ||xMa||τ (da) = Z {a:h(x,a)∈E} ||xMa||τ (da) = Z A+x IE(xMa/||xMa)||xMa||τ (da) = P(x, E). 2

Having made this observation we now make the following definition which partly replaces Definition 5.2.

Definition 6.2 Let H = {(S, F , δ0), (p, λ), (A, A), (m, τ )} be a HMM with

den-sity such that the partition (m, τ ) is regular, and let {(K, E ), (A, A), h, G} be the random mapping associated to H. (See (13), (16), (14), and (15).)

For µ ∈ P(K, E ), let {Yn,µ, n = 1, 2, ...} be the index sequence associated to

the random mapping {(K, E ), (A, A), h, G} and let {Zn,µ, n = 0, 1, 2, ...} be the

state sequence associated to the random mapping {(K, E ), (A, A), h, G}. (The state sequence and the index sequence associated to a random mapping were defined in Definition 4.5.)

Let P ∈ T P((K, E )) denote the Markov kernel induced by the random map-ping {(K, E ), (A, A), h, G}.

We now introduce the following terminology:

1) The Markov chain Zn,µ, n = 0, 1, 2, ... with values in (K, E ) is called the

(27)

2) The sequence {Yn,µ, n = 1, 2, ...} is called the observation sequence

asso-ciated to the HMM H.

3) The Markov kernel P ∈ T P((K, E )) is called the filter kernel.

In case µ = δx for some x ∈ K we write Zn(x) instead of Zn,δx and Yn(x)

instead of Yn,δx. 2

Remark. That HMM induces a random mapping is by no means a new observa-tion. Already in the paper [4] Blackwell proves a theorem for random mappings which he applies to the filtering process he is considering. (A finite state space and an observation sequence which is determined by a lumping function.) In section 2.3.3 of the book [10] the connection between partially observed Markov chains and random mappings is described and in the book [11] this connection is mentioned at several places. In the paper [12] from 1973 it is proved that the filtering process converges in distribution with geometric convergence rate in case the state space is finite and the tr.pr.m P has strictly positive elements by showing that the associated random mapping is a so called “distance dimin-ishing model”. (See [20], chapter 2.) In a recent paper by C Anton Popescu (see [1]) a similar result is proved. The connection between filtering processes

and random mappings is also emphasized in [13]. 2

Now some further notations similar to those in the previous section. Set

A1= A, An+1= A1× An, n = 1, 2, ....

For (a1, a2, ..., an) ∈ An we use the notation

an= (a1, a2, ..., an)

and consequently we also write

Y_µn= (Y1,µ, Y2,µ, ..., Yn,µ).

We define hn_{: K × A}n _{→ K, n = 1, 2, ... by first defining h}1_{= h, where thus h}

is defined by (16) and (14), and then defining hn _{iteratively by}

hn+1(x, an+1) = h(hn(x, an), an+1), n = 1, 2, ....

We define gn : S × An→ [0, ∞), n = 1, 2, ... iteratively by g1_{= g and}

gn+1(s, an+1) = gn(s, an)g(hn(s, an), an+1).

Recall that g is defined by (13).

As before we denote the n-product measure of τ ∈ (A, A) by τn. For

(a1, a2, ..., an) ∈ An we write Ma1Ma2...Man= M

n an.

Clearly {(K, E ), (An, An), hn, (gn, τn)} is the n : th composition of the

ran-dom mapping {(K, E ), (A, A), h, (g, τ )}.

Due to the special form of both h : K × A → K and g : K × A → [0, ∞) the following relations hold.

Proposition 6.1 Let H= {(S, F , δ0}, (p, λ), (A, A, %), (m, τ )}, be a HMM with

densities such that the partition (m, τ ) is regular. Let {(K, E ), (A, A), h, (g, τ )} denote the random mapping associated to the HMM H, and for n = 1, 2, ... let

{(K, E), (An_{, A}n_{), h}n_{, (g}n_{, τ}n_{))} be the nth composition of the random mapping}

(28)

Then a)

gn(x, an) = ||xM_ann||, ∀x ∈ K, ∀an∈ An,

b)

hn(x, an) = xM_ann/||xM_ann||, if ||xM_ann|| > 0.

Proof. See Lemma 5.1 and its proof.2

We end this section with some formulas for the state sequence and the ob-servation sequence of an HMM with densities and regular partition.

Thus, let {Zn(x), n = 0, 1, 2, ...} denote the filter process and let {Yn(x), n =

1, 2, ...} denote the observation sequence associated to a HMM

H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )} and the initial distribution δx∈ P(K, E),

and let H = {(K, E ), (A, A), h, G} be the associated random mapping. Then Zn(x) = hn(x, Yn(x)), n = 1, 2, ... .

Furthermore, if we let P denote the filter kernel induced by H and let T : B[K, E ] → B[K, E ] denote the transition operator associated to P, then

Tu(x) = Z

K

u(y)P(x, dy) = E[u(h(x, Y1(x)))] =

Z

A+x

u(h(x, a))g(x, a)τ (da) = Z

A+x

u( xMa

||xMa||

)||xMa||τ (da) (17)

and, for n = 2, 3, ..., we have

Tnu(x) =

Z

K

u(y)Pn(x, dy) = E[u(hn(x, Yn(x)))] =

Z {an_:||xMn an||>0} u(hn(x, an))gn(x, an)τn(dan) = Z {an_:||xMn an||>0} u( xM n an ||xMn an|| )||xM_ann||τn(dan) (18)

where the last equality is a consequence of Proposition 6.1.

7 The Kantorovich distance on the space P(P

λ

(S))

As above let K = Pλ(S, F ), let δT V be the metric on K determined by the total

variation, and let E be the σ − algebra on K generated by δT V.

If µ and ν belong to P(K, E ), we let P(K2; µ, ν) denote the set of probability

measures in P(K2, E2) defined by

P(K2_{; µ, ν) = {˜}_{µ ∈ P(K}2_{) : ˜}_{µ(E × K) = µ(E), ˜}_{µ(K × E) = ν(E), ∀E ∈ E }.}

The Kantorovich distance dK(µ, ν) between µ and ν is defined as

dK(µ, ν) = inf{

Z

K2

δT V(x, y)˜µ(dx, dy) : ˜µ ∈ P(K2; µ, ν)}. (19)

(29)

From the so called Kantorovich-Rubenstein theorem (see [8], Theorem 11.8.2)

it follows that the Kantorovich distance dK can also be defined by

dK(µ, ν) = sup{ Z K u(x)µ(dx) − Z K u(x)ν(dx) : u ∈ Lip1[K] }, (20) where thus Lip1[K] = {u ∈ C[K] : sup{ |u(x) − u(y)| δT V(x, y) : x, y ∈ K, δT V(x, y) > 0} ≤ 1}.

From this representation it is easily proved that the Kantorovich distance dK is

in fact a metric on P(K, E ).

It is well-known that the topology on P(K, E ) induced by the metric dK is

equivalent to the weak topology (see e.g [8], chapter 11).

8 The barycenter

An important notion within the theory of filtering processes is the notion barycen-ter. This concept was introduced into the theory of filtering processes by Ku-nita. (See [18].)

Let (S, F , δ0) be a complete, separable, metric space, and let λ ∈ Q∞(S, F ).

As before, let K = Pλ(S, F ), let δT V be the metric on K determined by the

total variation, let E be the σ − algebra on K generated by δT V and let P(K, E )

denote the set of probability measures on (K, E ).

Now let µ ∈ P(K, E ). The barycenter of µ, which we denote by b(µ), is an element in K which is defined as follows: First, for each F ∈ S we let

IF : S → {0, 1} denote the indicator function of F defined as usual by

IF(s) = 1, if s ∈ F, IF(s) = 0, if s 6∈ F.

For each F ∈ F we also define a mapping UF : K → [0, 1] by

UF(x) = hIF, xi.

Since |hIF, xi − hIF, yi| ≤ ||x − y||/2, it is clear that UF ∈ C[K, E] and in

particular UF ∈ B[K, E]. Therefore, we can define a mapping b(µ) : F → [0, 1]

by b(µ)(F ) = hUF, µi (= Z K hIF, xiµ(dx)). Since b(µ)(S) = Z K hIS, xiµ(dx) = Z K µ(dx) = 1, b(µ)(S \ F )) = Z K hIS\F, xiµ(dx) = Z K hIS, xiµ(dx) − Z K hIF, xiµ(dx) = 1 − b(µ)(F )), ∀F ∈ F and b(µ)(∪∞_i=1Fi)) = Z K h ∞ X i=1 IFi, xiµ(dx) =

(30)

∞ X i=1 ( Z K hIFi, xiµ(dx) = ∞ X i=1 b(µ)(Fi),

if Fi∈ F , i = 1, 2, , ... and Fi∩ Fj = ∅ if i 6= j, it is clear that b(µ) ∈ P(S, F ).

Moreover, since λ(F ) = 0 implies that UF(x) =

R

Fx(ds) = x(F ) = 0 for all

x ∈ K, it also follows that λ(F ) = 0 implies that b(µ)(F ) = 0. Hence b(µ) ∈ K. We denote the set of all probability measures in P(K, E ) with barycenter equal to x by P(K|x).

The following theorem is essentially due to Kunita. (See [18].)

Theorem 8.1 Let H = {(S, F , δ0), (p, λ), (A, A, %), (m, τ )} be a HMM with

densities such that the partition (m, τ ) is regular. (Recall that in the

defini-tion of a HMM with densities we assume that both (S, F , δ0) and (A, A, %) are

complete, separable, metric spaces.) Let P ∈ T F ((S, F )) be the Markov kernel

determined by (p, λ). As usual, let K = Pλ(S, F ), let δT V denote the total

vari-ation metric, let E be the σ − algebra determined by the total varivari-ation metric

δT V and let P ∈ T P((K, E )) be the filter kernel induced by the HMM H.

Then for all x ∈ K

b(Pn(x, ·)) = xPn, n = 1, 2, ... .

Proof. Let F ∈ F and let T denote the transition operator associated to the filter kernel P. From the definition of the barycenter we find

b(xP)(F ) = hUF, δxPi = hTUF, δxi = TUF(x) = Z Ax+ UF( xMa ||xMa|| )||xMa||τ (da) = Z Ax+ hIF, xMa ||xMa|| i||xMa||τ (da) = Z Ax+ hIF, xMaiτ (da) = Z Ax+ Z F xMa(dt)τ (da) = Z Ax+ Z F Z S m(s, t, a)x(ds)λ(dt)τ (da) = Z F Z S Z Ax+ m(s, t, a)τ (da)x(ds)λ(dt) = Z F Z S p(s, t)x(ds)λ(dt) = Z F xP (dt) = xP (F )

from which the conclusion of the theorem follows for n = 1. That the conclusion also holds for n ≥ 2 follows from Corollary 5.1.

9 A few words on couplings

Definition 9.1 Let (K, E ) be a measurable space. Let r > 0 and let µ, ν ∈

Qr_{(K, E ). A measure ˜}_{µ ∈ Q}r_(K2_{, E}2_{) is called a coupling of µ and ν if}

˜

µ(E × K) = µ(E), ∀E ∈ E ˜

µ(K × E) = ν(E), ∀E ∈ E . 2

(31)

Qr_{(K, E ). The measure ˜}_{µ ∈ Q}r_(K2_{, E}2_{) defined by}

˜

µ(E1× E2) = (1/r)µ(E1)ν(E2), ∀E1, E2∈ E (21)

is called the trivial coupling of µ and ν.

Qr_{(K, E ). Let E}+_{∈ E denote a set in E such that}

µ(E ∩ E+) ≥ ν(E ∩ E+), ∀E ∈ E ,

and set E−= K \ E+. (Recall that µ(E+) − ν(E+) = (1/2)||µ − ν||.)

The measure ˜µ ∈ Qr(K2, E2) defined by

˜

µ(E1× E2) = µ(E1∩ E2∩ E−) + ν(E1∩ E2∩ E+) +

(µ(E1∩ E+) − ν(E1∩ E+))(ν(E2∩ E−) − µ(E2∩ E−))/(µ(E+) − ν(E+)), (22)

where the second term is omitted if µ = ν, is called the Vaserstein coupling of µ and ν.

Remark. In the paper [24] L Vaserstein introduced the coupling defined by

(22) for probabilities defined on a denumerable set. 2

That the measures defined by (21) and (22) are couplings of µ and ν is easily checked. We shall usually denote the Vaserstein coupling by using a V as subscript.

We shall have use of the following well-known fact regarding the Vaserstein coupling.

Lemma 9.1 Let (K, E ) be a complete, separable, metric space. Let r > 0 and let µ, ν ∈ Qr_{(K, E ), let ˜}_µ

V ∈ Qr((K2, E2)) be the Vaserstein coupling and let

D = {(x, y) ∈ K × K : x = y}. Then ˜

µV(D) = r − ||x − y||/2.2

For a proof of the lemma when r = 1 see e.g [19], Section 1.5, Theorem 5.2.

10 The Vaserstein coupling of Markov kernels

with densities.

Lat (K, E ) and (A, A) be two measurable spaces and let Q ∈ T P((K, E ), (A, A)) be a tr.pr.f from (K, E ) to (A, A).

Let ψ be a positive σ − f inite measure on (A, A) and suppose that q ∈

Dψ[K, A] is a probability density kernel of Q ∈ T P((K, E ), (A, A)). As usual

let K2_{= K × K, A}2_{= A × A, E}2_{= E ⊗ E and A}2_{= A ⊗ A.}

We shall now define a tr.pr.f ˜QV ∈ PT ((K2, E2), (A2, A2)) as follows.

We first define ˆq : K2_{× A → [0, ∞) by}

ˆ

q(x, y, a) = min{q(x, a), q(y, a)},

we define f : K2× A → [0, ∞) and g : K2_{× A → [0, ∞) by}

(32)

and

g(x, y, a) = q(y, a) − ˆq(x, y, a)

respectively, and we define ∆(x, y) =

Z

|q(x, a) − q(y, a)|ψ(da).

To define ˜QV ∈ PT ((K2, E2) it suffices to specify ˜QV(x, y, ˜B) for sets ˜B

which are rectangular. Thus let (x, y) ∈ K2_{and B}

1, B2∈ A. Set D = B1∩ B2. We now define ˜ QV((x, y), B1× B2) = Z D ˆ q(x, y, a)ψ(da) + Z B2 Z B1 f (x, y, a)g(x, y, b)ψ(da)ψ(db)/∆(x, y) where the last term is omitted if ∆(x, y) = 0.

That ˜QV((x, y), ·)) is a coupling of Q(x, ·) and Q(y, ·) for every x, y ∈ K is

easy to check. We present the arguments for sake of completeness. Thus, let x, y ∈ K. Define

A+= {a ∈ A : q(x, a) ≥ q(y, a)}

and A− = A \ A+_{. From the definition now follows, if ∆(x, y) > 0 and B ∈ A,}

that ˜ QV((x, y), B × A)) = Z B∩A+ q(y, a)ψ(da) + Z B∩A− q(x, a)ψ(da)+ Z A1 Z B∩A+

(q(x, a) − q(y, a))(q(y, b) − q(x, b)ψ(da)ψ(db)/∆(x, y) = Z B∩A+ q(y, a)ψ(da) + Z B∩A− q(x, a)ψ(da)+ Z B∩A+ (q(x, a) − q(y, a))ψ(da) = Z B q(x, a)ψ(da) = Q(x, B). If ∆(x, y) = 0 then clearly ˜ QV((x, y), B × A)) = Z B q(x, a)ψ(da) = Q(x, B). In the same way one can show that for all x, y ∈ K and B ∈ A

˜

QV((x, y), A × B)) = Q(y, B).

That furthermore ˜QV((·, ·), B1× B2) is measurable for all B1, B2 ∈ A follows

from the fact that all integrands are measurable, and therefore it follows that ˜

QV((·, ·), ˜B) is measurable for all ˜B ∈ A ⊗ A.

We call ˜QV, as defined above, the Vaserstein coupling of the Markov