• No results found

Machine learning with state-space models, Gaussian processes and Monte Carlo methods

N/A
N/A
Protected

Academic year: 2022

Share "Machine learning with state-space models, Gaussian processes and Monte Carlo methods"

Copied!
61
0
0

Loading.... (view fulltext now)

Full text

(1)

Machine learning with state-space models, Gaussian processes and Monte Carlo methods

Andreas Svensson (Lindholm) 12 October 2018

Department of Information Technology, Uppsala University

(2)

Sugar consumption 50 g/day|

150 g/day|

Themodel(a straight line) islearned from data

Data from World Happiness Report 2017, Food and Agriculture Organization of the United Nations.

(3)

Sugar consumption 50 g/day|

150 g/day| Happinessindex3 |7 |

Themodel(a straight line) islearned from data

Data from World Happiness Report 2017, Food and Agriculture Organization of the United Nations.

(4)

Sugar consumption 50 g/day|

150 g/day| Happinessindex3 |7 |

Themodel(a straight line) islearned from data

(5)

Sugar consumption 50 g/day|

150 g/day| Happinessindex3 |7 |

Themodel(a straight line) islearned from data

(6)

Supervised Machine Learning

(7)

Supervised Machine Learning

(Supervised)Machine Learning is about learning models from data

Thedatacan be much more complicated…

• Season & product price↔ Customer demand

• Photo of a road↔ Position of pedestrians

• X-ray image↔ A medical report

• Recorded human speech↔ A written sentence

• A pilot’s operation↔ The trajectory of an aircraft

• …

Sugar consumption 50 g/day|

150 g/day| Happinessindex3|7|

Themodelcan be much more complicated…

• Neural network

• Random forest

• State-space model

• Gaussian process

• …

(8)

Supervised Machine Learning

(Supervised)Machine Learning is about learning models from data

Thedatacan be much more complicated…

• Season & product price↔ Customer demand

• Photo of a road↔ Position of pedestrians

• X-ray image↔ A medical report

• Recorded human speech↔ A written sentence

• A pilot’s operation↔ The trajectory of an aircraft

• …

Sugar consumption 50 g/day|

150 g/day| Happinessindex3|7|

Themodelcan be much more complicated…

• Neural network

• Random forest

• State-space model

• Gaussian process

• …

(9)

Supervised Machine Learning

(Supervised)Machine Learning is about learning models from data

Thedatacan be much more complicated…

• Season & product price↔ Customer demand

• Photo of a road↔ Position of pedestrians

• X-ray image↔ A medical report

• Recorded human speech↔ A written sentence

• A pilot’s operation↔ The trajectory of an aircraft

• …

Sugar consumption 50 g/day|

150 g/day| Happinessindex3|7|

Themodelcan be much more complicated…

• Neural network

• Random forest

• State-space model

• Gaussian process

• …

(10)

Learning models from data

(11)

Two approaches to learning

• Most model structures contain some (unknown) parameters θ

• The parameters can either be learned by selecting bθ

to explain the data y the best→point estimation approach

• … or by computing the parameter probabilities, given that the data y looked the way it did, p(θ| y) →Bayesian approach

bθ = arg min

θ −p(y | θ) p(θ| y) =p(y| θ)p(θ)

p(y)

(12)

Two approaches to learning

• Most model structures contain some (unknown) parameters θ

• The parameters can either be learned by selecting bθ

to explain the data y the best→point estimation approach

• … or by computing the parameter probabilities, given that the data y looked the way it did, p(θ| y) →Bayesian approach

bθ = arg min

θ −p(y | θ) p(θ| y) =p(y| θ)p(θ)

p(y)

(13)

Two approaches to learning

• Most model structures contain some (unknown) parameters θ

• The parameters can either be learned by selecting bθ

to explain the data y the best→point estimation approach

• … or by computing the parameter probabilities, given that the data y looked the way it did, p(θ| y) →Bayesian approach

bθ = arg min

θ −p(y | θ) p(θ| y) =p(y| θ)p(θ)

p(y)

(14)

Two approaches to learning

• Most model structures contain some (unknown) parameters θ

• The parameters can either be learned by selecting bθ

to explain the data y the best→point estimation approach

• … or by computing the parameter probabilities, given that the data y looked the way it did, p(θ| y) →Bayesian approach

bθ = arg min

θ −p(y | θ) p(θ| y) =p(y| θ)p(θ)

p(y)

(15)

Two approaches to learning

Sugar consumption 50 g/day|

150 g/day| Happinessindex3 |7 |

(16)

Two approaches to learning

Sugar consumption 50 g/day|

150 g/day| Happinessindex3 |7 |

(17)

Two approaches to learning

Sugar consumption 50 g/day|

150 g/day| Happinessindex3 |7 |

(18)

Two approaches to learning

Note! The only difference between the point estimation and the Bayesian approach is how parameters are learned from data.

There is no difference in the models.

(19)

Implementing the Bayesian approach

• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.

• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like

Ways to represent arbitrary p(θ| y) in a computer:

Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference

Generate samples from p(θ| y), and store those → Monte Carlo inference

(20)

Implementing the Bayesian approach

• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.

• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like

Ways to represent arbitrary p(θ| y) in a computer:

Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference

Generate samples from p(θ| y), and store those → Monte Carlo inference

(21)

Implementing the Bayesian approach

• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.

• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like

−2 −1 0 1 2 3

0 0.5 1 1.5

θ

p(θ|y)

Ways to represent arbitrary p(θ| y) in a computer:

Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference

Generate samples from p(θ| y), and store those → Monte Carlo inference

(22)

Implementing the Bayesian approach

• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.

• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like

−2 −1 0 1 2 3

0 0.5 1 1.5

θ

p(θ|y)

Ways to represent arbitrary p(θ| y) in a computer:

Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference

Generate samples from p(θ| y), and store those → Monte Carlo inference

(23)

Implementing the Bayesian approach

• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.

• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like

−2 −1 0 1 2 3

0 0.5 1 1.5

θ

p(θ|y)

Ways to represent arbitrary p(θ| y) in a computer:

Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference

Generate samples from p(θ| y), and store those → Monte Carlo inference

(24)

Implementing the Bayesian approach

• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.

• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like

−2 −1 0 1 2 3

0 0.5 1 1.5

θ

p(θ|y)

Ways to represent arbitrary p(θ| y) in a computer:

Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference

Generate samples from p(θ| y), and store those → Monte Carlo inference

(25)

Two interesting models

(26)

State-space models

xt+1 =f(xt) +wt

yt =g(xt) +et

t time index xt hidden state yt observation wt process noise

et measurement noise

Applications within…

• automatic control

• economics

• mechanical engineering

• medicine

• finance

• biology

• signal processing

• …

(27)

State-space models

xt+1 =f(xt) +wt

yt =g(xt) +et

t time index xt hidden state yt observation wt process noise

et measurement noise

Applications within…

• automatic control

• economics

• mechanical engineering

• medicine

• finance

• biology

• signal processing

• …

(28)

State-space models

xt+1 =f(xt) +wt

yt =g(xt) +et

t

t yt

xt

t = 7

|

t = 8

f |

t = 8

| g

(29)

Monte Carlo for state-space models: The particle filter

The first and foremost problem when dealing with state-space models, is thatxtis not in the collected data!

t

t yt

xt

t = 7

|

t = 8

f |

t = 8

| g

Theparticle filterprovides a Monte Carlo-solution for recovering xt

by computing p(xt| y1:t)using the observations yt, and the functions f and g.

(30)

Monte Carlo for state-space models: The particle filter

The first and foremost problem when dealing with state-space models, is thatxtis not in the collected data!

t

t yt

xt

f

g

Theparticle filterprovides a Monte Carlo-solution for recovering xt

by computing p(xt| y1:t)using the observations yt, and the functions f and g.

(31)

Monte Carlo for state-space models: The particle filter

The first and foremost problem when dealing with state-space models, is thatxtis not in the collected data!

t

t yt

xt

f

g

Theparticle filterprovides a Monte Carlo-solution for recovering xt

(32)

Learning state-space models

xt+1 =f(xt) +wt

yt =g(xt) +et

Theparticle filterprovides a Monte Carlo-solution for recovering xt

by computing p(xt| y1:t)using the observations yt, and the functions f and g.

How can f and g be learned then?

Bycombiningthe particle filter with some other method. More in Paper I, III, IV, V, and VI.

(33)

Learning state-space models

xt+1 =f(xt) +wt

yt =g(xt) +et

Theparticle filterprovides a Monte Carlo-solution for recovering xt

by computing p(xt| y1:t)using the observations yt, and the functions f and g.

How can f and g be learned then?

Bycombiningthe particle filter with some other method. More in Paper I, III, IV, V, and VI.

(34)

Learning state-space models

xt+1 =f(xt) +wt

yt =g(xt) +et

Theparticle filterprovides a Monte Carlo-solution for recovering xt

by computing p(xt| y1:t)using the observations yt, and the functions f and g.

How can f and g be learned then?

Bycombiningthe particle filter with some other method.

More in Paper I, III, IV, V, and VI.

(35)

Gaussian processes

[f(x) f(x) ]

∼ N

([µ(x) µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

x f(x)

x f(x)

x f(x)

x f(x)

(36)

Gaussian processes

[f(x) f(x) ]

∼ N

([µ(x) µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

x f(x)

x f(x)

x f(x)

x f(x)

(37)

Gaussian processes

[f(x) f(x) ]

∼ N

([µ(x) µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

x f(x)

x f(x)

x f(x)

x f(x)

(38)

Gaussian processes

[f(x) f(x) ]

∼ N

([µ(x) µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

x f(x)

x f(x)

x f(x)

x f(x)

(39)

Gaussian processes

[f(x) f(x) ]

∼ N

([µ(x) µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

x f(x)

x f(x)

x f(x)

x f(x)

(40)

Learning Gaussian processes

x f(x)

x f(x)

[f(x) f(x)

]

∼ N ([µ(x)

µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

⇒ f(x)| f(x) = N (

µ(x) +κ(x, x)

κ(x, x) (f(x)− µ(x)) , κ(x,x)κ(x, x)κ(x,x) κ(x, x)

)

The Gaussian process isnon-parametric, meaning it contains all training data. No parameters to learn then?

Maybe not in the ”ideal” case, but…

• There are hyperparameters in κ (and µ)

• Sometimes parametric approximations are used (with parameters to learn)

(41)

Learning Gaussian processes

x f(x)

x f(x)

[f(x) f(x)

]

∼ N ([µ(x)

µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

⇒ f(x)| f(x) = N (

µ(x) +κ(x, x)

κ(x, x) (f(x)− µ(x)) , κ(x,x)κ(x, x)κ(x,x) κ(x, x)

)

The Gaussian process isnon-parametric, meaning it contains all training data.

No parameters to learn then? Maybe not in the ”ideal” case, but…

• There are hyperparameters in κ (and µ)

• Sometimes parametric approximations are used (with parameters to learn)

(42)

Learning Gaussian processes

x f(x)

x f(x)

[f(x) f(x)

]

∼ N ([µ(x)

µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

⇒ f(x)| f(x) = N (

µ(x) +κ(x, x)

κ(x, x) (f(x)− µ(x)) , κ(x,x)κ(x, x)κ(x,x) κ(x, x)

)

The Gaussian process isnon-parametric, meaning it contains all training data.

No parameters to learn then?

Maybe not in the ”ideal” case, but…

• There are hyperparameters in κ (and µ)

• Sometimes parametric approximations are used (with parameters to learn)

(43)

Learning Gaussian processes

x f(x)

x f(x)

[f(x) f(x)

]

∼ N ([µ(x)

µ(x) ]

,

[κ(x, x) κ(x, x) κ(x,x) κ(x,x)

])

⇒ f(x)| f(x) = N (

µ(x) +κ(x, x)

κ(x, x) (f(x)− µ(x)) , κ(x,x)κ(x, x)κ(x,x) κ(x, x)

)

The Gaussian process isnon-parametric, meaning it contains all training data.

No parameters to learn then?

Maybe not in the ”ideal” case, but…

• There are hyperparameters in κ (and µ)

• Sometimes parametric approximations are used (with parameters to learn)

(44)

My thesis

(45)

Monte Carlo methods

State-space models

Gaussian processes

(46)

My contributions

Collect data

Choose model structure

Learn model

(47)

My contributions

Collect data

Choose model structure

Learn model Paper II

Paper I Paper III Paper IV Paper V Paper VI Paper VII

(48)

Paper I – A flexible state-space model for nonlinear dynamical systems

A computationally tractable combination of the state-space model and the Gaussian process model

By combining a reduced-rank approximation of the Gaussian process with the state-space model, the model can be learned efficiently with a certain Monte Carlo method; particle Gibbs with ancestor sampling.

xt+1=f(xt) +vt, f∼ GP yt=g(xt) +et, g∼ GP

(49)

Paper I – Some more details

f∼ GP(0, κ) ⇔ f(x) ≈

m j=0

w(j)ϕ(j)(x), w(j)∼ N (0, S(λ(j)))

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression. Arno Solin & Simo Särkkä 2014.

xt+1=

w(11) · · · w(1m)

... ...

w(1)nx · · · w(m)nx

| {z }

A

ϕ(1)(xt)

... ϕ(m)(xt)

| {z }

f(xt)

+vt, vt∼ N (0, Q),

yt=g(xt) +et

p(A, Q) =MN IW(A, Q; 0, V, ℓ, Λ), p(xt+1| A, Q, xt) =N (xt+1;Aϕ(xt),Q)

⇒ p(A, Q | x1:T) =MN IW(A, Q; Ψ(Σ + V−1)−1, (Σ +V−1)−1, Λ + Φ− Ψ(Σ + V−1)−1ΨT, ℓ +Tnx)

(50)

Paper I – Some more details

f∼ GP(0, κ) ⇔ f(x) ≈

m j=0

w(j)ϕ(j)(x), w(j)∼ N (0, S(λ(j)))

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression. Arno Solin & Simo Särkkä 2014.

xt+1=

w(1)1 · · · w(m)1

... ...

w(1)nx · · · w(m)nx

| {z }

A

ϕ(1)(xt)

...

ϕ(m)(xt)

| {z }

f(xt)

+vt, vt∼ N (0, Q),

yt=g(xt) +et

p(A, Q) =MN IW(A, Q; 0, V, ℓ, Λ), p(xt+1| A, Q, xt) =N (xt+1;Aϕ(xt),Q)

⇒ p(A, Q | x1:T) =MN IW(A, Q; Ψ(Σ + V−1)−1, (Σ +V−1)−1, Λ + Φ− Ψ(Σ + V−1)−1ΨT, ℓ +Tnx)

(51)

Paper I – Some more details

f∼ GP(0, κ) ⇔ f(x) ≈

m j=0

w(j)ϕ(j)(x), w(j)∼ N (0, S(λ(j)))

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression. Arno Solin & Simo Särkkä 2014.

xt+1=

w(1)1 · · · w(m)1

... ...

w(1)nx · · · w(m)nx

| {z }

A

ϕ(1)(xt)

...

ϕ(m)(xt)

| {z }

f(xt)

+vt, vt∼ N (0, Q),

yt=g(xt) +et

p(A, Q) =MN IW(A, Q; 0, V, ℓ, Λ), p(xt+1| A, Q, xt) =N (xt+1;Aϕ(xt),Q)

⇒ p(A, Q | x1:T) =MN IW(A, Q; Ψ(Σ + V−1)−1, (Σ +V−1)−1, Λ + Φ− Ψ(Σ + V−1)−1ΨT, ℓ +Tnx)

(52)

Paper I – Some more details

Input: Data y1:T, priors on A, Q.

Output: K MCMC-samples with p(x1:T,A, Q| y1:T)as invariant distribution.

1 Initialize A[0], Q[0].

2 for k = 0 to K do

3 Sample x1:T[k+1] A[k], Q[k] using particle Gibbs with ancestor sampling

4 Sample Q[k+1] x1:T[k+1]

5 Sample A[k+1] Q[k+1], x1:T[k+1]

6 end

Computationally efficient Bayesian learning of Gaussian process state space models. Andreas Svensson, Arno Solin, Simo Särkkä, and Thomas B. Schön, AISTATS 2016.

Nonlinear state space model identification using a regularized basis function expansion. Andreas Svensson, Thomas B. Schön, Arno Solin, and Simo Särkkä, IEEE CAMSAP 2015.

(53)

Paper I – Some more details

Input: Data y1:T, priors on A, Q.

Output: K MCMC-samples with p(x1:T,A, Q| y1:T)as invariant distribution.

1 Initialize A[0], Q[0].

2 for k = 0 to K do

3 Sample x1:T[k+1] A[k], Q[k] using particle Gibbs with ancestor sampling

4 Sample Q[k+1] x1:T[k+1]

5 Sample A[k+1] Q[k+1], x1:T[k+1]

6 end

Computationally efficient Bayesian learning of Gaussian process state space models.

Andreas Svensson, Arno Solin, Simo Särkkä, and Thomas B. Schön, AISTATS 2016.

Nonlinear state space model identification using a regularized basis function expansion.

Andreas Svensson, Thomas B. Schön, Arno Solin, and Simo Särkkä, IEEE CAMSAP 2015.

(54)

Paper II – Data consistency approach to model validation

”Am I using a good model structure?”

Important not least because all uncertainties in the Bayesian approach are conditional on the choice of model class!

(55)

Paper III - Learning dynamical systems with PSAEM

A method for learning parameter point estimates in nonlinear state-space models

A challenge (and strength!) with Monte Carlo methods is their inherent stochasticity – they produce a different result each time they are executed. That makes them hard to combine with other methods, such as Expectation Maximization. A go-to solution is to average over multiple Monte Carlo runs, to get something less stochastic. A more rigorous and efficient way is to make use of stochastic approximation.

(56)

Paper IV - Identification of jump Markov linear models using particle filters

The method from Paper III applied to the (non-trivial) special case of jump Markov linear models

With the particular structure of the jump Markov linear model can certain computations be made analytically, and the use of Monte Carlo approximations be kept to a minimum, so-called Rao-Blackwellization.

st+1∼ p(st+1| st) zt+1=Astzt+wt

yt=Cstzt+et

(57)

V - Learning state-space models with highly informative observations

A sequential Monte Carlo method for Bayesian learning of parameters when there is very little measurement noise

When there is very little measurement noise in state-space models, particle filters (a Monte Carlo method) may struggle, which obstructs parameter learning. The idea explored is to assume there is a certain level of measurement noise, and then gradually decrease that level.

xt+1=f(xt) +wt

yt=g(xt) + et

|{z}≈0

(58)

Paper VI - Learning state-space models using particle-filter approximations

A novel way to use particle filters for parameter point estimation in state-space models

The particle filter can estimate p(y| θ) for non-linear state-space models, but its stochasticity makes it hard to use within conventional optimization. With a re-interpretation of the particle filter, p(y| θ) is made less stochastic and possible to use in conventional optimization.

(59)

Paper VII - Marginalizing Gaussian process hyperparameters using SMC

A method for Bayesian learning of hyperparameters in Gaussian process models

When optimizing the unknown Gaussian process hyperparameters (i.e., learning point estimates), the result can sometimes be very sensitive to small changes in the data. By instead marginalizing them (i.e., learn the Bayesian way), this is not the case. A Sequential Monte Carlo sampler is one option for doing this.

(60)

Synthetic examples in the thesis

Example Purpose Page

xt+1=10sinc(xt7) +vt yt=xt+et

Model learning I-19

Narendra-Li benchmark Model learning I-19

PΘ=N (µ, σ2) Model validation II-10

Polynomial regression Model validation II-12

AR(1) Model validation II-12

xt+1= θxt+vt yt=xt+et

Parameter learning III-19

Infinite factorial dynamical model Hyperparameter learning III-22 Gaussian process state-space model Hyperparameter learning III-23 1-D Jump Markov linear state-space model Parameter learning IV-13 2-D Jump Markov linear state-space model Parameter learning IV-13 xt+1=[1 θ1

0 0.1 ]xt+[θ2

0 ]ut+vt yt= [1 0] xt

Parameter learning V-16

xt+1=atan(xt) + θ1ut+vt yt=xt+et

Parameter learning V-17

xt+1=0.5xt+b xt

1+x2t+8 cos(1.2t) + qvt yt=0.05x2t+et

Parameter learning VI-11

(61)

Real-data examples in the thesis

Example Purpose Page

Cascaded water tanks Model learning I-21

Earthquake counts Model validation II-4

Kangaroo population evolution Model validation II-13 Wiener-Hammerstein benchmark Parameter learning V-19

Sarcos robot arm data Model learning VII-8

Oxygen sensors Fault detection VII-9

References

Related documents

Machine learning using approximate inference: Variational and sequential Monte Carlo methods. by Christian

We can easily see that the option values we get by both the crude Monte Carlo method and the antithetic variate method are very close to the corresponding accurate option values

30 Table 6 compares the price values and standard errors of the down-and-out put option using the crude Monte Carlo estimator, the antithetic variates estimator, the control

The main contribution of this thesis is the exploration of different strategies for accelerating inference methods based on sequential Monte Carlo ( smc) and Markov chain Monte Carlo

1754 Department of Electrical Engineering, Linköping University. Linköping University SE-581 83

The two benchmark strategies, one of which is presented with an example, are derived from the optimal strategy known for normal Nim.In Section 3 a theoretical description of Monte

The contributions of this work can be summarized as follows: (I) state-space models based on Gaussian processes and corresponding state estimation algorithms are developed; (II)

Concerning Boosted Decision Trees, Artificial Neural Networks and Supported Vector Machines as supervised machine learning algorithms in this thesis project, which combination