Machine learning with state-space models, Gaussian processes and Monte Carlo methods
Andreas Svensson (Lindholm) 12 October 2018
Department of Information Technology, Uppsala University
Sugar consumption 50 g/day|
150 g/day|
Themodel(a straight line) islearned from data
Data from World Happiness Report 2017, Food and Agriculture Organization of the United Nations.
Sugar consumption 50 g/day|
150 g/day| Happinessindex3 |7 |
Themodel(a straight line) islearned from data
Data from World Happiness Report 2017, Food and Agriculture Organization of the United Nations.
Sugar consumption 50 g/day|
150 g/day| Happinessindex3 |7 |
Themodel(a straight line) islearned from data
Sugar consumption 50 g/day|
150 g/day| Happinessindex3 |7 |
Themodel(a straight line) islearned from data
Supervised Machine Learning
Supervised Machine Learning
(Supervised)Machine Learning is about learning models from data
Thedatacan be much more complicated…
• Season & product price↔ Customer demand
• Photo of a road↔ Position of pedestrians
• X-ray image↔ A medical report
• Recorded human speech↔ A written sentence
• A pilot’s operation↔ The trajectory of an aircraft
• …
Sugar consumption 50 g/day|
150 g/day| Happinessindex3|7|
Themodelcan be much more complicated…
• Neural network
• Random forest
• State-space model
• Gaussian process
• …
Supervised Machine Learning
(Supervised)Machine Learning is about learning models from data
Thedatacan be much more complicated…
• Season & product price↔ Customer demand
• Photo of a road↔ Position of pedestrians
• X-ray image↔ A medical report
• Recorded human speech↔ A written sentence
• A pilot’s operation↔ The trajectory of an aircraft
• …
Sugar consumption 50 g/day|
150 g/day| Happinessindex3|7|
Themodelcan be much more complicated…
• Neural network
• Random forest
• State-space model
• Gaussian process
• …
Supervised Machine Learning
(Supervised)Machine Learning is about learning models from data
Thedatacan be much more complicated…
• Season & product price↔ Customer demand
• Photo of a road↔ Position of pedestrians
• X-ray image↔ A medical report
• Recorded human speech↔ A written sentence
• A pilot’s operation↔ The trajectory of an aircraft
• …
Sugar consumption 50 g/day|
150 g/day| Happinessindex3|7|
Themodelcan be much more complicated…
• Neural network
• Random forest
• State-space model
• Gaussian process
• …
Learning models from data
Two approaches to learning
• Most model structures contain some (unknown) parameters θ
• The parameters can either be learned by selecting bθ
to explain the data y the best→point estimation approach
• … or by computing the parameter probabilities, given that the data y looked the way it did, p(θ| y) →Bayesian approach
bθ = arg min
θ −p(y | θ) p(θ| y) =p(y| θ)p(θ)
p(y)
Two approaches to learning
• Most model structures contain some (unknown) parameters θ
• The parameters can either be learned by selecting bθ
to explain the data y the best→point estimation approach
• … or by computing the parameter probabilities, given that the data y looked the way it did, p(θ| y) →Bayesian approach
bθ = arg min
θ −p(y | θ) p(θ| y) =p(y| θ)p(θ)
p(y)
Two approaches to learning
• Most model structures contain some (unknown) parameters θ
• The parameters can either be learned by selecting bθ
to explain the data y the best→point estimation approach
• … or by computing the parameter probabilities, given that the data y looked the way it did, p(θ| y) →Bayesian approach
bθ = arg min
θ −p(y | θ) p(θ| y) =p(y| θ)p(θ)
p(y)
Two approaches to learning
• Most model structures contain some (unknown) parameters θ
• The parameters can either be learned by selecting bθ
to explain the data y the best→point estimation approach
• … or by computing the parameter probabilities, given that the data y looked the way it did, p(θ| y) →Bayesian approach
bθ = arg min
θ −p(y | θ) p(θ| y) =p(y| θ)p(θ)
p(y)
Two approaches to learning
Sugar consumption 50 g/day|
150 g/day| Happinessindex3 |7 |
Two approaches to learning
Sugar consumption 50 g/day|
150 g/day| Happinessindex3 |7 |
Two approaches to learning
Sugar consumption 50 g/day|
150 g/day| Happinessindex3 |7 |
Two approaches to learning
Note! The only difference between the point estimation and the Bayesian approach is how parameters are learned from data.
There is no difference in the models.
Implementing the Bayesian approach
• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.
• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like
Ways to represent arbitrary p(θ| y) in a computer:
• Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference
• Generate samples from p(θ| y), and store those → Monte Carlo inference
Implementing the Bayesian approach
• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.
• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like
Ways to represent arbitrary p(θ| y) in a computer:
• Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference
• Generate samples from p(θ| y), and store those → Monte Carlo inference
Implementing the Bayesian approach
• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.
• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like
−2 −1 0 1 2 3
0 0.5 1 1.5
θ
p(θ|y)
Ways to represent arbitrary p(θ| y) in a computer:
• Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference
• Generate samples from p(θ| y), and store those → Monte Carlo inference
Implementing the Bayesian approach
• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.
• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like
−2 −1 0 1 2 3
0 0.5 1 1.5
θ
p(θ|y)
Ways to represent arbitrary p(θ| y) in a computer:
• Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference
• Generate samples from p(θ| y), and store those → Monte Carlo inference
Implementing the Bayesian approach
• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.
• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like
−2 −1 0 1 2 3
0 0.5 1 1.5
θ
p(θ|y)
Ways to represent arbitrary p(θ| y) in a computer:
• Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference
• Generate samples from p(θ| y), and store those → Monte Carlo inference
Implementing the Bayesian approach
• bθ(a point estimate) is a number, like, 4.2. That’s easy to handle with a computer.
• p(θ| y) is a probability distribution, i.e., any non-negative function that integrates to 1, like
−2 −1 0 1 2 3
0 0.5 1 1.5
θ
p(θ|y)
Ways to represent arbitrary p(θ| y) in a computer:
• Assume it has a certain form (e.g. Gaussian), and store its parameters (a few numbers)→ Variational inference
• Generate samples from p(θ| y), and store those → Monte Carlo inference
Two interesting models
State-space models
xt+1 =f(xt) +wt
yt =g(xt) +et
t time index xt hidden state yt observation wt process noise
et measurement noise
Applications within…
• automatic control
• economics
• mechanical engineering
• medicine
• finance
• biology
• signal processing
• …
State-space models
xt+1 =f(xt) +wt
yt =g(xt) +et
t time index xt hidden state yt observation wt process noise
et measurement noise
Applications within…
• automatic control
• economics
• mechanical engineering
• medicine
• finance
• biology
• signal processing
• …
State-space models
xt+1 =f(xt) +wt
yt =g(xt) +et
t
t yt
xt
t = 7
|
t = 8
f |
t = 8
| g
Monte Carlo for state-space models: The particle filter
The first and foremost problem when dealing with state-space models, is thatxtis not in the collected data!
t
t yt
xt
t = 7
|
t = 8
f |
t = 8
| g
Theparticle filterprovides a Monte Carlo-solution for recovering xt
by computing p(xt| y1:t)using the observations yt, and the functions f and g.
Monte Carlo for state-space models: The particle filter
The first and foremost problem when dealing with state-space models, is thatxtis not in the collected data!
t
t yt
xt
f
g
Theparticle filterprovides a Monte Carlo-solution for recovering xt
by computing p(xt| y1:t)using the observations yt, and the functions f and g.
Monte Carlo for state-space models: The particle filter
The first and foremost problem when dealing with state-space models, is thatxtis not in the collected data!
t
t yt
xt
f
g
Theparticle filterprovides a Monte Carlo-solution for recovering xt
Learning state-space models
xt+1 =f(xt) +wt
yt =g(xt) +et
Theparticle filterprovides a Monte Carlo-solution for recovering xt
by computing p(xt| y1:t)using the observations yt, and the functions f and g.
How can f and g be learned then?
Bycombiningthe particle filter with some other method. More in Paper I, III, IV, V, and VI.
Learning state-space models
xt+1 =f(xt) +wt
yt =g(xt) +et
Theparticle filterprovides a Monte Carlo-solution for recovering xt
by computing p(xt| y1:t)using the observations yt, and the functions f and g.
How can f and g be learned then?
Bycombiningthe particle filter with some other method. More in Paper I, III, IV, V, and VI.
Learning state-space models
xt+1 =f(xt) +wt
yt =g(xt) +et
Theparticle filterprovides a Monte Carlo-solution for recovering xt
by computing p(xt| y1:t)using the observations yt, and the functions f and g.
How can f and g be learned then?
Bycombiningthe particle filter with some other method.
More in Paper I, III, IV, V, and VI.
Gaussian processes
[f(x) f(x′) ]
∼ N
([µ(x) µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
x f(x)
x f(x)
x f(x)
x f(x)
Gaussian processes
[f(x) f(x′) ]
∼ N
([µ(x) µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
x f(x)
x f(x)
x f(x)
x f(x)
Gaussian processes
[f(x) f(x′) ]
∼ N
([µ(x) µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
x f(x)
x f(x)
x f(x)
x f(x)
Gaussian processes
[f(x) f(x′) ]
∼ N
([µ(x) µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
x f(x)
x f(x)
x f(x)
x f(x)
Gaussian processes
[f(x) f(x′) ]
∼ N
([µ(x) µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
x f(x)
x f(x)
x f(x)
x f(x)
Learning Gaussian processes
x f(x)
x f(x)
[f(x) f(x′)
]
∼ N ([µ(x)
µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
⇒ f(x′)| f(x) = N (
µ(x′) +κ(x, x′)
κ(x, x) (f(x)− µ(x)) , κ(x′,x′)−κ(x, x′)κ(x′,x) κ(x, x)
)
The Gaussian process isnon-parametric, meaning it contains all training data. No parameters to learn then?
Maybe not in the ”ideal” case, but…
• There are hyperparameters in κ (and µ)
• Sometimes parametric approximations are used (with parameters to learn)
Learning Gaussian processes
x f(x)
x f(x)
[f(x) f(x′)
]
∼ N ([µ(x)
µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
⇒ f(x′)| f(x) = N (
µ(x′) +κ(x, x′)
κ(x, x) (f(x)− µ(x)) , κ(x′,x′)−κ(x, x′)κ(x′,x) κ(x, x)
)
The Gaussian process isnon-parametric, meaning it contains all training data.
No parameters to learn then? Maybe not in the ”ideal” case, but…
• There are hyperparameters in κ (and µ)
• Sometimes parametric approximations are used (with parameters to learn)
Learning Gaussian processes
x f(x)
x f(x)
[f(x) f(x′)
]
∼ N ([µ(x)
µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
⇒ f(x′)| f(x) = N (
µ(x′) +κ(x, x′)
κ(x, x) (f(x)− µ(x)) , κ(x′,x′)−κ(x, x′)κ(x′,x) κ(x, x)
)
The Gaussian process isnon-parametric, meaning it contains all training data.
No parameters to learn then?
Maybe not in the ”ideal” case, but…
• There are hyperparameters in κ (and µ)
• Sometimes parametric approximations are used (with parameters to learn)
Learning Gaussian processes
x f(x)
x f(x)
[f(x) f(x′)
]
∼ N ([µ(x)
µ(x′) ]
,
[κ(x, x) κ(x, x′) κ(x′,x) κ(x′,x′)
])
⇒ f(x′)| f(x) = N (
µ(x′) +κ(x, x′)
κ(x, x) (f(x)− µ(x)) , κ(x′,x′)−κ(x, x′)κ(x′,x) κ(x, x)
)
The Gaussian process isnon-parametric, meaning it contains all training data.
No parameters to learn then?
Maybe not in the ”ideal” case, but…
• There are hyperparameters in κ (and µ)
• Sometimes parametric approximations are used (with parameters to learn)
My thesis
Monte Carlo methods
State-space models
Gaussian processes
My contributions
Collect data
Choose model structure
Learn model
My contributions
Collect data
Choose model structure
Learn model Paper II
Paper I Paper III Paper IV Paper V Paper VI Paper VII
Paper I – A flexible state-space model for nonlinear dynamical systems
A computationally tractable combination of the state-space model and the Gaussian process model
By combining a reduced-rank approximation of the Gaussian process with the state-space model, the model can be learned efficiently with a certain Monte Carlo method; particle Gibbs with ancestor sampling.
xt+1=f(xt) +vt, f∼ GP yt=g(xt) +et, g∼ GP
Paper I – Some more details
f∼ GP(0, κ) ⇔ f(x) ≈
∑m j=0
w(j)ϕ(j)(x), w(j)∼ N (0, S(λ(j)))
Hilbert Space Methods for Reduced-Rank Gaussian Process Regression. Arno Solin & Simo Särkkä 2014.
xt+1=
w(11) · · · w(1m)
... ...
w(1)nx · · · w(m)nx
| {z }
A
ϕ(1)(xt)
... ϕ(m)(xt)
| {z }
f(xt)
+vt, vt∼ N (0, Q),
yt=g(xt) +et
p(A, Q) =MN IW(A, Q; 0, V, ℓ, Λ), p(xt+1| A, Q, xt) =N (xt+1;Aϕ(xt),Q)
⇒ p(A, Q | x1:T) =MN IW(A, Q; Ψ(Σ + V−1)−1, (Σ +V−1)−1, Λ + Φ− Ψ(Σ + V−1)−1ΨT, ℓ +Tnx)
Paper I – Some more details
f∼ GP(0, κ) ⇔ f(x) ≈
∑m j=0
w(j)ϕ(j)(x), w(j)∼ N (0, S(λ(j)))
Hilbert Space Methods for Reduced-Rank Gaussian Process Regression. Arno Solin & Simo Särkkä 2014.
xt+1=
w(1)1 · · · w(m)1
... ...
w(1)nx · · · w(m)nx
| {z }
A
ϕ(1)(xt)
...
ϕ(m)(xt)
| {z }
f(xt)
+vt, vt∼ N (0, Q),
yt=g(xt) +et
p(A, Q) =MN IW(A, Q; 0, V, ℓ, Λ), p(xt+1| A, Q, xt) =N (xt+1;Aϕ(xt),Q)
⇒ p(A, Q | x1:T) =MN IW(A, Q; Ψ(Σ + V−1)−1, (Σ +V−1)−1, Λ + Φ− Ψ(Σ + V−1)−1ΨT, ℓ +Tnx)
Paper I – Some more details
f∼ GP(0, κ) ⇔ f(x) ≈
∑m j=0
w(j)ϕ(j)(x), w(j)∼ N (0, S(λ(j)))
Hilbert Space Methods for Reduced-Rank Gaussian Process Regression. Arno Solin & Simo Särkkä 2014.
xt+1=
w(1)1 · · · w(m)1
... ...
w(1)nx · · · w(m)nx
| {z }
A
ϕ(1)(xt)
...
ϕ(m)(xt)
| {z }
f(xt)
+vt, vt∼ N (0, Q),
yt=g(xt) +et
p(A, Q) =MN IW(A, Q; 0, V, ℓ, Λ), p(xt+1| A, Q, xt) =N (xt+1;Aϕ(xt),Q)
⇒ p(A, Q | x1:T) =MN IW(A, Q; Ψ(Σ + V−1)−1, (Σ +V−1)−1, Λ + Φ− Ψ(Σ + V−1)−1ΨT, ℓ +Tnx)
Paper I – Some more details
Input: Data y1:T, priors on A, Q.
Output: K MCMC-samples with p(x1:T,A, Q| y1:T)as invariant distribution.
1 Initialize A[0], Q[0].
2 for k = 0 to K do
3 Sample x1:T[k+1] A[k], Q[k] using particle Gibbs with ancestor sampling
4 Sample Q[k+1] x1:T[k+1]
5 Sample A[k+1] Q[k+1], x1:T[k+1]
6 end
Computationally efficient Bayesian learning of Gaussian process state space models. Andreas Svensson, Arno Solin, Simo Särkkä, and Thomas B. Schön, AISTATS 2016.
Nonlinear state space model identification using a regularized basis function expansion. Andreas Svensson, Thomas B. Schön, Arno Solin, and Simo Särkkä, IEEE CAMSAP 2015.
Paper I – Some more details
Input: Data y1:T, priors on A, Q.
Output: K MCMC-samples with p(x1:T,A, Q| y1:T)as invariant distribution.
1 Initialize A[0], Q[0].
2 for k = 0 to K do
3 Sample x1:T[k+1] A[k], Q[k] using particle Gibbs with ancestor sampling
4 Sample Q[k+1] x1:T[k+1]
5 Sample A[k+1] Q[k+1], x1:T[k+1]
6 end
Computationally efficient Bayesian learning of Gaussian process state space models.
Andreas Svensson, Arno Solin, Simo Särkkä, and Thomas B. Schön, AISTATS 2016.
Nonlinear state space model identification using a regularized basis function expansion.
Andreas Svensson, Thomas B. Schön, Arno Solin, and Simo Särkkä, IEEE CAMSAP 2015.
Paper II – Data consistency approach to model validation
”Am I using a good model structure?”
Important not least because all uncertainties in the Bayesian approach are conditional on the choice of model class!
Paper III - Learning dynamical systems with PSAEM
A method for learning parameter point estimates in nonlinear state-space models
A challenge (and strength!) with Monte Carlo methods is their inherent stochasticity – they produce a different result each time they are executed. That makes them hard to combine with other methods, such as Expectation Maximization. A go-to solution is to average over multiple Monte Carlo runs, to get something less stochastic. A more rigorous and efficient way is to make use of stochastic approximation.
Paper IV - Identification of jump Markov linear models using particle filters
The method from Paper III applied to the (non-trivial) special case of jump Markov linear models
With the particular structure of the jump Markov linear model can certain computations be made analytically, and the use of Monte Carlo approximations be kept to a minimum, so-called Rao-Blackwellization.
st+1∼ p(st+1| st) zt+1=Astzt+wt
yt=Cstzt+et
V - Learning state-space models with highly informative observations
A sequential Monte Carlo method for Bayesian learning of parameters when there is very little measurement noise
When there is very little measurement noise in state-space models, particle filters (a Monte Carlo method) may struggle, which obstructs parameter learning. The idea explored is to assume there is a certain level of measurement noise, and then gradually decrease that level.
xt+1=f(xt) +wt
yt=g(xt) + et
|{z}≈0
Paper VI - Learning state-space models using particle-filter approximations
A novel way to use particle filters for parameter point estimation in state-space models
The particle filter can estimate p(y| θ) for non-linear state-space models, but its stochasticity makes it hard to use within conventional optimization. With a re-interpretation of the particle filter, p(y| θ) is made less stochastic and possible to use in conventional optimization.
Paper VII - Marginalizing Gaussian process hyperparameters using SMC
A method for Bayesian learning of hyperparameters in Gaussian process models
When optimizing the unknown Gaussian process hyperparameters (i.e., learning point estimates), the result can sometimes be very sensitive to small changes in the data. By instead marginalizing them (i.e., learn the Bayesian way), this is not the case. A Sequential Monte Carlo sampler is one option for doing this.
Synthetic examples in the thesis
Example Purpose Page
xt+1=10sinc(xt7) +vt yt=xt+et
Model learning I-19
Narendra-Li benchmark Model learning I-19
PΘ=N (µ, σ2) Model validation II-10
Polynomial regression Model validation II-12
AR(1) Model validation II-12
xt+1= θxt+vt yt=xt+et
Parameter learning III-19
Infinite factorial dynamical model Hyperparameter learning III-22 Gaussian process state-space model Hyperparameter learning III-23 1-D Jump Markov linear state-space model Parameter learning IV-13 2-D Jump Markov linear state-space model Parameter learning IV-13 xt+1=[1 θ1
0 0.1 ]xt+[θ2
0 ]ut+vt yt= [1 0] xt
Parameter learning V-16
xt+1=atan(xt) + θ1ut+vt yt=xt+et
Parameter learning V-17
xt+1=0.5xt+b xt
1+x2t+8 cos(1.2t) + qvt yt=0.05x2t+et
Parameter learning VI-11
Real-data examples in the thesis
Example Purpose Page
Cascaded water tanks Model learning I-21
Earthquake counts Model validation II-4
Kangaroo population evolution Model validation II-13 Wiener-Hammerstein benchmark Parameter learning V-19
Sarcos robot arm data Model learning VII-8
Oxygen sensors Fault detection VII-9