Adaptive Filtering and Nonlinear Models for Post-processing of Weather Forecasts

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Adaptive Filtering and Nonlinear Models for

Post-processing of Weather Forecasts

Examensarbete utfört i Reglerteknik vid Tekniska högskolan vid Linköpings universitet

av Adam Wettring LiTH-ISY-EX--15/4876--SE

Linköping 2015

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Adaptive Filtering and Nonlinear Models for

Post-processing of Weather Forecasts

Examensarbete utfört i Reglerteknik

vid Tekniska högskolan vid Linköpings universitet

av

Adam Wettring LiTH-ISY-EX--15/4876--SE

Handledare: Michael Roth

isy_{, Linköpings universitet}

Fredrik Karlsson

SMHI

Examinator: Fredrik Gustafsson

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Division of Automatic Control Department of Electrical Engineering SE-581 83 Linköping Datum Date 2015-06-18 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-XXXXX

ISBN — ISRN

LiTH-ISY-EX--15/4876--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Svensk titel

Adaptive Filtering and Nonlinear Models for Post-processing of Weather Forecasts

Författare Author Adam Wettring Sammanfattning Abstract Nyckelord Keywords

(6)

(7)

Abstract

Kalman filters have been used by SMHI to improve the quality of their forecasts. Until now they have used a linear underlying model to do this. In this thesis it is investigated whether the performance can be improved by the use of nonlinear models such as polynomials and neural networks. The results suggest that an improvement is hard to achieve by this approach and that it is likely not worth the effort to implement a nonlinear model.

(8)

(9)

Notation

Common notation

Notation Meaning

t Time measured in hours

tk Indexed time

y Measured temperature

ftemp Temperature forecast

u Regressor vector

x Parameter vector

v Process noise vector

n Measurement noise

e Innovation noise

P Covariance matrix for the parameters Q Covariance matrix for the model noise

R Variance of the measurement noise f Parameter transition function

h Regression function

F Parameter transition matrix

H Regression matrix

(12)

Abbrevations

Abbrevation Meaning

arma _{Auto-regressive moving average}

armax _{Auto-regressive moving average with exogenous} in-puts

narmax _{Nonlinear auto-regressive moving average with} exoge-nous inputs

ar _{Auto-regressive}

arx _{Auto-regressive with exogenous inputs}

narx _{Nonlinear auto-regressive with exogenous inputs}

oe Output error

mse Mean square error

mmse Minimal mean square error rmse Root mean square error

lms Least mean square rls Recursive least square

kf _{Kalman filter}

ekf _{Extended kalman filter} ukf _{Unscented kalman filter} cdkf Central difference kalman filter

nlip _{Levenberg Marquardt Backpropagation} spkf _{Sigma-point kalman filter}

nn Neural network

lip Linear in parameters nlip Nonlinear in parameters siso Single input, single output miso Multiple inputs, single output

(13)

1

Introduction

In meteorology it is common practice to use physical concepts from for exam-ple fluid mechanics in the process of creating accurate forecasts. The governing equations for such systems are generally nonlinear and consequently challeng-ing to solve analytically. Therefore numerical methods are required to achieve reasonable results. This implies that the system needs to be discretized with re-spect to both time and space. In order to do this, one has to divide each area into a square grid where all the relevant numerical properties for a given square are represented by the mean over the entire square. The resulting models that are used to generate forecasts are known as numerical forecast models. However, the spatial discretization has the consequence that the numerical forecast models are only able to produce forecasts for discrete grid points in space. Thus, it is desirable to find a way to predict forecasts at specific points which are not repre-sented by the grid given by the numerical forecast models. Reasonable examples of such points would be the various measurement stations across the country. In this thesis an attempt will be made to produce a statistical regression model that generates forecasts for a given measurement station based on forecasts from two numerical forecast models, ec and c11, covering grid points in the neigh-bourhood of the investigated measurement station. The different measurement stations and their surrounding grid points are illustrated in Figure 1.1. Given previous experience of the chaotic nature of the underlying system, it is assumed that the sought model will be varying over time and that it is reasonable to use adaptive techniques to update the model. In order for this to work, two essential questions need to be answered:

• What is a suitable model structure?

• What algorithm should be used to update the model?

(14)

Under the assumption that the general behaviour of the system is constant over time, it makes sense to use a parametrized model with time varying parameters. Such structures could for example consist of simple linear combinations of fore-cast values from the the different grid points, polynomial models, auto regressive models and neural networks. Since a goal of the thesis is to verify the possi-bility to improve the performance by the use of nonlinear model structures, a special focus focus is aimed at nonlinear alternatives such as polynomials, nar-max-models and neural networks. For the parameter updates different nonlinear Kalman filter techniques are investigated. Nonlinear modelling have been used together with Kalman filters before with good results (Sweeney et al. [2013], Cro-chet [2004]), Louka et al. [2008], Homleid [1995]) and hopefully, similar results can be achieved in this application.

1.1 Notational conventions

The following notational conventions will hold for the remainder of this thesis unless otherwise specified. Bold letters will be used to depict vectors and matri-ces and normal font letters will be used for scalars. Furthermore, capital bold letters are used for matrices and lower case bold letters are used for vectors. For instance, y is a scalar, x is a vector and H is a matrix. Vectors are always assumed to be column vectors and the subscript is used to reference the elements of the vector, so that for example

x=hx1 x2 · · ·

iT

Similarly, a double subscript is used for matrices so that H(i,j)is the element at

the i:th row and j:th column of the matrix H. In the context of a subscript, the colon is used to specify a range of subscripted elements, which might be a vector, e.g x1:3= h x1 x2 x3 iT , or a set, e.g t1:3= n t1, t2, t3 o .

The hat will be used for estimated values so that for instance ˆx is an estimate of x. In a functional context, the expression g(a|b) will be used to depict that g is a function of a given information about b. When a and b are time variables, which will be the case for most of this thesis, i.e

g(a|b) = g(tk|tk−s), (k, s) ∈ N, k ≥ s.

(15)

1.2 Problem formulation 5

1.2 Problem formulation

The main purpose of this thesis is to examine whether nonlinear modelling and Kalman filtering can be used for post-processing of weather forecasts in such a way that forecast errors are reduced compared to the existing linear models. Measurements from nine different measurement stations are given along with forecasts from two different numerical forecast models denoted c11 and ec. The forecasts are covering grid points in the general area of each measurement station, as seen in figure 1.1.

t Time measured in hours.

yp(tk) Scalar measurement of temperature from a given

measure-ment station p at time tk.

u(tk) Regressor for time tk.

x(tk) Parameters of the model at time tk.

v(tk) Model noise at time tk.

n(tk) Measurement noise at time tk.

e(tk) Innovation noise at time tk.

Table 1.1:Notation used in this chapter.

By using data from the numerical forecast models, and possibly past measure-ments from the station, one wishes to predict future temperature measuremeasure-ments at the station by using a regression model of the form

yp(tk) = ˆyp(tk|tk−1) + e(tk),

ˆ

yp(tk|tk−1) = h(ˆx(tk−1|tk−1), u(tk))

(1.1) The sequence e(tk) is a noise sequence which in the ideal case is white and

Gaus-sian. The regression model is defined by the function h of the time dependent parameter vector x and the regressor u, consisting of forecasts from the two nu-merical forecast models and possibly past measurements from the measurement stations. The subscript p will be omitted whenever possible in order to make the text more concise.

As stated earlier, different types of nonlinear models will be investigated, in par-ticular different types of NARMAX polynomials and neural networks. As the sought system is assumed to be affected by seasonal changes, some kind of adap-tive algorithm has to be used in order to adjust the model parameters x. In the linear case, there are different methods for dealing with such problems, for exam-ple lms (Least mean square) and rls (Recursive least square) (Gustafsson et al. [2010]). These algorithms are however designed for models that are linear in the parameters (lip), that can be easily formulated as a linear regression

(16)

(a) Nine measurement stations spread out over Sweden

(b) Gridpoints for forecasts from the ec model (blue) and measurement station location (red). Closest grid point is 4.65 km from the measurement station

(c) Gridpoints for forecasts from the c11 model (blue) and measurement sta-tion locasta-tion (red). Closest grid point is 3.97 km from the measurement station

Figure 1.1: The different measurement stations with gridpoints for the dif-ferent forecast models

(17)

1.3 Previous model 7

These algorithms will thus be of no use for neural networks that are generally nonlinear in the parameters. Instead, a sigma point implementation, called the central difference Kalman filter (cdkf) (van der Merwe [2004]), will be used in these cases. In order to run the Kalman filters, the problem needs to formulated on state space form. This is done by assuming that the optimal parameter vector xchanges according to a random walk process.

x(tk+1) = x(tk) + v(tk),

y(tk) = h(x(tk), u(tk)) + n(tk).

(1.3) The more advanced models will later be compared to the simple linear models. The Kalman filter methods are also compared to gradient descent based methods like least mean square (lms), and Levenberg Marquardt backpropagation (lmb). Validation will be performed with respect to the root mean square error (rmse). The work may be divided into two parts:

• Find reasonable candidates for the function h in (1.1).

• Evaluate different methods for updating the parameters x in order to mini-mize the rmse.

1.3 Previous model

Prior to this thesis, the forecast error

y†(tk) = y(tk) − ftemp(tk) (1.4)

has been modelled as an affine function ˆ

y†(tk) = x1(tk) + x2(tk)ftemp(tk), (1.5)

where ftemp(tk) is the forecast from the grid point closest to the station. By setting

x(tk) = h x1(tk) x2(tk) iT , u(tk) = h 1 ftemp(tk) iT (1.6) it is possible to rewrite (1.5) as ˆ y†(tk) = uT(tk)x(tk). (1.7)

Subsequently, the problem was formulated on state space form x(tk+1) = x(tk) + v(tk),

y†(tk) = H(tk)x(tk) + n(tk)

(1.8)

where H(tk) = uT and the parameters x(tk) can be updated according to the

Kalman filter algorithm explained in Section 2.4.3. By this approach it has been possible to correct the bias of the forecast but not much more than that. However, by using a linear combination of the closest forecast from each of the two models the variance has been slightly reduced as well.

(18)

(a) Shows how often new data is available from the measurement stations and the two forecast models.

(b) Shows the available forecast hours from the two models.

Figure 1.2: Schematic overview of when new data is available.

1.4 Data sets

Two years worth of observations and forecast data have been provided. The obser-vation data is collected on an hourly basis from the nine measurement stations in Figure 1.1a. The corresponding forecast data is taken from two different models labeled c11 and ec. Both models are grid based and thus it is impossible to get forecast information for the exact location of the measurement station. Instead the forecast data is taken from grid points in the general area of each station as seen in Figures 1.1b and 1.1c. A simple approach is to only use data from the grid point which is closest to the station, but it might also be relevant to use data from several grid points in prediction.

ec forecasts c11 forecasts Measurement

Generated every twelfth hour

Generated every sixth hour

Available every hour Forecasts generated for

the next 48 hours

Forecasts generated for the next 60 hours One forecast generated

for every third hour of the forecasted period

One forecast generated for every third of the 24 first and one for every sixth of the remaining forecasted period

(19)

1.4 Data sets 9

Another concern from a practical point of view is that the observations and fore-casts are not synchronised. A new batch of c11 forefore-casts is generated every sixth hour, including forecasts for the next 60 hours. Every such batch gives one fore-cast for every third hour during the first 24 hours and one forefore-cast for every sixth hour during the rest of the 60 hour period. The ec model however, provides a new batch of forecasts every twelfth hour, with forecasts for the next 48 hours. Each batch of ec forecasts give one forecast for every third hour during the entire 48 hour period. The situation is summarized in Table 1.2 and Figure 1.2.

Since the observations are made every hour it is necessary to either interpolate between the forecasts so that one forecast is available for every hour, or to omit at least 2/3 (or 5/6 if the ec model is to be utilized in the model) of the obser-vation data. The first alternative implies that possibly false information will be used during the identification process while the second alternative implies that a lot of possibly useful information is lost. The second alternative is chosen, with the assumption that there is enough observation data even after discarding. Fur-thermore, since it is practical to use forecast data from both forecast models, the system will have a sampling resolution of 12 hours.

A summary of the total set of available forecast and measurement quantities can be found in table 1.3 below, although the temperature quantities are the ones that have been used the most in this project.

ec model c11 model Measurements

U wind component (m/s) U wind component (m/s) Temperature (degree celsius) V wind component (m/s) V wind component (m/s)

Dew point temperature (K)

Wind velocity (m/s) Temperature (K) Temperature (K) Wind direction (degree

true) Low cloud cover (%) Low cloud cover (%)

Mean sea level (Pa) Global radiation flux (W /m2)

Total cloud cover (%) Pressure (Pa) Total Precipitation

(kg/ms)

Snow depth (m) Total cloud cover (%) Total precipitation (kg/ms)

(20)

1.5 Prospective challenges

The concept of time invariance and stationarity is very important in the study of stochastic processes and signals and systems. Since neither the inputs or outputs of the system is stationary but rather drifting with seasonal changes, many useful tools will be a lot more difficult to use, such as cross correlation, spectral analy-sis and other standard methods that assume time invariance. One way to deal with this problem is to only use data collected during a short time interval, over which the inputs and outputs can be regarded as approximately stationary sig-nals. However, the performance of most numerical estimation methods is related to the number of samples being used. This leads to a trade-off situation where it is desired to use a lot of data to achieve higher accuracy in the estimations, but by doing so we might get false estimations since the samples have not been picked from approximately identical distributions. This situation suggests that methods based on assumptions of time invariance should be avoided whenever possible. During certain periods data is missing from some of the measurement stations. There are a number of ways to cope with this problem. In this project the choice has been made to remove all forecast data corresponding to the affected time step and by doing so in effect pretending that the time step never existed. This is a potential problem if the model is heavily relying on past measurement and fore-cast values (e.g ARX and ARMAX). A way to go around this problem is to use the existing model to estimate the missing measurements (Ljung [2009]). In this particular case this is a potentially dangerous approach however, as periods of absent data are usually quite long, sometimes spanning over an entire month or more. Given the chaotic nature of the system, this may lead to divergence and consequently disastrous results. Thus the simpler approach is chosen.

(21)

2

Theory

The following chapter will summarise the basic theory needed to understand the methods used in this thesis. As the modelling is a substantial part of the project a brief introduction to system identification is given in Section 2.1. Nonlinear models such as narmax and neural networks will be discussed in Sections 2.2 and 2.3. Kalman filter theory is discussed in Section 2.4.

2.1 System identification

System identification is the art of identifying a suitable mathematical model for a system by analysis of a given data set corresponding to the inputs and outputs of the system. Depending on the situation and what is known about the dynamics of the system, the modelling process and the resulting models might take different form. In the case of a mechanical system for instance, the governing equations are often known apart from some physical parameters that are to be estimated. In other cases the underlying physical models are unknown or too complicated to be used in practice. If that is the case one has to resort to what is called black box models. In black box models the output is modelled as a function of the inputs and outputs of the system. Usually the model is parametrised in some way so that the optimal parameters can be found by some optimisation algorithm. Dif-ferent models are subsequently compared to each other with respect to some key values, normally the mean square error mse, in order to find the best match. This is usually referred to as the validation process. It is of uttermost importance that the identification process (i.e finding the optimal parameters for a given model) and validation process is not performed on the same data set since the best can-didate is then bound to overfit the estimation data. On the other hand, it is also important that the validation data has been collected under approximately the

(22)

same circumstances as the estimation data so that the basic behaviour of the sys-tem can be assumed to be the same. A good introduction to blackbox modelling in the linear case can be found in Ljung [2009] and a similar treatment of the nonlinear case can be found in Billings [2013].

2.2 NARMAX-models

If the system shows signs of nonlinear behaviour, it is necessary to propose some kind of nonlinear model based on the past inputs and outputs of the system. One such model is the narmax-model, defined as:

y(tk) = f (y(tk−1), y(tk−2), . . . , y(tk−ny),

u(k − d), u(k − d − 1), . . . , u(k − d − nu),

e(k − 1), e(k − 2), . . . , e(k − ne))

(2.1)

where d is a delay constant. narmax-models can be seen as a generalization of the armax model in the linear case. Most nonlinear models can be seen as a special case of the formulation (2.1), for example Volterra series and reccurent neural networks (Billings [2013]). A reasonable first approach, though, would be to assume that the system could be accurately modelled as a series of polynomial combinations of the input and output of the system. Such polynomials are called narmax-polynomials and can generally be expressed on the form

y(tk) =x0+ n X i1=1 xi1ui1(tk) + n X i1=1 n X i2=i1 xi1i2ui1(tk)ui2(tk)+ · · ·₊ n X i1=1 · · · n X il=il−1 xi1i2···ilui1(tk)ui2(tk) · · · uil(tk) + e(tk) (2.2)

where l is the polynomial order in the sense that the maximal number of factors in each term is equal to l and x is the input vector such that

hu1 u2 · · · uny+nu+ne+1

i =

=hy(tk−1) · · · y(tk−ny) ftemp,1(tk) · · · ftemp,nu(tk) e(tk−1) · · · e(tk−ne)

i

,

where nyis the number of auto-regressive terms, nu is the number of inputs and

(23)

2.3 Neural networks 13

2.3 Neural networks

Notation Meaning

w_i,jk Weight/parameter of input i to node j in layer k

wk_b,j Bias weight of node j in layer k

ui Input i to the network

y Output of the network

Ψ Activation function

Node Graphical interpretation of a function that connects a number of inputs to one output

Layer A collection of parallel nodes

Number of layers The maximum number of nodes that a input signal has to traverse through in order to reach the output Feed forward network None of the signals in the network are fed back to

previous layers in the network

Recurrent network Some of the signals are fed back to previous layers in the network

Table 2.1: Summary of the notation and terminology regarding neural net-works.

A neural network can be explained as a complex chain of composite functions (Ro-jas [1996]). Values are given from the inputs of the system and are propagated through the network. The method was initially developed by the machine learn-ing community and was inspired by the way the human neural network works. The technique was first and foremost developed for classification problems, but has also been widely used for function approximation and regression. The sim-plest form of neural network for function approximation is the single layer feed forward network. An example of such a network can be seen in Figure 2.1. The three circles are called nodes and consist of an input function, which is usually a simple addition of the various inputs to the node, and an activation function which transforms the result from the input function to the node output. When using neural networks for function approximation and regression it is common to omit the activation function at the output node (or equivalently, set it equal to the identity function). Common choices of activation functions are the hyper-bolic tangent sigmoid function

Ψ(x) = 2

1 + e−_2x −1 (2.3)

and the radial basis function

Ψ(x) = e−x2 (2.4)

There are other possible choices of activation functions, but these two are the ones that has been used in this thesis. The network in Figure 2.1 has one hidden layer with two hidden units. If the activation functions Ψ would be hyperbolic

(24)

Figure 2.1:A single layer feedforward network with two nodes

tangent sigmoid functions according to (2.3) the output y could be formulated as a function of the inputs u1and u2as

y = 2 X i=1 w2_iΨ_i       2 X j=1 w1_j,iuj+ w1b,i      + w 2 b,i. (2.5)

(2.5) is a lot harder to interpret than Figure 2.1 and the situation gets even worse when adding more inputs, nodes or layers. Therefore it is commonplace to use a graphical representation when talking about neural networks. The example above is a so called feed forward neural network. It is also possible to feedback the outputs from some of the nodes, creating what is called a recurrent neural network. Another possibility is to add more hidden layers, thus making a multi-layer neural network.

2.4 Parameter estimation methods

The different adaptive algorithms used in this project are briefly explained in the section below. A more thorough treatment of the lms, rls and Kalman filter can be found in (Gustafsson et al. [2010]). The Levenberg-Marquardt algorithm is further explained in (Hagan and Menhaj [1994]). The sigma point approach to Kalman filters are thoroughly explained in (van der Merwe [2004]).

(25)

2.4 Parameter estimation methods 15

2.4.1 Least Mean Squares

For linear in parameters (lip) models, the Least Mean squares method (lms) is a gradient descent method based on the minimization of the squared predicted error

V (ˆx) = 1

2E((y − ˆy)

2_). _(2.6)

If the regression vector is given by φ and the unknown parameters is given by x so that

ˆ

y = φT(t) ˆx, (2.7) the steepest descent is given by

− d

dxV (x) = E(φ(t)(y(t) − φ

T_(t)x)). _(2.8)

The parameters are then updated by taking small steps in the direction of the steepest descent

ˆx(tk) = ˆx(tk−1) + µφ(tk)(y(tk) − φT(tk)ˆx(tk−1)), (2.9)

where µ is the step size. A large size of µ gives faster convergence at the cost of a higher risk of divergence and worse noise sensitivity.

2.4.2 Recursive Least Squares

The Recursive Least Squares (rls) method is another approach to go about the problem. The algorithm uses a weighted average of past squared errors

V (ˆx) =

N

X

k=1

λN −k(y(tk) − ˆy(tk))2, (2.10)

which after minimization gives the recursive algorithm ˆx(tk) = ˆx(tk−1) + K(tk)(y(tk) − φT(tk)ˆx(tk−1)), K(tk) = P(tk−1)φ(tk) λ + φT(tk)L(tk−1)φ(tk) , L(tk) = 1 λ L(tk−1) − L(tk−1)φ(tk)φT(tk)L(tk−1) λ + φT(tk)L(tk−1)φ(tk) ! . (2.11)

The paramter λ works as a forgetting factor. A small choice of λ gives a filter which is not very affected by old errors and results in a faster convergence at the cost of lower noise sensitivity and vice versa.

2.4.3 The basic Kalman filter

After initiating the state ˆx(t1) and covariance matrix P0as

(26)

The Kalman filter is performed in two steps. The time update: ˆx(tk+1|tk) = F(tk)ˆx(tk|tk),

P(tk+1|tk) = F(tk)P(tk|tk)F(tk)T + Q(tk),

(2.13) and the measurement update:

ˆx(tk|tk) = ˆx(tk|tk−1) + K(tk)(y(tk) − H(tk)ˆx(tk|tk−1)),

P(tk|tk) = P(tk|tk−1) − K(tk)H(tk)P(tk|tk−1),

(2.14) where Kalman gain K is defined as

K(tk) , P(tk|tk−1)H(tk)T(H(tk)P(tk|tk−1)H(tk)T + R(tk))

−₁

. (2.15)

Under the condition that x(t0), n(tk) and v(tk) are Gaussian, the Kalman filter

above is in fact the best possible estimator among all linear and nonlinear ones in the sense that it minimizes the variance of the estimate. Even if the distributions are non-Gaussian the Kalman filter still produces the best estimate among all linear filters provided that the covariance matrices P(t0), Q(tk) and R(tk) reflect

the correct second order moments. However, since the covariance matrices are generally unknown in most applications, it is usually necessary to try different configurations until a desired performance is obtained.

2.4.4 Levenberg-Marquardt backpropagation

The most basic algorithm to train neural networks is the classic backpropagation algorithm based on the gradient descent. In this project a modification of the backpropagation algorithm based on Levenberg-Marquardt optimization is used instead (Hagan and Menhaj [1994]). The algorithm is used as a default algorithm in the Matlab Neural Networks toolbox.

2.4.5 Sigma-point kalman filters

The sigma point approach to Kalman filtering (Gustafsson [2012]) has been stud-ied together with neural networks (van der Merwe [2004], Haykin [2001]) with promising results. The sigma point approach is based on an estimation of the pa-rameter distribution through the use of so called sigma points. For a papa-rameter vector of size L the first two moments of the parameter distribution is approxi-mated by 2L + 1 sigma points. An advantage of the sigma point approach is that it does not require any calculations of the Jacobian or Hessian, which is conve-nient in the case of neural networks where those expressions might become cum-bersome to work with. There are different ways to implement the sigma point approach, i.e, the Unscented Kalman filter (ukf), the central difference Kalman filter (cdkf) and the sigma point approach to the second order extended Kalman filter (Roth and Gustafsson [2011]). In this thesis, the cdkf, described in (van der Merwe [2004]) is used as sigma-point kf.

(27)

2.5 Feature selection 17

2.5 Feature selection

Since there are many possible forecasts that can be used as inputs to the model it would be appropriate to have some kind of feature selection that decides which forecasts to use. Methods based on K-nearest neighbour and mutual information have been used in similar situations to find appropriate features in long-term time series predictions (Sorjamaa et al. [2007]). In the case when many forecasts are used it would also be preferable if all the inputs add unique information about the output so that redundancy is minimized. That is, two inputs should not be too similar to each other.

2.5.1 Mutual information

One way of doing this is to use an algorithm based on the mutual information,

I(Y , Xi) =

Z Z

py,xi(y, xi) log

py,xi(y, xi)

py(y)pxi(xi)

dydxi (2.16)

between the input Xiand the output Y and the conditional mutual information

I(Y , Xi|Xj) = EXj I(Y , Xi)|Xj ! = Z pxj(xj) Z Z

py,xi|xj(y, xi|xj) log

py,xi|xj(y, xi|xj)

py|xj(y|xj)pxi|xj(xi|xj)

dydxidxj

(2.17)

between the input Xi and the output Y given knowledge about the already

cho-sen input Xj (Fleuret [2004]). The main problem with this approach is apparent

from the above expressions, i.e that it demands approximations of generally un-known distribution functions. There are different ways to do this. One is to use a K nearest neighbour approach (Kraskov et al. [2004]). A low value of K repre-sents a low bias, but bigger variance and vice versa. In this thesis, some initial experimentation lead to the conclusion that a value of K = 6 is a good trade off and therefore it is the value that is used in this thesis. A simple algorithm that can deal with the issue of redundancy in the estimators is given in (Fleuret [2004]). The algorithm can be concluded as follows:

v(1) = arg max n ˆI(Y , Xn), v(k + 1) = arg max n minl≤k ˆI(Y , Xn |_X_v(l)₎ ! , 1 ≤ k ≤ K. (2.18)

(28)

(29)

3

Method

A lot of different approaches have been investigated during the course of this project and some of them have been more fruitful than others. Only the most significant approaches will be discussed in the following sections. In this chap-ter the most important experiments will be explained in more detail. A brief summary of some of the initial ideas and attempts is also given.

3.1 Pilot study

The project started out with an extensive pilot study spanning over approxi-mately one month where the concepts of nonlinear modelling and neural net-works were studied in order to prepare for the challenges ahead. During the same time the student also got familiar with the huge amount of available data.

3.2 Initial experiments

The initial experiments were focused on the narmax-polynomials mentioned in Chapter 2.2. Some time was also spent on trying to use model validation tech-niques based on correlation analysis (Billings and Voon [1985]) to find suitable candidates. This was initially seen as a promising approach as it would presum-ably give information about which part of the narmax model that needed more terms, but also as a means to justify the need of a nonlinear model in the first place. This approach unfortunately faced some challenges for a number of rea-sons. The method assumes that each input sequence is generated from zero mean processes with odd moments equal to zero, which does not hold for the forecasts. The methods would also be more complicated to work with when using multiple

(30)

inputs.

There were also brief attempts to use the gamma test (Jones [2004]), to find an estimate of the smallest possible variance in the noise sequence which could not be explained by the inputs. This approach was planned to be used as reference for the various model types and also as a guideline for when to stop looking for improvements. The idea was later discarded as the method is designed for time invariant systems, which would make it less suitable to compare with the adap-tive models. Furthermore, large amounts of data are needed to give good results for inputs of higher dimensions. Pure auto-regressive models were also tested in order to investigate of how much help the past measurements would have to the total model. It turned out that the past measurements would be of pretty good use for short term forecasts but of less significance for long term forecasts.

3.3 Identification procedure

For the remainder of the project a more straight forward identification approach was used, based on testing different models and comparing their resulting rmse. The approach will now be described below, first in a general setting and then in more detail for the various model types. A sketch of the procedure is found in Figure 3.1.

(31)

3.3 Identification procedure 21

First, a measurement station and a forecast hour is chosen. Then measurements are extracted in such a way that there is a measurement for each of the available forecasts. The first year of the resulting data is then used as training data and the second year is used as validation data. Sometimes, measurements are miss-ing for some of the forecasts, if that is the case, the correspondmiss-ing time instance is discarded from the data set. Relevant grid points are then decided, either by choosing the closest grid points of the two forecast models or by the use of the mu-tual information approach explained in Section 2.5. Subsequently, a model type is chosen, starting with the linear models, to then continue with polynomial and neural network models. For each model type, the identification procedure starts of with a single forecast from a grid point selected either by the distance to the measurement station or the maximum mutual information approach. The param-eters of the model are then estimated by the use of batch methods (least squares for the lip models and Levenberg-Marquardt for the network models) over the training data. The resulting parameters are then used as initial estimates for the adaptive algorithms described in Section 2.4, which is then run sequentially over both sets of data. The parameters of the adaptive algorithms are tuned in ad-vance to parameters corresponding to a slow adaptation rate, which in general yields good results. The rmse is then calculated for the validation data which is then used as a measure of quality. Additional inputs are added one by one until the rmse of the estimate is no longer decreasing. Then, auto-regressive terms are added, again, until the rmse of the estimate stops decreasing. The procedure is repeated for the other model types.

3.3.1 Choice of parameters

Ideally, the choice of parameters for the adaptive algorithms should be retuned for every new measurement station, forecast hour, model type and set of inputs. Since this would make the process extremely time consuming the parameters are tuned in advance so that a reasonably good performance is achieved. The resulting parameters are then used for the other possible model configurations unless some of the algorithms are severely underperforming with the current choice of parameters, in which case the relevant parameters are retuned.

3.3.2 Linear models

Linear models consisting of a linear combination of different forecasts and auto regressive terms with an additional bias correction term were used as a reference during the project. After a suitable set of inputs u1, u2, . . . unhave been selected,

the parameters of the models were estimated by sorting the inputs and outputs of the training data into matrices

Φ =           1 u1(t1) . . . un(t1) .. . ... . .. ... 1 u1(tN) . . . un(tN)           , Y=           y(t1) .. . y(tN)           , Θ =           θ1 .. . θn           ,

(32)

where N is the size of the training data, and then applying the least squares method on the system

Y = Φ Θ.

The resulting parameters are then used to initialize the kf, rls and lms algo-rithms. The algorithms were then run through both years of data, and the rmse was calculated for the second year. Additional forecasts were added until the per-formance stopped increasing, then additional auto regressive terms were added in the same way.

3.3.3 Polynomial models

Polynomial models were the first nonlinear models to be investigated. After a suitable collection of input has been chosen, the parameters were estimated, us-ing the same methodology as for the linear models. The full polynomial models described in (2.2) were used with increasing polynomial order until the perfor-mance of the predictions started decreasing.

3.1 Example: Full second degree polynomial An example of a full polynomial is

ˆ

yp(tk) = ˆx1+ ˆx2ftemp,1(tk) + ˆx3ftemp,2(tk) + ˆx4y(tk−1)+

+ ˆx5ftemp,1ftemp,2+ ˆx6ftemp,1y(tk−1) + ˆx7ftemp,2y(tk−1)+

+ ˆx8ftemp,12 + ˆx9ftemp,22 + ˆx10y2(tk−1)

(3.1)

where ftemp,1 is the temperature forecast from the closest ec forecast, ftemp,2is

the temperature forecast from the closest c11 forecast and the time argument has been omitted for brevity.

3.3.4 Feed forward networks

Before training, the training data was normalized to a range between −1 and 1 using the transform

y = 2 y − ymin ymax−ymin −_1, ui = 2 u − ui,min ui,max−ui,min −_1, (3.2)

where ymaxand yminis the maximum and minimum of the training output and

ui,max and ui,min is the maximum and minimum of input i. Feed forward

net-works were then trained in batch form over the first year of data for the several forecast hours, using the Levenberg-Marquardt algorithm that is implemented in the Neural Networks toolbox of MATLAB. The resulting network was later updated adaptively over the course of the two years of data using the cdkf and the adaptive form of the Levenberg-Marquardt algorithm, where all the data had been transformed with (3.2), using the maximum and minimum values of the training data. The output was subsequently transformed back to the original

(33)

3.4 Tests for overall performance 23

Figure 3.2:A single layer feedforward network with one node and the closest ec and c11 forecasts as inputs, along with three autoregressive terms. The black dots on the vertices represent the network weights.

range by using the inverse of the transform (3.2).

Single layer networks with one hidden node and a hyperbolic tangent sigmoid as transfer function were tried first, starting with one input, to then increase the number of inputs until the lowest rmse of the three candidates had stopped de-creasing. Then, additional hidden units were added, one at a time, until the best candidates rmse had stopped decreasing. The activation function of the network was subsequently changed to the radial basis function, and the same procedure was repeated.

3.2 Example

One example of a simple feed forward network is found in Figure 3.2. It is a single layer network with a hyperbolic tangent sigmoid, which also happens to be the network that has the best average performance over all the investigated stations and forecast hours of the ones investigated in this project.

3.4 Tests for overall performance

The final experiments are performed by using each investigated model type on all the nine available measurement stations for the 6h, 12h, 24h, 36h and 48h forecast hours. That is, for each model type 45 different prediction series are pro-duced using the kf, lms and rls for the lip models and cdkf and lmb for the

(34)

neural network models. For the neural network models, the clean batch predic-tion is given as well. By letting the set

L₌n_{6, 12, 24, 36, 48}o and

P ₌n_{Ritsem, Parkalompolo, Sylarna, Junsele, Kerstinbo,} Stockholm, Örskär, Fårösund, Hörbyo,

the total mean rmse over the five forecast hours and nine measurement stations can then be computed as

X p∈P 1 9 X l∈L 1 5 v u t _N X k=1 (yp(tk) − ˆyp(tk|tk−l))2 N (3.3)

This result is then used to identify the best model. Since the value will be different for each adaptive algorithm, the best result is used.

(35)

4

Results

The following chapter will present some of the results given for the experimental procedure explained in Chapter 3. The results will describe the performance of the investigated methods for a specific measurement station and forecast hour and for the average performance for the given methods over all available mea-surement stations. The performance of the grid selection algorithm described in Section 2.5 will also be discussed.

4.1 Grid selection algorithms

For each measurement station and forecast hour, the mutual information ap-proach described in Section 2.5 is used to estimate which grid point is the best predictor of the temperature measurements. The selected grid point is later used as input to a one input linear combination model which is then Kalman filtered using parameters

R = 1000, Q = 0.01I, P0= 0.1I.

The corresponding system with the closest grid point used as input is then esti-mated accordingly. The difference,

rmse( ˆy_MI) − rmse( ˆy_closest),

between the rmse from the two methods is displayed in Table 4.1. Figure 4.1 shows the stations that on average would benefit from using the mutual informa-tion (MI) approach in blue and the others in red.

(36)

Table 4.1:Difference between rmse of the MI algorithm and using the clos-est grid points.

6 h 12 h 24 h 36 h 48 h Ritsem 0 0,258204 0,298394 0,199714 0,057434 Parkalompolo 0,182313 -0,07621 -0,18415 -0,20813 -0,36133 Sylarna -0,0182 0,037153 -0,15307 0,032446 0,019217 Junsele 0,058534 -0,25948 -0,2148 -0,13411 -0,18552 Kerstinbo 0,016942 0 0,00789 -0,00244 0,032486 Stockholm -0,01255 -0,09017 -0,09594 -0,09374 -0,16917 Örskär -0,11784 -0,11221 -0,03297 -0,24383 -0,2613 Fårösund 0 0 -0,02835 0,03124 -0,00073 Hörby -0,14829 0 0,046669 0,024251 0,055985

Figure 4.1: Stations where the grid points are better estimated by the MI approach than by the simple approach of using the closest grid points are coloured in blue, the rest are colored in red.

(37)

4.2 Tests on 24 hour Örskär forecasts 27

4.2 Tests on 24 hour Örskär forecasts

An example of the identification procedure described in Chapter 3.3 for the 24 hour temperature forecast of Örskär is described in this section. For this particu-lar station, as seen in Figure 4.1. Therefore the MI-approach was used to choose the different grid points for this station. A bar chart describing the rmse per-formance of the linear combination, second degree polynomial and some feed forward network models as a function of their number of inputs can be found in Figure 4.3. The diagrams show that the network models do typically not im-prove by the use of adaptive update algorithms. The best achieved performance of each specific model type is found in Table 4.2 with the corresponding staple diagram found in Figure 4.2. One thing that is worth noting about the results is that the best achieved result gets increasingly worse as the model structure gets more complicated, a fact that seems to hold both for the polynomial model and the single layer feed forward networks. Based on the best performing neu-ral network model, some more complex network architectures (simple recurrent structures and structures with multiple layers) were examined by the help of Mat-lab’s neural network toolbox. The architectures and corresponding rmse results are given in Figure 4.4. As the figure suggests, the additional complexity does not seem to be of much benefit for the rmse performance compared to the single node sigmoid network with three inputs.

Table 4.2:Best results for Örskär. The linear combination model produces the best results, although the second degree polynomial and single node works yield comparable performance. It is also worth noting that the net-work models do not benefit from the adaptive algorithms in any of the cases.

RMSE Inputs Algorithm

Linear combination 0,9878 4 RLS

Second degree polynomial 1,0104 2 RLS

1 node sigmoid network 1,029 3 No adaptation 2 node sigmoid network 1,1195 2 No adaptation 3 node sigmoid network 1,1435 2 No adaptation 1 node radbas network 1,0349 2 No adaptation 2 node radbas network 1,1166 4 No adaptation 3 node radbas network 1,1073 4 No adaptation

4.3 Overall performance

In this section the same experiments are performed for all the measurement sta-tions and forecast hours. The mean rmse is used to identify a model which has a good average performance as explained in 3.4. For the lip models, the standard Kalman filter described in chapter 2.4.3 is used with the covariance matrices

(38)

Figure 4.2: The best performance achieved for the some of the different model types.

For the nlip models a cdkf, described in Section 2.4.5 was used with the same covariance matrices and with

h =

√ 3. The lmb was used with parameter

µ = 0.001.

A comparison between the residuals of the best performing linear, polynomial and single layer sigmoid and radial basis networks are shown in Figure 4.11.

4.3.1 Linear models

The best linear model, on average, turned out to be a arx model consisting of the 5 closest grid points and three auto regressive terms. The results for the nine measurement stations are summarized in Table 4.3. The uppermost part of the table depicts the rmse of the Kalman filtered model with parameters

R = 1000, Q = 0.01I, P0= 0.1I. (4.1)

The table in the middle depicts the lms version with step length µ = 0.0001 and the lowermost table depicts the rls version with forgetting factor λ = 0.985. The residual of the estimates from the second year of the Kalman filtered system and its corresponding histogram can be seen in Figures 4.5 and 4.6. The figures are for the 24 hour Örskär forecast and the 48 hour Parkalompolo forecast. They are picked since the first has a high signal to noise ratio and is thus easy to predict, whereas the later has low signal to noise ratio and is difficult to predict.

(39)

4.3 Overall performance 29

Figure 4.3: The rmse for the various models, as a function of the number of inputs. Blue bars: kf for lip models, cdkf for network models. Green bars: lms for lip models, lmb for network models. Yellow bars: rls for lipmodels, no adaptation for network models. Horizontal line: best linear combination result.

(40)

Figure 4.4: Recurrent and multilayer extensions to the one node sigmoid network with three inputs.

(41)

4.3.2 Polynomial models

Polynomial models are estimated using the same procedure as the linear mod-els. The best second degree full polynomial model is given by the closest grid point, without any autoregressive terms. The rmse of this model is given in Ta-ble 4.4. Residuals and histograms for the 24 hour Örskär forecast and the 48 hour Parkalompolo forecast are given in Figures 4.7 and 4.8.

4.3.3 Single node feed forward sigmoid and radial basis

networks

The best single node feed forward sigmoid network turned out to be the one with three auto regressive terms and forecasts from the closest ec and c11 grid points as inputs. Results are given in Table 4.5.

(42)

T able 4.3: rmse for all the in v estig ated forecast hours and measure-men t sta tions. Linear combina tion, in put: 3 cl osest ec, 2 cl osest c11, 3 AR. The di ff eren t adaptiv e al g orithms yiel d similar perf ormance. KF R = 1000 Q = 0.01*I P_0 = 0.1*I 6 h 12 h 24 h 36 h 48 h Mean Ritsem 1,555165 1,706084 1,787401 1,936869 2,067168 1,803265 P ar kal om pol o 1,94678 2,109415 2,264917 2,33848 2,611271 2,243736 Sylarna 1,489411 1,555576 1,59747 1,64622 1,690758 1,588499 Junsele 1,935092 1,933059 2,068807 2,158591 2,320031 2,073472 K erstinbo 1,404449 1,593343 1,696479 1,784693 1,951061 1,678199 S tockholm 1,106744 1,162282 1,23117 1,321894 1,433421 1,24531 Örskär 0,97225 1,033665 1,078378 1,165907 1,276803 1,100283 F årösund 0,979601 1,089772 1,151217 1,226665 1,290324 1,142203 Hörby 0,916696 1,139455 1,226564 1,328954 1,43827 1,204386 Mean 1,366116 1,478814 1,565367 1,654818 1,784781 1,562834 LMS µ = 0.0001 6 h 12 h 24 h 36 h 48 h Mean Ritsem 1,620877 1,784013 1,86532 2,0184 2,163068 1,888229 P ar kal om pol o 1,984545 2,133876 2,265932 2,346027 2,598606 2,258714 Sylarna 1,518557 1,584608 1,622398 1,688156 1,732029 1,63944 Junsele 1,937465 1,932724 2,063763 2,142548 2,297098 2,115252 K erstinbo 1,400752 1,615856 1,709475 1,804233 1,96278 1,695226 S tockholm 1,146917 1,186242 1,258876 1,3421 1,460289 1,278181 Örskär 1,005111 1,067146 1,118712 1,202765 1,291684 1,132252 F årösund 1,02254 1,107722 1,174254 1,242713 1,287317 1,167158 Hörby 0,933029 1,177089 1,252698 1,345409 1,448206 1,227146 Mean 1,402691 1,515169 1,597471 1,685653 1,80802 1,603933 RL S λ = 0.985 6 h 12 h 24 h 36 h 48 h Mean Ritsem 1,61077 1,755905 1,81992 1,950948 2,085303 1,837128 P ar kal om pol o 2,003337 2,133007 2,312048 2,384302 2,669274 2,289744 Sylarna 1,509907 1,587041 1,678467 1,699928 1,760298 1,639503 Junsele 1,944194 1,941748 2,082733 2,19477 2,362477 2,095438 K erstinbo 1,421034 1,619154 1,734214 1,783762 1,968305 1,697399 S tockholm 1,087522 1,179417 1,241769 1,328859 1,450789 1,251849 Örskär 0,959694 1,024251 1,067777 1,142594 1,282475 1,090287 F årösund 1,016604 1,106277 1,171564 1,257721 1,328718 1,170732 Hörby 0,939463 1,124517 1,225121 1,335809 1,458775 1,211104 Mean 1,386799 1,495316 1,591031 1,673735 1,816672 1,585459 F igure 4.5: Örskär , Kalman fil tered 24 hour forecast. F igure 4.6: P ar kal om pol o, Kalman fil tered 48 hour forecast.

(43)

4.3 Overall performance 33 T able 4.4: rmse for all the in v estig ated forecast hours and measure-men t sta tions. F ull second degree pol ynomial, in put: cl osest ec and c11 grid poin ts, 1 AR. Here, the lms al g orithm clear ly does not w or k. The rls and kf yiel d acceptable perf ormance. KF R = 1000 Q = 0.01*I P_0 = 0.1*I 6 h 12 h 24 h 36 h 48 h Mean Ritsem 1,842455 2,053445 2,086616 2,220871 2,305169 2,101711 P ar kal om pol o 2,244277 2,435457 2,670793 2,749819 3,003798 2,620829 Sylarna 1,707838 1,871548 1,825714 1,879321 1,93915 1,844714 Junsele 2,18587 2,282789 2,436438 2,544014 2,912094 2,472241 K erstinbo 1,685046 1,905364 1,990351 2,06066 2,312495 1,990783 S tockholm 1,190273 1,240159 1,335181 1,441854 1,59209 1,359911 Örskär 1,133736 1,348122 1,351818 1,425908 1,627994 1,377516 F årösund 1,130893 1,285916 1,393671 1,445877 1,52005 1,355281 Hörby 1,031774 1,271447 1,371651 1,557425 1,766382 1,399736 Mean 1,572463 1,743805 1,829137 1,925083 2,108802 1,835858 LMS µ= 2e-07 6 h 12 h 24 h 36 h 48 h Mean Ritsem 5,83628 5,96989 5,89189 5,85038 5,796001 5,868888 P ar kal om pol o 6,351706 6,470417 6,482701 6,465626 6,451692 6,444428 Sylarna 4,92557 5,040273 5,017291 4,936351 5,003392 4,984575 Junsele 6,049171 6,061764 6,085798 6,067997 6,06659 6,066264 K erstinbo 4,937148 4,912465 4,926121 4,952881 4,979108 4,941545 S tockholm 4,260264 4,143135 4,151068 4,167948 4,209672 4,186417 Örskär 3,448455 3,410277 3,43928 3,424208 3,478258 3,440096 F årösund 2,318189 2,331444 2,385217 2,444142 2,449116 2,385622 Hörby 3,754969 3,334623 3,413809 3,421191 3,461908 3,4773 Mean 4,653528 4,630477 4,643686 4,636747 4,655082 4,643904 RL S λ = 0.985 6 h 12 h 24 h 36 h 48 h Mean Ritsem 1,618373 1,818598 1,903536 2,020669 2,171225 1,90648 P ar kal om pol o 2,147165 2,214334 2,39581 2,44482 2,677345 2,375895 Sylarna 1,536911 1,614504 1,699887 1,668557 1,720718 1,648115 Junsele 2,012718 2,006686 2,144405 2,26262 2,471753 2,179636 K erstinbo 1,433631 1,581063 1,704362 1,766552 1,997658 1,696653 S tockholm 1,114823 1,101993 1,195942 1,274053 1,403515 1,218065 Örskär 1,098278 1,298187 1,259358 1,374164 1,546355 1,315268 F årösund 1,047701 1,155196 1,303482 1,316979 1,40537 1,245745 Hörby 0,943006 1,117854 1,217429 1,349334 1,497521 1,225029 Mean 1,439178 1,545379 1,647134 1,71975 1,876829 1,645654 F igure 4.7: Örskär , Kalman fil tered 24 hour forecast. Figure 4.8: P ar kal om pol o, Kalman fil tered 48 hour forecast.

(44)

T able 4.5: rmse for all the in v estig ated forecast hours and measuremen t sta tions. F eed forw ard sigmoid netw or k one node, in put: Cl osest ec, cl osest c11, AR 3. It is apparen t tha t the recursiv e al g orithms faces di ffi cul ties in man y cases. CDKF R = 100000000 Q = 0.01*I P_0 = 0.1*I h = 1.7321 6 h 12 h 24 h 36 h 48 h Mean Ritsem 1,741782 1,901232 2,048643 2,073258 2,16404 1,985791 P ar kal om pol o 2,064633 2,191452 2,294878 2,373864 2,621143 2,309194 Sylarna 1,570347 1,667339 1,689344 1,741754 4,543303 2,242417 Junsele 22,50985 18,09696 17,63939 17,37864 18,30749 18,78647 K erstinbo 8,187076 11,58374 12,61019 11,06632 11,62594 11,01465 S tockholm 8,657043 6,081724 3,915434 3,789483 3,974322 5,283601 Örskär 1,301334 4,937457 4,194078 5,451443 1,474743 3,471811 F årösund 8,079562 9,076712 9,856811 7,360522 7,099478 8,294617 Hörby 7,305896 11,41559 11,15986 12,77862 12,52143 11,03628 Mean 6,824169 7,439134 7,267625 7,112656 7,147988 7,158314 LM 6 h 12 h 24 h 36 h 48 h Mean Ritsem 1,733994 1,873986 1,980882 2,56511 2,257178 2,08223 P ar kal om pol o 2,105044 2,239621 2,371566 2,449885 2,722121 2,377647 Sylarna 1,56875 1,634813 1,758122 1,912229 3,10796 1,996375 Junsele 3,164305 2,820897 2,993704 2,958993 3,091743 3,005929 K erstinbo 6,586221 6,43359 10,53961 4,396618 3,92851 6,37691 S tockholm 2,152785 1,734442 1,833161 2,033205 2,043427 1,959404 Örskär 1,342979 2,449228 2,512393 2,997912 1,70674 2,20185 F årösund 18,29438 3,162672 3,155455 3,147947 9,158927 7,383876 Hörby 5,495971 1,686041 1,836069 1,961012 2,225929 2,641004 Mean 4,716048 2,670588 3,220107 2,713657 3,360281 3,336136 No adapta tion 6 h 12 h 24 h 36 h 48 h Mean Ritsem 1,745413 1,906732 2,049178 2,076288 2,166121 1,988746 P ar kal om pol o 2,065247 2,192862 2,296166 2,374719 2,623833 2,310565 Sylarna 1,570119 1,666938 1,687311 1,734931 1,798478 1,691555 Junsele 2,116894 2,157753 2,289283 2,359743 2,492234 2,283182 K erstinbo 1,432096 1,664455 1,731744 1,816561 1,976756 1,724322 S tockholm 1,245304 1,236683 1,317395 1,391617 1,507668 1,339733 Örskär 1,231332 1,3061 1,322557 1,383235 1,470311 1,342707 F årösund 1,09673 1,132382 1,19581 1,222105 1,310847 1,191575 Hörby 0,969422 1,143059 1,244003 1,327185 1,446514 1,226037 Mean 1,496951 1,600774 1,681494 1,742931 1,865862 1,677602 F igure 4.9: Örskär , 24 hour fore-cast without adapta tion F igure 4.10: P ar kal om pol o, 48 hour forecast without adapta tion

(45)

(a) Örskär, Kalman filtered 24 hour forecast for the best linear combination model.

(b) Örskär, Kalman filtered 24 hour fore-cast for the best second degree polynomial model.

(c) Örskär, 24 hour forecast without adapta-tion for the best single layer neural network model.

Figure 4.11: Residuals and histograms for linear combination, second de-gree polynomial and single layer neural network models. The results corre-spond to the validation data.

(46)

(47)

5

Conclusions

In the following chapter, some of the insights that have been gained during the course of the project are summarized. Some of the ideas that were developed but were not properly investigated due to the time limit are discussed as areas for further research.

5.1 Experience of using nonlinear models

Different model types have been used in an attempt to find a nonlinear relation-ship between forecasts and measurements. None of the attempts have resulted in models with higher performance than the best linear models. However a sim-ple neural network model can, without any adaptive algorithm, perform almost as good as the best Kalman filtered linear model. One reason that the nonlinear models do not improve on the linear ones might be that there is no causal re-lationship between the forecasts and the measurements as would have been the case if the measurements were in some way generated by the forecasts. Instead, the forecasts have a simple but uncertain relationship to the measurements. The high variance in the forecast errors are likely a result of impacts that the numer-ical forecast models do not account for. If that is the case, there is only so much that can be done with the data generated from the models in order to improve the error variance. One such way is to combine forecasts from different grid points so that the errors cancel each other out. This is the approach which has been used before by the use of linear combinations, which seems to be enough to achieve a well performing and robust model.

Regarding the neural network models, it also seems difficult to improve the per-formance by the use of adaptive algorithms. One reason for this could be that the

(48)

neural network models are initially trained using one whole year of data. Given that network models are fairly good at approximating functions, the situation might be such that the model is good enough to cover even the time variant as-pects of the system. There are also a couple difficulties that arise when dealing with the sigma point Kalman filters. Kalman filter approaches require a lot of filter parameters to tune (2 square matrices of the same dimension as the number of model parameters and one square matrix of the same dimension as the number of outputs, together with additional parameters for the distribution). Even for a simple single layer feed forward network with n inputs and m hidden nodes, a total of m(2 + n) + 1 parameters need to be estimated. The fact that each of the parameters have a distinctively different impact on the model that is hard to get a good intuition about, makes the filter difficult to tune. The issue with the large number of tunable parameters in the Kalman filter are worth a notice even for the linear models. In this case it is a lot easier to predict the impact of each pa-rameter, however. If not enough work is done to tune the noise matrices of the Kalman filter, simpler models as the rls are likely to perform just as well or even better with only one tunable parameter and therefore less effort.

5.2 Method

The method that was finally used in the project is a simple and established way to test and validate different models. However, there are some things in the method-ology that are worth some mentioning. One of them is that advanced models with many parameters, such neural network models with many nodes, usually need a lot of training data in order to converge to a good estimate. In order to make sure that candidates from both forecast models are available in the modelling, a sample time of 12 hours is needed, which leads to roughly 731 samples for one year of training data. This is not a very big amount of data and nothing has been done to make sure that it is enough to estimate a good model. For the adaptive algorithms, a lot of the validation process has been done using the same tuning parameters. Furthermore, the Kalman filters are mainly using a scaled identity matrix as the process noise covariance, Q. The optimal tuning parameters are likely to vary between different model types and it is probable that the results from the Kalman filter models would improve if the task of tuning the parame-ters was given more work.

5.3 Further research

There are some questions and ideas that were under consideration during the project that were not thoroughly investigated due to the time limit. These might be interesting to look into if further research is to be made around this problem. • The system at hand is under the influence of time-varying noise. That is,

the signal to noise ratio is unlikely to be constant over time. Since the per-formance and optimal tuning of most adaptive algorithms are dependent

(49)

5.3 Further research 39

on the signal to noise ratio it could be a good idea to use some kind of change detection algorithm and switch the filter parameters when a change is detected. There is also a possibility that different model choices are best suited for certain seasons and then a change detection algorithm could be used to determine when to switch model.

• Develop and implement an algorithm which finds the best parameters for the filter. This is a task which is time consuming and complicated to do manually, especially when the optimal parameters vary over both time and space. This would also be of interest for algorithms dealing with feature selection or change detection, or any algorithm which demands tuning of parameters. One example of such an algorithm is Bayesian optimization, for instance.

• Some attempts have been made to use feature selection algorithms to choose appropriate grid points and input properties to the model. The results dur-ing this project have not been very reliable, but still promisdur-ing enough to maintain the idea that some kind of input selection might be of good usage. It is not certain that the best choice is static but might change over time and so it could be of interest to investigate if this is a choice that could be made by an on-line algorithm.

(50)

(51)

Bibliography

S.A Billings. Nonlinear System Identification: NARMAX Methods in the Time, Frequency and Spatio-Temporal Domains. Wiley & Sons, Ltd, Hoboken, NJ, 2013. Cited on page 12.

S.A Billings and W.S.F Voon. Correlation based model validity tests for nonlinear models. International Journal of Control, 44(1):235–244, 1985. Cited on page 19.

Phillipe Crochet. Adaptive Kalman filtering of 2-metre temperature and 10-metre wind-speed forecasts in Iceland. Meteorol. Appl., 11(2):173–187, 2004. Cited on page 4.

Francois Fleuret. Fast binary feature selection with conditional mutual informa-tion. Journal of Machine Learning Research, 5:1531–1555, 2004. Cited on page 17.

Fredrik Gustafsson. Statistical Sensor Fusion. Studentlitteratur AB, Lund, second edition, 2012. Cited on page 16.

Fredrik Gustafsson, Lennart Ljung, and Mille Millnert. Signal Processing. Stu-dentlitteratur AB, Lund, 2010. Cited on pages 5 and 14.

Martin T Hagan and Mohammad B Menhaj. Training feedforward networks with the Marquardt algorithm. Transactions on neural networks, 5(6):989 – 993, 1994. Cited on pages 14 and 16.

Simon Haykin. Kalman filtering and Neural Networks. Wiley & sons, Hoboken, NJ, 2001. Cited on page 16.

Mariken Homleid. Diurnal corrections of short-term surface temperature fore-casts using the Kalman filter. American Meteorological Society, 10(4):689–707, 1995. Cited on page 4.

Antonia J Jones. New tools in non-linear modelling and prediction. Computa-tional Management Science, 1(2), 2004. Cited on page 20.

(52)

Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev. E, 69(6):066138, 2004. Cited on page 17.

Lennart Ljung. System Identification: Theory For the User. Prentice Hall PTR, Upper Saddle River, NJ, second edition, 2009. Cited on pages 10 and 12. Petroula Louka, Georges Galanis, Nils Siebert, Georges Kariniotakis, Petros

Kat-safados, George Kallos, and Iaonnis Pytharoulis. Improvements in wind speed forecasts for wind power prediction purposes using Kalman filtering. Journal of Wind Engineering and Industrial Aerodynamics, 96:2348–2362, 2008. Cited on page 4.

Raul Rojas. Neural Networks: A Systematic Introduction. Springer, 1996. Cited on page 13.

Michael Roth and Fredrik Gustafsson. An efficient implementation of the second order extended Kalman filter. Proceedings of the 14th International Confer-ence on Information Fusion (FUSION), 2011, pages 1–6, 2011. Cited on page 16.

Antti Sorjamaa, Jin Hao, Nima Reyhani, Yongnan Ji, and Amaury Lendasse. Methodology for long-term prediction of time series. Neurocomputing, 70(16-18):2861–2869, 2007. Cited on page 17.

Conor Sweeney, Peter Lynch, and Paul Nolan. Reducing errors of wind speed fore-casts by an optimal combination of post-processing methods. Meteorological Applications, 20(1):32–40, 2013. Cited on page 4.

Rudolph van der Merwe. Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models. PhD thesis, 2004. Cited on pages 7, 14, and 16.

Adaptive Filtering and Nonlinear Models for Post-processing of Weather Forecasts

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Adaptive Filtering and Nonlinear Models for

Post-processing of Weather Forecasts

Adaptive Filtering and Nonlinear Models for

Post-processing of Weather Forecasts

Examensarbete utfört i Reglerteknik

vid Tekniska högskolan vid Linköpings universitet

av

Abstract

Contents

Notation

1

Introduction

1.1

Notational conventions

1.2

Problem formulation

1.3

Previous model

1.4

Data sets

1.5

Prospective challenges

2

Theory

2.1

System identification

2.2

NARMAX-models

2.3

Neural networks

2.4

Parameter estimation methods

2.4.1

Least Mean Squares

2.4.2

Recursive Least Squares

2.4.3

The basic Kalman filter

2.4.4

Levenberg-Marquardt backpropagation

2.4.5

Sigma-point kalman filters

2.5

Feature selection

2.5.1

Mutual information

3

Method

3.1

Pilot study

3.2

Initial experiments

3.3

Identification procedure

3.3.1

Choice of parameters

3.3.2

Linear models

3.3.3

Polynomial models

3.3.4

Feed forward networks

3.4

Tests for overall performance

4

Results

4.1

Grid selection algorithms

4.2

Tests on 24 hour Örskär forecasts

4.3

Overall performance

4.3.1

Linear models

4.3.2

Polynomial models