We also try to remove some of the \mystique&#34

(1)

Neural Networks in System Identication

J. Sjoberg, H. Hjalmarsson, L. Ljung

Department of Electrical Engineering, Linkping University, S-581 83 Linkping, Sweden E-mail: hakan@isy.liu.se, sjoberg@isy.liu.se, ljung@isy.liu.se

Abstract.Neural Networks are non-linear black-box model structures, to be used with conventional parameter estimation methods. They have good general approximation capabilities for reasonable non-linear systems. When estimating the parameters in these structures, there is also good adapt- ability to concentrate on those parameters that have the most importance for the particular data set.

KeyWords.Neural Networks, Parameter estimation, Model Structures, Non-Linear Systems.

1. EXECUTIVE SUMMARY 1.1. Purpose

The purpose of this tutorial is to explain how Articial Neural Networks (NN) can be used to solve problems in System Identication, to focus on some key problems and algorithmic questions for this, as well as to point to the relationships with more traditional estimation techniques. We also try to remove some of the \mystique" that sometimes has accompa- nied the Neural Network approach.

1.2. What's the problem?

The identication problem is to infer relationships between past input-output data and future outputs. Collect a nite number of past inputs u(k) and outputsy(k) into the vector'(t) '(t) = y(t^;1):::y(t^;na)u(t^;1):::u(t^;nb)]^T For simplicity we let y(t) be scalar. Let d(1)= na+nb. Then'(t)²IR^d. The problem then is to understand the relationship between the next outputy(t) and'(t):

?y(t)^$'(t)? (2) To obtain this understanding we have available a set of observed data (sometimes called the

\training set")

Z^N=^fy(t)'(t)]^jt= 1:::N^g (3) From these data we infer a relationship

y^(t) = ^gN('(t)) (4)

We index the functiong with a \hat" andN to emphasize that it has been inferred from (3). We also place a \hat" ony(t) to stress that (4) will in practice not be an exact relationship between '(t) and the observed y(t). Rather ^y(t) is the

\best guess" of y(t) given the information'(t).

1.3. Black boxes

How to infer the function ^gN? Basically we search for it in a parameterized family of functions

G=^fg('(t))^j²D^M^g (5) How to choose this parameterization? A good, but demanding, choice of parameterization is to base it on physical insight. Perhaps we know the relationship between y(t) and '(t) on physical grounds, up to a handful of physical parameters (heat transfer coecients, resistances,:::).

Then parameterize (5) accordingly.

This tutorial only deals with the situation when physical insight is not used i.e. when (5) is chosen as a exible set of functions capable of describing almost any true relationship between y and'. This is the black-box approach.

Typically, function expansions of the type g(') =^X

k (k)g_k(') (6) are used, where

gk(') : IR^d^!IR

and(k) are the components of the vector. For

(2)

example, let

g_k(') ='_k (k:th component of')k= 1:::d:

Then, with (1)

y(t) =g('(t)) reads

y(t) +a¹y(t^;1) +:::+anay(t^;na) = b¹u(t^;1) +:::+bnbu(t^;nb) if a_i =^;(i) b_i=(n_a+i)

so the familiar ARX-structure is a special case of (6), with a linear relationship betweeny and '.

1.4. Nonlinear black box models

The challenge now is the non-linear case: to describe general, non-linear, dynamics. How to select^fgk(')^g in this general case? We should thus be prepared to describe a \true" relationship y^(t) =g⁰('(t))

for any reasonable functiong⁰: IR^d^!IR. The

rst requirement should be that ^fgk(')^g is a basisfor such functions, i.e. that we can write

R1] : g⁰(') =^X¹

k⁼¹(k)gk(') (7) for any reasonable functiong⁰using suitable coecients (k). There is of course an innite number of choices of ^fgk^g that satisfy this requirement, the classical perhaps being the basis of polynomials. Ford= 1 we would then have

gk(') ='^k

and (7) becomesTaylor^orVolterra^expansion.

In practice we cannot work with innite expansions like (7). A second requirement on^fgk^gis therefore to produce \good" approximations for

nite sums: In loose notation:

R2] : ^kg⁰(')^;^Xⁿ

k⁼¹(k)gk(')^k

\decreases quickly asnincreases" (8) There is clearly no uniformly good choice of^fgk^g

from this respect: It will all depend on the class of functionsg⁰ that are to be approximated.

1.5. Estimating ^gN

Suppose now that a basis^fgk^ghas been chosen, and we try to approximate the true relationship by a nite number of the basis functions:

y^(t^j) =g('(t)) =^Xⁿ

k⁼¹(k)gk('(t)) (9) where we introduce the notation ^y(t^j) to stress that g('(t)) is a \guess" for y(t) given the information in '(t) and given a particular parameter value . The \best" value of is then determined from the data setZ^N in (9) by

^N = argmin^X^N

k⁼s

jy(t)^;y^(t^j)^j² (10) The model will be

y^(t) = ^y(t^j^{^}N) = ^gN('(t)) =g('(t)^{^}N) (11) 1.6. Properties of the estimated model

Suppose that the actual data have been gener- ated by

y(t) =g⁰('(t)) +e(t) (12) where^fe(t)^gis white noise with variance. The estimated model (11) (i.e. the estimated parameter vector ^N) will then be a random variable that depends on the realizations of both e(t)t = 1:::N and '(t)t = 1:::N. De- note its expected value by

E^gN=g_n=^Xⁿ

k⁼¹(k)gk (13) where we used subscript n to emphasize the number of terms used in the function approximation.

Then under quite general conditions E^j^gN('(t))^;g_n('(t))^j² =m

N ⁽¹⁴⁾

where E denotes expectation both with respect to '(t) and ^N. Moreover, m is the number of estimated parameters, i:e:, dim. The total error thus becomes

E^jg^N('(t))^;g⁰('(t))^j²=

kg⁰('(t))^;g_n('(t))^k²+m

N ⁽¹⁵⁾

The rst term here is an approximation error of the type (8). It follows from (15) that there is a trade-o in the choice of how many basis functions to use. Each included basis function increases the variance error by=N, while it decreases the bias error by an amount that could be less than so. A third requirement on the choice of^fgk^gis thus to

(3)

R3] Have a scheme that allows the exclusion of spurious basis functions from the expansion.

Such a scheme could be based on a priori knowledge as well as on information inZ^N.

1.7. Basis functions

Out of the many possible choice of basis functions, a large family of special ones have received most of the current interest. They are all based on just one fundamental function('), which is scaled in various ways, and centered at dierent points, i.e.

g_k(') =( _Tk('+~_k)) =( _Tk'+_k) =('_k) wherek= _Tk~k and k is thed+ 1-vector(16)

k= kk] (17) Such a choice is not at all strange. A very sim- plistic approach would be to take(') to be the indicator function (in the case d = 1) for the interval 01]:

(') =

1'²01]

0' =²01]

For a countable collection ofk (e.g. assuming all rational numbers) the functionsgk(') would then contain indicator functions for any interval, arbitrarily small and placed anywhere along the real axis. Not surprisingly, these^fg_k^gwill be a basis for all continuous functions. Equivalently, it could be threshold function

(') =

1' >0

0'0 (18)

since the basic indicator function is just the dif- ference between two threshold functions.

1.8. What is the Neural Network Identication Approach?

The basic Neural Network (NN) used for System Identication (one hidden layer feedforward net) is indeed the choice (16) with a smooth approximation for (18), often

(x) = 1 1 +e^;^x

Include the parameterin (16)-(17) among the parameters to be estimated, , and insert into (9). This gives the Neural Network model structure y^(t^j) =^Xⁿ

k⁼¹ k( k'+k)

= k kk] k= 1:::n (19) The n(d+ 2)-dimensional parameter vector is then estimated by (10).

1.9. Why has Neural Networks attached so much interest?

This tutorial points at two main facts.

1. The NN function expansion has good properties regarding requirement R2] for non- linear functionsg⁰that are \localized" i.e.

there is not much nonlinear eects going to the innity. This is a reasonable property for most real life physical functions. More precisely, see (69) in Section 9.

2. There is a good way to handle requirement R3] by implicit or explicit regularization (See Section 3)

1.10. Why has there been so much hesitation about Neural Networks in the statistics and system identication communities?

Basically, because NN is, formally, just one of many choices of basis functions. Algorithms for achieving the minimum in (10), and the statisti- cal properties of the type (15) are all of general character and well known for the more traditional model structures used. They have typically been reinvented and rediscovered in the NN literature and been given dierent names there.

This certainly has had an alienating eect on the \traditional estimation" communities.

1.11. Related approaches

Actually, the general family of basis functions (16), is behind both Wavelet Transform Net- works and estimation of Fuzzy Models. A com- panion tutorial, Benveniste et al. (1994), ex- plains these connections in an excellent man- ner.

1.12. Organization of the tutorial

In Section 2 we shall give some general back- ground about function approximation. This overlaps and complements the corresponding discussion in Benveniste et al. (1994). Sections 3 and 4 deal with the fundamentals of estimation theory with relevance for Neural Networks. The basic Neural Network structures are introduced in Section 5. The question of how to t the model structure to data is discussed in Section 6. In Sections 8 and 9 the perspective is widened

(4)

to discuss how Neural Networks relate to other black box non-linear structures. Also these sections deal with similar questions as Benveniste et al. (1994). Section 11 describes the typical structures that Neural Networks give rise to when applied to dynamical systems. The nal Section 12 describes applications and research problems in the area.

2. THE PROBLEM 2.1. Inferring relationships from data

A wide class of problems in disciplines such as classication, pattern recognition and system identication can be t into the following frame- work.

A set of observations (data) Z^N =^fy(t) (t)^g_Nt⁼¹

of two physical quantitiesy ²IR^p and ² IR^r is given. It may or may not be known which variables in inuence y. There may also be other, non-measured, variablesv that inuence y. Based on the observationsZ^N, infer how the variables in inuencey.

Let'be the variables in that inuencey, then we could represent the relation between',vand yby a functiong⁰

y=g⁰('v) (20) The problem is thus two-fold:

1. Find which variables in that should be used in '.

2. Determineg⁰.

In identication of dynamical systems, nding the right'is the model order selection problem.

Thentrepresents the time index and (t) would be the collection of all past inputs and outputs.

There are two issues that have to be dealt with when determiningg⁰:

1. Only nite observations in the '-space are available.

2. The observations are perturbed by the non- measurable variable ^fv(t)^g.

1) represents the function approximation problem, i.e. how to do interpolation and extrapo- lation, which in itself is an interesting problem.

Notice that there would be no problem at all if ywas given for all values of'(if we neglect the non-measurable input) since the function then in

fact would be dened by the data. 2) increases the diculty further since then we cannot infer exactly how'inuencesyeven at the points of observations. Blended together, these two problems are very challenging. Below we will try to disclose the essential ingredients. For further insight in this problem see also Benveniste et al.

(1994).

2.2. Prior assumptions

Notice that as stated, the problem is ill-posed.

There will be far too many un-falsied models, i.e. models satisfying (20), if any functiongand any non-measurable sequence^fv(t)^gis allowed.

Thus, it is necessary to include some a priori information in order to limit the number of possible candidates. However, often it is dicult to provide a priori knowledge that is so precise that the problem becomes well-dened. To ease the burden it is common to resort to some general principles:

1) Non-measurable inputs are additive. This means thatg⁰is additive in its second argument, i.e. g⁰('v) =g⁰(') +v

This is, for example, a relevant assumption when

fv(t)^g is due mainly to measurement errors.

Thereforev is often called disturbance or noise.

2) Try simple things rst (Occam's razor).

There is no reason to choose a complicated model unless needed. Thus, among all unfal- sied models, select the simplest one. Typically, the simplest means the one that in some sense has the smoothest surface. An example is spline smoothing. Among the classC² of all twice dif- ferentiable functions on an interval I, the solu- tion to

gmin²C²

X(y(t)^;g('(t))²+^Z

I(g⁰⁰('))²d' is given by the cubic spline, Wahba (1990).

Other ways to penalize the complexity of a function are information based criteria, such as AIC, BIC and MDL, regularization (or ridge penalty), cross-validation and shrinkage. We shall discuss these in Section 9. Part of this smoothness paradigm is that the roughness should be allowed to increase with the number of observations. If there is compelling evidence in the observations that the function is non-smooth, then the approximating function should be allowed to be more exible. This also holds for which variables in (t) that should be included in'(t). Thus both the dimension and the entries in'(t) could depend on the observationsZ^N. In pure approximation theory all these smoothness

(5)

criteria are rather ad hoc. It is rst when the non-measurable inputs are taken into account that they can be given meaningful interpreta- tions. This will be the main topic in Section 4, see also Sections 8-9.

2.3. Function classes

Thus,g⁰is assumed to belong to some quite general family^Gof functions. The function estimate

^g_nN however, is restricted to belong to a possi- bly more limited class of functions,^Gn say. This family^Gn, wherenrepresents the complexity of the class¹, is a member of a sequence of families

fGn^gthat satisfy^Gn ^!^G. As explained above, the complexity of ^g_nN is allowed to depend on Z^N, i.e. nis a function ofZ^N. We will indicate this by writingn(N).

In this perspective, an identication method can be seen as a rule to choose the family^fGn^gto- gether with a rule to choosen(N) and an estimator that given these provides an estimate ^g_Nⁿ⁽^N⁾. Notice that both the selection of^fGn^gandn(N) can be driven by data. This possibility is, as we shall see in Section 9, very important.

Typical choices of^Gare H older Balls which con- sist of Lipschitz continuous functions:

!(C) =^ff :^jf(x)^;f(y)^jC^jx^;y^j^g (21) andLp Sobolev Balls which have derivatives of a certain degree which belongs toLp:

W_p^m(C) =^ff :

Z

jf⁽^m⁾(t)^j^pdtC^p^g (22) Recently, Besov classes and Triebel classes, Triebel (1983) have been employed in wavelet analysis. The advantage with these classes are that they allow for spatial inhomogenity. Func- tions in these classes can be locally spiky and jumpy.

2.4. Noise assumptions

The non-measurable input ^fv(t)^g is also restricted to some family^V. It is possible to clas- sify these families into two categories:

1) Deterministic. Here ^fv(t)^g is usually assumed to belong to a ball

jv(t)^jCv ⁸t:

This is known as unknown-but-bounded disturbances and dates back to the work of Schweppe Schweppe (1973). This assumption leads to set

1Typicallyⁿis the number of basis functions in the class

estimation methods, see Milanese and Vicino (1991).

2) Stochastic. Here ^fv(t)^g is a stochastic pro- cess with certain properties. This type, which we shall focus on, is the most common one.

However, for a connection between deterministic and stochastic disturbances see Hjalmarsson and Ljung (1994). The advantage with this type is that it ts with the smoothness principle. A stochastic disturbance is typically non-smooth as opposed by the function of interest g⁰. This can be used to decrease the inuence of the disturbance.

The challenge is to nd identication methods that give good performance for as general families^G and^V as possible. For a chosen criterion (gure of merit) it is the interplay between the approximating properties of the method and the way that the disturbance corrupt the approximation that has to be considered. We shall delve into that issue in the sections that follow. Es- pecially we shall examine what Articial Neural Networks can oer in this respect.

2.5. Figures of merit

Since one is working in a function space it is natural to consider some norm of the error between the function estimate ^gN and the true function g⁰. It is quite standard to use L_p-norms

Jp(^gNg⁰) =^k^gN^;g⁰^k^p_L_p=

Z

jg^N(')^;g⁰(')^j^pdP'(') (23) where P' is the probability distribution of '. An estimator is almost surely convergent if

Jp(^gNg⁰)^!0 w:p:1 as N ^!¹ ⁸g⁰²^G: In order to compare dierent estimators one can consider rates of convergence:^f^gN^gconverges to g with rate^ffN^gifJp(^gNg⁰)fN and fN ^!

0.²

Another gure of merit is the expected value V_Pp(^g_Ng) = EJ_p(^g_Ng⁰)] (24) where the expectation is taken over the probability space P of ^fv(t)^g. With p= 2 one gets the integrated mean square error (IMSE)

V²^P(^gNg) =

Z

E^h^j^gN(')^;g⁰(')^j²ⁱdP'('): (25)

2

a

N b

N means that ^;1 ^< liminf^abN^N ^<

limsup^ab^NN ^<¹

(6)

This type of criteria is known as risk measures in statistics. Based on the risk, various optimality properties can be dened:

It is natural to try to minimize the risk for the worst-case: An estimator ^g_N is said to be minimax if

gsup⁰^2GV_Pp(^g_Ng⁰) = inf

g^ _gsup

0 2G

V_Pp(^gNg⁰): Often it is too dicult to derive the minimax estimator and one has too resort to asymptotic theory: The estimator ^g_Nis asymptotically minimax if

gsup^02GV_Pp(^g_Ng⁰) = inf

^gN _gsup

02G

V_Pp(^gNg⁰) as N ^! ¹. An even weaker concept is the minimax rate. The estimator ^g_N attains the minimax rate if

gsup⁰^2GV_Pp(^g_Ng⁰)inf

^gN _gsup

0 2G

V_Pp(^gNg⁰): (26) Notice that the risk will depend on the assumed distribution. To safeguard against uncertainty about the distribution it is possible to consider a whole family of distributions^P and use

V_p^P(^gNg⁰) = sup_P

2P

V_Pp(^gNg⁰):

This is thus a minimax problem and is considered in robust statistics Huber (1981). Notice that when rates of convergence are considered, the shape of the distribution is less important.

For the class of distributions where the support is unbounded and some mixing condition, the rate of convergence will be the same.

3. SOME GENERAL ESTIMATION RESULTS

The basic estimation set-up is what is called non-linear regressionin statistics. The problem is as follows. We would like to estimate the relationship between a scalary and '²IR^d. For a particular value'(t) the correspondingy(t) is assumed to be

y(t) =g⁰('(t)) +e(t) (27) where^fe(t)^gis supposed to be a sequence of in- dependent random vectors, with zero mean values and variance

E e(t)e^T(t) = (28) To nd the functiong⁰ in (27) we have the following information available:

1. A parameterized family of functions

Gm=^fg('(t))^j²D^MIR^m^g (29)

2. A collection of observedy'-pairs:

Z^N =^fy(t)'(t)]t= 1:::N^g (30) The typical way to estimate g⁰ is then to form the scalar valued function

VN() = 1N

N

X

t⁼¹^jy(t)^;g('(t))^j² (31) and determine the parameter estimate ^N as its minimizing argument:

^N = argminVN() (32) The estimate ofg⁰ will then be

^gN(') =g('^{^}N) (33) Sometimes a general, non-quadratic, norm is used in (30)

V_N() = 1N

N

X

t⁼¹`("(t)) (34)

"(t) =y(t)^;g('(t))

Another modication of (30) is to add a regularization term,

WN() =VN() +^j^;^#^j² (35) (and minimize W rather than V) either to re-

ect some prior knowledge that a goodis close to ^# or just to improve numerical and statis- tical properties of the estimate ^N. Again, the quadratic term in (35) could be replaced by a non-quadratic norm.

Now, what are the properties of the estimated relationship ^gN? How close will it be to g⁰? Following some quite standard results, see e.g.

Ljung (1987) S oderstr om and Stoica (1989), we have the following properties. We will not state the precise assumptions under which the results hold. Generally it is assumed that ^f'(t)^g is (quasi)-stationary and has some mixing property (i.e. that'(t) and'(t+s) become less and less dependent assincreases). The estimate ^N

is a random variable that depends on Z^N. Let E denote expectation with respect to bothe(t) and 't= 1:::N. Let

= E ^N

and g(') =g(')

Theng(') will be as close as possible to g⁰(') in the following sense:

arg min_g

2GmE^jg(')^;g⁰(')^j²=g(') (36)

(7)

where expectation E is over the distribution of 'that governed the observed sample Z^N. We shall call

g(')^;g⁰(')

the bias error. Moreover, if the bias error is small enough, the variance will be given approx- imately by

E^jg^N(')^;g(')^j² m

N ⁽³⁷⁾ Here m is the dimension of (number of estimated parameters),Nis the number of observed data pairs and is the noise variance. More- over, expectation in both over ^N and over ', assuming, the same distribution for'as in the sampleZ^N. The total integrated mean square error (IMSE) will thus be

E^j^gN(')^;g⁰(')^j²=^jjg(')^;g⁰(')^jj²+m N ₍₃₈₎ Here the double bar norm denotes the functional norm, integrating over'with respect to its distribution function when the data were collected.

Now, what happens if we minimize the regular- ized criterionW_N in (35)?

1. The valueg(') will change to the function that minimizes

E^jg(')^;g⁰(')^j²+^j^;^#^j² (39) 2. The variance (37) will change to

E^jg^N(')^;g(')^j² r(m)

N (40) where

r(m) =^X^m

k⁼¹

_i²

(i+)² (41) where i are the eigenvalues (singular values) of E V_N⁰⁰(), the second derivative ma- trix (the Hessian) of the criterion.

How to interpret (41)? A redundant parameter will lead to a zero eigenvalue of the Hessian. A small eigenvalue of V⁰⁰ can thus be interpreted as corresponding to a parameter (combination) that is not so essential: \A spurious parameter".

The regularization parameteris thus a threshold for spurious parameters. Since the eigenval- uesi often are widely spread we have

r(m)^'m^#= # of eigenvalues ofV⁰⁰ that are larger than

We can think of m^# as \the ecient number of parameters in the parameterization". Regu- larization thus decreases the variance, but typically increases the bias contribution to the total error.

4. THE BIAS/VARIANCE TRADE-OFF Consider now a sequence of parameterized function families

Gn=^fgn('(t))^j²D^MIR^m^g n= 123::: (42) where n denotes the number of basis function (9).

In the previous section we saw that the integrated mean square error is typically split into two terms the variance term and the bias term

V²(^g_nNg⁰) =V²(^g_nNg_n) +V²(g_ng⁰) (43) where, according to (37),

V²(^g_nNg_n) m

N : ⁽⁴⁴⁾

The bias term, which is entirely deterministic, decreases withn. Thus, for a given family^fGn^g

there will be an optimal n = n(N) that bal- ances the variance and bias terms.

Notice that (44) is a very general expression that holds almost regardless of how the sequence

fGn^gis chosen. Thus, it is in principle only possible to inuence the bias error. In order to have a small integrated mean square error it is therefore of profound importance to choose^fGn^gsuch that the bias is minimized. An interesting possibility is to let the choice of^fGn^gbe data driven.

This may not seem like an easy task but here wavelets have proven to be useful, see Section 9.

When the bias and the variance can be exactly quantied, the integrated mean square error can be minimized w.r.t. n. This gives the optimal model complexityn(N) as a function ofN. However, often it is only possible to give the rate with which the bias decreases as a function ofn and the rate with which the variance increases with n. Then it is only possible to obtain the rate with which n(N) increases with N. An- other problem is that ifg⁰in reality belongs not to ^G but to some other class of functions, the rate will not be optimal. These considerations has lead to the development of methods where the choice ofnis based on the observationsZ^N. Basically,nis chosen so large that there is no evidence in the data thatg⁰ is more complex than the estimated model, but not larger than that.

Then, as is shown in Guo and Ljung (1994), the bias and the variance are matched. These adap- tive methods are discussed in Section 9.

To get an idea of upper bounds for the optimal rate of convergence consider a simple linear regression problem: g⁰(') = '^T⁰. The bias is