We stress how the basic idea is to focus on the estimation of the state-variable candidates { thek-step ahead output predictors

(1)

A Least Squares Interpretation of Sub-Space Methods for System Identication.¹

Lennart Ljung and Tomas McKelvey

Dept. of Electrical Engineering, Linkoping University S-581 83 Linkoping, Sweden,

Email: ljung@isy.liu.se, tomas@isy.liu.se.

CDC 1996, Kobe Japan

Abstract

So called subspace methods for direct identication of linear models in state space form have drawn con- siderable interest recently. The algorithms consist of series of quite complex projections, and it is not so easy to intuitively understand how they work. They have also deed, so far, complete asymptotic analysis of their stochastic properties. This contribution de- scribes an interpretation of how they work. It specif- ically deals how consistent estimates of the dynamics can be achieved, even though correct predictors are not used. We stress how the basic idea is to focus on the estimation of the state-variable candidates { thek-step ahead output predictors.

1 Introduction

A linear system can always be represented in state space form as

x(t+ 1) =Ax(t) +Bu(t) +w(t)

y(t) =Cx(t) +Du(t) +(t) (1) We shall generally letndenote the dimension ofxand let p be the number of outputs. To estimate such a model, the matrices can be parameterized either from physical grounds or as black boxes in canonical forms.

Then these parameters can be estimated using predic- tion error/maximum likelihood (ML) techniques. See, e.g. 5].

However, there are also other possibilities: So called subspace methods, 9], 10], 2], 12], 13] form an inter- esting alternative to the ML approach. The idea behind these methods can be explained as rst estimating the state vectorx(t), and then nding the state space matrices by a linear least squares procedure. These meth-

1

ThisworkwassupportedinpartbytheSwedish Research

Council for Engineering Sciences (TFR), which is gratefully

acknowledged.

ods are most often described in a geometric framework, which gives nice projection interpretations.

We shall in this contribution describe the subspace approach in a conventional least squares estimation framework. This give some complementary insight, which could be useful for development of alternative algorithms and for the asymptotic analysis.

2 The Basic Idea

Let us for a moment assume that not only areu and y measured, but also the sequence of state vectorsx. This would, by the way, x the state-space realization coordinate basis. Now, with known uy and x, the model (1) becomes a linear regression: the unknown parameters, all of the matrix entries in all the matrices, mix with measured signals in linear combinations. To see this clearly, let

Y(t) = x(t+ 1) y(t)

= A B

C D

(t) = x(t) u(t)

E(t) = w(t)

(t)

Then, (1) can be rewritten as

Y(t) = (t) +E(t) (2) From this all the matrix elements in can be estimated by the simple least squares method. The covariance matrix forE(t) can also be estimated easily as the sample sum of the squared model residuals. That will give the covariance matrices forwand, as well as the cross covariance matrix. These matrices will, among other things, allow us to compute the Kalman lter for (1). Note that all of the above holds without changes for multivariable systems, i.e., when the output and input signals are vectors.

The problem is where to get the state vector sequence xfrom. For that we turn to basic realization theory, as

(2)

developed by 3], 1] and 7]. (See Appendix 4.A in 5]

for an account). The basic results are as follows (see Lemmas 4A.1 and 4A.2 in 5] and their proofs):

Let a system be given by the impulse response representation

y(t) =^X¹

j=0

(h^u(j)u(t^;j) +h^e(j)e(t^;j)) (3) where u is the input and e the innovations. Let the k-step ahead predictors be dened by

y^(t^jt^;k) =^X¹

j=k

(h^u(j)u(t^;j) +h^e(j)e(t^;j)) (4) Notice, and this is important, that the input between timet^;kandtare ignored: no attempt to predict its values from past data is made. Dene

Y^^r(t) =

0

B

@

y^(t^jt^;1)

^ ...

y(t+r^;1^jt^;1)

1

C

A (5)

Then the following is true:

1. The system (3) admits an n-th order state space description if and only if the rank of ^Y^r(t) is n for allr.

2. If rank ^Yⁿ⁺¹=nthennis the order of a minimal realization.

3. The state vector of any such minimal realization can be chosen as linear combinations of ^Yⁿ that form a basis for ^Y^rrn, i.e.,

x(t) =LY^{^}ⁿ(t) (6) such thatx(t) spans also ^Yⁿ⁺¹(t).

Remark: "Rank", "basis" and "span" refer to the matrix obtained by the sequence of vectors ^Y^N = ^Yⁿ(1)Y^{^}ⁿ(2):::Y^{^}ⁿ(N)]

Note that the common canonical state space represen- tations correspond toLmatrices that just pick out certain rows of ^Yⁿ. In general, we are not conned to such choices, but may pick L so that x(t) becomes a well conditioned basis.

It is clear that the facts above will allow us to nd a suitable state vector from data. The only remaining problem is to estimate thek-step ahead predictors. The true predictor ^y(t+k^jt) is given by (4) and is a linear function ofu(i)y(i) it. For practical reasons the predictor is approximated so that it only depends ons past data,t^;s+ 1it. It can then eciently be

determined by another linear least squares projection directly on the input output data. That is, set up the model

y(t+k) = (^{k `s})^T'^s(t+1)+(^{k `s})^T'~^`(t+1)+(t+k) where (7)

'^s(t+1) = y(t)u(t):::y(t^;s+1)u(t^;s+1)]^T (8) '~^`(t+ 1) = u(t+ 1):::u(t+`)]^T (9) Estimateandin (7) using least squares giving ^^{k `s}^N and ^^N^{k `s}. Thek-step ahead predictor is then

y^^s`(t+k^jt) = (^^N^{k `s})^T'^s(t+ 1) (10) For large enough s this will give a good approxima- tion of the true predictors. Let us also introduce the predictor that includes futureu:

y^s`(t+k^jt) = (^^{k `s}^N )^T'^s(t+1)+(^^N^{k `s})^T'~^`(t+1) (11)

Remark: The complication with the term has the following reason: The values of u(t+ 1):::u(t+k) a ecty(t+k), but should not be included in the predictor as demanded by (4). Ifuis not white noise, future inputs can be predicted from past ones. Without the term in (7) the rst term would attempt to include the (predicted) e ects from ~'(t+ 1) ony(t+k), thus giving the wrong result.

The method thus consists of the following steps:

Basic Subspace Algorithm ⁽¹²⁾

1. Estimate ^y^s`(t+k^jt)k= 1:::rusing (10).

2. Form ^Y^r(t) in (5)

3. Estimate its rank and determineLin (6)

4. EstimateABCD and the noise covariance matrices using (2)

What we have described now is the subspace projection approach to estimating the matrices of the state-space model (1), including the basis for the representation and the noise covariance matrices. There are a number of variants of this approach. See among several references, e.g. 10], 2].

The approach gives very useful algorithms for model estimation, and is particularly well suited for multivariable systems. The algorithms also allow numerically very reliable implementations. They contain a number of choices and options, like how to choose `s and r,

(3)

and also how to carry out step number 3. There are also several "tricks" to do step 4 so as to achieve consistent estimates even for nite values ofs. Accordingly, several variants of this method exist. In the following sections we shall give more algorithmic details around this approach.

3 Ecient calculation of ^Y^{^}^r(t)

Let us now consider in somewhat more detail the subspace algorithm (12). In fact, there are many variants of these algorithms and for a comprehensive treatment, we refer to 11]. Here we shall only point to the essential features, and what elements account for the consistency. Recall the basic estimates (7){(11). The corresponding vectors of stacked predictors will be denoted by

Y^^r^s`(t+ 1) =

0

B

@

y^^s`(t+ 1^jt)

^ ...

y^s`(t+r^jt)

1

C

A (13)

and Y^r^s`(t+ 1) analogously Clearly, we can treat all predictors simultaneously: Let

Y^r(t+ 1) =

0

B

@

y(t+ 1) y(t...+r)

1

C

A (14)

and stackrequations like (7) on top of each other:

Y^r(t+ 1) = ^r`s'^s(t+ 1) + ;^r`s'~^`(t+ 1) +E(t+ 1) Estimate theprs(p+m) matrix (together with ;)(15) by least squares and then form

Y^^r^s`(t+ 1) = ^^r`s^N '^s(t+ 1) (16) Y^r^s`(t+ 1) = ^^r`s^N '^s(t+ 1) + ^;^r`s^N '~^`(t+ 1) (17) In fact, these quantities can be eciently calculated by projections using the data vectors, without explicitly forming the matrices ^ and ^;.

4 Choice of order and basis

We now have the (pr-dimensional) vector ^Y^r(t) for t = 1:::N. If the system is of order n, this vector sequence has rank n, and we should select a basis for it by forming linear combinations

x(t) =LY^{^}^r(t) (18) so that x becomes well conditioned. To nd the rank of ^fY^^r(t)t = 1:::N^g we would form the (prN) matrix

Y

N = ^Y^r(1):::Y^{^}^r(N)] (19)

We could then perform an SVD on^Y^N:

Y

N=USV^T (20)

For added "exibility and options, the SVD could be carried out on a weighted version of^Y^N:

W¹^Y^NW²=USV^T (21) (HereW¹is aprprmatrix, whileW² isNN.) In the sequel, we will not use these weighting matrices, though. We would now examine the singular values in Sand put those below a certain threshold to zero. De- note the number of singular values above the threshold byn. LetS¹be the uppernnpart ofS. The corre- spondingncolumns ofU will be denoted by U¹ (thus aprn matrix), and the corresponding columns ofV areV¹ so that

USV^T U¹S¹V¹^T (22) There are now several candidates for matricesLin (18).

The most common choice seems to be

L= (U¹S¹¹⁼²)^y (23)

5 Relationships for the true predictors

Suppose the true system can be described as (3), with itsk-step ahead predictor given by (4). An immediate consequence is that

y^(t+k^jt) = ^y(t+k^jt^;1)+h^u(k)u(t)+h^e(k)e(t) (24) Suppose now that the true system can be written as a di erence equation

A⁰(q)y(t) =B⁰(q)u(t) +C⁰(q)e(t) (25) where the polynomials in the shift operator are all of degree at mostn:

A⁰(q) =I+A¹q^;1+:::Aⁿq^;n (26) Then

y^(t+r^jt)+A¹y^(t+r^;1^jt)+:::Aⁿy^(t+r^;n^jt) = 0 (27) for anyr > n. The proof is immediate: Take equation (25) witht=t+rand project it onto the space spanned by^fe(s)u(s)st^g(ignoring that future umight be predicted from past). The left hand side equals the left hand side of (27), while the right hand side is zero if r > n.

Let is now conne the discussion to the single output case. (It can be exactly transferred to the multi-output case, at the expense of somewhat more complex expression.)

(4)

Now, look at ^Y^r(t) made up of the true predictors.

Equation (27) means that any rowk > ncan be written as a linear combination of the rows above it. The rank of ^Y^r(t) is thus at mostn, and a possible basis is formed by the n rst rows. Let us rst choose an L that picks these rows:

x(t) =LY^{^}^r(t) = ^Yⁿ(t) (28) For componentk of this state vector we thus have

x^k(t+ 1) = ^y(t+k^jt)

= ^y(t+k^jt^;1) +h^u(k)u(t) +h^e(k)e(t)

=x^{k +1}(t) +h^u(k)u(t) +h^e(k)e(t) xⁿ(t+ 1) = ^y(t+n^jt)

= ^y(t+n^jt^;1) +h^u(n)u(t) +h^e(n)e(t)

=^;a¹y^(t+n^;1^jt^;1)^;:::^;aⁿy^(t^jt^;1) +h^u(n)u(t) +h^e(n)e(t)

=^;a¹xⁿ(t)^;:::^;aⁿx¹(t) +h^u(n)u(t) +h^e(n)e(t)

using (24) and (27). In matrix notation we have x(t+ 1) =A^cx(t) +

0

B

@

hû(1) hû(2) hû...(n)

1

C

Au(t) +

0

B

@

hê(1) hê(2) hê(...n)

1

C

Ae(t) y(t) =C^cx(t) +h^u(0)u(t) +h^e(0)e(t)

where

A^c=

0

B

@

0 1 0 ::: 0

0 0 1 ::: 0

... ... ... ...

0 0 0 ::: 1

;aⁿ ^;a^n;1 ^;a^n;2 ::: ^;a¹

1

C

A

C^c = (1 0 0 ::: 0)

This is of course the standard observability canonical form. See 4].

Suppose now that we pick another L in (28). Let us

rst note that, since the rst nrows form a basis, we can always write

Y^^r(t) =FY^{^}ⁿ(t) =Fx(t)

for someprnmatrixF. Choosing an arbitrary matrix Lto choose the basis gives

x~(t) =LY^{^}^r(t) = (LF)x(t) (29) This shows that the new state vector will be a linear map of (28), so carrying out the update equations for ~x will give us the same system, in a coordinate basis that corresponds to the similarity transformation (LF).

6 Consistency as the model order tends to innity

To investigate consistency { somewhat heuristically { we shall assume that the number of dataN, is so large that all estimates are e ectively equal to their limiting values. In this section we shall also assume that the model ordersused in (7) is so large that the in"uence ony(t+k) from input-output data older than t^;s is negligible. Alternatively, we may assume that the true system can be described by an ARX-model of orders. We shall also assume that the future input horizon` in (7) is chosen so that ` r, so that all e ects of inputs for t+ 1 i k on y(t+k) for k up to r are properly accounted for. All this means that (7) is a model structure that is capable of describing the k-step ahead predictors correctly. The estimates (10) will thus be the correct predictors.

For large N and large s, the vector ^Y^r(t) will conse- quently have all the properties (24) { (29). The algorithm will therefore give the correct system description under these conditions. Note that this is true for all choices ofLthat determine a basis for ^Y^r(t).

This approach can be seen as a variant of the two-stage methods described in Section 10.4 of 5]: First use a high order ARX-model to pick up the correct system dynamics (including noise dynamics), then reduce the model order by forcing the higher order model to t a lower order one. In the Mayne-Firoozan method, 6] the innovations are used explicitly for this. In the subspace method, the predictions are used in a related way.

7 Estimating only^A and ^C

We can make the discussion in the previous section more focussed on the essential matters by concentrat- ing on estimating A and C. Once these matrices are

xed, estimating B and D in (1) is a linear regression problem, even for unknownx. (See (53), below.) We then lump the dependencies of future inputs into a term ~'and set up a regression like

x(t+ 1) y(t)

= A

C

x(t) +'~^`+1(t) (30) If we use thek-step ahead predictors ^y(t+k^jt) for the statesx as in (28), we then nd { exactly as above { thatAandCwill be consistently estimated if only the following three relationships hold:

y^(t+k^jt) = ^y(t+k^jt^;1) +¹'~^`+1(t)

+(t) (31)

y^(t+n+ 1^jt) =^;a¹y^(t+n^jt):::^;aⁿy^(t+ 1^jt) + +²'~^`+1(t) (32)

(5)

rank^fY^ⁿ(t+ 1)^g=n (33) for some ⁱ and a sequence (t) that is uncorrelated with ~'^`+1(t) and '^s(t). These relationships clearly hold for the true k-step ahead predictors as veried by (24),(27) and (6).

8 Modications to achieve consistency even for nite model orders

The subspace methods can go one step further, and achieve consistent estimates of the ABC, and D- matrices, even without letting the model order stend to innity. This is technically more involved. The basic idea is to establish that the key relations (31) and (32) will hold also for the approximate predictors ^y^s`(t+k^jt) and y^s`(t+k^jt) (dened by (10) and (11)), if only we play carefully with the "subscript orders"sand`. We start by establishing two lemmas for these quantities, which show properties analogous to Levinson type recursions.

Lemma 1 Suppose that the true system can be described by (25) withnas the maximal order of the polynomials, and that the system operates in open loop, so that e and u are independent. Let ^y^s`(t+k^jt) and y^s` (t+k^jt) be the limits of (10) and (11) asN ^!¹. Then for anys, any r > nand any`r

y^^s` (t+r^jt)+A¹y^^s`(t+r^;1^jt)+:::+Aⁿy^^s` (t+r^;n^jt) = 0 and (34)

y^s` (t+r^jt)+A¹y^s`(t+r^;1^jt)+:::+Aⁿy^s`(t+r^;n)

=B⁰u(t+r)+B¹u(t+r^;1)+:::+Bⁿu(t+r^;n) (35)

Proof:Consider the equation (7). Suppress the indices

`andsand let

(t+ 1) = ~'^`(t+ 1) '^s(t+ 1)

(36) Let (t+ 1) be any vector of the same dimension as (t+ 1) such that

E (t+ 1)C⁰(q)e(t+r) = 0 (37) Suppose that ^k and ^k are estimated from (7) using the IV-method with instruments (t+ 1). Then the limiting estimate are given by

^k

T = Ey(t+k) ^T(t+1) E(t+1) ^T(t+1)]^;1 (38) Note also that we can write, for some⁰

B⁰(q)u(t+r) =⁰^T'~^`(t+ 1) (39)

ifr > nand`r, Hence

^r

T +A¹ ^r;1

^r;1

T +:::+Aⁿ ^r;n

^r;n

T =

= E (A⁰(q)y(t+r)) ^T(t+ 1)] E(t+ 1) ^T(t+ 1)]^;1

= E (B⁰(q)u(t+r) +C⁰(q)e(t+r)) ^T(t+ 1)]

E(t+ 1) ^T(t+ 1)]^;1

=⁰^T E~'(t+ 1) ^T(t+ 1)] E~'(t+ 1) ^T(t+ 1)

E'(t+ 1) ^T(t+ 1)

;1

=⁰^T(I 0) = (⁰^T 0)

Here we used (39) and (37) in the third last step, and the denition of a matrix inverse in the second last step.

Since

y^s`(t+k^jt) = ^k

^k

T(t+ 1) y^^s`(t+k^jt) = ^k

^k

T 0

'(t+ 1)

we just need to multiply the above expression with (t+ 1) to obtain the stated result.

It now only remains to show that (t+ 1) =(t+ 1) obeys (37), so that the result holds for the least squares estimates. The vector(t+ 1) contains a number of inputs, which are uncorrelated with the noise, under open loop operation. It also contains y(t) and older values ofy, which are uncorrelated with C⁰(q)e(t+r) if r > n, since the order of C⁰ is at most n. This concludes the proof.

Corollary: Suppose that the true system is given by A⁰(q)y(t) =B⁰(q)u(t) +v(t) (40) and that the parameters of the predictors are estimated from (7) using an instrumental variable method with instruments (t+1) that are uncorrelated withv(t+r).

Then the result of the lemma still holds.

Notice that the Lemma holds for anys, which could be smaller thann.

Lemma 2 ^Let^y ^{and ^}^y be dened as above. (These thus depend on N, but this subscript is suppressed.) Then for anyNskand any`

y(t) = y^s`+1(t^jt^;1) +(t) (41) y^s+1`(t+k^jt) = y^s`+1(t+k^jt^;1) + ~h^s+1`^{k N} (t) (42) y^^s+1`(t+k^jt) = ^y^s`+1(t+k^jt^;1) +b^s+1`^{k N} u(t) +

+^{k Ns`}^T '~^`(t+ 1) + ~h^s+1`^{k N} (t) (43) where (t) (same in the three expressions) is uncorrelated with'^s(t)u(t) and ~'^`(t+1)t= 1:::N. If the input sequence^fu(t)^gis white, then

b^s+1`^{k N} ^!h^u(k) and^{k Ns`}^!0 asN ^!¹ (44)

(6)

whereh^u(k) is the true impulse response coecient.

Proof: Let

¹(t+ 1) = '~^`(t+ 1) '^s+1(t+ 1)

and

²(t+ 1) = ~'^`+1(t) '^s(t)

The vector²(t+ 1) contains the values u(i)i=t+

`:::t^;s and y(i)i = t^;1:::t^;s. The vector ¹(t + 1) contains the same values, and in addition y(t). Deneas the residuals from the least squares t

y(t) =L²²(t+ 1) +(t)

so that(t)^?²(t+ 1). With this we mean that

N

X

t=1

(t)²(t+ 1) = 0

Note that (t) will depend on ` and s, but not on k. Moreover, by denition we nd that

L²²(t+ 1) = y^s`+1(t^jt^;1) (45) so the rst result of the lemma has been proved.

Let

³(t+ 1) = ²(t+ 1)

(t)

It is clear that¹ and³span the same space, so that for some matrixR(built up using L²) we can write

¹(t+ 1) =R³(t+ 1) Now write

y(t+k) = ^K¹¹(t+ 1) +"(t+k) (46) where ^K¹ is the LS-estimate, so that

"(t+k)^?¹(t+ 1)

Let K^¹R= (K² K³) (47)

Clearly, by denition

y^s+1`(t+k^jt) = ^K¹¹(t+ 1) = ^K¹R³(t+ 1)

=K²²(t+ 1) +K³(t) (48) Now rewrite (46) as

y(t+k) = ^K¹¹(t+ 1) +"(t+k)

= ^K¹R³(t+ 1) +"(t+k)

=K²²(t+ 1) +K³(t) +"(t+k)

Both(t) and "(t+k) are orthogonal to²(t+ 1), so K²must be the least squares t ofy(t+k) to²(t+1), which means that

y^s`+1(t+k^jt^;1) =K²²(t+ 1) (49) Comparing (48) with (49) we have shown (42),with

~h^s+1`^{k N} =K³). Moreover,

y^s+1`(t+k^jt) = ^y^s+1`(t+k^jt) +¹'~^`(t+ 1) y^s`+1(t+k^jt^;1) = ^y^s`+1(t+k^jt^;1) +²'~^`+1(t)

= ^y^s`+1(t+k^jt^;1) +b^ku(t) +³'~^`(t+ 1)

(withb^k being the rst column of ²). Applying (42) to the two left hand sides of these expressions, we have also proved (43) (with^{k Ns`}^T =²^;³ andb^s+1`^{k N} = b^k. The proof of (44) is straightforward and omitted here, since we will not need this result for the ensuing discussion. This concludes the proof of Lemma 2.

9 A consistent nite order Subspace Method

Based on Lemmas 1 and 2, several di erent subspace methods can be derived that consistently estimate the ABC and D- matrices of a state-space model, even without letting the underlying model order tend to in-

nity. We here give just one such algorithm, which is a slight variant of the N4SID-method of 8]. It has the same rst steps as the basic algorithm (12) of Section 7.5.

Algorithm SUBSP

1. Estimate Y^r^s+1`(t), Y^r^s`+1(t) and ^Y^r^s`+1(t) using (16) and (17) with`r >nwhere nis an upper bound of the system order.

2. Estimate the rank of ^Y^r^s`+1 and determine L in (6), e.g. as in (19){(23)

3. Introduce

x(t+ 1) =LY^r^s+1`(t+ 1) (50) x(t) =LY^r^s`+1(t) (51) 4. EstimateAC and in

x(t+ 1) y(t)

= A

C

x(t) +'~^`+1(t) (52) using the least squares method. (Notice the di erence with (30). We here use "pseudostates"xand

xto make use of the subscript di erences required to apply Lemma 2.)