Some facts about the Choice of the Weighting Matrices in Larimore Type of Subspace Algorithms

(1)

Matrices in Larimore Type of Subspace

Algorithms

Dietmar Bauerand Lennart Ljung

Department of Electrical Engineering

Linkping University, S-581 83Linkping, Sweden

WWW: http://www.control.isy.liu .se

Email: ljung@isy.liu.se

2000-10-13

REG

LERTEKNIK

AUTO

_{MATIC CONTR}

OL

LINKÖPING

Report no.: LiTH-ISY-R-2305

Submitted toAutomatica

Technicalreportsfrom theAutomatic Controlgroupin Linkpingare available

by anonymous ftp at the address ftp.control.isy.liu.se. This report is

(2)

in Larimore type of subspace algorithms

Dietmar Bauer

Institute f. Econometrics, Operations Research and System Theory

TU Wien, Argentinierstr. 8, A-1040 Wien

Lennart Ljung

Division of Automatic Control, Department of Electrical Engineering,

Linkoping University, SE-581 83 Linkoping, Sweden,

e-mail: ljung@isy.liu.se

October 4, 2000

Abstract

In this paper the eect of some weighting matrices on the asymptotic variance

oftheestimates oflineardiscretetimestatespace systemsestimatedusingsubspace

methods is investigated. The analysis deals with systems with white or without

observed inputsand refersto the Larimoretype of subspace procedures. The main

resultexpressestheasymptoticvarianceofthesystemmatrixestimatesincanonical

form as a function of some of the user choices, clarifying the question on how to

choose them optimally. It isshown,that theCCAweighting scheme leadsto optimal

accuracy. The expressions for the asymptotic variance can be implemented more

eÆcientlyascomparedto theones previouslypublished.

Keywords: linear systems, discrete time systems, subspace methods, asymptotic

variance

(3)

Subspace algorithms are used for the estimation of linear, time invariant, discrete time,

nite dimensional black box state space models. The algorithms can be roughly divided

into Larimore type of algorithms (Larimore, 1983) (One algorithm in this class is

usu-ally called CCA, canonical correlation analysis), which estimate the state in the rst step

and then extract the estimates of the system matrices from these estimates, and MOESP

(multivariable output error state space) type of algorithms (Verhaegen, 1994), which

es-timate the observability matrix and use this estimate to obtain estimates of the system

matrices. The asymptotic properties of the Larimoretype of approach have been derived

in a series of papers: (Peternell et al., 1996) derive the consistency, (Bauer et al., 1999)

proveasymptoticnormalityinthecaseof noobserved inputs,(Bauer,1998)dealswiththe

general case. For the MOESP type of procedure consistency and asymptotic normality are

dealtwith in(Bauer and Jansson,2000),while preliminaryresults onconsistencycan also

be found in (Jansson and Wahlberg, 1998; Verhaegen, 1994). The asymptotic normality

proofisveryconstructiveinbothcases, whichledtoformulasforthe asymptoticvariance.

However these expressions were too complicatedin order to directly provide some insight

intothe eect of certainuser choices. Recently simplicationsof theseformulas havebeen

found independently in (Jansson, 2000)for the MOESP case and in(Bauer et al., 2000)for

the Larimore type of procedures. These simplerexpressions lieat the heart of this paper,

which derives the corresponding variance expressions as a function of a certain weighting

matrix. This expression can be used in order to optimizethe user choice with respect to

asymptoticaccuracy of the estimated system.

Thepaperisorganisedasfollows: Inthenextsectionthemodelsetandtheassumptions

arestatedandalsoashortoverviewoftheestimationalgorithmsisgiven. Section3presents

the main results. Section 4demonstrates the results insome numericalexamples. Finally

section 5concludes.

Throughout the paper the following notation willbe used: I

n

denotes the nn

iden-tity matrix, 0 ab

the nullmatrix ofrespective dimensions. Furtherf

T

=O(g

T

(4)

T>0

T T T T T!1 T T

0 a.s. Here T is used to denote the sample size. Convergence is denoted asusual with !

and is always meant to be a.s. if not stated explicitely. Prime is used to denote

transpo-sition of matrices. The Kronecker product between two matrices A and B is denoted as

AB. Finally Q T = p T 1

loglogT is used and :

=denotes equality up toterms of order

o(T 1=2

).

2 Model Set, Assumptions and Algorithm

This paper deals with linear, nite dimensional, discrete time, time invariant,state space

systems of the form

x t+1 = Ax t +Bu t +K" t y t = Cx t +Du t +" t (1) where y t 2 R s

denotes the observed output process, u

t 2 R

m

denotes the observed input

process and "

t 2 R

s

the unobserved white noise sequence. x

t 2 R

n

is the state sequence.

Thus the true order of the system is denoted by n. Here A 2 R nn ;B 2 R nm ;C 2 R sn ;D 2 R sm ;K 2 R ns

are real matrices. The system is assumed to be stable, i.e.

alleigenvalues of A are assumed to lie insidethe unit circle, and strictly minimum-phase,

i.e. the eigenvalues of A KC are assumed to lie inside the unit circle. The system

matrices correspond to a pair of transfer functions: Let H(q) =I

s +C(qI n A) 1 K and letG(q)=D+C(qI n A) 1

B,whereqdenotestheforwardshiftoperator. Furthermorelet

M

n

denotetheset ofallpairsoftransferfunctionsthat permitastatespacerepresentation

of the form (1)fulllingthe stabilityand the strict minimum-phaseassumption onH(q).

The white noise "

t

is for simplicity assumed to be i.i.d. with mean zero, nonsingular

variance and nite fourth moments. More general assumptions in a martingale dierence

framework can be found in (Bauer et al., 1999). The input is assumed to be i.i.d. with

mean zero and nonsingular variance, also having nite fourth moments. Input and noise

are assumed tobe independent.

(5)

see e.g. Bauer, 1998, Chapter 3): Let Y t;f = [y 0 t ;y 0 t+1 ; ;y 0 t+f 1 ] 0 and let U t;f and E t;f

respectivelybe constructedanalogouslyusing u

t and "

t

respectivelyin theplace ofy

t . Let Z t;p =[y 0 t 1 ;u 0 t 1 ; ;y 0 t p ;u 0 t p ] 0

. Here f and p are twointeger parameters,whichhave to

be chosen by the user. See belowfor assumptions onthe choice of these integers. Then it

follows fromthe system equations(1) that

Y + t;f =O f K p Z t;p +U f U + t;f +E f E + t;f +O f (A KC) p x t p HereO 0 f =[C 0 ;A 0 C 0 ;(A f 1 ) 0 C 0 ]andK p =[[K;B KD];(A KC)[K;B KD]; ;(A KC) p 1 [K;B KD]]. FurtherU f

isthematrixcontaining[CA j 2

B;;CB;D;0

s(f j)m

]

asitsj-th blockrowand E

f contains [CA j 2 K;;CK;I s ;0 s(f j)s

] asitsj-thblockrow.

This equation builds the basis for all subspace algorithms, which can be described as

follows: 1. Regress Y + t;f ontoU + t;f and Z t;p to obtainanestimate ^ z of O f K p and anestimate ^ u of U f

respectively. Due tonite sample eects ^

z

willtypicallybe of full rank.

2. For given n nd a rank n approximation of ^ z by using the SVD of ^ W + f ^ z ^ W p = ^ U n ^ n ^ V 0 n + ^ R . Here ^ n

denotes the diagonalmatrix containingthe largest n singular

values in decreasing order. ^

U

n

contains the corresponding left singular vectors as

columns and ^

V

n

the corresponding right singular vectors. Finally ^

R accounts for

the neglected singular values. The matrices ^ W + f and ^ W p

are weighting matrices,

which are chosen by the user. Further details are given below, for the moment it

is suÆcient to note, that these possibly data dependent matrices are assumed to be

nonsingular(a.s.). This leads toanapproximation ^ O f ^ K p =( ^ W + f ) 1 ^ U n ^ n ^ V 0 n ( ^ W p ) 1 .

The actual decomposition of this matrix into ^ O f and ^ K p

has no in uence on the

estimated transfer functions.

3. Usingthe estimates ^ O f ; ^ K p and ^ u

obtainthe system matrix estimates.

In thesecond stepanorder has tobespecied. Alsothe matrices ^ W + f and ^ W p haveto be

(6)

the matrix ^

W

p

typicalchoices are ( ^ p ) 1=2 and ( ^ ; p ) 1=2 , where ^ p = 1 T T t=p+1 Z t;p (Z t;p ) 0

denotes the sample variance of Z

t;p

and X 1=2

denotes the uniquely dened symmetric

square root ofa matrix X. Further ^ ; p = ^ p ^ u;z ^ 1 u ^ u;z . Here ^ u

denotes the sample

varianceofU + t;f and ^ u;z

thesamplecovarianceofU + t;f and Z t;p . Let p =E ^ ; p denotethe

expectationoftheweighting matrix. Itfollowsfromtheassumptionsontheinputsandthe

noise stated above, that for any xed p the estimate ^ ; p converges to p . Furthermore

the results stated e.g. in (Hannan and Deistler, 1988) imply, that the two norm of these

matricesisboundedfrombelowandfromabovea.s. uniformelyforp=O((logT) a

);a<1,

i.e. for moderately growing size.

Corresponding to ^

W +

f

typicalchoices includethe identity matrix and ( ^ +; f ) 1=2 using ^ +; f = ^ + y ^ y;u ^ 1 u ^ u;y (2) where ^ + y

stands forthe sample variance ofY +

t;f and

^

y;u

denotes the sample covarianceof

Y + t;f and U + t;f

. In this paper the choice of the weighting ^

W +

f

will be restricted depending

on the choice of the integer f: If f is chosen to be xed and nite, then ^ W + f is assumed to be chosen such k ^ W + f W + f k = O(Q T

) for some nonsingular matrix W + f . For f ! 1 only ^ W + f = ( ^ +; f ) 1=2

or a weighting matrix attached to a frequency weighting transfer

function(cf.Bauer,1998)areconsidered. Lettheexpectationbedenoted with +;

f

. Then

analogous results hold true: The errork ^ +; f +; f k 2

!0 and the two norm of +; f and thus of ^ +; f

is bounded and bounded away from zero for f = O((logT) a

);a < 1. The

name CCA(canonical correlation analysis) willbe reserved for the procedure using

^ W p =( ^ ; p ) 1=2 and ^ W + f =( ^ +; f ) 1=2 (3)

Inthethirdstepthe dierencebetween thetwoclassesofproceduresappears: Whereas

the Larimoretypeof procedures use ^

K

p

tocontinue, the MOESPtypeof proceduresuses ^

O

f

(fordetailseeBauer, 1998,Chapter3). Inthispaperonlythe Larimoretypeofprocedures

(7)

The main idea of the considered class of algorithmsis to estimatethe state ina rst step

and to obtainthe estimateof the system using this state estimate. Consider the estimate

^ K p = ^ S ^ V 0 n ( ^ W p ) 1 . Here ^ S = [ ^ V 0 n ( ^ p ) 1 ] n

appears to be a convenient choice of ^

S, where

[X]

n

denotes the submatrix containing the rst n columns of X. This is possible (a.s.

asymptotically), if the rst n columns of K

p

are linearily independent (in one and thus

in any representation). This holds true on a generic subset of M

n

, which is denoted with

M + n . Let( ^ A c ; ^ B c ; ^ C c ; ^ D c ; ^ K c

) denote the estimated system, which has been converted into

the canonical forminduced by the restriction that [K

p ] n =I n and let (A c ;B c ;C c ;D c ;K c )

denotethecorrespondingrepresentationofthetruesystem. Sincetheentriesinacanonical

form are system invariants, the estimation accuracy of two procedures can be assessed by

comparing the asymptoticcovariance matrix of the vectorisation of the estimated system

in acanonical form. This isdone in the main result of this paper:

Theorem 3.1 Let y

t

be generated by a system (A

c ;B c ;C c ;D c ;K c

), such that the

corre-sponding pairof transfer functions are in M +

n

. The noise isassumed to be i.i.d. with zero

mean, nonsingularvariance andnite fourth moments. Theobserved input isrestricted to

be i.i.d. with zero mean, nonsingular variance and nite fourth moments. Furthermore it

is assumed thatnoise and observed input are independent. Assume thatthe Larimore type

of procedureusing ^ W p =( ^ ; p ) 1=2

isusedtoestimate thesystem andnodelay isassumed,

i.e. theentriesinD

c

areestimatedandnotrestrictedtobezero. Additionallyitisassumed,

thatp dlogT=(2logj

0

j);1<d<1and p=o((logT) a

) holdsforsomea <1, where

T denotes thesample sizeand

0 = max (A c K c C c ), where max denotes aneigenvalueof

maximum modulus. Corresponding to ^

W +

f

it is assumed, that either f is xed and ^

W +

f is

chosen such that there exists a nonsingular matrix W + f , where k ^ W + f W + f k =O(Q T ), or that f ! 1 and ^ W + f

is chosen according to equation (2). Then the asymptotic variance

of vec h ^ A c A c ; ^ B c B c ; ^ C c C c ; ^ D c D c ; ^ K c K c i is of the form M 1 M 0 1 +M 2 z (O 0 f W 2 O f ) 1 O 0 f W 2 [E f (I f )E 0 f ]W 2 O f (O 0 f W 2 O f ) 1 M 0 2 (4)

(8)

where W 2 =lim T!1 (W f ) 0 W f

. Here the matrices M

1 2R [(n+s)(n+m)+ns]s(n+m) and M 2 2 R [(n+s)(n+m) +ns]1 do not depend on f or W + f

. Finally denotes the noise covariance

matrix and z =EZ t;1 (Z t;1 ) 0 .

Theexpression (4) asa functionof W +

f

isminimizedbytheCCAchoiceof theweighting

W + f =( +; f ) 1=2

for each value of f. The minimum variance decreases monotonically in

f for the CCA case.

The implications of the theorem are that it is always (i.e. for any choice of f) optimal to

use the CCA weighting scheme. The theorem also suggests to use f ! 1 at some rate,

whichisinaccordancewithearliersimulationstudies(cf.Bauer, 1998). Italsoshows,that

nochoice off nite can achieve the optimalaccuracy inall cases, since the decrease with

respecttof isingeneralstrict. Furthermorethe theoremprovidesameasure ofhowmuch

of attainable accuracy one looses by using any other methodthan the optimal. Notethat

inthe theoremithas beenassumedthatptendstoinnityasafunctionof thesamplesize.

Therefore the above expression should be viewed as the limitof the respective quantities

for p!1. It willbepart of the proof to demonstratethat this limitexists.

PROOF: Consider the estimation ofthe system matricesusing the estimateof the state

sequence x^ t = ^ K p Z t;p

: This is done using ordinary least squares. The expressions for the

estimation error are easily derived to be the following. Here (A;B;C;D;K) denotes the

true system in the representation according to [K

p ]

n = I

n

and so does the true state x

t . Let t =x^ t x t

. Introduce the notationha

t ;b t i=T 1 P T f t=p+1 a t b 0 t . Then: h ^ C C; ^ D D i = [h" t C t ;x^ t i;h" t C t ;u t i] 2 4 h^x t ;x^ t i h^x t ;u t i hu t ;x^ t i hu t ;u t i 3 5 1 h ^ A A; ^ B B i = [h t+1 +K" t A t ;x^ t i;h t+1 +K" t A t ;u t i] 2 4 h^x t ;x^ t i h^x t ;u t i hu t ;x^ t i hu t ;u t i 3 5 1 h ^ K K i = h t+1 A t +Bu t K(^" t " t );"^ t ih^" t ;"^ t i 1 (5) Since "^ t " t = ( ^ C C)^x t C t

itis observed, that a number of terms appear inthese

expressions: h" t ;x^ t i;h t ;x^ t i;h t+1 ;x^ t i;h t+1 ;u t i;h t ;u t i;h t ;"^ t i;h t+1 ;"^ t i and h" t ;u t i.

(9)

Notethat t =( ^ K p K p )Z t;p (A KC) x t p . Let P K =I p(s+m) [I n ;0 ]K p .

Here all system matrices are assumed to be in the canonical form introduced by the

re-striction[K p ] n =I n . Further letO y f =(O 0 f W 2 O f ) 1 O 0 f W 2

denotea generalizedleftinverse

of O f . Recall that W 2 = (W + f ) 0 W + f

. In the notation we refer to the case f xed and

nite. In the CCAcase with f ! 1 we obtainW

2 =( +; 1 ) 1

. This introduces nofurther

complications however, since it is straightforward toshow, that all necessary limitsexist.

Fordetails onthis see (Bauer, 1998). Then it has been shown in (Bauer et al., 2000)that

( ^ K p K p )=O y f ( ^ z z )P K +O(k ^ z z kk ^ W p W p k)+O(k ^ z z k 2 ) (6) where z =EY + t;f (Z t;p ) 0 1 p

. For any of the proposed weighting matrices, k ^ W p W p k = O(Q T

fp). This follows fromthe uniform convergence of the sample covariances as stated

e.g. in (Hannan and Deistler, 1988, Theorem 5.3.2). It also follows, that k ^ z z k 2 = o(T 1=2

). Therefore for the asymptotic distribution the term O y f ( ^ z z )P K is the

es-sential one, the remainingterms do not show up in the asymptotic distribution, as they

are o(T 1=2

). Here asymptotic distribution is to be understood for the vectorisation of

the matrix in the sense of (Lewis and Reinsel, 1985), as the dimensions of the matrix

in-crease according to p ! 1: i.e. a zero mean vector x

T 2 R

p(T)

is said to be distributed

asymptoticallynormal, if forany vector l

T 2R p(T) ,such that kl T k 2 M forsome M <1 k[l 0 T ;0] l 0 k 2

!0 for some vector l 2`

2 E(l 0 T x T ) 2 !cfor some 0c<1

the scalar product l 0

T x

T

converges indistribution toa normalrandom variable.

P

K

depends on p but this is not re ected in the notation. Note the fact, that this

expression does not depend on the weighting ^

W

p

and that regarding the weighting ^

W +

f

onlytheexpectationW +

f

hasanin uence. Thusitisseen,thattheonlyessentialstochastic

terms in the equations above are h"

t ;x t i;h" t ;u t i and ^ K p K p : =O y f ( ^ z z )P K , where : =

denotes equality up to terms of order o(T 1=2

(10)

size it follows that k[ z ;0 ] O f K k 2 =O(pj 0 j )=o(T ),where 0 denotes a zero

of H(q) of maximum modulus. Thenlet Z ; t;p =Z t;p ^ z;u ^ 1 u U + t;f . Therefore ^ z z : =hE f E + t;f ;Z ; t;p ihZ ; t;p ;Z ; t;p i 1

as follows from straightforward calculations. The main idea of the proof now is to nd

matricesM 1 2R [(n+s)( n+m)+ns]s(n+m) and M 2;p 2R [(n+s)(n+m)+ns]np(m+s) such that vec h ^ A c A; ^ B c B; ^ C c C; ^ D c D; ^ K c K i : = M 1 vec[h" t ; 2 4 x t u t 3 5 i]+M 2;p vec[O y f hE f E + t;f ;Z ; t;p i] where [M 2;p ;0 [(n+s)(n+m)+ns]1 ] converges in ` 1

norm to a matrix whose entries decrease

exponentially with column index. In this case the suÆcient conditions for asymptotic

normality given in (Lewis and Reinsel, 1985) are fullled (see e.g. Bauer et al., 1999). It

willbeclearfromthederivationbelow,that M 1 andM 2 =lim T!1 [M 2;p ;0 [(n+s)(n+m)+ns]1 ] donot depend onf or W + f . Note that Evec[h" t ;x t i]vec [h" t ;x t i] 0 = 1 T 2 T f X t;s=p+1 E(x t x 0 s " t " 0 s ) : = 1 T ( x ) where x =K z K 0 = Ex t x 0 t and :

= again denotes equality up to terms of order o(T 1=2 ). Analogously Eve c[h" t ;u t i]vec[h" t ;u t i] 0 : =( u ), where u =Eu t u 0 t . Also Evec[h" t ;x t i]vec[h" t ;u t i] 0 =0

Further note that for any component of vec [h"

t ;x t i] : = vec[h" t ;K p Z t;p

i] and any linear

combination x 0 ( ^ K p K p )X p : = x 0 O y f E f hE + t;f ;Z ; t;p i 1 p P K X p

for some vectors x 2 R n and X p 2R p(s+m) suchthat k[X 0 p ;0 11 ] X 0 k 1 !0,whereX isavector in` 1 having elements

decreasing exponentially, one obtainsthat

E 1 T 2 P T f s;t=p+1 " t;i (K p;j Z t;p )(x 0 O y f E f E + s;f )(X 0 p P 0 K 1 p Z ; s;p ) = 1 T 2 P T f t=p+1 P t s=s E" t;i (x 0 O y f E f E + s;f ) EK p;j Z t;p (Z ; s;p ) 0 1 z P K X p = 1 T 2 P T f t=p+1 P t s=s E" t;i (x 0 O y f E f E + s;f )[K p;j ;0 1(m+s)(t s) ] 2 4 H t s;p p 3 5 1 p P K X P = 1 T 2 P T f t=p+1 P t s=s E" t;i (x 0 O y f E f E + s;f )[K p;j ;0 1(m+s)(t s) ] 2 4 ~ O t s K p I p 3 5 P K X p +o(T 1 )=o(T 1 )

(11)

whereH j;p =EZ t ;j (Z t j;p ) = ~ O j K p p

+o(T )ands=maxfp+1;t f+1g. Heremostly

K p P K =0andkH j;p 1 p ~ O j K p k=o(T 1=2

)forpasspeciedintheTheoremisused(fora

proofofthelatterstatementseee.g.Bauer,1998). Thelatterfactisusedinthereplacement

involved inthe third equality. The convergence follows fromthe convergence assumptions

on X

p

and the analogous property of K

p

. These properties allow the replacement of the

limitfor p ! 1 by the expression obtained for p =1, which will be done frequently in

the following inorder to simplifynotations. Thus letH

j denote H j;1 . Forthe covariance of elements of h" t ;u t i with x 0 ( ^ K p K p )X p

analogous arguments hold. Therefore the two

terms are asymptoticallyuncorrelated. The rst term due toh"

t ;x t i and h" t ;u t i has been

shown to have variance M

1 M 0 1 , where M 1

can be shown to be independent of f and W +

f

because of equation (5), see also below. Thus the only eect of these choices is hidden

in the second term. Consider the variance of O y f hE f E + t;f ;Z ; t;p i 1 p P K x p

for some vector

x p 2 R p(m+s) such that k[x 0 p ;0 11 ] x 0 k 1

! 0 for some vector x in `

1 having elements decreasing exponentially: 1 T 2 P T f s;t=p+1 EO y f E f E + t;f (Z ; t;p ) 0 1 p P K x p x 0 p P 0 K 1 p Z ; s;p (E + s;f ) 0 E 0 f (O y f ) 0 = 1 T P f 1 l =1 f h O y f E f EE + t ;f (E + t+l ;f ) 0 E 0 f (O y f ) 0 i x 0 p P 0 K 1 p EZ t ;p (Z t+l ;p ) 0 1 p P K x p +o(T 1 ) (7)

Note that for l = 0 the part due to E + t;f is equal to O y f E f (I f )E 0 f (O y f ) 0 . This is the

centralterminthe expression for the asymptoticvariancegivenin thetheorem. Fromthe

description of the estimation error in equation (5) it follows, that a matrix M

2

as used

above exists. The construction of this matrix will be claried below. The theorem then

is proved, if for any vectors x

p ;y p postmultiplying ^ K p K p

inequations (5) itholds, that

Ex 0 p P 0 K 1 p Z t;p (Z t j;p ) 0 1 p P K y p

!0;j 6=0. This willbe done below.

Considerthecasen <(s+m)rst. Examiningtheexpressionsfortheestimationerrors

given inequation (5), one observes that only anumberof terms are multiplying ^ K p K p :

Thesetermsconverge to

z K 0 ; h ~ H 0 1 ; z i 0 K 0 ;[0 ms ;I m ;0 m1 ] 0 and[I s ;0 s1 ] 0 , where conver-gence is in` 1

norm asrequired above for the vectors x

p

. This follows fromthe facts,that

EZ t ;1 u 0 t =0;EZ t ;1 " 0 t =0,theerrork ^ K p K p k=O(Q T

pf)andtheuniformconvergenceof

(12)

EP 0 K 1 z Z t;1 (Z t j;1 ) 0 1 z P K =P 0 K 1 z 2 4 H j z 3 5 1 z P K =P 0 K 1 z 2 4 0 j(s+m)1 I 1 3 5 P K =P 0 K 2 4 0 j(s+m)1 1 z 3 5 P K = 2 4 0 j(s+m)1 1 z 3 5 P K (8)

evaluatingtheexpressionforp=1ratherthandealingwiththelimit. Usingthe

exponen-tialdecrease howeverit isstraightforward show that inallsituation,where the expression

occurs, the limitand the expression for p=1 coincide. The next tolast equality follows

fromthe block matrix inversion formula

1 z = 2 4 0 (s+m)(s+m) 0 (s+m)1 0 1(s+m) 1 z 3 5 + 2 4 I s+m 1 z H 0 1 3 5 ( z (0) H 1 1 z H 0 1 ) 1 I s+m ; H 1 1 z (9) Theprojection P 0 K =I 1 K 0 [I n ;0 n1

]andthusthelastequalityfollowsfromn<(s+m).

Premultiplying with the above mentioned terms from the left shows, that the terms for

l 6=0 inthe equation(7) donot matter inthe case n<(s+m). Take e.g.

EK[H 0 1 ; z ]P 0 K 1 z Z t;1 (Z t j;1 ) 0 1 z P K =K[H 0 1 ; z ] 2 4 0 j(s+m)1 1 z 3 5 P K =0

This shows, thatinthe casen <(s+m)onlythe termfor l=0inequation(7)isnonzero

and thus the theorem holds in this case. Therefore assume n (s+m) from nowon.

In the theorem the asymptotic variance of ( ^ A c ; ^ B c ; ^ C c ; ^ D c ; ^ K c ) is given. In order to

show that only the term for l = 0 in equation (7) is nonzero, it is suÆcient to show this

fact for any invertible (possibly nonlinear) transformation of these matrices. It proves to

be convenient to consider ( b A c ; b B c ; ^ C c ; ^ D c ; ^ K c ), where b A c = ^ A c ^ K c ^ C c ; b B c = ^ B c ^ K c ^ C c .

These estimates are obtained by transforming the estimates of the subspace algorithm

( b A; b B; ^ C; ^ D; ^

K) into the particular canonical form dened by [K

p ] n = I n , which is done

using the transformation matrix ^ S, which is dened as ^ S = [[ ^ K; b B]; ; b A n 1 [ ^ K; b B]] n . Thene.g. ^ C c C = ^ C ^ S C =( ^ C C) ^ S+C( ^ S I n ),where I n isthe limitof ^ S asfollows

(13)

inastraightforward fashionfromthe consistencyresultsforK

p

and thesamplecovariances

used in the regression.

Thusconsidertheestimationerrorsinthesystemmatrixestimatesmoreclosely: Dueto

the whitenoiseassumption ontheinputs oneobtainsh^x

t ;u

t

i!0and thustheestimation

errors of ^

C and ^

D can be treatedindependently. Consider ^ C rst: ^ C C : = h" t C t ;x^ t ih^x t ;x^ t i 1 : =h" t ;x t i 1 x hC t ;x t i 1 x : = h" t ;x t i 1 x C( ^ K p K p ) p K 0 p 1 x

Here the error bound k ^ K p K p k = o(Q T

pf) has been used to show e.g. that h"

t ;x^ t i : = h" t ;x t

i. Next deal with K:

^ K K : =h t+1 ;" t i 1 +h A t K(^" t " t );" t i 1 : =( ^ K p K p ) 2 4 I s 0 [p(s+m) s]s 3 5

This follows fromthe error bound cited above and the uniform convergence of the sample

covariances, as e.g. h t ;" t i = ( ^ K p K p )hZ t;p ;" t i A p hx t p ;" t i :

= 0. Also the fact that

hu

t ;"^

t

i=0 has been used. Forn (s+m) this iszero. In this case simplemanipulations

lead to b A A : =h t+1 A t ;x t i 1 x : =[ ^ K p K p ;0 n(m+s) ] 2 4 H 1;p p 3 5 K 0 p 1 x A ( ^ K p K p ) z K 0 p 1 x where again :

= denotes equality up to terms of ordero(T 1=2

). The same arguments show

that b B = ^ B ^ K ^ D : =

B =B KD,where again n(s+m) is used. Therefore denoting

^ T i = b A i [ ^ K; b B];T i = A i [K;

B]the following recursion is obtained for i=0;1;:

^ T i+1 T i+1 = b A ( ^ T i T i )+( b A A)T i : = K 2 4 H 1 z 3 5 K 0 1 x T i A K z K 0 1 x T i + A( ^ T i T i ) : = K 2 4 H 1 z 3 5 K 0 1 x T i A i+1 K z K 0 1 x T 0 + i X j=1 A j K 8 < : 2 4 H 1 z 3 5 K 0 1 x z K 0 1 x A 9 = ; T i j

(14)

where K =[ ^ K p ;0 ] KandwhereK=K 1

isused. Therecursionisstartedat ^ T 0 =T 0 .

Note that due to the normalisation of K

p

it follows that each column of I

n

is equal to a

columnof T

i

forsomeindexi. Letn =bn=(s+m)c,wherebxcdenotesthe greatestinteger

smallerthan x. Furtherlet m =n n(s +m). Thenthe columns ofT

i

;0i<n and the

rst m columns of T

n

are vectors of the canonical basis. Therefore the columnsof ^

C

c C

are equal tothe columns of the following expressions for the respective integers i:

( ^ C C)T i 1 +C( ^ T i 1 T i 1 ) : = h" t ;x t i 1 x T i 1 C A i 1 K z K 0 1 x T 0 + P i 2 j=0 C A j K 8 < : 2 4 H 1 z 3 5 K 0 1 x z K 0 1 x A 9 = ; T i j 2 (10)

Thus the columns of 8 < : 2 4 H 1 z 3 5 K 0 1 x z K 0 1 x A 9 = ; and of z K 0 1 x T 0 appear as the vectors x p ory p

inequation (7). Now for j >0consider

ET 0 0 1 x K z P 0 K 1 z Z t;1 (Z t j;1 ) 0 1 z P K =T 0 0 1 x K z P 0 K 1 z 2 4 0 j(s+m)1 I 1 3 5 P K =T 0 0 1 x (K x [I n ;0 n1 ] 1 z ) 2 4 0 j(s+m)1 I 1 3 5 P K = T 0 0 [I n ;0 n1 ] 1 z 2 4 0 j(s+m)1 I 1 3 5 P K

which is zero due to the block matrix inversion (9) since T 0 0 [I n ;0 n1 ] = [I s+m ;0 (s+m)1 ]. Here K P K =0is used throughout.

The second term appearingabove has the same property:

E 1 x K [H 0 1 ; z ] A 0 1 x K z P 0 K 1 z Z t;1 (Z t j;1 ) 0 1 z P K = 1 x K [H 0 1 ; z ] A 0 1 x K z P 0 K 1 z 2 4 0 j(s+m)1 I 1 3 5 P K = 1 x K [H 0 1 ; z ] A 0 1 x K z (I 1 K 0 [I n ;0 n1 ]) 1 z 2 4 0 j(s+m)1 I 1 3 5 P K = 1 x K [H 0 1 ; z ] A 0 1 x K z K 0 [I n ;0 n1 ] 1 z 2 4 0 j(s+m)1 I 1 3 5 P K = [C 0 ;0 nm ]T 0 0 [I n ;0 n1 ] 1 z 2 4 0 j(s+m)1 I 1 3 5 P K =0

(15)

Here K [H 1 ; z ]K = Ex t (x t+1 ) = x

A has been used. Corresponding to D the situation

is much easier, since it is easily seen, that ^ D c D = ^ D D : = h" t ;u t i(Eu t u 0 t ) 1 , since h t ;u t i :

= 0. This follows from the bound on k ^

K

p K

p

k and the uncorrelatedness of u

t

and Z

t;p

inthe whitenoise case. Thus the asymptoticdistributionof ^ D c is notaected by the choice of f orW + f .

Due to the denition of ^ K c and b B c = ^ B c ^ K c ^ D c it follows, that ^ K c = K; b B c = B

for n (s+m). Thus it remains to consider the estimation of b A c . Note that ^ A c A = ( ^ S I n ) A+( ^ A A)+ A( ^ S I n ). If i is such that T i

is a blockcolumn of the identity

matrix, then ^ ST i = ^ T i and thus ( ^ S I n ) A T i 1 + ( ^ A A)T i 1 + A( ^ S I n )T i 1 = ( ^ S I n )T i + ( ^ A A )T i 1 + A( ^ T i 1 T i 1 )=0

This is also true for the rst m columns of T

n 1

. Therefore it remains to deal with

the matrix X = [x 0 0 ; ;x 0 n 1 ;x~ 0 n ] 0

, where X denotes the matrix built of the last s+m

columns of A. Here x i 2 R (s+m) (s+m) ;i = 0; ;n 1;x~ n 2 R m(s+m) . In order to

unify the notation let x

n =[~x 0 n ;0 (s+m)(s+m m ) ] 0 . Let s n 1 =[0 (s+m m ) m ;I s+m m ] 0 ;s n = [I m ;0 m(s+m m) ] 0 ;s n 1 = [I s+m m ;0 (s+m m) m ] 0 and s n = [0 m (s+m m) ;I m ] 0 respectively. Finally dene Y = 2 4 H 1 z 3 5 K 0 1 x

. Then fori=n 1 and i=n one obtains

( ^ T i T i ( ^ S I n ) A T i 1 )s i = P n j=0 ( ^ T j T j )x j s i + ^ T i T i = P n j=1 K YT j 1 x j s i + A j K z K 0 1 x T 0 x j s i + K YT i 1 s i A i K z K 0 1 x T 0 s i P n j=1 P j 1 l =1 A l K Y z K 0 1 x A T j l 1 x j s i + P i 1 j=1 A j K Y z K 0 1 x A T i j 1 s i

It has been shown before, that forthe termspostmultipliedby T

0

and the termsincluding

Y z K 0 1 x A

only the covariance of the respective term matters in the asymptotic

(16)

K Y T i 1 s i n X j=1 T j 1 x j s i ! = K Y 0 B B B B B B B B B @ 2 6 6 6 6 6 6 6 6 6 4 0 (s+m)(s+m) 0 (s+m)(s+m) . . . 0 m(s+m) I s+m 3 7 7 7 7 7 7 7 7 7 5 + 2 6 6 6 6 6 6 6 6 6 4 x 1 x 2 . . . ~ x n 0 (s+m)(s+m) 3 7 7 7 7 7 7 7 7 7 5 1 C C C C C C C C C A s i

Denoting the matrix in brackets on the right hand side with ~ X we obtain A ~ X = T 0 x 1 . Also A ~ X = A ~ X+T 0 [C 0 ;0 nm ] 0 ~ X =T 0

V for somematrix V. Therefore

E ~ X 0 1 x K [H 0 1 ; z ]P 0 K 1 z Z t;1 (Z t j;1 ) 0 1 z P K = ~ X 0 1 x K [H 0 1 ; z ]P 0 K 1 z 2 4 0 j(s+m)1 I 1 3 5 P K = ~ X 0 1 x K [H 0 1 ; z ](I 1 K 0 [I n ;0 n1 ]) 1 z 2 4 0 j(s+m)1 I 1 3 5 P K = ~ X 0 A 0 [I n ;0 n1 ] 1 z 2 4 0 j(s+m)1 I 1 3 5 P K =0

Here the secondlast equationfollows fromK [H 0 1 ; z ]K 0 =Ex t (x t+1 ) 0 = x A 0 .

Summingup the ndingsup tonow itfollows that

vec h b A c A; c B c B; ^ C c C; ^ D c D; ^ K c K i : = M 1 vech" t ; 2 4 x t u t 3 5 i+M 2;p vec h O y f E f hE + t;f ;Z t;p i i (11)

Here the matrices

M

1

and M

2;p

can be found by tracing the computations so far.

Addi-tionallyit has been shown, that

E 2 4 M 1 0 @ 2 4 x t u t 3 5 " t 1 A 3 5 M 2;p (P 0 K 1 p Z t+j;p O y E f E + t+j;f ) 0 ! 0 ;8j E M 2;p (P 0 K 1 p Z t;p O y E f E + t;f ) M 2;p (P 0 K 1 p Z t+j;p O y E f E + t+j;f ) 0 ! 0 ;j 6=0

FromthedenitionofM

1

andM

2;p

itfollowsthatthesematricesdonotdependonW

2 orf.

(17)

operations,thatthetermsgiven aboveareofordero(T 1=2

)uniformelyinf =O((logT a

)).

Here the formofthe weightingmatricesis usedtoshowe.g. that thevarianceofO y f E f E + t;f

is bounded uniformely in f. Therefore the expression for the asymptotic variance of the

estimated system matrices alsoholds inthe case f !1 for CCAweights.

Notethat O y E f (I f )E 0 f (O y ) 0 =(O 0 f W 2 O f ) 1 O 0 f W 2 [E f (I f )E 0 f ]W 2 O f (O 0 f W 2 O f ) 1 . ThisisminimizedbyW Æ 2 =[E f (I f )E 0 f ] 1 withminimum(O 0 f (E 0 f ) 1 (I f 1 )E 1 f O f ) 1 .

Some matrix algebra shows, that this gives the identical variance with any choice W

2 = W Æ 2 +W Æ 2 O f XO 0 f W Æ 2 , such that W 2

is invertible. Thus the CCA weightings minimize the

variance of the estimated system, since in this case W

2 =( +; y ) 1 , where +; y =E f (I f )E 0 f +O f x O 0 f

. Therefore due tothe matrix inversion lemma

( +; y ) 1 =(E f (I f )E 0 f ) 1 +(E f (I f )E 0 f ) 1 O f XO 0 f (E f (I f )E 0 f ) 1

for suitablematrix X. The lowerdiagonalblock Toeplitzstructure of E

f

then shows, that

this minimum variance decreases monotonically in f, since it ensures, that E 1 f O f is a submatrix of E 1 1 O 1

. This completes the proof.

Thistheorem clariesalongstandingquestionabouttheoptimalchoicesofthe

weight-ing matricesfor the algorithmsdealt with inthis contribution. It shows clearly,that CCA

is the optimal choice in any situation, where no input is present or the observed input

is white noise. It also shows, that f nite leads to noneÆcient estimation in the generic

case. The subset of M

n

, where nite f also leads to optimal estimates consists of ARX

systems, as is easily seen from the form of the essential term of the asymptotic variance

of the parameter estimates. The amount of accuracy, which is lost by using a small f is

thus inthe case of using optimalweights determinedby the magnitude of the noisezeros,

as they govern the rate of exponential decrease in the matrix E 1

f O

f .

The theoremalsohas implicationsfordierentweightings: Onthe basisofthetheorem

it is possible to assess the loss of eÆciency due to using a nonoptimal weighting. One

(18)

In a second step for a particular weighting the optimal f is calculated by minimizingthe

matrix given in the theorem. The benets of such methods must be doubted however,

since the derivation of the theorem have been given onthe grounds of knowing the order

ofthe system. In thiscase itdoesnot seemtomake sensetouse any otherweightingthan

the optimal.

Also another question is quite natural: The observed input has been assumed to be

white. Some of the analysis did not need this assumption and thus also holds for more

generalinputs. Inparticularthe equation(7)holdsformoregeneralinputs. This equation

in principlecould be used to obtainoptimal weightings with respect to specic measures:

Note, that the equationdenes amatrix andthere need not bea minimum forthe matrix

minimization problem. However we could for example minimize the trace of the matrix,

suchthatascalarminimizationproblemisobtained,whichcouldbesolvedusingnonlinear

search methods. However such aprocess includes optimisation issues, which one wants to

avoidinthe rst place by usingsubspace rather thanprediction errormethods. Thusalso

such a procedure is not recommended, althoughpossible. It isbelieved however, that the

expressions given inthis papermightbevaluablealsofortheanalysis ofthis generalinput

case.

4 Numerical Illustration

In this sectiontwo exampleswillbe given, whichillustratethe ndingsof the lastsection.

As arst case consider the following single-inputsingle-output system withoutexogenous

inputs havingstate dimension three:

A= 2 6 6 6 4 0:532 0:4639 0:2855 1 0 0:2568 0 1 0:0054 3 7 7 7 5 ;K = 2 6 6 6 4 1 0 0 3 7 7 7 5 ;C = h 0:532 0:4639 0:0413 i

Thenoise isassumed tobewhitewithvarianceequalto1. Forthis systemthe asymptotic

varianceiscomparedtotheCramer-Raoboundusingthefollowingmeasure: LetF

I

(19)

4

6

8

10

12

14

16

18

20

0

0.5

1

1.5

2

2.5

3 f

tr[V

_f

(W

_f

+

)F

_I

]−6

CCA

N4SID

4

6

8

10

12

14

16

18

20

500 1000

1500

2000

2500

3000

3500

4000

f

det[O

d

E

_f

E

_f

T

(O

d

)

T

]

CCA

N4SID

Figure 1: The no input case: The left plot shows the measure E

f

for the two weighting

schemes CCA and N4SID for the range of values f = 3;;20. The right picture shows

a plot of the determinant of (O 0 f W 2 O f ) 1 O 0 f W 2 E f E 0 f W 2 O f (O 0 f W 2 O f ) 1

for the same two

procedures and the same range of integers f.

theFisherinformationmatrixwithrespecttotheecheloncanonicalforminthiscase. Then

it is well known, that the Cramer Rao bound for the estimation is equal to F 1 I . Thus let V f (W + f

) denote the asymptotic variance of parameter estimates obtained from the

subspace procedure using the integer f and the weighting matrix W +

f

. Then the measure

E f = tr[V f (W + f )F I

] 2ns (n+s)m is used. For an eÆcient estimation method this is

equalto zero,otherwise positive. The left plot inFigure1 shows this measure for the two

weightingschemesCCA(i.e. ^ W + f =( ^ + f ) 1=2

)and N4SID(i.e. W +

f =I

fs

). The rightplotof

this gure shows det[(O 0 f W 2 O f ) 1 O 0 f W 2 E f E 0 f W 2 O f (O 0 f W 2 O f ) 1 ], where W 2 =(W + f ) 0 W + f .

The plots clearly reveal the identical behaviour of the two measures. It alsocan be seen,

that for the CCA weights the measure E

f

decreases to zero for f ! 1, whereas for the

N4SID weightsthe choice off =n isoptimal. Forboth weightings aconverging behaviour

is observed for large f, which alsois in accordancewith the theory.

(20)

addi-2

4

6

8

10

12

14

16

18

20

0

0.5

1

1.5

2

2.5 f

tr[V

_f

(W

_f

+

)F

_I

]−7

CCA

N4SID

2

4

6

8

10

12

14

16

18

20

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3 f

det[O

d

E

_f

E

_f

T

(O

d

)

T

]

CCA

N4SID

Figure 2: The white noise input case: The left plot shows the measure E

f

for the two

weighting schemes CCAand N4SID forthe range of values f =2; ;20. The right picture

shows aplotof thedeterminantof(O 0 f W 2 O f ) 1 O 0 f W 2 E f E 0 f W 2 O f (O 0 f W 2 O f ) 1

forthesame

two procedures and the same range of integers f.

tional observed white noise input given by the system matrices

A= 2 4 0:393 2:022 0:208 0:685 3 5 ;B = 2 4 0:95 1:00 3 5 ;C = h 0:326 0:743 i ;D=0:95;K = 2 4 1 0 3 5

The observed and the unobserved noise are assumed to have mean zero and variance 1.

Thus there are a total of 7 parameters to be estimated. Analogously to the case of no

inputs dene E f = tr[V f (W + f )F I

] 7. Figure 2 shows the result of the calculation. The

pictures demonstrateagainidenticalbehaviourofthetwomeasures ofaccuracy. Againthe

CCA weighting scheme is superior to the N4SID weighting scheme and again it reaches the

Cramer Rao lower bound for f ! 1. This illustrates the signicance of the expressions

(21)

Inthis paperthe dependence of theasymptoticaccuracyof the Larimoretype ofsubspace

methodswithrespecttothechoiceoftheintegerf andtheweighting matrixW +

f

has been

explored. It has been shown, that the eects of these choices in the case of no observed

inputsor whiteobserved inputscan besummarized inthe term(O 0 f W 2 O f ) 1 O 0 f W 2 E f (I )E 0 f W 2 O f (O 0 f W 2 O f ) 1

ashas beenshown inTheorem3.1. This termshows, thatthe CCA

choiceof the weighting accordingto(3)isoptimalwithrespecttothe asymptoticvariance

foreachf. Italsofollowsthatforthisoptimalchoicethevariancedecreaseswithincreasing

f, achieving the optimalaccuracy for the choice f !1. Forother weighting procedures

the expression can be used to optimise the choice of f. Finally the new expressions for

the asymptoticvariancemightalsolead toaneÆcientimplementationofthe computation

of the asymptotic variance, which could be used for practicalimplementation ratherthan

only foracademic purposes.

Acknowledgements

This work has been done, while the rst author was holding a post doc position at the

DivisionofAutomaticControlinLinkoping. FinancialsupportfromtheEU TMRproject

'SI' isgratefully acknowledged.

References

Bauer, D. (1998). Some Asymptotic Theory for the Estimation of Linear Systems Using

Maximum Likelihood Methods orSubspace Algorithms.PhD thesis. TUWien.

Bauer, D. and M. Jansson (2000). Analysis of the asymptotic properties of the MOESP

(22)

somesubspace algorithmsforsystems withoutobserved inputs.Automatica35,1243{

1254.

Bauer, D., M. Deistler and W. Scherrer (2000). On the impact of weighting matrices in

subspace algorithms. In: Proceedings of the IFAC Conference 'SYSID' 2000. Santa

Barbara, California.

Hannan, E. J. and M. Deistler (1988). The Statistical Theory of Linear Systems. John

Wiley.New York.

Jansson, M. (2000). Asymptotic variance analysis of subspace identication methods. In:

Proceedings of the SYSID'2000 Conference. SantaBarbara, California.

Jansson, M. and B. Wahlberg (1998). On consistency of subspace methods for system

identication. Automatica 34(12), 1507{1519.

Larimore, W. E. (1983). System identication, reduced order lters and modeling via

canonical variate analysis. In: Proc. 1983 Amer. Control Conference 2. (H. S. Rao

and P.Dorato, Eds.). Piscataway, NJ. pp. 445{451. IEEE Service Center.

Lewis, R. and G. Reinsel (1985). Prediction of multivariate time series by autoregressive

modeltting. Journal of Multivariate Analysis 16, 393{411.

Peternell, K., W. Scherrer and M. Deistler (1996). Statistical analysis of novel subspace

identicationmethods. SignalProcessing 52, 161{177.

Verhaegen, M.(1994). Identicationof the deterministic part of mimostate space models