• No results found

Two practical examples show the eciency of the approach

N/A
N/A
Protected

Academic year: 2021

Share "Two practical examples show the eciency of the approach"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

IDENTIFICATION OF SPARSE LINEAR REGRESSIONS1 Fredrik Gustafsson

Department of Electrical Engineering, Linkoping University, S-581 83 Linkoping

Abstract. Some important practical signals and systems can be modeled by very large linear regression models where it is reasonable that most of the parameters are zero. We give an ecient method to solve this combined estimation and struc- ture determination problem and relate the result to Akaike's information criteria for structure selection and criteria based on hypothesis testing. A recursive algorithm is derived, which can be applied to time-varying systems as well. Two practical examples show the eciency of the approach.

Keywords.linear regression, structure selection, maximum likelihood, sparse models, echoes, orthonormal bases, system identication, modeling, model selection

1. INTRODUCTION

We will consider linear regression models of the form

y(t) ='(t)T+e(t) (1) where'(t) is a regression vector and the correspond- ing parameter vector. Heree(t) is assumed to be white Gaussian noise with variance2. It is implicitly assumed that the number of parametersd= dim is large (e.g.

d > 100). The assumption in this contribution is that only a small number nof parameters (e.g. n<10) are active

(ki)6= 0 i= 12::n



l= 0 otherwise (2)

Byknwe denote the set of active parametersk1k2::kn. We point out two important applications where this is the case.

1. Multipath signal propagation.In telephone com- munication echoes are a big problem that can deteri- orate the speech quality severely. In 4-wire loop tele- phony, the echoes come from circuit echo paths (Sondhi and Berkley 1980), while in mobile radio channels they are caused by room acoustic echo paths. The eect can be removed by equalization once a channel model has been identied.

The signal can be written as

y(t) =Xn

i=1

(ki)u(t;ki) +e(t) (3) where uis the interesting speech signal. The indiceski for active coecients correspond to the time delay in

1

PartiallysupportedbyABVolvo

the echo path. The same problem occurs in sonar appli- cations (Burdic 1984), where these echoes are caused by reections at the surface and the bottom of the sea and also in geophysical signal processing (E.A.Robinson and T.S.Durrani 1986).

2. Approximation using basis functions.In system

identication, the use of (orthonormal) basis functions has become a popular approach recently (Wahlberg 1991, Ninness 1993, Van den Hof et al. 1993). Here, the system is modeled by

y(t) =Xn

i=1

(ki)< ki(t)u(t)>+e(t) (4) whith < ki(t)u(t) > denoting the scalar product of the basis function ki(t) and the system input u(t). A similar problem occurs in function approximation using, for instance, polynomial basis functions. In this case,

u(t) is the function to be approximated.

Note that both (3) and (4) are linear regressions. The problem is to estimate the number of active coecients

n, their positions kn and their values (ki). The up- per bound d on the number of parameters is assumed to be known. Any conceivable method that claims to be optimal for this problem has to examine all possible combinations ofkn. Ifnwere known, there would be (nd) dierent combinations to examine. Here, wherenis un- known, there arePdn=0(nd) = 2ddierent combinations, the latter expression coming from the fact that each co- ecient can be either active or inactive. Examples of optimality criteria that will be surveyed are Akaike's BIC (Akaike 1981) and the ML approach. Another pos- sible approach for function approximation is given in (D.L.Donoho 1992) and extended to dynamic system ap- proximation in (P.Bodin and B.Wahlberg 1994b, P.Bodin and B.Wahlberg 1994a).

The contribution here is an approximation of the loss

(2)

function for eachkn so all 2dloss functions can be eval- uated from onlydscalar LS estimates. We give a recur- sive implementation that can be applied to time-varying systems as well. The total complexity corresponds ap- proximately to the least mean square (LMS) algorithm, with a non-standard stepsize, applied to the full param- eter vector.

The outline is as follows. Section 2 surveys briey some possible approaches. The novel approximation of the loss function is derived in Section 3, Section 4 presents some examples while Section 5 concludes the paper.

2. CLASSICAL APPROACHES

Consider the model (1) with the assumption (2). Assume that we for each possible combinationkn= (k1k2::kn) are given a parameter estimate ^N(kn) minimizing the sum of squared residuals,

V

N(kn) =XN

t=1

y(t);'T(t)^(kn)]2 (5)

= min

 N

X

t=1

y(t);'T(t)(kn)]2: (6) By(kn) is meant the parameter vector in (1) under the constraint (2).

2.1 Including a penalty term

A standard approach to structure determination is to use one of Akaike's three tests FPE, AIC and BIC (Akaike 1981). The latter one is identical to Rissanen's MDL cri- terion (Rissanen 1978). They can be written as follows.

 Loss function with penalty term:

c

k

n= argmin

k n

1

N V

N(kn) +2n(N) (7) Here AIC corresponds to (N) = 2=N, BIC to (N) = log(N)=N. The rst term is a measure of model t while the second one is a penalty term for preventing too com- plex models. This is a generic philosophy in model struc- ture determination the best model should be a tradeo

between model t and complexity. The obvious draw- back in using (7) is the huge computational complexity since all possible kn has to be examined. As pointed out in (?), the maximum likelihood approach leads to a similar criterion.

There is well-known problem with AIC and BIC in the case of small number of data. This has met considerable interest in econometrics, but there might be a similar problem here whered=N is small as well. We will, how- ever, not investigate that matter here.

It follows from known theory that the method is consis- tent if

(N)!0 N !1

N(N)!1 N !1 (8) see for instance equation (11.47) and (11.68a) in (Soderstrom and Stoica 1989). This means that if the true system can be described with one of the model structures, this structure will eventually be estimated when more and more data become available.

2.2 Hypothesis testing

A natural approach from a statistical viewpoint is hy- pothesis testing. We can for instance test the signicance of each coecient by using the hypotheses

H

0(k) : k = 0 (9)

H

1(k) : k 6= 0 (10) From the Gaussian noise assumption and classical least squares theory we have  2 N(^P), where P is the corresponding covariance matrix. We next choose a sig- nicance level , corresponding to a threshold c, and the test becomes

 Hypothesis testing:Keep the coecients for which

j^N(k)j>cpPk k (11) A possible problem is that we face a multiple hypothesis test but designdindependent hypothesis tests, so we do not know the total condence level.

Another approach has been developed by (D.L.Donoho 1992) in the context of function approximation using a wavelet basis in (4). The solution is to rst estimate the full parameter estimate ^N(12::N) (in this appli- cation, there are as many regressors as data, d =N).

Then we have the following rule:

 The Donoho test:Keep the coecients for which

j^N(k)j>(1 + )plog(N)Pk k (12) where >0 is a constant.

This criterion for wavelet coecients was called wavelet shrinkagewith hard thresholding in (D.L.Donoho 1992).

The idea is quite congenial. It is based on the following result.

Suppose that  = 0, which means that kn = 0, and that the estimate ^N has a diagonal covariance matrix

P. This assumes the regressors to be uncorrelated and implies that the parameter estimate is a white noise se- quence. Then

(3)

P



max

i











^

(i)

p

P

ii











<(1 + )plog(N)

!

!1: (13) That is, the Donoho test yields

P(kcn= 0)!1 d!1: (14) if >0. In words, this result says that the probability for overmodeling tends to zero as the number of coe- cients tends to innity. This is automatically assured in the wavelet application if the number of data tends to innity, since d = N. The method is particularly well suited for ltering noisy deterministic smooth signals as will be pointed out in the examples later on. The test (12) has been used in (P.Bodin and B.Wahlberg 1994b, P.Bodin and B.Wahlberg 1994a) for smoothing empiri- cal transfer function estimates and for thresholding us- ing orthonormal bases, respectively, and the results are promising.

Suppose the covariance function for ^isP = N2I, which is the case when using orthonormal basis functions with a white input signal. Then the Donoho test can be writ- ten

j^N(k)j>(1 + )

rlog(N)

N

: (15)

This particularly simple form, depending only on the estimate itself, will be used for comparison later on.

2.3 Maximum Likelihood methods

If the structurekn is considered as an unknown param- eter, classical estimation approaches like the maximum likelihood (ML) method. The aim is to, in the face of the data, maximize the conditional probability density functionp(yNjkn). Asymptotical expressions of the like- lihood is given in (?). We have for the case of a known or unknown noise variance2 from Table 3, cases 17 and 12:

 ML with known noise variance:

c

k

n= argmin

k n

1

N V

N(kn) +2nlog(N)

N

(16)

 ML with unknown noise variance:

c

k

n= argmin

k

n log(VN(kn)) +nlog(N)

N

(17) Note that (16) is identical to BIC (asymptotically). The application decides which variant to use. For instance, in function approximation or signal recovering we might need to tune the approximation error or detail of the result. In communication problems the variance is un- known and the second variant is the most natural.

3. NEW APPROACH

There are two problems with the likelihood and infor- mation based approaches that make their direct appli- cations infeasible:

 The evaluation of the loss function involves the computation of the least squares solution to prob- lems of very high dimension.

 We need to compare 2d dierent model structures and apparantly there is no way to avoid computing one loss function to each of them.

We will give a solution to each of these problems. First, in the next subsection, we replace the least squares es- timate in the loss function by the least mean square (LMS) estimate, which is much simpler to compute.

Then, we point out that with this approximation all possible loss function can be evaluated at once and the total estimation scheme is as simple as the Donoho test.

3.1 Approximating the least squares loss function Let'(tkn) denote the regression vector where all ele- ments butk1k2::kn are set to zero. If we dene

f

N(kn) =XN

t=1

'(tkn)y(t) (18)

R

N(kn) =XN

t=1

'(tkn)'T(tkn) (19) the least squares estimate can be written

^



N(kn) =RN(kn);1fN(kn) (20) The loss function can now be written in a standard man- ner

V

N(kn) =XN

t=1

(y(t);'(tkn)T^)2 (21)

=XN

t=1 y

2(t);fNT(kn)R;1N (kn)fN(kn) (22)

=N^2y;fNT(kn)^RLSN (kn) (23) where ^y2is the variance of the observed outputy(t).

The key idea is to replaceRN(kn) by a diagonal matrix

D

N(kn) given by the diagonal elements of RN(kn):

D ii

N(kn) =XN

t=1

'ki(t)]2: (24) This approximation of uncorrelated regressors can be motivated in dierent ways. Consider rst the applica- tion (3). If the input is white noise, we haveE(R) =D

(4)

so the approximation is obviously unbiased. If the input is autocorrelated, which is the case for speech signals for instance, the correlation decays quickly in time. For large time distances between the active coecients, the approximation is almost unbiased. Secondly, the basis functions in (4) are often chosen orthonormal leading to approximately uncorrelated regressors.

There are two consequences of this approximation:

 LMS requires much less computational power (no matrix inversion is needed).

 The estimate of one parameterkremains the same regardless what the other indexes included in kn are. This is not true for the RLS estimate.

The second consequence is the most important one, since we just have to compute the full parameter estimate once and then all loss functions can be approximated.

We just have to analyse the error introduced when re- placing the RLS estimate with the LMS estimate.

Lemma 1. Consider the matricesDand R,

R

ij= r(i;j) = XN

t=1

u(t)u(t+j;i) ij= 12::n

D

ii = r(0) = XN

t=1 u

2(t) i= 12::n We have

E lim

N!1

Ntr(R D;1R D;1;I) =n(n;1) N !1 (25) Remark 1. The denition ofRandDassumes pre- and post-windowing in the least squares method.

Proof:We have

R D

;1=

0

B

B

B

B

B

B

B

B

B

@

1 r(1)

r(0) 

r(n;1)

r(0)

r(1)

r(0) 1  r(n;2)

r(0)

... ... ...

r(n;1)

r(0) 

r(1)

r(0) 1

1

C

C

C

C

C

C

C

C

C

A

(26)

and the diagonal elements on its square are given by

diag(R D;1R D;1) =

0

B

B

B

B

B

B

B

B

B

@

1 +r2(1)

r

2(0) ++r2(n;1)

r 2(0)

r 2(1)

r

2(0) + 1 ++r2(n;2)

r 2(0) ...

r

2(n;1)

r

2(0) ++r2(1)

r

2(0) + 1

1

C

C

C

C

C

C

C

C

C

A

Now a standard result, see (Soderstrom and Stoica 1989) equation (11.9), says that

N r

2(k)

r

2(0) !2(1) (27) in distribution. That is,

E lim

N!1

N(tr(R D;1R D;1;I) =n(n;1)

which proves the theorem. 2

Lemma 2.

E lim

N!1

1

p

N jf

N(kn)^NRLS(kn);fN(kn)^LMSN (kn)jy2pn(n;1) where y2is the variance of the output y(t).

Proof:Using matrix notationY = (y(1)::y(N))T,  = ('(1)::'(N))T, we have

jf

N(kn)^RLSN (kn);fN(kn)^NLMS(kn)j

=Y0;R;1;D;1 0Y

jYj 2

2

k;R;1;D;1 0k2

=Ny2k;R;1;D;1 0k2

N 2

y

;tr;;R;1;D;1 0;R;1;D;1 0 1=2

=Ny2 ;tr;0;R;1;D;1 0;R;1;D;1 1=2

=Ny2 ;tr;;I;R D;1 ;I;R D;1 1=2

=Ny2 ;tr;I;2R D;1+R D;1R D;1 1=2

=Ny2 ;tr;R D;1R D;1;I 1=2

where we have used trR D;1 = trI = n and kAk2 

ptr(ATA). Now the result follows from Lemma 1, lin- earity of the trace operator and Jensen's inequality the squareroot is a strictly concave function soE(pX) <

p

E(X). 2

Note that for n = 1 LMS and RLS coincides so there is no dierence. Now we have an expression for the er- ror, which in itself satisfy the conditions (8) for being a penalty term. We can thus include a multiple of the error term in the criterion to take care of the approxima- tion error when simplifying the loss function and also to include a penalty term. Approximatingn(n;1) n2, we nally get the following criterion:

c

k

n= argmin

k n

;

1

N f

T

N(kn)^NLMS(kn) + y2pn

N

A good starting value is = 2. Since the error bound is a bit conservative, a smaller might give better de- tectability. The dierence compared to the threshold

(5)

proposed in (J.Homer et al. 1994), is that log(N)=N is replaced by 1=pN. The criterion can conveniently be rewritten as keeping the coecients whose values are larger than a certain threshold,

1

N f

N(i)^LMSN (i)> p y2

N

(28) This criterion is quite appealing, since no knowledge of the noise variance is required.

3.2 Implementation

The LMS algorithm is very fast so (28) can be imple- mented directly. We will however point out a recursive algorithm which can be applied to time-varying systems.

The motivation is that the systems in telephone and sonar applications are time-varying and there is a strong need for recursive estimation.

Algorithm 1. Choose a forgetting factor<1 and com- pute recursively

f

t=ft;1+ (1;)'(t)y(t) (29)

D

t=Dt;1+ (1;) diag('(t))2 (30)

 2

yt=yt;12 + (1;)y2(t) (31)

^



t=Dt;1ft (32)

Keep the coecients for which

f

t(k)^t(k)>2y2p1; (33) Note thatft(k) and ^t(k) have the same sign.

4. EXAMPLES 4.1 Local low-pass ltering using wavelets Consider the signal in Figure 1.

The signal shows the angular velocity of a non-driven wheel in a Volvo 850 GLT using standard sensors in the ABS system. This particular test drive was performed at a gravel road with sampling interval 0.2 seconds. The roughness of the surface causes a noise in the wheel an- gular velocity compared to the pure unknown velocity signal. The purpose with this test is to classify gravel road from other ones by estimating the size of the noise.

For this purpose, we rst need to estimate the velocity signal by local low-pass ltering. Standard \global"low- pass lters are useless here due to the deterministic high frequency content in the signal.

The wavelet basis is used in (4). There are 300 data points and thus 300 wavelet coecients. The criterion

0 10 20 30 40 50 60

0 10 20 30 40 50 60 70 80

Time [s]

Fig. 1. Wheel velocity as a function of time

(15) with = 1 is applied to the wavelet transform coef-

cients and the remaining parameters are inverse trans- formed. About 30 % of the transform coecients are kept in this example. The upper plot in Figure 2 shows the residuals { that is, the dierence of the original and

ltered signals { as a function of time, and the lower plot the residuals' histogram. Clearly, the histogram resem- bles a Gaussian probability function, except for some outliers originating from the braking maneuvers. This was expected, since during 0.2 seconds, corresponding approximately to 4 meters, many small holes and ridges of dierent shapes add up to a total error which can be considered as a Gaussian variable according to the cen- tral limit theorem. This is one validation of that the ap- proach works. More details of this project can be found in (Gustafsson 1995). Finally, we remark that the bias in the residuals comes from an oset introduced in the wavelet transform.

This example illustrates an application where the as- sumption (2) holds. In an on-line approximation other basis functions should be used and the coecients be estimated recursively, using Algorithm 1.

0 10 20 30 40 50 60

−0.4

−0.2 0 0.2 0.4

Time [s]

−0.20 −0.1 0 0.1 0.2 0.3 0.4 0.5

10 20 30 40

Fig. 2. Estimated noise sequence and its histogram 4.2 Echo detection

We here describe an example similar to one in (J.Homer et al.1994). The underlying model is (3). The channel's

(6)

impulse response is given in the upper plot of Figure 3 and it has three active taps. The input and measure- ment noise are independent Gaussian white noises with variances 1 and 16, respectively.

0 10 20 30 40 50 60 70 80 90 100

0 1 2 3 4 5 6

Impulse response

0 200 400 600 800 1000 1200 1400 1600 1800 2000

−40

−20 0 20 40

Time [samples]

Fig. 3. Impulse response of the channel and simulated output

Figure 4 shows the result of algorithm 1 averaged over 100 simulations. The correct taps were found in all simu- lations. The upper plot shows the estimated active taps.

They converge quickly to the correct values 5, 6 and 1.5. Also the number of estimated active taps converges quickly to the correct number 3 as shown in the lower plot.

150 200 250 300 350 400 450 500

−5 0 5 10

Time [samples]

Estimated active taps

150 200 250 300 350 400 450 500

0 5 10 15 20

Time [samples]

Number of estimated active taps

Fig. 4. Estimated active taps (upper plot) and estimated number of active taps (lower plot)

5. CONCLUSIONS

We have investigated the combined structure determi- nation and parameter estimation problem for linear re- gression models where most of the parameters are ex- pected to be zero. Most concievable methods like BIC and the ML method lead to minimization of a criterion including the least squares loss function that must be evaluated a huge number of times. We have here pro- posed a way to simplify the loss function by using the least mean square estimate instead of the least squares one. An analysis provided an error term to be included

in the criterion in combination with the usual penalty term, which can be inhealed in the former. The proposed algorithm is of very low complexity corresponding to the least mean square algorithm. A recursive algorithm was pointed out, that can be applied to time-varying sys- tems. Finally, the algorithm was tested on two examples, from completely dierent applications.

6. REFERENCES

Akaike, H. (1981). Modern development of statistical methods. In: Trends and Progress in System Identi cation (P. Eyko, Ed.). Pergamon Press.

Oxford.

Burdic, W.S. (1984). Underwater Acoustic System Analysis. Prentice-Hall. Englewood Clis, NJ.

D.L.Donoho (1992). De-noising by soft-thresholding.

Technical Report 409. Dept. of Statistics. Stanford University.

E.A.Robinson and T.S.Durrani (1986). Geophysical Signal Processing. Prentice-Hall. Englewood Clis, Gustafsson, F. (1995). Slip-based estimation of tireNJ.

- road friction. In: Proc. on the 1995 European Control Conference, Rome. pp. 725{730.

J.Homer, B.Wahlberg, F.Gustafsson, I.Mareels and R.Bitmead (1994). LMS estimation of sparsely paramerized channels via structural detection. In:

Proc. on the CDC, 1994. Florida, USA. pp. 257{

Ninness, B.M. (1993). Stochastic and Deterministic262.

Modelling. PhD thesis. University of Newcastle.

P.Bodin and B.Wahlberg (1994a). Thresholding in high order transfer function estimation. In: Proc.

33:rd IEEE Conf. on decision and control. IEEE.

pp. 3400{3405.

P.Bodin and B.Wahlberg (1994b). A wavelet approach to frequency response estimation. In: SYSID'94.

IFAC. pp. 2441{2446.

Rissanen, J. (1978). Modeling by shortest data description. Automatica14, 465{471.

Soderstrom, T. and P. Stoica (1989). System Identi cation. Prentice Hall. New York.

Sondhi, M.M. and D.A. Berkley (1980). Silencing echoes on the telephone network. Proceedings of the IEEE

68, 948{963.

Van den Hof, P.M.J., P.S.C. Heuberger and J. Bokor (1993). Identication with generalized orthonormal basis functions-statistical analysis and error bounds. Selected Topics in Identi cation Modelling and Control6, 39{48.

Wahlberg, B. (1991). System identication using Laguerre models. IEEE Transactions on Automatic Control.

References

Related documents

In Figure 10 the errors are pretty small for different values of N , which means our RBF-QR method in the collocation approach and the least squares approach both work well for

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

This research topic lies in the discipline of Computer Science and is related to the implementation of approximation algorithms for different areas of active research

The contribution of the thesis is an investigation of the different legal and commercial tools, which are used on the business arena by established actors and retailers,

Motivated by these limitations and opportunities, the aim of this article is to contribute to the innovation intermediation and technological innovation systems literature in at

Tommie Lundqvist, Historieämnets historia: Recension av Sven Liljas Historia i tiden, Studentlitteraur, Lund 1989, Kronos : historia i skola och samhälle, 1989, Nr.2, s..