Two practical examples show the eciency of the approach

(1)

IDENTIFICATION OF SPARSE LINEAR REGRESSIONS¹ Fredrik Gustafsson

Department of Electrical Engineering, Linkoping University, S-581 83 Linkoping

Abstract. Some important practical signals and systems can be modeled by very large linear regression models where it is reasonable that most of the parameters are zero. We give an ecient method to solve this combined estimation and structure determination problem and relate the result to Akaike's information criteria for structure selection and criteria based on hypothesis testing. A recursive algorithm is derived, which can be applied to time-varying systems as well. Two practical examples show the eciency of the approach.

Keywords.linear regression, structure selection, maximum likelihood, sparse models, echoes, orthonormal bases, system identication, modeling, model selection

1. INTRODUCTION

We will consider linear regression models of the form

y(^t) =^'(^t)^T+^e(^t) (1) where^'(^t) is a regression vector and the corresponding parameter vector. Here^e(^t) is assumed to be white Gaussian noise with variance². It is implicitly assumed that the number of parameters^d= dim is large (e.g.

d > 100). The assumption in this contribution is that only a small number ⁿof parameters (e.g. ⁿ^<10) are active

(^kⁱ)⁶= 0 ⁱ= 12^::ⁿ

l= 0 otherwise (2)

By^kⁿwe denote the set of active parameters^k¹^k²^::^kⁿ. We point out two important applications where this is the case.

1. Multipath signal propagation.In telephone communication echoes are a big problem that can deteri- orate the speech quality severely. In 4-wire loop tele- phony, the echoes come from circuit echo paths (Sondhi and Berkley 1980), while in mobile radio channels they are caused by room acoustic echo paths. The eect can be removed by equalization once a channel model has been identied.

The signal can be written as

y(^t) =^Xⁿ

i=1

(^kⁱ)û(^t^;^kⁱ) +ê(^t) (3) where ûis the interesting speech signal. The indices^kⁱ for active coecients correspond to the time delay in

1

PartiallysupportedbyABVolvo

the echo path. The same problem occurs in sonar applications (Burdic 1984), where these echoes are caused by reections at the surface and the bottom of the sea and also in geophysical signal processing (E.A.Robinson and T.S.Durrani 1986).

2. Approximation using basis functions.^{In system}

identication, the use of (orthonormal) basis functions has become a popular approach recently (Wahlberg 1991, Ninness 1993, Van den Hof et al. 1993). Here, the system is modeled by

y(^t) =^Xⁿ

i=1

(^kⁱ)^<^kⁱ(^t)û(^t)^>+ê(^t) (4) whith ^< ^ki(^t)û(^t) ^> denoting the scalar product of the basis function ^kⁱ(^t) and the system input û(^t). A similar problem occurs in function approximation using, for instance, polynomial basis functions. In this case,

u(^t) is the function to be approximated.

Note that both (3) and (4) are linear regressions. The problem is to estimate the number of active coecients

n, their positions ^kⁿ and their values (^kⁱ). The upper bound ^d on the number of parameters is assumed to be known. Any conceivable method that claims to be optimal for this problem has to examine all possible combinations of^kⁿ. Ifⁿwere known, there would be (ⁿ^d) dierent combinations to examine. Here, whereⁿis unknown, there are^P^dⁿ⁼⁰(ⁿ^d) = 2^ddierent combinations, the latter expression coming from the fact that each coecient can be either active or inactive. Examples of optimality criteria that will be surveyed are Akaike's BIC (Akaike 1981) and the ML approach. Another possible approach for function approximation is given in (D.L.Donoho 1992) and extended to dynamic system approximation in (P.Bodin and B.Wahlberg 1994b, P.Bodin and B.Wahlberg 1994a).

The contribution here is an approximation of the loss

(2)

function for each^kⁿ so all 2^dloss functions can be evaluated from only^dscalar LS estimates. We give a recursive implementation that can be applied to time-varying systems as well. The total complexity corresponds approximately to the least mean square (LMS) algorithm, with a non-standard stepsize, applied to the full parameter vector.

The outline is as follows. Section 2 surveys briey some possible approaches. The novel approximation of the loss function is derived in Section 3, Section 4 presents some examples while Section 5 concludes the paper.

2. CLASSICAL APPROACHES

Consider the model (1) with the assumption (2). Assume that we for each possible combination^kⁿ= (^k¹^k²^::^kⁿ) are given a parameter estimate ^^N(^kⁿ) minimizing the sum of squared residuals,

V

N(^kⁿ) =^X^N

t=1

^y(^t)^;^'^T(^t)^(^kⁿ)]² (5)

= min

N

X

t=1

^y(^t)^;^'^T(^t)(^kⁿ)]²^: (6) By(^kⁿ) is meant the parameter vector in (1) under the constraint (2).

2.1 Including a penalty term

A standard approach to structure determination is to use one of Akaike's three tests FPE, AIC and BIC (Akaike 1981). The latter one is identical to Rissanen's MDL criterion (Rissanen 1978). They can be written as follows.

Loss function with penalty term:

c

k

n= argmin

k n

1

N V

N(^kⁿ) +²ⁿ(^N) (7) Here AIC corresponds to (^N) = 2^=N, BIC to (^N) = log(^N)^=N. The rst term is a measure of model t while the second one is a penalty term for preventing too com- plex models. This is a generic philosophy in model structure determination the best model should be a tradeo

between model t and complexity. The obvious draw- back in using (7) is the huge computational complexity since all possible ^kⁿ has to be examined. As pointed out in (?), the maximum likelihood approach leads to a similar criterion.

There is well-known problem with AIC and BIC in the case of small number of data. This has met considerable interest in econometrics, but there might be a similar problem here where^d=N is small as well. We will, however, not investigate that matter here.

It follows from known theory that the method is consis- tent if

(^N)^!0 ^N ^!¹

N(^N)^!¹ ^N ^!¹ (8) see for instance equation (11.47) and (11.68a) in (Soderstrom and Stoica 1989). This means that if the true system can be described with one of the model structures, this structure will eventually be estimated when more and more data become available.

2.2 Hypothesis testing

A natural approach from a statistical viewpoint is hypothesis testing. We can for instance test the signicance of each coecient by using the hypotheses

H

0(^k) : ^k = 0 (9)

H

1(^k) : ^k ⁶= 0 (10) From the Gaussian noise assumption and classical least squares theory we have ² N(^^P), where ^P is the corresponding covariance matrix. We next choose a signicance level , corresponding to a threshold ^c, and the test becomes

Hypothesis testing:Keep the coecients for which

j^^N(^k)^j^>^c^p^P^{k k} (11) A possible problem is that we face a multiple hypothesis test but design^dindependent hypothesis tests, so we do not know the total condence level.

Another approach has been developed by (D.L.Donoho 1992) in the context of function approximation using a wavelet basis in (4). The solution is to rst estimate the full parameter estimate ^^N(12^::^N) (in this application, there are as many regressors as data, ^d =^N).

Then we have the following rule:

The Donoho test:Keep the coecients for which

j^N(^k)^j^>(1 +)^plog(^N)^P^{k k} (12) where^>0 is a constant.

This criterion for wavelet coecients was called wavelet shrinkagewith hard thresholding in (D.L.Donoho 1992).

The idea is quite congenial. It is based on the following result.

Suppose that = 0, which means that ^kⁿ = 0, and that the estimate ^^N has a diagonal covariance matrix

P. This assumes the regressors to be uncorrelated and implies that the parameter estimate is a white noise sequence. Then

(3)

P

max

i

^

(ⁱ)

p

P

ii

<(1 +)^plog(^N)

!

!1^: (13) That is, the Donoho test yields

P(^k^cⁿ= 0)^!1 ^d^!^1: (14) if^>0. In words, this result says that the probability for overmodeling tends to zero as the number of coecients tends to innity. This is automatically assured in the wavelet application if the number of data tends to innity, since ^d = ^N. The method is particularly well suited for ltering noisy deterministic smooth signals as will be pointed out in the examples later on. The test (12) has been used in (P.Bodin and B.Wahlberg 1994b, P.Bodin and B.Wahlberg 1994a) for smoothing empiri- cal transfer function estimates and for thresholding using orthonormal bases, respectively, and the results are promising.

Suppose the covariance function for ^is^P = ^N²^I, which is the case when using orthonormal basis functions with a white input signal. Then the Donoho test can be written

j^N(^k)^j^>(1 +)

rlog(^N)

N

: (15)

This particularly simple form, depending only on the estimate itself, will be used for comparison later on.

2.3 Maximum Likelihood methods

If the structure^kⁿ is considered as an unknown parameter, classical estimation approaches like the maximum likelihood (ML) method. The aim is to, in the face of the data, maximize the conditional probability density function^p(^y^N^jkⁿ). Asymptotical expressions of the likelihood is given in (?). We have for the case of a known or unknown noise variance² from Table 3, cases 17 and 12:

ML with known noise variance:

c

k

n= argmin

k n

1

N V

N(^kⁿ) +²ⁿlog(^N)

N

(16)

ML with unknown noise variance:

c

k

n= argmin

k

n log(^V^N(^kⁿ)) +ⁿlog(^N)

N

(17) Note that (16) is identical to BIC (asymptotically). The application decides which variant to use. For instance, in function approximation or signal recovering we might need to tune the approximation error or detail of the result. In communication problems the variance is unknown and the second variant is the most natural.

3. NEW APPROACH

There are two problems with the likelihood and information based approaches that make their direct applications infeasible:

The evaluation of the loss function involves the computation of the least squares solution to problems of very high dimension.

We need to compare 2^d dierent model structures and apparantly there is no way to avoid computing one loss function to each of them.

We will give a solution to each of these problems. First, in the next subsection, we replace the least squares estimate in the loss function by the least mean square (LMS) estimate, which is much simpler to compute.

Then, we point out that with this approximation all possible loss function can be evaluated at once and the total estimation scheme is as simple as the Donoho test.

3.1 Approximating the least squares loss function Let^'(^t^kⁿ) denote the regression vector where all elements but^k¹^k²^::^kⁿ are set to zero. If we dene

f

N(^kⁿ) =^X^N

t=1

'(^t^kⁿ)^y(^t) (18)

R

N(^kⁿ) =^X^N

t=1

'(^t^kⁿ)^'^T(^t^kⁿ) (19) the least squares estimate can be written

^

N(^kⁿ) =^R^N(^kⁿ)^;1^f^N(^kⁿ) (20) The loss function can now be written in a standard man- ner

V

N(^kⁿ) =^X^N

t=1

(^y(^t)^;^'(^t^kⁿ)^T^)² (21)

=^X^N

t=1 y

2(^t)^;^f^N^T(^kⁿ)^R^;1^N (^kⁿ)^f^N(^kⁿ) (22)

=^N^²^y^;^f^N^T(^kⁿ)^^RLS^N (^kⁿ) (23) where ^^y²is the variance of the observed output^y(^t).

The key idea is to replace^R^N(^kⁿ) by a diagonal matrix

D

N(^kⁿ) given by the diagonal elements of ^R^N(^kⁿ):

D ii

N(^kⁿ) =^X^N

t=1

^'^kⁱ(^t)]²^: (24) This approximation of uncorrelated regressors can be motivated in dierent ways. Consider rst the application (3). If the input is white noise, we have^E(^R) =^D

(4)

so the approximation is obviously unbiased. If the input is autocorrelated, which is the case for speech signals for instance, the correlation decays quickly in time. For large time distances between the active coecients, the approximation is almost unbiased. Secondly, the basis functions in (4) are often chosen orthonormal leading to approximately uncorrelated regressors.

There are two consequences of this approximation:

LMS requires much less computational power (no matrix inversion is needed).

The estimate of one parameter^kremains the same regardless what the other indexes included in ^kⁿ are. This is not true for the RLS estimate.

The second consequence is the most important one, since we just have to compute the full parameter estimate once and then all loss functions can be approximated.

We just have to analyse the error introduced when re- placing the RLS estimate with the LMS estimate.

Lemma 1. Consider the matrices^Dand ^R,

R

ij= ^r(ⁱ^;^j) = ^X^N

t=1

u(^t)^u(^t+^j^;ⁱ) ⁱ^j= 12^::ⁿ

D

ii = ^r(0) = ^X^N

t=1 u

2(^t) ⁱ= 12^::ⁿ We have

E lim

N!1

Ntr(^{R D}^;1^{R D}^;1^;^I) =ⁿ(ⁿ^;1) ^N ^!¹ (25) Remark 1. The denition of^Rand^Dassumes pre- and post-windowing in the least squares method.

Proof:We have

R D

;1=

0

B

@

1 ^r(1)

r(0)

r(ⁿ^;1)

r(0)

r(1)

r(0) 1 ^r(ⁿ^;2)

r(0)

... ... ...

r(ⁿ^;1)

r(0)

r(1)

r(0) 1

1

C

A

(26)

and the diagonal elements on its square are given by

diag(^{R D}^;1^{R D}^;1) =

0

B

@

1 +^r²(1)

r

2(0) ++^r²(ⁿ^;1)

r 2(0)

r 2(1)

r

2(0) + 1 ++^r²(ⁿ^;2)

r 2(0) ...

r

2(ⁿ^;1)

r

2(0) ++^r²(1)

r

2(0) + 1

1

C

A

Now a standard result, see (Soderstrom and Stoica 1989) equation (11.9), says that

N r

2(^k)

r

2(0) ^!²(1) (27) in distribution. That is,

E lim

N!1

N(tr(^{R D}^;1^{R D}^;1^;^I) =ⁿ(ⁿ^;1)

which proves the theorem. ²

Lemma 2.

E lim

N!1

1

p

N jf

N(^kⁿ)^^N^RLS(^kⁿ)^;^f^N(^kⁿ)^^LMS^N (^kⁿ)^j^y²^pⁿ(ⁿ^;1) where ^y²is the variance of the output ^y(^t).

Proof:Using matrix notation^Y = (^y(1)^::^y(^N))^T, = (^'(1)^::^'(^N))^T, we have

jf

N(^kⁿ)^^RLS^N (^kⁿ)^;^f^N(^kⁿ)^^N^LMS(^kⁿ)^j

=^Y⁰^;^R^;1^;^D^;1⁰^Y

jYj 2

2

k^;^R^;1^;^D^;1⁰^k²

=^N^y²^k^;^R^;1^;^D^;1⁰^k²

N 2

y

;tr^;^;^R^;1^;^D^;1⁰^;^R^;1^;^D^;1⁰¹⁼²

=^N^y² ^;tr^;⁰^;^R^;1^;^D^;1⁰^;^R^;1^;^D^;1¹⁼²

=^N^y² ^;tr^;;^I^;^{R D}^;1^;^I^;^{R D}^;1¹⁼²

=^N^y² ^;tr^;^I^;2^{R D}^;1+^{R D}^;1^{R D}^;1¹⁼²

=^N^y² ^;tr^;^{R D}^;1^{R D}^;1^;^I¹⁼²

where we have used tr^{R D}^;1 = trI = ⁿ and ^kAk²

ptr(Â^TÂ). Now the result follows from Lemma 1, lin- earity of the trace operator and Jensen's inequality the squareroot is a strictly concave function soÊ(^p^X) ^<

p

E(^X). ²

Note that for ⁿ = 1 LMS and RLS coincides so there is no dierence. Now we have an expression for the error, which in itself satisfy the conditions (8) for being a penalty term. We can thus include a multiple of the error term in the criterion to take care of the approximation error when simplifying the loss function and also to include a penalty term. Approximatingⁿ(ⁿ^;1)ⁿ², we nally get the following criterion:

c

k

n= argmin

k n

;

1

N f

T

N(^kⁿ)^^N^LMS(^kⁿ) +^y²^pⁿ

N

A good starting value is= 2. Since the error bound is a bit conservative, a smaller might give better de- tectability. The dierence compared to the threshold

(5)

proposed in (J.Homer et al. 1994), is that log(^N)^=N is replaced by 1⁼^p^N. The criterion can conveniently be rewritten as keeping the coecients whose values are larger than a certain threshold,

1

N f

N(ⁱ)^^LMS^N (ⁱ)^> ^p ^y²

N

(28) This criterion is quite appealing, since no knowledge of the noise variance is required.

3.2 Implementation

The LMS algorithm is very fast so (28) can be imple- mented directly. We will however point out a recursive algorithm which can be applied to time-varying systems.

The motivation is that the systems in telephone and sonar applications are time-varying and there is a strong need for recursive estimation.

Algorithm 1. Choose a forgetting factor^<1 and compute recursively

f

t=^f^t;1+ (1^;)^'(^t)^y(^t) (29)

D

t=^D^t;1+ (1^;) diag(^'(^t))² (30)

2

yt=^yt;1² + (1^;)^y²(^t) (31)

^

t=^D^t^;1^f^t (32)

Keep the coecients for which

f

t(^k)^^t(^k)^>2^y²^p1^; (33) Note that^f^t(^k) and ^^t(^k) have the same sign.

4. EXAMPLES 4.1 Local low-pass ltering using wavelets Consider the signal in Figure 1.

The signal shows the angular velocity of a non-driven wheel in a Volvo 850 GLT using standard sensors in the ABS system. This particular test drive was performed at a gravel road with sampling interval 0.2 seconds. The roughness of the surface causes a noise in the wheel angular velocity compared to the pure unknown velocity signal. The purpose with this test is to classify gravel road from other ones by estimating the size of the noise.

For this purpose, we rst need to estimate the velocity signal by local low-pass ltering. Standard \global"low- pass lters are useless here due to the deterministic high frequency content in the signal.

The wavelet basis is used in (4). There are 300 data points and thus 300 wavelet coecients. The criterion

0 10 20 30 40 50 60

0 10 20 30 40 50 60 70 80

Time [s]

Fig. 1. Wheel velocity as a function of time

(15) with= 1 is applied to the wavelet transform coef-

cients and the remaining parameters are inverse trans- formed. About 30 % of the transform coecients are kept in this example. The upper plot in Figure 2 shows the residuals { that is, the dierence of the original and

ltered signals { as a function of time, and the lower plot the residuals' histogram. Clearly, the histogram resem- bles a Gaussian probability function, except for some outliers originating from the braking maneuvers. This was expected, since during 0.2 seconds, corresponding approximately to 4 meters, many small holes and ridges of dierent shapes add up to a total error which can be considered as a Gaussian variable according to the cen- tral limit theorem. This is one validation of that the approach works. More details of this project can be found in (Gustafsson 1995). Finally, we remark that the bias in the residuals comes from an oset introduced in the wavelet transform.

This example illustrates an application where the assumption (2) holds. In an on-line approximation other basis functions should be used and the coecients be estimated recursively, using Algorithm 1.

0 10 20 30 40 50 60

−0.4

−0.2 0 0.2 0.4

Time [s]

−0.20 −0.1 0 0.1 0.2 0.3 0.4 0.5

10 20 30 40

Fig. 2. Estimated noise sequence and its histogram 4.2 Echo detection

We here describe an example similar to one in (J.Homer et al.1994). The underlying model is (3). The channel's

(6)

impulse response is given in the upper plot of Figure 3 and it has three active taps. The input and measure- ment noise are independent Gaussian white noises with variances 1 and 16, respectively.

0 10 20 30 40 50 60 70 80 90 100

0 1 2 3 4 5 6

Impulse response

0 200 400 600 800 1000 1200 1400 1600 1800 2000

−40

−20 0 20 40

Time [samples]

Fig. 3. Impulse response of the channel and simulated output

Figure 4 shows the result of algorithm 1 averaged over 100 simulations. The correct taps were found in all simulations. The upper plot shows the estimated active taps.

They converge quickly to the correct values 5, 6 and 1.5. Also the number of estimated active taps converges quickly to the correct number 3 as shown in the lower plot.

150 200 250 300 350 400 450 500

−5 0 5 10

Time [samples]

Estimated active taps

150 200 250 300 350 400 450 500

0 5 10 15 20

Time [samples]

Number of estimated active taps

Fig. 4. Estimated active taps (upper plot) and estimated number of active taps (lower plot)

5. CONCLUSIONS

We have investigated the combined structure determination and parameter estimation problem for linear regression models where most of the parameters are expected to be zero. Most concievable methods like BIC and the ML method lead to minimization of a criterion including the least squares loss function that must be evaluated a huge number of times. We have here proposed a way to simplify the loss function by using the least mean square estimate instead of the least squares one. An analysis provided an error term to be included

in the criterion in combination with the usual penalty term, which can be inhealed in the former. The proposed algorithm is of very low complexity corresponding to the least mean square algorithm. A recursive algorithm was pointed out, that can be applied to time-varying systems. Finally, the algorithm was tested on two examples, from completely dierent applications.

6. REFERENCES

Akaike, H. (1981). Modern development of statistical methods. In: Trends and Progress in System Identication (P. Eyko, Ed.). Pergamon Press.

Oxford.

Burdic, W.S. (1984). Underwater Acoustic System Analysis. Prentice-Hall. Englewood Clis, NJ.

D.L.Donoho (1992). De-noising by soft-thresholding.

Technical Report 409. Dept. of Statistics. Stanford University.

E.A.Robinson and T.S.Durrani (1986). Geophysical Signal Processing. Prentice-Hall. Englewood Clis, Gustafsson, F. (1995). Slip-based estimation of tireNJ.

- road friction. In: Proc. on the 1995 European Control Conference, Rome. pp. 725{730.

J.Homer, B.Wahlberg, F.Gustafsson, I.Mareels and R.Bitmead (1994). LMS estimation of sparsely paramerized channels via structural detection. In:

Proc. on the CDC, 1994. Florida, USA. pp. 257{

Ninness, B.M. (1993). Stochastic and Deterministic262.

Modelling. PhD thesis. University of Newcastle.

P.Bodin and B.Wahlberg (1994a). Thresholding in high order transfer function estimation. In: Proc.

33:rd IEEE Conf. on decision and control. IEEE.

pp. 3400{3405.

P.Bodin and B.Wahlberg (1994b). A wavelet approach to frequency response estimation. In: SYSID'94.

IFAC. pp. 2441{2446.

Rissanen, J. (1978). Modeling by shortest data description. Automatica14^{, 465{471.}

Soderstrom, T. and P. Stoica (1989). System Identication. Prentice Hall. New York.

Sondhi, M.M. and D.A. Berkley (1980). Silencing echoes on the telephone network. Proceedings of the IEEE

68^{, 948{963.}

Van den Hof, P.M.J., P.S.C. Heuberger and J. Bokor (1993). Identication with generalized orthonormal basis functions-statistical analysis and error bounds. Selected Topics in Identication Modelling and Control6^{, 39{48.}

Wahlberg, B. (1991). System identication using Laguerre models. IEEE Transactions on Automatic Control.