• No results found

Relative cost based model selection for sparse high-dimensional linear regression models

N/A
N/A
Protected

Academic year: 2021

Share "Relative cost based model selection for sparse high-dimensional linear regression models"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

http://www.diva-portal.org

Preprint

This is the submitted version of a paper presented at ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing.

Citation for the original published paper:

Borpatra Gohain, P., Jansson, M. (2020)

RELATIVE COST BASED MODEL SELECTION FOR SPARSE HIGH-DIMENSIONAL LINEAR REGRESSION MODELS

In: (pp. 5515-5519). IEEE

https://doi.org/10.1109/ICASSP40776.2020.9054045

N.B. When citing this work, cite the original published paper.

Permanent link to this version:

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-273724

(2)

RELATIVE COST BASED MODEL SELECTION FOR SPARSE HIGH-DIMENSIONAL LINEAR REGRESSION MODELS

Prakash B. Gohain, Magnus Jansson Division of Information Science and Engineering,

KTH Royal Institute of Technology, Sweden Email ids: pbg@kth.se, janssonm@kth.se

ABSTRACT

In this paper, we propose a novel model selection method named multi-beta-test (MBT) for the sparse high-dimensional linear regression model. The estimation of the correct sub- set in the linear regression problem is formulated as a se- ries of hypothesis tests where the test statistic is based on the relative least-squares cost of successive parameter mod- els. The performance of MBT is compared to existing model selection methods for high-dimensional parameter space such as extended Bayesian information criterion (EBIC), extended Fisher Information criterion (EFIC), residual ratio threshold- ing (RRT) and orthogonal matching pursuit (OMP) with a pri- ori knowledge of the sparsity. Simulation results indicate that the performance of MBT in identifying the true support set surpasses that of EBIC, EFIC and RRT in certain regions of the considered parameter settings.

Index Terms— Beta distribution, high-dimensional infer- ence, linear regression, model selection, OMP, sparse estima- tion.

1. INTRODUCTION

In this paper, we consider the standard linear regression model y = Ax + σe, where y is the n dimensional vector of real measurement responses, A ∈ R n×p is the known design ma- trix where possibly p > n(or p  n) and comprising of the column vectors {a 1 , . . . , a p } also known as regressors.

x ∈ R p is the vector of unknown regressor coefficients and assumed to be sparse, i.e., the support of x given by S 0 = {i : x i 6= 0} has cardinality |S 0 | = k 0  min(n, p). The vector e ∈ R n is the error or noise vector whose elements are as- sumed to be independent and identically Gaussian distributed e ∼ N (0, I n ) where I n is a n × n identity matrix. The pa- rameter σ ≥ 0 is the unknown standard deviation. The goal of this paper is to design a method to accurately recover the true support S 0 .

This work was supported in part by the Swedish Research Council under contract 2015- 05484 and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 742648).

Methods for model selection have been a major research topic for a long time and many of these techniques have evolved from fields such as signal processing, information theory and statistics [1]. A wide variety of methods have been proposed till date. The classical Akaike information criterion (AIC) is a model selection principle proposed by Akaike in 1974 [2], which selects the model with the min- imum AIC value from a set of candidate models. Another well known and widely used model selection criterion that evolved out of AIC is Bayesian information criterion (BIC) [3]. It favours the model that has the minimum BIC value.

BIC is closely identical to AIC, however, the penalty term in AIC is independent of the sample size, which makes it prone to over-fitting. On the contrary, BIC possesses the risk of under-fitting [4].

In the recent times, model selection in high-dimensional space has gained increased attention primarily due to the fact that in the high-dimensional setting, where the number of measurements n, is significantly smaller then the parameter dimension p, (p  n), the classical model selection crite- ria such as AIC, BIC and other methods such as crossvalida- tion [5] and generalized crossvalidation [6] become ineffec- tive and tend to select models with false parameters [7, 8, 9].

To overcome this problem, extended BIC (EBIC) was pro- posed, which takes into consideration both the number of un- known parameters and the complexity of the model space [7].

EBIC is considered as a generic model selection method in a way that it generalizes the set of competing models to the col- lection of all possible combinatorial models up to a maximum cardinality K, with K  n. Under an asymptotic identifia- bility condition, EBIC can consistently select the true model as n tends to infinity [7]. However, the empirical performance of EBIC can sometimes be unsatisfactory for practical sizes of n. Moreover, in scenarios when n is fixed but the noise variance, σ 2 , tends to zero, the behaviour of EBIC is found to be inconsistent [10].

To overcome some of the issues of EBIC a novel model

selection criterion named extended Fisher information crite-

rion (EFIC) was proposed that takes into account the prob-

lem of generic model selection for high-dimensional parame-

(3)

ter vectors [10, 11]. It is formulated on the works of EBIC and Fisher information based model selection criteria. In these pa- pers, the authors analyze the performance of EFIC for a linear regression problem as n → ∞, as well as when the noise vari- ance σ 2 → 0 and show that for practical sizes of n or when σ 2 → 0, EFIC outperforms EBIC.

Orthogonal matching pursuit (OMP) is a widely used al- gorithm for recovering high-dimensional sparse signal vec- tors in linear regression models. OMP is a step-wise forward selection greedy algorithm that iteratively selects at each step the column from A having the maximum correlation with the current residual [12]. The algorithm stops when the pre- specified number of columns have been selected or when the residual is considered sufficiently small. However, such infor- mation is hardly available in real scenarios and as such proper stopping criterion or model selection schemes are needed to guarantee satisfactory performance.

Residual ratio thresholding (RRT) is a recent algorithm in the same genre that deals with estimation of sparse regression vector in high-dimensional linear regression models [13]. It operates with OMP for finding a good estimate of support S 0

from the data dependent sequence of supports generated by OMP.

This paper presents a model selection method called multi-beta-test (MBT) that provides an estimate of the true support S 0 for the sparse high-dimensional linear regression problem. OMP is employed for regressor selection to exem- plify one potential use of our method. After selecting a certain number of regressors via OMP, a series of tests is conducted using the relative least-squares cost of consecutive parameter models as the test statistic to estimate the support S 0 of the unknown vector x. The performance of MBT is compared to EBIC, EFIC, RRT and OMP with a priori knowledge of the sparsity k 0 also referred to as the oracle property.

The paper is organized as follows: In Section 2 we de- scribe the OMP algorithm. Section 3 presents in detail the proposed method. Section 4 shows the simulation results and Section 5 concludes the paper.

2. ORTHOGONAL MATCHING PURSUIT (OMP) Given a design matrix A ∈ R n×p , OMP (Algorithm 1) itera- tively selects the optimal columns one by one until the stop- ping criterion is fulfilled. A required step in this regard is to first normalize all the columns of A to be unit norm, i.e.,

||a i || 2 = 1, ∀ i = 1, 2, . . . , p. OMP starts by initializing the residual vector r 0 = y, setting the counter i = 1 and the ini- tial empty index set S OMP 0 = ∅. A column’s index is selected if that column has the maximum absolute correlation value with the residual vector r i−1 , i.e., d i = arg max

j

|a T j r i−1 | where d i denotes the column index at i th iteration and a j represents the j th column of A. A S

i

OMP

denotes the sub-matrix of A formed using the columns indexed by S OMP i . Next, based on

Algorithm 1 OMP

1: procedure OMP

2: Inputs: Design matrix A, observation vector y.

3: Initialization: r 0 = y, S OMP 0 = ∅, i = 1

4: Repeat:

5: Find next column index d i = arg max

j

a T j r i−1

6: Add current index: S OMP i = S OMP i−1 ∪ {d i }

7: Update residual: r i = 

I n − Π A

Si

OMP

 y

8: Increment counter: i = i + 1

9: Stop if the stopping condition is achieved

10: Outputs: True support estimate ˆ S 0 = S OMP i . Parameter vector estimate ˆ x = (A T ˆ

S

0

A S ˆ

0

) −1 A T ˆ

S

0

y

11: end procedure

the current support S OMP i , the least-squares estimate of y, i.e., Π A

Si

OMP

y is evaluated where Π A

Si

OMP

= A S

i OMP

A S

i

OMP

denotes the projection matrix onto the span(A S

i

OMP

) and A S

i OMP

= (A T S

i

OMP

A S

i

OMP

) −1 A T S

i OMP

is the Moore-Penrose pseudo inverse of A S

i

OMP

. The estimate Π A

Si

OMP

y is used to update the resid- ual r i = y − Π A

Si

OMP

y = (I n − Π A

Si

OMP

)y. The selection process is halted when the stopping criterion is achieved or the number of regressors to be chosen is fixed a priori.

3. PROPOSED METHOD: MBT

Let A S ∈ R n×s be a matrix constructed using some columns from the design matrix A with support S ⊂ {1, 2, . . . , p}

and cardinality |S| = s  min(n, p). Let x S ∈ R s be the unknown regressor coefficient vector corresponding to A S . The linear regression model then can be rewritten as y = A S x S + σe. The method of least-squares provides estimates of x S by minimizing the least-squares cost function V S (x S ) = ky − A S x S k 2 where k·k denotes the Euclidean vector norm. The minimizer is ˆ x S = (A T S A S ) −1 A T S y and the least-squares estimate of the output response y is Π A

S

y [14]. Then, for the design matrix A S with support S, the least-squares cost is

V S = k(I − Π A

S

)yk 2 = Π A

S

y

2 = y T Π A

S

y (1) where Π A

S

is the orthogonal projection matrix onto the null space of A T S . Now consider a new matrix A I ∈ R n×k with support I having cardinality |I| = k and consisting of columns from the original design matrix A but not present in A S . Concatenating A S and A I forms the new matrix [A S A I ]. The new least-squares cost after incorporating A I

is evaluated as

V [S,I] =

Π [A

S

A

I

] y

2

(4)

=

y − Π A

S

y − Π Π

⊥ AS

A

I

y

2

=

Π A

s

y − Π Π

AS

A

I

y

2

= (Π A

S

y − Π Π

AS

A

I

y) T A

S

y − Π Π

⊥ AS

A

I

y)

= V S − y T Π Π

AS

A

I

y. (2)

Using (1) and (2) the relative cost is evaluated as

w I = V S − V [S,I]

V S =

y T Π Π

AS

A

I

y

y T Π A

S

y . (3) Using the projection matrix expression Π X = X(X T X) −1 X T we can rewrite (3) as

w I = y T Π A

S

A I (A T I Π A

S

A I ) −1 A T I Π A

S

y

y T Π A

S

y . (4)

Now w I is a measure of the decrease in the least-squares cost relative to V S after addition of |I| = k extra columns of A.

Next we derive the statistical properties of w I . For this let us assume that S = S 0 , the true support of x. Then the true data model is y = A S x S + σe where e ∼ N (0, I n ). Fur- thermore, the projection matrix Π A

S

can be decomposed as Π A

S

= UU T where U ∈ R n×(n−s) is a semi-orthogonal matrix whose columns span the null space of A S . We also denote ˜ e = U T e ∼ N (0, I n−s ) where s = |S| and let A ˜ I = U T A I ∈ R (n−s)×k where k = |I| and n − s > k . Hence, when S is the true subspace we can rewrite (4) using the above substitutions and the fact that Π A

S

A S = 0 as w I = e T UU T A I (A T I UU T A I ) −1 A T I UU T e

e T UU T e

= ˜ e T A ˜ I ( ˜ A T I A ˜ I ) −1 A ˜ T I ˜ e

˜ e T ˜ e = ˜ e T Π A ˜

I

˜ e

˜ e T ˜ e

= ˜ e T Π A ˜

I

˜ e

˜ e T (I − Π A ˜

I

+ Π A ˜

I

)˜ e

= ˜ e T Π A ˜

I

˜ e

˜ e T Π A ˜

I

˜ e + ˜ e T Π ˜

A

I

˜ e = X 1

X 1 + X 2

,

where X 1 and X 2 are independent random variables dis- tributed as X 1 ∼ χ 2 (k) and X 2 ∼ χ 2 (n − s − k). It is well known from theory that if X 1 and X 2 are independent chi-square distributed random variables then w I follows a Beta distribution with parameters k/2 and (n − s − k)/2 [15], i.e.,

w I ∼ B (k/2, (n − s − k)/2) . (5) Now, from (3) we see that minimizing the least-squares cost V [S,I] over I is equivalent to maximizing the relative cost w I . Let us denote

w I

= max

I

i

w I

i

; I i ⊂ {1, 2, . . . , p}/S ; |I i | = k,

where {1, 2, . . . , p}/S denotes the set difference between the two sets and i = 1, 2, . . . , p−s k . The probability that the maximum relative cost w I

is less than some threshold γ can be expressed as

P(w I

< γ) = P(w I

1

< γ & . . . & w I

(

p−sk

) < γ)

= 1 − P(w I

1

> γ or . . . w I

(

p−sk

) > γ)

∴ P(w I

< γ) ≥ 1 − p − s k



P(w I > γ), (6) where the last inequality follows from the union bound.

Hence, (6) defines a lower bound on the probability P(w I

< γ). Setting this lower bound probability to some value β ∈ (0, 1) we get

p − s k



P (w I > γ) = 1 − β P (w I > γ) = 1 − β

p−s k

 P (w I < γ) = 1 − 1 − β

p−s k

 = ρ (say).

Since w I ∼ B (k/2, (n − s − k)/2), the threshold γ can be evaluated as

γ = B −1 (ρ; k/2, (n − s − k)/2) , (7) where B −1 (·) is the inverse beta cumulative distribution func- tion.

The MBT is summarized in Algorithm 2. First specify the design matrix A and the measurement vector y. Then run k max (> k 0 ) iterations of OMP to identify the most ap- propriate k max number of column vectors of A in a order of decreasing significance. This gives us the OMP generated in- dex set S OMP k

max

. Next, at each iteration s = 1, . . . , (k max − 1), compute the least-square cost V S

OMPs

and generate a se- quence of relative cost {w I (k)} and the corresponding se- quence of threshold {γ(k)} for k = |I| = 1, . . . , (k max − s).

The true sparsity k 0 is estimated as the value of s at which w I (k) < γ(k), ∀k = 1, . . . , (k max − s).

4. SIMULATION RESULTS

In this section, simulation results are provided to evaluate the

performance of MBT. The performance is measured in terms

of probability of correct model selection versus two vary-

ing parameters (1) number of measurements, n and (2) di-

mension of parameter space, p. Since high-dimensional sce-

nario is considered, p > n in all cases. The design matrix

A ∈ R n×p is generated with independent entries following

normal distribution N (0, 1). The columns of A are normal-

ized to have unit Euclidean (l 2 ) norm. Since the parame-

ter vector x is assumed to be sparse, the true support cardi-

nality is fixed at |S 0 | = k 0 = 5. The non-zero entries of

(5)

Algorithm 2 MBT as used with OMP

1: procedure MBT

2: Input: Design matrix A, measurement vector y

3: Run k max iterations of OMP to get index set S OMP k

max

4: for s = 1 to (k max − 1) do

5: Compute V S

OMPs

= y T Π A

Ss

OMP

y

6: for k = |I| = 1, 2, . . . , (k max − s) do

7: w I (k) =

V

Ss

OMP

−V

Ss+k OMP

V

Ss

OMP

8: Compute threshold γ(k)

9: end for

10: if w I (k) < γ(k), ∀k then

11: break;

12: else

13: Continue

14: end if

15: end for

16: Estimated true sparsity ˆ k 0 = s

17: Outputs: Estimated true support ˆ S 0 = S OMP s ; estimated parameter vector ˆ x S ˆ

0

= (A T ˆ

S

0

A S ˆ

0

) −1 A T ˆ

S

0

y.

18: end procedure

x are taken as x i = 1 where the indices i are taken uni- formly at random from {1, . . . , p}. The SNR in dB is SNR (dB) = 10 log 102 s2 ), where σ 2 s and σ 2 denote signal and noise power, respectively. The signal power is computed as σ s 2 = ||Ax|| 2 /n. Based on σ 2 s and the chosen SNR (dB), the noise power is set as σ 2 = σ s 2 /10 SNR (dB)/10 . The probabil- ity of correct model selection is estimated over 1000 Monte Carlo trials. To maintain randomness in the data, a new design matrix A is generated at each Monte Carlo trial. For OMP we set k max = 20.

Fig. 1 presents the plot for probability of correct model selection versus number of measurements, n. For this case the parameters considered are SNR = 3 dB, p = 300,

60 80 100 120 140 160

0 0.2 0.4 0.6 0.8 1

Fig. 1. The probability of correct model selection versus n when SNR = 3 dB, p = 300, and k 0 = 5.

100 200 300 400 500 600 700 800 900 1000

0 0.2 0.4 0.6 0.8 1

Fig. 2. The probability of correct model selection versus p when SNR = 5 dB, n = 60, and k 0 = 5.

k 0 = 5 and two different β values are taken into account, β = [0.95, 0.99], to highlight the effect of β in the perfor- mance of MBT. It can be seen from the figure that for the given setup and β = 0.95, the proposed method MBT gives higher probability of correct selection compared to EFIC, EBIC and RRT for lower values of n(< 120). On increas- ing n further the maximum probability of correct selection for MBT (β = 0.95) settles at 0.95. On the contrary, MBT with β = 0.99 achieves maximum probability of 0.99 with increase in n, but at lower measurements (n < 120) its performance degrades compared to EFIC and EBIC. The per- formance is also compared to OMP-oracle, which is OMP with known a priori knowledge of sparsity k 0 and is the optimal performance that OMP can achieve.

Fig. 2 illustrates the probability of correct model selec- tion versus parameter dimension, p for SNR = 5 dB, n = 60, k 0 = 5 and β = [0.95, 0.99]. The performance of all the methods decreases with increase in parameter dimension p.

However, it is seen that MBT provides higher probability of correct selection as compared to EFIC, EBIC and RRT for higher model dimensions under low n and SNR value.

5. CONCLUSION

In this paper, a novel model selection method called multi- beta-test or MBT is proposed that efficiently estimates the true support in a sparse high-dimensional linear regression model.

Numerical simulations with synthetic data and performance

comparison with the state-of-the-art methods have shown that

MBT can provide a performance similar or slightly better than

the existing methods. MBT involves a tuning parameter β,

which needs to be adjusted to achieve optimal results. This is

part of future work to formulate a way for dynamically setting

β in an automatic data-driven fashion.

(6)

6. REFERENCES

[1] J. Ding, V. Tarokh, and Y. Yang, “Model selection tech- niques: An overview,” IEEE Signal Processing Maga- zine, vol. 35, no. 6, pp. 16–34, Nov 2018.

[2] H. Akaike, “A new look at the statistical model identifi- cation,” IEEE Transactions on Automatic Control, vol.

19, no. 6, pp. 716–723, December 1974.

[3] Gideon Schwarz et al., “Estimating the dimension of a model,” The annals of statistics, vol. 6, no. 2, pp. 461–

464, 1978.

[4] Yanjun Wang and Qun Liu, “Comparison of Akaike in- formation criterion (AIC) and Bayesian information cri- terion (BIC) in selection of stock-recruitment relation- ships,” Fisheries Research, vol. 77, no. 2, pp. 220–225, 2006.

[5] Mervyn Stone, “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statisti- cal Society: Series B (Methodological), vol. 36, no. 2, pp. 111–133, 1974.

[6] Peter Craven and Grace Wahba, “Estimating the correct degree of smoothing by the method of generalized cross- validation,” Numerische Mathematik, vol. 31, pp. 377–

403, 1979.

[7] Jiahua Chen and Zehua Chen, “Extended bayesian in- formation criteria for model selection with large model spaces,” Biometrika, vol. 95, no. 3, pp. 759–771, 2008.

[8] Karl W Broman and Terence P Speed, “A model selec- tion approach for the identification of quantitative trait loci in experimental crosses,” Journal of the Royal Sta- tistical Society: Series B (Statistical Methodology), vol.

64, no. 4, pp. 641–656, 2002.

[9] David Siegmund, “Model selection in irregular prob- lems: Applications to mapping quantitative trait loci,”

Biometrika, vol. 91, no. 4, pp. 785–800, 2004.

[10] Arash Owrang and Magnus Jansson, “Model selection for high-dimensional data,” in Signals, Systems and Computers, 2016 50th Asilomar Conference on. IEEE, 2016, pp. 606–609.

[11] Arash Owrang and Magnus Jansson, “A model selection criterion for high-dimensional linear regression,” IEEE Transactions on Signal Processing, vol. 66, no. 13, pp.

3436–3446, 2018.

[12] T Tony Cai and Lie Wang, “Orthogonal matching pur- suit for sparse signal recovery with noise,” IEEE Trans- actions on Information theory, vol. 57, no. 7, pp. 4680–

4688, 2011.

[13] Sreejith Kallummil and Sheetal Kalyani, “Signal and noise statistics oblivious orthogonal matching pursuit,”

in Proceedings of the 35th International Conference on Machine Learning, Jennifer Dy and Andreas Krause, Eds., Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018, vol. 80 of Proceedings of Machine Learning Research, pp. 2429–2438, PMLR.

[14] S. Kay, Fundamental of Statistical Signal Processing:

Volume I Estimation Theory, Prentice Hall, 1998.

[15] Christian Walck, “Hand-book on statistical distributions

for experimentalists,” 1996., (No SUF-PFY/96-01).

References

Related documents

The regression scores with the feature subsets selected by PCC and MI indicates that they are both able to filter out features that are less relevant to the target and

In order to assess whether the time since a customer enrolled in the loyalty program has an impact on the choice of channel, the number of days from enrollment to the last day

Detta skulle kunna ske genom antingen någon slags optisk givare, en exakt position för klampning av profil vid hämtning nere vid pressen, eller genom manuell återkoppling, d v s

This study highlights the circumstances that surround human rights as well as the function of the national and international judiciaries in enforcing these rights for Palestinians

Figure 6.18: Scenario 3 with veloc- ity for VTM when R = 100 m... The truck ex- hibits a rather stable behaviour and neutral steering motion until a lateral accel- eration of about

In this paper we discuss how the time domain subspace based identification algorithms can be modified in order to be applicable when the primary measurements are given as sam- ples

I undersökningar som handlar om mobbning anser eleverna att läraren inte bryr sig och inte finns till hands när de behövs.. Det finns även stora brister i hur

Key words: Filing cabinets, data cabinets, diskette cabinets, fire resistance, fire protection, information media, test method..