1 Introduction It is well-known that low order models are usually to prefer to the full model in change detection

Full text

(1)ASYMPTOTIC POWER AND THE BENEFIT OF UNDER-MODELING IN CHANGE DETECTION F. Gustafsson Department of Electrical Engineering Linkoping University S-581 83 Linkoping, Sweden fredrik@isy.liu.se. B.M Ninness Department of Electrical and Computer Engineering University of Newcastle NSW 2308 Callaghan, Australia. Keywords: Change detection, asymptotic analysis, bias and variance is well-known, see for instance 5]. In under-modeling, hypothesis tests. Abstract It is well-known from experience that low order models perform well in change detection problems even if the true system is quite complicated. By computing the asymptotic power function, it is here shown that common hypothesis tests proposed in literature attain their maximum power for a low order model under certain conditions on how much each parameter change.. 1 Introduction It is well-known that low order models are usually to prefer to the full model in change detection. An example is speech signals where a high order AR model is used for modeling but a second order AR model is sucient for segmentation, see Section 11.1.3 in 3]. One advantage of using low order models is of course lower complexity. Another heuristic argument is that since the variance on the model is proportional to the model order, a change can be determined faster with a constant signi cance level for a low order model. It is customary to x the signi cance level, that is the probability for false alarms, and try to maximize the power of the test or, equivalently, minimize the average delay for detection. The power is of course a function of the true change, and using low order models there always are some changes that cannot be detected. However, for large enough model orders the variance contribution will inevitably outweigh the contribution from the system change. These ideas were presented in 6] and 7] of which this work is a continuation of. There is a close link between this work and identi cation of transfer functions, where the trade-o between. the cited work, an expression for the variance term is derived which is asymptotic in model order and number of data. Asymptotically, the variance term is proportional to the model order and since the bias is decreasing in the model order this shows that the model order cannot be increased inde nitely. That is, a nite model order gives the smallest over-all error in the estimate although the true system might be in nite dimensional. We will adopt a similar approach for analyzing the power of change detection tests. The contribution is as follows. A regression model is used for describing the signal. An expression for the power is derived which is asymptotic in model order and number of data. This expression coincides with a lower bound on the power for. nite order and nite number of data. The result implies that the model order cannot be arbitrarily large. Given a number of possible regressors, we will come up with an explicit rule for which regressors to be included in the low order model to maximize the power of the test as a function of the parameter change magnitude. The result is especially easy to interpret for Finite Impulse Response (FIR) models.. 2 Change detection setup 2.1 Data model. We will here suppose that the true system or signal can be described by a linear regression model S : yt = 'Tt 0 + et: (1) Here 't is the regression vector, yt the measurement, i the parameter vector and et white noise. A standard approach to change detection is to compare a nominal model M0 to one estimated in a sliding window, M1 , in the following manner, see 1] and 2]: Data ('1 y1 ):::('t;N yt;N ) ('t;N +1 yt;N +1 )::('t yt ) {z } | {z } Work done whilst the rst author was visiting the University of Notation | Y Y Newcastle. The rst author gratefully acknowledges fundings from Y = 0 + E Y = + E the Swedish Institute and the Swedish Natural Science Research System Council during his visit. Model Y = T0 + V Y = T + V.

(2) where it is assumed that 0 , and consequently also 0 , is known. The sliding window of size N should be long enough to obtain accuracy in the model but as short as possible to minimize the time for detection, so the second model has a much larger uncertainty. That is the rationale for assuming that 0 is known. The change detection task is formulized as a hypothesis test. The full test is. H0 : No change = 0 H1 : Change = 6 0 :. (2) (3). The question is whether a reduced order test. H0 : No change = 0 H1 : Change = 6 0 :. (4) (5). gives higher power. The data transformation matrix T is used to characterize the low order model. The parameter vectors 0 , 1 (d 1) and (m 1) will not be of the same size due to T (d m). Generally, the reduced order model M1 will have correlated noise in V . A key point is that the expected value of the estimated under H0 can be calculated from the noise-free signal obtained from a simulation, Y0 = 0 .. 2.2 The power function. We will assume that a test statistic V (T ) for measuring the distance between M0 and M1 is given. The test statistic is a function of the data transformation matrix T . The conditional mean and variance of the test statistic are denoted EV (T jHi )] = i VarV (T jHi )] = i2. (6) (7) (8). It can be motivated in two ways why the PFA should be as large as possible. Suppose rst we can show a normal conditional distribution on the test statistic, V (T jH0) 2 N(0 02 ) (12) 2 V (T jH1) 2 N(1 1 ): (13) The signi cance level and power of the test are = 1 ; F (;C )

(3) (T ) = F 1 ; 0 ; C 0 1. Here F is the cumulative probability function for the normal distribution. Since F is a monotonous function of its argument, maximizing

(4) (T ) is the same as maximizing the PFA. A normal distribution of the test statistic requires asymptotical analysis. A non-asymptotic argument for maximizing the PFA can be derived from Chebyshev's inequality, 1 ;

(5) (T ) = P (V (T jH1 ) ; 0 < C 0 ) = P (0 ; V (T jH1 ) > ;C 0 ) = P (1 ; V (T jH1 ) > 1 ; 0 ; C 0 ). ( ; ;1 C )2 1 0 0 = PFA;2 )

(6) (T ) 1 ; PFA;2 2. (14) This proves that a lower bound on

(7) (T ) increases as PFA increases. The test statistics that will be examined are all 2 distributed under H0 . A more logical test is therefore Decide H0 if V (T ) < h (15) 2 where h is a threshold taken from a distribution. Using the same argument that leads to (14) now gives 2

(8) (T ) 1 ; (1 ; h). 12 the 2 (m) threshold for m degrees of freedom is app The test statistic will increase after a change so 1 > 2 . Since 2 m for some constant C , see Figure proximately m + C The hypothesis test is supposed to be of the following form:. Decide H0 if V (T) ; 0 < C 0 where C is a threshold depending on the con dence level of the test. The con dence level and power of the test are de ned by. (T ) = 1 ; P (V (T jH0 ) ; 0 > C 0 )

(9) (T ) = P (V (T jH1 ) ; 0 > C 0 ). (9) (10). De ne the power function's argument (PFA) as. 4 1 ; 0 ; C 0 PFA = 1. (11). 4, the PFA should still be large. Thus, we would like to have small 0 , small 1 and large 1 ; 0 . The last term is upper bounded by the true unknown parameter change jj0 ; 1 jj. The smaller the model order m, the harder it is to nd real changes. On the other hand, the inuence on 1 ; 0 of each new element in the parameter vector becomes smaller and smaller as we increase the model order, while the variance quantities i increases. Thus, it is not unlikely that there will be a good trade-o between change and variance for a low order model..

(10) 2.3 Test statistics. The test statistic we will study is. V (T ) = kYi ; Y0 + E k2Q(T ) where kxk2A = x0 Ax and Q(T ) = T (T 00 T );1 T 00. Proof: The last term is N(0 C 0 C ). Let A = U 0 DU be the SVD of A. Since A is a projection matrix of rank m, the diagonal matrix D has m ones and N ; m zeros on the De ne E = UE , which is a vector of Gaussian (16) diagonal. variables. Since U is orthogonal, et are uncorrelated and thus independent with variance . Then. (17). Here Yi + E is the true signal Y = i + E in the sliding window, and Y0 the simulated signal 0 corresponding to no change. The use of (16) can be motivated in three ways: The well-known GLR test 8] using a two-model approximation suggested in 1] leads to (16). The divergence test in 2] gives (16) as test statistic as well for this case with known measurement noise. The third approach, also leading to (16), is to use the normalized parameter norm k0 ; ^k2P ;1 . The least squares estimate ^ and 0 (assuming Y = 0 ) are computed from ^ = (T 0 0 T );1T 0 0 Y (18) 0 0 ; 1 0 0 0 = (T T ) T 0 : (19) It follows immediately that PFA = (Y1 ; Y0 )0 Q(T )(Y1 ; Y0 ) ; C Var1=2 (E 0 Q(T )E ) Var1=2 (E 0 Q(T )E + 2(Y1 ; Y0 )0 Q(T )E ). 3 Asymptotic analysis The asymptotic analysis consists of two steps: If the model order m is large, the central limit theorem can be used to show that V (T jHi ) is approximately Gaussian distributed. If the number of data is large, then 0 NR for some R. With these approximations, it is possible to maximize

(11) (T ) with respect to T .. 3.1 Asymptotic distribution. The asymptotic distributions of the test statistics will be derived here. The following lemma is needed. Lemma 1 Let A be an N N projection matrix (that is, AA = A) of rank m and E an N 1 vector of independent and identically distributed Gaussian variables with zero mean and variance . Then E 0 AE + C 0 E 2 AsN(m 22 m + C 0 C ) (20). m e2 1 E 0 AE = 1 E 0 DE = X i 2 2 (m): i=1. Thus, the rst term scaled with 1= is 2 (m) The 2 (m) distribution has mean m and variance 2m and thus, according to the central limit theorem, is asymptotically N(m 2m). The rst and second terms are uncorrelated, since their product contains only third order products of Gaussian variables. Since they are asymptotically Gaussian they are thus asymptotically independent. Thus, their sum is also Gaussian and the result follows. 2 For a non-symmetric distribution like the 2 one, a standard rule of thumb is that m > 25 gives a good Gaussian approximation. The approximation becomes better when the variance contribution C 0 C from the exact Gaussian term dominates 2m. Furthermore, we remark that the Gaussian assumption on E is not too crucial. A similar expression can be derived for other symmetric distributions. This lemma can now be applied to V (T ). It gives. 02 = 22 m 12 = 22 m + 4(Y1 ; Y0 )0 Q(T )(Y1 ; Y0 ) The resulting PFA, assuming high model orders m, is p 2 0 ( Y 1 ; Y0 ) Q(T )(Y1 ; Y0 ) ; C 2 m PFA = (22 m + 4(Y1 ; Y0 )0 Q(T )(Y1 ; Y0 ))1=2 These are algebraic expressions in T , but it is still not clear how T should be chosen to maximize the PFA. Further assumptions are needed.. 3.2 Orthonormalization. We will use the assumption that the columns in are quasi-stationary processes so (21) lim 1 0 = R N !1 N for some positive de nite (assuming persistence of excitation) symmetric R. Now, the model can be orthonormalized by the transformation = R1=2 (22) T = R1=2 T: (23). asymptotically in m, where the N 1 vector C is arbi- Here the square root is de ned through R1=2 R1=2 = R. trary. There is however no insight to gain by letting N tend to.

(12) in nity, because it is easily realized that the PFA tends to in nity and thus the power function tends to 1 in that case. Using an abuse of notation assuming a large N < 1, this transformation gives Yi0 Yi = N i0 i (24) 1 Q(T ) = N R;1=2 TT0 R;1=2 0 (25) Yi0 Q(T )Yi = N i0 TT0 i (26) The power function argument is now simpli ed to 1 ; 0 )0 TT0(1 ; 0 ) ; C p22 m N ( PFA = 2 (27) (2 m + 4N(1 ; 0 )0 TT0 (1 ; 0 ))1=2 If N tends to in nity, the PFA tends to in nity no matter how small (1 ; 0 )0 TT0(1 ; 0 ) is. That means that eventually any change that is visible in this term will be detected. To overcome this problem in asymptotic analysis, the local approach has been suggested 4]. Here it is assumed that the parameter change decreases with N . Suppose for instance that 1 ; 0 = N ;1=4 and m is increasing as O(N ), where 0:5 < < 1, then PFA will tend to a constant in N and m, r m PFA ! 0pTT0 (28) N 2. 4 Choosing the model Now T will be restricted to pick out a particular subset of the parameters . Let ki , i = 1 2 :: m, be the indices used in the low order model and ki , i = m + 1 m + 2 :: d, be the indices not used. Then the power function argument is Pm 1(ki ) ; 0(ki ) )2 ; C p22 m N ( i =1 (29) PFA = P (ki ) (ki ) (22 m + 4N m i=1 (1 ; 0 )2 )1=2 Now, the test criterion is quite easy to understand. If the parameters after the change were known, we would rst re-order the parameters in descending order of (1(ki ) ; 0(ki ) )2 . The question is now only how many terms to include, that is, to choose m. To nd the breaking point where the power of the test does not increase anymore, the di erence in PFA's for model orders m + 1 and m is computed. The following lemma is helpful. Lemma 2 Consider the expression. p. Proof: Follows from the Taylor expansion 1 + x 1 + x=2. 2 The following theorem is now straightforward. Theorem 1 Assume that the parameter vector is ordered in descending order of the true change j1 ; 0 j. Then, asymptotically in m and N , the power function increases as long as q. 1 + 2C m2 m +1 m +1 2 p (1 ; 0 ) > 2N (m+C 2m) 1 + NP m ( i ; i )2 i=1 1 0. (32). Proof: De ne ; ; P A = CN mi=1 1i ; 0i a = CN 1m+1 ; 0m+1 B = 2m2 b = 22 C = 4C : The PFA is of the form (30), where a and b denote the. change in PFA when the model order is increased with one. Substitute (A B C a b) in (31), noting that A=a m and B=b = m so that a b are small, gives the desired result. 2 P i ; i )2 >> m=N , (32) simpli Assuming that m ( i=1 1 0. es to . r. !. C : (1m+1 ; 0m+1 )2 > 2N 1 + 8m. (33). Thus, the larger noise variance or the smaller number of data or the larger con dence level, the smaller model order should be used. In the almost noise free case, the full model should be used, which is also the case when the number of data in the test window is large. The local approach gives the PFA (28) and the model order should be increased as long as r. (1m+1 ; 0m+1 )2 > 2Nm :. (34). That is, the assumption of a decreasing change in N leads to a considerably smaller model. It is generally not clear how the transformed parameter change 1i ; 0i should be ordered. Here FIR models with white noise input is a revealing special case. Then, because R = I , 1i ; 0i = bi1 ; bi0 . The natural order is p the time order. A common and logical assumption for exB A ; is the the impulse response is p : (30) ponentially stable systems B + CA expontially bounded, jbi j < C

(13) ai . We can now compute where A B C are positive numbers. Let a and b be small a suboptimal model order from ( !) r perturbations of A and B , respectively. Then, for su 8 C m ciently small a b, the change in (30) is positive i m = sol 2Ca = 2N 1 + m ; 1 C 1 + pB ; log(4NC ) p a > Cb (31) log() log( 2B + B 1 + AC a) A.

(14) 5 A simulated example. Variance/change trade-off. -3. 6. A simulated example will be used to illustrate how the power varies with the model order. The data are generated by the " Astrom" system under the no change hypothesis ;1 :5q;2 p u + yt = 1 ;q1:5q+;10+ et : t ; 2 0:7q The change is a shift in the phase angle of the complex pole pair from 0.46 to 0.4. The corresponding impulse responses bi (k) for i = 0 1 are plotted in Figure 1.. x 10. 5. 4. 3. 2. 1. 0 0. Impulse response before and after change. 5. 0.35. 15 m. 20. 25. 30. Figure 2: The true squared parameter change and the lower bound for inclusion in the test.. 0.3 0.25. p. 0.2. than the Gaussian threshold m + C 2m, see Figure 4, so the actual power is smaller than expected.. 0.15 0.1. 6 Conclusions. 0.05 0 -0.05 -0.1 0. 10. 5. 10. 15. 20. 25. 30. Figure 1: Impulse response before and after a change for the system under consideration. Both the input and noise are white Gaussian noise with unit variance. A FIR model is used, so R = I and ik = bi (k). The noise variance is = 0:1 and the number of data in the sliding window is N = 50. No prior information of the change was used and the parameters are enumerated in ascending order in (29). Figure 2 shows (1 m ; 0 m )2 together with the acceptance limit (32). From this plot we predict that model order 10 or possibly 16 will give the global maximum of the power function. Furthermore, there will be a large increase in power when going from model order 4 up to 8 and a local minimum at 12. Figure 3 shows a Monte Carlo simulation for 1000 noise realizations. The 2 test (15) was used which gives the desired con dence level without any asymptotic assumptions. The upper plot shows the chosen and obtained con dence level. The lower plot shows the asymptotic power function using (29) and the result from the Monte Carlo simulation. Qualitatively, they are very similar with local maximas and minimas where expected and a large increase between model orders 4 and 8. The power from the Monte Carlo simulation is however much smaller. This depends on the Gaussian approximation of a 2 term. The right tail of a 2 distribution is larger than the corresponding Gaussian distribution with the same variance. This implies that the threshold h in the 2 test is larger. The inuence of the choice of model order in change detection has been considered. The change detection approach was based on a comparison between a nominal model and one estimated in a sliding window. It was motivated by Chebyshev's inequality and also by an asymptotic assumption of Normal distribution that the power function increases with the PFA. The PFA corresponds to three di erent measures of model di erence: GLR, the divergence test and normalized parameter norm. For a large sliding window and high model order, the central limit theorem was used to simplify the PFA. As an important special case, consider an FIR model with impulse response b0 (t) before and b1 (t) after the change. If the input is white noise, then the model order should be increased as long as . r. !. C : +1 (bm ; bm0 +1)2 > 2N 1 + 8m 1 If the rate of decay of the coecients is known, then a suboptimal model order can be computed a priori. A Monte Carlo simulation for a particular example indicated that the asymptotic result is qualitatively very useful even for small model orders and short windows.. References 1] U. Appel and A.V. Brandt. Adaptive sequential segmentation of piecewise stationary time series. Information Sciences, 29(1):27{56, 1985. 2] M. Basseville and A. Benveniste. Sequential detection of abrupt changes in spectral characteristics of digital.

(15) 3]. Power of the test 0.9 0.8 0.7 0.6. 4]. 0.5 0.4 0.3. 5]. 0.2 0.1 0 0. 5. 10. 15 Model order. 20. 25. 30. Significance level of the test. 6]. 1. 0.995. 7]. 0.99. 0.985. 8]. 0.98. 0.975. 0.97 0. 5. 10. 15 Model order. 20. 25. 30. Figure 3: Signi cance level (upper plot) and power (lower plot) from asymptotic expression and simulation as a function of the number of parameters included in the test.. Threshold for V(T) using chi-2 and Gaussian tests 60. 50. 40. 30. 20. 10. 0 0. 5. 10. 15 Model order. 20. 25. 30. Figure 4: Thresholds computed using the exact 2 and asymptotic Gaussian distributions. signals. IEEE Trans. on Information Theory, 29:709{ 724, 1983. M. Basseville and I.V. Nikiforov. Detection of abrupt changes: theory and application. Information and system science series. Prentice Hall, Englewood Cli s, NJ., 1993. A. Benveniste, M. Basseville, and B.V. Moustakides. The asymptotic local approach to change detection and model validation. IEEE Transactions on Automatic Control, 32:583{592, 1987. L. Ljung. Asymptotic variance expressions for identi ed black-box transfer function models. IEEE Transactions on Automatic Control, 30:834{844, 1985. B.M. Ninness and G.C. Goodwin. Robust fault detection based on low order models. Proceedings of IFAC Safeprocess 1991 Symposium, Baden Baden Germany, 1991. B.M. Ninness and G.C. Goodwin. Improving the power of fault testing using reduced order models. In Proceedings of the IEEE Conference on Control and Instrumentation,Singapore, February, 1992. A.S. Willsky and H.L. Jones. A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Transactions on Automatic Control, pages 108{112, 1976..

(16)

No results found