A Study of the DWO Approach
to Function Estimation at a Given Point:
Approximately Constant and
Approximately Linear Function Classes
Alexander Nazin, Jacob Roll, Lennart Ljung
Division of Automatic Control
Department of Electrical Engineering
Link¨
opings universitet, SE-581 83 Link¨
oping, Sweden
WWW: http://www.control.isy.liu.se
E-mail: nazine@ipu.rssi.ru, roll,ljung@isy.liu.se
December 22, 2003
AUTOMATIC CONTROL
COM
MUNICATION SYSTEMS
LINKÖPING
Report no.: LiTH-ISY-R-2578
Technical reports from the Control & Communication group in Link¨oping are available at http://www.control.isy.liu.se/publications.
A Study of the DWO Approach
to Function Estimation at a Given Point :
Approximately Constant and
Approximately Linear Function Classes
Alexander Nazin
∗, Jacob Roll
†, Lennart Ljung
†December 22, 2003
Abstract
In this report, the Direct Weight Optimization (DWO) approach to function estimation is studied for two special function classes: The classes of approximately constant and approximately linear functions. These classes consist of functions whose deviation from a constant/affine func-tion is bounded by a known constant. Upper and lower bounds for the asymptotic maximum MSE are given, some of which also hold in the non-asymptotic case.
1
Introduction
In what follows, we study particular problems of estimating an unknown uni-variate function f0 : [−0.5, +0.5] → R at a fixed point ϕ∗ ∈ [−0.5, +0.5] from the given data set{ϕ(t), y(t)}Nt=1 with
y(t) = f0(ϕ(t)) + e(t) , t = 1, . . . , N (1) where{e(t)}N
t=1is a random sequence of uncorrelated, zero-mean variables with a known constant variance Ee2(t) = σ2 > 0; for the sake of simplicity we assume the gaussian distribution for e(t). The main study here is devoted to the equidistant fixed design, i.e.,
ϕ(t) =−0.5 + t/N , t = 1, . . . , N . (2) We also discuss the extension to uniform random design when regressors ϕ(t) are uniformly distributed on [−0.5, +0.5], i.i.d. random variables, and {e(t)}N
t=1 being independent of{ϕ(t)}N
t=1.
The DWO-estimator ˆfN(ϕ∗) is the linear one defined by ˆ fN(ϕ∗) = N X t=1 wty(t) (3)
∗Institute of Control Sciences, Profsoyuznaya str., 65, 117997 Moscow, Russia, e-mail: nazine@ipu.rssi.ru
†Div. of Automatic Control, Link¨oping University, SE-58183 Link¨oping, Sweden, e-mail: roll, ljung@isy.liu.se
with the weights w = (w1, . . . , wN)T minimizing the MSE upper bound UN(w) = σ2 N X t=1 w2t+ M2 1 + N X t=1 |wt| !2 → min w (4)
subject to certain constraints depending on a priori information about f0. See [6, 7, 8] and the references therein for further details. A solution to the optimization problem is denoted by w∗, and its components w∗t are called the DWO-optimal weights. Consequently, the estimate
ˆ fN(ϕ∗) = N X t=1 w∗ty(t) (5)
is called the DWO-optimal one.
The objective here is twofold. For each particular problem, we first find the MSE minimax lower bound among arbitrary estimators. Then we study both the DWO-optimal weights wt∗ and the DWO-optimal MSE upper bound UN(w∗); the latter is further compared with the MSE minimax lower bound. As we will see, some of the results obtained here hold for a fixed number of observations N while others are of asymptotic consideration, as N → ∞.
Remark 1.1. Note, that (3) represents a non-parametric estimator, since the
parameter number N is not fixed, but is in fact the number of samples. See, e.g., [2].
2
Approximately Constant Functions f
0Suppose f0: [−0.5, +0.5] → R belongs to the following class of approximately constant functions
F0(M ) ={f(ϕ) = θ + r(ϕ) | θ ∈ R, |r(ϕ)| ≤ M} (6)
with given a priori constant M > 0.
2.1
Minimax Lower Bound
Let ˜fN = ˜fN(y1N) be an arbitrary measurable function of the observation vector yN
1 = (y(1), . . . , y(N ))T, that is an arbitrary estimator.
Assertion 2.1. For any estimator ˜fN
sup f0∈F0(M )
Ef0( ˜fN − f0(ϕ∗))2≥ (2M)2+σ 2
N (7)
Proof. Notice, that for f0∈ F0(M ) the observation model (1) reduces to y(t) = f0(ϕ∗) + ˜r(ϕ(t)) + e(t) (8) with
˜
Let q(·) denote the p.d.f. of N(0, σ2). Then the probability density of the observation vector yN 1 = (y(1), . . . , y(N ))T p(yN1 | f0) = N Y t=1 q(y(t)− f0(ϕ(t))) = N Y t=1 q(y(t)− f0(ϕ∗)− ˜r(ϕ(t))) (10) In other words, the initial problem is reduced to the one of estimating a constant parameter θ1 = f0(ϕ∗) from its direct measurements (8) corrupted by both gaussian and non-random but bounded noise. Furthermore,
sup f0∈F0(M ) Ef0( ˜fN− f0(ϕ∗))2≥ sup θ1 sup |˜r|≤2MEθ,˜r( ˜fN− θ1) 2 (11)
where the last supremum in the RHS is taken over all constant functions ˜r(ϕ)≡ ˜
r bounded by 2M in absolute value, and the expectation therein is taken over the probability density (10) with θ1 = f0(ϕ∗) and ˜r(ϕ) ≡ ˜r. Applying the auxiliary Lemma A.1, we arrive at the inequality (7).
2.2
DWO-Optimal Estimator
The constraints for the optimization problem (4) here are described by a unique equality
N X t=1
wt= 1 . (12)
Thus, the DWO-optimal weights do not depend on ϕ∗. Moreover, they are uniform, that is
wt∗= 1
N , t = 1, . . . , N. (13)
Assertion 2.2. The DWO-optimal upper bound for the function class (6) equals
UN(w∗) = (2M )2+σ 2
N (14)
which coincide with the minimax lower bound (7).
Proof. The straightforward proof follows from (13) and (4).
This means that the DWO-optimal estimator is in fact minimax optimal among all admissible (even nonlinear) estimators, for the problem considered here. Note, that both (7) and (14) represent non-asymptotic results.
Remark 2.1. The obtained result (13) means that the arithmetic mean may
be treated as the DWO-optimal estimator for the particular problem under consideration here.
Remark 2.2. It is easily seen, that both the DWO-optimal upper bound and
MSE minimax lower bound remain the same for an arbitrary random design, as far as f0 ∈ F0(M ). It is really evident for the DWO-optimal upper bound since both the function UN(w) (4) and the constraint (12) do not depend on regressors. The same is true for the MSE minimax lower bound, as may be observed from both the proof of Assertion 2.1 and from Lemma A.1. Thus the results of this section hold completely true for a random design as well.
3
Approximately Linear Functions f
0Again we consider the problem of estimating f0(ϕ∗) for an unknown univariate function f0 : [−0.5, +0.5] → R, ϕ∗ ∈ [−0.5, +0.5], based on a given data set {ϕ(t), y(t)}N
t=1 with
y(t) = f0(ϕ(t)) + e(t) , t = 1, . . . , N (15) where{e(t)}N
t=1is a random sequence of uncorrelated, zero-mean gaussian vari-ables with a known constant variance Ee2(t) = σ2 > 0. First, we consider the equidistant design
ϕ(t) =−0.5 + t/N , t = 1, . . . , N . (16) We now assume that f0 belongs to the class of approximately linear functions
F1(M ) ={f(ϕ) = θTF (ϕ) + r(ϕ)| θ ∈ R2,|r(ϕ)| ≤ M} (17) with given a priori constant M > 0 and
F (ϕ) = 1 ϕT (18)
In other words, we assume that f0 may be an arbitrary function having the following property: there exists θ0∈ R2such that
f0(ϕ)− θ0TF (ϕ) ≤ M ∀ ϕ ∈ [−0.5, +0.5] (19) with a given constant M > 0.
3.1
Minimax Lower Bound
Consider an arbitrary estimator ˜fN = ˜fN(yN
1 ) for f0(ϕ∗), that is an arbitrary measurable function of the observation vector yN
1 = (y(1), . . . , y(N ))T.
Assertion 3.1. For any estimator ˜fN
sup f0∈F1(M ) Ef0( ˜fN − f0(ϕ∗))2≥ (2M)2+σ 2 N 1 + 12ϕ∗2 (20) Proof. Notice, that for f0∈ F1(M ) the observation model (15) reduces to
y(t) = θ1+ θ2ϕ(t) + ˜˜ r(ϕ(t)) + e(t) (21) with θ1= f0(ϕ∗), θ2∈ R, ˜ϕ(t) = ϕ(t)− ϕ∗, and
˜
r(ϕ(t)) = r(ϕ(t))− r(ϕ∗) , |˜r(ϕ(t))| ≤ 2M . (22) Let q(·) denote the p.d.f. of N(0, σ2). Then the probability density of the observation vector yN 1 = (y(1), . . . , y(N ))T is p(yN1 | f0) = N Y t=1 q(y(t)− f0(ϕ(t))) (23) = N Y t=1 q(y(t)− θ1− θ2ϕ(t)˜ − ˜r(ϕ(t))) (24)
In other words, the initial problem is reduced to the one of estimating a constant parameter θ1 = f0(ϕ∗) from its direct measurements (21) corrupted by both gaussian e(t) and non-random but bounded noise ˜r(ϕ(t)). Furthermore,
sup f0∈F1(M ) Ef0( ˜fN− f0(ϕ∗))2≥ sup θ sup |˜r|≤2MEθ,˜r( ˜fN− θ1) 2 (25)
where the last supremum in the RHS is taken over all constant functions ˜r(ϕ)≡ ˜
r bounded by 2M in absolute value, and the expectation therein is taken over probability density (23) with θ1= f0(ϕ∗) and ˜r(ϕ)≡ ˜r. Applying the auxiliary Lemma A.2 with h = (1, 0)T we arrive at the inequality (20).
Remark 3.1. It is easily seen from the proof of Assertion 3.1 that if Lemma
A.3 would be applied instead of Lemma A.2 then the same MSE minimax lower bound (20) could be obtained for the uniform random design (and f0∈ F1(M )).
3.2
DWO-Optimal Estimator
Following the DWO approach [6] we are now to minimize the MSE upper bound (4) subject to the constraints
N X t=1 wt= 1 , N X t=1 wtϕ(t) = ϕ∗ (26)
The solution to this optimization problem as well as its properties appear to be dependent of ϕ∗. It turns out that there arise two different cases which are studied below separately.
3.2.1 Positive DWO-Optimal Weights
Assertion 3.2. Assume
|ϕ∗| < 1/6 + O(N−1) , N → ∞ . (27) Then the DWO-optimal upper bound for the function class (17) equals
UN(w∗) = 4M2+
1 + 12ϕ∗2
σ2N−1+ O(N−2) (28)
which asymptotically coincides with the minimax lower bound (20). Moreover, the DWO-optimal weights
wt∗=1 + 12ϕ∗ϕ(t)
N 1 + O(N
−1), t = 1, . . . , N . (29) Particularly, for ϕ∗= 0 we arrive at asymptotically uniform weights
w∗t = 1
N 1 + O(N
−1), t = 1, . . . , N (30) (which correspond to the arithmetic mean estimator) with the DWO-optimal upper bound
UN(w∗) = 4M2+σ 2
N 1 + O(N
Proof. The proof is based on the following auxiliary lemma which is proved in the Appendix.
Lemma 3.1. Suppose that the DWO-optimal solution w∗ in the sense of (4),
(26) only contains positive components. Then the optimization problem (4), (26) is equivalent to the following one:
N X t=1 w2t → min w (32) subject to constraints N X t=1 wt= 1, N X t=1 wtϕ(t) = ϕ∗ (33)
Moreover, the inverse statement holds: If the solution wopt to the optimization problem (32)–(33) has only positive components then w∗= wopt.
Now, let us prove (28)–(31) under assumption (27). Based on Lemma 3.1 one needs to minimize kwk22 subject to the constraints (33). If the solution to the latter problem has only positive components then it is indeed w∗ for the initial optimization problem (4), (26). Applying the Lagrange function technique, we arrive at w∗t = λ + µ (ϕ(t)− ϕ∗) , t = 1, . . . , N (34) with λ µ ! = N PN t=1(ϕ(t)− ϕ∗) PN t=1(ϕ(t)− ϕ∗) PN t=1(ϕ(t)− ϕ∗)2 −1 1 0 ! (35) = 1 DN PN t=1(ϕ(t)− ϕ∗)2 −PN t=1(ϕ(t)− ϕ∗) (36) and DN = N N X t=1 (ϕ(t)− ϕ∗)2− N X t=1 (ϕ(t)− ϕ∗) !2 . (37) Note, that N X t=1 ϕ(t) =1 2, N X t=1 ϕ2(t) = N 2+ 2 12N = N 12+ O(N −1) (38) and DN = N N X t=1 ϕ2(t)− N X t=1 ϕ(t) !2 = N 2− 1 12 . (39)
Thus, from (34)–(39) follows w∗t = 12 + O(N −1) N2 N + O(N−1) 12 − ϕ∗2 2 + N ϕ ∗ϕ(t) (40)
and we arrive at (29) which means positive weights w∗t for all t = 1, . . . , N iff |ϕ∗| < 1/6. Furthermore, from (33) and (34) – (36) follows
N X t=1 w∗t2= λ = 1 DN N X t=1 (ϕ(t)− ϕ∗)2 (41)
and straightforward calculations lead to the desirable results (28)–(31).
Remark 3.2. The exact (non-asymptotic) DWO-optimal weights wt∗ will
de-pend linearly on ϕ(t), as directly seen from (34). Note also, that the analytic study of this subsection was possible to carry out since for the considered case the DWO-optimal weights are all non-negative, which led to a simpler, equiv-alent optimization problem (32), (33), having also a non-negative solution w∗t. An opposite case when there are negative components in the solution of prob-lem (32), (33) is more difficult for an explicit analytic treatment; it is considered below via approximating sums by integrals.
3.2.2 Some DWO-optimal weights are non-positive
The previous subsection considers the relatively simple case of positive DWO-optimal weights. In order to understand (at least on a qualitative level) what may happen when w∗ contains also non-positive components, let us introduce the piecewise constant kernel functions Kw: [−0.5, +0.5] → R which correspond to an admissible vector w: Kw(ϕ) = N X t=1 1{ϕt−1< ϕ≤ ϕt} Nwt, t = 1, . . . , N (42)
where ϕ0 = −0.5. Now one may apply the following representations for the sums from (4), (26): N X t=1 |wt| = Z 0.5 −0.5|Kw (u)| du (43) N X t=1 w2t = 1 N Z 0.5 −0.5K 2 w(u) du (44) N X t=1 wt = Z 0.5 −0.5Kw(u) du (45) N X t=1 wtϕ(t) = Z 0.5 −0.5u Kw(u) du + O N −1 (46)
Thus, the initial optimization problem (4), (26) may asymptotically, as N→ ∞, be rewritten in the form of the following variational problem:
UN(K) = σ 2 N Z 0.5 −0.5K 2(u) du + M2 1 + Z 0.5 −0.5|K(u)| du 2 → min K (47)
subject to constraints Z 0.5 −0.5 K(u) du = 1 , (48) Z 0.5 −0.5u K(u) du = ϕ ∗. (49)
Minimization in (47) is now meant to be over the admissible set D0that is the set of all piecewise continuous functions K : [−0.5, +0.5] → R meeting constraints (48), (49).
It is easily seen from (47) that asymptotically, as N → ∞, the influence of the first summand in the RHS (47) becomes negligible, compared to the second one. Hence, we first need to minimize
U(2)(K) = Z 0.5
−0.5|K(u)| du → minK∈D (50) Note, that if the latter problem would have a unique solution K2∗then it might give a good approximation to w∗ (in a criterial sense), based on (42), that is
wt∗≈ 1 NK
∗(ϕ
t) (51)
However, the solution to (50) is not unique, and it is attained on any non-negative kernel K∈ D. Indeed, for an admissible K
1 = Z 0.5 −0.5K(u) du ≤Z 0.5 −0.5|K(u)| du (52)
and the RHS becomes 1 unless K(u) is non-negative on [−0.5, +0.5]. A useful example of such a kernel is the uniform kernel function
Kuni∗ (u) = 1
1− 2ϕ∗1{|u − ϕ
∗| ≤ 1 − ϕ∗} . (53) Here and below in the current subsection we assume that 0≤ ϕ∗< 1/2, for the concreteness. It is straightforward to verify that Kuni∗ ∈ D, and
U(1)(Kuni∗ ) = 1
1− 2ϕ∗. (54)
Let us compare this value U(1)(K∗
uni) with that of U(1)(K∗) where the DWO-optimal kernel is known for|ϕ∗| ≤ 1/6 to be
K∗(u) = (1 + 12ϕ∗u) 1{|u| ≤ 1/2} . (55)
The latter equation corresponds to (29) and may be obtained directly from (47)–(49) in a similar manner to that of the previous subsection. Thus,
U(1)(K∗) = 1 + 12ϕ∗2. (56)
Figure 1 demonstrates the criterial difference UN(1)(Kuni∗ )−UN(1)(K∗) between the DWO-suboptimal uniform kernel Kuni∗ and the DWO-optimal kernel K∗ (solid
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Figure 1: The difference between suboptimal uniform-kernel and optimal kernel constants (solid), that is UN(1)(Kuni∗ )− UN(1)(K∗), and its related value w.r.t. the optimal constant (dashed), that is
UN(1)(Kuni∗ )− UN(1)(K∗) .
UN(1)(K∗) ; both as functions of ϕ∗∈ [0, 1/6].
line); the dashed line represents the ratio of the difference w.r.t. the optimal upper bound that is
UN(1)(K)− UN(1)(K∗) .
UN(1)(K∗) ; both as functions of ϕ∗∈ [0, 1/6].
Now go further in studying the asymptotics of the problem (47)–(49). In what follows below in the current subsection is assumed that 1/6 < ϕ∗< 1/2. Let D1∗ denote the set of all the solutions to problem (50). One might further minimize the first summand of the RHS in (47), that is
UN(1)(K) = Z 0.5 −0.5K 2(u) du→ min K∈D∗ 1 (57)
Remark 3.3. A similar approach has been developed in the previous subsection
when the first minimization problem (50) has easily been resolved among all the positive admissible kernel functions, in other words.
Based on Theorem 2 from [5] the solution to (57) should be found among the non-negative functions of the form
K(u) = (λ1+ µ1u) 1{a ≤ u ≤ 1/2} (58) with parameters λ1, µ1 and a ∈ (−0.5, ϕ∗) which should meet the following continuity equation
λ1+ µ1a = 0 (59)
In order to reduce the problem under consideration to that of previous subsec-tion, let us introduce a linear transformation of the interval [a, +0.5] to that of [−0.5, +0.5] as follows u = ∆ + h t , t∈ [−0.5, +0.5] (60) with ∆ = a 2+ 1 4, h = 1 2− a (61)
Thus, if the linear and non-negative kernel K is admissible, i.e. K∈ D, then
K1(t) = h K(∆ + h t) (62)
also represents a linear non-negative kernel and meets the following constraints Z 0.5 −0.5K1(t) dt = 1 (63) Z 0.5 −0.5t K1(t) dt = ϕ∗− ∆ h (64) Moreover, Z 0.5 −0.5K 2 1(t) dt = h Z 0.5 a K2(u) du (65)
Thus, the optimization problem under consideration is indeed reduced to the
following one Z
0.5 −0.5
K12(t) dt→ min
K1 (66)
with minimization among all positive kernels K1 subject to constraints (63), (64). We have also to assume here that
ϕ∗h− ∆ ≤ 1/6 (67)
in which case one may directly apply the solution similar to that of (29) from the previous subsection (that is (55), in terms of kernel functions); it means that the optimal kernel
K1∗(t) = 1 + 12 tϕ ∗− ∆ h 1{|t| ≤ 0.5} (68)
Also note, that the structure of the solution (58) is automatically ensured through the minimization. Now, applying inverse transformation w.r.t. (60)– (62) we finally arrive at the following optimal kernel
K∗(u) = 1 hK ∗ 1 u− ∆ h (69) = 1 h 1 + 12u− ∆ h ϕ∗− ∆ h 1{a ≤ u ≤ 0.5} (70) The continuity equation (59) means that K1(−0.5) = 0 and it becomes
ϕ∗− ∆
h =
1
6 (71)
which meets the condition (67). From (71) and (61) it follows that
a = 3ϕ∗− 1 (72)
It is interesting to notice that parameter a goes from−0.5 to +0.5 as ϕ∗ goes from 1/6 to +0.5, as it follows from (72). Now, the related optimal MSE upper bound becomes UN(K∗) = 4 M2+ σ2 N 8 9(1− 2ϕ∗) (73)
which coincides with (28) for ϕ∗ = 1/6, in particular, up to negligible terms. Thus, the following Assertion is proved.
Assertion 3.3. Let 1/6 < ϕ∗ < 1/2. Then the asymptotically DWO-optimal kernel K∗(u) = 1 h 1 + 2 h(u− ∆) 1{a ≤ u ≤ 0.5} (74) with h = 3 2(1− 2ϕ ∗) , ∆ = 6ϕ∗− 1 4 , a = 3ϕ ∗− 1 . (75)
The related DWO-optimal MSE upper bound is given by (73).
Figure 2 demonstrates U(1) for both optimal (solid) and uniform DWO-suboptimal (dashed) kernels, as functions of ϕ∗; their minimax lower bound 1 + 12ϕ∗2is represented by plus signs; the point related to ϕ∗= 1/6 is marked by a star. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 1.5 2 2.5 3 3.5 4 4.5 5
U1 for Optimal (solid) and Uniform suboptimal (dashed) kernels. ULB=1+12*pfi2 (pluses)
Figure 2: U(1)for DWO-optimal (solid) and uniform DWO-suboptimal (dashed) kernels; their minimax lower bound 1 + 12ϕ∗2is represented by plus signs; the point related to ϕ∗= 1/6 is marked by a star.
Remark 3.4. Theorem 2 from [5] indicates another possibility for the structure
of an optimal kernel K∗ which contains a negative part too. However, asymp-totically (as N → ∞), that may not occur since otherwise the main term of the MSE upper bound (47) — the second summand of the RHS (47) — is not minimized.
References
[1] A.V. Gol’denshlyuger and Nazin A.V. Parameter estimation under random and bounded noises. Automation and Remote Control, 53(10, pt. 1):1536– 1542, 1992.
[2] A. Juditsky, H. Hjalmarsson, A. Benveniste, B. Delyon, L. Ljung, J. Sj¨oberg, and Q. Zhang. Nonlinear black-box modeling in system identification: Math-ematical foundations. Automatica, 31(12):1724–1750, 1995.
[3] V.Ya. Katkovnik and A.V. Nazin. Minimax lower bound for time-varying frequency estimation of harmonic signal. IEEE Trans. Signal Processing, 46(12):3235–3245, December 1998.
[4] A.S. Nemirovskii. Recursive estimation of parameters of linear plants. Au-tomation and Remote Control, 42(4, pt. 6):775–783, 1981.
[5] Jacob Roll. Extending the direct weight optimization approach. Draft, Department of Electrical Engineering, Link¨oping University, SE-581 83 Link¨oping, Sweden., October 2003.
[6] Jacob Roll. Local and Piecewise Affine Approaches to System Identification. PhD thesis, Department of Electrical Engineering, Link¨oping University, SE-581 83 Link¨oping, Sweden, April 2003.
[7] Jacob Roll, Alexander Nazin, and Lennart Ljung. A non-asymptotic ap-proach to local modelling. In The 41st IEEE Conference on Decision and Control, pages 638–643, December 2002.
[8] Jacob Roll, Alexander Nazin, and Lennart Ljung. Local modelling with a priori known bounds using direct weight optimization. In European Control Conference, Cambridge, September 2003.
A
Appendix
A.1
Proof of Lemma 3.1
Proof. The inverse statement is very simple to demonstrate. Indeed, let D be an admissible set of w defined by (33), and suppose that woptt > 0 for all t = 1, . . . , N . Then for any w∈ D
UN(wopt) = σ2 N X t=1 wtopt2+ 4M2 (76) ≤ σ2 N X t=1 wt2+ M2 1 + N X t=1 wt !2 (77) ≤ UN(w) (78)
Hence, w∗ = wopt due to their uniqueness which follows from strict convexity of the function UN(w).
Now assume that all wt∗> 0. Let D+ denote a subset of admissible vectors w with all components being positive. Obviously,
UN(w) = σ2 N X t=1
wt2+ 4M2, ∀w ∈ D+ (79) therefore w∗ represents the minimum point for
kwk2 2= N X t=1 w2t (80)
among all vectors w∈ D+. Note, that w∗ is an internal point of the convex set D+ ⊂ D (w.r.t. positiveness constraints). Moreover, function kwk22 is strictly convex. Hence, w∗is the minimum point forkwk2
2over D too. Thus, the lemma is proved.
A.2
Auxiliary Information Lower Bounds
Lemma A.1. Let ˜θN :RN → R be an arbitrary estimator for θ, based on i.i.d.
observations
y(t) = θ + r + e(t) , t = 1, . . . , N (81) with e(t) being gaussian N (0, σ2), |r| ≤ ε. Then
sup θ sup |r|≤εEθ,r(˜θN − θ) 2≥ ε2+σ2 N (82)
Proof. Let τ = θ + r. The Fisher information for gaussian i.i.d. observations y(t) = τ + e(t) , t = 1, . . . , N, (83) equals N/σ2as is well known. Now apply Theorem 1 from [1], a reduced version of which is reproduced below under the above gaussian i.i.d. assumptions for observations (81) as
Proposition A.1. Let ρ > ε ≥ 0 and ˜θN : RN → R be an arbitrary
measur-able function of the observation vector yN
1 = (y(1), . . . , y(N ))T with the i.i.d. gaussian components defined by (81). Then
sup
|θ|≤ρ|r|≤εsupEθ,r(˜θN − θ)
2≥ ε2+ σ2/N 1 + (ρ− ε)−1√N /σ
2 (84)
In order to end the proof of lemma observe, that the LHS (82) is the magorant for that of (84). Thus, tending ρ to infinity in (84) we arrive at (82).
Below, the notation (x, y) for two real column vectors x, y stands for the inner product related to the Euclidean normk · k, that is (x, y) = xTy. Note, that the following lemma as well as its proof goes back to the arguments by Nemirovskii [4] which were further adopted in [1] to a particular problem of a parameter estimation under both random and non-random but bounded noise; see also [3] and the references therein.
Lemma A.2. Let ˜θN :RN → R2 be an arbitrary estimator for θ = (θ1, θ2)T ∈ R2, based on observations
y(k) = θTF (k) + r + e(k) , k = 1, . . . , N (85) with fixed regressors F (k) = 1 ϕ(k)T, ϕ(k) ∈ R, the noise e(k) being i.i.d. gaussian N (0, σ2), and|r| ≤ ε. Then for any h = h1 h2T ∈ R2, the following information inequality holds
sup θ sup |r|≤εEθ,r ˜ θN− θ, h 2 ≥ (εh1)2+ hTJN−1h (86) with the Fisher information matrix
JN = 1 σ2 N X k=1 F (k)FT(k) (87)
which is supposed to be invertible.
Proof. In order to make a rather long proof more readable we represent it in a form of sequential items.
1. Preliminary constructions. It suffices to assume h 6= 0. To begin with, fix
an arbitrary ρ > ε, and consider parameters θ belonging to the ball{kθk ≤ ρ}. Let α = sup kθk≤ρ|r|≤εsupEθ,r ˜ θN − θ, h 2 (88) which is assumed to be finite, without loss of generality. Obviously, for any sufficiently small µ > 0 there exists a modified estimator ¯θN, having a finite support inRN and a bounded norm from above, for which
α + µ≥ sup
kθk≤ρ|r|≤εsupEθ,r ¯
θN − θ, h2 (89)
Indeed, this modification might consist of the following two steps:
Step 1: Orthogonally projecting the vector θN onto a ball (in R2) of
suffi-ciently large radius C1= C1(µ) <∞;
Step 2: Introducing a sufficiently large constant C2= C2(µ) <∞ and defining
¯
θN to be zero outside the ball{ky1Nk ≤ C2} and to be equal to the result of the first step inside the ball.
2. Reducing to an auxiliary estimation problem. Let τ = (θ1+r, θ2)T and notice
that (85) may be represented as the observation model
y(k) = τTF (k) + e(k) , k = 1, . . . , N, (90) for which the Fisher information matrix
JN = Eτ
for the gaussian i.i.d.{e(k)}N
k=1is well known to be represented by (87). Let us treat (ˆθN, h) as the estimator for (τ, h) and introduce the related bias function (to the model (90)), that is
b(τ ) = Eτ θ¯N − τ, h (92) = Z ¯ θN(y1N)− τ, h N Y k=1 q y(k)− τTF (k)dyN1 (93)
where q(·) stands for the p.d.f. of a gaussian distribution N(0, σ2), as earlier. To reach our purposes it suffices (and is convenient) to consider τ belonging to the ball of the radius ρ− ε which is positive by construction. The reason is that ifkτk ≤ ρ − ε and |r| ≤ ε then
kθk = kτ − (r, 0)Tk ≤ kτk + |r| ≤ ρ (94) Furthermore, one may see that by the construction of the estimator ¯θN the bias function b(τ ) is continuously differentiable function over the ball{kτk ≤ ρ − ε}; moreover, when differentiating b(τ ) one may interchange the integral and the gradient arriving at ∇b(τ) + h = Z ¯ θN(y1N)− τ, h ∇τ N Y k=1 q y(k)− τTF (k)dyN1 (95)
3. Using the Cram´er-Rao inequality. Thus, one may apply a well-known Cram´er-Rao inequality
Eτ θ¯N − τ, h 2 ≥ b2(τ ) + ∇b(τ) + h, J−1 N (∇b(τ) + h) (96) which for τ = θ + (r, 0)T leads to
Eθ,r θ¯N − θ, h 2
≥ (b(τ) + rh1)2+ ∇b(τ) + h, JN−1(∇b(τ) + h) (97) and together with (89), (94) to
α + µ≥ (b(τ) + rh1)2+ ∇b(τ) + h, JN−1(∇b(τ) + h) (98) The latter inequality holds true for any τ and r meeting the independent con-straints kτk ≤ ρ − ε and |r| ≤ ε. What is nice in (98) is that only the first summand of the RHS depends on r. Maximizing the RHS (98) by r subject to |r| ≤ ε and opening the brackets in the second summand we obtain
α + µ ≥ (|b(τ)| + ε|h1|)2+ h, JN−1h+ ∇b(τ), JN−1∇b(τ) (99) + 2 ∇b(τ), JN−1h (100)
4. Reducing to a particular trajectory inside the ball. Consider the information
inequality (99), (100) on that particular part of the trajectory of the differential equation
˙τ =−JN−1h , τ (0) = 0 , (101) which goes inside the above mentioned ball{kτk ≤ ρ − ε}. Denote the “time” variable in the differential equation (101) and in the related solution by t. The
Fisher information matrix JN does not depend on τ here; hence, the solution to (101) (the particular part which we are interested in)
τ (t) =−JN−1h t , |t| ≤ T , (102) with
T = (ρ− ε)kJN−1hk−1. (103) The idea is to integrate inequality (99), (100) along τ (t), |t| ≤ T . In order to do that, let us introduce the function
β(t) = b(τ (t)) (104)
and constant
g = JN−1h, h> 0 (105) (which is really constant since JN does not depend on τ ). It follows from (101), (104) that
˙
β(t) =− ∇b(τ), JN−1h (106) Now apply the Cauchy-Schwartz inequality
∇b(τ), J−1 N h 2 ≤ J−1 N ∇b(τ), ∇b(τ) JN−1h, h (107)
which gives for τ = τ (t)
JN−1∇b(τ(t)), ∇b(τ(t))≥ ˙β2(t)/g (108) Thus, in view of (104)–(108) the inequality (99), (100) leads to
α + µ≥ (|β(t)| + ε|h1|)2+ g + ˙β2(t)/g− 2 ˙β(t) (109) which may be resolved w.r.t. β(t), giving for all t,˙ |t| ≤ T , the differential inequality
˙
β(t) ≥ g −√gpα + µ− (|β(t)| + ε|h1|)2 (110) ≥ g −√gpα + µ− (εh1)2. (111)
5. Integrating the information inequality. First observe that if the RHS (111)
is non-positive then
α + µ≥ (εh1)2+ g (112)
Consider the opposite situation in which the RHS (111) is positive. Then func-tion β(t) is increasing on the interval [−T, +T ], and integrating (110), (111) leads to 2T g−√gpα + µ− (εh1)2 ≤ β(T ) − β(−T ) (113) ≤ 2 sup |t|≤T|β(t)| (114) ≤ 2pα + µ− (εh1)2. (115) The latter inequality (115) holds due to (109) from which it follows that
Thus (113)–(115) leads to
α + µ≥ (εh1)2+ g
1 + (T√g )−12 (117)
Notice, that (117) follows from (112); therefore, one may go further based on (117). One may simplify the denominator of the RHS (117) somewhat due to the expressions for T (103) and g (105) and the definition of the matrix-operator norm. Indeed, (T√g )−1 = (ρ− ε)−1 kJ −1 N hk (h, JN−1h)1/2 ≤ (ρ − ε) −1kJ−1/2 N k (118)
Substituting (118) into (117) and letting µ→ +0 lead to α≥ (εh1)2+ h, J −1 N h 1 + (ρ− ε)−1kJN−1/2k 2 (119)
Finally, letting ρ→ +∞ and reminding of the definition of α (88) lead directly to the desirable inequality (86). Thus, Lemma A.2 is proved.
Remark A.1. As far as can be seen seen from the proof, Lemma A.2 may
be naturally extended to non-gaussian i.i.d. noise under additional regularity assumptions on its p.d.f. Note also that the proof uses no concrete form of (non-random) regressors F (k) imposing only the assumption on the related Fisher information matrix to be non-degenerate. An extension to the uniform random design is represented by Lemma A.3 below.
Lemma A.3. Let ˜θN :RN → R2 be an arbitrary estimator for θ = (θ1, θ2)T ∈
R2, based on observations
y(k) = θTF (k) + r + e(k) , k = 1, . . . , N (120) with
1) regressors F (k) = 1 ϕ(k)T having random i.i.d. entries ϕ(k) uniformly
distributed on the interval [−0.5, +0.5];
2) i.i.d. gaussian random noise e(k)∈ N(0, σ2);
3) {e(k)}N
k=1 and{ϕ(k)}Nk=1 independent;
4) finally, |r| ≤ ε.
Then, for any h = h1 h2 T
∈ R2 the following information inequality holds sup θ sup |r|≤εEθ,r ˜ θN− θ, h 2 ≥ (εh1)2+ hTJN−1h (121) with the Fisher information matrix
JN = N σ2 1 0 0 1/12 (122)
Proof. The proof is almost completely analogous to that of Lemma A.2. The only difference arise from the observation vector which now should include also random variables ϕ(k). Let
z1N = (y(1), ϕ(1), . . . , y(1), ϕ(N ))T ∈ R2N. (123) be the observation vector. When a combined parameter τ = (θ1+ r, θ2)T is introduced and the observation model (120) is reduced to
y(k) = τTF (k) + e(k) , k = 1, . . . , N (124) the p.d.f. of observations becomes
p(z1N| τ) = N Y k=1
q(y(t)− τTF (k)) 1{|ϕ(k)| ≤ 0.5} (125) with its Fisher information matrix
JN = Eτ
∇τln p(zN1 | τ) ∇Tτ ln p(z1N| τ) (126) represented by (122). All other items of the proof completely repeat those of Lemma A.2.