• No results found

A Study of the DWO Approach to Function Estimation at a Given Point : Approximately Constant and Approximately Linear Function Classes

N/A
N/A
Protected

Academic year: 2021

Share "A Study of the DWO Approach to Function Estimation at a Given Point : Approximately Constant and Approximately Linear Function Classes"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

A Study of the DWO Approach

to Function Estimation at a Given Point:

Approximately Constant and

Approximately Linear Function Classes

Alexander Nazin, Jacob Roll, Lennart Ljung

Division of Automatic Control

Department of Electrical Engineering

Link¨

opings universitet, SE-581 83 Link¨

oping, Sweden

WWW: http://www.control.isy.liu.se

E-mail: nazine@ipu.rssi.ru, roll,ljung@isy.liu.se

December 22, 2003

AUTOMATIC CONTROL

COM

MUNICATION SYSTEMS

LINKÖPING

Report no.: LiTH-ISY-R-2578

Technical reports from the Control & Communication group in Link¨oping are available at http://www.control.isy.liu.se/publications.

(2)
(3)

A Study of the DWO Approach

to Function Estimation at a Given Point :

Approximately Constant and

Approximately Linear Function Classes

Alexander Nazin

, Jacob Roll

, Lennart Ljung

December 22, 2003

Abstract

In this report, the Direct Weight Optimization (DWO) approach to function estimation is studied for two special function classes: The classes of approximately constant and approximately linear functions. These classes consist of functions whose deviation from a constant/affine func-tion is bounded by a known constant. Upper and lower bounds for the asymptotic maximum MSE are given, some of which also hold in the non-asymptotic case.

1

Introduction

In what follows, we study particular problems of estimating an unknown uni-variate function f0 : [−0.5, +0.5] → R at a fixed point ϕ∗ ∈ [−0.5, +0.5] from the given data set{ϕ(t), y(t)}Nt=1 with

y(t) = f0(ϕ(t)) + e(t) , t = 1, . . . , N (1) where{e(t)}N

t=1is a random sequence of uncorrelated, zero-mean variables with a known constant variance Ee2(t) = σ2 > 0; for the sake of simplicity we assume the gaussian distribution for e(t). The main study here is devoted to the equidistant fixed design, i.e.,

ϕ(t) =−0.5 + t/N , t = 1, . . . , N . (2) We also discuss the extension to uniform random design when regressors ϕ(t) are uniformly distributed on [−0.5, +0.5], i.i.d. random variables, and {e(t)}N

t=1 being independent of{ϕ(t)}N

t=1.

The DWO-estimator ˆfN(ϕ∗) is the linear one defined by ˆ fN(ϕ∗) = N X t=1 wty(t) (3)

Institute of Control Sciences, Profsoyuznaya str., 65, 117997 Moscow, Russia, e-mail: nazine@ipu.rssi.ru

Div. of Automatic Control, Link¨oping University, SE-58183 Link¨oping, Sweden, e-mail: roll, ljung@isy.liu.se

(4)

with the weights w = (w1, . . . , wN)T minimizing the MSE upper bound UN(w) = σ2 N X t=1 w2t+ M2 1 + N X t=1 |wt| !2 → min w (4)

subject to certain constraints depending on a priori information about f0. See [6, 7, 8] and the references therein for further details. A solution to the optimization problem is denoted by w∗, and its components w∗t are called the DWO-optimal weights. Consequently, the estimate

ˆ fN(ϕ∗) = N X t=1 w∗ty(t) (5)

is called the DWO-optimal one.

The objective here is twofold. For each particular problem, we first find the MSE minimax lower bound among arbitrary estimators. Then we study both the DWO-optimal weights wt and the DWO-optimal MSE upper bound UN(w∗); the latter is further compared with the MSE minimax lower bound. As we will see, some of the results obtained here hold for a fixed number of observations N while others are of asymptotic consideration, as N → ∞.

Remark 1.1. Note, that (3) represents a non-parametric estimator, since the

parameter number N is not fixed, but is in fact the number of samples. See, e.g., [2].

2

Approximately Constant Functions f

0

Suppose f0: [−0.5, +0.5] → R belongs to the following class of approximately constant functions

F0(M ) ={f(ϕ) = θ + r(ϕ) | θ ∈ R, |r(ϕ)| ≤ M} (6)

with given a priori constant M > 0.

2.1

Minimax Lower Bound

Let ˜fN = ˜fN(y1N) be an arbitrary measurable function of the observation vector yN

1 = (y(1), . . . , y(N ))T, that is an arbitrary estimator.

Assertion 2.1. For any estimator ˜fN

sup f0∈F0(M )

Ef0( ˜fN − f0(ϕ∗))2≥ (2M)2+σ 2

N (7)

Proof. Notice, that for f0∈ F0(M ) the observation model (1) reduces to y(t) = f0(ϕ∗) + ˜r(ϕ(t)) + e(t) (8) with

˜

(5)

Let q(·) denote the p.d.f. of N(0, σ2). Then the probability density of the observation vector yN 1 = (y(1), . . . , y(N ))T p(yN1 | f0) = N Y t=1 q(y(t)− f0(ϕ(t))) = N Y t=1 q(y(t)− f0(ϕ∗)− ˜r(ϕ(t))) (10) In other words, the initial problem is reduced to the one of estimating a constant parameter θ1 = f0(ϕ∗) from its direct measurements (8) corrupted by both gaussian and non-random but bounded noise. Furthermore,

sup f0∈F0(M ) Ef0( ˜fN− f0(ϕ∗))2≥ sup θ1 sup |˜r|≤2MEθ,˜r( ˜fN− θ1) 2 (11)

where the last supremum in the RHS is taken over all constant functions ˜r(ϕ)≡ ˜

r bounded by 2M in absolute value, and the expectation therein is taken over the probability density (10) with θ1 = f0(ϕ∗) and ˜r(ϕ) ≡ ˜r. Applying the auxiliary Lemma A.1, we arrive at the inequality (7).

2.2

DWO-Optimal Estimator

The constraints for the optimization problem (4) here are described by a unique equality

N X t=1

wt= 1 . (12)

Thus, the DWO-optimal weights do not depend on ϕ∗. Moreover, they are uniform, that is

wt= 1

N , t = 1, . . . , N. (13)

Assertion 2.2. The DWO-optimal upper bound for the function class (6) equals

UN(w∗) = (2M )2+σ 2

N (14)

which coincide with the minimax lower bound (7).

Proof. The straightforward proof follows from (13) and (4).

This means that the DWO-optimal estimator is in fact minimax optimal among all admissible (even nonlinear) estimators, for the problem considered here. Note, that both (7) and (14) represent non-asymptotic results.

Remark 2.1. The obtained result (13) means that the arithmetic mean may

be treated as the DWO-optimal estimator for the particular problem under consideration here.

Remark 2.2. It is easily seen, that both the DWO-optimal upper bound and

MSE minimax lower bound remain the same for an arbitrary random design, as far as f0 ∈ F0(M ). It is really evident for the DWO-optimal upper bound since both the function UN(w) (4) and the constraint (12) do not depend on regressors. The same is true for the MSE minimax lower bound, as may be observed from both the proof of Assertion 2.1 and from Lemma A.1. Thus the results of this section hold completely true for a random design as well.

(6)

3

Approximately Linear Functions f

0

Again we consider the problem of estimating f0(ϕ∗) for an unknown univariate function f0 : [−0.5, +0.5] → R, ϕ∗ ∈ [−0.5, +0.5], based on a given data set {ϕ(t), y(t)}N

t=1 with

y(t) = f0(ϕ(t)) + e(t) , t = 1, . . . , N (15) where{e(t)}N

t=1is a random sequence of uncorrelated, zero-mean gaussian vari-ables with a known constant variance Ee2(t) = σ2 > 0. First, we consider the equidistant design

ϕ(t) =−0.5 + t/N , t = 1, . . . , N . (16) We now assume that f0 belongs to the class of approximately linear functions

F1(M ) ={f(ϕ) = θTF (ϕ) + r(ϕ)| θ ∈ R2,|r(ϕ)| ≤ M} (17) with given a priori constant M > 0 and

F (ϕ) = 1 ϕT (18)

In other words, we assume that f0 may be an arbitrary function having the following property: there exists θ0∈ R2such that

f0(ϕ)− θ0TF (ϕ) ≤ M ∀ ϕ ∈ [−0.5, +0.5] (19) with a given constant M > 0.

3.1

Minimax Lower Bound

Consider an arbitrary estimator ˜fN = ˜fN(yN

1 ) for f0(ϕ∗), that is an arbitrary measurable function of the observation vector yN

1 = (y(1), . . . , y(N ))T.

Assertion 3.1. For any estimator ˜fN

sup f0∈F1(M ) Ef0( ˜fN − f0(ϕ∗))2≥ (2M)2+σ 2 N  1 + 12ϕ∗2  (20) Proof. Notice, that for f0∈ F1(M ) the observation model (15) reduces to

y(t) = θ1+ θ2ϕ(t) + ˜˜ r(ϕ(t)) + e(t) (21) with θ1= f0(ϕ∗), θ2∈ R, ˜ϕ(t) = ϕ(t)− ϕ∗, and

˜

r(ϕ(t)) = r(ϕ(t))− r(ϕ∗) , |˜r(ϕ(t))| ≤ 2M . (22) Let q(·) denote the p.d.f. of N(0, σ2). Then the probability density of the observation vector yN 1 = (y(1), . . . , y(N ))T is p(yN1 | f0) = N Y t=1 q(y(t)− f0(ϕ(t))) (23) = N Y t=1 q(y(t)− θ1− θ2ϕ(t)˜ − ˜r(ϕ(t))) (24)

(7)

In other words, the initial problem is reduced to the one of estimating a constant parameter θ1 = f0(ϕ∗) from its direct measurements (21) corrupted by both gaussian e(t) and non-random but bounded noise ˜r(ϕ(t)). Furthermore,

sup f0∈F1(M ) Ef0( ˜fN− f0(ϕ∗))2≥ sup θ sup |˜r|≤2MEθ,˜r( ˜fN− θ1) 2 (25)

where the last supremum in the RHS is taken over all constant functions ˜r(ϕ)≡ ˜

r bounded by 2M in absolute value, and the expectation therein is taken over probability density (23) with θ1= f0(ϕ∗) and ˜r(ϕ)≡ ˜r. Applying the auxiliary Lemma A.2 with h = (1, 0)T we arrive at the inequality (20).

Remark 3.1. It is easily seen from the proof of Assertion 3.1 that if Lemma

A.3 would be applied instead of Lemma A.2 then the same MSE minimax lower bound (20) could be obtained for the uniform random design (and f0∈ F1(M )).

3.2

DWO-Optimal Estimator

Following the DWO approach [6] we are now to minimize the MSE upper bound (4) subject to the constraints

N X t=1 wt= 1 , N X t=1 wtϕ(t) = ϕ∗ (26)

The solution to this optimization problem as well as its properties appear to be dependent of ϕ∗. It turns out that there arise two different cases which are studied below separately.

3.2.1 Positive DWO-Optimal Weights

Assertion 3.2. Assume

|ϕ∗| < 1/6 + O(N−1) , N → ∞ . (27) Then the DWO-optimal upper bound for the function class (17) equals

UN(w∗) = 4M2+ 

1 + 12ϕ∗2 

σ2N−1+ O(N−2) (28)

which asymptotically coincides with the minimax lower bound (20). Moreover, the DWO-optimal weights

wt=1 + 12ϕ∗ϕ(t)

N 1 + O(N

−1), t = 1, . . . , N . (29) Particularly, for ϕ∗= 0 we arrive at asymptotically uniform weights

w∗t = 1

N 1 + O(N

−1), t = 1, . . . , N (30) (which correspond to the arithmetic mean estimator) with the DWO-optimal upper bound

UN(w∗) = 4M2+σ 2

N 1 + O(N

(8)

Proof. The proof is based on the following auxiliary lemma which is proved in the Appendix.

Lemma 3.1. Suppose that the DWO-optimal solution w in the sense of (4),

(26) only contains positive components. Then the optimization problem (4), (26) is equivalent to the following one:

N X t=1 w2t → min w (32) subject to constraints N X t=1 wt= 1, N X t=1 wtϕ(t) = ϕ∗ (33)

Moreover, the inverse statement holds: If the solution wopt to the optimization problem (32)–(33) has only positive components then w∗= wopt.

Now, let us prove (28)–(31) under assumption (27). Based on Lemma 3.1 one needs to minimize kwk22 subject to the constraints (33). If the solution to the latter problem has only positive components then it is indeed w∗ for the initial optimization problem (4), (26). Applying the Lagrange function technique, we arrive at w∗t = λ + µ (ϕ(t)− ϕ∗) , t = 1, . . . , N (34) with λ µ ! =   N PN t=1(ϕ(t)− ϕ∗) PN t=1(ϕ(t)− ϕ∗) PN t=1(ϕ(t)− ϕ∗)2   −1 1 0 ! (35) = 1 DN   PN t=1(ϕ(t)− ϕ∗)2 PN t=1(ϕ(t)− ϕ∗)   (36) and DN = N N X t=1 (ϕ(t)− ϕ∗)2 N X t=1 (ϕ(t)− ϕ∗) !2 . (37) Note, that N X t=1 ϕ(t) =1 2, N X t=1 ϕ2(t) = N 2+ 2 12N = N 12+ O(N −1) (38) and DN = N N X t=1 ϕ2(t)− N X t=1 ϕ(t) !2 = N 2− 1 12 . (39)

Thus, from (34)–(39) follows w∗t = 12 + O(N −1) N2  N + O(N−1) 12 ϕ∗2 2 + N ϕ ϕ(t)  (40)

(9)

and we arrive at (29) which means positive weights w∗t for all t = 1, . . . , N iff |ϕ∗| < 1/6. Furthermore, from (33) and (34) – (36) follows

N X t=1 w∗t2= λ = 1 DN N X t=1 (ϕ(t)− ϕ∗)2 (41)

and straightforward calculations lead to the desirable results (28)–(31).

Remark 3.2. The exact (non-asymptotic) DWO-optimal weights wt will

de-pend linearly on ϕ(t), as directly seen from (34). Note also, that the analytic study of this subsection was possible to carry out since for the considered case the DWO-optimal weights are all non-negative, which led to a simpler, equiv-alent optimization problem (32), (33), having also a non-negative solution w∗t. An opposite case when there are negative components in the solution of prob-lem (32), (33) is more difficult for an explicit analytic treatment; it is considered below via approximating sums by integrals.

3.2.2 Some DWO-optimal weights are non-positive

The previous subsection considers the relatively simple case of positive DWO-optimal weights. In order to understand (at least on a qualitative level) what may happen when w∗ contains also non-positive components, let us introduce the piecewise constant kernel functions Kw: [−0.5, +0.5] → R which correspond to an admissible vector w: Kw(ϕ) = N X t=1 1{ϕt−1< ϕ≤ ϕt} Nwt, t = 1, . . . , N (42)

where ϕ0 = −0.5. Now one may apply the following representations for the sums from (4), (26): N X t=1 |wt| = Z 0.5 −0.5|Kw (u)| du (43) N X t=1 w2t = 1 N Z 0.5 −0.5K 2 w(u) du (44) N X t=1 wt = Z 0.5 −0.5Kw(u) du (45) N X t=1 wtϕ(t) = Z 0.5 −0.5u Kw(u) du + O N −1 (46)

Thus, the initial optimization problem (4), (26) may asymptotically, as N→ ∞, be rewritten in the form of the following variational problem:

UN(K) = σ 2 N Z 0.5 −0.5K 2(u) du + M2  1 + Z 0.5 −0.5|K(u)| du 2 → min K (47)

(10)

subject to constraints Z 0.5 −0.5 K(u) du = 1 , (48) Z 0.5 −0.5u K(u) du = ϕ . (49)

Minimization in (47) is now meant to be over the admissible set D0that is the set of all piecewise continuous functions K : [−0.5, +0.5] → R meeting constraints (48), (49).

It is easily seen from (47) that asymptotically, as N → ∞, the influence of the first summand in the RHS (47) becomes negligible, compared to the second one. Hence, we first need to minimize

U(2)(K) = Z 0.5

−0.5|K(u)| du → minK∈D (50) Note, that if the latter problem would have a unique solution K2then it might give a good approximation to w∗ (in a criterial sense), based on (42), that is

wt∗≈ 1 NK

t) (51)

However, the solution to (50) is not unique, and it is attained on any non-negative kernel K∈ D. Indeed, for an admissible K

1 = Z 0.5 −0.5K(u) du Z 0.5 −0.5|K(u)| du (52)

and the RHS becomes 1 unless K(u) is non-negative on [−0.5, +0.5]. A useful example of such a kernel is the uniform kernel function

Kuni (u) = 1

1− 2ϕ∗1{|u − ϕ

| ≤ 1 − ϕ} . (53) Here and below in the current subsection we assume that 0≤ ϕ∗< 1/2, for the concreteness. It is straightforward to verify that Kuni ∈ D, and

U(1)(Kuni ) = 1

1− 2ϕ∗. (54)

Let us compare this value U(1)(K

uni) with that of U(1)(K∗) where the DWO-optimal kernel is known for|ϕ∗| ≤ 1/6 to be

K∗(u) = (1 + 12ϕ∗u) 1{|u| ≤ 1/2} . (55)

The latter equation corresponds to (29) and may be obtained directly from (47)–(49) in a similar manner to that of the previous subsection. Thus,

U(1)(K∗) = 1 + 12ϕ∗2. (56)

Figure 1 demonstrates the criterial difference UN(1)(Kuni )−UN(1)(K∗) between the DWO-suboptimal uniform kernel Kuni and the DWO-optimal kernel K∗ (solid

(11)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

Figure 1: The difference between suboptimal uniform-kernel and optimal kernel constants (solid), that is UN(1)(Kuni )− UN(1)(K∗), and its related value w.r.t. the optimal constant (dashed), that is



UN(1)(Kuni )− UN(1)(K∗)  .

UN(1)(K∗) ; both as functions of ϕ∗∈ [0, 1/6].

line); the dashed line represents the ratio of the difference w.r.t. the optimal upper bound that is



UN(1)(K)− UN(1)(K∗)  .

UN(1)(K∗) ; both as functions of ϕ∗∈ [0, 1/6].

Now go further in studying the asymptotics of the problem (47)–(49). In what follows below in the current subsection is assumed that 1/6 < ϕ∗< 1/2. Let D1 denote the set of all the solutions to problem (50). One might further minimize the first summand of the RHS in (47), that is

UN(1)(K) = Z 0.5 −0.5K 2(u) du→ min K∈D∗ 1 (57)

Remark 3.3. A similar approach has been developed in the previous subsection

when the first minimization problem (50) has easily been resolved among all the positive admissible kernel functions, in other words.

Based on Theorem 2 from [5] the solution to (57) should be found among the non-negative functions of the form

K(u) = (λ1+ µ1u) 1{a ≤ u ≤ 1/2} (58) with parameters λ1, µ1 and a ∈ (−0.5, ϕ∗) which should meet the following continuity equation

λ1+ µ1a = 0 (59)

In order to reduce the problem under consideration to that of previous subsec-tion, let us introduce a linear transformation of the interval [a, +0.5] to that of [−0.5, +0.5] as follows u = ∆ + h t , t∈ [−0.5, +0.5] (60) with ∆ = a 2+ 1 4, h = 1 2− a (61)

(12)

Thus, if the linear and non-negative kernel K is admissible, i.e. K∈ D, then

K1(t) = h K(∆ + h t) (62)

also represents a linear non-negative kernel and meets the following constraints Z 0.5 −0.5K1(t) dt = 1 (63) Z 0.5 −0.5t K1(t) dt = ϕ∗− ∆ h (64) Moreover, Z 0.5 −0.5K 2 1(t) dt = h Z 0.5 a K2(u) du (65)

Thus, the optimization problem under consideration is indeed reduced to the

following one Z

0.5 −0.5

K12(t) dt→ min

K1 (66)

with minimization among all positive kernels K1 subject to constraints (63), (64). We have also to assume here that

ϕ∗h− ∆ ≤ 1/6 (67)

in which case one may directly apply the solution similar to that of (29) from the previous subsection (that is (55), in terms of kernel functions); it means that the optimal kernel

K1∗(t) =  1 + 12 tϕ − ∆ h  1{|t| ≤ 0.5} (68)

Also note, that the structure of the solution (58) is automatically ensured through the minimization. Now, applying inverse transformation w.r.t. (60)– (62) we finally arrive at the following optimal kernel

K∗(u) = 1 hK 1  u− ∆ h  (69) = 1 h  1 + 12u− ∆ h ϕ∗− ∆ h  1{a ≤ u ≤ 0.5} (70) The continuity equation (59) means that K1(−0.5) = 0 and it becomes

ϕ∗− ∆

h =

1

6 (71)

which meets the condition (67). From (71) and (61) it follows that

a = 3ϕ∗− 1 (72)

It is interesting to notice that parameter a goes from−0.5 to +0.5 as ϕ∗ goes from 1/6 to +0.5, as it follows from (72). Now, the related optimal MSE upper bound becomes UN(K∗) = 4 M2+ σ2 N 8 9(1− 2ϕ∗) (73)

which coincides with (28) for ϕ∗ = 1/6, in particular, up to negligible terms. Thus, the following Assertion is proved.

(13)

Assertion 3.3. Let 1/6 < ϕ < 1/2. Then the asymptotically DWO-optimal kernel K∗(u) = 1 h  1 + 2 h(u− ∆)  1{a ≤ u ≤ 0.5} (74) with h = 3 2(1− 2ϕ ) , ∆ = 6ϕ∗− 1 4 , a = 3ϕ − 1 . (75)

The related DWO-optimal MSE upper bound is given by (73).

Figure 2 demonstrates U(1) for both optimal (solid) and uniform DWO-suboptimal (dashed) kernels, as functions of ϕ∗; their minimax lower bound 1 + 12ϕ∗2is represented by plus signs; the point related to ϕ∗= 1/6 is marked by a star. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 1.5 2 2.5 3 3.5 4 4.5 5

U1 for Optimal (solid) and Uniform suboptimal (dashed) kernels. ULB=1+12*pfi2 (pluses)

Figure 2: U(1)for DWO-optimal (solid) and uniform DWO-suboptimal (dashed) kernels; their minimax lower bound 1 + 12ϕ∗2is represented by plus signs; the point related to ϕ∗= 1/6 is marked by a star.

Remark 3.4. Theorem 2 from [5] indicates another possibility for the structure

of an optimal kernel K∗ which contains a negative part too. However, asymp-totically (as N → ∞), that may not occur since otherwise the main term of the MSE upper bound (47) — the second summand of the RHS (47) — is not minimized.

(14)

References

[1] A.V. Gol’denshlyuger and Nazin A.V. Parameter estimation under random and bounded noises. Automation and Remote Control, 53(10, pt. 1):1536– 1542, 1992.

[2] A. Juditsky, H. Hjalmarsson, A. Benveniste, B. Delyon, L. Ljung, J. Sj¨oberg, and Q. Zhang. Nonlinear black-box modeling in system identification: Math-ematical foundations. Automatica, 31(12):1724–1750, 1995.

[3] V.Ya. Katkovnik and A.V. Nazin. Minimax lower bound for time-varying frequency estimation of harmonic signal. IEEE Trans. Signal Processing, 46(12):3235–3245, December 1998.

[4] A.S. Nemirovskii. Recursive estimation of parameters of linear plants. Au-tomation and Remote Control, 42(4, pt. 6):775–783, 1981.

[5] Jacob Roll. Extending the direct weight optimization approach. Draft, Department of Electrical Engineering, Link¨oping University, SE-581 83 Link¨oping, Sweden., October 2003.

[6] Jacob Roll. Local and Piecewise Affine Approaches to System Identification. PhD thesis, Department of Electrical Engineering, Link¨oping University, SE-581 83 Link¨oping, Sweden, April 2003.

[7] Jacob Roll, Alexander Nazin, and Lennart Ljung. A non-asymptotic ap-proach to local modelling. In The 41st IEEE Conference on Decision and Control, pages 638–643, December 2002.

[8] Jacob Roll, Alexander Nazin, and Lennart Ljung. Local modelling with a priori known bounds using direct weight optimization. In European Control Conference, Cambridge, September 2003.

A

Appendix

A.1

Proof of Lemma 3.1

Proof. The inverse statement is very simple to demonstrate. Indeed, let D be an admissible set of w defined by (33), and suppose that woptt > 0 for all t = 1, . . . , N . Then for any w∈ D

UN(wopt) = σ2 N X t=1 wtopt2+ 4M2 (76) ≤ σ2 N X t=1 wt2+ M2 1 + N X t=1 wt !2 (77) ≤ UN(w) (78)

Hence, w∗ = wopt due to their uniqueness which follows from strict convexity of the function UN(w).

(15)

Now assume that all wt∗> 0. Let D+ denote a subset of admissible vectors w with all components being positive. Obviously,

UN(w) = σ2 N X t=1

wt2+ 4M2, ∀w ∈ D+ (79) therefore w∗ represents the minimum point for

kwk2 2= N X t=1 w2t (80)

among all vectors w∈ D+. Note, that w∗ is an internal point of the convex set D+ ⊂ D (w.r.t. positiveness constraints). Moreover, function kwk22 is strictly convex. Hence, w∗is the minimum point forkwk2

2over D too. Thus, the lemma is proved.

A.2

Auxiliary Information Lower Bounds

Lemma A.1. Let ˜θN :RN → R be an arbitrary estimator for θ, based on i.i.d.

observations

y(t) = θ + r + e(t) , t = 1, . . . , N (81) with e(t) being gaussian N (0, σ2), |r| ≤ ε. Then

sup θ sup |r|≤εEθ,rθN − θ) 2≥ ε2+σ2 N (82)

Proof. Let τ = θ + r. The Fisher information for gaussian i.i.d. observations y(t) = τ + e(t) , t = 1, . . . , N, (83) equals N/σ2as is well known. Now apply Theorem 1 from [1], a reduced version of which is reproduced below under the above gaussian i.i.d. assumptions for observations (81) as

Proposition A.1. Let ρ > ε ≥ 0 and ˜θN : RN → R be an arbitrary

measur-able function of the observation vector yN

1 = (y(1), . . . , y(N ))T with the i.i.d. gaussian components defined by (81). Then

sup

|θ|≤ρ|r|≤εsupEθ,rθN − θ)

2≥ ε2+ σ2/N 1 + (ρ− ε)−1√N /σ

2 (84)

In order to end the proof of lemma observe, that the LHS (82) is the magorant for that of (84). Thus, tending ρ to infinity in (84) we arrive at (82).

Below, the notation (x, y) for two real column vectors x, y stands for the inner product related to the Euclidean normk · k, that is (x, y) = xTy. Note, that the following lemma as well as its proof goes back to the arguments by Nemirovskii [4] which were further adopted in [1] to a particular problem of a parameter estimation under both random and non-random but bounded noise; see also [3] and the references therein.

(16)

Lemma A.2. Let ˜θN :RN → R2 be an arbitrary estimator for θ = (θ1, θ2)T R2, based on observations

y(k) = θTF (k) + r + e(k) , k = 1, . . . , N (85) with fixed regressors F (k) = 1 ϕ(k)T, ϕ(k) ∈ R, the noise e(k) being i.i.d. gaussian N (0, σ2), and|r| ≤ ε. Then for any h = h1 h2T ∈ R2, the following information inequality holds

sup θ sup |r|≤εEθ,r  ˜ θN− θ, h 2 ≥ (εh1)2+ hTJN−1h (86) with the Fisher information matrix

JN = 1 σ2 N X k=1 F (k)FT(k) (87)

which is supposed to be invertible.

Proof. In order to make a rather long proof more readable we represent it in a form of sequential items.

1. Preliminary constructions. It suffices to assume h 6= 0. To begin with, fix

an arbitrary ρ > ε, and consider parameters θ belonging to the ball{kθk ≤ ρ}. Let α = sup kθk≤ρ|r|≤εsupEθ,r  ˜ θN − θ, h 2 (88) which is assumed to be finite, without loss of generality. Obviously, for any sufficiently small µ > 0 there exists a modified estimator ¯θN, having a finite support inRN and a bounded norm from above, for which

α + µ≥ sup

kθk≤ρ|r|≤εsupEθ,r ¯

θN − θ, h2 (89)

Indeed, this modification might consist of the following two steps:

Step 1: Orthogonally projecting the vector θN onto a ball (in R2) of

suffi-ciently large radius C1= C1(µ) <∞;

Step 2: Introducing a sufficiently large constant C2= C2(µ) <∞ and defining

¯

θN to be zero outside the ball{ky1Nk ≤ C2} and to be equal to the result of the first step inside the ball.

2. Reducing to an auxiliary estimation problem. Let τ = (θ1+r, θ2)T and notice

that (85) may be represented as the observation model

y(k) = τTF (k) + e(k) , k = 1, . . . , N, (90) for which the Fisher information matrix

JN = Eτ 

(17)

for the gaussian i.i.d.{e(k)}N

k=1is well known to be represented by (87). Let us treat (ˆθN, h) as the estimator for (τ, h) and introduce the related bias function (to the model (90)), that is

b(τ ) = Eτ θ¯N − τ, h  (92) = Z ¯ θN(y1N)− τ, h N Y k=1 q y(k)− τTF (k)dyN1 (93)

where q(·) stands for the p.d.f. of a gaussian distribution N(0, σ2), as earlier. To reach our purposes it suffices (and is convenient) to consider τ belonging to the ball of the radius ρ− ε which is positive by construction. The reason is that ifkτk ≤ ρ − ε and |r| ≤ ε then

kθk = kτ − (r, 0)Tk ≤ kτk + |r| ≤ ρ (94) Furthermore, one may see that by the construction of the estimator ¯θN the bias function b(τ ) is continuously differentiable function over the ball{kτk ≤ ρ − ε}; moreover, when differentiating b(τ ) one may interchange the integral and the gradient arriving at ∇b(τ) + h = Z ¯ θN(y1N)− τ, h  ∇τ N Y k=1 q y(k)− τTF (k)dyN1 (95)

3. Using the Cram´er-Rao inequality. Thus, one may apply a well-known Cram´er-Rao inequality

Eτ θ¯N − τ, h 2 ≥ b2(τ ) + ∇b(τ) + h, J−1 N (∇b(τ) + h)  (96) which for τ = θ + (r, 0)T leads to

Eθ,r θ¯N − θ, h 2

≥ (b(τ) + rh1)2+ ∇b(τ) + h, JN−1(∇b(τ) + h) (97) and together with (89), (94) to

α + µ≥ (b(τ) + rh1)2+ ∇b(τ) + h, JN−1(∇b(τ) + h) (98) The latter inequality holds true for any τ and r meeting the independent con-straints kτk ≤ ρ − ε and |r| ≤ ε. What is nice in (98) is that only the first summand of the RHS depends on r. Maximizing the RHS (98) by r subject to |r| ≤ ε and opening the brackets in the second summand we obtain

α + µ ≥ (|b(τ)| + ε|h1|)2+ h, JN−1h+ ∇b(τ), JN−1∇b(τ) (99) + 2 ∇b(τ), JN−1h (100)

4. Reducing to a particular trajectory inside the ball. Consider the information

inequality (99), (100) on that particular part of the trajectory of the differential equation

˙τ =−JN−1h , τ (0) = 0 , (101) which goes inside the above mentioned ball{kτk ≤ ρ − ε}. Denote the “time” variable in the differential equation (101) and in the related solution by t. The

(18)

Fisher information matrix JN does not depend on τ here; hence, the solution to (101) (the particular part which we are interested in)

τ (t) =−JN−1h t , |t| ≤ T , (102) with

T = (ρ− ε)kJN−1hk−1. (103) The idea is to integrate inequality (99), (100) along τ (t), |t| ≤ T . In order to do that, let us introduce the function

β(t) = b(τ (t)) (104)

and constant

g = JN−1h, h> 0 (105) (which is really constant since JN does not depend on τ ). It follows from (101), (104) that

˙

β(t) =− ∇b(τ), JN−1h (106) Now apply the Cauchy-Schwartz inequality

∇b(τ), J−1 N h 2 ≤ J−1 N ∇b(τ), ∇b(τ)  JN−1h, h (107)

which gives for τ = τ (t)

JN−1∇b(τ(t)), ∇b(τ(t))≥ ˙β2(t)/g (108) Thus, in view of (104)–(108) the inequality (99), (100) leads to

α + µ≥ (|β(t)| + ε|h1|)2+ g + ˙β2(t)/g− 2 ˙β(t) (109) which may be resolved w.r.t. β(t), giving for all t,˙ |t| ≤ T , the differential inequality

˙

β(t) ≥ g −√gpα + µ− (|β(t)| + ε|h1|)2 (110) ≥ g −√gpα + µ− (εh1)2. (111)

5. Integrating the information inequality. First observe that if the RHS (111)

is non-positive then

α + µ≥ (εh1)2+ g (112)

Consider the opposite situation in which the RHS (111) is positive. Then func-tion β(t) is increasing on the interval [−T, +T ], and integrating (110), (111) leads to 2T  g−√gpα + µ− (εh1)2 ≤ β(T ) − β(−T ) (113) ≤ 2 sup |t|≤T|β(t)| (114) ≤ 2pα + µ− (εh1)2. (115) The latter inequality (115) holds due to (109) from which it follows that

(19)

Thus (113)–(115) leads to

α + µ≥ (εh1)2+ g

1 + (T√g )−12 (117)

Notice, that (117) follows from (112); therefore, one may go further based on (117). One may simplify the denominator of the RHS (117) somewhat due to the expressions for T (103) and g (105) and the definition of the matrix-operator norm. Indeed, (T√g )−1 = (ρ− ε)−1 kJ −1 N hk (h, JN−1h)1/2 ≤ (ρ − ε) −1kJ−1/2 N k (118)

Substituting (118) into (117) and letting µ→ +0 lead to α≥ (εh1)2+ h, J −1 N h   1 + (ρ− ε)−1kJN−1/2k 2 (119)

Finally, letting ρ→ +∞ and reminding of the definition of α (88) lead directly to the desirable inequality (86). Thus, Lemma A.2 is proved.

Remark A.1. As far as can be seen seen from the proof, Lemma A.2 may

be naturally extended to non-gaussian i.i.d. noise under additional regularity assumptions on its p.d.f. Note also that the proof uses no concrete form of (non-random) regressors F (k) imposing only the assumption on the related Fisher information matrix to be non-degenerate. An extension to the uniform random design is represented by Lemma A.3 below.

Lemma A.3. Let ˜θN :RN → R2 be an arbitrary estimator for θ = (θ1, θ2)T

R2, based on observations

y(k) = θTF (k) + r + e(k) , k = 1, . . . , N (120) with

1) regressors F (k) = 1 ϕ(k)T having random i.i.d. entries ϕ(k) uniformly

distributed on the interval [−0.5, +0.5];

2) i.i.d. gaussian random noise e(k)∈ N(0, σ2);

3) {e(k)}N

k=1 and{ϕ(k)}Nk=1 independent;

4) finally, |r| ≤ ε.

Then, for any h = h1 h2 T

∈ R2 the following information inequality holds sup θ sup |r|≤εEθ,r  ˜ θN− θ, h 2 ≥ (εh1)2+ hTJN−1h (121) with the Fisher information matrix

JN = N σ2  1 0 0 1/12  (122)

(20)

Proof. The proof is almost completely analogous to that of Lemma A.2. The only difference arise from the observation vector which now should include also random variables ϕ(k). Let

z1N = (y(1), ϕ(1), . . . , y(1), ϕ(N ))T ∈ R2N. (123) be the observation vector. When a combined parameter τ = (θ1+ r, θ2)T is introduced and the observation model (120) is reduced to

y(k) = τTF (k) + e(k) , k = 1, . . . , N (124) the p.d.f. of observations becomes

p(z1N| τ) = N Y k=1

q(y(t)− τTF (k)) 1{|ϕ(k)| ≤ 0.5} (125) with its Fisher information matrix

JN = Eτ 

∇τln p(zN1 | τ) ∇Tτ ln p(z1N| τ) (126) represented by (122). All other items of the proof completely repeat those of Lemma A.2.

References

Related documents

This is the simplest type of probabilistic model to be used for describing equicorrelation between the bending strength values of the weak zones within the same timber

En del av forskningsprojektet “Osäkra övergångar” (Lundahl 2015) var att granska strategier och åtgärder på lokal nivå för att förebygga misslyckad skolgång samt

If Keynesians present the thesis, and Austrians the antithesis, what would be the synthesis? This question lies beyond the grasp of this paper. Nevertheless I will make

För fyra år sedan (2008) introducerades grävskyddsplattor av polyeten (PE) som ett alternativ till betongplattorna. PE- plattorna har testats och godkänts av

Utifrån ambitionen att analysera vad som står skrivet i morgontidningar kring rättegångarna mot Mijailo Mijailovic valde jag att fokusera på de artiklar som publicerats i

In this thesis a Linear and a Nonlinear Model Predictive Controller have been developed with the goal to maximize the energy output of a Wave Energy Con- verter. As a

Denna del av texten som återberättar detta placerar Silvana Imam och hennes musik i ett större sammanhang där fler kan relatera till texten än bara de som är intresserade

On the other extreme we nd the black-box approach where the model is searched for in a su ciently exible model set. Instead of incorporating prior knowledge the model contains