Adaptive DWO Estimator of a Regression Function

(1)

Technical report from Automatic Control at Linköpings universitet

Adaptive DWO Estimator of a Regression

Function

Anatoli Juditski, Alexander Nazin,

Jacob Roll

,

Lennart Ljung

Division of Automatic Control

E-mail:

Anatoli.Iouditski@inrialpes.fr

,

nazine@ipu.rssi.ru

,

roll@isy.liu.se

,

ljung@isy.liu.se

14th June 2007

Report no.:

LiTH-ISY-R-2794

Accepted for publication in Proc. NOLCOS 2004 - IFAC Symposium

on Nonlinear Control Systems, Stuttgardt

Address:

Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

WWW: http://www.control.isy.liu.se

AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET

Technical reports from the Automatic Control group in Linköping are available from

(2)

Abstract

We address a problem of non-parametric estimation of an unknown regres-sion function f : [−1/2, 1/2] → R at a xed point x0 ∈ (−1/2, 1/2) on the basis of observations (xi, yi), i = 1, .., n such that yi = f (xi) + ei, where ei ∼ N (0, σ2) is unobservable, Gaussian i.i.d. random noise and xi∈ [−1/2, 1/2]are given design points. Recently, the Direct Weight Op-timization (DWO) method has been proposed to solve a problem of such kind. The properties of the method have been studied for the case when the unknown function f is continuously dierentiable with Lipschitz continuous derivative having a priori known Lipschitz constant L. The minimax opti-mality and adaptivity with respect to the design have been established for the resulting estimator. However, in order to implement the approach, both Land σ are to be known. The subject of the submission is the study of an adaptive version of the DWO estimator which uses a data-driven choice of the method parameter L.

(3)

ADAPTIVE DWO ESTIMATOR OF A REGRESSION FUNCTION Anatoli Juditsky∗ _{Alexander Nazin}∗∗,1

Jacob Roll∗∗∗ _{Lennart Ljung}∗∗∗

∗_{LMC-IMAG, BP 53, F-38041 Grenoble Cedex 9, France}

Email: juditsky@inrialpes.fr

∗∗_{Institute of Control Sciences, Russian Acad. Sci.,}

Profsoyuznaya str., 65 117997 Moscow, Russia Email: nazine@ipu.rssi.ru

∗∗∗_{Div. of Automatic Control, Linkoping University}

SE-58183 Linkoping, Sweden Emails: roll, ljung@isy.liu.se

Abstract: We address a problem of non-parametric estimation of an unknown regression function f : [−1/2, 1/2] → R at a fixed point x0 ∈ (−1/2, 1/2) on the basis of observations (xi, yi), i = 1, .., n such that yi = f (xi) + ei, where

ei ∼ N (0, σ2) is unobservable, Gaussian i.i.d. random noise and xi ∈ [−1/2, 1/2]

are given design points. Recently, the Direct Weight Optimization (DWO) method has been proposed to solve a problem of such kind. The properties of the method have been studied for the case when the unknown function f is continuously differentiable with Lipschitz continuous derivative having a priori known Lipschitz constant L. The minimax optimality and adaptivity with respect to the design have been established for the resulting estimator. However, in order to implement the approach, both L and σ are to be known. The subject of the submission is the study of an adaptive version of the DWO estimator which uses a data-driven choice of the method parameter L.

Copyright c° 2004 IFAC

Keywords: Non-parametric regression, Estimators, Adaptive algorithms, Mean-square error, Quadratic programming

1. INTRODUCTION

Consistent non-parametric estimation of a regres-sion function f : [−1/2, 1/2] → R based on its noisy observations

yi= f (xi) + ei, i = 1, . . . , n

at some given design points xi ∈ (−1/2, 1/2)

is one of the basic problems for many

applica-1 _{Partially supported by the Swedish Royal Academy of}

Sciences via their research grant of 2003.

tions including non-linear system identification, see, e.g., (Ljung, 1999). Here the random noise ei

is supposed to be i.i.d. with Eei = 0, Ee2i = σ2,

σ > 0. A common approach to estimating f (x0) at a fixed point x0∈ (−1/2, 1/2) is to use a linear estimator b fn= bfn(x0) = n X i=1 wiyi.

The problem then reduces to finding a good vector

w = (w1, . . . , wn)T of weights wi = wi(x0; Xn),

(4)

small Mean-Square Error (MSE) M SE( bfn, f ) = E ·³ b fn(x0) − f (x0) ´2 | Xn ¸

over a given function class F.

A classic family of weights are generated by

ker-nel methods, where a kerker-nel function K and a

bandwidth hn are used to determine the weights.

Another widely used approach is the local

poly-nomial modelling approach, where the estimator

is determined by locally fitting a polynomial to the given data; appropriate kernel and bandwidth should also be determined here. See, e.g., (Fan and Gijbels, 1996) for both the details and the references therein.

Recently, in (Roll et al., 2003a) and (Roll et al., 2003b), the Direct Weight Optimization (DWO) method has been proposed to solve a problem of considered type; note, that a similar approach has been earlier studied in (Sacks and Ylvisaker, 1978). The main idea of the DWO is to minimize the maximum MSE

Rn( bfn) = sup f ∈F

M SE( bfn, f )

or its “natural” upper bound Un(w). Particularly,

the properties of the method have been studied for the case when the unknown function f is continuously differentiable with Lipschitz contin-uous derivative having a priori known Lipschitz constant L. In this case, the upper bound on

Rn( bfn) represents a convex quadratic function of

w ∈ Rn _{depending also on the parameters σ and}

L. Moreover, it should be minimized subject to

some linear constraints, and the problem reduces to a quadratic programming one (or to a cone program, in the multivariate case). The detailed study of the approach and simulation examples may be found in (Roll, 2003). Particularly, min-imax optimality and adaptivity with respect to the design have been established (for non-negative weights).

However, in order to implement the approach, both L and σ are to be known. The goal of the submission is to propose and study an adaptive version of the DWO estimator which uses a data-driven choice of the method parameter L. It turns out that a payoff for the adaptivity to the unknown Lipschitz constant L∗ _{is expressed by a}

logarithmic factor to the MSE upper bound.

2. PROBLEM STATEMENT

Consider the problem of non-parametric estima-tion of the value f (x0) of an unknown function

f : [−1/2, 1/2] → R at a given point x0 ∈ (−1/2, 1/2), given a set of input-output pairs

{(xi, yi)}ni=1, coming from the relation

yi= f (xi) + ei, (1)

where ei ∼ N (0, σ2) is unobservable, i.i.d.

gaus-sian random noise and xi ∈ [−1/2, 1/2] are given

design points (non random, for the sake of sim-plicity); σ > 0 is supposed to be a priori known. Function f is continuously differentiable with Lip-schitz continuous derivative

|f0(u) − f0(v)| ≤ L|u − v|, (2) with Lipschitz constant L being a priori unknown; denote F(L) the (non-parametric) class of all functions meeting inequality (2). We consider the maximum Mean-square error (MSE)

Rn( bfn, L) = sup f ∈F(L)

Ef{( bfn(x0) − f (x0))2} (3)

as a risk for an estimator bfn over function class

F(L). Here and further on, the expectation Ef

corresponds to the distribution of observations (1) generated by function f . Introduce

e

xi= xi− x0 (4) For an arbitrary linear estimator

b fn= n X i=1 wiyi (5)

the following MSE Upper Bound holds true:

Rn( bfn, L) ≤ Un(w, L) = Ã L 2 n X i=1 |wi| ex2i !2 + σ2_kwk2 ₍₆₎

with k · k standing for Euclidean norm. Note, that the last summand in the right hand side of (6) represents the variance and the first one the upper bound on the squared bias of the estimation error

b

fn(x0) − f (x0).

The DWO approach to the given estimation prob-lem with a priori known Lipschitz constant L is to use estimator (5) with the weight vector

w = (w1, . . . , wn) being a solution to the following

optimization problem: U∗ n(L) = min_w∈RnUn(w, L) (7) subject to constraints n X i=1 wi= 1, n X i=1 wixei= 0 . (8)

Note, that computationally, this problem reduces to a quadratic program. Denote w∗_{(n, L) the}

minimizer for Un(w, L) in the problem (7), (8).

Thus, U∗

n(L) = Un(w∗(n, L), L). In what follows

we extend the DWO approach to the case of unknown Lipschitz constant L.

(5)

3. ADAPTIVE DWO ESTIMATOR Suppose that it is known a priori that U∗

n(L∗) ∈

[Umin

n , Unmax] ⊂ (0, ∞) where L∗ stands for “true

Lipschitz constant” of f0_{(x). Let us fix α ∈ (0, 1)}

and a related integer K = K(α) > 1. Consider the decreasing sequence U = (Uk), k = 1, ..., K,

such that U1= Unmax, UK = Unmin, and

Uk/Uk−1≤ α < 1, k = 2, . . . , K.

Note that as U∗

n(L) is monotonously increasing in

L, for any Ukone can easily find Lk(e.g., by

bisec-tion) which gives U∗

n(Lk) = Uk. Thus, Lmax= L1 stands for maximum Lipschitz constant L. Note, that typically Umax n = O(σ2) and Umin n = O Ã σ2 Pn i=1xe2i nPn_i=1ex2 i − ( P_n i=1exi)2 ! .

For each couple (i, k) such that 2 ≤ i ≤ K and 1 ≤ k < i introduce sik= σik s 2 ln Ui Umin n + 2bi

with σik= σkw(k)− w(i)k and

bi= Li 2 n X j=1 e x2 j|w(i)j | .

Consider the following (adaptive DWO) algorithm the idea of which arises from (Lepski, 1990). 1 For each Uk ∈ U , k = 1, ..., K compute the

corresponding Lk and the vector

w(k)= w∗(n, Lk) .

Form the related auxiliary estimates b f(k) n = n X i=1 w_i(k)yi, k = 1, . . . , K .

2 The adaptive estimate bfn is defined as follows:

We call i, 1 ≤ i ≤ K, admissible if b f(i) n ∈ \ 1≤k<i [ bf(k) n − sik, bfn(k)+ sik] .

Clearly, an admissible i exists, e.g. i = 1 (because intersection over an empty index set equals R). Then we set bi the largest of admissi-ble i:s and put bfn= bfn(bi ).

This concludes the algorithm.

4. MAIN RESULTS

Theorem 1. Consider the adaptive estimator bfn

defined in the previous section. Let the parameter

α ∈ (0, 1) be fixed. There exists such an absolute

constant C(α) < ∞ that for any f ∈ F(Lmax) the MSE for a defined above adaptive estimator meets the inequality

Ef ³ b fn− f (x0) ´2 ≤ C(α) · U∗ n(L∗) µ lnUn∗(L∗) Umin n + 1 ¶ (9) + Umin n ln Umax n U∗ n(L∗) ¸ .

Note, that the rough bounds on Umin

n and Unmax

can always be taken as follows (see Appendix B for their proof):

Umin n = σ2 n , U max n = µ Lmax 2 ¶2 + σ2_; ₍₁₀₎

the last expression for Umax

n is proved when

min

1≤i≤nxei< 0 < max1≤i≤nexi. (11) With these bounds on risk, one obtains

ln(Umax

n /Unmin) = ln n + O(1), n → ∞

Hence, ln(U∗

n(L∗)/Unmin) and ln(Unmax)/Un∗(L∗))

are of the order O(ln n). Thus, we arrive at the following asymptotic result.

Corollary 2. Under the assumptions of Theorem 1

lim sup n→∞ Ef ³ b fn− f (x0) ´2 U∗ n(L∗) ln n ≤ C(α) . (12)

Note, that typically U∗

n(L∗) = O(n−ν) with ν > 0

depending on the design. For instance, equidistant design leads to ν = 4/5, see (Roll, 2003) for the details.

5. CONCLUSION

The adaptive DWO algorithm assumes upper bound on Lipschitz constant Lmax ≥ L∗ being known a priori. However, even when this bound is very large, the adaptive method will attain the same MSE (up to the log factor) as the unim-plementable DWO estimator, which “knows” the exact Lipschitz constant. On the other hand, if the non-adaptive DWO estimator is used with the upper bound Lmax, its performance will degrade severely if L∗_{<< L}

(6)

Furthermore, it can be shown that the logarithmic loss in the accuracy with respect to the unim-plementable (“oracle”) algorithm, which attains the MSE U∗

n(L∗), in a certain sense, cannot be

suppressed using any estimation procedure, and represents an unavoidable price for the lack of the prior knowledge of L∗_.

Acknowledgements The authors would like to thank Oleg Lepski for the fruitful discussion. The second author also thanks the Swedish Research Council (VR) for their support.

REFERENCES

Fan, J. and I. Gijbels (1996). Local Polynomial

Modelling and Its Applications. Chapman &

Hall.

Lepski, O.V. (1990). One problem of adaptive estimation in gaussian white noise. Theory

Probab. Appl. 35, 454–466.

Ljung, L. (1999). System Identification: Theory

For the User, 2nd ed,. Prentice Hall. Upper

Saddle River, N.J.

Roll, J., A. Nazin and L. Ljung (2003a). Local modelling of nonlinear dynamic systems us-ing direct weight optimization. In: 13th IFAC

Symposium on System Identification.

Rotter-dam. pp. 1554–1559.

Roll, J., A. Nazin and L. Ljung (2003b). Local modelling with a priori known bounds us-ing direct weight optimization. In: European

Control Conference. Cambridge.

Roll, Jacob (2003). Local and Piecewise Affine Approaches to System Identification. PhD thesis. Link¨oping University.

Sacks, J. and D. Ylvisaker (1978). Linear estima-tion for approximately linear models. Ann.

Statist. 6(5), 1122–1137.

APPENDIX A

Below is a sketch of the proof for Theorem 1. 1) First we note that for 1 ≤ k < i ≤ K

¯ ¯ ¯ bf(k) n − bfn(i) ¯ ¯ ¯ = ¯ ¯ ¯ ¯ ¯ ¯ n X j=1 ³ w_j(k)− w_j(i) ´ f (xj) + n X j=1 ³ w(k)_j − w(i)_j ´ ej ¯ ¯ ¯ ¯ ¯ ¯ ≤ ¯ ¯ ¯ ¯ ¯ ¯ n X j=1 ³ w_j(k)− w_j(i) ´ (f (xj) − f (x0) − f0(x0)exj) ¯ ¯ ¯ ¯ ¯ ¯ + ¯ ¯ ¯ ¯ ¯ ¯ n X j=1 ³ w_j(k)− w_j(i)´ej ¯ ¯ ¯ ¯ ¯ ¯ ≤L ∗ 2 n X j=1 ³ |w(k)_j | + |w_j(i)| ´ e x2j+ |ξik|σik, (13) where ξik∼ N (0, 1).

Further, for any 1 ≤ i, k ≤ K

b2 i + σ2kw(i)k2≤ L2 i L2 k b2 k+ σ2kw(k)k2

(as w(i)_{is the minimizer which corresponds to L}

i),

so that (by summing the i’s and k’s inequalities above) b2 i + b2k≤ L2 i L2 k b2 k+ L2 k L2 i b2 i,

and, for i > k, as Lk > Li, one obtains

bk ≤Lkbi

Li

. (14)

2) Let i∗ _{be such that}

Un∗(L∗) ≤ Ui∗≤ α−1U_n∗(L∗) (15)

Consider first the case bi ≥ i∗_{. Then due to the}

admissibility of bi and inequality L∗_{≤ L} i∗, E| bfn− f (x0)|21{bi ≥ i∗} ≤ 2E| bfn− bfi ∗ n |21{bi ≥ i∗} + 2E| bfni∗− f (x0)|2≤ 2Es_b2_ii∗+ 2Un(w (i∗₎ , L∗) ≤ 4E · 2σ2 bii∗ln µ Ubi Umin n ¶ + 4b2 bi ¸ + 2Ui∗.

(Here and further on we use a simpler notation

E for the expectation Ef assuming function f is

fixed.) On the other hand, since Ubi≤ Ui∗ and

kw(k)_{− w}(i)_k2_{≤ 2kw}(k)_k2_{+ 2kw}(i)_k2_{, we have}

σ2 bii∗ln µ Ubi Umin n ¶ + 2b2 bi ≤ 2 · σ2kw(bi )k2ln µ Ubi Umin n ¶ + b_b2_i ¸ + 2σ2kw(i∗)k2ln µ Ubi Umin n ¶ ≤ C Ui∗ln µ Ui∗ Umin n ¶

with some finite constant C. Thus

E| bfn− f (x0)|21{bi ≥ i∗} ≤ 8 µ C ln µ Ui∗ Umin n ¶ + 1 ¶ Ui∗ (16)

(a finer estimate can probably be obtained). We now aim to upper bound E| bfn− f (x0)|21{bi < i∗}. To this end for 1 ≤ k < i ≤ i∗ _{we use (13):}

(7)

| bfn(i)− bfn(k)| ≤L ∗ 2 n X j=1 ³ |w(k)_j | + |w(i)_j |´xe2 j + |ξik|σik =L ∗ Lkbk+ L∗ Libi+ |ξik|σik (as i∗_{≥ i > k) ≤} Li Lk bk+ bi+ |ξik|σik (by (14)) ≤ 2bi+ |ξik|σik.

Thus bi < i∗ _{only if for some k < i ≤ i}∗_{, |ξ} ik| >

p

2 ln (Ui/Unmin). Further, since ξik has standard

Gaussian distribution one may prove from Lemma 3 the inequality E   n X j=1 ejw(i)j   2 1 { |ξik| > λ} ≤ r 2 π ¡ λ + λ−1¢σ2kw(i)k2e−λ2/2, (17)

and for λ = p2 ln (Ui/Unmin) this expectation

≤ Cpln (Ui/Unmin) σ2kw(i)k2Unmin/Ui.

Now, since bfn= bfn(bi ), one obtains

¯ ¯ ¯ bfn(bi )− f (x0) ¯ ¯ ¯ 1 n bi < i∗o ≤X i<i∗ ³¯_¯ ¯E bf(i) n − f (x0) ¯ ¯ ¯ (18) + ¯ ¯ ¯ bfn(i)− E bfn(i) ¯ ¯ ¯ ´ 1 n bi = io and, for i < i∗_{, by (14)} ¯ ¯ ¯E bf(i) n − f (x0) ¯ ¯ ¯ ≤ L ∗ Li bi ≤ bi ∗ (19)

and, by (17) and i∗_{≤ C ln (U}max

n /Ui∗), E ¯ ¯ ¯ bf(i) n − E bfn(i) ¯ ¯ ¯2 1 n bi = io ≤X k<i E  Xn j=1 ejw(i)j   2 1 { |ξik| > λ}

≤ Ciσ2_kw(i)_k2Unmin

Ui ≤ C U min n ln Umax n Ui∗ . (20)

Thus, the bounds (18)–(20) lead to

E ¯ ¯ ¯ bfn(bi )− f (x0) ¯ ¯ ¯21 n bi < i∗o ≤ 2X i<i∗ E µ¯ ¯ ¯E bf(i) n − f (x0) ¯ ¯ ¯2 + ¯ ¯ ¯ bfn(i)− E bfn(i) ¯ ¯ ¯2 ¶ 1 n bi = io ≤ 2b2 i∗+ CU_nminln Umax n Ui∗ (21)

3) Finally, combining the upper bounds (16), (21) and using (15) we arrive at the result of the

Theorem. 2

Lemma 3. Let the random variables ξ and η be

Gaussian with Eξ = Eη = 0, Eξ2 _{= Eη}2 _{= 1.} Then, for any non-random λ > 0 the following inequality holds E(η2_{1{|ξ| > λ}) ≤} r 2 π ¡ λ + λ−1¢_e−λ2_/2 . (22) APPENDIX B

The bounds (10) are proved as follows. By defi-nition (6), the solution to minimization problem (7)–(8) is bounded by Un(w, L) ≥ σ2 min wT₁_n₌₁kwk 2₌ σ2 n (23) where 1n= (1, 1, . . . , 1)T ∈ Rn. Similarly, U∗ n(L∗) ≤ Un∗(Lmax) ≤ Un(w+, Lmax) (24) where weight vector w+ _{has non-negative entries} and meets constraints (8), existing due to (11). Thus, second equality (10) follows from (24) due to |exi| ≤ 1 and kw+k ≤ 1. The bounds (10) are

(8)

Avdelning, Institution Division, Department

Division of Automatic Control Department of Electrical Engineering

Datum Date 2007-06-14 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version http://www.control.isy.liu.se

ISBN ISRN

Serietitel och serienummer

Title of series, numbering ISSN_1400-3902

LiTH-ISY-R-2794

Titel

Title Adaptive DWO Estimator of a Regression Function

Författare

Author Anatoli Juditski, Alexander Nazin, Jacob Roll, Lennart Ljung Sammanfattning

Abstract

We address a problem of non-parametric estimation of an unknown regression function f : [−1/2, 1/2] → R at a xed point x0 ∈ (−1/2, 1/2) on the basis of observations (xi, yi),

i = 1, .., nsuch that yi= f (xi) + ei, where ei∼ N (0, σ2)is unobservable, Gaussian i.i.d.

random noise and xi ∈ [−1/2, 1/2]are given design points. Recently, the Direct Weight

Optimization (DWO) method has been proposed to solve a problem of such kind. The properties of the method have been studied for the case when the unknown function f is continuously dierentiable with Lipschitz continuous derivative having a priori known Lipschitz constant L. The minimax optimality and adaptivity with respect to the design have been established for the resulting estimator. However, in order to implement the approach, both L and σ are to be known. The subject of the submission is the study of an adaptive version of the DWO estimator which uses a data-driven choice of the method parameter L.

Nyckelord