Piecewise Linear Solution Paths for Parametric Piecewise Quadratic Programs with Application to Direct Weight Optimization

(1)

Technical report from Automatic Control at Linköpings universitet

Piecewise Linear Solution Paths for

Para-metric Piecewise Quadratic Programs with

Application to Direct Weight Optimization

Jacob Roll

Division of Automatic Control

E-mail:

roll@isy.liu.se

28th August 2007

Report no.:

LiTH-ISY-R-2816

Address:

Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

WWW: http://www.control.isy.liu.se

AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET

Technical reports from the Automatic Control group in Linköping are available from

(2)

Abstract

Recently, pathfollowing algorithms for parametric optimization problems with piecewise linear solution paths have been developed within the field of regularized regression. This paper presents a generalization of these algo-rithms to a wider class of problems, namely a class of parametric piecewise quadratic programs and related problems. It is shown that the approach can be applied to the nonparametric system identification method Direct Weight Optimization (DWO) and be used to enhance the computational efficiency of this method. The most important design parameter in the DWO method is a parameter (λ) controlling the bias-variance trade-off, and the use of parametric optimization with piecewise linear solution paths means that the DWO estimates can be efficiently computed for all values of λ simultaneously. This allows for designing computationally attractive adaptive bandwidth selection algorithms. One such algorithm for DWO is proposed and demonstrated in two examples.

Keywords: System identification, Non-parametric identification, Paramet-ric optimization, Pathfollowing algorithms

(3)

Piecewise Linear Solution Paths for Parametric

Piecewise Quadratic Programs with Application

to Direct Weight Optimization

Jacob Roll

2007-08-28

Abstract

Recently, pathfollowing algorithms for parametric optimization prob-lems with piecewise linear solution paths have been developed within the field of regularized regression. This paper presents a generalization of these algorithms to a wider class of problems, namely a class of para-metric piecewise quadratic programs and related problems. It is shown that the approach can be applied to the nonparametric system identifica-tion method Direct Weight Optimizaidentifica-tion (DWO) and be used to enhance the computational efficiency of this method. The most important design parameter in the DWO method is a parameter (λ) controlling the bias-variance trade-off, and the use of parametric optimization with piecewise linear solution paths means that the DWO estimates can be efficiently computed for all values of λ simultaneously. This allows for designing computationally attractive adaptive bandwidth selection algorithms. One such algorithm for DWO is proposed and demonstrated in two examples.

1 Introduction

In many applications, one encounters optimization problems involving a trade-off between two terms to optimize, i.e., problems of the type

min

x∈PL(x) + λJ (x) (1)

where λ is a design parameter controlling the trade-off, and P is the feasible region. The problem (1) is a parametric optimization problem (Guddat et al.,

1990), or can also be viewed as a special case of multiobjective optimization (Boyd and Vandenberghe, 2004). Examples include bias-variance trade-off is-sues, regularization (where the size of the penalty is to be estimated) etc.

In recent years, results in explicit model predictive control has led to a growing interest in multiparametric linear and quadratic programming, where it has been shown that the solutions to different classes of problems are piecewise affine functions of the parameters (see, e.g., Bemporad et al., 2002; Borrelli,

2003;Tøndel et al.,2003).

On the other hand, examples of (single-)parametric optimization problems in the form (1), with solutions that are piecewise affine functions of λ, can be found in the field of regularized regression. In the paper byEfron et al.(2004),

(4)

the authors present a new estimation method, least angle regression (LARS), and show that both the solutions to both LARS and LASSO (Tibshirani,1996) can be efficiently computed for all values of λ simultaneously. As pointed out in (Rosset and Zhu, 2004,2007), the key to these algorithms is that the solution paths (i.e., the optimal solutions x to the parametric optimization problem as a function of λ) are piecewise linear as λ varies from 0 to ∞. Similar results have recently also been shown for the related nn-garrote method and grouped versions of all these methods (Yuan and Lin, 2006). In all these cases, having a single-parametric optimization problem allows for developing pathfollowing algorithms that exploit the piecewise linearity to efficiently find and represent the solution path.

This paper presents a generalization of the framework of pathfollowing al-gorithms for piecewise linear solution paths in (Efron et al., 2004; Rosset and Zhu,2007;Yuan and Lin,2006), and extends the problem class to a broad class of optimization problems. For the case of quadratic plus piecewise affine cost functions, an algorithm with explicit expressions for computation of the solution path is given.

A particular example of parametric problems in the form (1) occurs in Direct Weight Optimization (Roll,2003;Roll et al.,2005a,b), which is a nonparametric identification/function estimation method. DWO computes pointwise function estimates, given data {y(t), ϕ(t)}N

t=1 from

y(t) = f0(ϕ(t)) + e(t) (2)

where f0 is the unknown function to be estimated, f0 : Rn → R, and e(t)

are white noise terms. The idea of making pointwise function estimates has also appeared under names such as Model on Demand, lazy learning and least commitment learning (see, e.g., Atkeson et al.,1997a,b;Bontempi et al., 1999;

Stenman, 1999, and references therein).

In order to estimate f0(ϕ∗) for a given point ϕ∗, the idea of DWO is to use

an affine estimator ˆfN(ϕ∗) ˆ fN(ϕ∗) = w0+ N X t=1 wty(t) (3)

and to select the weights (w0, w), w = (w1, . . . , wN) of the estimator by convex

optimization. Assuming that f0 belongs to some function class F , the weights

can be determined by minimizing a convex upper bound on the maximum mean-squared error (MSE)1

mMSE (ϕ∗, w0, w) = sup f0∈F E f0(ϕ∗) − ˆfN(ϕ∗) 2 {ϕ(t)} N t=1 (4)

The resulting minimization problem is convex and can be written in the following abstract form:

min

w∈DλU2(w) + V (w) (5)

1_{Note that (}₄_{) is always convex in (w}

0, w), regardless of how the function class F is chosen,

so in principle we could minimize the maximum MSE directly. However, for many function classes, (4) is difficult to compute, and we have to find an upper bound instead.

(5)

where w is a weight vector needed for computing the function estimate, U2

is basically an upper bound on the squared bias, and V the variance term. The design parameter λ determines the trade-off between the flexibility of the function class and the noise variance (see Section3for more details).

The computed estimate will of course depend on the choice of λ controlling the bias-variance trade-off. A method for selecting λ for the case when the noise variance is known was given in (Juditsky et al.,2004). It could also be chosen by using cross-validation or some other criterion (Härdle,1990;Stenman,1999). For all these methods, one needs to compute the DWO estimates for several different parameter values, which makes it desirable to be able to efficiently compute the entire solution path.

We will show that the developed pathfollowing algorithm can be applied to the DWO approach. This means that we can simultaneously compute the DWO values from (5) for all choices of λ, which would mean a great gain in computational efficiency. A cross-validation-based algorithm for selection of λ in DWO is also proposed.

The paper is organized as follows: Section2considers some specific problem classes for which the solution paths are piecewise linear, while Section3proposes how this property can be exploited in the DWO approach.

2 Piecewise linear solution paths

In this section, we will consider some specific classes of optimization problems of the type (1), which will be shown to have piecewise linear solution paths.

2.1 Piecewise quadratic plus piecewise affine cost function

First, we will consider a class of optimization problems in the form (1) where J (x) is piecewise linear and L(x) is a piecewise quadratic function. A general piecewise linear convex function can be written (Boyd and Vandenberghe,2004)

J (x) = max

k {c T

kx + dk} (6)

L(x) is supposed to be strictly convex and in the following form L(x) = 1

2x

T_Q

ix + fiTx + ri if x ∈ Xi (7)

where the polyhedral regions Xi= {x | ˜Hix 4 ˜qi}, i ∈ I (here 4 denotes

compo-nentwise inequalities), form a partition of the x space (for simplicity, we let the regions be closed sets, which means that they will intersect at the boundaries). Furthermore, we assume that for each λ ≥ 0, problem (1) has a unique, finite optimal solution.

We can now show the following lemma. Lemma 1. The problem

min x λ maxk {c T kx + dk} + L(x) (8) subj. to Ax = b ¯ Ax 4 ¯b

(6)

with L(x) given by (7) has a piecewise linear solution path, i.e., the optimal x ∈ Rn _{is a piecewise affine function of λ ∈ [0, ∞].}

Proof. It is easy to see that the optimum of (8), which is unique and finite for given λ according to the assumtions, changes continuously with λ.

Now, we can partition the feasible set into a number of relatively open poly-hedra together with a number of points (the corners of the polypoly-hedra), denoted Pj(i.e., either Pj= relint(Pj) or Pj is a single point; for the definition of relative

interior, seeBoyd and Vandenberghe(2004)), such that on Pj, the cost function

of (1) equals λ(cT_k jx + dkj) + 1 2x T_Q ijx + f T ijx + rij

Let the affine hull of Pj (Boyd and Vandenberghe,2004) be described by

aff(Pj) = {x | ˜Ajx = ˜bj}

where ˜Aj has full row rank.

Assume that the solution to (8) for a given λ lies in Pj. Then, since this

solution is either in the relative interior of Pj or the only point of Pj, it is also

the solution to min x λ(c T kjx + dkj) + 1 2x T_Q ijx + f T ijx + rij (9) subj. to A˜jx = ˜bj

But the solution to this problem can be computed as x = Q−1_i j ˜ AT_j( ÃjQ−1ij ˜ AT_j)−1A˜jQi−1j − I(fij+ ckjλ) + Ã T j( ÃjQ−1ij ˜ AT_j)−1˜bj (10) (see AppendixA). Here, x is linear in λ. This means that the solution to (8) must consist of a number of such linear pieces, one piece for every Pjthat the solution

path passes through, and hence, the solution path is piecewise linear.

Remark 1. The strict convexity condition for L(x) can be relaxed. It is sufficient that L(x) is strictly convex in a neighborhood of each point on the solution path, and convex elsewhere.

Having shown Lemma 1, let us try to outline an algorithm that computes the entire solution path. For simplicity, we assume that A has full row rank. We can rewrite the problem by introducing slack variables according to

min x,s λs + L(x) (11) subj. to s ≥ cT_kx + dk Ax = b ¯ Ax 4 ¯b The Lagrangian function of (11) becomes

L(s, x; µ, µA¯, µA) = λs + L(x) (12) − m X k=1 µk(s − cTkx − dk) − µ ¯ AT (¯b − ¯Ax) − µAT(b − Ax)

(7)

Using a version of the Karush-Kuhn-Tucker (KKT) conditions (Rockafellar,

1970, Cor. 28.3.1) we can see that (x, s) is the optimal solution to (11) if and only if the following conditions are satisfied for some subgradient ν of L(x):

ν + m X k=1 µkck+ ¯ATµ ¯ A_{+ A}T_µA_{= 0} _(13a) λ − m X k=1 µk = 0 (13b) s ≥ cTkx + dk (13c) ¯ Ax 4 ¯b (13d) Ax = b (13e) µk(s − cTkx − dk) = 0 (13f) µA_j¯(¯bj− ¯Ajx) = 0 (13g) µk ≥ 0, µLi ≥ 0, µ ¯ A < 0 (13h)

The subgradient ν can be written as a convex combination of linear expressions: ν = X i∈Ia αi(Qix + fi) (14a) where Ia _{= {i | x ∈ X} i}, X i∈Ia αi= 1, αi ≥ 0 (14b)

The KKT conditions have a solution that is unique in (x, s), but not necessarily in (µ, µL_{, µ}A¯_{, µ}A_{). Hence, we should keep the number of active constraints at a}

minimum, to ensure that we get a unique solution.

Denote the different sets of active constraints by Ka _{(for µ}

k) and Ja (for

µA¯

j). If we assume that the solution for the current λ is at a linear piece (not at

a knot) in the interior of a Xi region, then (13a) and (13b) can be written as

Qix + X k∈Ka µkck+ X j∈Ja µAj¯A¯Tj + ATµA= −fi (15a) X k∈Ka µk = λ (15b)

Let us introduce the following notation: Let Ka_{= {k} 1, k2, . . . , knK} Ja_{= {j} 1, j2, . . . , jnJ} µKa= µk1 . . . µknK T µA_J¯a= µA¯ j1 . . . µ ¯ A j_nJ T CKa= c_k₁ . . . c_k nK T ¯ AJa=    ¯ Aj1 .. . ¯ Aj_nJ   

(8)

dKa= d_k₁ . . . d_k nK T ¯ bJa= ¯b_j₁ . . . ¯b_j nJ T

If we combine (15), (13f), (13g) and (13e), to obtain the solution we then need to solve       Qi 0 C_KTa A¯T_Ja AT 0 0 1T nK 0 0 CKa 1_nK 0 0 0 ¯ AJa 0 0 0 0 A 0 0 0 0             x −s µKa µA_J¯a µA       =       −fi λ −dKa ¯_b Ja b       (16)

It now follows from Lemma2 of AppendixBthat if   CKa 1_nK ¯ AJa 0 A 0   (17)

has full row rank, then the solution to (16) is unique.

To compute what happens for a small change in λ, we can solve       Qi 0 CKTa A¯T_Ja AT 0 0 1T nK 0 0 CKa 1_nK 0 0 0 ¯ AJa 0 0 0 0 A 0 0 0 0              ∂x ∂λ −∂s ∂λ ∂µKa ∂λ ∂µAJ a¯ ∂λ ∂µA ∂λ        =       0 1 0 0 0       (18)

If x belongs to an intersection between a number of regionsT

i∈IaXi, we need to solve       P i∈IaαiQi 0 CKTa A¯T_Ja AT 0 0 1T nK 0 0 CKa 1_nK 0 0 0 ¯ AJa 0 0 0 0 A 0 0 0 0             x −s µKa µA¯ Ja µA       =       −P i∈Iaαifi λ −dKa ¯ bJa b       (19)

instead of (16). Since αi are unknown, this is not a system of linear equations,

but as we will se, it can still be handled using mostly linear techniques. Let

HIax = q_Ia (20)

be a minimal number of constraints that restrict x to aff T

i∈IaXi

(taking into account also the last three block rows of (19)). What we need to find is a solution to the combined problem (19) and (20).

Extend HIa to a square, non-singular matrix according to

HIa H_I⊥a ∈ Rn×n_, _H IaH_I⊥a T = 0

Given a particular solution x∗ to (20), the general solution can be written as x = x∗+ H_I⊥a

T

(9)

Inserting this into the first block row of (19) and multiplying from left by H_I⊥a gives X i∈Ia αiHI⊥a Qi(x∗+ HI⊥a T β) + fi (21) + H_I⊥aC_KTaµKa+ H_I⊥aA¯T_Jaµ ¯ A Ja+ H_I⊥aATµA= 0

Now, since L(x) is continuous, the gradients of all the quadratic functions with indices in Ia _{have the same component along the common boundary of the}

regions Xi, i ∈ Ia. Hence, the first sum is independent of α, and we can choose

any index l ∈ Ia _{and replace the sum according to}

X i∈Ia αiHI⊥a Qi(x∗+ HI⊥a T β) + fi = H_I⊥a Ql(x∗+ HI⊥a T β) + fl

Hence, we can solve        H_I⊥aQlHI⊥a T 0 H_I⊥aC_KTa H_I⊥aA¯T_Ja H_I⊥aAT 0 0 1T nK 0 0 CKaH_I⊥a T 1nK 0 0 0 ¯ AJaH_I⊥a T 0 0 0 0 AH_I⊥a T 0 0 0 0              β −s µKa µA¯ Ja µA       (22) =       −H⊥ Ia(Qlx∗+ fl) λ −dKa ¯_b_Ja b      

which, due to the minimality of (20), is still uniquely solvable. The solution can then be inserted into the first block row of (19) to solve for α.

As before, to compute what happens for a small change in λ, we can solve        H_I⊥aQlHI⊥a T 0 H_I⊥aC_KTa H_I⊥aA¯T_Ja H_I⊥aAT 0 0 1T_nK 0 0 CKaH_I⊥a T 1nK 0 0 0 ¯ AJaH_I⊥a T 0 0 0 0 AH_I⊥a T 0 0 0 0               ∂β ∂λ −∂s ∂λ ∂µKa ∂λ ∂µA_{J a}¯ ∂λ ∂µA ∂λ        =       0 1 0 0 0       (23)

To start the algorithm, we first have to compute the optimal solution for λ = 0 (let us denote this solution by x0). This is an ordinary convex optimization

problem. After this, we select a maximal subset of the constraints that are active at x0, such that (17) has full row rank. (However, note that in Ka, it is sufficient

to include only one index k to start with.)

The algorithm can now be described as follows:

Algorithm 2.1. Given: A parametric optimization problem of the type (8). 1. Set λ = 0.

(10)

3. Let S = {(λ, xλ)} = {(0, x0)}.

4. Let Ka _{= {k} for some k in arg max}

k{cTkxλ+ dk}, and let Ja, Ia be

maximal subsets of indices for which corresponding constraints are active at x0, such that (17) has full row rank. Compute HIa and H⊥

Ia. (If Ia

has only one member, HIa will be empty, and we can let H⊥ Ia= I.)

5. Compute (22) to get s, µKa, µA_J¯a, and µA.

6. Compute the directions given by (23).

7. Find the minimal δλ ≥ 0 such that one of the following conditions are satisfied: (a) s + ∂s_∂λδλ = cT k(xλ+∂x∂λδλ) + dk and ∂s∂λ < c T k ∂x ∂λ for some k 6∈ K a_.

Then move the corresponding k to Ka_.

(b) µk +∂µ_∂λkδλ = 0 and ∂µ_∂λk < 0 for some k ∈ Ka. Then remove the

corresponding k from Ka_.

(c) ¯Aj(xλ+∂x_∂λδλ) = ¯bj and ¯Aj_∂λ∂x> 0 for some j 6∈ Ja. Then move the

corresponding j to Ja. (d) µA¯ j + ∂µA¯ j ∂λ δλ = 0 and ∂µA¯ j ∂λ < 0 for some j ∈ J

a_{. Then remove the}

corresponding j from Ja_.

(e) xλ+∂x_∂λδλ ∈ Xl for some l 6∈ Ia, and xλ+∂x_∂λ(δλ + ε) 6∈ Xi for some

ε > 0 and i ∈ Ia. Then include l in Ia.

(f ) There would not be a subgradient ν satisfying (14) (with x replaced by xλ+ _∂λ∂xδλ) if increasing δλ further. Then remove one i with

corresponding αi= 0 from Ia.

Add δλ to λ, update xλ, s, µKa, µ ¯ A

Ja, and µA, and recompute HIa and

H_I⊥a. If there is no δλ ≥ 0 for which the conditions are satisfied, set

λ = ∞.

8. Add the new pair (λ, xλ) to S; S := {S, (λ, xλ)}.

9. If λ = ∞, stop. Otherwise, go to6.

When the algorithm has finished, S will contain the knots of the piecewise linear solution path, and any solution to (8) can be obtained by linear interpo-lation between two neighboring knots.

Note that only linear techniques are needed to find the solution path, except for step7f, where αi is a nonlinear function of λ. This step can be handled by

simple line search. It can also be shown that all the changes of step7 will lead to full row rank of (17).

Remark 2. Without restriction, we can assume non-degeneracy in the sense that each term cT

kx + dk in (6) is the maximal term on a set of positive measure,

and that two terms are equal only on a set of measure zero. This may simplify implementations for computing the solution path. However, this is not necessary for Lemma1 to hold.

(11)

2.2 Quadratic plus piecewise affine cost function

Now let us consider the class of optimization problems with positive definite quadratic L(x) and a J (x) which is a sum of absolute values of affine functions. We will also allow linear equality constraints. In other words, the problems we will consider can be written as

min z λ m X k=1 |hT kz + gk| + 1 2z T_{Qz + p}T_z ₍₂₄₎ subj. to Az = b

with Q > 0. Provided that the problem is feasible, we can use the constraints to eliminate variables, to get an equivalent problem in the form

min ˜ z λ m X k=1 |˜hT_kz + ˜˜ gk| + 1 2z˜ T_Q˜_˜_{z + ˜}_pT_z_˜ (25)

We can now make the variable substitution x = ˜Q1/2_{z + ˜}_˜ _Q−1/2_{p. Hence, it turns}_˜

out that it is sufficient to consider problems of the type min x λ m X k=1 |cT kx + dk| + 1 2x T_x ₍₂₆₎

in more detail. This is a special case of (8), and hence (26) has a piecewise linear solution path.

We will now derive explicit expressions for the directions of the linear parts of the solution path. For simplicity, we will assume that the problem is non-degenerate in the sense that for all x, the vectors of the set C_x0= {ck|cTkx + dk=

0} are linearly independent.

Introducing slack variables, we can rewrite (26) as min x,s λ m X k=1 sk+ 1 2x T_x ₍₂₇₎ subj. to sk≥ cTkx + dk sk≥ −cTkx − dk

The Lagrangian function of (27) becomes L(s, x; µ) = λ m X k=1 sk+ 1 2x T_{x −} m X k=1 µ+_k(sk− cTkx − dk) − m X k=1 µ−_k(sk+ cTkx + dk)

We can now write the KKT conditions as x + m X k=1 ckµ+k − m X k=1 ckµ−k = 0 (28a) λ − µ+_k − µ−_k = 0 (28b) µ+_k(sk− cTkx − dk) = 0 (28c) µ−_k(sk+ cTkx + dk) = 0 (28d) µ±_k ≥ 0 (28e)

(12)

Define the sets

K+= {k : cT_kx + dk > 0}

K− = {k : cTkx + dk < 0} (29)

K0= {k : cT_kx + dk = 0}

For k ∈ K+, we get µ+_k = λ and µ−_k = 0 from (28b) and (28d). Similarly, for k ∈ K− we obtain µ+_k = 0 and µ−_k = λ. Using (28a) this implies that

x + λ X k∈K+ ck− λ X k∈K− ck+ X k∈K0 ck(2µ+_k − λ) = 0 (30)

Since we would like to consider the linear parts of the solution path, we can assume 0 < µ+_k < λ in the last sum. Let us now add δλ to λ. The solution (x, s) to (28) will then change accordingly by (δx, δs), so that if δλ is small enough we have x + δx + (λ + δλ) X k∈K+ ck− X k∈K− ck ! + X k∈K0 ck 2(µ+_k + δµ+_k) − (λ + δλ) = 0

which together with (30) implies that

δx + δλ X k∈K+ ck− X k∈K− ck ! + X k∈K0 ck 2δµ+k − δλ = 0 (31)

If we introduce the notation K0= {k1, . . . , kn0} and

C0=    cT_k 1 .. . cT k_n0   , δM +₌    δµ+_k 1 .. . δµ+_k n0    (32)

and similarly for C+ _{and C}−_{, we can write (}₃₁_{) as}

δx +C+T1n+− C−T1_n−

δλ + C0T 2δM+− 1n0δλ = 0 (33)

At the same time, for j ∈ K0 _{it must hold that}

cT_j(x + δx) + dj = 0 ⇒ cTjδx = 0

Hence, multiplying (33) by C0yields C0C+T1n+− C−T1_n−

δλ + C0C0T 2δM+− 1n0δλ = 0 (34)

Since C0_C0T _{is invertible, we get}

δM+ δλ = 1 2 C0C0T −1 C0−C+T₁ n++ C−T1_n−+ C0T1_n0 (35)

(13)

Inserting this into (33) results in δx δλ = I − C0TC0C0T −1 C0 −C+T₁ n++ C−T1_n−+ C0T1_n0 (36) Note that since this expression is locally constant, the solution x will locally change linearly as λ changes. Just as for the problem class considered in Sec-tion 2.1, this means that when computing the solution path, we only need to store the solutions and values of λ for the knots of the solution path, as the values in between can be obtained afterwards by simple linear interpolation.

We can now give an algorithm for finding the solution path to a problem in the form (26). For the algorithm, we will need the rates of change in value of the terms cT_kx + dk as δx is added to x:

∂(cTx) ∂λ = c T k I − C0TC0C0T −1 C0 −C+T₁ n++ C−T1_n−+ C0T1_n0 (37) Algorithm 2.2. Given: An optimization problem of the type (26).

1. Set λ = 0, xλ= 0, and S = {(λ, xλ)}.

2. Compute the sets K+, K− _{and K}0_{, as defined in (}₂₉_).

3. Compute the values of µ+_k for all k from the KKT conditions. 4. Compute the directions given by (35) and (36).

5. Find the minimal δλ ≥ 0 such that one of the following conditions are satisfied:

(a) cT

kx + dk = 0, and the right hand side of (37) is negative, for some

k ∈ K+_{. Then move the corresponding k from K}+ _{to K}0_.

(b) cT_kx + dk = 0, and the right hand side of (37) is positive, for some

k ∈ K−. Then move the corresponding k from K− to K0.

(c) µ+_k = λ and the corresponding element of the right hand side of (35) is positive for some k ∈ K0. Then move the corresponding k from K0 to K+.

(d) µ+_k = 0 and the corresponding element of the right hand side of (35) is negative for some k ∈ K0_{. Then move the corresponding k from}

K0 _{to K}−_.

Add δλ to λ. If there is no δλ ≥ 0 for which the conditions are satisfied, set λ = ∞.

6. Add the new pair (λ, xλ) to S; S := {S, (λ, xλ)}.

(14)

2.3

3 Direct weight optimization

The following section will outline how the piecewise linear solution path algo-rithm can be applied to Direct Weight Optimization (DWO,Roll (2003); Roll et al. (2005a,b)). DWO is a nonparametric method for system identification and function estimation, which for some function classes and under some as-sumptions has been shown to give optimal pointwise function estimates (in the sense of minimizing the maximum MSE; Nazin et al.(2006)).

Assume that we are given data {y(t), ϕ(t)}N

t=1generated from

y(t) = f0(ϕ(t)) + e(t) (42)

where f0 is an unknown function, f0 : Rn → R, and e(t) are zero-mean, i.i.d.

random variables with variance σ2_{, independent of ϕ(τ ) for all τ . In this paper,}

we will assume that f0belongs to a function class F , whose members can locally

be approximately described by a given basis function expansion, with a known upper bound on the approximation error. More precisely, F is defined as follows:

(17)

Definition 1. Let F = F (D, Dθ, F, M ) be the set of all functions f , for which

there, for each ϕ0∈ D, exists a θ0(ϕ0) ∈ Dθ, such that

f (ϕ) − θ 0T_(ϕ 0)F (ϕ) ≤ M (ϕ, ϕ0) ∀ϕ ∈ D (43) Here, F (·) is a vector of given basis functions, θ0T_(ϕ

0)F (ϕ) is a local

(un-known) approximation of f (ϕ) around ϕ0, and M (ϕ, ϕ0) is a given upper bound

on the approximation error. Figure 2illustrates the definition for a case when θ0T_(ϕ

0)F (ϕ) is linear and the bound M (ϕ, ϕ0) is quadratic. Examples of

func-tion classes that can be formulated in this way include the class of funcfunc-tions with Lipschitz continuous gradients (with a given Lipschitz constant L). Also systems with both stochastic and unknown-but-bounded noise terms can be han-dled within the DWO framework. For more details and examples of function classes covered by Definition 1, see e.g., (Roll et al.,2005b).

ϕ0

ϕ

f

(ϕ

)

Figure 2: Illustration of Definition 1: The true function f (ϕ) (thick line), the local approximation θ0T_(ϕ

0)F (ϕ) (thin line), and the bounds θ0T(ϕ0)F (ϕ) ±

M (ϕ, ϕ0) (dashed).

Now, given this information, how would we estimate f0(ϕ∗) for a given

point ϕ∗? The idea behind DWO is to estimate f0(ϕ∗) by postulating that the

estimate should be affine in y(t), i.e.3,

ˆ fN(ϕ∗) = w0+ N X t=1 wty(t) (44)

and determine the weights (w0, w), w = (w1, . . . , wN) by minimizing an upper

bound on the maximum MSE (4), i.e., mMSE (ϕ∗, w0, w) = sup f0∈F E f0(ϕ∗) − ˆfN(ϕ∗) 2 {ϕ(t)} N t=1 (45) = sup f0∈F E " f0(ϕ∗) − w0− N X t=1 wt(f0(ϕ(t)) + e(t)) !2 {ϕ(t)}N t=1 #

3_{This assumption is not very restrictive. For instance, note that any least-squares}

(18)

where we have replaced ˆfN(ϕ∗) in (4) by using (42) and (44).

For the particular function class given by Definition1, we can give an upper bound on the maximal MSE that leads to the following minimization (seeRoll et al., 2005b): min w N X t=1 |wt|M (ϕ(t), ϕ∗) + M (ϕ∗, ϕ∗) !2 + σ2 N X t=1 wt2 subj. to N X t=1 wtF (ϕ(t)) − F (ϕ∗) = 0 (46)

This is a typical bias-variance trade-off problem, where the first term of (46) is an upper bound on the squared bias of ˆfN(ϕ∗), and the second term is the

variance. The problem is convex, and the solution depends on the size of σ2

compared to M . In many cases, both σ2 _{and M are unknown, and have to be}

estimated along with the function. For simplicity, we may assume that at least the shape of the function M is given, so that M (ϕ1, ϕ2) = Lm(ϕ1, ϕ2) where

m is a known function. Then, (46) may be rewritten as

min w λ N X t=1 mt|wt| + m0 !2 + N X t=1 w2_t (47) subj. to N X t=1 wtF (ϕ(t)) − F (ϕ∗) = 0

where we have introduced the notation m0= m(ϕ∗, ϕ∗), mt= m(ϕ(t), ϕ∗), and

λ = L2_/σ2_.

The design parameter λ can be interpreted as controlling the trade-off be-tween the flexibility of the function class and the noise variance. A small λ will put effort on minimizing the variance term (by making more weights non-zero) at the expense of the bias error, while a large λ will decrease the bias error (by making the estimates more local — the exact properties depend on the shapes of F (ϕ) and m(ϕ1, ϕ2)). To select a good value for λ, problem (47) needs to

be solved for various values of λ, and the solutions can be evaluated, e.g., by cross-validation (Härdle,1990;Stenman,1999). For known σ2, one method for selecting λ was proposed in (Juditsky et al.,2004).

3.1 Solution path for DWO

Now we are ready to show that the DWO problem (47) has a piecewise linear solution path. Using Observation1and Remark3, we can see that (47) has the same solution path as

min w λ N X t=1 mt|wt| + N X t=1 w_t2 (48) subj. to N X t=1 wtF (ϕ(t)) − F (ϕ∗) = 0

(19)

But this problem is in the same form as (24), and hence has a piecewise linear solution path. Therefore, we can use Algorithm 2.2to compute it, which leads to a significant gain in computational complexity.

3.2 An algorithm for DWO with unknown λ

With the algorithm for finding the solution path for a given DWO problem, the next step is to choose which λ = L2/σ2 to use for the final function estimate. Here we will take a cross-validation approach (for other alternatives, see (Roll,

2003, Section 7.2), where the problem is discussed to some extent under the term “adaptive bandwidth selection”, and (Juditsky et al.,2004)).

Assuming that a set of experimental data is collected, split it into two sets, called the estimation data set Seand calibration data set Sc. Now, the algorithm

contains an offline and an online part:

Algorithm 3.1. Computation of DWO solution with estimated λ, using a cal-ibration data set.

Offline part:

1. For each point ϕ(i) in the calibration data set, compute the DWO solution paths w(i)_{(λ) and the corresponding function estimates ˆ}_{f (ϕ(i), λ) using the}

estimation data set. Online part:

1. Given a regression vector ϕ∗, for which the function value f (ϕ∗) is to be estimated, choose the k nearest neighbours of ϕ∗ from the calibration data set. Denote the set S_k∗⊂ Sc.

2. Choose ˆλ as ˆ λ = arg min λ X j:ϕ(j)∈S∗ k y(j) − ˆf (ϕ(j), λ) 2

(we will call the function on the right hand side the k-nn cost function). 3. Compute the DWO solution ˆf (ϕ∗, ˆλ) using the estimation data set.

An advantage with this scheme is that the computation of the DWO solution paths for the calibration data can be done offline, once and for all. Hence, in the online part, we only need to find the k nearest neighbors and then minimize a cost function in one variable. In practice, the value of each term of the cost function can be computed for a number of representative λ values already during the offline part, which makes step2in the online part very fast.

Having found an appropriate ˆλ, all that remains is to find the DWO function estimate ˆf (ϕ∗, ˆλ) for that particular ˆλ, which is a problem that can be efficiently solved, either through the pathfollowing algorithm, or with the help of some other convex optimization algorithm (Boyd and Vandenberghe,2004).

The choice of k is a trade-off between relying on enough calibration data points, and at the same time keeping the k-nn cost function local enough. This can be compared to choosing the bandwidth for pilot estimates in, e.g., plug-in methods (Stenman,1999). Preliminary experiments indicate that the quality of the function estimates is fairly robust to the choice of k, but this is something that should be further investigated.

(20)

3.3 Examples

Let us illustrate Algorithm3.1with two examples.

−4 −2 0 2 −2 0 2 −5 0 5 ϕ1 ϕ2 y

(a) Estimation data.

−2 0 2 −2 −1 0 1 2 3 ϕ1 ϕ2

(b) Calibration data. The 30 nearest neigh-bours of ϕ∗are marked with circles (◦).

Figure 3: Data from the system (49).

Example 3. First, consider the system

y(t) = f1(ϕ(t)) + e(t) (49)

f1(ϕ(t)) = 0.5 + 0.4ϕ1(t) − 0.2ϕ2(t) + ϕ21(t) − ϕ 2 2(t)

where e(t) is a white noise term with variance 0.1. An estimation data set of 500 points were generated, where ϕ(t) was drawn from an N (0, I) distribution. The estimation data set is shown in Figure3(a). Also, a similar calibration data set was generated.

Suppose now that we would like to estimate f1(ϕ∗) for ϕ∗ = (0.5 0.5)T. To

do this, the algorithm in Section 3.2with k = 30 was used. Figure3(b) shows the calibration data set with the 30 selected nearest neighbours highlighted. Running the algorithm gives the k-nn cost function shown in Figure 4(a), and the function estimates as a function of λ are plotted in Figure 4(b). We get ˆ

λ = 1.3461, which leads to one of the best function estimates in Figure 4(b). The resulting weights w(ˆλ) are shown in Figure4(c).

We can make some observations from the example. Choosing a value of λ means that we decide what trade-off between bias and variance to make. For the system (49), the overall bias does not increase that much when increasing the number of nonzero weights, since the effect of the positive second derivative in one direction is neutralised by the effect of the negative second derivative in another direction. This gives us a wider range of values of λ with approximately the same k-nn cost. This is not a major problem, since what we are primarily after is not λ in itself, but a good function value estimate ˆf (ϕ∗, ˆλ).

(21)

10−4 10−2 100 102 104 2 2.5 3 3.5 4 4.5 5 k -nn co st λ

(a) k-nn cost function.

10−4 10−2 100 102 104 0.5 0.6 0.7 0.8 0.9 1 ˆ f(ϕ1 ∗) λ

(b) Function value estimates ˆf1(ϕ∗, λ)

(dot-ted line indicates f1(ϕ∗), star (∗) indicates

ˆ f1(ϕ∗, ˆλ)). −2 0 2 −2 0 2 0 0.005 0.01 0.015 0.02 0.025 ϕ1 ϕ2 w (c) Weights corresponding to ˆλ.

Figure 4: Result of running the DWO solution path algorithm on the system (49).

(22)

To confirm the argument above, let us instead consider the system

y(t) = f2(ϕ(t)) + e(t) (50)

f2(ϕ(t)) = 0.5 + 0.4ϕ1(t) − 0.2ϕ2(t) + ϕ21(t) + ϕ 2 2(t)

and estimate f2(ϕ∗) using the same experimental setup as before, with

estima-tion and calibraestima-tion data sets containing exactly the same regression vectors. For this system, the Hessian is positive definite everywhere, which means that the bias will increase with a decreasing λ (i.e., an increasing number of nonzero weights). 10−4 10−2 100 102 104 0 10 20 30 40 50 k -nn co st λ

(a) k-nn cost function.

10−4 10−2 100 102 104 1 1.5 2 2.5 ˆ f(ϕ2 ∗) λ

(b) Function value estimates ˆf2(ϕ∗, λ)

(dot-ted line indicates f2(ϕ∗)), star (∗) indicates

ˆ f2(ϕ∗, ˆλ)). −2 0 2 −2 0 2 0 0.05 0.1 0.15 ϕ1 ϕ2 w (c) Weights corresponding to ˆλ.

Figure 5: Result of running the DWO solution path algorithm on the system (50).

The resulting k-nn cost function, function estimates and weights are pre-sented in Figure5. Note that the differences in the k-nn cost function are much larger than for the previous system, which allows us to more unambiguously determine the value of ˆλ.

(23)

Example 4. Consider the system

y(t) = f (ϕ(t)) + e(t) (51)

f (ϕ(t)) = 1 + sin(6 arctan(ϕ1(t)ϕ2(t)))

where e(t) is a white noise term with variance 0.1. An estimation data set of 500 points were generated, where ϕ(t) ∈ R2 _{was drawn from a uniform distribution}

on [−2.5, 2.5]2_{. A similar calibration data set was also generated, as well as a}

test set of 100 data points.

The function values were estimated for the test set using the proposed al-gorithm. For comparison, a sigmoidal neural network model was identified. Several models of different complexity were identified using the estimation data set, and the one performing best on the calibration data set, as well as on the test set, had 25 neurons. The results are shown in Table 1, where it can be observed that the performance of the proposed DWO algorithm is about 5% better than the best neural network found.

Estimation method Cost

DWO with ˆλ 0.1436

Neural network (25 neurons) 0.1517

Table 1: Performance of DWO solution path algorithm on a test set, measured as _N1 P

j(f (ϕ∗(j)) − ˆf (ϕ∗(j))) 2_.

4 Conclusions

This paper has extended the use of piecewise linear solution paths suggested for LARS and LASSO in (Efron et al.,2004) to a more general setting, and proposed to use it for computation of DWO estimates. The benefit of exploiting the piecewise linear solution paths are that we can efficiently compute all solutions to a parametric optimization problem, such as the DWO problem for unknown bias-variance trade-off.

A cross-validation-based method for selecting the DWO weights was also proposed, and showed good performance in examples. There are many possible alternatives to this approach, including modified versions of Akaike’s criteria (Akaike, 1973), Mallows’ Cp criterion (Cleveland and Devlin, 1988; Mallows,

1973) etc. See also (Härdle,1990;Stenman,1999) and references therein. How-ever, regardless of what method is used for choosing an appropriate λ value for the bias-variance trade-off, one can benefit from the reduced computational complexity obtained by simultaneously computing the solutions for all λ values. A topic for further studies is to enhance the computational complexity of Al-gorithms2.1and2.2by taking advantage of the specific problem structures. For instance, to further increase the efficiency of Algorithm2.2, one could consider to compute the expressions (35) and (36) recursively, similarly to what is done in, for instance, the RLS algorithm. Also numerical issues should be studied in more detail. Furthermore, it would be interesting to compute the solutions in the other direction, i.e., for λ starting at ∞ and decreasing to 0. These could all be topics for further research.

(24)

References

H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, pages 267–281, 1973.

C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1-5):11–73, Feb. 1997a.

C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning for control. Artificial Intelligence Review, 11(1-5):75–113, Feb. 1997b.

A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, Jan. 2002.

G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for local modelling and control design. International Journal of Control, 72(7-8):643–658, May 1999.

F. Borrelli. Constrained Optimal Control of Linear and Hybrid Systems, volume 290 of Lecture Notes in Control and Information Sciences. Springer-Verlag, 2003.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.

W. S. Cleveland and S. J. Devlin. Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403):596–610, Sept. 1988.

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004.

J. Guddat, F. Guerra Vasquez, and H. T. Jongen. Parametric Optimization: Singularities, Pathfollowing and Jumps. John Wiley & Sons, 1990.

W. Härdle. Applied Nonparametric Regression. Number 19 in Econometric Society Monographs. Cambridge University Press, 1990.

A. Juditsky, A. Nazin, J. Roll, and L. Ljung. Adaptive DWO estimator of a regression function. In NOLCOS, Stuttgart, Sept. 2004.

C. L. Mallows. Some comments on Cp. Technometrics, 15:661–676, 1973.

A. Nazin, J. Roll, and L. Ljung. Direct weight optimization for approximately linear functions: Optimality and design. In 14th IFAC Symposium on System Identification, Newcastle, Australia, Mar. 2006.

R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, New Jersey, 1970.

J. Roll. Local and Piecewise Affine Approaches to System Identification. PhD thesis, Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden, Apr. 2003.

(25)

J. Roll, A. Nazin, and L. Ljung. Nonlinear system identification via direct weight optimization. Automatica, 41(3):475–490, Mar. 2005a.

J. Roll, A. Nazin, and L. Ljung. A general direct weight optimization frame-work for nonlinear system identification. In 16th IFAC World Congress on Automatic Control, Prague, Czech Republic, July 2005b.

S. Rosset and J. Zhu. Least angle regression - discussion. The Annals of Statis-tics, 32(2):469–475, Apr. 2004.

S. Rosset and J. Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), June 2007.

A. Stenman. Model on Demand: Algorithms, Analysis and Applications. PhD thesis, Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden, 1999.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. P. Tøndel, T. A. Johansen, and A. Bemporad. An algorithm for

multi-parametric quadratic programming and explicit MPC solutions. Automatica, 39(3):489–497, Mar. 2003.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B, Sta-tistical Methodology, 68(1):49–67, 2006.

A

Solution to the restricted parametric

optimiza-tion problem (9).

In this section, the solution to min x λ(c T kjx + dkj) + 1 2x T_Q ijx + f T ijx + rij (52) subj. to A˜jx = ˜bj will be computed. The Lagrangian of (52) is L(x; µ) = λ(cT kjx + dkj) + 1 2x T_Q ijx + f T ijx + rij− µ T_{( ˜}_A jx − ˜bj)

Differentiating with respect to x and setting to zero yields 0 = ckjλ + Qijx + fij − ˜A T jµ ⇔ x = Q−1_i j ( ˜A T jµ − fij− ckjλ)

Inserting this into ˜Ajx = ˜bj gives

˜ AjQ−1ij ( Ã T jµ − fij − ckjλ) = ˜bj ⇔ µ = ( ÃjQ−1ij ˜ AT_j)−1 ÃjQi−1j (fij + ckjλ) + ˜bj

(26)

and finally x = Q−1_i j ˜ AT_j( ÃjQ−1ij ˜ AT_j)−1A˜jQi−1j − I(fij+ ckjλ) + Ã T j( ÃjQ−1ij ˜ AT_j)−1˜bj

B

Uniqueness of (16)

Lemma 2. Assume that Q ∈ Rm×m

, A ∈ Rn×m, B > Rp×m, that Q = QT > 0, and that

A 1n

B 0

has full row rank. Then     Q 0 AT _BT 0 0 1T n 0 A 1n 0 0 B 0 0 0     is nonsingular.

Proof. It is equivalent to show that     Q 0 AT _BT 0 0 1T n 0 A 1n 0 0 B 0 0 0         x y z w     =     0 0 0 0     (53)

has a unique solution. To begin with, we can solve for x to get x = −Q−1ATz − Q−1BTw and   0 1T n 0 1n −AQ−1AT −AQ−1BT 0 −BQ−1_AT _−BQ−1_BT     y z w  =   0 0 0   If we multiply from the left by y zT _wT_{, this yields}

0 = y1T_nz + zT1ny + zT wT −AQ−1AT −AQ−1BT −BQ−1_AT _−BQ−1_BT z w = − zT wTA B Q−1 AT _BT z w since 1T

nz = 0 by the second block row of (53). Hence, since Q > 0,

zT wTA 1n

B 0

= 0

Due to the full rank assumption, it follows that zT _wT_{= 0. It now directly}

follows that also x = 0, and from the third block row of (53) we finally get y = 0.

(27)

Avdelning, Institution Division, Department

Division of Automatic Control Department of Electrical Engineering

Datum Date 2007-08-28 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.control.isy.liu.se

ISBN — ISRN

—

Serietitel och serienummer Title of series, numbering

ISSN 1400-3902

LiTH-ISY-R-2816

Titel Title

Piecewise Linear Solution Paths for Parametric Piecewise Quadratic Programs with Appli-cation to Direct Weight Optimization

Författare Author

Jacob Roll

Sammanfattning Abstract

Recently, pathfollowing algorithms for parametric optimization problems with piecewise lin-ear solution paths have been developed within the field of regularized regression. This paper presents a generalization of these algorithms to a wider class of problems, namely a class of parametric piecewise quadratic programs and related problems. It is shown that the ap-proach can be applied to the nonparametric system identification method Direct Weight Optimization (DWO) and be used to enhance the computational efficiency of this method. The most important design parameter in the DWO method is a parameter (λ) controlling the bias-variance trade-off, and the use of parametric optimization with piecewise linear solution paths means that the DWO estimates can be efficiently computed for all values of λ simulta-neously. This allows for designing computationally attractive adaptive bandwidth selection algorithms. One such algorithm for DWO is proposed and demonstrated in two examples.

Nyckelord Keywords

Piecewise Linear Solution Paths for Parametric Piecewise Quadratic Programs with Application to Direct Weight Optimization

Piecewise Linear Solution Paths for

Para-metric Piecewise Quadratic Programs with

Application to Direct Weight Optimization

Jacob Roll

Division of Automatic Control

E-mail:

roll@isy.liu.se

28th August 2007

Report no.:

LiTH-ISY-R-2816

Abstract

Piecewise Linear Solution Paths for Parametric

Piecewise Quadratic Programs with Application

to Direct Weight Optimization

Jacob Roll

2007-08-28

1

Introduction

2

Piecewise linear solution paths

2.1

Piecewise quadratic plus piecewise affine cost function

2.2

Quadratic plus piecewise affine cost function

2.3

Related solution paths for different problems

3

Direct weight optimization

3.1

Solution path for DWO

3.2

An algorithm for DWO with unknown λ

3.3

Examples

4

Conclusions

References

A

Solution to the restricted parametric

optimiza-tion problem (9).

B

Uniqueness of (16)