Doctoral Thesis in Applied and Computational Mathematics
Approaches to accelerate methods for solving systems of equations arising in nonlinear optimization
DAVID EK
Stockholm, Sweden 2020 www.kth.se
ISBN 978-91-7873-713-0 TRITA-SCI-FOU 2020:37
kth royal institute of technology
Dav iD Ek Appr oa ch es to ac ce ler ate me th od s fo r solv in g sys te m s o f e qu ati on s a ris in g in no nli ne ar o pti m iza tio n K TH 20 20
Approaches to accelerate methods for solving systems of equations arising in nonlinear optimization
DAVID EK
Doctoral Thesis in Applied and Computational Mathematics KTH Royal Institute of Technology
Stockholm, Sweden 2020
Academic Dissertation which, with due permission of the KTH Royal Institute of Technology,
is submitted for public defence for the Degree of Doctor of Philosophy on Friday the 18th of
December 2020, at 10:00 a.m. in U1, KTH Royal Institute of Technology, Brinellvägen 28A,
Stockholm, Sweden.
© David Ek
Cover page photo: Photographer name ISBN 978-91-7873-713-0
TRITA-SCI-FOU 2020:37
Printed by: Universitetsservice US-AB, Sweden 2020
iii
Abstract
Methods for solving nonlinear optimization problems typically involve solving systems of equations. This thesis concerns approaches for acceler- ating some of those methods. In our setting, accelerating involves finding a trade-off between the computational cost of an iteration and the quality of the computed search direction. We design approaches for which theoretical results in ideal settings are derived. We also investigate the practical perfor- mance of the approaches within and beyond the boundaries of the theoretical frameworks with numerical simulations.
Paper A concerns methods for solving strictly convex unconstrained quad- ratic optimization problems. This is equivalent to solving systems of linear equations where the matrices are symmetric positive definite. The main fo- cus is exact linesearch limited-memory quasi-Newton methods which generate search directions parallel to those of the method of preconditioned conjugate gradients. We give a class of limited-memory quasi-Newton methods. In addi- tion, we provide a dynamic framework for the construction of these methods.
The methods are meant to be particularly useful for solving sequences of re- lated systems of linear equations. Such sequences typically arise as methods for solving systems of nonlinear equations converge.
Paper B deals with solving systems of nonlinear equations that arise in interior-point methods for bound-constrained nonlinear programming. Appli- cation of Newton’s method gives sequences of systems of linear equations. We propose partial and full approximate solutions to the Newton systems. The partial approximate solutions are computationally inexpensive, whereas each full approximate solution typically requires a reduced Newton system to be solved. In addition, we suggest two Newton-like approaches, which are based upon the proposed partial approximate solutions, for solving the systems of nonlinear equations.
Paper C is focused on interior-point methods for quadratic programming.
We propose a structured modified Newton approach to solve the systems of nonlinear equations that arise. The modified Jacobians are composed of a previous Jacobian, plus one low-rank update matrix per succeeding iteration.
For a given rank restriction, we construct a low-rank update matrix such that the modified Jacobian becomes closest to the current Jacobian, in both 2-norm and Frobenious norm. The approach is structured in the sense that it preserves the nonzero pattern of the Jacobian.
The approaches suggested in Paper B and Paper C are motivated by asymptotic results in ideal theoretical frameworks. In particular, it is shown that the approaches become increasingly accurate as primal-dual interior- point methods converge. A demonstration of the practical performance is given by numerical results. The results indicate the performance and limita- tions of the approaches suggested.
We envisage that the approaches of Paper A, Paper B and Paper C can be useful in parallel, or in combination, with existing solvers in order to accelerate methods.
Paper D is more pedagogical in nature. We give a derivation of the method
of conjugate gradients in an optimization framework. The result itself is well
iv
known but the derivation has, to the best of our knowledge, not been presented before.
Keywords: Nonlinear optimization, mathematical programming, interior-
point methods, approximate solutions to systems of linear equations, method
of conjugate gradients, quasi-Newton methods, modified Newton methods.
v
Sammanfattning
Metoder för att lösa ickelinjära optimeringsproblem involverar vanligtvis lösning av ekvationssystem. Denna avhandling handlar om strategier för att accelerera några av dessa metoder. I vårt ramverk innebär acceleration att hit- ta en avvägning mellan beräkningskostnaden för en iteration och kvaliteten av den beräknade sökriktningen. Vi utformar strategier för vilka teoretiska resultat i ideala omgivningar härleds. Vi undersöker också strategiernas prak- tiska prestanda inom och utanför gränserna för de teoretiska omgivningarna med numeriska simuleringar.
Artikel A behandlar metoder för att lösa strikt konvexa kvadratiska op- timeringsproblem utan bivillkor. Detta motsvarar att lösa linjära ekvations- system där matriserna är symmetriska och positivt definita. Huvudfokus är kvasi-Newtonmetoder, med exakt linjesökning och begränsat minne, som ge- nererar sökriktningar parallella med de i konjugerade gradientmetoden. Vi ger en klass av kvasi-Newtonmetoder med begränsat minne. Dessutom före- slår vi ett dynamiskt ramverk för att konstruera dessa metoder. Metoderna är avsedda att vara särskilt användbara för att lösa sekvenser av relaterade linjära ekvationssystem. Sådana sekvenser förekommer vanligtvis då metoder för att lösa ickelinjära ekvationssystem konvergerar.
Artikel B handlar om att lösa ickelinjära ekvationsssystem som uppstår i inre-punktsmetoder för ickelinjär optimering med begränsade variabler. Till- ämpning av Newtons metod ger sekvenser av linjära ekvationssystem. Vi före- slår partiella och fullständiga approximativa lösningar till Newton-systemen.
De partiella approximativa lösningarna är beräkningsmässigt billiga, medan varje fullständig approximativ lösning vanligtvis kräver att ett reducerat Newton-system löses. Dessutom föreslår vi två Newton-liknande strategier, som baseras på de föreslagna partiella approximativa lösningarna, för att lösa de ickelinjära ekvationssystemen.
Artikel C berör inre-punktsmetoder för kvadratisk optimering med linjära olikhetsbivillkor. Vi föreslår en strukturerad modifierad Newton-strategi för att lösa de ickelinjära ekvationssystemen som uppstår. De modifierade Ja- kobianerna är sammansatta av en tidigare Jakobian, samt en uppdaterings- matris av låg rang per efterföljande iteration. För en given rangbegränsning konstruerar vi en uppdateringsmatris av låg rang sådan att den modifiera- de Jakobianen blir närmast den nuvarande Jakobianen, i både 2-norm och Frobenious-norm. Strategin är strukturerad i avseendet att den bevarar Ja- kobianens gleshetsstruktur.
Strategierna som föreslås i Artikel B och Artikel C motiveras med asymp- totiska resultat i ideala teoretiska omgivningar. Särskilt visas det att strate- gierna blir alltmer exakta då primal-duala inre-punktsmetoder konvergerar.
En undersökning av deras praktiska prestanda ges genom numeriska resultat.
Resultaten illustrerar de föreslagna strategiernas praktiska prestanda samt deras begränsningar.
Vi föreställer oss att strategierna i Artikel A, Artikel B och Artikel C kan
vara användbara parallellt, eller i kombination, med befintliga lösare för att
accelerera metoder.
vi
Artikel D är mer pedagogisk i sin natur. Vi ger en härledning av konjuge- rade gradientsmetoden i ett optimeringsramverk. Resultatet i sig är välkänt men härledningen har, såvitt vi vet, inte presenterats tidigare.
Nyckelord: Ickelinjär optimering, matematisk programmering, inre-
punktsmetoder, approximativa lösningar till linjära ekvationssystem, konju-
gerade gradientmetoden, kvasi-Newtonmetoder, modifierade Newtonmetoder.
Acknowledgements
First and foremost, I would like to give my sincerest thanks to my advisor Anders Forsgren for granting me this opportunity. Thanks for sharing your knowledge, all the encouragement, guidance and support. Your enthusiasm, positive spirit and sense of humor has made this experience not only a privilege but also a pleasure. I am truly grateful for everything that I have learned from you, both academic and on a personal level, and for all the freedom that you have granted me throughout the process.
Also, many thanks to the other members of the faculty at the division, Krister Svanberg, Xiaoming Hu, Per Enqvist and Johan Karlsson, for sharing your know- ledge and providing a friendly atmosphere which have made my time at KTH more pleasant.
I also want to thank my friends and colleagues at KTH for the stimulating and friendly atmosphere. In particular, but without any particular order, thanks to Julian, Thomas, Parikshit, Samuel, Gerard and Simon, for the sports and other activities outside work which helped make this time extra enjoyable. I hope this continues.
Last but not least, thanks to all my friends and family for contributing to a lively and active free time. I am particularly grateful to John, Erik, Ani and, my parents Anna-Maria and Roger, for all the unconditional support and encouragement.
Finally, I would also like to acknowledge and thank the Swedish Research Council (VR) which partly financed the research included in this thesis.
Stockholm, November 2020 David Ek
vii
Contents
Abstract iii
Acknowledgements vii
Contents ix
Part I: Introduction and Summary
1 Introduction 1
1.1 Unconstrained nonlinear programming . . . . 2
1.1.1 Linesearch methods . . . . 3
1.1.2 Quadratic problems . . . . 5
1.2 Constrained nonlinear programming . . . . 7
1.2.1 Primal-dual interior-point methods . . . . 8
Bound-constrained nonlinear programming . . . . 12
Quadratic programming . . . . 13
2 Thesis summary and contributions 15 2.1 Summary of appended papers . . . . 15
2.2 Main contribution . . . . 18
References 19
Part II: Scientific Papers Paper A
Exact linesearch limited-memory quasi-Newton methods for minimizing a quadratic function
(joint with Anders Forsgren)
Preprint: https://arxiv.org/abs/1809.10590
(submitted to Computational Optimization and Applications)
ix
Paper B
Approximate solution of system of equations arising in interior-point methods for bound-constrained optimization
(joint with Anders Forsgren)
Preprint: https://arxiv.org/abs/2004.04057
(submitted to Computational Optimization and Applications) Paper C
A structured modified Newton approach for solving systems of nonlinear equations arising in interior-point methods for quadratic programming (joint with Anders Forsgren)
Preprint: https://arxiv.org/abs/2009.07913 (submitted to SIAM Journal on Optimization) Paper D
An optimization derivation of the method of conjugate gradients (joint with Anders Forsgren)
Preprint: http://arxiv.org/abs/2011.02337 (submitted to SIAM Review)
x
1 Introduction
Terminology associated with optimization tends to be ambiguous among the general public. Frequently, it is used in contexts of optimizing an experience, a strategy or some performance, etc. The goal is often to select the best decisions, according to some subjective measure, among some available alternatives through trial and error, or simply to make qualitative decisions which improve some previous state.
In these cases, optimization may be without regard to optimality as it may not take into account all feasible alternatives or may not include a well-defined measure of the outcome. In contrast, this thesis concerns mathematical optimization, which typically includes selecting the elements from a feasible set of alternatives, which are the best with respect to a certain well-defined measure.
The history of mathematical optimization is rich and intertwined with many subfields emerging and evolving over time. Optimization as a field began to emerge as mathematical programming gained traction in the 1940’s, although its origin may be traced back further to work on continuous functions using calculus. As mathematical programming expanded and developed, it gave rise to a variety of subfields like linear programming, nonlinear programming, integer programming, stochastic programming, and more. During certain periods of time, some of these subfields were considered more independent of each other and evolved in this way as well. As theory and numerical methods developed across the subfields, these grew to be more interconnected and nowadays all of these go under the umbrella of optimization. The problems arise in a great variety of areas ranging from finance to healthcare to physics and more. A general finite real-valued optimization problem can be posed on the form
minimize
x∈F f (x) (1.0.1)
where x is a vector of decision variables, F is the feasible set and f : D ⊆ R n → R is
the objective function. Depending on the nature of the objective function f and the
feasible set F these problems may be more or less challenging to solve. As problems
become more complex it quickly becomes nonviable to find analytic solutions and
instead numerical methods are employed. On a high level a numerical method is a
complete set of unambiguous protocols aimed to find the solution of a problem. The
goal in every optimization problem is to find globally optimal solutions, i.e., x ∗ ∈ F
1
1. Introduction
such that f (x ∗ ) ≤ f (x) for all x ∈ F . However, in general it may be intractable to determine the global properties of a problem. In consequence numerical methods often seek locally optimal solutions, i.e., x ∗ ∈ int(S) ∩ F , for a compact set S, such that f (x ∗ ) ≤ f (x) for all x ∈ S ∩ F . A typical tool to establish global properties of a solution is the concept of convexity. Indeed, locally optimal solutions are also globally optimal solutions if the problem is a convex problem, i.e., if F is a convex set and f (x) is a convex function for all x ∈ F [13, Ch. 4].
Notation
Throughout, ∇f and ∇ 2 f denote the gradient and Hessian of a function f , respec- tively. Moreover, R(M ) denotes the range of a matrix M and C 2 denotes the class of functions which are twice continuously differentiable. The notion “ 0” and
“ 0” are defined as positive definite and positive semidefinite respectively.
1.1 Unconstrained nonlinear programming
A twice continuously differentiable unconstrained optimization problem can be for- mulated as
minimize
x∈R
nf (x) (P)
where f : R n → R is assumed to be in C 2 . A vector x ∗ ∈ R n that is a locally optimal solution of (P) satisfies
∇f (x ∗ ) = 0, (1.1.1a)
∇ 2 f (x ∗ ) 0. (1.1.1b)
Moreover, x ∗ is a strict locally optimal solution if it in addition satisfies
∇ 2 f (x ∗ ) 0. (1.1.1c)
The stationarity condition (1.1.1a) constitutes a first-order necessary condition for a locally optimal solution of (P). This condition combined with (1.1.1b) and (1.1.1c) form second-order necessary conditions and second-order sufficient conditions re- spectively. Numerical methods typically aim to find solutions which satisfy the stationarity condition (1.1.1a). For strictly convex problems stationary solutions are also globally optimal solutions.
The focus of this thesis is iterative methods where the aim at each iteration k is to improve a current solution x k by moving to x k+1 . Two fundamental techniques to advance from x k to x k+1 are linesearch methods and trust-region methods. A typical iteration in a trust-region method takes the from x k+1 = x k + d k , where the size of d k is restricted by a parameter that depends on k. The step d k is typically chosen such that x k+1 minimizes a local approximation of the objective function, see e.g., [17] [62, Ch. 4] [67, Ch. 6] for further details on trust-region methods.
2
1.1. Unconstrained nonlinear programming
1.1.1 Linesearch methods
In a linesearch method, the search direction is calculated prior to the step length.
For a thorough introduction to these methods, see e.g., [62, Ch. 3] [67, Ch. 2].
A general iteration k takes the from
x k+1 = x k + α k p k ,
where α k is the step length and p k is the search direction. The step length is typically determined such that sufficient decrease of some merit function is achieved at x k+1 . In general, the search direction is chosen such that p k is a descent direction, i.e., ∇f (x k ) T p k < 0. This guarantees the existence of a step such that the objective function is decreased, i.e., the existence of α k such that f (x k+1 ) < f (x k ). The ideal strategy is exact linesearch which, at iteration k, means α k as the solution of
minimize
α≥0 f (x k + αp k ). (1.1.2)
The exact linesearch strategy involves solving an optimization problem on its own and may, in general, be too computationally expensive. More computationally tractable approaches include finding α k such that a certain condition is satisfied, e.g., Armijo, Wolfe, Goldstein, etc. [62, Ch. 3].
Linesearch methods can be separated into subclasses which are determined by the order of the information that is utilized. The order of information affects the computational cost of an iteration. In the last couple of decades, there has been an increased interest in large-scale problems, partly due to the evolution of machine learning and the increased availability of large quantities of data. This has increased the demand for methods where each iteration is computationally in- expensive. Consequently, a substantial amount of research has been focused on first-order methods in recent decades. The perhaps most basic first-order method is gradient descent in which the search direction at iteration k is chosen as the negative gradient, i.e., p k = −∇f (x k ) [18]. Even though this is a descent direc- tion, ∇f (x k ) T p k = −k∇f (x k )k 2 2 < 0, it is well known that gradient descent may converge slowly [13, Ch. 9]. Other popular first-order methods are versions of co- ordinate descent, alternating direction method of multipliers, stochastic gradient methods, etc., [3, 12]. In this work a particular focus is put on the method of con- jugate gradients. In particular its linear version, which is described in detail in Section 1.1.2 below. For a thorough treatment of methods of conjugate gradients and their properties, see e.g., [50, 65] and [49] for a survey on nonlinear versions.
In regard to convergence rates, a general first-order method where each search di- rection is composed of a linear combination of observed gradients, possess at most linear convergence rate in the vicinity of a strict locally optimal solution [60, Ch. 2].
The perhaps most well-known second-order method to find zeros of nonlinear equations is Newton’s method, see e.g., Ortega and Rheinboldt [63]. At each itera- tion k, Newton’s method finds a zero along a linearization of ∇f (x k ). This means solving
∇ 2 f (x k )p k = −∇f (x k ). (1.1.3)
3
1. Introduction
If ∇ 2 f (x k ) is positive definite then (1.1.3) gives the step, in both length and di- rection, to the minimizer of a local quadratic approximation at x k . As iterates approach a solution the local quadratic approximations become increasingly accu- rate and a step length of one becomes natural. If ∇ 2 f (x k ) is positive definite then (1.1.3) also generates descent directions and linesearch may be used as a globaliza- tion strategy. Many methods take unit steps when applicable and adjust otherwise, e.g., if an iterate becomes infeasible or does not produce sufficient decrease of a merit function.
In general the matrix of (1.1.3) may have undesirable properties, such as being singular, indefinite or expensive to factorize. Approaches which modify the ma- trix of (1.1.3) to attain certain properties are called modified Newton methods, see e.g., [62, Ch. 3] for an introduction to these methods. Paper C concerns a struc- tured modified Newton approach where the matrix of (1.1.3) is modified so that the system becomes more computationally tractable to solve.
Newton’s method may achieve rapid local convergence. If the Hessian is lo- cally Lipschitz continuous around a strict locally optimal solution then Newton’s method attains local quadratic convergence [63]. However, each iteration is, in gen- eral, computationally expensive compared to a first-order method as it requires the evaluation of second-order derivatives and the solution of the system (1.1.3).
Computationally tractable alternatives to second-order approaches include quasi- Newton methods. These were initially presented as variable metric methods in the 1950’s by Davidon [20], and were a few years later characterized by Fletcher and Powell [33]. Quasi-Newton methods are first-order methods which sequentially build an approximation of local curvature as iterations proceed. Each iteration k may be interpreted as finding a zero along an approximate linearization of ∇f (x k ).
The search direction is given by
B k p k = −∇f (x k ), (1.1.4)
where B k is an approximation of the true Hessian ∇ 2 f (x k ). In comparison to New- ton’s method, the solution of (1.1.4) is the step to a zero along a linearization of
∇f (x k ) where the curvature information is approximated by B k . If B k is positive definite then (1.1.4) also gives the step to the minimizer of a local quadratic approx- imation at x k with curvature information given by B k . In particular, the Hessian approximation is sequentially built up as iterations proceed as B k−1 is updated by an update matrix U k to
B k = B k−1 + U k .
The matrix U k is typically of low rank, composed of observed information of at most first-order and chosen such that the iterates satisfy some kind of secant condition.
The iterates of the symmetric two-parameter Huang class satisfy the scaled secant condition
B k s k−1 = σ k y k−1 ,
where s k−1 = x k − x k−1 , y k−1 = ∇f (x k ) − ∇f (x k−1 ) and σ k is one of the free parameters [52]. The choice σ k = 1 gives the well known Broyden class where B k−1
4
1.1. Unconstrained nonlinear programming
is updated by a rank-two matrix to B k = B k−1 − 1
s T k−1 B k−1 s k−1
B k−1 s k−1 s T k−1 B k−1 + 1 y T k−1 s k−1
y k−1 y k−1 T
+ φ k−1 ω k−1 ω k−1 T , with
ω k−1 = s T k−1 B k−1 s k−1 1/2 1 y k−1 T s k−1
y k−1 − 1
s T k−1 B k−1 s k−1
B k−1 s k−1
,
and φ k−1 as a free parameter [32]. The parameter φ k−1 = 0 gives the Broyden- Flecher-Goldfarb-Shanno (BFGS) scheme and φ = 1 gives Davidon-Fletcher-Powell (DFP) scheme. Quasi-Newton methods in the Broyden class generate identical iterates under exact linesearch [25]. Under reasonable assumptions certain quasi- Newton schemes attain local superlinear convergence [24, Ch. 8-9] [14,23] [62, Ch. 6]
[31, Ch. 3].
1.1.2 Quadratic problems
We will be particularly interested in unconstrained quadratic optimization problems as such problems are fundamentally important. These problems can be formulated as
minimize
x∈R
n1
2 x T Hx + c T x (QP)
where H = H T ∈ R n×n and c ∈ R n . From (1.1.1) it follows that (QP) has a locally optimal solution only if H 0 and there exists a vector x ∗ ∈ R n such that
∇f (x ∗ ) = 0, or equivalently,
Hx ∗ + c = 0, (1.1.5)
i.e., if −c ∈ R(H). Moreover, the solution is unique if H 0. The focus in this work is on the case where H 0. Solving (QP) is then equivalent to solving the system of linear equations in (1.1.5). As problems become large, direct methods typically become computationally intractable and iterative methods are considered. Herein, a particular emphasis is put on the method of conjugate gradients as proposed by Hestenes and Stiefel [51]. A derivation of the method in an optimization framework is given in Paper D. The method is an exact linesearch method that generates search directions which are mutually conjugate with respect to H, i.e., at iteration k the step length α k is the solution of (1.1.2) and search directions satisfy p T i Hp j = 0, i 6= j. For quadratic problems it rather natural to use exact linesearch since an explicit expression for the solution of (1.1.2) is attainable. The search directions and step lengths are given by
5
1. Introduction
p k = (−∇f(x k ) k = 0,
−∇f (x k ) + ∇f (x ∇f (x
k)
T∇f (x
k)
k−1