Distributed Multi-Agent Optimization via Dual Decomposition

(1)

Distributed Multi-Agent Optimization via

Dual Decomposition

HÅKAN TERELIUS

(2)

(3)

Distributed Multi-Agent Optimization via Dual

Decomposition

HÅKAN TERELIUS

(4)

(5)

Abstract

In this master thesis, a new distributed multi-agent opti-mization algorithm is introduced. The algorithm is based upon the dual decomposition of the optimization problem, together with the subgradient method for finding the opti-mal dual solution.

The convergence of the new optimization algorithm is proved for communication networks with bounded time-varying delays, and noisy communication. Further, an ex-plicit bound on the convergence rate is given, that shows the dependency on the network parameters.

(6)

(7)

Acknowledgements

There are a number of people that has been invaluable to me throughout my work on this thesis. This work began as a research project when I visited California Institute of Technology, in Pasadena, USA, during the spring of 2009. I would like to thank the entire Control and Dynamical Systems department at Caltech for introducing me to the world of research, and turning my visit into pleasure.

My supervisor Prof. Richard M. Murray at Caltech. For inviting me to Caltech, always giving me new ideas and suggestions. Always being inspiring and en-couraging, every time I ran into his office in despair. Ph.D. Ufuk Topcu, for always having time for me, guiding me through the problems and helping me in every possible way. Your time has truly been invalu-able.

Back in Sweden, I continued working on this project at the Automatic Control Laboratory, Royal Institute of Technology (KTH), Stockholm. I would also like to thank everyone at the department for giving me such a warm wel-coming that not even the water in Trosa canal could make me feel cold.

Prof. Henrik Sandberg, for inspiring discussions, sup-port and a lot of patience throughout my work on this thesis.

Prof. Karl Henrik Johansson, for giving me the best gift a student can receive: The inspiration, encour-agement and joy of doing research.

I also want to show my appreciation for the financial support I have received from Stiftelsen Frans von Sydows Hjälpfond during my studies at the Royal Institute of Tech-nology, that has led to this master thesis.

Finally, I would like to express my gratitude to my fam-ily and friends, for supporting me throughout my visit at Caltech, and the work on my master thesis.

Thank you!

(8)

(9)

Abstract i Acknowledgements iii Contents v Notation xi 1 Introduction 1 1.1 Motivation . . . 1 1.2 Problem Formulation . . . 2 1.3 Outline . . . 2 2 Preliminaries 3 2.1 Mathematical Optimization . . . 3 2.2 Convex Optimization . . . 4 2.2.1 Concave Functions . . . 6 2.3 Subgradients . . . 6 2.4 Subgradient Methods . . . 8

2.4.1 Choosing Step Sizes . . . 8

2.4.2 Projected Subgradient Methods . . . 10

2.5 Decomposition Methods . . . 11 2.5.1 Primal Decomposition . . . 12 2.5.2 Dual Decomposition . . . 13 2.6 Graph Theory . . . 16 2.7 Distributed Optimization . . . 18 2.7.1 Centralized Optimization . . . 19 2.7.2 Decentralized Optimization . . . 19

2.8 Average Consensus Problem . . . 20

3 Primal Consensus Algorithm 23 3.1 Problem Definition . . . 23

3.2 Computation Model . . . 24

(10)

4.2 Centralized Algorithm . . . 36

4.3 Computation Model . . . 37

4.4 Decentralized Algorithms . . . 40

4.4.1 Halting Algorithm . . . 40

4.4.2 Constant-Delays Algorithm . . . 42

4.4.3 Time-Varying Delays Algorithm . . . 44

4.5 Convergence Analysis . . . 45

4.5.1 Constant-Delays Algorithm . . . 45

4.5.2 Time-Varying Delays Algorithm . . . 52

4.5.3 Noisy Communication Channels . . . 56

4.6 Communication Considerations . . . 61

4.6.1 Measuring Communication . . . 61

4.6.2 Primal Consensus Algorithm . . . 61

4.6.3 Dual Decomposition Algorithm . . . 62

4.7 Quadratic Objective Functions . . . 64

4.7.1 Simple Quadratic Example . . . 65

5 Numerical Results 69 5.1 Common Model . . . 69

5.2 Evaluating the Convergence Rate . . . 70

5.3 Comparing Primal and Dual Algorithms . . . 71

5.3.1 Connected Graph . . . 71

5.3.2 Line Graph . . . 75

5.3.3 Ring Graph . . . 78

5.3.4 Complete Graph . . . 81

5.4 Dual Decomposition Algorithm . . . 84

5.4.1 Without Bounded Subgradients . . . 84

5.4.2 Noisy Subgradients . . . 87

5.4.3 Halting Algorithm . . . 89

6 Conclusions and Remarks 93 6.1 Conclusions . . . 93

6.2 Summary of Contributions . . . 96

6.3 Future Research . . . 97

A Distributed Consensus Sum Algorithm 101

(11)

List of Figures

2.1 A convex function f , with the cord between the two points x and y. . . 5

2.2 A convex function f , with subgradients at the points A and B. . . . 7

2.3 An undirected graph with 7 nodes and 7 edges. . . 17

2.4 A directed graph with 7 nodes and 9 edges. . . 18

2.5 Communication graph for a centralized optimization problem. . . 19

2.6 Communication graph for a decentralized optimization problem. . . 20

4.1 Communication graph for the centralized optimization algorithm. . . 36

4.2 Communication graph for the decentralized optimization algorithm. . . 43

4.3 The intuition behind the delays in the convergence results. . . 51

4.4 Communication graph for the simple quadratic example. . . 66

4.5 The spectral radius of the dynamics matrix, for the quadratic optimiza-tion problem. . . 68

5.1 Connected communication graph for the first simulation. . . 72

5.2 The convergence rate’s dependency on the step size, for the first simulation. 73 5.3 The convergence rate for the first simulation, using the optimal step sizes. 75 5.4 Communication graph for the second simulation, with a line topology. . 76

5.5 The convergence rate’s dependency on the step size, for the line graph. . 77

5.6 The convergence rate for the second simulation, using the optimal step sizes. . . 78

5.7 Communication graph for the third simulation, with a ring topology. . . 79

5.8 The convergence rate’s dependency on the step size, for the ring graph. 80 5.9 The convergence rate for the third simulation, using the optimal step sizes. 81 5.10 Communication graph for the fourth simulation, with a complete graph topology. . . 82

5.11 The convergence rate’s dependency on the step size, for the complete graph. . . 83

5.12 The convergence rate for the fourth simulation, using the optimal step sizes. . . 84

5.13 Convergence rate with unbounded subgradients, and a small step size. . 85

5.14 Convergence rate with unbounded subgradients, and the optimal step size. 86 5.15 Convergence rate with unbounded subgradients, and a large step size. . 87

(12)

(13)

List of Algorithms

2.1 Primal Decomposition Algorithm . . . 13

2.2 Dual Decomposition Algorithm . . . 16

3.1 Primal Consensus Algorithm . . . 24

4.1 Centralized Optimization Algorithm . . . 36

4.2 Halted Decentralized Algorithm . . . 41

4.3 Constant-Delays Distributed Algorithm . . . 43

4.4 Time-Varying Delay, Distributed Algorithm . . . 45

4.5 Simple Quadratic Example Algorithm. . . 66

A.1 Leader Election Algorithm. . . 102

(14)

(15)

Notation

R The set of real-valued scalars. C The set of complex-valued scalars. (·)_i The i:th element of the vector. (·)_i,j The (i, j):th element of the matrix. (·)_i,· The i:th row of the matrix.

(·)_·,j The j:th column of the matrix. (·)T The transpose of a matrix or vector.

x The n-dimensional real-valued optimization variable.

xi _{The i:th node’s estimate of the optimization variable.}

λ The Lagrange dual optimization variables.

O (·) The Landau notation, describes the limiting behavior of a function when the argument tends towards infinity.

ρ(·) The spectral radius of the matrix.

1 The vector with each element equal to one. ||·||₂ The Euclidean norm of the vector.

(16)

(17)

Chapter 1

Introduction

“To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science.”

Albert Einstein, 1879-1955.

1.1 Motivation

In the ever increasingly connected world we live in, the ability to compute an optimal distributed decision has gained a lot of attention in recent years [1, 2, 3, 4, 5, 6]. The distributed optimization problems appear in a broad range of practical applications, for example, the Internet consist of many users, with different objectives [7]. The users are referred to as “agents”, and the interconnected network of agents is called a “multi-agent system”. The multi-agent system considered share an important feature, they consists of agents making local decisions while trying to coordinate the overall objective with the other agents in the system.

Continuing with the Internet as an example, each connected agent gains a utility from using the network, but since the capacity of the network is limited, each agent also has an associated cost with the usage of the network. Thus, each agent can formulate an optimization problem of maximizing its utility, while minimizing its cost. Further, from a global perspective, we would like to maximize the total utility of all agents, while minimizing the total cost for all agents.

(18)

In this thesis, we are trying to tackle the underlying distributed optimization problem, that is common for all these applications.

1.2 Problem Formulation

We consider an optimization problem, where a convex and potentially non-smooth objective function should be minimized. The objective function consists of N terms, each term being a function of the optimization variable x ∈ Rn,

minimize x∈Rn N X i=1 fi(x).

Further, a connected graph, consisting of N agents, is associated with the opti-mization problem, and each agent is associated with one of the terms in the objective function. It is assumed that the part fi of the objective function is agent i’s own local objective function, and only agent i has the necessary knowledge to be able to compute it.

The goal in distributed multi-agent optimization is to solve this minimization problem in a distributed fashion, where the nodes are able to cooperatively find the optimal solution without any central coordinator.

1.3 Outline

The outline of this master thesis is as follows. In Chapter 2 we introduce the mathematical foundations on which this thesis is built upon.

In Chapter 3 we give a brief summary of the distributed multi-agent optimization algorithm developed by Nedić and Ozdaglar [2].

In Chapter 4 we develop a novel distributed optimization algorithm based on the dual decomposition principle. We further analyze its convergence rate with time-varying delays, and noisy communication. We also study the communication requirements necessary for the algorithm.

In Chapter 5 we present numerical simulations for the considered algorithms, and focus in particular on their convergence rate.

(19)

Chapter 2

Preliminaries

“To those who do not know mathematics it is difficult to get across a real feeling as to the beauty, the deepest beauty, of nature ... If you want to learn about nature, to appreciate nature, it is necessary to understand the language that she speaks in.”

Richard P. Feynman, 1967.

In this chapter, we present the mathematical foundations on which the following chapters are built upon. Many books and research papers have been devoted to these subjects, and it is unfortunately not possible to cover them here in all of their glory, but references are provided to some of the materials that have been very useful.

2.1 Mathematical Optimization

In mathematics, an optimization problem is the problem of finding an optimal so-lution among all feasible soso-lutions. The standard form for an optimization problem can be written as [12, 13]

minimize

x∈Rn f (x),

subject to ci(x) ≤ bi, i = 1, . . . , m.

(2.1)

Thus, the standard problem is to choose the optimization variable x = (x1, . . . , xn) ∈ Rn in such a way that the scalar-valued objective function f : Rn → R attains its minimal value over the feasible domain. The feasible domain is determined by the

constraint functions c_{i(x) : R}n_{→ R, and can be defined as}

D = {x | ci(x) ≤ bi i = 1, . . . , m} .

The points x ∈ D are said to be feasible points. The optimal value f∗ to problem (2.1) is the maximum lower bound to f in the feasible domain,

(20)

The optimal set X_opt to problem (2.1) is the set of feasible points that attains the optimal value,

Xopt = {x ∈ D | f (x) = f∗} .

If the optimal set is non-empty, then the optimization problem is said to be solvable, and any vector in the optimal set is called optimal solution (or even global optimal

solution), denoted by x∗. Thus, for any other feasible point z ∈ D we must have

f (x∗) ≤ f (z).

The Euclidean distance between two points, x, y ∈ Rn_{, is defined as,}

||x − y||₂ = v u u t n X i=1 (xi− yi)2.

Further, the distance between a point x ∈ Rn and a set X ⊆ Rn is defined as the minimum distance from x to any point in X,

d (x, X) = inf x0_∈X x − x0 ₂.

A vector x0 ∈ D is called local optimal solution if it minimizes f for a neighbor-hood around x0, i.e., that there exits an r > 0 such that x0 solves the optimization problem (2.2). minimize x∈Rn f (x), subject to ci(x) ≤ bi, i = 1, . . . , m, ||x − x0||₂≤ r. (2.2)

An optimization problem like (2.1), but without any constraint functions is called an unconstrained optimization problem, and the feasible domain is then the entire Rn space. Similarly, if we want to emphasize that an optimization problem is constrained then we simply call it a constrained optimization problem.

2.2 Convex Optimization

A special, and very important class of optimization problems (2.1) are those where both the objective function and the constraint functions are convex.

Definition 2.1. A function f : Rn → R is said to be convex if it satisfies the inequality

f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y), (2.3) for all x, y ∈ Rn and all α ∈ R, 0 ≤ α ≤ 1 (Fig. 2.1).

It is worth to notice that in particular all linear functions are convex, since they satisfy (2.3) with equality. Also, a quadratic function f (x) = xTQx + cTx + b is

(21)

2.2. CONVEX OPTIMIZATION f(x) x, f(x) y, f(y) αf(x)+(1-α)f(y) f x

Figure 2.1: A convex function f . The chord between (x, f (x)) and (y, f (y)) lies above the function graph between those two points.

Definition 2.2. A set C ⊆ Rn is said to be convex if the line segment between any two points in C also lies in C [14]. Thus, for any points x₁, x2 ∈ C, and for

0 ≤ α ≤ 1

αx1+ (1 − α)x2∈ C.

Notice that if the constraint function ci(x) is a convex function, then the set of points satisfying the condition {x | ci(x) ≤ bi} is a convex set. Further, the inter-section of convex sets is also convex, hence if all constraint functions c_i are convex, then so is the feasible domain D.

The main reason that convex optimization has received much attention in the literature is because of the property given in the following theorem.

Theorem 2.1. For a convex optimization problem, any local optimal solution is

also a global optimal solution.

(22)

that

f (x0) ≤ f (x), (2.4)

for all x such that ||x − x0||₂ ≤ r and ci(x) ≤ bi, i = 1, . . . , m. Assume that ˆx is a global optimal solution, with f (ˆx) < f (x0). Consider the line segment between x0 and ˆx, with the point x given by x = αˆx + (1 − α)x0, 0 < α < 1. By the convexity assumption (2.3) on the constraint function ci, we have

ci(x) ≤ αci(ˆx) + (1 − α)ci(x0) ≤ αbi+ (1 − α)bi= bi.

Hence, x is also a feasible point. Further, by the convexity assumption (2.3) on the objective function, we have

f (x) ≤ αf (ˆx) + (1 − α)f (x0) < f (x0). Let α = _2||ˆ_x−xr 0_||

2

, notice that 0 < α < 1 since ||ˆx − x0||₂ > r, and thus, ||x − x0||₂ < r. This contradicts (2.4), and hence the assumption that f (ˆx) < f (x0) must be false.

The implication of this theorem is that, for convex optimization problems, it is enough to find a local optimal solution, which is in general a much easier problem than finding a global optimal solution. Many efficient algorithms exist with the purpose of finding a local optimal solution [12].

2.2.1 Concave Functions

A class of functions that are closely related to the convex functions are the concave

functions.

Definition 2.3. A function f is said to be concave if −f is convex.

The optimization problem (2.1) can be rewritten as an equivalent maximization problem,

maximize

x∈Rn −f (x),

subject to ci(x) ≤ bi, i = 1, . . . , m.

Thus, we realize that a convex minimization problem is equivalent to a con-cave maximization problem, and we will therefore refer to both of these as convex

optimization problems.

2.3 Subgradients

(23)

2.3. SUBGRADIENTS

Definition 2.4. A vector g ∈ Rn is a subgradient to f : X ⊆ Rn→ R at the point

x ∈ X if, for all other points z ∈ X, the following holds

f (z) ≥ f (x) + gT(z − x). (2.5) Further, the subdifferential of the function f , at x ∈ X, is the set of subgradients to f at x, and it is denoted by ∂f (x). Thus,

∂f (x) =ng : f (z) ≥ f (x) + gT(z − x) ∀z ∈ Xo. (2.6) Similarly, for a concave function f : X ⊆ Rn → R, we call g a subgradient at

x ∈ X if

f (z) ≤ f (x) + gT(z − x), (2.7) holds for all z ∈ X.

Remark. If f is a convex function, and differentiable at x, then there is exactly one

subgradient of f at x, and it coincides with the gradient (Fig. 2.2).

B A

f(x) f

x

(24)

2.4 Subgradient Methods

The subgradient method is a simple first-order algorithm for minimizing a non-differentiable convex function [16]. It is based on the well-known gradient descent

method, originally proposed by Cauchy already in 1847 [17, 12], but extended to

non-differentiable functions by replacing the gradient with a subgradient to the function.

Since the subgradient method is a first-order method, it can have a slow con-vergence rate compared to more advanced methods, such as Newton’s method or interior-point methods for constrained optimization. However, it does have some advantages, in particular, it can be used for non-differentiable functions and it also has a lower memory requirement. Another reason that we are interested in the subgradient method is that it enables us to run the optimization problem in a distributed fashion, as we shall see later.

Consider the unconstrained optimization problem minimize f (x),

where f : Rn → R is a convex function. The subgradient method solves this optimization problem by the simple iterative algorithm

x(t + 1) = x(t) − αtg(t). (2.8) Here, x(t) denotes the estimate of the solution at step t, g(t) ∈ ∂f (x(t)) is a subgradient to f at x(t), and α_t is a step size rule, which we will discuss next.

2.4.1 Choosing Step Sizes

The convergence of the subgradient method in equation (2.8) depends on the choice of step sizes α_t, and there are several different schemes used. To analyze the con-vergence rates we need three assumptions;

Assumption 2.1. There exists at least one finite minimizer of f , denoted by x∗. Let X_opt denote the optimal set, also, let f∗= f (x∗) denote the optimal value of f . Assumption 2.2. The norm of the subgradients is uniformly bounded by G,

||g(t)||₂ ≤ G ∀t.

Assumption 2.3. The distance from the initial point to the optimal set is bounded by R,

d (x(0), Xopt) ≤ R.

For the normal gradient method the function value decreases at each iteration, but that is not necessarily the case for the subgradient method. Instead, it is the distance to the optimal set that decreases. Let x∗ be an optimal solution, then

0 ≤ ||x(t + 1) − x∗||2₂= ||x(t) − αtg(t) − x∗||2₂

(25)

2.4. SUBGRADIENT METHODS

Since g(t) ∈ ∂f (x(t)) is a subgradient to the function f , the subgradient defini-tion (2.5) implies that −g(t)T(x(t) − x∗) ≤ − (f (x(t)) − f∗), and thus

0 ≤ ||x(t + 1) − x∗||2₂ ≤ ||x(t) − x∗||2₂− 2α_t(f (x(t)) − f∗) + α2_t||g(t)||2₂.

This is a recursive expression in terms of the norm ||x(t) − x∗||2₂, and by expand-ing the relation until the initial step t = 0, we get

0 ≤ ||x(0) − x∗||2₂− 2 t X i=0 αi(f (x(i)) − f∗) + t X i=0 α2_i||g(i)||2₂. (2.10) Since the function value f (x(i)) does not necessarily decrease at each iteration, we evaluate the method by remembering the best solution f_best found until step t. Thus, fbest(t) is defined as

fbest(t) = min

i=0,...,tf (x(i)),

and with this definition, we have t X i=0 αif (x(i)) ≥ fbest(t) t X i=0 αi. Using this relation, and rearranging the terms in (2.10), yields

fbest(t) − f∗ ≤ ||x(0) − x∗_||2 2+ Pt i=0α2i||g(i)|| 2 2 2Pt i=0αi .

Notice that the optimal point x∗ can be chosen arbitrarily in Xopt, and thus, ||x(0) − x∗||₂ can be replaced with the distance to the optimal set, d (x(0), Xopt). Substituting the bounds from assumption 2.2 and 2.3 yields the main convergence result for the subgradient method,

fbest(t) ≤ f∗+ R2+ G2Pt i=0α2i 2Pt i=0αi . (2.11)

We can now give bounds on the convergence rate for the following step size rules: Constant step size

The simplest choice is to use a constant step size αt= α, that is independent of t. In this case, the convergence results (2.11) becomes

fbest(t − 1) ≤ f∗+

R2+ G2α2t

2αt . (2.12)

Thus, the subgradient method will converge as R2/2αt to within G2α/2 of

(26)

The result can be further strengthened by instead considering the average objective function value

favg(t − 1) = 1 t t−1 X i=0 f (x(i)).

Again, rearranging the terms in (2.10) shows that the average value satisfies the same bound as the best objective function value above,

favg(t − 1) ≤ f∗+

R2+ G2α2t

2αt . (2.13)

Constant step length

Instead of keeping the step size constant, by letting αt= γ/||g(t)||₂ we ensure that the step length is constant. Here, γ > 0 is the length of the step in each iteration. The convergence results (2.11) becomes

fbest(t − 1) ≤ f∗+

R2_{+ γ}2_t

2γt/G . (2.14)

Thus, the subgradient method will converge as GR2/2γt to within Gγ/2 of

optimality.

Square summable, but not summable step sizes

Both of the previously discussed step size rules only guarantees convergence to a neighborhood around the optimal value, but now we analyze a step size rule that can guarantee convergence to the optimal value. Let the step size satisfy      αt≥ 0; P∞ t=0α2t < ∞; P∞ t=0αt= ∞.

Consider the convergence results (2.11), notice that the nominator is finite, but the denominator tends towards infinity. Thus, this choice of step sizes guarantees convergence for the subgradient method, i.e.,

fbest(t) → f∗, (2.15)

as t → ∞.

2.4.2 Projected Subgradient Methods

Subgradient methods can also be extended to constrained optimization problems. The projected subgradient method solves the constrained optimization problem (2.1) with the iterations

x(t + 1) = Px(t) − αtg(t)

(27)

2.5. DECOMPOSITION METHODS

where P is the Euclidean projection onto the convex feasible set defined by the constraints ci(x) ≤ bi, i = 1, . . . , m. These iterations should be compared to the ordinary subgradient iterations (2.8).

Notice that the optimal set to the optimization problem is a subset of the feasible set defined by the constraint functions. Thus, the Euclidean projection onto the convex feasible set can only decrease the distance to the optimal solution, hence we have ||x(t + 1) − x∗||2₂= P x(t) − αtg(t) − x∗ 2 2 ≤ ||x(t) − αtg(t) − x ∗_||2 2.

By updating expression (2.9), it can be seen that the convergence result (2.11) for the ordinary subgradient method also holds for the projected subgradient method. Hence, the convergence bounds given in (2.12), (2.14) and (2.15) also holds for the projected subgradient method.

2.5 Decomposition Methods

In mathematics, decomposition is the general concept of solving a problem by break-ing it up into smaller subproblems, which can be solved independently, and then assemble the solution to the original problem from the solution of the subproblems [18].

Decomposition methods has been devoted a lot of attention for particular two reasons:

• If the complexity of the algorithm grows faster than linear in the number of subproblems, then decomposing the problem into several subproblems can result in a significant performance gain. For example consider the problem of inverting a block diagonal matrix, (2.16), with Gauss-Jordan elimination.

A =       A1 0 · · · 0 0 A2 · · · 0 .. . ... . .. ... 0 0 · · · A_m       (2.16)

Assume that each block A_i is an n × n-matrix, and thus A is a nm × nm-matrix. Inverting A directly with the general Gauss-Jordan elimination would require O (nm)3

(28)

By the decompositioning technique, the problem of inverting matrix A can be solved by inverting the submatrices A1, . . . , Am of size n × n, which requires

mO n3

operations in total.

• If the subproblems can be solved independently of each other, then the de-composition technique enables us to solve the subproblems in parallel. Traditionally, computers has been designed for serial computation, where each task is solved after each other. Parallelization has for a long time mainly been used in high-performance computing, where large supercomputers has been built up from thousands of smaller and cheaper processors than what otherwise would have been possible [20, 21]. However, interest in it has grown lately due to the shift from single core to multi-processor computer architectures, and ever faster networks [22, 23]. Thus, the ability to decompose a problem into many smaller subproblems that can be solved in parallel has never been more important. Decomposition methods can also improve the reliability of a large system [24].

Another aspect of parallelization that we will be more interested in are the application in multi-agent systems, where the decomposition methods yields a distributed optimization algorithm.

A problem is called trivially parallelizable if the subproblems can be solved com-pletely independent of each other. Consider for example the following optimization problem minimize x1,...,xn f (x1, . . . , xn) = minimize_x 1,...,xn f1(x1) + · · · + fn(xn). This problem is trivially parallelizable since the subproblems

minimize

xi

fi(xi) ∀i

can be solved independently of each other. Inverting the block diagonal matrix (2.16) is another trivially parallelizable example. More interesting situations occur when there is a coupling between the subproblems, and that is the situation which decomposition methods are trying to solve.

2.5.1 Primal Decomposition

The simplest decomposition method is called Primal Decomposition, because the op-timization algorithm manipulates the primal variables [18]. Consider the following unconstrained optimization problem

minimize

x1,x2,y

f (x1, x2, y) = minimize_x 1,x2,y

f1(x₁, y) + f2(x₂, y), (2.18) where there is a coupling between the two subproblems f1 and f2 in the variable

y. The variable y is commonly called the complicating variable, since it complicates

(29)

The primal decomposition method works by fixating the complicating variables, in this case y, and then solving the now decoupled subproblems. Thus, define

φ1(y) = min x1 f1(x1, y), φ2(y) = min x2 f 2_(x 2, y), and

φ(y) = φ1(y) + φ2(y). (2.19) The optimization problem (2.18) can then be expressed with, (2.19), as

minimize

x1,x2,y

f (x1, x2, y) = minimize

y φ(y).

This is called the master problem in primal decomposition. Notice that if the original problem is convex, then so is the master problem. The master problem can then be solved by any local optimization algorithm, for example the subgradient method.

If g1(y) is a subgradient to φ1(y) and g2(y) is a subgradient to φ2(y), then

g1(y) + g2(y) is a subgradient to φ(y). The subgradient method solves the master problem by the iteration

y(t + 1) = y(t) − αt

g1(y(t)) + g2(y(t)). (2.20) Thus, the primal decomposition method together with the subgradient method can be used directly, as Algorithm 2.1, to solve the optimization problem in (2.18).

Algorithm 2.1: Primal Decomposition Algorithm Input: Initial estimate y(0)

Output: Estimate y(T ) at time T

1 for t=0 to T-1 do

// Solve the subproblems, can be done in parallel

2 Solve φ1(y(t)), and return g1(y(t)) 3 Solve φ2(y(t)), and return g2(y(t))

// Update the complicating variable

4 y(t + 1) = y(t) − αt g1(y(t)) + g2(y(t))

5 end

2.5.2 Dual Decomposition

(30)

an equivalent constrained optimization problem by introducing two copies y₁, y2 of

the complicating variable y. The problem can be expressed as minimize

x1,x2,y1,y2

f (x1, x2, y1, y2) = f1(x1, y1) + f2(x2, y2),

subject to y1 = y2.

(2.21)

Notice how this simple reformulation of the optimization problem makes the objective function trivially parallelizable, but at the cost of adding the consistency

constraint that requires the two copies to be equal.

We will proceed by applying the method of Lagrange multipliers [12, 25, 17]. Introduce the dual variable (or Lagrange multiplier) λ, and define the Lagrange

function as

L(x1, x2, y1, y2, λ) = f1(x1, y1) + f2(x2, y2) + λT(y1− y2). (2.22)

Further, the Lagrange dual function q is defined as the minimum value of the La-grange function over all primal variables x₁, x2, y1, y2, i.e.,

q(λ) = inf

x1,x2,y1,y2

L(x1, x2, y1, y2, λ). (2.23)

Notice that the Lagrange dual function is concave, even if the problem (2.21) is not convex, since it is the point-wise infimum of the affine function L.

An important property of the Lagrange dual function is that it provides a lower bound on the optimal value, as is shown in Theorem 2.2.

Theorem 2.2 (Boyd [12]). Let f∗ be the optimal value to problem (2.21), and further let q be the Lagrange dual function defined in (2.23). Then, for any λ we have the lower bound

q(λ) ≤ f∗.

Proof. This can be realized by considering any feasible point ( ˆx1, ˆx2, ˆy, ˆy) to (2.21),

L( ˆx1, ˆx2, ˆy, ˆy, λ) = f1( ˆx1, ˆy) + f2( ˆx2, ˆy) + λT(ˆy − ˆy) = f ( ˆx1, ˆx2, ˆy, ˆy).

Hence

q(λ) = inf

x1,x2,y1,y2L(x1, x2, y1, y2, λ) ≤ L( ˆx1, ˆx2, ˆy, ˆy, λ) = f ( ˆx1, ˆx2, ˆy, ˆy).

Since ( ˆx1, ˆx2, ˆy, ˆy) was an arbitrary feasible point, then this holds in particular for

the optimal solution to (2.21), thus giving us q(λ) ≤ f∗.

Since the Lagrange dual function q(λ) gives us a lower bound on the optimal value f∗, a natural problem is to maximize the lower bound. This leads us to the

Lagrange dual problem,

maximize

(31)

Let q∗denote the optimal value to the convex optimization problem (2.24). Theorem 2.2 implies that q∗ ≤ f∗, and this is also referred to as the weak duality.

However, if q∗= f∗ then strong duality is said to hold. Strong duality does not hold in general, but, for example, Slater’s condition [12, 25] guarantees that strong duality holds for the convex optimization problem (2.21).

Theorem 2.3 (Slater’s condition). Consider the primal optimization problem minimize

x∈Rn f (x),

subject to ci(x) ≤ bi, i = 1, . . . , m,

hj(x) = 0, j = 1, . . . , k,

where all functions are convex. The feasible domain to this problem is

D = {x | c_i(x) ≤ b_i, hj(x) = 0 i = 1, . . . , m j = 1, . . . , k} .

Slater’s condition states that strong duality holds if there exists an interior point x ∈ D with

ci(x) < bi, i = 1, . . . , m.

Let us now continue with the dual decomposition, that, in contrast to the primal decomposition, works by fixating the dual variables λ. Define

φ1(λ) = inf x1,y1 f1(x1, y1) + λTy1 , φ2(λ) = inf x2,y2 f2(x2, y2) − λTy2 .

The Lagrange dual function can then be written as

q(λ) = φ1(λ) + φ2(λ),

and the corresponding maximization problem is the master problem in dual decom-position. Once again we can solve the master problem with the subgradient method. Notice that the subgradient to −φ1 is −y₁ and the subgradient to −φ2 is y₂, hence the subgradient to −q is y2− y1, where y1 and y2 are obtained as the solutions to

φ1 and φ2 respectively. The corresponding subgradient update rule is

λ(t + 1) = λ(t) − αt(y2(t) − y1(t)) .

(32)

Algorithm 2.2: Dual Decomposition Algorithm Input: Initial estimate λ(0)

Output: Estimate λ(T ) at time T

1 for t=0 to T-1 do

// Solve the subproblems, can be done in parallel

2 Solve φ1(λ(t)), and return y1(t)

3 Solve φ2(λ(t)), and return y₂(t)

// Update the dual variable

4 λ(t + 1) = λ(t) − αt(y2(t) − y1(t))

5 end

2.6 Graph Theory

In this section, we introduce some notation and basic concepts from graph theory that will be useful for describing distributed multi-agent systems. A more rigorous description of this fascinating topic can be found in [26].

A graph G(V, E ) consists of a set of nodes (or vertices), denoted by V, and a set of edges E ⊆ V × V. We usually denote the number of nodes in a graph G by

NG = |V|, or simply N if the graph can be understood from the context. Further,

we usually label the nodes with numbers from 1 to N , thus V = {1, . . . , N }. We can further divide the graphs into two families, first the undirected graphs are the family of graphs where the edges are unordered pairs of vertices, hence (i, j) and (j, i) are considered to be the same edge, and (i, j) ∈ E if and only if (j, i) ∈ E . All other graphs are said to be directed graphs, and the edge (i, j) is said to be directed from node i to node j.

Two nodes that has an edge between them are said to be adjacent. The neighbors of a node i are those nodes which have an incoming edge from node i. We denote the neighbors of node i by N_i, thus

N_i= {j | (i, j) ∈ E } .

A path in G is a list of distinct vertices {v0, v1, . . . , vn} such that (vi, vi+1) ∈ E ,

i = 0, 1, . . . , n − 1. The number of edges in the path is the length of the path. The

path is said to go from v₀ to v_n.

A graph is strongly connected if there exists a path from every vertex to every other vertex in the graph. The distance dist(v0, vn), between two nodes v0 and vn, is the length of the shortest path between them, and ∞ if there is no such path. Further, the diameter of a graph is the greatest distance between any two nodes in the graph.

A cycle is a path {v₀, v1, . . . , vn}, with the additional requirement that the edge (vn, v0) ∈ E exists. A graph without cycles is said to be acyclic. A graph where the

(33)

2.6. GRAPH THEORY

An edge (i, i) is called a loop, and a graph without any loops is said to be simple. Notice that a strongly connected graph with at least one loop is aperiodic.

A tree is an undirected, connected graph with no cycles. In a tree, any two vertices is connected by a unique, simple, path, and a tree with N nodes has exactly

N − 1 edges.

The usual way of picturing an undirected graph is by drawing a circle for each node, and joining two of the circles by a line if the corresponding nodes form an edge (Fig. 2.3).

4

1

3

2

5

7

6

Figure 2.3: An undirected graph with 7 nodes and 7 edges.

Similarly, a directed graph is pictured by drawing a circle for each node, and joining two circles with an arrow pointing from i to j if (i, j) forms an edge of the graph (Fig. 2.4).

Remark. When we are considering a multi-agent system we will usually think about

(34)

4

1

3

2

5

7

6

Figure 2.4: A directed graph with 7 nodes and 9 edges.

2.7 Distributed Optimization

As we have mentioned before, the topic in this thesis is distributed optimization, which means that we have a set V of N agents that wants to cooperatively solve an optimization problem. There are several different possible ways to express the optimization problem, but we will focus on where the objective function f is al-ready expressed as a sum of individual objective functions for the agents. Thus, we associate the local objective function fi with agent i, for each agent i ∈ V.

The optimization problem that the agents are trying to solve is

minimize x∈Rn N X i=1 fi(x), (2.25)

thus, the local objective functions are coupled in the optimization variables. The agents can, for example, be computers in a computer network, or vehicles trying to stay in a certain formation. It is also assumed that only agent i has the full knowledge about function fi, thus they need to communicate with each other in order to solve the problem.

The problem is further restricted by a set of communication edges E , where an agent i can only communicate with agent j if there is an edge (i, j) ∈ E . In order for the problem to be solvable, it is assumed that the graph G(V, E ) is strongly connected.

(35)

2.7. DISTRIBUTED OPTIMIZATION

2.7.1 Centralized Optimization

In centralized optimization there exists a special coordinator agent that is respon-sible for coordinating all other agents (Fig. 2.5). In view of the decomposition methods, the coordinator is solving the master problem, while the other agents solves the subproblems.

Consider the Primal Decomposition Algorithm 2.1. There, agent 1 would solve the subproblem φ1(y(t)), and sending the subgradient g1(y(t)) to the coordinator. Similarly, agent 2 would solve the subproblem φ2(y(t)), and send the subgradient

g2(y(t)) to the coordinator. The coordinator would then be able to update the complicating variable y, and return the updated variable y(t + 1) to both agents.

Similarly, for the Dual Decomposition Algorithm 2.2, agent 1 and 2 would solve the subproblems φ1(λ(t)) and φ2(λ(t)) respectively. They then transmit the local copies of the complicating variables, y1(t) and y2(t), to the coordinator, who is then

able to update the dual variable λ(t + 1).

4

1

2

3

5 Central coordinator

Figure 2.5: Communication graph for a centralized optimization problem, where there is a special coordinator agent.

2.7.2 Decentralized Optimization

In decentralized optimization, there is no central coordinator, which means that all agents should be considered equivalent (Fig. 2.6). Compared to the centralized op-timization, this means that even the master problem has to be solved distributedly. Thus, decentralized optimization is in general a more difficult problem than the centralized optimization problem, and we will therefore focus on the decentralized optimization problem.

(36)

move on to the main contribution of this thesis; a decentralized Dual Decomposition Algorithm.

4

1

3

2

6

7

5

Figure 2.6: Communication graph for a decentralized optimization problem, where all agents are equal.

2.8 Average Consensus Problem

The average consensus problem is the distributed computational problem of finding the average of a set of initial values. The problem of distributed averaging comes up in many applications such as decentralized optimization, coordination of multi-agent systems, estimation and distributed data fusion [27, 5, 28, 6, 9].

Consider a connected network of N nodes, each node holding an initial value

xi(0) ∈ R at time t = 0. The goal is to compute the average of these values,

1

N

PN

i=1xi(0), with only local communication. Thus, a node i is only allowed to communicate with its neighbors N_i. Further, the goal is also that the nodes should reach consensus on the average value, i.e., every node’s value should converge to the average of the initial values,

lim t→∞x i_{(t) =} 1 N N X i=1 xi(0) ∀i ∈ V. (2.26)

A standard algorithm to solve the average consensus problem is with the follow-ing distributed linear iterations,

xi(t + 1) = N

X

j=1

Wi,jxj(t) i = 1, . . . , N, (2.27)

where t = 0, 1, . . . are the time steps and W ∈ RN ×N is the weight matrix. To enforce that each node only uses local information we have the requirement that

(37)

2.8. AVERAGE CONSENSUS PROBLEM

Now, let x(t) be defined as the column vector with elements xi(t),

x(t) =h x1(t) x2(t) · · · xN_(t) iT_, then the update (2.27) can be simplified for the entire network as

x(t + 1) = W x(t).

Expanding this equation recursively implies that x(t) = Wtx(0), hence a necessary

and sufficient condition for the convergence of this algorithm is that

lim

t→∞W

t₌ 11T

N , (2.28)

where 1 denotes the vector consisting of only ones. Boyd [27] showed the following theorem

Theorem 2.4. Equation (2.28) holds if and only if the following three conditions

hold

(i) 1TW = 1T

(ii) W 1 = 1

(iii) ρ(W − 11T/N ) < 1

Before we continue, we will need another definition, Definition 2.5.

• A vector v ∈ Rn _{is said to be a stochastic vector if the components v}

i are

nonnegative andPn

i=1vi = 1.

• A matrix is said to be a stochastic matrix if all of its rows are stochastic vectors.

• A matrix is said to be a doubly stochastic matrix if all of its rows and columns are stochastic vectors.

(38)

(39)

Chapter 3

Primal Consensus Algorithm

“The difficulty lies, not in the new ideas, but in escaping the old ones, which ramify, for those brought up as most of us have been, into every corner of our minds.”

John M. Keynes, 1935.

In this chapter we will review a distributed optimization method developed by A. Nedić and A. Ozdaglar [2, 3, 4]. This will be the main method that we will compare our optimization algorithm that we develop in the next chapter.

The distributed method is designed to let a multi-agent system find the optimal solution to the optimization problem of minimizing a sum of convex, not necessarily differentiable, objective functions. Every agent will try to minimize its own objective function, while only exchanging information with its closest neighbors over a time-varying topology.

Compared to the decomposition methods discussed in Section 2.5.1 and 2.5.2, this method manipulates the primal variables, and the method can be viewed as a combination of the consensus algorithm, described in Section 2.8, and the subgradi-ent method from Section 2.4. We will therefore denote this algorithm as the Primal

Consensus Algorithm.

3.1 Problem Definition

Consider a network of N agents trying to minimize a common additive cost function. The agents want to cooperatively solve the unconstrained optimization problem

minimize x∈Rn N X i=1 fi(x), (3.1)

(40)

We assume that all agents update their estimates simultaneously at the discrete times t = 0, 1, . . .. Let xi_{(t) ∈ R}ndenote agent i’s estimate of the optimal solution at time t. The update rule that we consider is a combination of the average consensus update (2.27) and the subgradient method update (2.8). At each time step, every agent computes a weighted average of its neighbors current estimates, and then takes a step in the direction of its local subgradient,

xi(t + 1) = N

X

j=1

Wj,i(t)xj(t) − αitgi(t). (3.2)

Algorithm 3.1: Primal Consensus Algorithm Input: Initial estimates xi(0).

Output: Estimate of xi(T ) at time T

// The following algorithm is executed on each agent i.

1 for t=0 to T-1 do

2 Compute a subgradient gi(t) to fi at the point xi(t).

// Update the local estimate

3 xi(t + 1) =PN_j=1W_j,i(t)xj(t) − α_tigi(t)

4 Transmit the new primal variable xi(t + 1) to all neighbors. 5 Receive primal variables xj(t + 1) from neighbors.

6 end

To follow the notation used by A. Nedić and A. Ozdaglar [3, 2], the matrix

W should be compared to the transpose of the consensus weight matrix. Further, αi_t > 0 is the step size used by agent i at time t, and gi(t) is the local subgradient to fi at agent i’s estimate xi(t).

Notice that the weights W_j,i are time dependent, and that this can correspond to a time-varying communication topology. If the weight Wj,i(t) is nonzero then that corresponds to agent i using agent j’s estimate at time t, hence the directed communication edge (j, i) is used during that time. This leads us to the definition of a time dependent directed graph (V, Et), with the edge set given by

Et= {(j, i) | Wj,i(t) 6= 0} , representing the communication at time t.

3.2 Computation Model

(41)

3.2. COMPUTATION MODEL

The first assumption states that each agent gives a significant weight to its own estimate as well as to all other estimates that that are available to it. Also, the weight is zero for all estimates that are unavailable.

Assumption 3.1 (Weights Rule). The following is true for all t: (a) There exists a real number η with 0 < η < 1 such that

(i) W_i,i(t) ≥ η for all i and t.

(ii) Wj,i(t) ≥ η for all i, j, t when (j, i) ∈ Et. (iii) W_j,i(t) = 0 otherwise.

(b) The transpose of the weight matrix, W (t)T, is stochastic.

It is necessary for every agent to be able influence every other agent, hence we assume that every pair of agents are able to influence each other an infinite number of times, possibly through a path of other agents.

Assumption 3.2 (Connectivity). The graph (V, E∞) is strongly connected, where

the edge set is defined as

E∞= {(j, i) | (j, i) ∈ Et for infinitely many t} .

To further strengthen the Connectivity Assumption 3.2, we impose an upper bound on the intercommunication interval for all edges in E∞.

Assumption 3.3 (Bounded Intercommunication Interval). There exists an integer

B ≥ 1 such that (j, i) ∈ E∞ ⇒ (j, i) ∈ t+B−1 [ i=t Ei, for all t.

Similar to the average consensus problem, we require the weight matrix to be doubly stochastic at every time step to ensure that all agents’ estimates converge to the same value.

Assumption 3.4 (Doubly Stochastic Weights). The weight matrix W (t) is doubly stochastic for all t.

(42)

Assumption 3.5 (Simultaneous Information Exchange). The agents exchange in-formation simultaneously,

(j, i) ∈ E_t ⇒ (i, j) ∈ E_t.

Let us introduce a new weight Pj,i, called planned weights, that are used to determine the actual weights W_j,i. Let each agent i choose the planned weights

Pj,i(t) that it would like to use if it receives an estimate from agent j during time step t. If, at time t, agent i is able to communicate with agent j, then it will send both its own estimate xi(t) as well as the planned weight P_j,i(t) to agent j. By the Simultaneous Information Exchange Assumption, if they are able to communicate then agent i will also receive the estimate xj(t) and the planned weight Pi,j(t) from agent j. Let the actual weight used by the agents be defined as

Wj,i(t) =      1 −P k6=iWk,i(t) if i = j;

min(Pi,j(t), Pj,i(t)) if i 6= j and (j, i) ∈ Et;

0 otherwise.

(3.3)

Assumption 3.6 (Symmetric Weights). The following is true for all t: (a) There exists a real number η with 0 < η < 1 such that

(i) P_j,i(t) ≥ η for all i, j, t. (ii) PN

j=1Pj,i(t) = 1.

(b) The actual weights are chosen according to equation (3.3).

We will now prove that this procedure guarantees that the weight matrix is doubly stochastic.

Proposition 3.1. Assume that the Simultaneous Information Exchange

Assump-tion 3.5 and Symmetric Weights AssumpAssump-tion 3.6 holds. Then so does the Weight Rule Assumption 3.1 and the Doubly Stochastic Weights Assumption 3.4.

Proof. First, notice that P

k6=iWk,i(t) ≤ Pk6=iPk,i(t) ≤ 1 − η, thus, Wi,i(t) = 1 −P

k6=iWk,i(t) ≥ 1 − (1 − η) = η. Second, min(Pi,j(t), Pj,i(t)) ≥ η, thus Wj,i(t) = min(Pi,j(t), Pj,i(t)) ≥ η if (j, i) ∈ Etand Wj,i(t) = 0 otherwise.

Next, consider the column sum P

kWk,i(t) = Wi,i(t) +Pk6=iWk,i(t)) = 1 −

P

k6=iWk,i(t) +Pk6=iWk,i(t)) = 1, hence W (t)T is stochastic and the Weight Rule Assumption is satisfied.

Finally, the Simultaneous Information Exchange Assumption together with the equation (3.3) tells us that Wj,i(t) = Wi,j(t) = min(Pi,j(t), Pj,i(t)) if (j, i), (i, j) ∈ Et and W_j,i(t) = W_i,j(t) = 0 otherwise. Thus, W (t) is symmetric, and W (t)T being stochastic implies that W (t) is doubly stochastic.

(43)

3.2. COMPUTATION MODEL

Assumption 3.7 (Constant Step Size). The step size is constant and common to all agents,

αi_t= α.

In the analysis, we will consider a related model where the agents cease to com-pute the subgradients gi(t) at some time ˆt, but continue exchanging their estimates

according to the consensus part of the algorithm.

Assumption 3.8 (Stopped Model). There exists a time ˆt such that

gi(t) =

(

0 if t ≥ ˆt;

a subgradient to fi at xi(t) otherwise,

for all i.

Finally, we need some assumptions for the subgradient part, similar to assump-tions 2.1, 2.2, 2.3 used in Section 2.4 about the subgradient method.

Assumption 3.9 (Subgradient Assumption). The following is true (i) The optimal set X_opt is nonempty.

(ii) The subgradients are uniformly bounded by G,

g i ₂≤ G

for all subgradients gi∈ ∂fi_{(x) and all i and x.}

(iii) The distance from the initial points to the optimal set is bounded by R,

dxi(0), Xopt

≤ R,

for all i.

(iv) The initial values are bounded by αG,

x i₍₀₎ ₂≤ αG

(44)

3.3 Convergence Analysis

By recursively expanding the update rule (3.2) it is possible to express the estimate

xi(t + 1) in terms of x1(s), . . . , xN(s), for any s ≤ t, as

xi(t + 1) = N X j=1 [W (s)W (s + 1) · · · W (t − 1)W (t)]_j,ixj(s) − N X j=1 [W (s + 1) · · · W (t − 1)W (t)]_j,iαj_sgj(s) − · · · − N X j=1 [W (t − 1)W (t)]_j,iαj_t−2gj(t − 2) − N X j=1 [W (t)]_j,iαj_t−1gj(t − 1) − αi_tgi(t). (3.4)

Since the products W (s)W (s + 1) · · · W (k) appears several times in the expres-sion, let us define the matrix Φ(t, s), for any t ≥ s, as

Φ(t, s) = W (s)W (s + 1) · · · W (t − 1)W (t).

With this definition, the expression for the estimate xi(t + 1) in equation (3.4) can be simplified as xi(t + 1) = N X j=1 [Φ(t, s)]_j,ixj(s) − N X j=1 [Φ(t, s + 1)]_j,iαj_sgj(s) − · · · − N X j=1 [Φ(t, t − 1)]_j,iαj_t−2gj(t − 2) − N X j=1 [Φ(t, t)]_j,iαj_t−1gj(t − 1) − αi_tgi(t). (3.5)

We are now going to study the properties of the Φ matrices, but first recall that the longest possible path in a graph with N nodes consists of N − 1 edges. Hence the Connectivity Assumption 3.2 implies that there is a path in E∞between any two

nodes of length at most N − 1. Further, the Bounded Intercommunication Interval Assumption 3.3 gives us an upper bound B on the time it takes the estimate xi to influence state xj if the edge (i, j) belongs to E∞. Thus, let us define ¯B = (N − 1)B

as the upper bound on the time it takes any node’s estimate to influence any other node’s estimate.

(45)

3.3. CONVERGENCE ANALYSIS

Lemma 3.2 (Lemma 4, [2]). Under the Weights Rule 3.1, Connectivity 3.2 and

Bounded Intercommunication Interval 3.3 assumptions the following is true, (a) The limit ¯Φ(s) = limt→∞Φ(t, s) exists for each s.

(b) The limit matrix ¯Φ(s) has identical columns and the columns are stochastic, ¯

Φ(s) = φ(s)1T, where φ(s) is a stochastic vector.

(c) The columns converge to φ(s) with a geometric rate,

[Φ(t, s)]i,j− [φ(s)]i ≤ 2 1 + η− ¯B 1 − ηB¯ 1 − ηB¯ t−s ¯ B _{∀i, j and t ≥ s}

If we also assume that the matrices W are doubly stochastic then we have the following corollary to Lemma 3.2

Corollary 3.3 (Proposition 1,2, [2]). Let the Weights Rule 3.1, Connectivity 3.2,

Bounded Intercommunication Interval 3.3 and Doubly Stochastic Weights 3.4 as-sumptions hold (or Connectivity 3.2, Bounded Intercommunication Interval 3.3, Simultaneous Information Exchange 3.5 and Symmetric Weights 3.6 assumptions hold). Then the following is true,

(a) The limit matrices ¯Φ(s) = limt→∞Φ(t, s) are doubly stochastic and correspond

to a uniform steady state distribution for all s,

¯

Φ(s) = 1

N11

T _∀s _(3.6)

(b) The entries [Φ(t, s)]_j,i converge to _N1 as t → ∞ with a geometric rate,

[Φ(t, s)]i,j− 1 N ≤ 21 + η − ¯B 1 − ηB¯ 1 − ηB¯ t−s ¯ B _{∀i, j and t ≥ s}

Proof. Recall Proposition 3.1, it states that the Weights Rule and Doubly

Stochas-tic Weights assumptions are implied by the Simultaneous Information Exchange and Symmetric Weights assumptions. Thus, we can proceed with the former as-sumptions.

(a) The assumption that W (t) is doubly stochastic for all t implies that the Φ(t, s) matrices are also doubly stochastic, since that the product of two doubly stochas-tic matrices is also doubly stochasstochas-tic. From Lemma 3.2 we know that the columns are identical, and since every row sums to one we have

(46)

(b) This follows directly from Lemma 3.2, with [φ(s)]_i = _N1.

The Φ matrices we have just studied determine how the consensus part of the algorithm propagate, and we will now turn to the subgradient part. In particular, with the Constant Step Size Assumption 3.7, the iterates in equation (3.5) becomes

xi(t + 1) = N X j=1 [Φ(t, s)]_j,ixj(s) − α t X r=s+1   N X j=1 [Φ(t, r)]_j,igj(r − 1)  − αgi(t). (3.7)

To further analyze the convergence, we will consider the related Stopped Model Assumption 3.8, where the agents cease to compute the subgradients at some time ˆ

t. Let ˆx denote the iterates for the stopped model. Notice that ˆxi(t) = xi(t) for

t ≤ ˆt, and for t > ˆt we have

ˆ xi(t) = N X j=1 [Φ(t − 1, 0)]_j,ixj(0) − α ˆ t X r=1   N X j=1 [Φ(t − 1, r)]_j,igj(r − 1)  , (3.8)

where we also let s = 0.

Using Corollary 3.3 it is evident that the limit limt→∞xˆi(t) exists, and is inde-pendent of i. However it does depend on the parameter ˆt, thus let us define the

limit as

y(ˆt) = lim

t→∞xˆ

i_(t).

By using the relation (3.6) from Corollary 3.3 we can express the limit as

y(ˆt) = 1 N N X j=1 xj(0) − α ˆ t X r=1   N X j=1 1 Ng j_{(r − 1)}  .

Rewriting it as a recursive equation in ˆt yields the expression y(ˆt + 1) = y(ˆt) − α N N X j=1 gj(ˆt). (3.9)

Notice the similarity with the subgradient method update in (2.8). However, the vector gj(ˆt) is a subgradient to fj at xi(ˆt), not at y(ˆt), but it can be used as an

approximation of the subgradient at y(ˆt), as is done in the following lemma.

Lemma 3.4 (Lemma 5, [2]). Let the sequence {y(t)} be generated by the iteration (3.9), and let the sequence {xj(t)} be generated by (3.7). Let {dj(t)} denote the

sequence of subgradients to fj _{at y(t). For any x ∈ R}n and all t ≥ 0 we have

(47)

3.3. CONVERGENCE ANALYSIS

Corollary 3.5. In particular, for any optimal solution x∗ _{∈ R}n to (3.1) we have,

||y(t + 1) − x∗||2₂ ≤ ||y(k) − x∗||2₂+2α N N X j=1 g j_(t) ₂+ d j_(t) ₂ y(t) − x j_(t) ₂ −2α N[f (y(t)) − f ∗_{] +} α2 N2 N X j=1 g j_(t) 2 2.

In the following proposition we will give the main convergence result for the Primal Consensus Algorithm. It consists of both a uniform bound on the difference between y(t) and xi(t), and upper bounds on the objective function for the average estimates y_avg(t) and xi_avg(t).

yavg(t) = 1 t t−1 X k=0 y(k), xi_avg(t) = 1 t t−1 X k=0 xi(k). The main convergence result is

Theorem 3.6 (Proposition 3, [2]). Let Connectivity, Bounded Intercommunication

Interval, Simultaneous Information Exchange, Symmetric Weights, Constant Step Size and the Subgradient assumptions hold. We then have

(a) For every agent i, a uniform upper bound on y(t) − xi(t) ₂ is given by y(t) − x i_(t) ₂ ≤ 2αGC1 ∀t ≥ 0 where C1 = 1 + N 1 − (1 − ηB¯₎B1¯ 1 + η− ¯B 1 − ηB¯

(b) An upper bound on the objective function for the average values f (yavg(t)) is

given by

f (yavg(t)) ≤ f∗+

N R2_{+ G}2_α2_Ct

2αt ∀t ≥ 1, (3.10)

and an upper bound on the objective function for each agent i is given by f (xi_avg(t)) ≤ f∗+N R 2 2αt + αG2_C 2 + 2αN G 2_C 1 ∀t ≥ 1, where C = 1 + 8N C1.

This shows that the error between y(t) and xi_{(t) is bounded by a constant that} depends on the step size α. The second part shows how the objective function converges to a neighborhood around the optimal value when a constant step size is used. The result is similar to, and should be compared with the pure subgradient method in equation (2.13),

f (t − 1) ≤ f∗+R

2_{+ G}2_α2_t

(48)

If we instead consider the minimal function value, instead of the average, then we get the following trivial corollary.

Corollary 3.7. Let f_best(t) be defined as

fbest(t) = min

i=0,...,tf (y(t)),

then, since the objective function f is convex, fbest(t − 1) ≤ f (yavg(t)) ≤ f∗+

N R2_{+ G}2_α2_Ct

(49)

Chapter 4

Dual Decomposition Algorithm

“Nothing tends so much to the advancement of knowledge as the application of a new instrument. The native intellectual powers of men in different times are not so much the causes of the different success of their labours, as the peculiar nature of the means and artificial resources in their possession.”

Sir Humphry Davy, 1812.

In this chapter we describe a new distributed optimization algorithm, based on the dual decomposition principle in Section 2.5.2.

The distributed optimization method is designed in the spirit of the Primal Consensus Algorithm from Chapter 3, where a multi-agent system tries to solve the optimization problem of minimizing a sum of convex objective functions. There are, however, some significant differences between the models, as we will see in Section 4.3. From the decomposition method we develop both centralized and decentralized algorithms in Section 4.2 and 4.4 respectively.

In particular, we are able to prove the convergence of the algorithms with time-varying delays, and noisy communication channels. The proofs are presented in Section 4.5. We further investigate some aspects of the communication model, and especially some ways of limiting the communication in Section 4.6.

Finally, in Section 4.7 we explore how the algorithm behaves for quadratic cost functions.

4.1 Dual Problem Definition

The optimization problem we are considering is similar to the one in the previous chapter. Consider a multi-agent network consisting of N agents, trying to minimize a common additive cost function,

minimize

x∈X⊆Rn

N

X

(50)

where fi _{: R}n _{→ R is a convex function only known by agent i. Notice that we} introduced the constraint set X ⊆ Rn, where we assume X is convex and has a nonempty interior so that Slater’s condition guarantees strong duality.

We will now apply the dual decomposition method from Section 2.5.2 on this problem, but first, let us partition the state x ∈ Rn as follows. Let

x =       x1 x2 .. . xN       ,

where xi ∈ Rni for i = 1, . . . , N and with PNi=1ni = n. Notice that some ni could be zero. This partitioning can be done arbitrary, but the idea is to associate part xi with agent i. In many cases there are a natural partitioning, where x_i is agent i’s internal state, for example in the problem with vehicle formations xi could contain vehicle i’s position and velocity.

The optimization problem in (4.1) can now be rewritten as

minimize x∈X⊆Rn N X i=1 fi(x1, x2, . . . , xN). (4.2) Let each agent maintain its own estimate of the optimization variable xi, thus

xi_j would denote part j of agent i’s estimate. The optimization problem (4.1) can be further rewritten as minimize x1_,...,xN_∈X⊆Rn N X i=1 fi(xi₁, xi₂, . . . , xi_N), subject to x1= x2 = · · · = xN, (4.3)

such that the coupling between the objective functions is in the consistency con-straints.

We now apply the method of Lagrange multipliers. Introduce the dual variables

λij for i = 1, . . . , N and j = 1, . . . , N , where λij is associated with the constraint