Robust learning and control of linear dynamical systems

(1)

Licentiate Thesis in Electrical Engineering

Robust learning and control of linear dynamical systems

MINA FERIZBEGOVIC

Stockholm, Sweden 2020 www.kth.se

ISBN 978-91-7873-628-7 TRITA-EECS-AVL-2020:42

kth royal institute of technology

(2)

Robust learning and control of linear dynamical systems

MINA FERIZBEGOVIC

Licentiate Thesis in Electrical Engineering KTH Royal Institute of Technology Stockholm, Sweden 2020

Academic Dissertation which, with due permission of the KTH Royal Institute of Technology, is submitted for public defence for the Degree of Licentiate of Engineering on Thursday the 1st of October 2020, at 10:00 a.m. in Harry Nyquist, Malvinas väg 10, Stockholm.

(3)

ISBN 978-91-7873-628-7 TRITA-EECS-AVL-2020:42

Printed by: Universitetsservice US-AB, Sweden 2020

(4)

Abstract

We consider the linear quadratic regulation problem when the plant is an unknown linear dynamical system. We present robust model-based methods based on convex optimization, which minimize the worst-case cost with respect to uncertainty around model estimates. To quantify uncertainty, we derive a method based on Bayesian inference, which is directly applicable to robust control synthesis.

We focus on control policies that can be iteratively updated after sequentially collecting data. More specifically, we seek to design control policies that balance exploration (reducing model uncertainty) and exploitation (control of the system) when exploration must be safe (robust). First, we derive a robust controller to minimize the worst-case cost, with high probability, given the empirical observation of the system. This robust controller synthesis is then used to derive a robust dual controller, which updates its control policy after collecting data. An episode in which data is collected is called exploration, and the episode using an updated control policy is exploitation. The objective is to minimize the worst-case cost of the updated control policy, requiring that a given exploration budget constrains the worst-case cost during exploration. We look into robust dual control in both finite and infinite horizon settings. The main difference between the finite and infinite horizon settings is that the latter does not consider the length of the exploration and exploitation phase, but it rather approximates the cost using the infinite horizon cost. In the finite horizon setting, we discuss how different exploration lengths affect the trade-off between exploration and exploitation. Additionally, we derive methods that balance exploration and exploitation to minimize the cumulative worst-case cost for a fixed number of episodes. In this thesis, we refer to such a problem as robust reinforcement learning. Essentially, it is a robust dual controller aiming to minimize the cumulative worst-case cost, and that updates its control policy in each episode.

Numerical experiments show that the proposed methods have better performance compared to existing state-of-the-art algorithms. Moreover, experiments also indicate that the exploration prioritizes the uncertainty reduction in the parameters that matter most for control.

(5)

(6)

Sammanfattning

Vi betraktar problemet med linjär-kvadratisk reglering där systemet är okänt och icke-linjärt. Vi presenterar robusta modell-baserade metoder baserade p˚a konvex optimering där den värsta möjliga kostnaden med avseende p˚a osäkerheten kring modellens uppskattning minimeras. För att kvantifiera osäkerheten s˚a härleder vi en metod baserad p˚a Bayesiansk inferens som är direkt applicerbar till syntetisering av robusta reglersystem.

Vi fokuserar p˚a policyer som kan uppdateras efter sekventiellt insamlat data.

Specifikt s˚a designar vi policyer som balanserar utforskning (reducering av osäker- heten av modellen) och exploatering (reglering av systemet) d˚a utforskning m˚aste vara säkert (robust). Vi börjar med att härleda en robust regulator som är med hög sannolikhet robust, givet observerat data. Syntesen av denna regulator används sedan för att för att härleda en robust dual regulator som uppdaterar sin policy efter insamlat data. En episod där data samlas in kallas för utforskning medan en episod där regleringspolicyn uppdateras kallas för exploatering. M˚alet är att minimera den värsta möjliga kostnaden av att uppdatera regleringspolicyn och samtidigt begränsa den värsta möjliga kostnaden under utforskning genom att kräva att en utforskningsbudget h˚alls. Robust dual reglering behandlas b˚ade i fallet med

¨

andlig och oändlig horisont. Den huvudsakliga skillnaden mellan de tv˚a fallen är att det senare inte tar hänsyn till längden av faserna för utforskning och exploatering, utan istället uppskattar kostnaden genom att anta oändlig horisont. I fallet med ändlig horisont s˚a behandlar vi hur olika längd p˚a utforskningsfasen p˚averkar avvägningen mellan utforskning och exploatering. Vi härleder dessutom metoder som balanserar utforskning och exploatering för att minimera den kumulativa värsta möjliga kostnaden för ett fixerat antal episoder. Den typen av problem kallar vi för robust förstärkande inlärning. Den härledda metoden är i princip en robust dual regulator vars syfte är att minimera den kumulativa värsta möjliga kostnaden, och som uppdaterar sin regleringspolicy varje episod.

Numeriska experiment visar att de föreslagna metoderna har bättre prestanda än andra moderna algoritmer. Dessutom s˚a antyder experimenten att utforskningen pri- oriterar reduceringen av osäkerhet i den riktningen som är viktigast för regleringens funktion.

(7)

(8)

Acknowledgements

It would not have been possible to write this thesis without the help and support of many colleagues and friends.

First and foremost, I would like to thank my supervisor H˚akan for his guidance and genuine interest in my research. Thank you for providing detailed and prompt feedback on all manuscripts/notes I’ve sent you. I also express my warmest gratitude to my co-supervisor, Thomas, for his enthusiasm and encouragement. Thank you both for all the time you made available for me, and thank you for revising this thesis!

I would like to give special thanks to Jack for his kindness, enthusiasm, and pa- tience. Almost all results developed in the thesis are the outcome of our collaboration.

It has been a great pleasure to work with you!

I am sincerely thankful to Miguel for introducing me to his research into Weighted Null-Space Fitting during my ﬁrst year. Thanks for your advice and support. Also, thank you and Demia for all hikes around Stockholm, exciting discussions, and dinners.

Many thanks to all colleagues and friends at the Division of Decision and Control.

In particular, thanks to all members of SYSID group Inˆes, Javad, Matias, Mohamed, Robert B., Robert M., and Rodrigo for the interesting discussion and lunches. And special thanks to Robert B. for translating the abstract to Swedish. Thanks to my oﬃce mates Hanxiao, Fei, Joana, Rodrigo, and Yuchao for providing a friendly environment. I had a pleasure to meet He, Lissy, Peter, and Xuechun while taking WASP courses. Thank you for all the projects done together and fun trips. Joana and Matin, thank you for listening over a cup of tea (or something else). Matias, thank you for always being there for me.

Finally, I want to thank my family for their unconditional support and optimism that have always lightened my journey. Thanks for ensuring that I have a great time whenever visiting home.

Mina Ferizbegovic August, 2020.

(9)

(10)

To my family

(11)

Chapter 1 Introduction

The modeling of dynamical systems is a fundamental task in many areas of science and engineering. Accurate models help us understand a system and to predict their behavior. Models can be constructed in two ways: physical modeling and data-driven modeling. Physical modeling of a system involves using laws and principles that describe the system. If physical modeling is not accessible, or is too expensive, a system is modeled using measured data. The latter approach to modeling is known as system identiﬁcation [1].

In the automatic control field, models are mainly used to construct a regulator in order to control a system. Model-based control has been a dominant paradigm since the 60s. It tends to separate the modeling part from the control design part. Model- based control using the certainty equivalence (CE) principle involves two steps: i) approximation of a real system by a model using system identification techniques, ii) building a controller based on the CE principle, i.e., when the true system is replaced by the model. The CE principle is competent in the regime of low model errors. In the regime of moderate to large model errors, it is necessary to take into account the confidence intervals of the estimated model. Robust control methods guarantee that the resulting controller is stable provided that the uncertainty parameters are within the confidence set. Even though robust control is efficient in guaranteeing stability, it can result in poor average performance [2].

The most time-consuming task in model-based control is often to obtain the model [3]. Also, even if the model fits measured data the best, it does not necessarily have to be the best model for the intended control application. Spurred by these problems, the field of identification for control has been active since the 90s. The main idea is that the obtained model is not an exact representation of a true system, but rather an approximation. Then, the quality of the model must not only be dependent on the identification data but also on the model’s intended application.

Identiﬁcation for control showed that high-performance controllers could be obtained with simple models (i.e., models with low complexity) when they can capture the most important dynamic features [4].

Simultaneous identiﬁcation and control have been studied since the early 60s,

1

(15)

2 Introduction

when the term ‘dual control’ was first introduced by Feldbaum [5, 6]. In such a setting, decisions are made with two objectives in mind. First, there is a control purpose, typically captured by a cost to be minimized. Second, due to the inherent uncertainty, there is a need to gather information about the unknown system. These two objectives are often competing since gathering information involves disturbing the control process; this trade-off is known as a ‘dual effect’ (of decision). Though the formulation of optimal dual control was clear, synthesis via dynamic programming (DP) was intractable. This has made dual control almost impossible to use in real- world applications. Thus, the design of controllers with dual effect relied on different types of approximations [7].

Adaptive control also addresses the issue of controlling uncertain dynamical sys- tems. The ﬁrst applications of adaptive control were carried out on aircraft autopilot and ship steering. The major diﬃculty is the stability analysis, as the feedback parameters change at every sampling instance. Self-tuning adaptive controllers, pioneered by [8], and model reference adaptive systems [9] are two main approaches.

The diﬀerences between these two approaches are discussed in [10]. Works [11]

and [12] have signiﬁcantly contributed to the adaptive control theory by providing algorithms applicable to discrete and continuous-time systems, respectively, that provide global stability.

In recent years, controlling uncertain dynamical system has witnessed a resur- gence in interest due to the recent success of reinforcement learning (RL), particularly in computer games [13, 14]. The ’dual effect’ in control corresponds to the exploration-exploitation trade-off in RL. Most RL methods can be classified as model-free methods, i.e., the collected data is directly mapped into control actions without traditionally building a model of dynamics. However, these methods require a lot of data samples to give a meaningful result and do not provide any robustness guarantees. Moreover, they are often dependent on hyperparameters, making the obtained results hard to reproduce [15]. In order to employ RL into technologies like autonomous vehicles, robotic systems, or power systems, these methods must be safe and reliable; namely, we need to make RL methods robust. In this thesis, we discuss how this task can be accomplished.

Model-free methods have also been developed in the control community. Usu- ally, they are referred to as data-driven control methods. The oldest and most familiar example is PID tuning. Another important approach to tune the parameters of feedback controllers is Iterative feedback-tunning (IFT) [16, 17], which is a gradient-based and unbiased approach that iteratively updates the parameters of the controller with information from a closed-loop experiment using the most recent controller. Furthermore, Iterative Correlation-based Tuning (ICbT) [18] is based on the minimization of correlation criteria between the closed-loop output error, i.e., the error between the achieved and desired output, and the reference signal.

Besides iterative methods, non-iterative methods have been proposed. The two most common non-iterative data-driven methods are Virtual Reference Feedback Tuning (VRFT) and Correlation-based Tuning (CbT). The basic idea of VRFT is to build a virtual reference signal such that the control design problem converts

(16)

1.1. Linear Quadratic Regulator 3

to a standard identification problem as described in [19], and the problem can be solved using instrumental variables (IV). Several extensions of this method are developed in [20, 21]. CbT was introduced in [22] and it is based on the correlation approach similar to ICbT, and can be solved using a particular set of instrumental variables. A data-driven stability test for CbT, which guarantees internal stability, is proposed in [23]. The main limitation of these methods is a high variance based on instrumental variable (IV) estimation; thus, the Cramér-Rao lower bound cannot be achieved [24]. To enhance the statistical properties of these approaches, optimal input design has been proposed as a solution, cf. [25]. A different approach to improve the noise sensitivity is to regularize the controller/model, cf. [26, 27, 28, 29]. Unlike the previously mentioned data-driven methods, Data-enabled Predictive Control (DeePC) [30] learn a control input rather than the parameters of the feedback controller. In this approach, a model is learned implicitly through predictions of the system behavior.

The aforementioned methods deal with simultaneous learning and control coming from both machine learning and control communities. Successful workshops on this topic have been reported during the past invited sessions of two important conferences during the last years: Conference on Decision and Control (CDC) and the new Learning for Dynamics and Control (L4DC). To assess these different methods, there has been a renewed interest in linear quadratic control with unknown dynamics as this problem lies at the intersection of learning and control. It has been an important baseline problem in the RL community to answer questions about sample efficiency, robustness, merits of different methods, etc. [15, 31]. In the thesis, we will also consider the problem of linear quadratic control with unknown dynamics to construct robust (w.r.t. parameter uncertainty) methods that balance exploration/exploitation jointly, particularly considering dual control. In the following chapter, we survey the literature on this problem, which involves work from both the control and machine learning communities.

1.1 Linear Quadratic Regulator

The linear quadratic regulator is one of the fundamental and best-studied problems in optimal control. The problem is to control a linear dynamical system subject to quadratic cost. If the dynamics are known, the optimal strategy is well known;

see [32]. On the other hand, the solution to the LQR problem with unknown dynamics has not been completely clariﬁed. In what follows, we give an overview of related work on this topic.

1.1.1 Related work

The interplay of learning and control of dynamical systems began with the introduction of ‘dual control’ [5, 6]. The solutions to optimal dual control problems require dynamic programming and are, in general, intractable [33], with few restricted

(17)

4 Introduction

exceptions that involve linear systems with ﬁnite state/decision spaces [34]. Compu- tationally tractable solutions require simplifying approximations to the problem [35].

Some possible ways to approximate the problem are: adding the perturbation signal to the cautious controller [36], using a serial expansion of the loss function [37], constraining the variance of the parameter estimates [38], and modiﬁed loss function [39], etc. Nevertheless, these early eﬀorts established the importance of balancing

‘probing’ (exploration) with ‘caution’ (robustness).

Eﬀective exploration has strong connections to the topic of experiment design;

in particular, the value of choosing input signals with consideration of the purpose of the model was recognized in [40]. Convex formulations [41, 42] ultimately led to the application-oriented and least-costly experiment design paradigms [43, 44], in which the objective is to reduce model uncertainty such that certain performance criteria can be achieved while minimizing the disruption to the system or experiment time. This paved the way for application-oriented experiment design approaches to dual control [45], cf. also [46, 47, 48, 49, 50] for adaptive, dual and data-driven model predictive control applications. However, with some exceptions, e.g., [51]

(upon which this work builds), these methods do not consider the robustness of the control strategies due to model uncertainty, and indeed assume that the true model parameters are known. In contrast, we consider a worst-case design with respect to a set of models for robustness. Another aspect of identiﬁcation for control is concerned with exploration for reduced complexity modeling [52, 53].

In the machine learning community, the LQR problem is studied under reinforcement learning with two main goals: providing regret bounds and robustness.

It is important to underline that RL algorithms are analyzed in finite-time. That said, the uncertainty quantification necessary for robust control derivation aims to have finite-time guarantees. However, as most existing results on uncertainty quantification are asymptotic, providing finite time guarantees on the estimates and their uncertainty has surged [54, 55, 56].

Robustness is studied in the so-called ‘coarse-ID’ family of methods, c.f. [57,58,59], which is based on the recent robust control framework System Level Synthesis (SLS) [60]. In [59], sample complexity bounds, i.e., the number of training samples necessary to learn the system, are derived for LQR with unknown linear dynamics.

This approach is extended to robust adaptive LQR in [57]; however, the policies are not optimized for exploration and exploitation jointly; exploration is eﬀectively random. On the other hand, the work of [61] eschews uncertainty quantiﬁcation, and demonstrates that the ‘so-called’ certainty equivalent control policy attains nearly optimal regret ˜O(√

T ) for small uncertainty.

Works on model-based adaptive control such as [62, 63] employ the so-called

‘optimism in the face of uncertainty’ (OFU) principle, which is inspired by [64]. OFU optimistically selects control actions assuming that the true system behaves like the

‘best-case’ model in the uncertain set. This leads to optimal regret but requires the solution to intractable non-convex optimization problems. Alternatively, the works of [65, 66, 67] employ Thompson sampling, which optimizes the control action for a system drawn randomly from the posterior or uniform distribution over the set of

(18)

1.1. Linear Quadratic Regulator 5

uncertain models, given data.

In the RL community, there has also been considerable interest in model-free methods [68] for direct policy optimization [69, 70], as well as partially model-free methods based on spectral ﬁltering [71, 72]. Other model-free methods are based on approximated value and Q-function as least-square temporal diﬀerence (LSTD) [73]

and least-square policy iteration (LSPI) [74], respectively. In [75], it has been empirically shown that the sample complexity of LSPI is worse than the one for model-based methods.

Unlike the present thesis, none of the works above consider deriving control policies for exploration/exploitation jointly with safe and reliable exploration, which is essential for implementation on physical systems.

1.1.2 Main objectives and contributions of the thesis

The main objective of the thesis is to deliver control methods for linear systems that are safe and reliable. We focus on deriving model-based robust (w.r.t. parameter uncertainty) methods. On the other hand, we seek to derive control policies balancing exploration and exploitation. For this purpose, we consider controllers with dual goals. First, the controller must minimize a given cost (exploitation), while, at the same time, it must gain information about the process to have a better model (exploration). The main problem is that obtaining a better model involves disturbing the control process, which is in conﬂict with the exploitation goal.

The speciﬁc contributions of the thesis are:

1. We quantify uncertainty around the estimated model, in the form of a high- probability bound on the spectral norm of the parameter estimation error.

This form is applicable to both robust control synthesis and the design of targeted exploration. By targeted exploration, we mean that we explore the system to reduce uncertainty in parameters that matter most for control, i.e., model application.

2. We derive a convex procedure for the design of controllers with dual goals for uncertain linear systems. More precisely, we ﬁnd an approximate solution to the problem sketched in Figure 1.1. We consider that the horizon is divided into the exploration and exploitation phase. The cost of the exploration phase is constrained by a user-deﬁned budget (green line). In this phase, we collect more data, so we gain more information about the system (red line). We use new data collected during the exploration phase to design a control policy that minimizes the cost during the exploitation phase (blue area).

3. We derive a convex approximation to the problem of minimizing the worst-case quadratic cost for uncertain linear systems using a receding horizon strategy. In particular, we solve the problem depicted in Figure 1.2. The horizon is divided intoN episodes, and we design N control policies. The goal is to minimize

(19)

6 Introduction

cost

new policy takes effect

information

time

exploration exploitation

budget goal: minimizecost

Figure 1.1: Sketch of the robust dual control problem. The goal is to design ex- ploration and exploitation control policies, so as to minimize the worst-case cost of the exploitation policy (blue area), with the budget constraint (green line) on the worst-case cost of the exploration policy. The red line in the plot represents information gain over time; namely, as we collect more data, we gain more information about the system.

cost

new policy takes eﬀect

information

0 t1 t2 tN =Ttime

Figure 1.2: Sketch of the robust reinforcement learning problem. The goal is to design N policies, so as to minimize the worst-case cost (blue area) over the time horizon [0, T ]. The red line shows an increase in information, which gets better over time as we are collecting more data to describe the system.

the cumulative worst-case cost (blue area). The worst-case cost duringi-th episode is based on the system information (red line) collected before the start of the episode.

1.2 Outline

In this section, we provide an outline of the thesis.

Chapter 2 This chapter is devoted to the introduction of the linear quadratic regulator problem and the methods used to solve this problem when the dynamics are known and unknown, respectively. If the dynamics are known, the solution

(20)

1.2. Outline 7

can be obtained by dynamic programming. For the case of unknown dynamics, model-free and model-based methods are introduced.

Chapter 3 We give the general statement of the problem, i.e., specification of the dynamics, cost, noise, policy, and we briefly introduce three problems we attempt to solve in the following chapters. As we are working with robust control, we need an approach to quantify uncertainty in parameter estimation, and we propose two approaches: the first approach is taken from [59], which is based on high-dimensional statistics, and the second approach is based on Bayesian inference. This chapter is partly based on:

• Mina Ferizbegovic, Jack Umenberger, H˚akan Hjalmarsson, and Thomas B. Sch¨on.

Learning robust LQ-controllers using application oriented exploration. IEEE Control Systems Letters, volume 4, pages 19–24, 2020.

• Jack Umenberger, Mina Ferizbegovic, Thomas B. Sch¨on, and H˚akan Hjal- marsson. Robust exploration in linear quadratic reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS) 32, pages 15336–15346, 2019.

Chapter 4 We propose a robust control synthesis method to minimize the worst-case LQ cost with high probability, given empirical observations of the system.

We use the high probability bound on the spectral norm of the estimation error derived in the previous chapter. Our numerical experiments show better performance for the proposed robust LQ regulator over existing methods. This chapter is based on:

Chapter 5 In this chapter, we consider the LQR problem with unknown dynamics. We propose an approximate robust dual controller that simultaneously regulates the system and reduces model uncertainty. The objective of the dual controller is to minimize the worst-case cost attained by a new robust controller, synthesized with reduced model uncertainty. The dual controller is subject to an exploration budget in the sense that it has constraints on its worst-case cost with respect to the current model uncertainty. The dual control strategy gives promising results when compared to the common greedy random exploration strategies. This chapter is based on:

(21)

8 Introduction

Chapter 6 We investigate robust dual control in a ﬁnite horizon setting. Unlike Chapter 5, we take transient features into account. In this case, we can explore the eﬀect of the exploration phase’s length on the cumulative cost. The main drawback is that the calculation of this policy is much more computationally demanding than the robust dual controller presented in the previous chapter.

Chapter 7 This chapter concerns the problem of learning control policies for an unknown linear dynamical system to minimize a quadratic cost function. We present a method, based on convex optimization, that accomplishes this task robustly: i.e., we minimize the worst-case cost, accounting for system uncertainty given the observed data. The method balances exploitation and exploration, exciting the system in a way that reduces uncertainty in the model parameters to which the worst-case cost is most sensitive. Numerical simulations and application to a hardware-in-the-loop servo-mechanism demonstrate that the approach shows appreciable performances and robustness gains over alternative methods. This chapter is based on:

• Jack Umenberger, Mina Ferizbegovic, Thomas B. Sch¨on, and H˚akan Hjal- marsson. Robust exploration in linear quadratic reinforcement learning. In Advances in Neural Information Processing Systems 32, pages 15336–15346, 2019.

Chapter 8 The ﬁnal chapter summarizes the main conclusions and outlines possible directions for future work.

Contributions not included in this thesis

The following contributions are not included in the thesis:

• Stefanie Fonken, Mina Ferizbegovic, and H˚akan Hjalmarsson. Consistent identiﬁcation of dynamic networks subject to white noise using Weighted Null-Space Fitting. In Proceedings of the 21st IFAC World Congress, 2020.

• Mina Ferizbegovic, Miguel Galrinho, and H˚akan Hjalmarsson. Weighted Null- Space Fitting for Cascade Networks with Arbitrary Location of Sensors and Excitation Signals. In Proceedings of the 57th IEEE Conference on Decision and Control (CDC), pages 4707–4712, 2018.

• Miguel Galrinho, Riccardo Prota, Mina Ferizbegovic, and H˚akan Hjalmars- son. Weighted Null-Space Fitting for Identiﬁcation of Cascade Networks. In Proceedings of the 18th IFAC Symposium on System Identiﬁcation (SYSID), pages 856–861, 2018.

• Mina Ferizbegovic, Miguel Galrinho, and H˚akan Hjalmarsson. Nonlinear FIR Identiﬁcation with Model Order Reduction Steiglitz-McBride. In Proceedings

(22)

1.2. Outline 9

of the 18th IFAC Symposium on System Identiﬁcation (SYSID), pages 646–651, 2018.

(23)

(24)

Abbreviations

CE Certainty equivalent

DARE Discrete algebraic Riccati equation DP Dynamic programming

HIL Hardware in the loop

iid Independent and identically distributed iﬀ If and only if

LSTD Least-squares temporal diﬀerence LSPI Least-squares policy iteration LTI Linear and time-invariant LQ Linear quadratic

LQR Linear quadratic regulator ML Maximum likelihood PD Proportional-derivative pdf Probability density function PEM Prediction error method RHS Right-hand side

RL Reinforcement learning RRL Robust reinforcement learning SDP Semi-deﬁnite program

OFU Optimism in the face of uncertainty TS Thompson sampling

w.p. With probability w.r.t. With respect to

11

(25)

(26)

Notation

N Set of natural numbers

R Set of real numbers

R+ Set of positive real numbers

A Transpose of A

x1:n Shorthand for the sequence{xt}ⁿ_t=1 λmax(A) The maximum eigenvalue ofA λmin(A) The minimum eigenvalue ofA

A₂ The largest singular value ofA

A_F The Frobenius norm

|x|P xP x

⊗ Kronecker product

vec (A) A vector wherein the columns ofA are stacked

Sⁿ₊(Sⁿ₊₊) Cones ofn × n symmetric positive semideﬁnite (deﬁnite) matrices A B B − A ∈ Sⁿ₊

A ≺ B B − A ∈ Sⁿ₊₊

N (m, R) Normal distribution with meanm and covariance matrix R Unif [a, b] The uniform probability density function with boundariesa and b χ²_n(δ) The chi-square distribution withn degrees of freedom and

probabilityδ Tr (A) Trace ofA

E [X] Expected value ofX

p(x) Probability density function ofx

X ∼ p(x) The random variableX is distributed according to p(x) 0_n×m Matrix of dimensionsn × m with only zero elements I_n Identity matrix of sizen

blkdiag(Q, R)

Q 0

0 R

13

(27)

(28)

Chapter 2 Background

This chapter intends to introduce the LQR problem in a Markov decision process (MDP) framework. In particular, we introduce solutions to LQR problems with both known and unknown dynamics.

2.1 Markov Decision Process

The Markov decision process [76] is used for making decisions in an environment with Markovian dynamics. It is deﬁned by four elements:

• the state spaceX ;

• the input (action) spaceU;

• the transition probabilitiesp(xt+1|xt, ut) that describe the distribution of the next statext+1 ∈ X conditioned only on the current state xt∈ X , and the current actionut∈ U (as the process satisﬁes the Markov property);

• the cost (or reward) at each stepc(x_t, u_t).

The goal is to find a sequence of inputs/actions{ut}^T_t=0that commonly minimizes the infinite or finite horizon average cost, i.e., lim_{T →∞}_T¹E_{T −1}

t=0 c(xt, ut)

and E_{T −1}

t=0 c(xt, ut)

, respectively. The sequence of actions are assumed to be history- dependent, i.e., the current action ut is allowed to be a mapping from the set of previous states and actions. A policyπ is a sequence of actions (or mappings) deﬁned as

π = {ut(x0, u0, · · · , xt−1, ut−1, xt)}^T_t=0.

If the input and state space of the MDP is ﬁnite, it is well known that the policy minimizing the costs can be derived by DP based algorithms such as policy or value iteration [76]. However, DP suﬀers from the curse of dimensionality, so it becomes hard to apply these algorithms when the dimensions of the input and state-space

15

(29)

16 Background

grow large. As our goal is to work with control problems, wherein the input and state space is inﬁnite, ﬁnding a solution can become more challenging.

2.2 The Linear Quadratic Regulator

The best-studied problem in the ﬁeld of optimal control is the LQR problem; see, e.g., the book [77] for an introduction. The LQR problem is a particular instance of an MDP when the transition to the next statext+1can be written as a linear combination of the current statextand the inputut, corrupted with noise. More precisely, the linear dynamics are given by

xt+1=Axt+But+wt, wt∼ N (0, Σw), t ≥ 0, (2.1) whereA and B are transition matrices, xt∈ Rⁿ^x,ut∈ Rⁿû andwt∈ Rⁿ denote the state (which is assumed to be observed directly, without noise), input and process noise, respectively, at timet. The cost at each step is a quadratic function of the statextand the control actionut, namely,x_tQtxt+u_tRtut, where Qt andRtare user-specified positive semidefinite and positive definite matrices, respectively. The cost afterT steps is

J(x₀) :=E

x_TQxT+

T −1

t=0

x_t Qxt+u_tRut

. (2.2)

The so-called inﬁnite time cost is given by

T →∞lim 1 TE[

T t=0

x_tQxt+u_tRut]. (2.3)

This corresponds to an MDP with state space X = Rⁿ^x, input space U = Rⁿ^u, transition probabilitiesp(x_t+1|x_t, u_t) =N (Ax_t+Bu_t, Σ_w), and cost at each step c(x_t, u_t) =x_tQ_tx_t+u_tR_tu_t.

2.2.1 Control with known transition matrices

When the parameters of the system,A and B, are known, the optimal solution to the LQR problem is well-known, cf. [32]. The optimal solution can be obtained by DP, which is based on the optimality principle: assume thatπ^∗={μ^∗₀, μ^∗₁, · · · , μ^∗_{T −1}} is the optimal policy for the horizon lengthT , then the optimal strategy minimizing the cost function from timei to time T is given by {μ^∗_i, μ^∗_i+1, · · · , μ^∗_{T −1}}. In other words, the optimal policy can be constructed recursively by ﬁrst calculating a control policy at the ﬁnal time T − 1, and then extending it to include T − 2 time instance, and continuing in this manner until the complete control policy is obtained.

To provide a DP solution to the LQR problem with known dynamics, we follow [32]. We deﬁne the terminal cost at timeT as JT(xT) =x_TQxT, and then

(30)

2.2. The Linear Quadratic Regulator 17

recursively

J_t^∗(xt) = min

ut E

x_tQxt+u_tRut+J_t+1^∗ (Axt+But+wt)

. (2.4)

We will prove by induction that the optimal control law is a linear and time-invariant function of the current state at each time. Observe that (2.4) withk = T − 1 is

J_{T −1}^∗ (xT −1) = min

uT −1E[x_{T −1}QxT −1+u_{T −1}RuT −1

+ (Ax_{T −1}+Bu_{T −1}+w_{T −1})Q(Ax_{T −1}+Bu_{T −1}+w_{T −1})] (2.5a)

=x_{T −1}QxT −1+x_{T −1}AQAxT −1+ Tr (Σ_wQ) (2.5b) + min

uT −1

u_{T −1}(R + BQB)uT −1+ 2x_{T −1}AQBuT −1

,

where the last equality (2.5b) is obtained using the factE [w_{T −1}] = 0. In order to explicitly ﬁnd the actionu_{T −1} attaining the minimium in the right-hand side of (2.5) we diﬀerentiate w.r.t.u_T and then set the derivative to 0, obtaining:

(R + BQB)uT −1=−BQAxT −1.

MatricesR and BQB are positive deﬁnite and positive semideﬁnite, respectively.

Thus,R + BQB is an invertible matrix, and then, the optimal input is u^∗_{T −1}=−(R + BQB)⁻¹BQAx_{T −1}.

Now we can rewriteJ_{T −1}^∗ (xT −1), see (2.5), as

J_{T −1}^∗ (xT −1) =x_{T −1}PT −1xT −1+ Tr (Σ_wQ) , where the matrixP_{T −1} is given by

PT −1=A(Q − QB(BQB + R)⁻¹BQ)A + Q.

The matrixP_{T −1}is symmetric, and it can be easily proven to be positive semideﬁnite.

The fact follows from

x_{T −1}PT −1xT −1=x_{T −1}QxT −1+x_{T −1}AQAxT −1 (2.6) + min

uT −1

u_{T −1}RuT −1+ 2x_{T −1}AQBuT −1

,

being positive for all uT −1, hence PT −1 is positive semideﬁnite. Since J_{T −1}^∗ is a positive semideﬁnite quadratic function plus constant term (depending on the noise variance), we can derive the control law similarly for stepT − 2. The cost from time T − 2 to T is

J_{T −2}^∗ (xT −2) =x_{T −2}QxT −2+x_{T −2}AQAxT −2+ Tr (Σ_w(Q + PT −1)) (2.7) + min

uT −2

u_{T −2}(R + BP_{T −1}B)u_{T −2}+ 2x_{T −2}AP_{T −1}Bu_{T −2} ,

(31)

18 Background

and then we can calculate the optimal policy at timeT − 2:

u^∗_{T −2}=−(R + BP_{T −1}B)⁻¹BP_{T −1}Ax_{T −1}. By substituting into (2.8), we have

J_{T −1}^∗ (xT −1) =x_{T −1}PT −2xT −1+ Tr (Σ_w(Q + PT −1)), (2.8) where the matrixPT −2 is given by

PT −2=A(PT −1− PT −1B(BPT −1B + R)⁻¹BPT −1)A + Q.

By doing this iteratively backwards towards t = 0, we obtain that, at every time t, the optimal policy is

u_t=K_tx_t, where

Kt=−(R + BPt+1B)⁻¹BPt+1, (2.9) and

Pt=A(Pt+1− Pt+1B(BPt+1B + R)⁻¹BPt+1)A + Q, (2.10) and PT = Q. The equation (2.10) is called the discrete-time Riccati equation.

When the horizon is inﬁnite,Ptwill converge to a steady state solution under mild assumptions stated in the following proposition taken from [32].

Proposition 2.1. Assume that the following assumptions hold:

• the pair (A, B) is controllable, i.e., the matrix

B, AB, · · · , Aⁿ^x⁻¹B is full rank or, equivalently, there exist a linear policyu_t=Kx_tsuch that the system x_t+1= (A + BK)x_t is asymptotically stable,

• the pair (A, G) where Q = GG, is observable, i.e., if (A, G) is controllable or, equivalently, ifu_t→ 0, and Gx_t→ 0, then x_t→ 0.

Under these assumption, there exists a positive semideﬁnite matrix P^∗ :=

lim_t→∞Pt, which satisﬁes the discrete algebraic Riccati equation (DARE):

P^∗=AP^∗A − AP^∗B(BP^∗B + R)⁻¹BP^∗A. (2.11) It is well known that the optimal policy for the inﬁnite horizon LQR is a stable, time-invariant, linear-function of the state:ut=K^∗xt, where

K^∗=−(BP^∗B + R)⁻¹BP^∗A. (2.12) We can solve the inﬁnite horizon LQR problem by, for example, semideﬁnite programming, cf. [78], which is explained in Chapter 4.

(32)

2.2.2 Control with unknown transition matrices

In this setting, we assume that (A, B) can only be accessed through input/output data. There are two classes of methods to solve this problem: model-based and model-free methods.

Model-based methods for LQR

Model-based methods involve two steps: i) deriving a model, and ii) control design based on that model. The ﬁrst step is equivalent to ﬁnding an estimate ( ˆA, ˆB) of the true system dynamics (A, B) from the observed data D = {xt, ut}ⁿ_t=1. A standard way to do this is by using the least squares estimator:

( ˆA, ˆB) = arg min

A,B

T k=1

x_k+1− Ax_k− Bu_k²₂. (2.13)

The solution to (2.13) is given by:

( ˆA, ˆB) =

⎛

⎝^{T −1}

k=0

x_k+1

x_k uk

⎞

⎠

⎛

⎝^N

k=1

x_k uk

⎞

⎠

−1

.

At this point, we are concerned about the quality of the estimators ( ˆA, ˆB). More precisely, we want to find confidence/credible intervals for these estimates. Credible intervals are used in Bayesian statistics and are analogous to confidence intervals in frequentistic statistics. The main difference is that credible Bayesian intervals incorpo- rate prior distributions, and consider the estimated parameters as random variables.

In the following chapter, we suggest two ways to form high conﬁdence/credibility regions, i.e., regions which contain the true system parameters with high probability.

The ﬁrst is based on high-dimensional statistics, and the second is derived using Bayesian inference. In particular, these regions will be of the following form:

Θ_m(M(D)) := {A, B : XDX Inx, X = [ Â − A, ˆB − B]}. (2.14) whereM(D) = { Â, ˆB, D} is a mapping from observed data D = {xt, ut}ⁿ_t=1, and D quantifies uncertainty associated with the estimates ( ˆA, ˆB).

A natural approach, now that ( ˆA, ˆB) are available, is to design a controller, say K, based on the CE principle, i.e., not accounting for the confidence intervals. Inˆ the regime of low model errors, we expect that such a controller stabilizes the true system and, at the same time, attains a competitive cost, compared to the optimal cost. This controller can be obtained by minimizing the finite horizon cost as (2.9), or the infinite horizon cost (2.12).

For moderate and large model error, the CE principle may lead to poor performance or even to non-stabilizing controllers. Therefore, robust (w.r.t. parameter uncertainty) stability and performance is a task that must be addressed in the

(33)

20 Background

solution. This thesis is focused on robust controller synthesis. Important results on robust controllers for LQR are presented in [59], where Coarse-ID Control is introduced. This method is based on SLS synthesis, and it consists of the following three steps

1. Identify a coarse model ( ˆA, ˆB) of the dynamical system.

2. Determine error bounds in the form of

ˆA − A

2≤ A, ˆB − B

2≤ B, (2.15)

using statistical tools like bootstrap.

3. Solve the robust optimization problem that is based on SLS synthesis, which guarantees that the controller is stable for all models inside the conﬁdence set.

The main diﬀerence between the Coarse-ID Control method and robust synthesis presented in the thesis is on the structure of the conﬁdence intervals: our robust controllers take into account the structure of uncertainty represented byD in (2.14), while Coarse-ID Control makes use of spectral norms in (2.15) only.

Unlike the methods mentioned above, which run offline without re-estimating transition matrices (A, B), adaptive (online) LQR methods use the control input to refine our knowledge of (A, B) while, at the same time, minimize the control cost. We mention briefly the two most commonly used adaptive methods: Optimism in the face of uncertainty (OFU), and Thompson Sampling (TS). Usually, these methods proceed in epochs and for that we denote the confidence/credible region by Θ_mⁱ during epochi.

Optimism in the face of uncertainty (OFU) for LQR. At each epoch, OFU methods require a new estimate ( ˆA, ˆB) of the dynamics, and also the confidence/credible region around them. The controller is then designed for that pair (A, B) in the confidence/credible region that minimizes the cost (2.3). Such pair (A, B) is denoted as (Aofu, Bofu), and finding it requires solving a non-convex optimization problem which can be cast as

(Aofu, Bofu) = arg min

A,B∈ΘmiTrP (A, B), whereP is a solution to the following DARE:

P (A, B) = Q + AP (A, B)A − AP (A, B)B(BP (A, B)B + R)⁻¹BP (A, B)A.

After obtaining (Aofu, Bofu), the controller is calculated according to the CE principle wherein the true system is replaced by (Aofu, Bofu).

Thompson sampling (TS) for LQR. At each epoch, TS also requires a new estimated dynamics and also the conﬁdence/credible region around the estimated dynamics. It is worth to mention that, in RL, TS [79] is mostly referred to as posterior sampling.

In the context of adaptive control in LQR problems, TS can be slightly diﬀerent

(34)

formulated to its counterpart in RL, and we refer the reader to [65] for a detailed explanation. The method is carried out by drawing a sample (A_ts, B_ts) uniformly from the conﬁdence/credible region. The uniform sampling is implemented as follows

[Ats, Bts] =

A, ˆˆ B +

U^1/(n^x⁽ⁿ^x⁺ⁿ^u⁾⁾

η_F

D_i⁻¹²,

where U ∼ Unif([0, 1]), η ∈ Rⁿ^x^×(n^x⁺ⁿ^u⁾ with each η_ij ∼ N (0, 1), and D_i is uncertainty associated with Θ_mⁱ, see (2.14). The controller is obtained according to the CE principle wherein the true system is replaced by (Ats, Bts).

Model-free methods for LQR

We do not consider model-free methods in the thesis, but we provide a brief review for completeness. Model-free methods for LQR skip the identiﬁcation step and instead learn a controller directly from data. They are mainly divided into two groups: approximate dynamic programming (ADP) and direct policy search. ADP uses the principle of optimality to solve the approximation of problem (2.2) by using only input-output data. On the other hand, the direct policy search updates the policy by using data from the previous run to improve the obtained cost.

Instead of learning the transition dynamics, ADP learns an approximation of the value function orQ-function. The Q-function represents the value of the cost when the initial state isx0 and the ﬁrst actionu0, that is

Q(x0, u0) = max

E

_T

t=0

x_tQxt+u_tRut

, xk+1=Axk+Buk+wk

.

Unfortunately, the previous expression can be only used for inﬁnite horizon discounted LQR setting, see [80]. One way to avoid a discounted setting is by employing a relativeQ-function, cf. [81]. In particular, LSPI, the model-free algorithm for LQR, is derived by approximating aQ-function. On the other hand, another model-free algorithm, LSTD, is derived by approximating the value functionV (x) = maxuQ(x, u).

For an excellent overview of ADP, see [80].

Direct policy search methods search for policies directly without building any value or Q-function. These algorithms can be mainly classified as derivative-free optimization (DFO) [82]. Algorithms that directly search for control policies on LQR are analyzed in [70]. Other methods based on DFO, such as REINFORCE [83], involves perturbing the actionsut, instead of perturbing the control policy. These methods do not use any model or any state-space representation and are easy to implement. However, they are less effective than methods that compute actual gradients [84]. These methods can lead to high variance, which requires many new samples to be drawn to find a stationary point.

(35)

(36)

Chapter 3 Preliminaries

In this chapter, we formulate a general problem statement we seek to solve in the thesis. First, we state the basic concepts like dynamics, cost, and form of control policies. As we consider dual control methods, we explain what do we consider by robust dual control in this thesis. Also, we demonstrate through the simple example, the complexity of solving an optimal dual control.

As the main objective of the thesis is to deliver methods that minimize the worst- case cost, it is necessary to provide uncertainty quantification methods. We give two concrete suggestions on uncertainty quantification for the linear dynamical system given by (3.1). The first suggestion is a high-probability bound on the spectral norm of the system parameter estimation error taken from the work [59]. Although it is in a form that applies to robust control synthesis and targeted exploration design, it cannot be directly applied to correlated time series data. In the second approach, which is the main contribution of this chapter, we remove that limitation.

3.1 General problem statment

In this section, we describe the general problem addressed in the thesis. We explain dynamics, cost function, control policies, and modeling, which are used in the following chapters.

Dynamics and cost function We are concerned with control of linear time- invariant systems

x_t+1=Ax_t+Bu_t+w_t, w_t∼ N

0, σ_w²I_n_x

, x₀= 0, (3.1) wherext∈ Rⁿ^x, ut∈ Rⁿ^u andwt∈ Rⁿ denote the state (which is assumed to be observed directly, without noise), input and process noise, respectively, at time t.

The objective is to design a feedback control policyut=φ({x1:t, u1:t−1}) so as to minimize the cost function

T t=i

c(xt, ut),

23

Robust learning and control of linear dynamical systems

Robust learning and control of linear dynamical systems

MINA FERIZBEGOVIC

Robust learning and control of linear dynamical systems

MINA FERIZBEGOVIC

Abstract

Sammanfattning

Acknowledgements

Contents

Chapter 1

Introduction

1.1 Linear Quadratic Regulator

1.2 Outline

Abbreviations

Notation

Chapter 2

Background

2.1 Markov Decision Process

2.2 The Linear Quadratic Regulator

Chapter 3

Preliminaries

3.1 General problem statment