LQR and MPC control of a simulated data center

(1)

IN

DEGREE PROJECT

MATHEMATICS,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

LQR and MPC control of a

simulated data center

ERIK BERGLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

LQR and MPC control of a

simulated data center

ERIK BERGLUND

Degree Projects in Optimization and Systems Theory (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2017

Supervisors at ABB: Winston Garcia-Gabin, Kateryna Mischenko Supervisor at KTH: Xiaoming Hu

(4)

TRITA-MAT-E 2017:59 ISRN-KTH/MAT/E--17/59--SE

Royal Institute of Technology

School of Engineering Sciences

KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

One of the largest contributions to a data center’s power usage is its cooling system. To decrease the energy usage of the cooling system, an automatic control scheme that adapts the capacity of the cooling units is needed. In this master thesis, a Simulink model of a data center is developed, along with several LQR and one MPC controller. The controllers control the outlet temperature and volumetric airflow of two CRAH units in the simulated data center. Simulations are performed in which the controllers are judged based on their estimated energy usage and how often the server temperatures in the data center exceed 35°C. Based on the experimental results, recommendations are made regarding what kinds of controllers to investigate in ABB’s further research.

(6)

(7)

Sammanfattning

Ett av de största bidragen till ett datacenters energiförbrukning kommer från kylsystemet. För att minska kylsystemets energianvändning krävs ett automatiskt reglersystem som an-passar hur stor andel av kylningsenheternas kapacitet som utnyttjas. I detta examensarbete utvecklas en Simulink-modell av ett datacenter, samt flera LQR-regulatorer och en MPC-regulator. Regulatorerna kontrollerar utblåsningstemperaturen och luftflödet hos två CRAH-enheter i det simulerade datacentret. Simuleringar utförs, där regulatorerena bedöms efter uppskattad energianvändning och efter hur ofta servertemperaturerna övergår 35 ° C. Baserat på experimentella resultat ges rekommendationer angående vilken typ av regulatorer som bör undersökas närmare i ABBs fortsatta forskning.

(8)

(9)

Acknowledgements

I would like to thank Xiaoming Hu for helping me get started with the thesis, and Winston Garcia-Gabin and Kateryna Mischenko for their advice and patient support throughout my work with it. I would like to thank my fellow master student Huang Zhang and his supervisor Xiaojing Zhang for their valuable contributions regarding the data center model. I would also like to thank my parents Anders and Astrid Berglund for their moral support during my work. Finally, I would like to thank my grandma Ingeborg Berglund for letting me live with her in Västerås during my time at ABB.

(10)

(11)

Abbreviations

ASHRAE the American Society of Heating, Refrigerating and Air-Conditioning Engineers

CRAH Computer Room Air Handler

LQR Linear Quadratic Controller

MPC Model Predictive Controller

(12)

1 Introduction

1.1 Background

A data center can be defined as a facility that contains concentrated equipment to store, manage, process and exchange digital data and information. The need for such services has drastically increased, along with the development of IT equipment technology. To support the energy intensive computing of a data center, as of 2004 most data centers use specialized computer room air

conditioning systems [1]. Research efforts on data centers have been led by US based organizations

such as the Department of Energy, the Department of Defense, ASHRAE’s Technical Committee

9.9, The Green Grid and the Uptime Institute. According to [2], the research before 2012 was

mostly concerned with IT equipment characteristics and safety. In comparison, research on energy savings in cooling of servers was limited. In 2007, the EPA published a report on data center efficiency identifying, among others, heat removal and control and management as topics in need

of research [11]. The EU began making efforts in data center power efficiency by the end of 2010,

launching projects All4Green and CoolEmAll. Since then, 6 more projects aiming to reduce the

environmental impact of data centers have been launched [10]. They were completed in 2016.

(14)

Figure1illustrates the different parts of the energy consumption of a typical data center. Since cooling accounts for nearly 40 % of the energy consumption of a data center, this is a suitable area to focus on when researching how to decrease the energy needs. The cooling can be done on several scales, from cooling of individual chips, to cooling of the whole room. The room cooling system inside a typical data center can be described as follows: a CRAH cools the air inside of it by using cold water provided by an external cooler. A fan in the CRAH supplies this air to the data center, usually through a plenum. The servers racks are organized into rows dividing the data center into cold aisles and hot aisles. The chilled air is supplied to the cool aisles, passes through the server racks and exits them into the hot aisles. The hot air rises to the roof and recirculates

into the CRAH where it is cooled again [4]. This is illustrated in figure2.

Figure 2: A diagram showing the layout of a data center with raised floor plenum cooling. Several alternative ways of cooling have been suggested. A review of the state of the art of

cooling systems from 2015 [5] considers liquid cooling instead of air cooling, and concludes that

liquid cooling will be able to support a higher density of power than air cooling, increase the energy efficiency by removing the air as an intermediate step and make it easier to use the heat generated from the servers. The main disadvantage of liquid cooling is the risk of having liquid near the servers in case of failure, so which solution that is the better alternative depends on whether this risk can be justified. The advantage of conventional air cooling is accessibility and maintainability, since it requires no pipes or barriers around the servers. Its disadvantage is lowered efficiency because of recirculation and bypass of air. This can be remedied by isolating the cold and hot

aisles, but that increases cost and decreases maintainability. According to [12] page 430, data

center users have stated that for power densities less than 14 W/m2 _{(150 W/ft}2_{), the extra cost}

for containment is not justifiable. Part of that cost is because of the need of more sensitive control, which could be decreased with an efficient control algorithm.

1.2 Approaches for controlling the data center

Temperature management of servers in data centers can be considered on several different scales, from the hardware in the server components, to the whole cooling architecture of a multi-room data center. Dynamic control of data center temperatures can be done on the server level, as in [25] and [7], rack level as in [9] and room level as in [24] and [6].

A study from 2006 [3] investigated the viability of using different parts of the cooling system

of a data center as control variables. The conclusions were the following: The inlet temperature of the servers depends linearly on the CRAH supply temperature. The supply temperature would be suitable as a control variable, but it would not be enough to just change the supplied temperature if the fan speed would be to low. Increasing the fan speed would decrease the difference between inlet and outlet temperature of a server, but it would not decrease linearly. Instead, a given decrease in airflow would have a greater effect for lower initial airflows. Closing a vent in the plenum increased the temperature near it but decreased it at neighboring vents. Manipulation of vents would thus allow more localized temperature control.

(15)

the data-center as a network of thermal and computational nodes. Test cases in the referenced dissertation shows that the coordinated optimization of IT-load leads to efficiency improvements in cases where the servers are not placed so that they are all cooled efficiently. A realistic model containing both IT-loads and thermal dynamics can be quite complex, and if it is not tractable to coordinate the IT-load and the cooling, another approach would be to predict the IT-load

distribution in the cooling efforts, as was done for control of a single server in [25].

A common approximation in the models used when controlling the data center is to disregard the dynamics of airflow. The motivation for this is that they are quicker than the temperature dynamics. The air recirculation from server outlets to inlets, and the airflow that bypasses the servers and goes directly to the cold aisle have generally been modeled as static functions of the supplied airflows. Modeling their contributions as constants can lead to a model where the

time derivatives of server temperatures depend linearly on airflow, as in [6]. Modeling them as

polynomials of the inflow to the servers gives a more realistic model, and has been used to treat

the case of individual servers as in [7]. In [9], a thermal model that can estimate a dynamic airflow

in real time is developed. This model is used to aid a PID controller by estimating unavailable temperatures. Changes in outlet airflows are generally assumed to affect the server temperatures immediately, no time delays have, as far as I know, been considered in previous works.

1.3 Statement of thesis scope

The objective of this thesis was to develop several control schemes for the cooling system of a data center, consisting of 2 CRAH units, and investigate their performance in a preliminary study for ABB. The conclusions of the thesis would be used to determine the direction of further research of data center control. The performance of the control schemes is judged based on two objectives: minimizing the energy consumption of the cooling system and keeping the maximum temperatures of the servers in the data center below a certain setpoint.

The first part of the thesis was to develop a model of the data center. This model would be both in order to develop model based controllers, and to evaluate the performance of those controllers. In order to check the performance with respect to the control objectives, the model would need to simulate the server temperatures and the power usage of the cooling system. These two parts will be elaborated on in two separate chapters.

The data center model was developed using energy balance equations and simplifying assump-tions such as treating the servers and their enclosures as having one uniform temperature. The

thermal dynamics are similar to the model in [25], except that they apply to servers in a data

center instead of components of a server, using an empirically determined thermal mass for servers

from [18]. The power usage of the cooling system is modeled similarly to [24], but incorporates a

contribution from varying airflow as well as temperature. The most distinguishing feature of the current model compared to the previous ones is that it has a time delay for the controller outputs to take effect. However, tests of the model showed that this time delay was insignificant on the time scale of the servers’ thermal dynamics.

The second part of the thesis was to develop controllers based on the data center model. Several LQR controllers and one MPC controller were developed. The LQR controllers were tuned according to the heuristic that they need to make the controlled system adjust to a change in setpoint twice as fast as the open loop system would move to that point naturally given a change in server usage. The MPC was developed to minimize an energy functional while taking constraints on controller outputs and maximum server temperatures into account.

The model on which the LQR controllers were based on, was a linearized version of the dynamics of the mean temperatures of servers in two groups of server racks in one row of the data center. In the experiments with the controllers, they were used to control these mean temperatures, but also the maximum temperatures of the rack groups. One goal of the experiments was to investigate whether the controllers would have a stabilizing effect on the maximum temperatures even though their dynamics would not correspond entirely to the controllers’ internal models. If this was the case, control of the maximum temperatures could be a better way of avoiding overheating than controlling the mean temperatures.

Some of the LQR controllers would use only airflow as a control variable, some used the outlet temperature of the CRAHs. This was done in order to investigate if there were any significant performance difference between airflow and temperature control. One LQR controller was also developed according to the same model as the others, but would use both outlet temperature and

(16)

airflow as control variables.

In order to compensate for discrepancies with the linearized model on which the LQR controllers was based on and the real data center, some controllers had their state space extended with the integral of the difference between the controlled temperatures and their setpoints, making them LQI controllers. Another objective of the experiments with these controllers was to investigate the difference in controller performance with and without this extension.

The MPC was designed with a discretized version of the data center model that would ignore time delays. Several tuning experiments were performed in order to choose the algorithm for the MPC. The results of these tuning experiments showed that the posed optimization problem was nonconvex, so that only local minima would be found, if any at all. The computation time for the MPC simulations were also significantly longer than for the LQR simulations. The experiments with the MPC would show if its ability to find local minima for energy usage and its capability of predicting the values of its controlled temperatures would warrant the effort of implementing it more efficiently so that it could work in real time.

Tests in a real data center has been done with a similar MPC in [6]. The main difference

between the MPC in that study and in this thesis, are that the MPC in that study only monitors rack inlet temperatures and does not take varying power usage of the servers into account, but has another control variable in the form of adjustable sizes of vents in the data center floor. The MPC in that study also has a slightly different objective function which assumes that the power usage of the chiller depends linearly on the supplied air temperatures, and which penalizes too large changes in controller output.

The MPC and LQR controllers were used in two different test scenarios. In the first scenario, two servers in each rack would constantly work at minimum and maximum capacity respectively, increasing the difference between the maximum and minimum temperatures, making it inefficient to control the servers using mean temperatures. In the second scenario, the server load would shift between the servers with different thermal dynamics in each rack, which could present a problem when controlling the maximum temperatures. By comparing the controllers’ performance in these two extreme scenarios, conclusions could be drawn about whether a certain type of control scheme would be strictly better than another or if a controller’s performance depended heavily on the type of variations in IT load.

2 Thermal model of the servers

This chapter describes a dynamical model of the average temperatures of servers in a data center.

The model is based on a module in the SICS-ICE data center used for research purposes [13]. The

chapter is divided as follows: After a stating the contributions of another master thesis student to this model, a description of the data center is given. Then, the assumptions made for developing the mathematical model are presented, followed by the resulting equations in a general form. Finally, specific parameters values used in simulations are given, and some results of validation tests are shown.

2.1 Statement of contributions

The data center model described here is based on a steady state model developed by Huang Zhang

as part of his thesis work at ABB [8]. Given his model, I modified it to include dynamics and

implemented it in Simulink. The equations in sections 2.4.4 and 2.4.5 describe derivations of

coefficients and equations in Huang’s original model. My contribution in these sections was to write down the derivations, explain the equations and to generalize them by naming parameters instead of using explicit parameter values. I also changed some parameters compared to Huang’s

model. Another contribution of Huang is section 2.4.2, the equation there was his suggestion.

Huang’s model included steady state special cases of the equations in sections2.4.3and2.4.6, but

generalizing the model to include dynamics was my contribution.

2.2 Data center layout

The data center modeled is a small scale slab floor data-center. It contains two rows of five server racks, four CRAHs and one UPS, battery pack and a single switchgear. The layout can be seen in

(17)

server racks and one hot aisle between them. An advantage of this setup compared to the more common one with floor vents is that there is less air bypass from the cool aisles and recirculation from the hot aisle. In the model of this data center, there are 18 servers per rack, enumerated

from top to bottom. Figure3shows how the CRAH units and racks are located in relation to each

other and how they are enumerated.

2.3 Assumptions

The model is based on several simplifying assumptions. • All energy consumed by the servers is emitted as heat.

• All servers have the same power consumption for the same resource usage.

• All server have an equally sized enclosure in their racks. A server along with its enclosure will be referred to as that server’s enclosing region.

• Any heat in the data center is assumed to be transported entirely by airflow, except for the emission of heat from the servers and removal of heat from the CRAHs.

• Air does not recirculate from the hot aisle to the inlets of the servers. • A fixed proportion of the air is assumed to pass over the racks.

• Only heat and thermal energy is considered in the energy balance equations.

• The heat capacity and density of air are assumed to be constant, their dependency on tem-perature, pressure and humidity is ignored.

• The air velocity field originating from a certain CRAH is assumed to be constant over any cross section perpendicular to that CRAH.

• The temperature of an airflow is the same at the CRAH outlet and rack inlet.

• Airflows from different sources mix perfectly before entering a server. This mixing happens in non-overlapping regions in front of the server inlets, and during such a short time interval that inlet temperatures and airflows can be considered constant over this interval.

• Each CRAH has a fixed region of influence, a 2-dimensional region in the plane of the rack inlets below the height of the server racks through which air from that CRAH passes. • The region of influence has the same size for all CRAHs.

• The region of influence of a CRAH has the same height as the server racks.

• The region of influence of a CRAH is assumed to be symmetrical with respect to the vertical plane perpendicular to the CRAH’s outlet that contains the CRAH’s midpoint.

• The regions of influence have the minimum width necessary for the racks next to the wall of the data center (racks 5 and 10) to have their inlets lie entirely inside the regions of influence of a CRAH (CRAHs 2 and 4 respectively).

• The air passing through the region of influence travels through the 3-dimensional region bounded by the convex hull of the points in the CRAH outlet and the region of influence of that CRAH.

• The lead time for a CRAH unit to change its output is ignored.

Figure3shows with the lines extending from each CRAH unit in which region the air from them is

assumed to travel, and shows their regions of influence. As seen from the figure, the inlets of racks 1 and 2 lie entirely within CRAH 1’s region of influence, while the inlets of racks 2, 3, 4 and 5 lie entirely withing the region of influence of CRAH 2. CRAH 1 also has a partial intersection with the inlet of rack 3 and CRAH 2 a very small intersection with the inlet of rack 1. By symmetry of the data center, the inlets of racks 6-10 and the regions of influence of CRAHs 3 and 4 have the corresponding intersections.

(18)

Figure 3: A labeling of racks and CRAHs with a top-down view of the regions of influence. Note how the regions of influence of CRAHs 1 and 2, and of CRAHs 3 and 4 overlap, and that the regions of influence of CRAHs 1,2,3 and 4 intersect partially with the inlets of racks 3,1,8 and 6 respectively.

2.4 Model

2.4.1 Indices, parameters and variables

Table1describes the variables, parameters and indices used to model the data center and its power

(19)

Index Description

i Index of the CRAH units, i ∈ {1, 2, 3, 4}

j Index of the server racks, j ∈ {1, ..., 10}

k Index of the servers in each rack, k ∈ {1, ..., 18}

Parameter Description

p Proportion of air from the CRAHs that passes above the racks. p ∈ [0, 1)

Vairf low The volume of the region through which air is assumed to pass between the

CRAH and its region of influence

si,j The area of the region through which airflow from CRAH i enters rack j

sinf luence The area of the region of influence of a CRAH unit

pidle The power usage of an idle server

ppeak The power usage of a server working at full capacity

Cth The thermal mass of the enclosing region of a server

cp The specific heat capacity of the air in the data center.

ρ The density of the air in the data center.

Variable Description

t Time measured from the start of a simulation

tdi(t) The time delay of the airflow from CRAH i that reaches the region of

influence at time t

uj,k(t) The resource usage of server k in rack j at time t. uj,k(t) ∈ [0, 1]

Pj,k(t) The power usage of server k in rack j at time t

ai(t − tdi(t)) Volumetric airflow emitted from CRAH i at time t − tdi. Assumed to be

ai(0)for t − tdi(t) < 0

ai,j(t) Volumetric airflow from CRAH i entering rack j at time t

Aj,k(t) Volumetric airflow entering the enclosing region of server k in rack j at

time t

Ti,out(t − tdi(t)) Outlet temperature of CRAH i at time t − tdi(t). Assumed to be

Ti,out(0)for t − tdi(t) < 0

Tj,k,in(t) Inlet temperature of the enclosing region of server k in rack j at time t

Tj,k,out(t) Outlet air temperature of the enclosing region of server k in rack j at

time t.

Table 1: Table describing the indices, parameters and variables of the model

2.4.2 Power usage of the servers

For simplicity, a linear equation for the power usage is assumed, as in [16]. With an idle server

using power pidle, a full capacity server using ppeak and the proportion of resources used of server

kin rack j at time t being uj,k(t), the power usage of that server is

Pj,k(t) = pidle+ (ppeak− pidle)uj,k(t). (1)

2.4.3 Time delay

Because of the distance between the CRAHs and the nearest row of racks, there is a time-delay in

the airflow that reaches the racks. The time delay tdi(t)at time t must be such that the volume of

air passing from a CRAH unit to its region of influence during the time interval [t − tdi(t), t]equals

the volume of the region the air is assumed to travel through. This yields the following equation:

Z t

t−tdi(t)

(1 − p)ai(τ )dτ = Vairf lows (2)

2.4.4 Airflows

The airflow from CRAH i that passes through that CRAH’s region of influence at time t is (1 −

p)ai(t − tdi(t)). By the assumption that the airflow is distributed evenly,

ai,j(t) =

si,j

sinf luence

(20)

and Aj,k(t) = 1 18 4 X i=1 ai,j(t). (4)

A non-uniform distribution of airflow could also be simulated by changing the parameters si,j or

setting different weights on the sum in the right hand side of equation (4).

2.4.5 Inlet temperatures

For an ideal fluid with constant specific heat capacity cp, density ρ and volume V , a temperature

change ∆T corresponds to a change in internal energy ∆Q = cpρV ∆T. If two volumes V1 and V2

of the same fluid with temperatures T1 and T2 perfectly mix, the net change in internal energy is

0. The temperature of the mixed fluid T will be such that the cooler volume of fluid will gain as much energy as the hotter one loses, so

cpρV1(T − T1) = cpρV2(T2− T ) ⇔ T =

V1T1+ V2T2

V1+ V2

. (5)

The above equation can be generalized by induction. Assume that if n fluid volumes V1, ..., Vnwith

temperatures T1, ...Tn mix, the resulting fluid will have volume V = P

n m=1Vm and temperature T = Pn m=1VmTm Pn m=1Vm

. Mixing in an extra volume of fluid Vn+1with temperature Tn+1, the

tempera-ture of the resulting fluid is V T + Vn+1Tn+1

V + Vn+1 = Pn+1 m=1VmTm Pn+1 m=1Vm

according to equation (5). To apply

this to the data center, the assumption of perfect mixing of airflows is used. This is assumed to happen at non overlapping regions outside the inlets of the racks. The time it takes for this mixing to occur is short enough that all airflows and temperatures can be considered constant over that time interval. Then the volume of air from each CRAH is proportional to the volumetric airflow of that CRAH reaching the considered inlet. The inlet temperature to the enclosing region of server

kin rack j at time t can be calculated as

Tj,k,in(t) =

P4

i=1ai,j(t)Ti,out(t − tdi(t))

P4

i=1ai,j(t)

. (6)

2.4.6 Outlet temperatures

The final step in the model derivation is to obtain a dynamical model of the server outlet temper-atures. A simple type of model is a first order differential equation and such an equation will be derived as follows: Assuming that there is no inflow from the hot aisle, no heat conduction and a

uniform temperature in the enclosing region of a server, Tj,k,out(t)will be equal to the temperature

of the enclosing region of server k in rack j. The air in the model is assumed to have a constant

density ρ and specific heat capacity cp. The enclosing region of a server is assumed to have thermal

mass Cth. By energy balance, the rate of change in internal energy is equal to the inflow minus

the outflow. The inflow air has temperature Tj,k,in(t) and replaces the outflow air, changing the

internal energy of the air surrounding the server. Assuming all energy used by the server is radiated as heat,

Cth

dTj,k,out

dt (t) = Pj,k(t) + cpρAj,k(t)(Tj,k,in(t) − Tj,k,out(t)), (7)

or equivalently, dTj,k,out dt (t) = Pj,k(t) Cth +cpρAj,k(t) Cth (Tj,k,in(t) − Tj,k,out(t)). (8)

(21)

2.4.7 Complete modeling framework

To summarize, the data center model is described by the equations given below. Pj,k(t) = pidle+ (ppeak− pidle)uj,k(t).

Z t

t−tdi(t)

(1 − p)ai(τ )dτ = Vairf lows.

ai,j(t) = si,j sinf luence (1 − p)ai(t − tdi(t)) Aj,k(t) = 1 18 4 X i=1 ai,j(t). Tj,k,in(t) = P4

i=1ai,j(t)Ti,out(t − tdi(t))

P4 i=1ai,j(t) . dTj,k,out dt (t) = Pj,k(t) Cth +cpρAj,k(t) Cth (Tj,k,in(t) − Tj,k,out(t)). (9)

This is a nonlinear, continuous time model, although it is linear if all the control airflows ai(t), i ∈

{1, 2, 3, 4}, are held constant. It may be suitable to linearize the model if airflow is to be used as a

control variable. For applications such as MPC, the differential and integral equations should be discretized.

2.5 Parameter values

This section presents specific values of the parameters used for simulation of the data center and the reasons for choosing these values. The first subsection presents a table of the inflow areas, while the second lists the values of the other parameters.

2.5.1 Inflow areas

With the assumptions on how the airflow is distributed, and the fact that the cooling units used

in the data center have a width of 2.25 m [20], the values of si,jcan be calculated using the widths

and heights of the racks and the distances between the CRAHs given in figures4and 5.

(22)

Figure 5: A sideways view of the data center with measurements.

The values of si,jfor i ∈ {1, 2}, j ∈ {1, 2, 3, 4, 5} are given in m2in table2. By the symmetry of

the data center, si,j= si−2,j−5for for i ∈ {3, 4}, j ∈ {6, 7, 8, 9, 10}. All values of si,jnot mentioned

are 0, as CRAH units are assumed to only affect racks in the nearest row.

si,j j = 1 2 3 4 5

i = 1 1.29000 1.29000 0.26875 0 0

2 0.05375 1.29000 1.29000 1.29000 1.29000

Table 2: Table of the inflow areas for the racks and CRAHs on the left side of the data center

2.5.2 Other parameters

In table 3, the remaining parameters are listed. The lengths and areas are taken from given

measurements in the specific data center (figures4and5) and in the specifications of the

CRAH-units [20]. The thermal mass of the enclosing region of a server was taken to be the thermal mass

of a server studied in [18], a server of the same size, although not of the same brand, as the ones

in the modeled data center. Ignoring the thermal mass of the enclosure of the server is justified

by a conclusion of [19], that the servers heat capacity have the most significant contribution to

the data center dynamics and that the thermal mass of the racks can be ignored. The density and

heat capacity of air are taken from the values in [17] for a temperature of 300 K.

As no data for the proportion of air traveling over the racks was found for the modeled data center, p was treated as a tuning parameter, and reasonable results were obtained for p = 0.25.

ppeak and pidle were chosen so that ppeak would be roughly twice of pidle and so that their mean

would be 200 W , the power consumption of a typical server in the modeled data center.

The CRAH outlets are rectangular, and the regions of influence were assumed to be so too.

Therefore, Vairf low was calculated as follows: Let hobe the height and wobe the width of a CRAH

outlet, and hi be the height and wi be the width of a region of influence. The assumed region

through which air travels from the CRAH outlet to the region of influence has, at a distance x

from the CRAH, a height h(x) = hix+ho(L−x)

L , and a width w(x) =

wix+wo(L−x)

L . The volume of

the region is then

Vairf low =

Z L

0

h(x)w(x)dx = L

6(2(howo+ hiwi) + howi+ hiwo). (10)

(23)

Parameter Value Unit p 0.25 unitless L 1.2 m ho 1.875 m wo 1.225 m hi 2.15 m wi 2.425 m Vairf low 4.440375 m3 sinf luence 5.21375 m2 pidle 130 W ppeak 270 W Cth 12000 J/K cp 1007 J/(kg · K) ρ 1.1614 kg/m3

Table 3: Table of the remaining parameters of the data center model

2.6 Model validation

To see that the model would give plausible outputs for given inputs, several tests were conducted. All simulations were done in Simulink, with the solver ode15s.

2.6.1 Steady-state validation

In this validation, different scenarios were studied in which the cooling system was set to work at either minimum, medium or maximum capacity. The servers’ resource utilization were set to either

0%, 50% or 95%. The different scenarios are described in table4. The minimum and maximum

input temperatures to the model were chosen according to ASHRAE’s recommendations, 18 °C

and 27 °C respectively [21]. The maximum airflow was set to 2.18 m3_/s_{, the airflow given in the}

specifications for the CRAH units [20], and the minimum airflow was assumed to be 1.3 m3_/s_{. All}

CRAHs were set to have the same outlet airflow and temperature, and all servers had the same utilization.

Scenario CRAH outlet CRAH outlet Server

airflow temperature utilization

1 2.18m3_/s ₁₈_°C _0% 2 1.3 m3_/s ₂₇_°C _0% 3 1.3 m3_/s ₂₇_°C _95% 4 2.18 m3_/s ₁₈_°C _95% 5 2.18 m3/s 18°C 50% 6 1.74 m3_/s _22.5_°C _50%

Table 4: A description of the test scenarios for the steady state validation

With the same resource utilization and inlet temperature for all servers, differences in server outlet temperature would be caused only by differences in inlet airflow. Therefore, four different outlet temperatures were obtained: One for racks 1 and 6, one for racks 2 and 7, one for racks 3

(24)

Scenario Outlet temperature Outlet temperature Outlet temperature Outlet temperature racks 1 and 6 racks 2 and 7 racks 3 and 8 racks 4, 5, 9 and 10

1 22.7481°C 20.4730°C 22.0932°C 22.9459°C 2 34.9622°C 31.1470°C 33.8639°C 35.2939°C 3 43.1081°C 35.3896°C 40.8863°C 43.7793°C 4 27.6057°C 23.0030°C 26.2808°C 28.0060°C 5 25.3047°C 21.8046°C 24.2972°C 25.6091°C 6 31.6519°C 27.2666°C 30.3896°C 32.0333°C

Table 5: Steady state temperatures for the scenarios described in table4

As seen in table5, all temperatures except for those in scenario 3 lie around or below 35 °C, the

maximum recommended operating temperature for a server according to Dell [23]. Overheating

can be expected in scenario 3, as the servers are using almost all their resources while the cooling system works at minimum capacity. In the other scenarios, the temperatures are reasonable. For

a better overview, the temperatures are plotted in figure6. Note that in all scenarios, the ordering

of racks by outlet temperature is always the same.

Racks

Racks 1 and 6 Racks 2 and 7 Racks 3 and 8 Racks 4,5,9 and 10

Temperature ( ° C) 20 25 30 35 40 45

Figure 6: The steady state temperatures in the different scenarios. The graphs correspond to the

following scenarios: green - 3,blue - 2,magenta - 6, black - 4,cyan - 5 andred - 1.

2.6.2 Step response validation

In this validation, the step response of the model was studied. The model was initially in the steady state described in scenario 1 in the previous section. Three different cases were studied, and in each case, there would be a step in either the temperature, airflow or server usage at time

500 s. The step would be from the minimum to the maximum value in the case of temperature,

from 0% to 95% in the case of server utilization, and from maximum to minimum in the case of airflow.

The results of the simulations are shown in figures 7, 8 and 9. The figures illustrate the

similarities and differences in how the model responds to steps in different inputs. The server temperatures are in a reasonable range throughout the simulation. Four different temperatures

(25)

were given, corresponding to the four different inlet airflows that a server could have in these scenarios. The step responses are stable first order responses, as can be expected by the model.

The time constants can be calculated as τ = Cth

cpρAj,k. The settling times are calculated as 4 ∗ τ

for all servers. Figure7 shows the step response for resource usage. In the steady state after the

step response, the different temperatures are further apart, as the temperature impact of increased power usage depends on the rate of heat removal. The settling times are in the range 914 s - 1823 s. Time (s) 0 500 1000 1500 2000 2500 3000 Temperature ( ° C) 20 21 22 23 24 25 26 27 28

Figure 7: Step response when the resource usage increases from 0% to 100 %. The graphs

cor-respond to the following rack temperatures, with settling times: Orange - racks 4, 5, 9 and 10,

settling time 1823 s. Cyan- racks 1 and 6, settling time 1753 s. Teal- racks 3 and 8, settling time

1508 s. Red- racks 2 and 7, settling time 914 s.

Figure 8 shows the step response for inlet temperature. In contrast to the step response for

resource usage, all temperatures have the same difference between them in the steady states before and after the step. The settling times are the same in the resource usage and inlet temperature step scenarios, as the time constant is not affected by power usage or inlet temperature.

(26)

Time (s) 0 500 1000 1500 2000 2500 3000 Temperature ( ° C) 20 22 24 26 28 30 32

Figure 8: Step response when the inlet temperature increases from 18 °C to 27 °C.The graphs

correspond to the following rack temperatures, with settling times: Orange- racks 4, 5, 9 and 10,

settling time 1823 s. Cyan- racks 1 and 6, settling time 1753 s. Teal- racks 3 and 8, settling time

1508 s. Red- racks 2 and 7, settling time 914 s.

Figure 9 shows the step response for airflow. As in the step response for resource usage, the

temperatures move further apart after the step. Since the airflow is reduced, the settling times increase compared to the previous scenarios. In this case, they are in the range 1531 s - 3062 s.

Time (s) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Temperature ( ° C) 20 21 22 23 24 25 26 27

Figure 9: Step response when the airflow decreases from 2.18m3_/s _{to 1.3m}3_/s_{.The graphs}

corre-sponds to the temperatures of servers in the following racks: Orange- racks 4, 5, 9 and 10. Cyan

- racks 1 and 6. Teal- racks 3 and 8. Red- racks 2 and 7.

In the previously shown step responses, all servers in a rack had the same power usage, and would therefore have the same temperature according to the model. However, the model should be able to calculate individual temperatures for each server. Since all servers in a rack are assumed to have the same inlet temperature and airflow, this allows a comparison of how different resource usages affect the server temperatures. To illustrate this, the following step response simulation was performed: The serves in rack 2 would begin in the same steady state as in previous simulations.

(27)

At time 500 s, the servers would get an increase in resource usage between 0% and 85%. Figure

10shows the results of this simulation. As before, the plot shows first order step responses.

Time (s) 0 500 1000 1500 2000 2500 3000 Temperature ( ° C) 20 20.5 21 21.5 22 22.5 23 85% 0.8% 0.75% 0.7% 0.65% 0.6% 0.55% 0.5% 0.45% 0.4% 0.35% 0.3% 0.25% 0.2% 0.15% 0.1% 0.05% 0%

Figure 10: Step response for servers in rack 2, when their resource usages start at 0% and are increased to percentages evenly distributed between 0% and 85% at time 500s.

Knowing that the model has an adequate behavior for simple test cases, a more advanced demonstration of the model’s capabilities was done. This time, the temperature of server 1 in rack 2 was monitored. The simulation started in its usual steady state. At several points in time, there would be steps in the input signals. At 500 s, all server’ resource usage would go from 0% to 95%.

At 1500 s, the airflow of CRAHs 1 would be changed from 2.18m3_/s _{to 1.3m}3_/s _{and at 3000 s,}

the same thing would happen with CRAH 2. At 4500s, the outlet temperatures of CRAH would change from 18°C to 27°C and at 6000 s, the same thing would happen with CRAH 2. Finally, at

8500 s, all input signals would change to their original values. The results are shown in figure11.

The simulation illustrates all the ways that the temperature of a server in rack 2 can be changed. The time it takes for the temperature to return to its original value is approximately 20 min, indicating that this is how long it would take for this server to return to normal temperatures after a cooling system failure.

(28)

Time (s) 0 2000 4000 6000 8000 10000 12000 Temperature ( ° C) 20 22 24 26 28 30 32 34 36

Figure 11: A simulation of the temperature of server 1 in rack 2 when its resource usage increases

from 0% to 100% at 500s, the airflow decreases from 2.18 m3_/s _{to 1.3 m}3_/s _{for CRAH 1 at 1500}

s and CRAH 2 at 3000 s, the output temperature increases from 18 °C to 27 °C for CRAH 1 at 4500 s and CRAH 2 at 6000 s, and all input signals are returned to their original values at 8500 s. An important result revealed by the model validation tests in this section is that the time delays of the airflows from the CRAH units are much smaller than the settling times of the server temperatures. This means that they can safely be ignored when deriving the internal models for

the controllers, described in sections5and6.

3 Power usage of the cooling system

With the data center model developed, the next step is to model the power usage of the cooling system. The following subsections describe the derivation and validation of the data center’s power usage model.

3.1 Power usage model

The cooling power has two parts: The power used by the chiller supplying cold water to the CRAH, and the power used by the CRAH to maintain the airflow. The power usage of a CRAH can be estimated when its outlet temperatures and airflows are known, as well as its inlet temperature.

This is described in subsections3.1.2and3.1.3. To simplify calculations, the inlet temperature is

assumed to be the same for all CRAHs. Subsection3.1.4describes how to choose it appropriately.

3.1.1 Indices, parameters and variables

This chapter will reuse indices, parameters and variables introduced in table1. It introduces the

variables listed in table6. As the parameters introduced in this chapter are not reused in equations

other than those they are introduced in, they will not be listed in6, but will be given with explicit

values and explanations when they are introduced.

Tin(t) Inlet temperature to the CRAH units at time t

Pi,cooling(t) The heat removal rate of the chiller associated with CRAH i at time t

Pi,f an(t) The power usage of the fan of CRAH i at time t

(29)

3.1.2 Power usage of the chiller

Firstly, an estimate of the power usage of the chiller will be derived. Let Tin(t) be the inflow

temperature and Ti,out(t) be the supplied temperature of CRAH i at time t. It is from here on

assumed that Tin(t)is greater than Ti(t), since that will be the case in realistic situations where

the output temperature of the CRAH does not change rapidly. If the CRAH shall supply air with

a flow ai(t), the heat removal rate of its chiller must be

Pi,cooling(t) = cpρai(t)(Tin(t) − Ti,out(t)). (11)

The coefficient of performance (COP) of a chiller is the ratio of heat removed to the amount

of work needed to remove that heat. To estimate the COP, a model is taken from [14]. This

model has also been used in [24] and [25]. In this model, the COP is given as COP (Ti,out) =

0.0068T2

i,out+ 0.0008Ti,out+ 0.458. Then, the power consumed by the chiller at time t in order to

provide sufficient cooling to CRAH i is estimated by cpρai(t)(Tin(t) − Ti,out(t))

COP (Ti,out(t)) .

3.1.3 Power usage of the fans

Having a model of the chiller’s power usage, the next step is to estimate that of the fans. The fan affinity laws state that the volumetric flow out of a fan is proportional to the rotation speed,

and that the power usage is proportional to the rotation speed cubed [15]. To use this, a reference

point with a given airflow and power usage is needed. According to the specifications for the SEE

Cooler HDZ-2, an airflow of 2.18 m3_/s _{corresponds to a power usage of 800 W . Therefore, with}

ai(t)being the airflow of CRAH i at time t, the power usage in W at that time is

Pi,f an(t) = 800(

ai(t)

2.18)

3_. ₍₁₂₎

3.1.4 Estimation of input temperature

What remains is to estimate Tin(t). To simplify this, the air in the hot aisle is assumed to be

perfectly mixed and the air entering the CRAHs will have the same temperature as that of the

hot aisle. With the same kind of reasoning as in section2.4.5, the temperature in the hot aisle is

calculated as a weighted average of the temperatures of the airflows entering it. The weight of each temperature is the ratio of the corresponding airflow to the total emitted airflow of the CRAHs. This includes both airflows through the servers and airflows that bypass the racks. The latter kind of airflows are also assumed to have the same temperature as they had when leaving the CRAH unit. Time delays are applied to the bypassing airflows as well. With this and other previously

stated assumptions, Tin(t)can be calculated as

Tin(t) = P10 j=1 P18 k=1Aj,k(t)Tj,k,out(t) +P 4 i=1(p + (1 − p)(1 − P10 j=1si,j sinf luence ))ai(t − tdi(t))Ti,out(t − tdi(t)) P4 i=1ai(t − tdi(t)) . (13)

3.2 Validation of the power usage model

To see that the power computed by the power usage model is realistic, a validation simulation was performed. A particular consistency requirement for the power usage model is that at steady state, the power removed from the data center must equal the power emitted by the servers. Simulation results indicate that this is the case, and a mathematical proof of this property is presented at the end of this section.

3.2.1 Simulation results

Two simulations were performed in order to validate the power usage model. The scenario in both

(30)

power usage of the cooling system was monitored. The result can be seen in figure12. The step response in power usage looks similar to the step response in rack outlet temperature. This is

because when Ti,out(t) and ai(t) are held constant, changes in cooling system power usage are

proportional to changes in Tin(t), which is a linear combination of the changes in Tj,k,out(t).

Time (s)

0 500 1000 1500 2000 2500 3000

Power (W)

×104 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1

Figure 12: Total power usage of the cooling system when the servers’ resource usage rises from 0% to 95% at 500 s.

A commonly used metric for data centers is the PUE, which is the total power usage of the data center divided by the power usage of the servers. Since the largest contributions of power usage come from the cooling system and the servers, the sum of them can be taken as the total data center power to estimate the PUE. Using the same data as above, a plot of the PUE could be

created. This is shown in figure13. The PUE shown is slightly lower than 1.7, the average PUE

reported in a survey from 2014 [22]. However, if the neglected parts of the data center’s power

usage contribute to 10% of the total power, then the PUE is initially 1.661, a value much closer to the average.

(31)

Time (s)

0 500 1000 1500 2000 2500 3000

PUE

1.25 1.3 1.35 1.4 1.45 1.5 1.55

Figure 13: PUE of the simulated data center when the servers’ resource usage rises from 0% to 95% at 500 s.

The second validation was to compare the rate of heat removal with the power emitted by the

servers. The results can be seen in figure14. Initially, the data center is in a steady state with

the server power and heat removal rate equal to each other. After the step in the server power, the heat removal rate starts increasing as well, approaching the server power asymptotically. A

comparison of figures7 and14shows that as the temperature approaches a steady state, the heat

(32)

Time (s)

0 500 1000 1500 2000 2500 3000

Power (W)

×104 2 2.5 3 3.5 4 4.5 5

Heat removal rate Server power usage

Figure 14: A comparison of the power emitted by the servers and heat removal rate from the data center.

3.2.2 Proof of consistency

This section presents a proof of the fact that at steady state, the heat emission rate of the servers is equal to the heat removal rate of the CRAH units. Assume that the data center is in a steady state. For the remainder of this section, variables that are generally time-dependent will be written

without explicitly indicating this time dependency, e.g. Tin(t)and ai(t)will be written as Tinand

ai. As seen from (11), the total rate of heat removal from the data center is

4 X i=1 Pi,cooling= 4 X i=1 cpρai(Tin− Ti,out). (14)

Using the expression of Tinfrom (13),

4 X i=1 Pi,cooling= cpρ 10 X j=1 18 X k=1 Aj,kTj,k,out+ 4 X i=1 (p − 1 + (1 − p)(1 − P10 j=1si,j sinf luence ))aiTi,out, (15) which simplifies to 4 X i=1 Pi,cooling = cpρ 10 X j=1 18 X k=1 Aj,kTj,k,out− 4 X i=1 10 X j=1 (1 − p)( si,j sinf luence )aiTi,out. (16)

Using (3) to identify ai,j gives 4 X i=1 Pi,cooling = cpρ 10 X j=1 18 X k=1 Aj,kTj,k,out− 4 X i=1 10 X j=1 ai,jTi,out. (17)

(33)

By (6), Tj,k,inP4_i=1ai,j=P4_i=1ai,jTi,j, so 4 X i=1 Pi,cooling = cpρ 10 X j=1 18 X k=1 Aj,kTj,k,out− 10 X j=1 Tj,k,in 4 X i=1 ai,j. (18)

Using (4) to identify Aj,kand simplifying further yields

4 X i=1 Pi,cooling = 10 X j=1 18 X k=1

cpρAj,k(Tj,k,out− Tj,k,in). (19)

In steady state, the left hand side of (7) must be 0, which implies that Pj,k = cpρAj,k(Tj,k,out−

Tj,k,in). Therefore, steady state implies 4 X i=1 Pi,cooling= 10 X j=1 18 X k=1 Pj,k. (20)

As seen above, the total heat removal rate of the CRAH units is equal to the total heat emission rate of the servers. This concludes the proof.

3.3 Suggestions for model improvements

When estimating the power usage as was done here, some complications have been overlooked. This matter will be brought up here, and can be used as suggestions for further work.

Firstly, The COP should in general not depend only on Ti,out(t), but on Tin(t) as well. More

specifically, it should depend on the difference between Ti,out(t) and Tin(t), decreasing when the

difference increases. In [14], ignoring the dependency on Tin(t)was justified by that there was only

one cooling unit in the data center, so that if its output temperature would change by a certain amount, its input temperature would have changed by the same amount in a steady state, keeping

Tin(t) − Ti,out(t)constant for steady states. With four CRAH units, this is no longer true. Another

issue with using the model from [14] is that individual cooling units have different COP.

This suggests that an improvement of the COP estimation would be to use a model adapted

to real data from the modeled data center, with dependencies on both Tin(t) and Ti,out(t).

Fur-thermore, instead of having one inlet temperature for all CRAHs, different temperatures should be estimated for improved accuracy.

Finally, one important component of the cooling system power usage was left out of the model: the internal fans of the servers. They are controlled by the IT-equipment’s internal controllers, and depending on how they work, their power usage could be affected by external airflow and temperature and airflow input.

4 Preliminary comments to the controller chapters

With test models of the servers’ temperatures and the cooling systems power usage, controllers for the data center could be designed. Two types of controllers were considered: LQR controllers,

described in chapter5, and MPC controllers, described in chapter6. After that, chapter7will be

dedicated to describe tests comparing the performance of the controllers.

The internal models of the controllers will be based on the previously derived models, but with some differences. As previously mentioned, time delays will be ignored when designing the controllers. Furthermore, the controller models will only include half of the data center and will be expressed in terms of variables associated with racks 1 to 5 and CRAH-units 1 and 2.

The two types of controllers will be designed with respect to different objectives. While the LQR controllers aim to keep both control and controlled variables near certain setpoints, the MPC aims to minimize an estimate of the cooling system’s power usage. While it is desirable to minimize energy usage with the LQR controllers too, it is not possible to express the energy usage of the cooling system as a quadratic function, which would be needed in order to minimize it explicitly with an LQR controller. An MPC controller using an objective function similar to that of the LQR controllers could have been designed for the purpose of having a more fair comparison between the two control design frameworks. However, this was not done in this thesis, as the main purpose of

(34)

developing an MPC was to explicitly solve the problem of energy minimization under constraints on control and controlled variables.

Because of symmetry of the data center and the independence of the variables associated with its two halves, controllers with no terms in their objective functions depending on variables associated with both halves of the data center will be mathematically equivalent to two independent controllers. For the LQR controllers, this is the case. For the MPC controllers, it is not, but in order to reduce computational complexity, the data center is assumed to have the same power usage and initial server temperature distribution in both of its halves, which also means that the controller outputs in both halves are going to be equal. The generalization in order to consider the whole data center is straightforward.

5 LQR control

With the LQR controller, the objective is to keep both the server temperatures and the controller outputs at certain reference points. Five LQR controllers were developed: two of them using CRAH outlet temperature as the control variable while keeping the airflow constant, two varying the CRAH airflow with constant outlet temperature and one using both airflow and temperature as control variables.

Among the two controllers with the same controlled variables, one would be a classical LQR controller and one an LQI controller. The controller using both airflow and temperature would also be an LQI controller. The distinction between these types of controllers, along with other

background theory, will be described in subsection5.1. After that, subsection5.2will describe the

derivations of the LQR controllers’ internal models. Finally, subsection5.3describes how the LQR

controllers were tuned.

5.1 LQR Theory

LQR-controllers aim to optimally control a system defined by the linear state-space equations ˙

x = Ax + Bu

y = Cx, (21)

Where x is the vector of state-space variables, u is the controller output and y is the observed output of the system. The LQR controllers considered here are infinite horizon controllers for linear, time invariant systems, i.e. they solve the problem:

min u Z ∞ 0 xTQx + uTRu dt s.t. x = Ax + Bu,˙ (22) where A, B, Q and R are constant matrices, R is positive definite and Q is positive semidefinite.

If Q can be written as CT_C_{such that (A,B,C) describes an observable and reachable system, the}

the problem defined in (22) has a solution u = −R−1_{BP x}_{, where P is the unique positive definite}

solution of the algebraic Riccati equation [29]

ATP + P A − P BR−1BTP + Q = 0. (23)

If equations21accurately describe the controlled system, the LQR controller will drive the system

towards the steady state x = 0. In practice, model inaccuracies may cause x = 0 to not be a steady state of the true controlled system. In that case, the LQR controller can still have a stabilizing effect on the system, but make it get stuck in another state than x = 0. An extension

to fix this problem is the LQI controller [30]. The LQI controller extends the state space with

v(t) =R₀tCx(s)ds, yielding the new state space equations

˙ x v =A 0 C 0 x v +B 0 u y0=C 0 0 I x v (24)

(35)

The feedback law for the LQI-controller is constructed from (24) analogously to how the feedback

law for the LQR controller was constructed from (21). Since the integral variables v will keep

increasing or decreasing until the system reaches a state such that Cx = 0, the LQI extension ensures that the system does not get stuck in an undesirable state.

5.2 LQR formulations

The LQR model formulations will be derived from the model described in chapter 2. As it was

seen in the end of section2.6.2that the time delays were insignificant in the time scale of the data

center’s dynamics, time delays will be ignored in this derivation. Since the variables will then all depend on the same time, and the LQR models will be time-invariant, the time dependency of the variables will not be written out explicitly. Any notation not introduced in this subsection will be described in table1. From equations (9), the explicit relation between Tj,k,out, Pj,k, a1, a2, uj,k,

T1,out and T2,outcan be written as

dTj,k,out

dt =

pidle+ (ppeak− pidle)uj,k

Cth + cpρ(1 − p) 18Cthsinf luence 2 X i=1

aisi,j(Ti,out− Tj,k,out)

= f (uj,k, a1, a2, T1,out, T2,out, Tj,k,out)

(25) To obtain models for LQR controllers, the equation must be linearized around an equilibrium

point. The following notation convention will be used: Let v be any variable. Then v0denotes the

value of that variable around which the model is linearized, and ∆v = v − v0. The equilibrium

point, (uj,k,0, a1,0, a2,0, T1,out,0, T1,out,0, Tj,k,out,0), is found by fixing uj,k,0, a1,0, a2,0, T1,out,0 and

T2,out,0 and solving equation (26) for Tj,k,out,0.

f (uj,k,0, a1,0, a2,0, T1,out,0, T1,out,0, Tj,k,out,0) =

pidle+ (ppeak− pidle)uj,k,0

Cth + cpρ(1 − p) 18Cthsinf luence 2 X i=1

ai,0si,j(Ti,out,0− Tj,k,out,0) = 0.

(26) A general linearized model around the equilibrium point can be expressed as

d∆Tj,k,out dt = ∂f ∂uj,k ∆uj,k+ ∂f ∂a1 ∆a1+ ∂f ∂a2 ∆a2+ ∂f ∂T1,out ∆T1,out+ ∂f ∂T2,out ∆T2,out+ ∂f ∂Tj,k,out ∆Tj,k,out, (27) where∂f

∂v denotes the partial derivative of f (defined in (25)) with respect to a variable v, evaluated

at (uj,k,0, a1,0, a2,0, T1,out,0, T1,out,0, Tj,k,out,0). For all LQR formulations, it is assumed that the

input variables not used to control the data center are constant, so ∆uj,k and for some controllers

either ∆a1,out and ∆a2,out or ∆T1,out and ∆T2,out are set to 0. To ensure controllability, the

number of controlled variables is reduced to 2: x =x1

x2

. For simplicity, they are modeled as the average server temperature deviations from a given setpoint of racks 1 and 2, and racks 3, 4 and 5 respectively, i.e. x1 x2 = "₁ 36 P2 j=1 P18 k=1∆Tj,k,out 1 54 P5 j=3 P18 k=1∆Tj,k,out # . (28)

Considering everything above, the different LQR state space formulations, on the form described

in (22), can now be stated. For the temperature LQR controller, the state space model is

˙ x = Ax + Btemp ∆T1,out ∆T2,out , (29) where A = cpρ(1 − p) 18Cthsinf luence " −1 2 P2 j=1 P2

i=1ai,0si,j 0

0 −1

3

P5

j=3

P2

i=1ai,0si,j

# (30) and Btemp= cpρ(1 − p) 18Cthsinf luence "₁ 2 P2 j=1a1,0s1,j 1 2 P2 j=1a2,0s2,j 1 3 P5 j=3a1,0s1,j 1₃P 5 j=3a2,0s2,j # . (31)

(36)

For the airflow LQR controller, the state space model is ˙ x = Ax + Bair ∆a1 ∆a2 (32) with Bair = cpρ(1 − p) 18Cthsinf luence × "₁ 36 P2 j=1 P18 k=1s1,j(T1,out,0− Tj,k,out,0) 361 P2 j=1 P18 k=1s2,j(T2,out,0− Tj,k,out,0) 1 54 P5 j=3 P18 k=1s1,j(T1,out,0− Tj,k,out,0 1 54 P5 j=3 P18 k=1s2,j(T2,out,0− Tj,k,out,0) # . (33)

For the two LQI controllers that use either airflow or temperature, the state space is extended from

the above models with integrals of x1 and x2 as described in section5.1. For the LQI controller

that uses both airflow and outlet temperature, the state space model is the LQI extension of the

model in equation34. ˙ x = Ax + Btemp ∆T1,out ∆T2,out + Bair ∆a1 ∆a2 . (34)

As for the matrices in the cost functions, R is chosen to be the identity matrix for the controllers that use only one of temperature and airflow. For the controller that uses both outlet temperature and airflow, R =     rtemperature 0 0 0 0 rtemperature 0 0 0 0 rairf low 0 0 0 0 rairf low     , (35)

where rtemperature and rairf low are tuning parameters. For the classical LQR controllers, the cost

matrix for the variables is on the form

QLQR= q

2 0

0 3

, (36)

and for the LQI controllers, it is

QLQI=     2q 0 0 0 0 3q 0 0 0 0 2qint 0 0 0 0 3qint     , (37)

where q and qintare tuning parameters. The factors 2 and 3 comes from that x1is the temperature

average of 2 racks, while x2 is the temperature average of 3 racks. The procedure of choosing the

tuning parameters will be explained in the next subsection.

5.3 Tuning

To completely define the LQR controllers, what remains is to state the points around which the data center model was linearized, and the tuning parameters in the LQR cost functions. Also, a signal saturation used for the LQI controller that uses both outlet airflow and temperature.

5.3.1 Linearization point

The linearization points were chosen by considering that if there are upper and lower bounds on the input variables, a reasonable choice is to linearize the model around the average of these bounds. This maximizes the minimum difference between the linearization point of a variable and any of its

bounds. In section2.6.1, upper and lower bounds on the control variables were given: 2.18 m3_/s

and 1.3 m3_/s _{for the airflow and 27 °C and 18 °C for the temperature. The linearization points}

were thus 22.5 °C for the temperatures and 1.74 m3_/s _{for the airflows.}

The power usage at the linearization point was assumed to be 200 W for all servers. With the input variables fixed, the server temperatures to linearize around were determined by solving the

(37)

Taking the mean of the steady state server temperatures for racks 1 and 2, and racks 3, 4 and 5 gave the default reference point for the LQR controllers, i.e. the amount to subtract the mean

temperatures obtained from the data center model by to get x1and x2 as defined in the previous

section. Other reference points were used in other experiments, but if nothing is else is mentioned, the experiments used these reference points, 29.4593 °C for the mean temperature of racks 1 and 2, and 31.4854 °C for that of racks 3,4 and 5.

5.3.2 Signal saturation

The LQI controller using both temperature and airflow as control variables was constructed after some experiments had been performed with the other controllers. In these experiments, some of the controllers would output airflow or temperatures outside of the bounds decided on in section

2.6.1. To limit the outputs of the new controller, a signal saturation was connected to it. Whenever the LQI controller would call for an airflow or temperature outside of the upper or lower bounds, the signal saturation would reduce this control output to the closest bound. The signal saturation

was used in all experiments, both the tuning experiments and the ones described in section7.

5.3.3 Tuning experiments

The tuning parameters were determined by the heuristic that the controller should drive the system to a point twice as quickly as the open loop system would move there given a step input in power usage. More specifically, the time constant for the response in the first case should be half of that in the second case. This criterion was chosen in order to make sure that the controller would make the system respond significantly faster to inputs, but also to limit the aggressiveness of the controller.

Two types of tuning experiments were performed. First, an open loop simulation was done. The initial state of the data center was the same linearization point as described above, but the power usage was instantly set to 100%. The server temperatures were given some time to settle, and at simulation time t = 3000 s the power usage of all servers was dropped to 50% again. The total simulation time was 6000 s. The measured outputs were the mean temperature deviation of

racks 1 and 2 that of racks 3,4 and 5, i.e. x1and x2. Let τ1,riseand τ1,f all be the time constants

for the response of x1to the increase and decrease in power usage respectively, and let τ2,riseand

τ2,f allbe the corresponding for x2. These time constants were determined by solving

x1(τ1,rise) = e−1x1(0) + (1 − e−1)x1(3000),

x1(τ1,f all+ 3000) = e−1x1(3000) + (1 − e−1)x1(6000), τ1,f all> 0,

x2(τ2,rise) = e−1x2(0) + (1 − e−1)x2(3000),

x2(τ2,f all+ 3000) = e−1x2(3000) + (1 − e−1)x2(6000), τ2,f all> 0

(38)

As the outputs of the data center model were discrete time-series, the equations above were solved

approximately by finding the values of x1 and x2 in the time series that would minimize the

absolute value of the difference between the left and right sides of the equations above.

With values of the time constants and of x1and x2 at t = 3000 s, the next phase of the tuning

experiments could begin. The experiments was similar to that for the open loop system, but with the controller connected, and the step input was in the controller’s reference signal instead of in the

power usage. From t = 0 s to t = 3000 s, the values of x1(3000)and x2(3000)from the previous

experiment would be added to the usual reference signal of the LQR controller. Time constants were computed analogously as for the first tuning experiments.

The parameters of the different LQR controllers were adjusted by trial and error until one of the time constants of the current closed loop experiment were equal to half of the corresponding time constant in the open loop experiments ±1s and the other time constants were less than half

of their corresponding counterparts. The results of the tuning experiments are shown in table7.

For the LQI controllers, q was chosen significantly larger than qint. Despite this, qint had a great

impact on the controller behavior, as can be seen by the very different behavior for the LQI and

(38)

Controller τ1,rise τ2,rise τ1,f all τ2,f all q qint rairf low rtemperature (s) (s) (s) (s) Open Loop 439 537 439 537 - - - -LQI airflow 219 171 206 142 0.087 5 × 10−7 - -LQR airflow 219 245 206 218 0.056 - - -LQI 219 210 219 210 0.2 1.7 × 10−5 - -temperature LQR 185 269 185 269 1.01 - - -temperature

LQI airflow & 219 242 167 151 1.2 2 × 10−5 100 1

temperature

Table 7: Time constants for the LQR tuning experiments and the chosen tuning parameters giving those time constants.

6 MPC control

With the MPC controller, the objective is to minimize the energy usage of the CRAH units during a given prediction horizon, while satisfying upper- and lower-bound constraints on CRAH outlet airflow and temperature, and on server temperature. To simplify this problem, time discrete

versions of the models in sections2 and3are used. Time delays are not considered, in accordance

with the conclusion at the end of section2.6.2. The optimization variables are the CRAH outlet

airflow and temperature evaluated at the different discrete points in time. The table below presents the parameters and variables used in the time-discrete formulation. For parameters not listed in this table, see the the tables in section2.

Index Description

n Index of the time steps, n ∈ 0, ..., N

Parameter Description

N Number of time steps for which the server temperatures are predicted

∆t Length of a time step.

tn The time at n time steps from the beginning of the prediction horizon.

tn= n∆t

uj,k The resource usage of server k in rack j. It is assumed to remain constant

throughout the MPC’s prediction horizon.

Tj,k,n The time-discrete approximation of the temperature of server k in rack j

at time tn.

ai,n The outlet airflow from CRAH unit i at time tn

Ti,n,out The outlet temperature of CRAH unit i at time tn

Table 8: A table showing the variables in the time-discrete formulation of the data center model.

6.1 Overview of MPC

MPC (Model Predictive Control) is a term for a broad range of control strategies with the following characteristics [28]:

• Explicit use of a model to predict the state of the controlled system at discrete points in time (the prediction horizon).

• Calculation of a sequence of control signals, one for each time point in the prediction horizon, that minimizes some objective function.

• A receding prediction horizon. The first control signal of the calculated sequence is used as input to the system and then the calculations are redone with the horizon displaced one step into the future.

LQR and MPC control of a simulated data center

IN

DEGREE PROJECT

MATHEMATICS,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

LQR and MPC control of a

simulated data center

ERIK BERGLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES

LQR and MPC control of a

simulated data center

ERIK BERGLUND

Acknowledgements

Abbreviations

Contents

1

Introduction

1.1

Background

1.2

Approaches for controlling the data center

1.3

Statement of thesis scope

2

Thermal model of the servers

2.1

Statement of contributions

2.2

Data center layout

2.3

Assumptions

2.4

Model

2.5

Parameter values

2.6

Model validation

3

Power usage of the cooling system

3.1

Power usage model

3.2

Validation of the power usage model

Time (s)

Power (W)

Time (s)

PUE

Time (s)

Power (W)

3.3

Suggestions for model improvements

4

Preliminary comments to the controller chapters

5

LQR control

5.1

LQR Theory

5.2

LQR formulations

5.3

Tuning

6

MPC control

6.1

Overview of MPC