Temperature handler in radios using machine learning

(1)

IT 19 089

Examensarbete 30 hp

December 2019

Temperature handler in radios

using machine learning

Arnthor Helgi Sverrisson

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Temperature handler in radios using machine learning

Arnthor Helgi Sverrisson

Machine learning is revolutionising the field of automation in various industries. But there exist powerful methods and tools in a number of cases that do not include the learning process like machine learning does. In this thesis, controllers for

compensating for overheating in radio stations are built, evaluated and compared. The controllers are based on two different approaches: the first approach is based on model predictive control (MPC), and the second one is based on methods of reinforcement learning (RL). This report compares those two approaches, and reports qualitative and quantitative differences.

Tryckt av: Reprocentralen ITC IT 19 089

Examinator: Mats Daniels

(4)

(5)

1. Introduction

Ericsson is a provider of Information and Communication Technology (ICT). The company offers services, software and infrastructure in ICT for telecom-munications operators, traditional telecomtelecom-munications and Internet Protocol (IP) networking equipment, mobile and fixed broadband, operations and busi-ness support services, cable television, IPTV, video systems, and an extensive services operation [23]. Their products, for example radios, are deployed all around the world and therefore must sustain all kinds of conditions. One of the challenges for the radios is heat. As an example, the hottest officially recorded day in Phoenix, Arizona, the temperature went up to 50°C [15]. On top of that radio products must handle traffic which requires a lot of power consumption. The power amplifiers generate additional heat the system needs to sustain. When conditions become this harsh, the radio needs to reduce the output power to cool down for hardware protection. The timing of when to start to reduce output power and by how much is important. If done too early, unnecessary reduction and diminished serviceability might occur. If done too late operational temperature might continue to rise and reach critical level and the system shuts down.

1.1 Current implementation

Right now the control for the reduction of output power, back off power, in hot conditions is controlled by based control. The idea behind a rule-based control is to try to encode human knowledge into automatic control. The temperature handling function in the radio unit is designed with if/else statements with manually defined thresholds. Several thresholds are set and for the temperature, one triggers a timer, another triggers back off power and yet another threshold is the critical threshold and triggers shutdown of the whole system. When the timer is started the time integral of temperature is calculated and the integration of the temperature is not allowed to surpass another (manually defined) threshold, otherwise back off starts. The quantity of the back off power is decided by a formula and is in proportion to the internal temperature, that is, the higher the temperature the more back off power is needed for reducing internal component temperature. These formulas and thresholds are have been manually set and tuned through trial-and error.

(8)

These components can sustain at maximum different temperature values (thresh-olds defined by the manufacturer). Therefore the temperature thresh(thresh-olds are different for each component, which makes the job of manually setting them even more difficult. Since these crucial parameters, thresholds and formulas are manually decided there is room for exploring if a more scientific approach works better. In this thesis I will propose to solve this problem with model predictive control (MPC) [12] and reinforcement learning (RL) [19]. To do so I will also do a system identification on how the internal temperature of radio reacts to changes in its environment and build a simulator from that. The simulator is needed for both RL and MPC.

1.2 Expert system

One branch of artificial intelligence (AI) is expert systems, which is essen-tially a system in which its intelligence and decision making process is based on human expert knowledge. Knowledge engineering is the act of encoding human expert knowledge into a set of rules a system can follow. Rules contain a IF and THEN part. Expert systems whose knowledge is represented in rule form are called rule-based systems[5]. Expert systems used to be a popular research field and was one of the first truly successful forms of AI software, but in recent years the focus of research has moved to a more machine learning approach and away from expert systems [8]. In a machine learning approach the system, for example supervised learning, is told what to look for or what the solution should be but not how to find the solution. Also in RL the agent is told what is a good or bad action (relatively) but the agent is not told how to solve the problem. The algorithm finds out how. The current temperature controller in the radios is a rule-based system. Since the other methods like MPC and RL are emerging and proved good in some cases it is interesting to see if such controllers would outperform the current rule based controller.

1.3 Contribution

(9)

2. Theory

This chapter introduces the theory behind the algorithms used. The main fo-cus is on system identification, model predictive control and reinforcement learning

2.1 System identification

A dynamic system is an object in which single or multiple inputs or variables produce an observable signal, usually referred to as output . The relationship between the inputs and outputs can be described with a mathematical formula. But identifying the system’s behaviour can be tricky in some cases especially when dealing with MIMO systems (multiple input multiple output). In some cases a physical model of the system is simple and/or known but in most cases it is complicated and/or non-linear. Mathematical models can then be gener-ated from statistical data. A dynamic system can be described by for example differential or difference equations, transfer functions, state-space equations, and pole-zero-gain models. The methodology for building a mathematical model of the system is called System identification. [10]

2.1.1 State space models

A common way of representing a model of a system is state space representa-tion. For a given state vector ~x (~x ∈ Rn), output vector ~y (~y ∈ Rq), input vector ~u (~u ∈ Rp_{), A is the state matrix (A ∈ R}n×n_{), B is the input matrix (B ∈ R}n×p_),

C_{is the output matrix (C ∈ R}q×n_{) and D is the feedforward matrix (D ∈ R}q×p) and for continuous time-invariant system the state space model representation is as shown in equation 2.1

x0(t) = Ax(t) + Bu(t)

y(t) = Cx(t) + Du(t) (2.1)

where ~x is state vector. ~x ∈ Rn y(·) is output vector. y(·) ∈ Rq

(10)

D_{(·) is the feedforward matrix, D(·) ∈ R}q×p

When this formula is transformed from continuous time to discrete time the formula becomes

x(k + 1) = Ax(k) + Bu(k)

y(k) = Cx(k) + Du(k) (2.2) [24] [18]

2.2 Model predictive control

MPC originated in the late seventies [17] but has improved a lot since then. MPC is not a specific algorithm but rather an umbrella term that encompasses a range of control methods which make use of a model of the process to ob-tain the control signal by minimizing an objective function. What all MPC algorithms have in common is that the MPC

• uses explicit use of a model to predict the process output • minimizing an objective function

• receding strategy, that is, at every time instance the prediction horizon is moved forward by one step and optimization calculations recalculated with a new horizon

But at the same time the MPC algorithms differ among themselves in how the cost function or noise is minimized or the type of model used, to name a few things.

Figure 2.1. MPC strategy [4]. The input or control variable is shown as u and the output as y. N is the prediction horizon.

The strategy MPC algorithms follow shown in 2.1 can essentially be shown in three steps:

(11)

2. The future control signals u(t + k|k), k = 1...N is calculated by optimiz-ing the objective function.

3. The first control signal u(t|t) is sent to the process and executed and the prediction horizon is moved to t + 1, and the whole process is repeated. [4]

2.2.1 Objective function

MPC’s objective is to find a solution, that is a control policy, that minimizes an objective function. The objective function therefore represents how good a control policy is. The higher the value of the objective function, when a control policy is used as an input, the ’worse’ the control policy is. In order to design an MPC controller the objective function needs to be defined with the an appropriate criteria.

In control theory it is desirable that a controller is able to optimize several things at the same time. So the objective function therefore contains usually 3 or 4 factors. As equation 2.3 shows the the objection function J is the sum of four factors

J(zk) = Jy(zk) + Ju(zk) + J∆u(zk) + Jε(zk) (2.3)

The different J functions in equation 2.3 measure different features in the con-troller. Those features are

• Jyis output reference tracking

• Juis manipulated variable tracking

• J∆uis manipulated variable move suppression

• Jε is constraint violation

The Jy is the factor that measures how closely the outputs follow a reference

value for the the output. This can be for example if a thermostat is set on 25°C, the Jybecomes higher if the current temperature is further away from the 25°C

reference value. Jy(zk) = ny

∑

j=1 p

∑

i=1 ( wy_{i, j} sy_j [rj(k + i|k) − yj(k + i|k)] )2 (2.4) Where,

• k - Current control interval. • p - Prediction horizon.

• ny - Number of plant output variables.

• zk - Control parameters selected (quadratic program decision).

• yj(k + i|k) - Predicted value of jth plant output at ith prediction horizon

step, in engineering units.

• rj(k + i|k) - Reference value for jth plant output at ith prediction horizon

step.

(12)

• wy_{i, j}- Tuning weight for jth plant output at ith prediction horizon step. The factor Juis shown in equation 2.5, and measures how well a manipulated

variable (MV), u follow a reference signal.

Ju(zk) = nu

∑

j=1 p−1

∑

i=0 ( wu_{i, j}

su_j [uj(k + i|k) − uj,target(k + i|k)] )2

(2.5)

Where u represents the MV and wu_{i, j}is a tuning weight jth MV at ith prediction horizon step.

In some cases it is not desirable that the controller chooses sharp changes in the MV. So the J∆umeasures the change in MV

J_∆u(zk) = nu

∑

j=1 p−1

∑

i=0 ( w∆u i, j su_j [uj(k + i|k) − uj(k + i|k)] )2 (2.6)

Lastly the Jε is for constraint violations (see equation 2.7)

Jε(zk) = ρεεk2 (2.7)

Where

• εk - Slack variable at control interval k (dimensionless)

• ρε - Constraint violation penalty weight (dimensionless)

[1]

2.3 Reinforcement learning

Reinforcement learning is a branch of machine learning. The idea is to learn from experience through trial and error. The decision maker is put into an environment to solve a task. It is then told, through a so called reward function, whether its actions are good or bad. The decision maker creates a memory of its experiences. [19].

2.3.1 Markov decision process

To formulate this problem mathematically, a mathematical framework called Markov decision process (MDP) is used. The decision maker or controller is called agent. The agent interacts with the environment. This can be a sim-ulated environment or a real environment. At each time step, t, the agent receives some representation of the environment’s state, st∈S and based on

the state st, the agent selects an action at ∈A .

For the action it selects it receives an reward rt+1, which is a numerical

(13)

does not only try to maximize the a immediate reward but also cumulative reward in the long run. The cumulative sum of future rewards called value, Gt, is discounted with a factor γ called discount factor. The value represents

how good the action is for future states while the rewards only represents the immediate effect of the action. This discounted total sum of future rewards is shown in equation 2.8. Usually γ is set as a value in the interval 06 γ 6 1. If γ = 0, only the immediate reward is looked at when tried to be maximized. If γ > 1 the sum in 2.8 becomes infinite, but if 06 γ 6 1 the sum equals finite. The closer γ is to 0 the more "myopic" or shortsighted the agent is, but as γ approaches 1 gives more importance to future rewards when the agent tries to maximize objective function.

Gt = Rt+1+ γRt+2+ γ2Rt+3+ ... = ∞

∑

k=0 γkRt+k+1 (2.8) Gt= Rt+1+γRt+2+γ2Rt+3+... = Rt+1+γ(Rt+2+γRt+3+...) = Rt+1+γ Gt+1 (2.9) By following a policy π, i.e. a sequence of actions, it is then possible to calculate the expected ’value’, qπ(s, a), of choosing action a in state s under

π policy. The equation for this value is shown in equation 2.10. The value function is called action-value function or q-value function

q_π_{(s, a) = E}_π[Gt|St= s, At = a] (2.10)

Eπ[·] denotes the expected value given that the agent follows policy π. For

every problem there is an optimal policy π∗that will yield the highest possible

q-value q∗(s, a). Equation 2.9 shows that if the value of next state, St+1, is

known then the value of the current state, St, can be found [19].

Figure 2.2.The agent-environment interaction in a Markov decision process. [19]

2.3.2 Q learning

(14)

MDP problem M = (S, A, P) and a γ for each state and action pair possible, a the q-value q(s, a) is stored as an entry in a memory table. During training after each action the q-value for the state action pair is updated according to equation 2.11, which is an equation based on Bellman equation.

q(st, at) ← q(st, at) + α[rt+1+ γ max

a q(st+1, a) − q(st, at)] (2.11)

The α in equation 2.11 represents the learning rate. The larger the learning rate more we accept the new value and reject the old value. [19]

2.3.3 Deep Q-learning

Some reinforcement learning problems are too complex to be able to set up in finite MDP. It can be because the state space is infinite or too big to be stored in a table. The larger the q-table the longer the training process takes. In Deep Q-learning the table is substituted for a neural network. The state space does not have to be discrete and finite (like in Q-learning) but can instead be a set of continuous values, that are then inputs to the neural network. The output is approximation of q-values for each a ∈A . So the size of the output layer equals the number of actions in the action setA . Since neural networks are function approximators, they can work well in approximating the q-values. The max

a_t+1 q(st+1, at+1) can therefore be calculated in a single forward pass in

the neural network for a given st+1.

The network is initialized with a random weights θ . In simulation, experiences are gathered. This means storing actions, rewards, state and next state, < st, a, r, st+1> as a tuple in a dataset. Then, an approximation of the q-values are

updated towards the value Y_kQshown in equation 2.12, where k is the iteration of the training and θ refers to the weights in the network.

Y_kQ= r + γ max

a q(st+1, at+1; θk) (2.12)

The parameters θk are updated by stochastic gradient descent by minimizing

the square loss LDQN(see equation 2.13)

LDQN= (q(s, a; θk) −Y q k)

2 _(2.13)

And the parameters are then updated as follows

θk+1= θk+ α(Y q

k − (q(s, a; θk))∇θ kq(s, a; θk) (2.14)

(15)

3. Experiment setup

The first part of the experiment is to understand the thermal physics of the radio and try to convert that into mathematical terms. Second part is to create a controller using the information from the mathematical model of the system. In this experiment two controllers are developed. First an MPC controller is designed and secondly a RL controller.

3.1 System identification

In system identification practices measured data is used to create the model of a system, whether it is a state space model, transfer function model, polyno-mial model, process model or gray-box model. There exists several methods that turn measured data into models and the one used here is N4SID [13]. The data that was used in this experiment came from a climate chamber experi-ment.

This experiment was done using Mathwork’s Matlab and its toolbox System Identification Toolbox [11]

3.1.1 Climate chamber experiment

In Ericsson’s office in Kista there is a so called climate chamber, which is a chamber where the temperature can be controlled easily. In June 2018 a test was conducted in the climate chamber and the temperature profile from a warm day in Phoenix, Arizona [15], was simulated. Inside the climate chamber a mobile communication radio transmitter was placed and a power usage from radio operating on a typical busy day in Hong Kong was also simulated. The heat and power usage together created a high internal temperature inside the radio so the temperature handler could therefore be tested. The test lasted 24 hours. The result from the experiment is shown in figure 3.1.

3.1.2 State space model

The climate chamber test created useful data on how ambient temperature and power usage effected the internal temperature of the radio. From this data it was possible to build a state space model.

(16)

Figure 3.1. This graph shows the 24 hour climate chamber test. The brown PaFinal (brown) line shows the internal temperature on one temperature sensor in the radio. The simulated temperature inside the climate chamber is also shown. They follow the axis on the right side in °C. The requested power and actual power is shown and fol-lows the left side of the axis measured in dB. As can be seen from the graph the radio follows the requested power until approximately 14pm when the internal temperature of the radio is quite high. Then the temperature controller kicks in an starts to back off and the actual power usage is lower than the requested power.

sensor with the highest recorded value was chosen as an output for system identification.

The input to the system was the measured ambient temperature inside the climate chamber and the power usage that was simulated. The output was the internal temperature. The state space model therefore describes how the in-ternal temperature changes as power usage and ambient temperature changes. The N4SID method was used to acquire the state space model of the system.

Equation 2.1 Shows the structure of a state space model. The N4SID ap-proach with those inputs and output gave the values of the matrices as follows:

(17)

3.2 Model predictive control

This part of the experiment was also done in Matlab using the MPC Toolbox [1].

Once the model of the thermal dynamics of the system is in place the plant can be defined. The inputs to the plant is ambient temperature and power usage. The output is internal temperature. To fit into the MPC’s structure an additional output is added which is power usage. That output is just a delayed output, that is, the same as the input. The reason why power usage is both and input and an output is because that is the controller variable, u, of the controller but it is also input as it effects the output of the plant which is the internal temperature.

As shown in figure 3.2, the MPC controller’s inputs are two measured out-put and one measured disturbance. The two measured outout-puts are the outout-puts of the plant, those are power usage and internal temperature. The reference signal is the requested power from the radio’s users. That reference signal is the power usage that the radio wants to follow but because of overheating problems that is not possible at all times.

The controller is supposed to prevent the radio from overheating and shut-ting down. The shut down occurs once the internal temperature exceeds the shut-down limit. The shut-down limit varies between temperature sensors in the radio and but in this experiment it was set 105°C. So a hard limit is set on the MPC output that was internal temperature equal to the shut-down limit.

MPC has weights on inputs and outputs, which penalizes deviations of the reference signals. Also the rate weight on the input which penalizes sharp changes. In the results section, different values are compared for these param-eters.

Figure 3.2.The MPC structure

3.3 Reinforcement learning

(18)

3.3.1 Simulation

Since a state space model of the system had already been acquired, it was possible to build a simulator that simulates the change in internal heat of the radio. The inputs to the system are ambient temperature and power usage. There are other factors that can influence the internal temperature of the radio like for example sun radiation and wind speed but since the data I used comes from the aforementioned experiment from a climate chamber, only the power usage and ambient temperature were simulated and therefore data only exists for ambient temperature and power usage and how that effects the system.

Each episode is a simulated several hour period. Both episode lengths of 24 hour and 9 hour periods were tested. To train the agent in different scenarios, the inputs are randomly generated before each episode is played out. As for the ambient temperature the sine function is used to simulate the change in ambient temperature in over the course of several hours. One period in the sine function is represents 24 hours.

To simulate ambient temperature, the ambient temperature profile from Phoenix from the climate chamber test was used as a benchmark. If it is possible to approximate that temperature profile as a mathematical formula it is possible to introduce some randomness to that formula and in that way randomly generate ambient temperatures for the simulator. Swings in temper-atures over 24 hours are similar to sine function, so the sine function is used. Equation 3.1 shows how a sine function can be transformed into something that resembles 24 hour temperature swings.

f(x) = range · sin(x) + o f f set + noise (3.1) Here

• range is set as range = highest−lowest₂ where highest and lowest are the highest and lowest temperature values from the Phoenix data

• o f f set is set as mean of Phoenix data

• noise is a random value with Gaussian distribution with 0 as center of distribution and as 5

This gives a function with rather similar features as the Phoenix data. And for the simulator, random factors are multiplied to o f f set, range and noise. Also a random factor is used to shift the peak of the sine function.

Figure 3.3 shows random samples of temperature profiles generated by the simulator from the calculations described above.

The user requested power of the radio was simulated by using the same as was used in the Hong Kong test. It has two peaks, one in the morning and one in the afternoon. In the simulation a random starting point in the requested power data is chosen so for each scenario the requested power is not exactly the same. This is shown in figure 3.4.

(19)

Figure 3.3.The ambient temperature profile. The image shows 5 randomly generated ambient temperatures from the data generator in the simulation. The label of x-axis is time in seconds and the graph shows 86.000 seconds or 24 hours. The label of y-axis is temperature measured in °C.

the training so during the training process the random factors in the simulators were tuned a lot.

The temperature controller in the radio needs to prepare for a wide variety of ambient temperature and requested power and that is why in the training the simulator gives different scenarios. This is also to prevent over fitting.

3.3.2 States

The state is the input to the neural network and is an array of values that are ’relevant’ to the controller. Sometimes it can be tricky to find what is and is not relevant. Having more values gives the controller more information about the problem but at the same time means bigger network and longer time to train. Example of values used as state are:

• Current internal temperature • Current ambient temperature • Current requested power • Back off

(20)

Figure 3.4. Five random samples from the requested output power generator. The label of y-axis is power in dB and the label of x-axis is time in seconds and shows 86.000 seconds or 24 hours. As seen in this these are always the same series but with different (randomly selected) stating point and then clipped in the end.

3.3.3 Actions

The actions the controller could choose were as follows • Increase the back off by 0.5

• Increase the back off by 0.1 dB • No change in back off

• Decrease the back off by 0.1 dB • Decrease the back off by 0.5 dB

So the back off starts at 0. The controller can than choose to increase, decrease it or keep it the same, as long as the back off stayed between -5dB to 0db. This limit is set to make it easier for the RL agent to learn and search for the best solution as it makes the solutions space smaller. A controller that chooses to apply more power than the requested power (positive back off) is not what is wanted for this problem so it is forbidden.

3.3.4 Reward function

(21)

Figure 3.5. The image shows 30 randomly generated scenarios of the the simulator. The series colored red at the top show the internal temperature (in °C). The green series show the ambient temperature (in °C). The blue shows the output power (in dB)

however be tricky. In warm conditions the controller needs to find out if it is better to back off or not, and if so by how much. The less it backs off the more likely it is that the internal temperature becomes higher. Therefore the reward function needs to reflect that balance that will minimize the back off while keeping the temperature below a certain limit.

In each time instance the reward function looks at 3 aspects to give reward for:

• Back off: Since the controller wants to minimize the back off, the gen-eral rule is that the more back off the controller applies the more negative reward the agent receives

• Lower temperature limit violation: When the internal temperature goes above a lower temperature limit called, The agent receives a negative re-ward and the higher above the lower temperature limit the more negative reward the agent receives.

(22)

3.3.5 Training process

The neural network’s values are initialized with Xavier initialization [7]. The simulation starts with exploration rate as 1, and gathers information into a memory array. In the memory array information about each action stored is:

• State

• Action chosen • Reward received • Next state

• A boolean which represents whether the episode is terminated or not (in that case there is no ’next state’)

Once the memory is large enough the training process begins. A random batch from the memory array is selected. For each randomly selected tuple from the memory the Q-value is calculated according to Bellman’s function (see 2.11)

The value of max

a Q(st+1, a) is simply found by using the state_next as and

input to the neural network and take the maximum value of the output. In the beginning when the neural network is not trained the outputs are wrong, but studies show that in most cases (not all) the neural network converges to a good Q-value function approximator [25]. Now that the Q-value, Q(st, at),

is updated the back propagation is performed with the target output as the previous Q-value array with the updated Q(st, at). The batch size was 32,

which means that the agent performs an action and then a random batch of size 32 from the memory array is selected, the neural network trained, and then the agent selects the next action.

The agent’s exploration rate started as 1, meaning all actions are selected at random. Then after each episode the exploration rate was multiplied by a factor called exploration rate decay. The value of exploration rate decay varied and depended of the total number of episodes but was 0.95-0.9992. This means that the exploration rate becomes gradually lower as the training process goes on and the agent becomes more greedy.

(23)

4. Results

In this chapter results are shown from the experiments explained before.

4.1 Model predictive control results

The main hyperparameters for tuning was the prediction horizon, control hori-zon and the weights on the objective functions. The weights on both the output and input. To compare different parameter settings, the total back off was used as a performance metric. No controller violated the hard limit set on the inter-nal temperature (105°C). The tests assume that perfect prediction is possible.

4.1.1 Comparison on different control horizons

The goal of this experiment is to see what control and prediction horizon fits best for this problem. During this experiment the objective function weights were constant, the output weight as 1, and input rate weight as 0. The control horizon cannot be larger than the prediction horizon. So a pair of control and prediction horizons from 1-89 were tested on the Phoenix and Hong Kong data and the total back off was used as performance metric. The lower the total back off the better the controller did. The result is shown in figure 4.1.

The lowest total back off value was when prediction horizon was set to 87 and 88 and control horizon set as 7. Note that those values are time steps and the sample time is 67 seconds so for example 7 time steps equals 469 seconds. The lowest total back off value was 87.75. When the control horizon was set to 1 the controller did much worse for all values so in the heatmap in figure 4.1 that row and column is left out for a better color contrast.

4.1.2 Comparison on different weights

The goal of this experiment is to find what parameter settings for the weights on the objective functions were best for this problem. During this experiment the control horizon was set as 7 and prediction horizon as 87 as those values gave the best result from the experiment described in section 4.1.1.

(24)

(25)

Figure 4.2. Finding optimal setting for weights on the objective function. The label of the x-axis is the input rate weight from 0-1 and the label of the y-axis is the output weight from 0.1-1. The value on the colorbar is the total back off from the given setting.When the output weight was set as 0 the results were much worse so that row was removed from the heatmap for a better color contrast.

the reference values, which is in this case the requested power. For testing the objective function rate weight on the input, also 11 different values between 0-1 were tested. The input rate weight is a parameter that tells the controller if it is bad to change the manipulated variable (output power) too rapidly.

The results are shown in figure 4.2

4.1.3 MPC controller results

With the optimal settings found the MPC controller can be created and tested in a simulator. Figure 4.3 shows how the controller performs and how it ap-plies back off on the Phoenix and Hong Kong data. The settings used were:

• Prediction horizon: 87 timesteps • Control horizon: 7 timesteps • Objective function output weight: 1 • Objective function input rate weight: 0

This assumes perfect prediction. The total back off was 87.75 dB.

(26)

(27)

Figure 4.4. The MPC controller strategy with prediction error. The label of x-axis is time in seconds and the goes up to 24 hours. The label of y-axis is both temperature in °C and power in dB. The light blue series shows how the internal temperature changes over time (in °C. the orange series is the ambient temperature (also in °C). The requested power is the light green colored series and the actual power that the controller controls is shown in pink. Both are measured in dB.

error accumulates over time, so as the prediction horizon got larger the worse the controller became. The prediction horizon is 10 time steps (670 seconds or 11 minutes approximately). The total back off for this test was 105.94 dB.

MPC cannot operate with no predictions but the prediction horizon was tested by setting as the minimum, which is 1 time step (67 seconds) the with same disturbance as figure 4.4 shows. That controller applied a total back off of 88.85. Figure 4.5 show how that controller performed.

So to summarize the MPC controllers’ performance, measured in total back off:

• Perfect prediction: 87.75 dB

(28)

(29)

4.2 Reinforcement learning results

Early on in the thesis work all kinds of different reward functions, distribution in simulation, network structures, episode length, action space, states space and other hyperparameter settings for the RL method were tried. There are a lot of different settings and set ups to try out and it was difficult to find a set up that gave good result. Some settings gave a bit more promising results than others though none gave a truly good result. The settings that seemed to give better results than others was

• Adam compiler for the Neural network training • Reward function with no soft temperature limit • Episode length of 500 steps

• Exploration rate decay was 0.997 • Number of episodes was 1500 • Batch size of 32 for training

• Back off limit of 5db. That is the controller could not back off more than 5dB (and it was also not allowed to set a higher power than the requested power)

The shut down reward was also set as -40.000. This value was chosen be-cause if the controller chose to apply maximum back off for 5dB for the entire episode (500 steps) the total episode reward would be ca. -30.000. Because it is ’worse’ to shut down than apply this much back off so the shut down reward was chosen as a lower value than -30.000. The distribution or randomness of the randomly generated ambient temperature in the simulator was also a factor that needed a lot of tuning. The ambient temperature could not be so high that even if the controller applied maximum back off it could not avoid shut down. Also if too many occasions were so that no back off was needed, that tended to be the general strategy for all occasions, meaning the controller applied no back off at all times even when needed.

4.2.1 Hyperparameter test

(30)

Figure 4.6.The label of x-axis is number of episodes, from 0-1500. The label of y-axis is the total rewards per episode. The blue series are all episodes. But since the episodes are different scenarios, a validation series (orange colored) is added. The validation is performed every 10 episodes with no exploration and is always the same ambient temperature and requested power. The reward graph from training using learning rate was 0.00025, discount factor was 0.9 and the neural network has the structure 4 hidden layers with 12 nodes per hidden layer. The red line shows how the MPC controller, described in section 4.1.3, would perform. That controller would get 730.3 in rewards.

One of the best results came when the learning rate was 0.00025, discount factor was 0.9 and the neural network has the structure 4 hidden layers and 12 nodes per hidden layer. The reward graph is shown in the figure 4.6, and when the training was finished the controller was tested on the climate chamber data and that result is shown in figure 4.7. Another example of training is shown in figure 4.6. Not all 27 (plus more) results from the training will be shown here.

4.2.2 Train on validation data

(31)

(32)

(33)

Figure 4.9. The label of x-axis is number of episodes, from 0-1500. The label of y-axis is the total rewards per episode. The blue series are all episodes. The orange series is the rewards from validation. The validation is performed every 10 episodes with no exploration. The upper red line show how the MPC would perform if the same reward function applied. The MPC would then receive 730.3 in rewards. The lower red line is if the back off from the climate chamber test was calculated as rewards. That controller would get 4739.9 in rewards

but this test was performed to make things simpler and see if the training would work since the previous training did not go as expected.

The parameters used for this test were the same as the parameters used for the training shown in figure 4.6 because that was one of combinations that gave the best results in the hyperparameter test. Those parameters were learning rate was 0.00025, discount factor was 0.9 and the neural network has the structure 4 hidden layers with 12 nodes per hidden layer.

The reward graph from that training is shown in figure 4.9 and the con-troller’s performance after training is shown on figure 4.10.

4.3 Comparison on MPC and RL

Figure 4.11 shows the different kinds of methods used for creating a controller and how much they backed off if in the climate chamber scenario. The con-trollers showed are as follows

(34)

(35)

Figure 4.11.Comparison between results of different controller methods. This graph shows how much back off the controllers developed perform on the same scenario. The RL controllers (green and purple series) apply much more back off than the MPC controllers (blue and orange series. The yellow series show the amount of back off during the climate chamber test (see figure 3.1).

• MPC with perfect prediction from figure 4.4 • RL trained on validation from figure 4.10 • RL trained on random scenarios from figure 4.7

(36)

5. Discussion

In this section the results from the thesis work is concluded and discussed.I will also suggest ways forward, and things to work on in the future.

5.1 Conclusion

The results were quite clearly in favor of using MPC in all comparisons. I would therefore recommend, going forward in with this project, to focus on the development on MPC. Another drawback with RL is that it is more com-putationally heavy because of the training step. So if RL should be used for radios the computational cost of that needs to be considered.

The best setting for MPC was when the input rate weight is set to 0. This means that the controller is not ’punished’ for changing the input too quickly. This makes sense because in this application it is not an important factor to restrict the change in input.

The best setting for the output power weight in the objective function was 1. This means that it is important that the controller follows the reference trajectory (in this case the requested power) closely. So it is important that this weight is high.

For the prediction and control horizon the best setting was prediction hori-zon as 87 time steps (5829 seconds or 97 minutes) and control horihori-zon as 7 time steps (469 seconds or approximately 8 minutes). It is interesting though that there is not so much deviation in the results. So having a lower predic-tion horizon doesn’t give much worse results. And it is more difficult to have a longer prediction horizon as predictions become less accurate (predictions longer into the future are harder to make). So in a real product it might be advisable to add a shorter prediction horizon.

By adding prediction errors the MPC controller started to perform worse. But by reducing the prediction horizon to 1 time step the performance got better again. So in a real application the controller could assess how good the predictions are and tune the prediction accordingly. If predictions are good and accurate, use long prediction horizon but if predictions are bad shorten the prediction horizon.

(37)

5.2 Future work

The work of this thesis can be further extended. What I would suggest is to work on is to improve the system identification. This includes explore what methods can be used for system identification and also what inputs are used. In this thesis ambient temperature and output power was used because that data was available (from the climate chamber test). But other things can influence the internal temperature like sun radiation and wind speed. I would suggest gathering data about those attributes and try to estimate the importance of those attributes. And then build a model using this data.

The work used on improving the system identification is also beneficial for building a simulator which can save a lot of time and money used in testing in real products like the climate chamber test.

The RL results were bad. But even if hypothetically in this thesis RL would do well in all tests, there are still some things to consider. As mentioned be-fore, one thing is that training the RL is very computationally heavy compared to MPC or the rule based controller. And Ericsson has a lot of radios around the world so if a RL controller should be trained for each radio that would re-quire a lot computational power. So before spending more time on improving RL and searching for the right set up, some questions need to be answered first. Those are for example, is training RL agents feasible and should they be trained in the radios or should the data be sent from each radio and the training performed centrally? Those questions need to be assessed.

But if Ericsson concludes that it is worth it to continue to improve the RL, my suggestions to try is:

• Use LSTM (Long short-term memory) as the state in RL [2]. Since the inputs can be interpreted as time series data this method could work • In this thesis a simple DQN algorithm was used. But a lot of

improve-ments on the DQN algorithm exists, like Multistep DQN [9], Double DQN [21], Prioritize Experience Replay [20], Dueling Network [22] or model based RL that might be worth a try.

RL has some positives. RL can, opposed to supervised learning like neural networks, learn more than a human. Supervised learning is limited to what it is taught . But RL can become smarter than the developers. So if RL would give a good result, Ericsson could analyse the policies RL chooses and learn from the RL agent. Another good thing with RL is that it is easier to modify the objective function (f.ex. back off rewards). But on the other hand ML is a black box so it is impossible to understand why a certain decision by the agent is taken.

(38)

MPC (as suggested in this thesis) the radio can also adapt and learn on its environment.

Prediction on the inputs, like ambient temperature, requested power, wind speed and sun radiation can also assessed. Once a better system identification model is acquired, it would be interesting to see how the controllers perform with perfect prediction, bad prediction and no prediction. And this way it is possible to estimate how accurate the prediction needs to be before it stops being useful (when ’no prediction’ starts to outperform ’bad prediction’). It would then be interesting to create prediction models for f.ex. requested power using LSTM or linear regression and see if it is possible to get sufficiently accurate prediction model.

(39)

References

[1] Manfred Morari Alberto Bemporad, N. Lawrence Ricker. Model predictive control toolbox: User’s guide.

https://se.mathworks.com/help/pdf_doc/mpc/mpc_ug.pdf, 2019.

[2] Bram Bakker. Reinforcement learning with lstm in non-markovian tasks with long-term dependencies, 2001.

[3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

[4] Eduardo F. Camacho and Carlos Bordons Alba. Model Predictive Control. Addison-Wesley Professional, 2 edition, 2007.

[5] Edward A. Feigenbaum, Peter Friedland, Bruce B. Johnson, H. Penny Nii, Herbert Schorr, Howard E. Shrobe, and Robert S. Engelmore.

Knowledge-based systems in japan (report of the JTEC panel). Commun. ACM, 37(1):17–19, 1994.

[6] Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. An introduction to deep reinforcement learning. CoRR, abs/1811.12560, 2018.

[7] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, 2010.

[8] J. Hendler. Avoiding another ai winter. IEEE Intelligent Systems, 23(2):2–4, March 2008.

[9] J. Fernando Hernandez-Garcia and Richard S. Sutton. Understanding multi-step deep reinforcement learning: A systematic study of the DQN target. CoRR, abs/1901.07510, 2019.

[10] L. Ljung. System Identification: Theory for the User. Prentice Hall information and system sciences series. Prentice Hall PTR, 1999.

[11] L. Ljung. System identification toolbox: User’s guide.

https://www.mathworks.com/help/pdf_doc/ident/ident.pdf, 2019.

[12] J.M. Maciejowski. Predictive Control: With Constraints. Prentice Hall, 2002. [13] Mathworks. n4sid - estimate state-space model using subspace method, 2019.

[Online; accessed 5-September-2019].

[14] Manfred Morari and Jay H. Lee. Model predictive control: Past, present and future. Computers and Chemical Engineering, 23:667–682, 1997.

[15] University of Arizona. Azmet : The arizona meteorological network. https://cals.arizona.edu/azmet/az-data.htm.

[16] Sasa V. Rakovic and William S. Levine. Handbook of Model Predictive Control. Birkhauser Basel, 09 2018.

(40)

[18] Derek Rowell. State-space representation of lti systems.

http://web.mit.edu/2.14/www/Handouts/StateSpace.pdf, October 2002. [19] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An

Introduction. The MIT Press, second edition, 2018.

[20] Ioannis Antonoglou Tom Schaul, John Quan and David Silver. Prioritized experience replay. ICLR, 2016.

[21] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. CoRR, abs/1509.06461, 2015.

[22] Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architectures for deep reinforcement learning. CoRR, abs/1511.06581, 2015.

[23] Wikipedia. Ericsson, 2019. [Online; accessed 5-September-2019].

[24] Wikipedia. State-space representation, 2019. [Online; accessed 5-June-2019]. [25] Zhuoran Yang, Yuchen Xie, and Zhaoran Wang. A theoretical analysis of deep

Temperature handler in radios using machine learning

Examensarbete 30 hp

December 2019

Temperature handler in radios

using machine learning

Arnthor Helgi Sverrisson

Abstract

Temperature handler in radios using machine learning

Arnthor Helgi Sverrisson

Contents

1. Introduction

1.1 Current implementation

1.2 Expert system

1.3 Contribution

2. Theory

2.1 System identification

2.1.1 State space models

2.2 Model predictive control

2.2.1 Objective function

∑

∑

∑

∑

∑

∑

2.3 Reinforcement learning

2.3.1 Markov decision process

∑

2.3.2 Q learning

2.3.3 Deep Q-learning

3. Experiment setup

3.1 System identification

3.1.1 Climate chamber experiment

3.1.2 State space model

3.2 Model predictive control

3.3 Reinforcement learning

3.3.1 Simulation

3.3.2 States

3.3.3 Actions

3.3.4 Reward function

3.3.5 Training process

4. Results

4.1 Model predictive control results

4.1.1 Comparison on different control horizons

4.1.2 Comparison on different weights

4.1.3 MPC controller results

4.2 Reinforcement learning results

4.2.1 Hyperparameter test

4.2.2 Train on validation data

4.3 Comparison on MPC and RL

5. Discussion

5.1 Conclusion

5.2 Future work

References