Reinforcement Learning for Uplink Power Control

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

Reinforcement Learning for Uplink Power Control

ALAN GORAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Reinforcement Learning for Uplink Power Control

ALAN GORAN

Examiner: Joakim Jaldén Academic Supervisor: Marie Maros Industrial Supervisor at Ericsson: Roy Timo

Master’s Thesis

School of Electrical Engineering and Computer Science Royal Institute of Technology, SE-100 44 Stockholm, Sweden

Stockholm, Sweden 2018

(4)

(5)

i

Abstract

Uplink power control is a resource management function that controls the signal’s transmit power from a user device, i.e. mobile phone, to a base-station tower. It is used to maximize the data-rates while reducing the generated interference.

Reinforcement learning is a powerful learning technique that has the capability not only to teach an artificial agent how to act, but also to create the possibility for the agent to learn through its own experiences by interacting with an environment.

In this thesis we have applied reinforcement learning on uplink power control, enabling an intelligent software agent to dynamically adjust the user devices’ transmit powers. The agent learns to find suitable transmit power levels for the user devices by choosing a value for the closed-loop correction signal in uplink. The purpose is to investigate whether or not reinforcement learning can improve the uplink power control in the new 5G communication system.

The problem was formulated as a multi-armed bandit at first, and then extended to a contextual bandit. We implemented three different reinforcement learning algorithms for the agent to solve the problem. The performance of the agent using each of the three algorithms was evaluated by comparing the performance of the uplink power control with and without the agent. With this approach we could discover whether the agent is improving the performance or not. From simulations, it was found out that the agent is in fact able to find a value for the correction signal that improves the data-rate, or throughput measured in Mbps, of the user devices average connections. However, it was also found that the agent does not have a significant con- tribution regarding the interference.

(6)

ii

Referat

Förstärkningsinlärning för Uplink Effektstyrning

Uplink effektstyrning är en funktion för resurshantering som styr signalens sändningseffekt från en användaranord- ning, det vill säga från en mobiltelefon, till ett bas-stationstorn.

Det används för att maximera datahastigheterna samtidigt som den genererade interferensen reduceras.

Förstärkningsinlärning är en kraftfull inlärningsteknik.

Den har förmågan att inte bara lära en artificiell agent hur man agerar utan också skapa möjligheten för agenten att lära sig själv, genom att samspela med en miljö och samla egna erfarenheter.

I detta examensarbete har vi tillämpat förstärknings- inlärning på uplink effektstyrning, vilket möjliggör för en intelligent mjukvaruagent att dynamiskt anpassa effekten på sändningssignalerna hos användarens enheter. Agenten lär sig att hitta lämpliga sändningseffekter för enheterna genom att välja ett värde för korrigeringssignalen med slu- ten slinga i uplinken. Syftet är att undersöka huruvida för- stärkningsinlärning kan förbättra uplink effektstyrningen i det nya 5G-kommunikationssystemet.

Problemet formulerades som en Multi-Armed bandit först och utvidgades sedan till en Contextual bandit. Vi implementerade tre olika förstärkningsinlärning algoritmer för agenten att lösa problemet. Utförandet av agenten med användning av var och en av de tre algoritmerna utvärde- rades genom att jämföra prestanda för uplink effektstyrningen med och utan agenten. Med detta tillvägagångssätt kunde vi upptäcka om agenten förbättrar prestanda eller inte. Från våra simuleringar upptäcktes att agenten faktiskt kan hitta ett värde för korrigeringssignalen som förbättrar datahastigheten, genomflödet mätt i Mbps, av genomsnitt- liga anslutningar för användarenheterna. Det har däremot också visat sig att agenten inte har ett betydande bidrag avseende interferensstörningarna.

(7)

Abbreviations

AI Artificial Intelligence ML Machine Learning

MDP Markov Decision Processes RL Reinforcement Learning UCB Upper Confidence Bound UPC Uplink Power Control UE User Equipment

SINR Signal-to-Interference and Noise Ratio CDF Cumulative Distribution Function

v

(10)

(11)

Chapter 1

Introduction

The world is on the verge of a new age of interconnection as the fifth generation of wireless system, known as 5G, is expected to launch in 2020. With the cutting edge network technology, the 5G offers connections that are tens of times faster and more reliable than the current connections. By providing the needed infrastructure to carry big amounts of data, the technology will bring us a big step closer to power the Internet of Things, [1].

Ericsson is one of the leading companies in telecommunications. It is at Ericsson this thesis is being carried out. With this thesis we wish to improve one particular area, that is the scheduling decisions in future 5G base-stations. The study will more specifically be about Uplink Power Control (UPC).

Uplink power control is the controlling of a signal’s transmission power, a signal that is transmitted from a User Equipment (UE), i.e. a mobile phone, to a base- station tower. UPC is generally used to maximize the quality of service while limiting the generated interference [2].

In this thesis we are studying the potential of Machine Learning (ML) in Uplink Power Control. The type of ML in which we study and implement is called Rein- forcement Learning (RL). With RL algorithms we will attempt to train a system to find the optimal transmission power for each UE. The transmission power control is an important part of communication systems and improving it allows us to reduce the interference between different UE signals, as well as reducing the UE’s energy consumption.

The aim is to provide better data-rates to the users by dynamically adjusting the users’ transmit powers. The Signal-to-Interference and Noise Ratio (SINR) is a reasonable first-order approximation to the user’s data rate. This approximation is used throughout the field. The idea is to use the SINR to train an RL-agent that sets the transmit power.

Reinforcement learning allows us to create intelligent learners and decision mak- ers called agents. The agents can interact with an environment, which consists of

1

(12)

2 CHAPTER 1. INTRODUCTION

everything outside of the agent’s control [3]. Through this agent-environment interaction, the agent can learn to choose better actions in order to maximize some notion of cumulative reward. This maximization of the reward through the agent- environment interaction is the underlying theory of reinforcement learning. This type of machine learning is inspired by behaviourist psychology. For example, when a child is taking his first steps in an environment, the most effective way for him to learn how to walk is through his direct sensing connection to the environment.

By training this connection, he gathers a complex set of information about the con- sequences of his actions and, through that, learns to achieve his goals. However, human’s biological perception and connection to their surrounding is profoundly more advanced than what we are capable of providing a software agent with. Find- ing an optimal solution to a problem with RL is therefore not as simple as one might assume. Nevertheless, RL have been showing very good results in the latest years.

It got a lot of attention when Google’s DeepMind developed an RL system, called AlphaGo, that for the first time ever won over the human champion in the game of Go [4]. An AI system that can beat a human champion in Go was previously not thought to be possible due to its big search space and the difficulty in predicting the board moves.

1.1 Research Question

The thesis will study the use of reinforcement learning in uplink power control in 5G. The purpose is to investigate usage of RL’s promising technology in the context of power control. We want to discover whether or not it is favourable to set the control parameters in uplink using RL algorithms.

Reducing the interference between different the UEs while achieving better data- rates in the uplink transmission is the main goal. Lower interference, or higher SINR, is the measurement that defines how good the system performs.

1.2 Outline/Overview

The following two chapters contain the theory for both uplink power control and reinforcement learning. In chapter 2, we go through the background behind the uplink power control and identify the problem we aim to solve. In chapter 3, the necessary reinforcement learning background is explained and how it can be a solution for the uplink power control problem.

In chapter 4, we put those theories in use and explain all the methods that were implemented in this thesis. In addition to that, we will also go through the difficulties that were encountered when implementing the algorithms. Then the results of the implemented methods are summarized in chapter 5. Finally we end this thesis with chapter 6, it consists of discussions, conclusions, future work and ethical aspects about the researched work.

(13)

Chapter 2

Uplink Power Control

The problem we are studying in this thesis is about Uplink Power Control (UPC).

We aim to improve the transmission power control between two nodes. In this chapter, we will explain the theory behind UPC as well as the thesis problem. It is necessary to understand the background of UPC before attempting to improve it. Specifically, we want to know about the functionality of UPC in 5G and the trade-offs between transmitting with more power and increased interference.

This chapter will also cover how user devices, or User Equipment (UE), differ in their behaviour when deployed in different environments, i.e., dense city versus sparse rural area.

2.1 Power Control in Telecommunications

Power control is the process of controlling the transmission power of a channel link between two nodes. The controlling of transmission power is an attempt to improve a communication signal and achieve better quality of service. Power controls are implemented on communication devices such as cell phones or any other kind of UE.

There are two kinds of channel links in telecommunication, downlink and uplink.

A communication signal is called downlink when it goes from base-station to UEs, and it is called uplink when it goes from UEs to base-station, see Figure 2.1. In this thesis, we are focusing on power control for uplink only.

Power control is needed to prevent too much unwanted interference between different UEs’ communication signals, see Figure 2.2. It is undeniable that if there is only one mobile device on a site and there are no other devices to interfere with, then the higher the transmission power of the mobile device the better the quality of the signal. However, this is not a good strategy when there are other UE signals within range. The idea is to decrease the output transmission power to an optimal value when other mobile devices are around. Decreasing the power level will reduce the interference between the devices. A good analogy is imagining a room with only two persons in it, the louder they speak (higher transmission power) the better they

3

(14)

4 CHAPTER 2. UPLINK POWER CONTROL

Figure 2.1: An illustration separating the downlink and uplink communication signals between a base-station and a UE. The arrows represent the wireless communication signals.

will hear each other, but if there are multiple people in the room and everyone is speaking loudly then hearing each other will be hard (more interference). Another advantage of decreasing the transmission power is that the mobile devices’ energy consumption will be reduces.

2.2 Open Loop And Closed Loop Power Control

In uplink power control, we combine two mechanisms, open loop and closed loop.

The open loop’s objective is to compensate the path loss. Path loss is the reduction in power density of an electromagnetic wave when the wave is traveling through space. The open loop generates a transmit power depending on the downlink path loss estimation, see Equation 2.1. The disadvantage of open loop control is that its performance can be subjected to errors in estimating the path loss [5]. Therefore a closed loop control is introduced. The closed loop consists of a correction function δ, which is a feedback (i.e., measurements and exchange of control data) from the base-station to the UE in order to fine-tune the UE’s transmission power [5]. The UE’s transmitting power is set according to

P = min{PCM AX, P₀+ αP LDL+ 10log10M+ ∆M CS+ δ}, (2.1) where the open loop power control include the first part of the formula P0 + αP LDL+ 10log10M + ∆M CS and the δ alone makes it a closed loop control. The terms stand for;

(15)

2.2. OPEN LOOP AND CLOSED LOOP POWER CONTROL 5

Figure 2.2: An illustration of the interference between two UEs. Transmitter 1 and transmitter 2 are the respective transmitters on UE 1 and UE 2 while receiver 1 and receiver 2 are the receivers of the base-station. The receivers are physically located on the same tower but they are separated here in order to make the illustration clearer. The straight arrows between the transmitters and receivers represent the desired uplink communication signal from the UEs to the base-station. The dotted arrows represent the same transmitted signal picked up by the unintended receiver, transmitter 1 to receiver 2 and transmitter 2 to receiver 1, which causes an interference.

• PCM AX is the maximum power allowed for UE transmission, which is 23 dBm (200 mW) for UE power class 3 according to 3GPP standard, [6].

• P0 represents the power offset that is the minimum required power received at base-station in uplink,

• P LDL is the estimated downlink path loss calculated by UE,

• α ∈ [0, 1] is a parameter used to configure the path loss compensation,

• M is the instantaneous Physical Uplink Shared Channel (PUSCH) bandwidth during a subframe expressed in resource blocks,

• the 10log10M term is used to increase the UE transmission power for larger resource block allocations,

• ∆M CS increases the UE transmit power when transferring a large number of bits per Resource Element. This links the UE transmit power to the Modu- lation and Coding Scheme (MCS).

(16)

6 CHAPTER 2. UPLINK POWER CONTROL

• δ is the correction parameter that is used to manually increase and decrease the transmission power.

2.3 Deployment of UEs in Different Environments

User equipment are used in various places our environment. We want to stay connected to fast internet in sparse rural areas as much as in dense cities. The deployment of the UEs plays a crucial roll in the power control problem. The methods that decide the values of the parameters in Equation 2.1 are dependent on the UE deployment. Which makes it very hard to find one method that fits all deployments.

This is where reinforcement learning becomes a topic of interest. Designing an agent that can quickly adapt to different environments and learn to maximize the quality of service, may be a good solution.

2.4 Thesis Problem

We want to investigate whether or not using reinforcement learning algorithms will improve the power control, which in this context means reducing interference while achieving higher uplink data rates. The approach is to design a system using reinforcement learning that can control the outputted transmission power P by choosing different values for δ parameter in Equation 2.1. Instead of having δ set to a single predetermined value, the system will train an agent to select different “best”

values of δ for every UE. The agent will learn which δ values, also called actions, provide better results and which do not. When it finds a more desirable action it will repeatedly exploit that action, more on this is explained in chapter 3. The agent will be running at the base-station, and feed the results of the RL algorithm, which is the value of δ, as control information to the UEs.

The parameter δ can optionally be either absolute or cumulative according to the 3GPP standard [6]. Only the first option will be investigated in this thesis. The absolute values the δ parameter can take, according to [6], consist of four values, see Table 2.1.

Actions Absolute δ [dB]

a₀ -4

a1 -1

a₂ 1

a₃ 4

Table 2.1: A mapping between the agent’s actions and the corresponding δ values [6]

An RL-agent is trained through evaluating the actions that were previously executed. We will therefore need a representation, or a measurement, of the environment that tells us how good the given action performed. This measurement

(17)

2.4. THESIS PROBLEM 7 is needed for the reward function, which is more thoroughly explained in subsec- tion 3.5.1. What we choose to use for evaluating the performance is the Signal- to-Interference and Noise Ratio (SINR), which is a first-order approximation to the UE’s data rate. Specifically, we can estimate the maximum transmitted data rate, also called throughput, by using the Additive White Gaussian Noise (AWGN) channel capacity formula from information theory,

throughput= log(1 + SINR). (2.2)

(18)

(19)

Chapter 3

Reinforcement Learning

This chapter is dedicated to the theory behind reinforcement learning (RL). In this thesis, we have the problem explained in chapter 2 that we want to solve by using several reinforcement learning algorithms. The current existing solution to that problem is using different control methods to adjusts the strength of uplink signals.

Because of the rapid developments in the field of reinforcement learning, it is a good idea to try to improve the current solution to a more advanced or intelligent solution.

In order to implement RL algorithms, it is necessary to first acquire a general understanding about reinforcement learning, its advantages, disadvantages, as well as broadening our knowledge about the processes of the reinforcement learning. We also describe what category of “learning” that reinforcement learning falls under and the differences between it and other types of learning methods. The background knowledge needed to comprehend how reinforcement learning can be useful to solve a problem in uplink power control is provided in this chapter.

3.1 Machine Learning

Machine learning is a type of Artificial Intelligence (AI) that allows software pro- grams to predict outcomes of an action without being specifically programmed.

Machine learning is mainly broken down into supervised learning, unsupervised learning and reinforcement learning.

Supervised learning is learning from a data set that is equipped with labeled examples supplied by an experienced supervisor. Each provided data point describes a situation with a specification or label of the optimal action to take [3]. This type of machine learning is often used in categorizing objects. For example, we can train a program to identify cars in images by using a data set of images with labels that tell whether there is a car in each image or not. In order for the program to understand if there is a car in an unfamiliar image we need to have the program go through the labeled data set and get some sort of understanding of how a car looks like in an

9

(20)

10 CHAPTER 3. REINFORCEMENT LEARNING

image.

Unsupervised learningis learning by training on data that is neither classified nor labeled. Unsupervised learning algorithms use iterative approaches to go through the data and arrive at a conclusion. It is about obtaining structural aspects of the data during the time that the system learns. Since the data set is unlabeled, the trained system would have no evaluation of the accuracy of its output, a feature that separates unsupervised learning from supervised learning and reinforcement learning. This type of machine learning is useful when labeled data is scarce or unavailable, or when the tasks are too complex for supervised learning.

The other type of machine learning is Reinforcement Learning. Reinforcement learning shares some similarities with unsupervised learning. In particular, it does not train/depend on examples of correct behaviour [3]. However, the difference is that reinforcement learning’s goal is to maximize a reward signal instead of finding structures in unfamiliar data.

Reinforcement learning is the main subject of this thesis and it is explained in detail throughout the rest of the chapter. However, since the other kinds of learning are not within the scope of this thesis they are not further explained.

3.2 Reinforcement Learning

Reinforcement learning is the study of self-learning and its most essential feature is that, during training, the actions are evaluated after they are already taken. In fact, in the initial steps the system will not know for sure whether a given action is good or bad before executing that action. It does however estimate the outcome of all possible actions before choosing one of them. This makes reinforcement learning efficient when it comes to taking real-time decisions in an unknown environment [3]. For that reason, using RL may be a good method to solve our power control problem explained in section 2.4.

In reinforcement learning, there are learners and decision makers called agents.

The agent’s objective is to pick actions and reward every good action and punish, or negatively reward, every bad action that it takes. We can therefore argue that this type of machine learning is learning by doing, meaning an RL-agent has to take uncertain actions in the beginning that might lead to bad results, but it will eventu- ally learn and improve its performance after some exploration of the environment.

An environment in reinforcement learning is defined as a set of states that the agent interacts with and tries to influence through its action choices. The environment involves everything other than what the agent can control [3].

(21)

3.3. MULTI-ARMED BANDITS 11

3.3 Multi-Armed Bandits

Multi-armed bandit is one of many different reinforcement learning problems. An easy way to explain multi-armed bandit is through the classic example of the gambler at a row of slot machines in a casino. Imagine a gambler playing the slot machines. Since each machine have a fixed payout probability, the gambler has to decide which machines to play and how many times to play each machine, as well as making decisions about whether or not to continue playing the current machine or try playing a different one [7]. The gambler’s goal is to discover a policy that maximizes the total payout, or the reward in RL terms. The actions in this problem are the pulling of the levers attached to the different slot machines. By repeatedly pulling the levers the gambler can observe the associated rewards and learn from these observations. When a sequence of trials is done the gambler will be able to estimate the produced reward for each action, hence ensuring that optimal action choices are taken to fulfill the goal.

The problem that the gambler faces at each lever pull is the trade-off between the exploitationof a machine that delivers the highest payout based on the information the gambler already has, and the exploration of other machines to gather more information about the payout probability distribution of the machines.

3.4 Exploration And Exploitation Dilemma

A great challenge in reinforcement learning is the trade-off between exploration and exploitation. For an agent to obtain higher rewards, or take more correct actions, it must exploit actions that are already tried out in the past and proved to deliver good rewards. However, it is also necessary for an agent to explore new actions and take the risk of possibly choosing bad actions in order to learn and make better decisions in the future.

The dilemma can be very complicated depending on the environment within which the agent is operating. In a simple static environment, we may want the agent to eventually stop exploring. Because after a reasonable amount of exploration, there may not be anything else left to explore and taking actions for exploration purposes will only lead to bad performance. But in a more dynamic environment that is constantly subjected to changes, we do not want to stop exploring. Which means that there will be a balancing problem, how often should we exploit and how often explore? This trade-off dilemma between exploration and exploitation is found in all of the reinforcement learning algorithms, but the method for balancing the trade-off varies between one and the other.

The explore-exploit trade-off methods we look into in this thesis are epsilon greedy, Upper-Confidence-Bound (UCB) and gradient bandit. We will explain these methods in chapter 4.

(22)

3.5 Finite Markov Decision Process

Markov Decision Processes (MDP) are the fundamental formalism for reinforcement learning problems as well as other learning problems in stochastic domains. MDPs are frequently used to model stochastic planning problems, game playing problems and autonomous robot control problems etc. The MDPs have in fact become the standard formalism for learning sequential decision making [8].

MDPs do not differ a lot from markov processes. A markov process is a stochastic process in which the past of the process is not important if you know the current state, because all the past information that could be useful to predict the future states are included in the present state. What separates the MDPs from markov processes is the fact that MDPs have an agent that makes decisions which influences the system over time. The agent’s actions, given the current state, will affect the future states. Since the other types of markov chains, discrete-time and continuous- time markov chain, do not include such a decision making agent, the MDPs can be seen as an extension of the markov chain.

In RL problems, an environment is modelled through a set of states and these states are controlled through actions. The objective is to maximize the action’s performance representation called reward. The decision maker, called the agent, is supposed to choose the best action to execute at a given state of the environment.

When this process is repeated, it is known as markov decision process, see Figure 3.1.

A finite MDP is defined as a tuple of (S, A, R) where S is a finite set of states, A a finite set of actions and R a reward function depending only on the previous state-action pair. The MDP provides a mathematical framework for solving decision making problems. The mathematical definition of the MDP is given by

p(r|s, a) := P r{Rt= r|St−1= s, At−1= a}, (3.1) which is translated to the probability of producing the reward r given the agent is handling state s by executing action a. It is through this definition the agent makes decisions about what actions to take.

The states are some representation of the environment’s situation. What the states specifically represent depends on what the designer designs it to be, in this thesis the states represent the different user equipment on the site. The gambler’s multi-armed bandit problem explained in section 3.3 is simply a one-state markov decision process, since the gambler is sitting at only one row of slot machines.

The actions on the other hand, are simply a set of abilities that the agent can choose from in order to reach its goal, which is defined in the designed policy π.

A set of actions can vary between low-level controls, such as the power applied to a hammer machine that strikes down a spike into a piece of wood, and high-level decisions, such as whether or not a self-driving car should crash into a wall or drive down a cliff. The actions can basically be any decision we want to learn how to

(23)

3.5. FINITE MARKOV DECISION PROCESS 13 make, whereas the states are any information that can be useful to know in making those decisions [3].

The MDP framework is useful in RL problems where the actions taken by the agent do not only affect the immediate reward, but also the following future situations, or states. Hence the chosen action will also influence the future rewards [3]. The agent will in theory learn which specific action works best or produces maximum reward in each state. However, in order to maximize the reward in the long run, it will also have to choose specific actions that may not maximize the reward at a given time step or a given state but will result in a greater overall reward in the future time steps or future states. So there is a trade-off between favoring immediate rewards and future rewards that is resolved through various RL methods.

Figure 3.1: A diagram of state, reward and action in reinforcement learning that shows the interaction between a learning agent and an environment. Atdenotes the chosen action by the agent based on the policy π, Rt+1 denotes the reward (or the consequence) of the taken action, and St+1denotes what state the agent is found after the action was taken [3].

The process diagram for MDP is shown in Figure 3.1. It starts with the agent deciding to take an action At chosen from a set of possible actions A based on a predesigned policy π. This occurs every discrete time step t = 1, 2, 3, ..., T , where T is the final time step. The action Atis normally chosen conditioned to knowing the state St, however, the state of the environment is not usually known in the initial starting point. The first action A1 is randomly chosen in most cases. After the chosen action At is executed, we will be in the next time step and the consequence of the taken action is fed back to the agent as a form of reward Rt+1. The reward system is used to evaluate the actions and it tells the agent how good the perfor- mance of a chosen action was. With the reward, the agent will update the likelihood of action At being picked in the future. Additionally, when action At is executed, the agent will find itself in a different state St+1, this information is read from the environment and sent to the agent in order for the agent to choose the next action A_t+1 on that basis. The process will then repeat itself again. With this closed loop

(24)

agent-environment interface the system becomes autonomous, see Figure 3.1.

The MDP framework is probably not sufficient to solve all the decision-learning problems but it is very useful and applicable in many cases [3]. It is capable to reduce any goal-directed learning problem into three signals exchanged between the agent and the environment: a signal that represents the choices made by the agent, called actions, a signal that represents the basis on which the choices are made, called states, and a signal that represents the agents goal, called reward [3].

3.5.1 Reward Function And Goals

The goal in a reinforcement learning problem is maximizing the total amount of rewards that is generated from an environment and passed along to an agent. The goal is not to maximize the immediate reward, but maximizing the cumulative reward in the future. Therefore the reward signal is one of the vital parts in any reinforcement learning problem, and using the reward to formulate a goal is a very specific feature of reinforcement learning [3].

The best way to explain how reward functions work by using reward signals is through examples. Imagine a robot that we want to use during autumn for picking up fallen leaves in a garden. A reasonable reward function will give it a reward of zero most of the time, a reward of +1 for every leaf it picks up and stores in its basket, and a reward of -1 every time it makes a mistake, like crashing into something or braking a vase. Since the agent is designed to maximize the reward, the goal will therefore be to pick up as many leaves as possible while avoiding casualties like crashing or braking vases.

Formulating the goals using only the reward signal might sound a little confined for more complicated situations, but it has in fact proved to be a very suitable way to do that [3]. Using the reward signal, one can create a complex reward function that shapes up the main goal as well as subgoals for the agent. The reward function must also make sure that the agent is rewarded properly to achieve the end goal in the right way. Achieving the end goal the right way may be trickier than one thinks.

The agent will always learn to maximize its rewards but this does not always mean that it is doing what the designer wants it to do. There are many examples where reward functions fail and cause the agent to learn to make matters worse. This fail in the reward function design is common and have been seen in many RL problems.

There is something called the cobra effect, it happens when an attempt to solve a problem actually make matters much worse, as a way of unexpected consequence [9]. An example of the cobra effect was found in the work done in [10], where they mention how a flaw in the reward function made the agent learn to do something else than the intended goal. In their work, reinforcement learning is used on a robot arm that is meant to learn how to pick up a block and move it as far away as possible, in other words, fully stretch the arm before putting down the block. So the reward function was designed to reward the agent based on how far the blocks

(25)

3.5. FINITE MARKOV DECISION PROCESS 15 got from the robot arm’s base. The designer thought with this reward function the agent will learn to pick the block up and move it as far away as possible before setting it down, which makes sense. But instead of doing that, the agent learned to maximize the reward by hitting the block as hard as it could and shooting it as far away as possible. This happened because the block got further away and the agent was rewarded better than picking it up and setting it down. After further training time, it even learned to pick the block up and throw it. The reinforcement learning algorithm worked, in the sense that it did in fact maximize the reward. The goal was, however, not achieved at all because of the poorly designed reward function.

This proves how important it is to formulate the reward function correctly in an RL problem. The reward function is a way to tell the agent precisely what you want to achieve and not how you want to achieve it [3].

(26)

(27)

Chapter 4

Method

There are many different reinforcement learning methods to tackle a reinforcement learning problem with. We will however only explain the methods that were imple- mented in this thesis, which are epsilon greedy (-greedy), Upper-Confidence-Bound (UCB) and gradient bandit algorithms. We will go through the algorithms in detail, how they were implemented and what difficulties that were found when the theory was invoked in practice.

The reason behind choosing the mentioned RL algorithms to solve our RL problem are due to the fact that our problem is formulated as a multi-armed bandit.

And according to Sutton’s book on reinforcement learning [3], a multi-armed bandit problem can be solved through -greedy, UCB and gradient bandit algorithms. All three of them were implemented in this project.

This chapter is started with a simplified setting of the uplink power control problem, one that does not involve learning to handle more than one situation, or one UE. We formulate the problem as a non-associative search called multi-armed bandit. With this setting we can avoid the complexity of a full reinforcement learning problem, yet still be able to evaluate how the implemented methods work. The problem was later reformulated to a more general reinforcement learning problem, an associative search, that is, when the solution methods involve learning to handle more than one situation.

The designed reward functions for the RL agent are stated and explained under a section of its own in this chapter. As mentioned in the previous chapter, the reward function plays a big roll in reinforcement learning applications. It is therefore necessary to clarify what the different reward functions represent and why they are designed that way.

We will end this chapter by briefly explaining how the simulation works and which of its parameters need to be set before running a simulation. The detailed explanation of Ericsson’s simulation is left out because it is not in the scope of this thesis. The simulation is only used to evaluate our implemented work.

17

(28)

18 CHAPTER 4. METHOD

4.1 Non-Associative Search

A non-associative search in our problem means handling the power transmission for only one UE. Since the power control problem was buried inside Ericsson’s existing uplink power control implementation, it was reasonable to formulate the problem as a non-associative multi-armed bandit problem. That way, it was easier to manipulate the already implemented power control code, and then try out the chosen methods, evaluate the results and identify further problems.

4.1.1 A k-Armed Bandit Problem in Uplink Power Control

In reinforcement learning, there are three important terms; states, actions and rewards. In order to draw up the power control problem as a k-armed bandit, we need to first identify what these terms correspond to.

We define the states as the different UEs, each UE is a state of its own. Since we are only considering one UE for now, the use of states are insignificant. The states plays a bigger roll in section 4.3, where the problem is reformulated to an associative search. We will therefore explain the roll of states in more detail in section 4.3.

Considering the power control problem explained in section 2.4, we are repeat- edly faced with a choice among k different values for δ parameter, we call this choice the action selection. Each possible choice is an action the agent can choose from.

When an action is selected and executed, a reward that is dependent on that action is returned. In our case, the reward is equal to the throughput. If we for example select the action δ = 4 dB from Table 2.1 and transmit the power signal, then the SINR is measured and from that we get the throughput according to Equation 2.2 in chapter 2.

The throughput is directly considered to be the reward for that action. It is a linear value measured in mega bits per second (Mbps), so the higher the throughput the better the connection between a UE and a base station. The objective here is to maximize the reward and eventually converge the transmitted uplink power.

When the actions, states and rewards are clarified and we know what they stand for in our power control problem in section 2.4, we propose three different algorithms to solve it. The algorithms are explained below.

4.2 Implemented Methods

The first step for all of the different implemented algorithms are the same, that is, choose and execute a random action. The agent’s first decision on what action to take is simply a blind guess, because at this stage the agent has no knowledge about the environment. Every action is equally weighed and the agent is purely exploring.

(29)

4.2. IMPLEMENTED METHODS 19 After the first step the agent will get a reward for the executed action and along with that a reference point to compare the rewards of other actions with it.

By repeating the action selection process and store/accumulate the reward for each selected action we can get the system to “learn” to become somewhat intel- ligent. The variable that stores the reward information is called action-value, it is used to maximize the reward by concentrating the action selection on the choices that have proven to provide the system with a higher reward, which means higher SINR and higher throughput.

The selected action on time step t can be denoted as At, and the reward for that action as Rt+1 because the reward is returned in the next time step. The action- value for an action, a, is then the expected reward given that a is the selected action At [3],

Q_t(a) = E[Rt+1|A_t= a]. (4.1) The action-values are a future estimation made based on previous rewards, we do not know their values with certainty until they are executed. If we knew the values before the execution, the problem would be irrelevant to solve because we would always choose the action that we know would produce the highest reward value.

The estimated value of action a at time step t is denoted in the implementation as Qt(a).

After every action has been chosen and executed, there should be at least one action that performed better than the rest, or produced the highest reward. Although just because that action produced the highest reward once does not mean it is the optimal choice in the future. We will therefore need to explore other actions every now and then, which leads to the exploit-explore dilemma explained in section 3.4.

Every implemented algorithm handles this dilemma differently.

4.2.1 Epsilon Greedy Algorithm

In epsilon greedy the action selection comes down to two choices, greedy and non- greedychoices. That choice is dependent on a predetermined constant called epsilon

, which is essentially a probability that resolves the exploit-explore dilemma that was mentioned in section 3.4. Since the constant is a probability it can take values between 0 ≤ ≤ 1 and it represents the exploration probability, which is explained in more detail below.

Action selection

Considering the estimates of all the action-values Q, there should always be at least one action whose estimate is higher than the other ones. This action is the greedy action and when it is selected, it means that the algorithm is exploiting the current knowledge and not exploring to gain new knowledge. If there are two or

(30)

more action-values that are equal and are together the highest in the list, then one of them is randomly chosen. All the other actions that are not greedy actions are called non-greedy actions. And if the agent chooses a non-greedy action, it means that the system is exploring and gaining new knowledge about the environment.

The remaining question is; how does the agent actually decide which to choose between greedy and non-greedy actions?

Suppose we set the value to some value between 0 ≤ ≤ 1, the algorithm would then decide on choosing greedy or non-greedy actions according to the probability distribution

A ←

( greedy with probability 1 − (breaking ties randomly)

nongreedy with probability  (4.2)

where setting the epsilon to a higher value, will lead the algorithm to explore more and exploit less, and vice versa.

Exploitation is the way to go in order to maximize the reward/SINR at a given time step, however, exploration can lead to better total reward in future steps.

Assume at a given time step t, there is a greedy action (has the highest estimated action-value) and that same action has been tried out multiple times before time step t which makes us certain that the estimation is reliable, while on the other hand there are several other actions that are non-greedy (have lower estimated action- values) but their estimation values are with a great uncertainty. That implies that among these non-greedy actions there may be an action that will produce a higher reward than the greedy action, but the agent does not know it yet or which one of them. It may therefore be wise to explore these lower valued actions and discover actions that may produce better reward than the greedy action.

Once the agent has chosen whether to take a greedy or a non-greedy action, it will then need to separate the greedy action from the non greedy ones. The agent does this through

greedy_action = arg max_a ^hQ_t(a)ⁱ, (4.3) where greedy_action is an array that stores the highest estimated action-values, while the rest of the estimated action-values are stored in another array called nongreedy_actions. For the greedy actions, if two or more actions achieve the maximum Qt, meaning they have the same value, then it chooses randomly between these actions.

Action-value estimation and update

To make sure that the reinforcement learning agent learns to select the optimal actions, it has to be updated with a reward after every execution of the previously

(31)

4.2. IMPLEMENTED METHODS 21 selected action. Updating the agent in this case is the same as estimating the values of actions after every action taken.

There are several ways to estimate the values of the actions. A natural way to do the estimation is by computing the average of every produced reward in every time step that a given action a was chosen. For this the agent needs to store the number of times each action has previously been selected Nt(a), and the respective rewards at each time step,

Q_t(a) = sum of rewards when a was selected

number of times a has been selected, (4.4)

which by the law of large numbers, Qt(a) will converge to a specific value when the denominator goes to infinity [3]. That is, if the expected reward of each action do not fluctuate over time. This method of estimating the action-value is called sample-average in [3] because every estimate is an average of the sample of relevant rewards. This method is one of the simpler methods to estimate action values, and not necessarily the best one. It does however serve it’s purpose.

In the implementation, this method requires a lot of memory usage and computation due to the increase of the number of saved rewards over time. So the implementation of the estimation method was modified and improved to a method called Incremental Action-Value.

Let the produced reward after ith selection of a given action be denoted as Ri

and the action-value estimation after being selected n times be denoted as Qn, then Equation 4.4 would be equivalent to

Q_n= R₁+ R2+ ... + Rn−1

n −1 = 1

n −1

n−1

X

i=1

R_i. (4.5)

The incremental action-value method is a more reasonable method to implement in coding. Instead of maintaining a record of all the rewards produced for every action in every time step and carry out larger and larger computations as the number of iterations increase, we can simply use an incremental formula that updates the averages with smaller computation every time a new reward needs to be processed. So the formula for computing the new updated average of all n rewards was converted to

(32)

Qn+1= 1 n

n

X

i=1

Ri

= 1 n

Rn+ⁿ⁻¹^X

i=1

Ri

= 1 n

Rn+ (n − 1) 1 n −1

n−1

X

i=1

Ri

= 1 n

Rn+ (n − 1)Qn

= Qn+ 1 n

hR_n− Q_nⁱ.

(4.6)

This implementation requires memory only for the old estimate Qnand number of previous selections n for each action, and only a small computation of Equa- tion 4.6 every time the agent is updated.

The pseudocode for the implemented bandit algorithm using -greedy action selection is shown in Algorithm 1.

Algorithm 1 Epsilon greedy algorithm applied to a bandit problem [3]

Initialize, for a = 1 to k;

Q(a) ← 0 N(a) ← 0

Select initial action; A ←random selection Repeat;

R ← execute action A and return the reward N(A) ← N(A) + 1

Q(A) ← Q(A) +_{N (A)}¹ [R − Q(A)]

A ←

( arg maxaQ(a) with probability 1 − (breaking ties randomly) a random action with probability 

To further improve Algorithm 1, the initial values of Q(a) can be changed from 0 to an optimistic high initial value. The optimistic initialization method was used by Sutton (1996) [11], it is an easy way to provide the agent with some prior knowledge about the expected reward level. This method makes the agent bias to the actions that have not been tried out yet in the beginning steps of the learning process.

Since the non-selected actions are biased by their high initial values the -greedy agent will most likely pick those actions first. This way the agent tries out all the actions in its first iterations. The bias is immediately removed once all the actions

(33)

4.2. IMPLEMENTED METHODS 23 have been taken at least once because their action-value estimates will be replaced with a lower value than the initial optimistic value. The advantage of this method is that it does not miss out on trying out all of the actions at start and help converge to the optimal action choice faster.

4.2.2 Upper Confidence Bound Action Selection Algorithm

A greedy action selection method selects only the “best” action at a given time step, but the best action according to the agent’s current knowledge at that time step is not necessarily the actual best action. Therefore, exploring and taking the risk of selecting a bad action from time to time can be beneficial. The -greedy action forces the agent to sometimes (depending on the probability value of ) take non- greedy actions, which makes the agent gaining more knowledge. There is however room for improvements in that. The non-greedy action in -greedy algorithm is selected randomly with no consideration to how nearly-greedy the action is nor how uncertain the action-value is. A great improvement in these exploration states would be to select an action among the non-greedy actions by taking into account their estimation uncertainties as well as how close they are to the maximum (the greedy action). This method of picking non-greedy actions according to their potential for being the optimal one instead of blindly picking an action is called the Upper Confidence Bound (UCB) action selection. The UCB method was developed by Auer, Cesa-Bianchi and Fischer in 2002, [12].

In order to take both action-value Q(a) estimation and its uncertainty into account we will select the actions according to

A_t= arg max_a

"

Q_t(a) + c s ln t

N_t(a)

#

[3], (4.7)

where ln t is the natural logarithm of t, Nt(a) is the number of times that action a has been selected before the time step t and c is a constant number larger than 0 that controls the degree of exploration. The Nt(a) is updated (accumulated with the value of 1, similarly to Algorithm 1) before the equation Equation 4.7 takes place, which means that under no circumstances the value of Nt(a) is equal to zero.

The square-root term in the UCB action selection method is a way to estimate the uncertainty or the variance in the estimation of action-value [3]. By taking that into account and establishing a confidence level c, the maximized estimate is the upper bound of the true action-value of a given action a. If an action a has been selected much less than the other actions, then the square-root term of action a would be much higher than the other actions. This will thus favor action a by making it more probable to get selected, since its total value might possibly become higher than the other action’s total values. The natural logarithm in the numerator of the uncertainty measurement term makes the term less significant over a longer time, but still not limited. The UCB action selection equation guarantees that all

(34)

actions are eventually selected, but actions that have low action-value estimation or have been selected many times in comparison to the other actions will over time be selected fewer times.

4.2.3 Gradient Bandit Algorithms

Another possible way to find the optimal action in this problem is the gradient bandit algorithm. The gradient bandit is, unlike the two previously discussed algorithms, not using action-value estimations to select its optimal actions. It instead uses and learns through a numerical preference for every possible action Ht: A → R.

The action preferences are updated and recorded after each iteration with the use of the throughput between two nodes (UE and base station) which is used as the reward, this is stated in Equation 4.9 further down.

In this algorithm, the action that has a higher preference over other actions are more likely to be selected and executed. All the actions have a selection probability, which is determined through a soft-max distribution using the preferences of the actions.

In probability theory, a soft-max function (also called normalized exponential function [13]) is used to represent a categorical distribution, which is a probability distribution over K different possible outcomes. It is calculated using the exponen- tial values of the action preferences H, see Equation 4.8.

The probability of an action a being selected at time step t is calculated as follows,

P(At= a) = e^H^t^(a) Pk b=1

e^H^t^(b)

. (4.8)

This calculation is done for all the actions in every iteration. And the total probability of all actions should always be equal to one, with the exception of a very small deviation caused by rounding up or down due to the maximum float decimals a computer can handle. Furthermore, the actions’ initial preference are all set to zero so that all actions have an equal probability of being selected at the start.

When an action, At= a, is selected and executed at time step t, the simulation returns a reward, Rt, which we can then use to update the preferences of the actions depending on the reward. The agent is also meant to keep a record of the average reward produced through all the previous time steps and including the current time step t. The average reward, ¯R_t, is calculated just like in Equation 4.5 and is used as a reference point to indicate whether the returned reward Rt was smaller or greater than the average reward. If Rt is greater than the average reward ¯Rt, then that action’s preference is increased and the preferences of all the other actions are

(35)

4.3. ASSOCIATIVE SEARCH 25 equally decreased for the next iteration, and vice versa if the Rtis smaller than ¯R_t, see Equation 4.9. Furthermore, increasing or decreasing the action preferences will equivalently affect the probability of the respective actions due to the calculation done in Equation 4.8. The action-preferences update step for this algorithm is as follows,

H_t+1(At) = Ht(At) + α(Rt− ¯R_t)(1 − P (At)), for selected action a = At

Ht+1(a) = Ht(a) − α(Rt− ¯Rt)P (a), for all a 6= At, (4.9) where α > 0 is a constant predefined step-size parameter [3].

The gradient bandit algorithm is expected to work better than the -greedy and the UCB algorithms because of its computation of the likelihood of actions being chosen. An advantage that the gradient algorithm has is that for every selected and executed action, all of the action’s preferences are updated. When the agent is doing well, producing higher and higher rewards, the gradient algorithm will increase the likelihood of choosing that action that is causing the better performance while decreasing the rest of action’s likelihood to be chosen. This feature will cause the agent to keep exploiting a given action as long as there is an improvement in producing rewards, and turn its focus on exploring other actions when it is not. On the contrary, -greedy and the UCB algorithms are set to explore percentage of the time and that value does not adapt to the performance of the agent.

4.3 Associative Search

Associative search, also called contextual bandits [3], introduce the concept of states.

The states are representations of the environment, and in this thesis we designed the states to represent each individual UE. In the non-associative search, the agent completely ignores the state of the environment because it was only dealing with one UE. Therefore, in the case of non-associative search, the use of states is irrelevant, unless we design the states to represent something else.

We will reformulate the problem into an associative search, which means there are now multiple UEs on the site. And we use the states to let the RL-agent be able to distinguish the different UEs. It will let the agent know which UE it is currently dealing with. This is a good feature since the agent needs to learn what values of δ work best for each UE. Each UE has different properties, and the optimal transmission power will not be the same for every UE. The different properties can for example be near or far away from the base-station, located in a crowded or uncrowded area etc. The agent has much bigger challenges to take on when there are multiple UEs on site, challenges like interference and quality of service.

It has to find out a way to minimize the interference while maximizing the overall throughput between the UEs and the base-station. It has to take into account that

(36)

the UEs’ transmission powers are correlated, meaning that, changes in the power transmission of one UE will not change the throughput of that UE only, but other UEs’ too.

Figure 4.1: An illustration of the associative search problem. The agent on the left side of the figure commands the UEs to use the chosen action At. After the UE’s uplink transmission is done, the throughput is measured and returned to the agent through the reward Rt+1. At the same time the agent is informed through the state St+1 which UE it needs to select an action for next. The indexed letters n and m denote the number of UE and action respectively. On the right side of the figure there is the base-station tower that the UEs are connecting with.

The Markov Decision Processes (MDP) explained in section 3.5 is used for this associative search problem. With MDP the agent can use the states to keep track of the environment’s situation and try to find an optimal transmission power, or the δ value, for all of the UEs, see Figure 4.1. The agent will still be using the implemented methods explained in section 4.2 for selecting and rewarding the actions, only now, it will have different parallel threads for each UE. Additionally, the agent will now to condition the action choices on the states. Thus the action-value, or the estimated reward, given in Equation 4.1 will be updated to

Qt(s, a) = E[Rt+1|S_t= s, At= a], [3] (4.10) where s and a stand for the given state of the environment and the selected action respectively.

4.4 Reward Functions

So far in this chapter, we have assumed that the reward value is always provided.

However, as explained in subsection 3.5.1 the reward function is more complicated

(37)

4.4. REWARD FUNCTIONS 27 than it appears. For that reason, different reward functions were designed and tried out.

Reward functions are especially important in this thesis because the only way to let the agent know that the different user equipment are correlated is through the reward function. The correlation is that actions chosen for a UE affect the throughput of another UE through interference.

The reward function was designed and gradually improved as the results were evaluated. The different reward functions are explained in the subsections below.

4.4.1 Local Reward

Local reward is basically another name for the throughput between base-station and a given UE. It is the first reward function that was designed in this thesis. It rewards the action-values with the throughput directly, see Equation 2.2.

Using only local reward, the agent will learn to take actions that only maximizes the throughput of the currently handled UEs and not the overall throughput. The agent will disregard how the actions may affect the other UEs. This reward function is simply assuming there is one UE on site and that there is nothing else to interfere with. The local reward is useful to analyze how well the overall design of the system works, it may possibly be enough for actually reaching our goal too, which is maximizing the total throughput. However, we want the agent to also take other UEs into account when choosing the actions, which is why we introduce global reward.

4.4.2 Global Reward

Global reward is the average throughput of all the UEs that the agent handles simultaneously. The interference between the UEs only occur when they are being handled simultaneously. Thus, computing the average throughput, or the average local reward, is a way to measure that interference. The global reward is designed to shape the agent’s goal. Through the global reward we can let the agent know that raising the overall throughput of all the UEs is what we want and not just each UE for itself.

Assume that the agent sets the δ parameter to a value that is maximizing the throughput of a given UE but also causing the throughput of other UEs to become lower. Using the global reward we expect the agent to understand that this is not a good choice because of the reduction of the reward, and will eventually try something else.

(38)

4.5 Simulation Setup

We run the RL algorithms on an Ericsson-developed simulation. The simulation is equipped with features needed to replicate a real-life scenario, where we can simulate a site, or sites, that consists of a base-station and a number of UEs connected to that base-station. The simulation is set up so that the environment is dynamic, there are some changes that happen to the UEs’ properties during a simulation run, i.e. the distances of the UEs to the base-station.

In the simulation setting, we can manipulate the number of propagation frames, which is the number of times the power control program schedules each UE during a simulation. Accordingly, the propagation frames are the number of times that the agent can choose actions for the UEs, which means setting the δ parameter in Equation 2.1 to either one of the values stated in Table 2.1. The propagation frame number was usually set to 4000 for our experiments.

Another parameter that need to be set to a value prior to running the simula- tion is the path loss compensation factor α, this is mentioned in Equation 2.1 in chapter 2. The α value was set to 1 for our experiments, which indicates a full path loss compensation for the transmitted uplink signal.

Additionally, we have to define the number of sites and the total number of UEs we want to use in the simulation. A site stands for a base-station and all the UEs that are handled by that base-station. When we set the number of sites and UEs, the simulation program randomly drops the UEs in different locations on the sites.

The UEs could therefore be located far away from each other as well as very close to each other. This randomness plays a roll in the case of interference, making every simulation different from one to another. For that reason, we set up the system so that every time we want to simulate something, the same simulation is repeated exactly 50 times and then we take the average of the results before evaluating it.

We repeat the process 50 times because the results stabilizes after this number of runs.

The the agent’s ability to handle multiple UEs simultaneously is dependent on the number of sectors. Each site is divided into three sectors, see Figure 4.2, which means that the site’s base-station is capable of scheduling three UEs simultaneously.

Accordingly, the agent can handle three UEs at each time step too. The number of sectors can be changed but we have no reason to change that in this project.

The RL algorithms’ constant parameters are the probability value , the degree of exploration c and the step size α used respectively in the -greedy, UCB and the gradient methods were set to values that produced the best performances. The parameters’ values can always be manipulated and improved even more, however, for most of our simulations the values stated in Table 4.1 were used. These values were decided after multiple trials.

Reinforcement Learning for Uplink Power Control

Reinforcement Learning for Uplink Power Control

ALAN GORAN

Reinforcement Learning for Uplink Power Control

Abstract

Referat

Förstärkningsinlärning för Uplink Effektstyrning

Contents

Abbreviations

Chapter 1

Introduction

1.1 Research Question

1.2 Outline/Overview

Chapter 2

Uplink Power Control

2.1 Power Control in Telecommunications

2.2 Open Loop And Closed Loop Power Control

2.3 Deployment of UEs in Different Environments

2.4 Thesis Problem

Chapter 3

Reinforcement Learning

3.1 Machine Learning

3.2 Reinforcement Learning

3.3 Multi-Armed Bandits

3.4 Exploration And Exploitation Dilemma

3.5 Finite Markov Decision Process

Chapter 4

Method

4.1 Non-Associative Search

4.2 Implemented Methods

4.3 Associative Search

4.4 Reward Functions

4.5 Simulation Setup