Bridging Sim-to-Real Gap in Offline Reinforcement Learning for Antenna Tilt Control in Cellular Networks

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2021,

Bridging Sim-to-Real Gap in

Offline Reinforcement Learning for Antenna Tilt Control in Cellular

Networks

MAYANK GULATI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Bridging Sim-to-Real Gap in Offline Reinforcement

Learning for Antenna Tilt

Control in Cellular Networks

MAYANK GULATI

Master in Autonomous Systems Date: January 25, 2021

Industrial Supervisors: Filippo Vanella and Ezeddin Al Hakim Academic Supervisor: Alexandre Proutiere

Examiner: Gabor Fodor

School of Electrical Engineering and Computer Science Host company: Telefonaktiebolaget LM Ericsson

Swedish title: Överbrygga Sim-to-Real Gap i inlärning av offlineförstärkning för antennlutningskontroll i mobilnät

(4)

(5)

iii

Abstract

Antenna tilt is the angle subtended by the radiation beam and horizontal plane.

This angle plays a vital role in determining the coverage and the interference of the network with neighbouring cells and adjacent base stations. Traditional methods for network optimization rely on rule-based heuristics to do decision making for antenna tilt optimization to achieve desired network characteristics. However, these methods are quite brittle and are incapable of capturing the dynamics of communication traffic. Recent advancements in reinforcement learning have made it a viable solution to overcome this problem but even this learning approach is either limited to its simulation environment or is limited to off-policy offline learning. So far, there has not been any effort to overcome the previously mentioned limitations, so as to make it applicable in the real world. This work proposes a method that consists of transferring reinforcement learning policies from a simulated environment to a real environment i.e. sim-to-real transfer through the use of offline learning. The approach makes use of a simulated environment and a fixed dataset to compensate for the underlined limitations.

The proposed sim-to-real transfer technique utilizes a hybrid policy model, which is composed of a portion trained in simulation and a portion trained on the offline real-world data from the cellular networks. This enables to merge samples from the real-world data to the simulated environment consequently modifying the standard reinforcement learning training procedures through knowledge sharing between the two environment’s representations. On the one hand, simulation enables to achieve better generalization performance with respect to conventional offline learning as it complements offline learning with learning through unseen simulated trajectories. On the other hand, the offline learning procedure enables to close the sim-to-real gap by exposing the agent to real-world data samples. Consequently, this transfer learning regime enable us to establish optimal antenna tilt control which in turn results in improved coverage and reduced interference with neighbouring cells in the cellular network.

Keywords

reinforcement learning, transfer learning, simulation-to-reality, simulator, real- world, real-world network data, remote electrical tilt optimization, cellular networks, antenna tilt, network optimization.

(6)

iv

Sammanfattning

Antennlutning är den vinkel som dämpas av strålningsstrålen och det horison- tella planet. Denna vinkel spelar en viktig roll för att bestämma täckningen och störningen av nätverket med angränsande celler och intilliggande basstationer.

Traditionella metoder för nätverksoptimering förlitar sig på regelbaserad heu- ristik för att göra beslutsfattande för antennlutningsoptimering för att uppnå önskade nätverksegenskaper. Dessa metoder är dock ganska styva och är oför- mögna att fånga dynamiken i kommunikationstrafiken. De senaste framstegen inom förstärkningsinlärning har gjort det till en lönsam lösning att lösa detta problem, men även denna inlärningsmetod är antingen begränsad till dess si- muleringsmiljö eller är begränsad till off-policy offline inlärning. Hittills har inga ansträngningar gjorts för att övervinna de tidigare nämnda begränsning- arna för att göra det tillämpligt i den verkliga världen. Detta arbete föreslår en metod som består i att överföra förstärkningsinlärningspolicyer från en simulerad miljö till en verklig miljö, dvs. sim-till-verklig överföring genom an- vändning av offline-lärande. Metoden använder en simulerad miljö och en fast dataset för att kompensera för de understrukna begränsningarna.

Den föreslagna sim-till-verkliga överföringstekniken använder en hybrid- policymodell, som består av en del utbildad i simulering och en del utbildad på offline-verkliga data från mobilnätverk. Detta gör det möjligt att slå samman prover från verklig data till den simulerade miljön och därmed modifiera stan- dardutbildningsförfarandena för förstärkning genom kunskapsdelning mellan de två miljöernas representationer. Å ena sidan möjliggör simulering att uppnå bättre generaliseringsprestanda med avseende på konventionellt offlineinlär- ning eftersom det kompletterar offlineinlärning med inlärning genom osynli- ga simulerade banor. Å andra sidan möjliggör offline-inlärningsförfarandet att stänga sim-till-real-klyftan genom att exponera agenten för verkliga dataprov.

Följaktligen möjliggör detta överföringsinlärningsregime att upprätta optimal antennlutningskontroll som i sin tur resulterar i förbättrad täckning och mins- kad störning med angränsande celler i mobilnätet.

(7)

Acknowledgments v

Acknowledgments

Firstly, I would like to acknowledge KTH Royal Institute of Technology for providing me the platform to do my master thesis in Ericsson research. I would like to thank my academic supervisor Prof. Dr. Alexandre Proutiere for essential insights and constructive discussion about the project. I would also like to thank my thesis examiner Prof. Dr. Gabor Fodor for his guidance and providing feedback on the thesis report.

I would like to express my deepest gratitude towards my industrial supervisors, Filippo and Ezeddin for not only presenting me such an interesting topic but also for the constant support and co-operation throughout the thesis du- ration. I enjoyed working together and had a great learning experience with them. I would further like to extend my thanks towards colleagues at Ericsson research especially my fellow master thesis students Henrik and Grigorios for creative discussions. I am grateful to Ericsson for the help with all the required computational resources to carry out the thesis work smoothly.

Finally, I would like to express my special thanks to my friends and family for continuous support and appreciation, without them it would not have been possible.

Stockholm, January 2021 Mayank Gulati

(8)

(9)

List of Figures

2.1 Illustration of the RET angle ψ . . . 7

2.2 RL framework . . . 8

3.1 Figure from [27] shows Progressive network architecture for sim-to-real, knowledge is shared from simulation based column 1 to reality based column 2 through dashed lateral connection from left to right. . . 18

4.1 Offline sim-to-real transfer architecture . . . 20

4.2 Hybrid policy model abstract representation. . . 25

4.3 Hybrid policy model ANN representation. . . 28

5.1 Simulation training, evaluation and comparison of RL agent with rule-based agent in the first run. . . 34

5.2 Simulation training, testing and comparison of RL agent with rule-based agent in the second run. . . 35

5.3 Real data training, testing and comparison of offline agent with hybrid model agent having no adapter. . . 37

5.4 Real data training, testing and comparison of offline agent with hybrid model agent having adapter and pre-trained simulation model trained in the first simulation run. . . 39

5.5 Real data training, testing and comparison of offline agent with hybrid model agent having adapter and pre-trained simulation model trained in the second simulation run. . . 40

A.1 Real data testing and comparison of offline agent with hybrid model agent having adapter and pre-trained simulation model trained in the first simulation run. . . 52

ix

(12)

x LIST OF FIGURES

A.2 Real data testing and comparison of offline agent with hybrid model agent having adapter and pre-trained simulation model trained in the first simulation run. . . 53

(13)

List of Tables

4.1 Simulation training hyperparameters. . . 31 4.2 Real-world training hyperparameters. . . 31 6.1 Offline real data evaluation . . . 42

xi

(14)

Acronyms

3GPP 3rd generation partnership program. 6 5G 5th generation telecommunication. 1 A2C advantage actor critic. 11, 17

A3C asynchronous advantage actor critic. 17, 43 AC actor critic. 11

ANN artificial neural networks. ix, 12, 17, 20, 23, 26, 28, 41, 45, 46 DA Domain Adaptation. 14, 16

DDPG deep deterministic policy gradient. 15 DNN deep neural networks. 10, 14

DOF degree of fire. 21, 22, 32–35 DQN deep Q-learning network. 11, 43

DR Domain Randomization. 15, 16, 19, 21, 23, 42 DRL deep reinforcement learning. 4, 6, 12, 14 HER Hindsight experience replay. 15, 16

KPI key performance indicator. 21, 22, 24, 32, 42, 43 LTE long-term evolution. 13

MDP Markov Decision Process. 7

xii

(15)

Acronyms xiii

PPO proximal policy optimization. 17

RDPG recurrent deterministic policy gradient. 15 RET remote electrical tilt. vii, ix, 2, 6, 7, 19, 24

RL reinforcement learning. iii, vii, ix, 1–4, 6–17, 19, 23, 24, 27, 32–36, 38, 41–45

sim-to-real simulation-to-reality. iii, ix, 2, 13–16, 18–21, 43 SINR signal to inference and noise ratio. 14

SON self-organizing networks. 4, 6

(16)

(17)

Chapter 1 Introduction

There has been unprecedented growth in the demand for cellular networks due to rapid increase in the number of mobile users and services. With the introduction of the 5G technology, the global mobile data traffic is expected to increase nearly five times by 2025 [1]. It becomes highly crucial for wireless service providers to assure quality of service [2] and consistent coverage throughout the dedicated service region. However, even after the recent advancements in cellular networks technology, these service providers find it extremely difficult to address huge wireless data demands. Issues such as high congestion in the network and improper coverage cause major concerns given the increasing complexity and ever so increasing user mobility in the network.

Within the huge domain of radio resource management, there exists a field of cell shaping which enables dynamical optimization of cellular networks by controlling the radio beam pattern based on the network’s performance indi- cators. Such kind of optimization approach curtails the previously mentioned networks related issues. One of the principal ways of this optimization involves adjusting antenna tilt remotely at distinct cellular sites. This process of remote tilting is highly automated using various possible techniques rang- ing from conventional rule-based methods [3], [4] to fairly advanced machine learning techniques with reinforcement learning (RL) [5], [6], [7], [8],[9]. The rule-based techniques rely on handcrafted rules that could not capture the dynamics of the environment hence suffer from major setback due to its inability to adapt to a highly volatile cellular environment. Whereas RL provides a continual training paradigm in which a software-designed agent continuously evolves through received feedback in the form of reward function for every actions taken from different states. However, to the best of our knowledge,

1

(18)

2 CHAPTER 1. INTRODUCTION

even the use of RL is largely limited to simulation environments [5], [6], [7], or only associated with post-hoc analysis through off-policy learning from offline data [9]. Therefore, none of the two strategies consider the combination of a simulated and real-world environments in order to achieve the optimal tilt angle. In this thesis, we analyse previous works about the use of RL for remote electrical tilt (RET) optimization and propose a powerful framework for sim-to-real transfer learning for applicability of RL techniques to real-world telecommunication scenarios.

1.1 Problem Description

The performance of cellular networks is highly influenced by the configuration of antenna parameters at network base stations. The configuration of these antenna parameters, specifically antenna tilt, contributes immensely in optimizing the radio network attributes to obtain desirable coverage, capacity and quality¹ of the wireless signal. Consequently, it results in better signal reception within a cell and reduction of interference among neighbouring cells.

Automatic control of the antenna tilt is traditionally carried out with the help of rule-based or heuristic-based control methods. These methods have substan- tiated to be quite rigid and shown an inability to adapt to different scenarios autonomously. Thus, they often demand manual intervention for timely main- tenance and reconfiguration of the antenna parameters to various dynamic scenarios that results in greater operational and capital expenses to network operators.

To tackle this issue, in the recent years, there has been an explosion of RL techniques to solve the RET optimization problem [5], [6], [7], [8],[9]. RL algorithms have shown superior performance compared to traditional rule-based or convex optimization-based methods [3], [4], [11] and also proved success- ful in fast adaptation to abrupt changes in the environment. Until recently, these superlative performances of RL agents are only attributed to the simulation environments. The RL algorithms fail to conform well to the real-network scenarios as the simulation environment can not capture the true dynamics of the telecommunication networks hence, it does not provide a true depiction of the real network.

1Quality refers to the extent of interference in the cellular network with respect to neighbouring cells and neighbouring sites. Higher quality corresponds to lower interference [10].

(19)

CHAPTER 1. INTRODUCTION 3

Moreover, the RL approach comes with its own inherent implication i.e.

trade-off between exploration and exploitation. This learning approach requires to take a significant number of arbitrary actions in the exploration stage so that the RL agent can leverage from these gathered experiences during exploitation stage to finally come up with the optimal policy. This makes it highly prone to taking unpredictable actions which could lead to an overall degrada- tion in the performances for the real networks. This is the major reason for not being able to deploy such promising RL algorithms to real-world telecommunication applications.

1.2 Research Questions

Can we build a robust RL policy that performs better than the conventional rule-based or control based methods for controlling the antenna tilt? Further- more, can this policy incorporate the learning from the simulation and offline real-world dataset to establish an effective and adaptive decision making policy?

1.3 Contributions

• Our approach allows to cope with the simulation-to-reality problem by training an RL policy that combines learning from the simulation and an offline dataset coming from the real-world environment. The outcome of such a hybrid training procedure creates a policy having increased reliability and robustness towards discrepancies between the real-world and the simulation environment.

• It enables to strike a balance between exploration and exploitation by taking advantage of simulation to promote exploration while enhancing exploitation during a real-world interaction. This aspect is very important in real-world applications in which the performance of the system under control cannot be degraded arbitrarily.

• This approach tries to improve the real-world sample efficiency, i.e. re- ducing the number of real-world data samples required to learn an optimal policy with the help of simulation used to generalize the offline learning to unseen trajectories. The proposed solution utilizes offline off-policy samples to have a policy closer to reality through the injec- tion of real-world samples in the training procedures.

(20)

4 CHAPTER 1. INTRODUCTION

• Moreover, it allows training a policy more safely without interaction with the real environment. This is particularly important in RL applications where the interaction with the real-world can disrupt the performance of the real system because of uncontrolled exploration. Further- more, the solution enhances both simulation training and offline training by integrating the two.

1.4 Structure of the Thesis

The thesis report pans out into seven chapters. The current chapter explains the problem description and the research questions that we would like to address to achieve our thesis objective. The subsequent chapters of the thesis report are organized as follows.

• Chapter 2 comprises the theoretical background that covers the key components involving self-organizing networks (SON) followed by the RL framework and its dichotomies. Then, we talk about some of the RL algorithms and finally touch upon deep reinforcement learning (DRL) and understand the motive behind using it for our problem statement.

• Chapter 3 contains the related works that use RL in the field of antenna tilt optimization. Then, we emphasize on simulation-to-reality paradigms, and the literature works pertaining to this field for RL.

• Chapter 4 provides an overview of the method used in the thesis project.

Further, we elaborate the architecture blocks of the workflow, specific RL formulation which is trailed by neural network implementation to execute offline transfer learning for remote electrical tilting through hybrid policy model.

• Chapter 5 shows the results generated by two phases of experiments which are covered in two separate sections wherein the first section represents the training and evaluation conducted in the simulation phase while the second section illustrates the real dataset training from the batched dataset obtained happening in the second phase. On the way through, we explain different the metrics used in order to analyse our experimental results.

• Chapter 6 summarizes our thesis work, provides merits and demerits of our implementation, and finally highlights the future research avenues which are worth examining.

(21)

CHAPTER 1. INTRODUCTION 5

• Chapter 7 provides a conclusion to our entire work. This chapter is followed by a brief appendix containing slightly different offline data experiments.

(22)

Chapter 2 Background

This chapter familiarizes with the core concepts used in the thesis. Here, we review the fundamental topic of self-organizing networks which is widely used in radio networks. Further, we talk about the key RL terminologies and algorithms, and finally, we motivate the use of DRL to fulfill our thesis objective.

2.1 Self-Organizing Networks

SON [12] is the network management framework introduced by the 3rd generation partnership program (3GPP) that is aimed at optimizing, configuring, troubleshooting, and healing the cellular networks autonomously. This technology helps in avoiding incurring capital and operational expenses happening every time due to manual configuration during the deployment phase and its lifetime optimization. [13] proves antenna parameter configurations of tilt, frequency, and output power hugely influence the networks coverage with respect to traffic changes through SON. This framework provides one of the prominent ways to execute RET optimization, i.e. optimize the electrical antenna tilt angle remotely without being physically present at the cell site. The RET angle is defined as the angle between the main beam of the antenna pattern and the horizontal plane (see ψ Figure 2.1).

6

(23)

CHAPTER 2. BACKGROUND 7

Figure 2.1: Illustration of the RET angle ψ

2.2 The Reinforcement Learning Framework

This section covers the fundamental concepts of RL which provides essential grounding to understand topics of the thesis work. These concepts are derived from Introduction to Reinforcement Learning [14], if not explicitly specified.

In the context of RL, an agent interacts with a stochastic environment and observes reward feedback as a result of taking an action in a given state.

Through this interaction, the agent learns a policy π, that is a control law map- ping states to probability over actions. At any time step t, the agent finds itself in state s^t, takes action a^tunder the policy π, transitions to the next state s^t+1, and in turn receives reward sample rtas feedback from such interaction with the environment. This interaction happens in a cycle where the agent tries to evaluate and improve its policy π until it eventually comes up with an optimal policy π^∗. The RL problem is defined based on the Markov Decision Process (MDP) to conduct decision-making process in a discrete time stochastic process. MDP is extension of Markov chains i.e. agent’s next action is dependent on the current state but not on the history of previous states. It is described using a 5 elements tuple M = (S, A, p, r, γ), where:

• S is the set of possible states that the agent can consider in the environment.

(24)

8 CHAPTER 2. BACKGROUND

• A is the set of possible actions that the agent can take at the given state s ∈ S to be able to interact with the environment.

• p : S × A × S → [0, 1] is the transition probability, describing the probability of going from state s to s⁰when taking action a.

• r(s, a) is the reward function, received for being in a state s and taking action a.

• γ ∈ [0, 1] is a discount factor, that accounts for delayed rewards over time.

Agent

Environment

Action a_t Reward rt

State st

rt+1

st+1

Figure 2.2: RL framework

The state value function V^π(s) quantifies the desirability of reaching state s over the course of the agent’s interaction with the environment by following the policy π. Formally, it is a cumulative discounted reward or expected return an agent receives for being in the state s behaving under the policy π(a|s).

V^π(s) = E^π hX^∞

i=0

γⁱrt+i|st= s i

The state-action value function Q^π(s, a) represents the desirability to take an action a when being in state s, when the following policy π. In order words, it is the expected return an agent receives for taking acting action a in state s behaving under the policy π(a|s).

Q^π(s, a) = Eπ

hX^∞

i=0

γⁱr_t+i|s_t= s, a_t= ai

(25)

The main objective of the RL agent is to learn an optimal policy π^∗ with the goal of maximizing the expected cumulative rewards. More formally the optimal policy will satisfy

π^∗ = arg max

π

V^π(s).

2.2.1 On-policy vs Offline Off-policy Learning

One of the main dichotomies of RL methods is the distinction between on- policy and off-policy methods. The former aims at learning the same policy using which it interacts with the environment. The policy learnt through on- policy approach leads near-optimal as it compromises the performance due to the exploration happening during the learning process. Therefore, it can only result in a near-optimal policy but not the optimal one.

While the latter utilizes two separate policies and aims at learning a policy (target policy π) that is different from the one that is used to interact with the environment (behavior policy π0) and thus collect trajectory samples. The target policy acts as the optimal policy and behaviour policy acts as the ex- ploratory policy.

In addition to learning from samples from the behavior policy, when the agent requires to base its learning strategy solely on a fixed batch of data that cannot be expanded further, then we refer to as batch RL or offline RL. Here, the agent is deprived of the direct feedback from the environment which nor- mally exists when interacting with the environment. This means that the agent needs to generalize the learning using the fixed batch of past interactions ex- isting in the form of a logged dataset and eventually come up with an optimal policy that performs well on the future interactions.

2.2.2 Importance Sampling

Importance sampling allows to estimate the value function for a policy π when the samples were collected using a different policy π0. Hence, it is a well- known technique used in offline off-policy learning which consists of re-weighting the returns using the importance sampling ratio, defined for a trajectory starting from state s^tand ending in state s^T, as ρ^t:T =QT

k=t

π(ak|s_k)

π0(ak|s_k). Such ratio is used to compensate for the distribution mismatch between the logging policy π₀ and the learning policy π and the policy estimator based on this technique provides an unbiased estimator of the value of a policy.

(26)

2.2.3 Model-based and Model-free Learning

MDP allows to formulate RL problem mathematically as it represents the dynamics of the environment using the above mentioned five elements in section 2.2. Essentially, it is equipped with transition probability and the reward function. The transition probability is a function that outputs the probability of moving to the next states provided the current state and the action that the agent take at the state, and the reward function determines the reward value associated with the state and action pair. These two functions together govern the dynamics of the environment are therefore referred to as the model of the environment.

The model-based algorithms utilize the transition probability and reward function in order to estimate the optimal policy. In some cases, the agent might need to rely on the approximation of the transition function but the reward function can be learnt through interaction with the environment (on-policy) or through behaviour of another agent (off-policy). However, when considered the approximations of true transition and reward functions it might never lead to the optimal policy.

While model-free algorithms estimate the optimal policy without the usage of the dynamical model of the environment. Here, the agent does not have access to the transition function rather it uses the value function or directly policy computed through the experience gathered by interacting through the environment.

2.2.4 Reinforcement Learning Algorithms

Q-learning

Q-learning is a value-based learning algorithm where for every state and action pair, its corresponding state-action value function is stored in tabular format and these values are updated according to the following update rule:

Q^new(s_t, a_t) ←

1 − α

Q(s_t, a_t) + α

r_t+ γ max

a Q(s_t+1, a_t) . The major drawback of (tabular) Q-Learning algorithm is that it is re- stricted to work with discrete and finite state and action spaces. This problem can be tackled by the use of deep neural networks (DNN) as functional approximators to parameterize continuous state space that can be seen in Atari games.

(27)

This results in effective representation of the high dimensional sensory inputs and actions present in these games. This neural network based method known as deep Q-learning network (DQN) [15] that also addresses the instabilities due to inter-dependency of the sequence of observations using:

• Experience Replay: To utilize from the learning through past expe- riences and break the correlation between consecutive “experiences”

(state, action, reward, next state), these experiences are cached in the form of replay buffer and randomly sampled to perform network updates.

• Separate Target Networks: Two networks namely online and target networks are allocated to work in tandem to ensure the proper exploration of the state space. The weight parameters of the target network, which calculates the Q-values, are kept fixed for a while and updated less frequently compared to the primary online network and this update is done by periodically copying Q-values from the primary network.

Policy Gradient

The policy gradient algorithm is the policy-based algorithm that aims to optimize the policy directly through gradient-based methods (e.g. Stochastic Gra- dient Descent). The policy πθ(a|s) is generally parameterized using a parameter θ. The value of the objective function depends directly on the policy and there exists different strategies to optimize θ with the goal of maximizing the objective function:

∇_θJ (θ) = Eπθ

h∇_θlog π_θ(a|s)Q^π(s, a)i

Actor-Critic Algorithms

The actor critic (AC) is a class of RL algorithms that combines policy-based and value-based approaches. The critic evaluates the policy using the value function which in turn provides a better estimate for the actor to update the gradient in the direction governed by the information provided by the critic.

Other variants of this method known as advantage actor critic (A2C) use the advantage function in place of value function that captures temporal difference error which helps to reduce the variance between the actor’s policy updates.

(28)

To overcome the issue of high dimensional state and action spaces, the most common way to implement these algorithms are by using function approximators like neural networks where the actor-network, parameterized with θ, tries to learn the best possible action for a given state while the critic-network, parameterized with φ, based on interaction with the environment and input action from actor evaluates the action and provides feedback to the actor-network.

Value function based gradient:

∇_θJ (θ) = Eπθ

h∇_θlog π_θ(a|s)V_φ^π(s)i , Advantage function:

A^π_φ(s, a) = r(s, a) + γV_φ^π(s⁰) − V_φ^π(s), Advantage function based gradient:

∇θJ (θ) = E^πθ

h

∇θlog πθ(a|s)A^π_φ(s, a) i

.

2.2.5 Deep Reinforcement Learning

DRL blends artificial neural networks (ANN) with the RL framework as these neural networks empower the agent to learn the best possible action for a given state through interaction with the virtual environment to accomplish the goal.

DRL addresses the curse of dimensionality problem happening in classical RL algorithms using neural network representation. When state and action space becomes gigantic such that it is no longer possible to fit the state-action pair in tabular form then these neural networks are used as function approximators to estimate the value function.

(29)

Chapter 3 Related Works

This chapter covers the related works of RL in the field of telecommunication for down-tilt optimization as well as the contemporary works of simulation- to-reality for transfer learning in RL.

3.1 Reinforcement Learning for Antenna Tilt Optimization

The use of RL is prevalent in various technical fields and the telecommunication industry is no exception. In the past, researchers have tried to apply RL primarily to optimize coverage and capacity in the cellular networks. In [16], the authors use overlapping fuzzy membership functions to resolve the issue of handling continuous state and action spaces in RL formulation. The reward function is designed to maximize the spectral efficiency¹ of mean users and edge users (cell boundary) using Q-learning in order to achieve tilt angle optimization in a long-term evolution (LTE) simulation setup.

A similar approach of Fuzzy Q-learning is employed in [17], where the state space for the problem contains down-tilt angle, mean and edge spectral efficiency of the cellular network, and reward function is aimed at improv- ing subsequent combined spectral efficiency. This work highlights the effects of simultaneous actions of neighbouring cells in the network and proposes a cluster-based strategy for effective cooperative learning in the simulation en-

1The spectral efficiency refers to the rate of information transmitted over a given bandwidth in a specific communication system which is interpreted as a measure of utilization of a defined frequency spectrum.

13

(30)

14 CHAPTER 3. RELATED WORKS

vironment.

While [6] jointly optimizes antenna tilt and power to control coverage and capacity. It uses fuzzy neural network combined with Q-learning on hybrid SON architectures. Hybrid architecture involves distributed learning of self- optimizing parameters and centralize learning of network management system for homogeneous cells in simulation environment.

Recent work [7] uses DRL to optimize antenna tilt, vertical, and horizontal half-power beam widths in a heterogeneous cellular network setting in the simulation environment. In order to tackle the issues of scalability in multi- agent method and sub-optimality in a single agent, the authors utilized a hybrid method that combines offline multi-agent solution and online single-agent. In offline multi-agent learning each cell learns cumulative behaviour of neighbouring macro-cell for every state-action pair using mean-field RL algorithm.

This learning phase transfers information for the second phase containing online Q-learning with a single agent using DNN as functional approximation.

Here, the state space is formulated by quantized and averaged signal to inference and noise ratio (SINR) and the reward function is also proportional to the SINR value.

3.2 Reinforcement Learning for Sim-to-Real

3.2.1 The Simulation-to-Reality Paradigms

Simulators provide an effective infrastructure to train RL policies without concerns about performance disruption caused by uncontrolled exploration, which represents one of the main limitations in training RL algorithms in real-world environments. However, it can be challenging to build reliable simulation models due to inherent modeling errors or uncontrollable stochastic effects of the real-world environment.

The discrepancy between the simulator model and the physical world is known as the simulation-to-reality (sim-to-real) gap. Different techniques have been explored to bridge the simulation-to-reality gap in the literature.

Different techniques have been explored to bridge the sim-to-real gap in the literature [18], [19], [20]. Two of the most prominent techniques are:

• Domain Adaptation (DA): starting from a model trained on a task in

(31)

CHAPTER 3. RELATED WORKS 15

a given domain (source domain) to adapt to another domain (target domain). In the context of sim-to-real , DA consists of updating the simulation parameters to make them closer to the real-world observation.

• Domain Randomization (DR): an ample number of random variations are generated in the simulation environments and the model is trained across these randomized settings. This makes the trained model robust to change of conditions in the real-world environment.

3.2.2 Prior Arts in the Field of Sim-to-Real RL

Most of the sim-to-real RL literature is in the domain of robotics. Different techniques have been explored to close the gap between simulation and reality in the literature. In the following, we illustrate prior art about sim-to-real transfer techniques in the field of RL .

In [18], DR is carried out by randomizing the dynamics of the robotic arm simulator with intention of developing policies capable to robustly to adapt real-world robot dynamics. Furthermore, the authors use universal policy formulation [21] where different goals g ∈ G are presented in every episode resulting in the policy π(a|s, g) with an extra input g and the reward function is also modified based on these episodic goals which is incorporated in r(st, a_t, g). Hindsight experience replay (HER) [22] is introduced to utilize sparse rewards setting by analyzing whether goal g is satisfied or not throughout the trajectory. HER based off-policy actor-critic algorithm uses recurrent deterministic policy gradient (RDPG) [23] (a variant of deep deterministic policy gradient (DDPG) [24]) method with recurrent universal value function Q(st, a_t, y_t, g, µ), where yt = y(h_t) is the value function’s internal memory which make use of memory to store past states and actions history h_t = (a_t−1, s_t−1, a_t−2, s_t−2, . . . ) and this memory-based value function is only used during the training phase. This sophisticated value function is also provided with additional input µ that incorporates random dynamics of the simulator.

In [19], to completely leverage the learning from the simulator, the authors develop an asymmetric actor-critic approach where the critic is trained on fully observed states while the actor has to contend with partially observed rendered images. By using DR in this asymmetric input framework, they could accomplish sim-to-real transfer without even any need of training with real-world data. Similar to [18], this work also applies off-policy DDPG [24] actor-critic

(32)

setup along with HER [22] and multi-goal RL formulation [21].

Here, the replay buffer is populated with previous episodic experiences (s_t, o_t, a_t, r_t, s_t+1, o_t+1, g, g^o) where at a given time-step t, stcorresponds to fully observed simulator state and otto partially observed rendered image, sim- ilarly g and g^osignifies fully-observed episodic goal and its partially-observed rendered counterpart respectively. To effectively transfer policies to real-world scenarios approach of DR is employed by randomizing visual aspects that in- volve texture, lighting, camera positions, and depth when rendering scenes from the simulator.

Although [18], [19] use off-policy learning techniques where they utilize segments of the historic experiences in the form of experience replay still they do not focus on a complete offline learning approach wherein they could no longer be able to interact with the environment. These setups consider online learning both in the simulation and in the real-world setting.

In [20], utilizes both the sim-to-real transfer paradigms of DR and DA where instead of using the conventional way of DR in which the randomization of simulations are configured manually, it does DR with the help of DA by adjusting the simulation parameter distribution using a few real-world roll- outs interleaved through policy-based training. As the simulator is deterministic in nature, the authors purposely make it probabilistic by introducing the distribution of simulation parameters ξ represented as ξ ∼ pφ(ξ) which is parameterized by φ. This captures the probabilistic dynamics of the simulator given by P^ξ∼pφ = P (st+1|st, at, ξ) which helps in obtaining generalized policies by finding simulation parameter distribution parameters p^φ(ξ) following DR strategy, to achieve

maxθ EP_ξ∼pφ

h

EπθR(τ )i .

The core essence of this approach aims to reduce the discrepancy (D) between real-world observed trajectories τreal^ob = (o_0,real..., o_{T ,real}) and simulated observed trajectories τξ^ob = (o0,ξ..., oT ,ξ) by minimizing the following objective

minφ EP_ξ∼pφ

h

Eπ_θ,pφD(τ_ξ^ob, τ_real^ob )i .

The authors use KL-divergence between old simulation distribution pφi

and updated distribution pφi+1 to ensure it does not go out of the trust region

(33)

CHAPTER 3. RELATED WORKS 17

of the policy πθ,p_φi with policy-based algorithm called as proximal policy optimization (PPO) [25] that enables robust gradient updates in mini-batches contrary to classic stochastic gradient descent in conventional policy gradient.

The authors demonstrate an impressible approach of using real-world roll-outs however this approach overly relies on continuous integration of these roll-outs after a specified number of iterations in simulation training. This means in one training cycle, the RL agent is trained in simulation and then it verifies the recently trained policy from the real-environment with active interaction.

Progressive Networks [26] are ANN designed for transferring knowledge to a sequence of machine learning tasks. This approach covers the usage of neural networks to train on subsequent tasks utilizing the outputs coming from preceding layers that were trained on prior tasks. The pre-trained model used for one specific task can be used to benefit training for other tasks using transfer learning. This method works equally well with simulation and real experiments as employed by the authors of [27] for some robotic arm tasks whose model architecture can be seen in Figure 3.1.

During the training phase, the first column is trained in simulation using asynchronous advantage actor critic (A3C) [28] which is an accelerated form of A2C algorithm using a multi-threaded training approach wherein there is a global network which combines the training from multiple independent agents having their individual trained parameters (weights). These agents interact with different copies of the environment in parallel to learn more efficiently.

While for the real environment, training is carried out by utilizing simulation trained transferred parameters through dashed lateral oblique connections along with a single-threaded A2C agent in column 2 of the network when provided with visual input from the real-world robotic manipulator.

Even though their work involves the usage of shared parameters trained in simulation to real-world training but the scope is limited to online learning where the RL agent can actively interact with the environment both in simulation as well as in real-world.

(34)

Figure 3.1: Figure from [27] shows Progressive network architecture for sim- to-real, knowledge is shared from simulation based column 1 to reality based column 2 through dashed lateral connection from left to right.

(35)

Chapter 4 Methodology

This chapter illustrates the methodology used for implementing the thesis objective which starts with a brief overview of the procedural blocks used in sim-to-real pipeline succeeded by its the detailed explanation. Subsequently, we elaborate the RL formulation used particularly for RET use-case followed by the neural network implementation for hybrid policy model and finally we cover the training and testing routine used in the simulation and with the real- world dataset.

4.1 Method Overview

Offline RL sim-to-real transfer architecture: The method consists of an archi- tecture to train an RL policy by combining sim-to-real and offline off-policy learning techniques. The resulting offline sim-to-real architecture, depicted in Figure 4.1, is composed of the following components:

(a) The cellular network simulator consists of an imperfect simulation model of the true real-world environment. This simulator provides an online telecommunication environment with which the RL agent interacts to obtain an optimal policy well-suited for the simulator. This simulation interface is equipped with DR technique so that by changing simulation parameters we can capture a variety of randomized environment conditions.

(b) The offline real-world dataset consists of a real-world dataset of recorded observations from the real-world environment collected according to a logging policy (e.g. a policy that was previously operating in the real- world system and has collected data of its interaction, or an ad-hoc log-

19

(36)

20 CHAPTER 4. METHODOLOGY

Network Simulator

a

Sim to Real Parametersharing

Hybrid Policy Model

c

Ofﬂine real-world dataset

b

Online Simulation training

d

Ofﬂine real-world dataset training

e f

Figure 4.1: Offline sim-to-real transfer architecture

ging policy optimized to explore the state-action space) from different telecommunication operators. This results in different accumulated observations in the form of state, action, and rewards by following the actions governed by this policy.

(c) The hybrid policy model is a parametric machine learning model, using ANN as functional approximator, we denote it with parameter θ. The model parameter vector is composed of a portion trained in the simulation represented by θs and a portion that is trained in the real world represented by θr, i.e. θ = θs∪ θ_r. This is the enabling element for the integration of training elements from the simulation environment and real-world data.

(d) The simulation interface training block trains the simulation parameter θ_s of the hybrid policy model in the simulation environment. Training

(37)

CHAPTER 4. METHODOLOGY 21

is done utilizing DR so as to come up with a robust policy that works across various environmental settings. The simulation interface training block outputs the trained simulation weights, that we denote by θs^∗. (e) The sim-to-real parameters sharing happens by freezing and sharing a

subset of the simulation model parameters θshared ⊆ θ^∗_s with the real- world model portion θ^r. The real-world training parameters θ^rand θshared

are concatenated in a way that their outputs (from sim and real portions) at each layer can be combined by summing them when the hybrid model is provided with input from the real-world dataset.

(f) The real-world training block trains the real-world parameters θ^r of the hybrid policy utilizing off-policy offline learning with the help of importance sampling technique during training. This training uses learning from the simulator from shared parameters θshared by merging outputs from parameters θshared and θ^r at each training step. The output of the real-world training is denoted by θr^∗. The inference process is executed by the portion of the model containing both θshared and θr of the hybrid model, that we denote by θ^∗ = θ_shared∪ θ_r^∗.

4.2 Explanation of the Architecture Blocks

• Network Simulator: The network simulator is a discrete-event simula- tion model which means in a particular instant of time a single discrete event occurs that indicates for change in the state of the system. This is a mathematical model designed to take angles as the input and returns network characteristics in form of key performance indicator (KPI) as the output. These KPIs signify coverage, capacity, and quality of the network signal. In our experiments, we use degree of fire (DOF) KPIs where DOF denotes the alarming KPIs that express the extent of sever- ity with respect to coverage, capacity and quality in the cell.

The simulator contains a collection of adjustable parameters which can be modified in order to emulate a variety of different environmental scenarios. The set of adjustable scalar parameters, we denote with µ, includes but is not limited to antenna height, minimum and maximum antenna tilt, signal frequency, traffic per cell, distance between adjacent base stations, etc. These parameters µ would be randomized using DR to enhance the training process in order to obtain a robust policy.

(38)

• Offline real-world dataset: We use a pre-collected offline dataset de- scribed by following one (or more) logging policy from different telecommunication operators. This logging policy comes from a rule-based agent which acts as a behaviour policy for our off-policy learning method.

The collected dataset is pre-processed to convert into a format that is suitable for training. This pre-processing routine involves, data prepa- ration techniques, data cleaning, feature extraction, feature construction, etc. To execute offline off-policy learning on the real-world data through the importance sampling technique, the inverse propensity scores¹ are computed through logistic regression. Due to a huge imbalance in the dataset with respect to three action classes {−1, 0, 1} mentioned in section 4.3, we had to balance the dataset by down-sampling data from the most frequently occurred action class and up-sampling data from the least frequently occurring action class. Finally, the processed dataset containing 45,000 data samples is split (70-30%) into train (31,500 samples) and test sets (13,500 samples).

The dataset contains the information about telecommunication network characteristics that involves signal reception within each cell and interference with the neighbouring cells in the network base station. This information is deduced in the form of KPIs namely coverage, capacity and quality DOF and antenna tilt angle over different days coming from geographic locations similar to the one that exists in the simulator. Each data sample in the processed dataset consists of these DOF KPIs and the actual antenna tilt angle (in degrees) of the current day and the following day. These data samples are distributed identically and independently meaning samples are taken from a same probability distribution and also all sample items are mutually independent of one another.

• Hybrid policy model: Inspired from Progressive Network’s [26], [27]

capability of sharing knowledge in form of shared parameters by first training in simulation and then in reality through the utilization of shared parameters to generalize the learning from one domain to another. We extrapolate this idea which was initially used on robotic manipulators using visual input to antenna tilt optimization use-case. This is done by using numerical inputs in form of the aforementioned KPIs that govern

1Inverse propensity scoring is a statistical tool which helps in the counter-factual inference that allows to understand more informative behavioural data with low probability of occur- rence in given a data distribution.

(39)

the characteristics of telecommunication signal through our method that we call hybrid policy model.

The hybrid model as described in item (c) is instantiated. These two portions are trained using two separate loss functions but require the usage of the same RL framework formulation and algorithm. We use a value function based learning algorithm with a function approximator using ANN because the continuous state space can be parameterized by a set of parameters θ. The structure of the model used in simulation portion and real-world portion that includes the number of neural network layers and the number of neurons in each layer for these portions may differ from one another. Ideally, the simulation model portion would have more trainable parameters as compared to the offline real-world model portion |θ^s| > |θr| as online simulation training is exposed to a large variety of environmental settings and it tries to optimize across all these settings. The exact RL framework formulation and neural network representation are covered in the subsequent sections.

• Online Simulation Training: The simulation portion parameters θ^sof the hybrid model is trained in the simulation environment with different randomized settings µ by using the DR approach. DR is done by extracting information about the range of values of these adjustable parameters µ from the offline real-world dataset and improvising these values to simulate similar conditions in case of simulation setup.

• Offline Real-world training: The real-world portion parameters θr of the hybrid model are trained on the real-world offline dataset utilizing the shared pre-trained parameters θsharedfrom simulation training. In offline off-policy learning, the gradient updates of the training parameters θr

uses importance sampling ratio ρtin the form of computed propensities during data-processing for each data sample using the following update rule.

θ^t+1_r ← θ_r^t+ αρ_tδ_t∇ˆv(s_t, θ^t_r)

where α is the learning rate, δ^t = r_t+1+ γ ˆv(s_t+1, θ_r^t) − ˆv(s_t, θ^t_r) is the temporal difference value error with discount factor γ, ˆv(s_t, θ^t_r) is the gradient of value function with respect to θr^t and ρt corresponds to the importance sampling ratio at time step t. After training the hybrid model in both simulation and real-world environment, it is then evaluated on

(40)

the test dataset by offline off-policy evaluation with help of importance sampling technique using earlier computed propensities.

4.3 RL Framework Formulation Used in RET

The RL framework formulation includes the following elements²:

• State: includes a set of measured KPIs collected at cell, cluster, or net- work level depending on the implementation detail. In the simplest case, the state contains recorded KPIs at the cell level for every cell and the policy is trained across different cells independently. The state at time t consists of the following vector

s_t = [c_DOF(t), q_DOF(t), ψ(t)] ⊂ [0, 1] × [0, 1] × [ψ_min, ψ_max], where ψ(t) is the current vertical tilt angle, and cDOF(t), q_DOF(t) are a measure of alarming KPIs of coverage and quality in the cell.

• Action: comprises a discrete tilt variation from the current vertical an- tenna tilt angle. It can be used to do at-most unit step change that consists of up-tilting, no-change action or down-tilting at ∈ {−1, 0, 1} respectively. The tilt at the next step will be deterministically computed as:

ψ(t + 1) = ψ(t) + a_t.

• Reward: consists of a function of the observed KPIs that describe the perceived by the user equipments. We consider a function ∆DOF for change in current alarming KPIs cDOF(t), q_DOF(t) and next alarming KPIs values cDOF(t+1), q_DOF(t+1) which helps in defining the reward function r(s^t, a_t).

∆_DOF = −n

c_DOF(t + 1) + q_DOF(t + 1)

−

c_DOF(t) + q_DOF(t)o

r(s_t, a_t) =

(exp |∆_DOF|

∆_DOF ≥ 0

− exp |∆_DOF|

∆_DOF < 0

2The transition probabilities and the reward distribution are assumed to be unknown.

(41)

ℎ

_!^"#$

ℎ

_%^"#$

ℎ

_&^"#$

ℎ

_'^"#$

output

simulator interactions

ℎ

_!^"#$

ℎ

_%^"#$

ℎ

_&^"#$

ℎ

_'^"#$

ℎ

_!^()*+

ℎ

_%^()*+

ℎ

_&^()*+

ℎ

_'^()*+

real data

output

Agent

Simulator

Agent

Dataset

Online Learning Offline off-policy Learning

Figure 4.2: Hybrid policy model abstract representation.

(42)

4.4 Artificial Neural Network Implemen- tation

The hybrid policy model is the neural network based function approximator (seen in Figure 4.3) which is used to estimate continuous value function with help of regression³technique which is trained in two phases.

Therefore, the output of the model represents the value function corresponding all possible actions in the given state.

– Online simulation training phase: We employ Fully Connected layers i.e. ANN topology where every neuron in the current layer is connected to every neuron in the next layer to estimate state- action value function. We consider the simple case with discount factor γ = 0 which means the value function becomes myopic and equal to immediate reward also known as Contextual Bandit problem⁴. In the simulation, the agent interacts with the environment for specified number (refer the hyper-parameters Table 4.1) of the steps and gets the reward for action taken at the state. We also considered shorter episodes (where the end of the episode is marked by done state) to closely replicate the behaviour of data present in the real-world dataset training. The detailed hyper-parameters used in the simulation training are mentioned in Table 4.1. The ANN based Fully Connected layers learn this reward by following the simulation training in Algorithm 1.

Once the simulation model portion converges which is marked by convergence in the average reward value of the training agent ( Fig- ure 5.1 (a) and Figure 5.2 (a) ), the parameters used in this model portion are kept frozen and the subset of these trained parameters i.e. θshared (Fully Connected:0, Fully Connected:1 in Figure 4.3) are shared with real-world model training utilized in the second phase training. Freezing of θs^∗ happens by stopping the gradient

3Regression is a predictive modeling technique used to study the relationship between a dependent (e.g. state-action value function) and independent variable(s) (e.g. state, action) where the dependent variable is continuous while the independent variable(s) can be continuous or discrete.

4In a Contextual Bandit problem, an agent receives rewards for actions it takes over a sequence of turns. The agent decides about an action in every turn considering the context (or state) for the current turn and the feedback through rewards received in previous turns.

(43)

updates during back-propagation which is achieved with the help of custom Lambda layers (Lambda:0, Lambda:1 in Figure 4.3) – Offline real-world training phase: Similar to simulation train-

ing, the real model portion also estimates myopic value function i.e. reward using regression technique. Instead of having online interaction with the environment, this training utilizes a recorded dataset derived from the logging policy. This training deals with offline off-policy learning therefore we use importance sampling method in form of propensity weighting. These propensities are computed during data-processing as explained in section 4.2. Dur- ing the training, the squared error between the true reward and predicted reward is divided the propensity to incorporate for distribution mismatch in case of fully off-policy RL. Real-world data training in Algorithm 2 covers the training process.

We train real model portion parameters θ^rby utilizing pre-trained shared frozen parameters θshared. Similar to [26], we also employ adapters, projection layers, and lateral connection layers. Adapters, represented by Alpha:0, Alpha:1 in Figure 4.3, are learnable scalar parameters that adjust for different scales of the inputs coming from Lambda layers. Projection layers (Fully Connected:4, Fully Connected:6) are used to take input from the higher dimensions and project it to the lower dimensions while the lateral connection layers (Fully Connected:5, Fully Connected:7) have same dimen- sions as layers (Fully Connected:9, Fully Connected:10) respec- tively, these lateral connection layers are responsible for maintain- ing the identical composition of shared parameters.

(44)

Fully Connected: 0

Lambda: 0 Fully Connected: 3

Fully Connected: 2

Fully Connected: 1

Lambda: 1 Alpha: 1 Fully Connected: 6 Fully Connected: 7

Alpha: 0 Fully Connected: 4 Fully Connected: 5

Fully Connected: 8 Fully Connected: 9 Add: 0

Fully Connected: 10 Add: 1

Fully Connected: 11

Input

𝜃

_𝑟

𝜃

_𝑠

𝜃

_{𝑠ℎ𝑎𝑟𝑒𝑑}

Figure 4.3: Hybrid policy model ANN representation.

(45)

Algorithm 1

function Simulation training for number of steps do

if step = 0 or done then

prevAction = random action end if

Action = getNextAction(State, prevAction) function getNextAction(State, prevAction)

Qvalues = modelP rediction(state, prevAction) Action = epsilonGreedyP olicy(Qvalues) return Action

end function

prevAction ←− Action

singleTrainingStep(State, Action)

function singleTrainingStep(State, Action) take Action in the environment and get Reward Qvalues = modelP rediction(State, Action) x = Qvalue

y = Reward

gradient descent on mean square error of (x, y).

end function end for

end function

(46)

Algorithm 2

function Real-World Data training

function Trainer(State, Action, Reward, P ropensity) Qvalues = modelP rediction(State, Action) x = Qvalue

y = Reward

Loss = M ean Absolute(x − y)/P ropensity update M odel parameters to minimize Loss.

end function

function Tester(State, Action, P ropensity) Qvalues = modelP rediction(State, Action) P redictedAction = arg max_ActionQvalues AccumulatedRew = 0

matches = 0

if (P redictedAction == Action) then matches ←− matches + 1

x = Qvalue

AccumulatedRew ←− AccumulatedRew + (x/P ropensity) end if

P erf ormance = (AccumulatedRew/matches).

end function end function

(47)

Hyperparameters Name Value

Input Size: [State, Action] 4

Hidden Layer1 Size: Fully Connected:0 100

Output Layer Size: Fully Connected:3 3

Discount factor (γ) 0

Episode length 50

Training steps 3000

Initial for epsilon greedy policy 1

Decay in for epsilon greedy policy 0.998

Learning rate 0.001

Table 4.1: Simulation training hyperparameters.

Hyperparameters Name Value

Input Size: [State,Action] 4

Lambda Layer Size: Lambda:0 100

Lambda Layer Size: Lambda:1 50

Alpha Layer Size: Alpha:0 1

Alpha Layer Size: Alpha:1 1

Projection Layer Size: Fully Connected:4 5 Projection Layer Size: Fully Connected:6 5 Lateral Connection Layer Size: Fully Connected:5 5 Lateral Connection Layer Size: Fully Connected:7 5

Output Layer Size: Fully Connected:11 3

Discount factor (γ) 0

Batch size 1024

Number of epochs 500

Learning rate 0.001

Table 4.2: Real-world training hyperparameters.

Bridging Sim-to-Real Gap in Offline Reinforcement Learning for Antenna Tilt Control in Cellular Networks

Bridging Sim-to-Real Gap in

Offline Reinforcement Learning for Antenna Tilt Control in Cellular

Networks

MAYANK GULATI

Bridging Sim-to-Real Gap in Offline Reinforcement

Learning for Antenna Tilt

Control in Cellular Networks

MAYANK GULATI

Abstract

Keywords

Sammanfattning

Acknowledgments

Contents

List of Figures

List of Tables

Acronyms

Chapter 1 Introduction

1.1 Problem Description

1.2 Research Questions

1.3 Contributions

1.4 Structure of the Thesis

Chapter 2 Background

2.1 Self-Organizing Networks

2.2 The Reinforcement Learning Framework

Agent

Environment

2.2.1 On-policy vs Offline Off-policy Learning

2.2.2 Importance Sampling

2.2.3 Model-based and Model-free Learning

2.2.4 Reinforcement Learning Algorithms

2.2.5 Deep Reinforcement Learning

Chapter 3

Related Works

3.1 Reinforcement Learning for Antenna Tilt Optimization

3.2 Reinforcement Learning for Sim-to-Real

3.2.1 The Simulation-to-Reality Paradigms

3.2.2 Prior Arts in the Field of Sim-to-Real RL

Chapter 4

Methodology

4.1 Method Overview

Network Simulator

a

Sim to Real Parametersharing

Hybrid Policy Model

c

Ofﬂine real-world dataset

b

Online Simulation training

d

Ofﬂine real-world dataset training

e f

4.2 Explanation of the Architecture Blocks

4.3 RL Framework Formulation Used in RET

ℎ

ℎ

ℎ

ℎ

ℎ

ℎ

ℎ

ℎ

ℎ

ℎ

ℎ

ℎ

Agent

Simulator

Agent

Dataset

4.4 Artificial Neural Network Implemen- tation

𝜃

𝜃

𝜃