Bandit Algorithms for Adaptive Modulation and Coding in Wireless Networks

(1)

Bandit Algorithms for Adaptive

Modulation and Coding in Wireless Networks

ROMAIN DEFFAYET

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Adaptive Modulation and

Coding in Wireless Networks

ROMAIN DEFFAYET

Master in Machine Learning Date: July 9, 2020

Supervisor: György Dán Examiner: Rolf Stadler

School of Electrical Engineering and Computer Science Host company: Huawei Technologies Sweden AB

Swedish title: Bandit Algoritmer för Adaptiv Modulering och Kodning i trådlösa nätverk

(4)

(5)

Acknowledgements

I would like to express my appreciation to people who helped me make this work possible :

First and foremost to Stojan Denic, who was my direct supervisor at the company and provided tremendous technical and moral support. Then to Damian Kolmas, senior engineer at the company, who guided me in the field of Wireless Networks, and provided much appreciated insights and discussions. Then to Gunnar Peters, my direct manager, who trusted in me and made this possible.

On KTH’s side, to my supervisor György Dán, who guided me towards the completion of my thesis. Finally, my appreciation extends to all my colleagues who helped me to varying degrees during my time here.

(6)

Abstract

The demand for quality cellular network coverage has been increasing signifi- cantly in the recent years and will continue its progression throughout the near future. This results from an increase of transmitted data, because of new use cases (HD videos, live streaming, online games, ...), but also from a diversifica- tion of the traffic, notably because of shorter and more frequent transmissions which can be due to IOT devices or other telemetry applications. The cellular networks are becoming increasingly complex, and the need for better manage- ment of the network’s properties is higher than ever. The combined effect of these two paradigms creates a trade-off : whereas one would like to design algorithms that achieve high performance decision-making, one would also like those to be able to do so in any settings that can be encountered in this complex network.

Instead, this thesis proposes to restrict the scope of the decision-making algorithms through on-line learning. The thesis focuses on the context of initial MCS selection in Adaptive Modulation and Coding, in which one must choose an initial transmission rate guaranteeing fast communications and low error rate. We formulate the problem as a Reinforcement Learning problem, and propose relevant restrictions to simpler frameworks like Multi-Armed Bandits and Contextual Bandits. Eight bandit algorithms are tested and reviewed with emphasis on practical applications.

The thesis shows that a Reinforcement Learning agent can improve the utilization of the link capacity between the transmitter and the receiver. First, we present a cell-wide Multi-Armed Bandit agent, which learns the optimal initial offset in a given cell, and then a contextual augmentation of this agent taking user-specific features as input. The proposed method achieves with burst traffic an 8% increase of the median throughput and 65% reduction of the median regret in the first 0.5s of transmission, when compared to a fixed baseline.

(7)

Sammanfattning

Efterfrågan på mobilnät av hög kvalitet har ökat mycket de senaste åren och kommer att fortsätta öka under en nära framtid. Detta är resultatet av en ökad mängd trafik på grund av nya användningsfall (HD-videor, live streaming, onli- nespel, ...) men kommer också från en diversifiering av trafiken, i synnerhet på grund av kortare och mer frekventa sändningar vilka kan vara på grund av IOT-enheter eller andra telemetri-applikationer. Mobilnätet blir allt komplexare och behovet av bättre hantering av nätverkets egenskaper är högre än någonsin.

Den kombinerade effekten av dessa två paradigmer skapar en avvägning: medan man vill utforma algoritmer som uppnår mycket hög prestanda vid beslutsfatt- ning, skulle man också vilja att algoritmerna kan göra det i alla konfigurationer som kan uppstå i detta komplexa nätverk.

Istället föreslår denna avhandling att begränsa omfattningen av beslutsalgo- ritmerna genom att introducera online-inlärning. Avhandlingen fokuserar på första MCS-valet i Adaptiv Modulering och Kodning, där man måste välja en initial överföringshastighet som garanterar snabb kommunikation och minsta möjliga transmissionsfel. Vi formulerar problemet som ett Reinforcement Le- arning problem och föreslår relevanta begränsningar för matematikt enklare ramverk som Multi-Armed Bandits och Contextual Bandits. Åtta banditalgorit- mer testas och granskas med hänsyn till praktisk tillämpning.

Avhandlingen visar att en Reinforcement Learning agent kan förbättra an- vändningen av länkkapaciteten mellan sändare och mottagare. Först presenterar vi en Multi-Armed Bandit agent på cell-nivå, som lär sig den optimala initiala MCSen i en given cell och sedan en kontextuell utvidgning av dennaa agent med användarspecifika funktioner. Den föreslagna metoden uppnår en åttapro- centig (8%) ökning av medianhastigheten och en sextiofemprocentig (65%) minskning av median ångern vid skurvis trafik det första 0.5s av tranmissionen, jämfört med ett fast referensvärde.

(8)

1 Introduction 1

2 Background 4

2.1 AMC in Wireless networks . . . 4

2.1.1 MCS and BLER . . . 4

2.1.2 SINR and CQI . . . 5

2.1.3 ILLA and OLLA . . . 5

2.2 Reinforcement Learning . . . 6

2.2.1 Markov Decision Process . . . 6

2.2.2 Model-free RL . . . 8

2.2.3 Function Approximation . . . 9

2.2.4 Sample efficiency and exploration . . . 9

2.2.5 From bandits to RL . . . 10

2.2.6 Bandit algorithms . . . 10

2.2.7 -greedy algorithms . . . 12

2.3 Thompson sampling in Bandit problems . . . 13

2.3.1 Bayesian Inference for Multi-Armed Bandits . . . 13

2.3.2 Thompson Sampling for Multi-Armed Bandits . . . . 15

2.3.3 Importance of the choice of prior for Gaussian Thomp- son Sampling . . . 16

2.3.4 Thompson Sampling for Contextual Bandits . . . 18

3 Method 22 3.1 Motivation for RL . . . 22

3.2 Difficulties and constraints . . . 23

3.3 Restriction of the Problem . . . 24

3.3.1 Global optimization . . . 24

3.3.2 Variable network parameters . . . 25

3.4 Definition of the problem . . . 27

vi

(9)

3.4.1 On-line Cell-wide learning for common iOff . . . 28

3.4.2 Per-UE iOff Learning . . . 29

3.5 Data . . . 32

3.5.1 Getting the data . . . 32

3.5.2 Datasets . . . 33

4 Results 34 4.1 Multi-Armed Bandits for cell-wide iOff . . . 34

4.2 Contextual Bandits for per-UE iOff . . . 38

5 Discussion 42 5.1 On the importance of bayesian thinking in POMDPs . . . 42

5.2 On the application of RL in wireless networks research . . . . 43

5.3 Lessons learned from the project . . . 44

6 Conclusion 47

A Results without Optimistic Initialization 54

B Hyperparameters for Contextual Bandits 56

(10)

(11)

Introduction

Mobile and wireless communications have become part of the daily life of billions of people around the world, and the demand for fast and reliable transmissions keeps increasing. Since the first patent filed in 1907 by Nathan Stubblefield [1] describing a wireless phone using electromagnetic waves, successive improvements of the technology have made the use cases that we know today possible. For instance, the first call from a civil phone was given in 1946, the first use of a cellular network happened in 1973, and the shift from analogical transmissions to digital transmissions took place in 1992 with the appearance of the Global System for Mobile (GSM) communications, also called 2G. Since then, improvements have been decided in agreement with different actors of the field and coordinated by the 3rd Generation Partnership Project (3GPP), and distributed as successive generations as well as minor releases.

In recent years, operators and hardware providers have observed a change in demand. Firstly, there has been an increase of the amount of data transmitted because of new use cases such as video or live streaming. As a result, the current networks are becoming saturated in certain areas, which is one of the motivations for the development of new technology. At the same time, the demand for shorter transmissions has emerged recently and is expected to drastically increase in the next few years. This is due to the proliferation of connected devices notably in homes and cities, which are often grouped under the general denomination "Internet Of Things" (IOT). Telemetry applications from already existing devices like mobile phones and computers also plays a role in this recent trend. These types of transmission also saturate the network, but in terms of capacity.

1

(12)

As a consequence, research directions that envision to solve these issues are thoroughly examined, including the problem of resource allocation : we want to distribute resources to the users in the best way possible, in order to improve the quality of service as well as the energy efficiency of the network. The field studied in this thesis belongs to this research area. It is called Adaptive Modulation and Coding (AMC) and aims to select an appropriate Modulation and Coding Scheme (MCS) for each user in the network. The quality of the MCS selection will have a direct effect on the throughput of the transmission, but an aggressive strategy prioritizing throughput maximization will make the receiver prone to decoding failures. This trade-off makes AMC a non-trivial problem.

In particular, we studied the initial offset problem, in which we need to decide on an initial value for an algorithm commonly used in the field called the Outer Loop for Link Adaptation (OLLA). We chose to formulate this problem as a Reinforcement Learning (RL) problem, and provide relevant restrictions to simpler settings, namely Multi-Armed Bandits (MAB) and Contextual Bandits (CB). The desired outcome was to determine whether RL techniques have the potential to show substantial gains on the AMC problem, and more broadly in wireless communication systems. It also constituted an opportunity to try and apply state-of-the-art techniques and evaluate them on a real-world problem, where resources are limited and time is constrained.

Previous work has been conducted about the formulation of AMC as an RL problem, notably in [2], [3] and [4], and has presented alternatives to OLLA.

However, these approaches consider learning the optimal MCS over the course of the transmission. This is not possible with very short transmissions which typically require only very few packets to be scheduled, and constitute the motivation for initial offset tuning.

In this thesis, our contribution can be summed up by three points. First, we formulate the problem in a tractable way, and discard irrelevant information.

Then, by using the formulation that we propose, we can implement well-known bandit algorithms for this problem and analyze their potential and limitations.

Finally, we propose a method that improves the performance of the OLLA algorithm by choosing an appropriate initial offset in the face of uncertainty.

(13)

In chapter 2, we describe the background about relevant parts of wireless networks theory and RL, with emphasis on bandits and Thompson Sampling.

In chapter 3, we motivate the use of RL in this context, we analyze the relevance and importance of possible restrictions of the problem, and finally we present the scenarios in which the algorithms were trained. In chapter 4, we perform a comparison of different algorithms and choice of hyper-parameters and provide interpretation of the results. Finally, in chapter 5, we discuss the relevance of some user-specific features and provide an interpretation of results in the light of bayesian reasoning.

A substantial part of the time was dedicated to literature review and problem formulation, as well as constructing the scenario within the company’s simulator, which was performed alongside with experiments.

Although there are no direct ethical and societal dilemmas in this work, it is worth noting that some choices that we make while designing the algorithms may affect the overall positioning of the end product in terms of user prioritiza- tion and fairness, and as a result, net neutrality. An example of such a choice and its implications is detailed in chapter 3. Moreover, the ecological sustain- ability of 5G technology is challenged [5], and even if its energy efficiency is expected to improve over time [6], the use cases allowed by this technology might create a rebound effect. This effect would be incompatible with the need for digital sobriety [7].

(14)

Background

2.1 AMC in Wireless networks

We consider a network composed of one Base Station (BS) or site serving several User Equipments (UE) positioned within a certain range. Only Down- link transmission is considered in the study. We will use the context of 5G technology with Massive MIMO, but the ideas can be generalized to other types of networks. We study methods used to increase resource utilization efficency depending on the quality of the wireless channel.

2.1.1 MCS and BLER

The Modulation and Coding Scheme (MCS) is a 5-bit index corresponding to a specific combination of modulation type, coding rate, and number of spatial streams. This index guarantees an appropriate demodulation of the signal at the UE, so that BS and UE can communicate flawlessly.

Moreover, the choice of the MCS affects the performance of the network : if it is too conservative (i.e. low), the transmission will not utilize the full capacity of the link, which will lead to a lower throughput, and eventually to a bad efficiency of the network [8]. However, if it is too aggressive (i.e. high), the quality of the channel (interference, noise, ...) will cause a high BLock Error Rate (BLER), and the transmission will not be acknowledged by the UE [9].

In practice, there is a BLER target under which the transmission can be carried out flawlessly [10]. This target is usually set to 10%.

It is thus important to choose the right MCS for each transmission in order to meet the expected BLER target. The process of selecting the best MCS for each packet sent is called Adaptive Modulation and Coding (AMC) or Link

4

(15)

Adaptation (LA). A perfect AMC procedure would select the MCS which utilizes the full capacity of the wireless channel at all times.

2.1.2 SINR and CQI

Every time the BS transmits a signal to the UE (either a packet or a reference signal), the UE can measure a Signal-to-Interference-plus-Noise Ratio (SINR), which represents the quality of the transmission.

The UE then maps the SINR to a Channel Quality Indicator (CQI), quantized in a way that is dependent to the vendor and the model [2], but always encoded on 4 bits. This CQI is then sent back to the BS in order to suggest an appropriate new MCS [10].

However, the CQI report may be inaccurate, mostly because of three factors : a delay between measurement of SINR at the UE and processing at the BS, the SINR-to-CQI mapping, and the SINR estimation (can be biased and suffer from high variance) [2]. The next sub-section describes a method for the correction of this inaccuracy.

Figure 2.1: Diagram of the AMC, and how RL can be integrated. The red dotted lines represent wireless transmissions.

2.1.3 ILLA and OLLA

The collected CQI is taken as input of the Inner Loop for Link Adaptation (ILLA). This is a "fast" loop [11], which outputs the best choice of MCS for the given CQI. ILLA maps a 4-bit CQI to a 5-bit MCS.

The correction of CQI inaccuracy (or offset) is computed by the Outer Loop for Link Adaptation (OLLA), a "slow" adaptive loop first proposed in [11]. For this project, we will consider that the correction of the CQI estimation

(16)

takes places after ILLA, although in the literature it sometimes takes place before [9], [12]. This means that the correction provided by OLLA is added to the MCS given by ILLA, and then the continuous corrected MCS is quantized to fit in 5 bits. The loop works as following :

Every time a packet is sent, the UE replies with an ACK/NACK signal denoting whether the packet was acknowledged or not. If OLLA receives an ACK, the offset is incremented by a certain step size to make the MCS selection more aggressive, and if OLLA receives a NACK, the offset is decremented. Note that we consider here that the offset is added to the MCS, although some previous works consider it is substracted [9]. The respective step sizes of ACK and NACK obey certain rules that link them and impose a maximum step size [10], and while the step-size selection constitutes an interesting playground for RL [2], it will be left for future work.

Moreover, since OLLA is a recursive loop, it needs to take an initial value, that we will call here the iOff. If the iOff is too high, the beginning of the transmission will suffer from a high BLER, and will require re-transmissions [9], until a suitable MCS is reached by OLLA. Conversely, if the iOff is too low, the MCS selection is too conservative, which deteriorates the throughput.

OLLA will eventually find a good offset, but it may require excessive time to converge, as Figure 2.2 shows (note that OLLA doesn’t converge to a single value, but converges in average values [10]). Indeed, in the typical case in which ∆up = 0.1dB [10]; compensating a 1dB bias takes 10 iterations of OLLA, which means 10 packet transmissions [9].

For this reason, choosing an appropriate value for the iOff is crucial, especially with sparse traffic, for which the convergence time is substantial and cannot be ignored. With the increasing occurrence of such traffic due to IOT and other telemetry applications [13], a suboptimal choice of iOff may have a stronger impact on the network’s efficiency.

The whole AMC process is described in Figure 2.1.

2.2 Reinforcement Learning

2.2.1 Markov Decision Process

Reinforcement Learning (RL) agents need to operate in an environment, whose behavior is consistent enough to allow learning. A common way to model the environment is to use a Markov Decision Process (MDP) [14] , defined as

(17)

(a) Favorable scenario (b) Unfavorable scenario

Figure 2.2:Convergence of OLLA in two different scenarios. The graphs were obtain by simulating the same experiment with different initial values for OLLA. On graph (a), OLLA takes around 0.25s to converge, while on (b), it has not converged, even after 0.5s. The y-axis represents the offset value in dB.

following :

A MDP is a tuple (T , S, A, (p(·|s, a), r(s, a), s ∈ S, a ∈ A)) containing : a time horizon : T ∈ N^∗∪ {∞}

a set of states : s ∈ S a set of actions : a ∈ A

state transition probabilities : p(s⁰|s, a) = P r{s_t+1 = s⁰|s_t= s, a_t = a}

a reward function for each state-action pair : r(s, a) = E [r^t+1 | s_t= s, a_t = a]

We can define the return as the expected sum of rewards along a trajectory τ = (s_t, a_t, r_t)_16t6T. This sum is usually weighted by a discount factor γ ∈ [0, 1] [14] :

R =

T

X

t=1

γ^t−1r_t

Note that for an infinite time horizon, we need γ < 1 for the return to be finite. Subsequently, we can define the expected return of a policy π : s 7→ a, thanks to trajectories sampled from π :

R^π = E

τ ∼π

" _T X

t=1

γ^t−1r_t^τ

#

In RL, we seek to find a policy π^∗ that maximizes the expected return.

(18)

We can define tools for assessing the progress of the learning agent, namely value functions. The state-value function V and the action-value function Q are defined respectively for a certain policy π as :

V^π(s) = E [R^π|s^π₁ = s] and Q^π(s, a) = E [R^π|s^π₁ = s, a^π₁ = a]

The goal of RL can be reformulated as the estimation of the true value function, that is, the value function of the optimal policy.

2.2.2 Model-free RL

In most environments, the rewards and transition probabilities are not known in advance. The reason for this can be that these probabilities cannot be com- pressed into a reasonable set of rules or simply that computing them is too complex, which is the case with this project : the company’s simulator is too complex for computing the conditional probability distributions. Furthermore, in such a complex system, the state most likely does not account for full in- formation of the environment, which makes it a Partially Observable MDP (POMDP). Formally, the definition of a POMDP is obtained by adding an observation space Ω and observation probabilities O(o | s⁰, a) to the definition of an MDP.

We can split RL methods into two categories : Model-based methods, which need to know the model (or to learn it) and use planning to find the optimal policy, and Model-free methods, which instead rely on exploration (i.e. interaction with the environment) to gather knowledge.

Because of the reasons explained above, we used Model-free RL algorithms.

Model-based methods where the MDP is learned by the agent over time are promising [15] [16], particularly in terms of sample efficiency, but most ad- vances in this domain are fairly new and lack analysis in broader cases, which make them out of scope for this thesis.

Model-free RL is a broad field, composed of many (overlapping) sub- frameworks including Temporal Difference, Policy Gradients and Actor-Critic methods, but describing them is not the main point of this background section.

However it is important to note that a common point of all of these algorithms is that they all keep in memory a belief of the environment’s dynamics, through the value function (value-based methods) and/or directly on the policy (policy- based methods). Since we only use value-based methods in this thesis, we will now refer to this belief directly as the value function.

(19)

The value function is successively updated through the course of training, until the policy we derive from it is satisfactory. In other words, RL algorithms are estimators of the true value function, whose properties (bias, variance, ...) define their effectiveness.

2.2.3 Function Approximation

The structure of the estimator used for the value function can be very straightforward when the state-action space is small : we can typically store V (s) (resp.

Q(s, a)) for each s ∈ S (resp. (s, a) ∈ S × A) in a table.

However, when the state-space is large or even continuous, this is intractable as the table cannot fit in memory. In that case, we can use a neural network as a function approximator to compute the estimates of the value function, and update it using a gradient-based optimization algorithm [17]. Combined with efficient Deep Learning platforms and GPU computations, this allows for a wide range of applications that were previously out of reach [18].

2.2.4 Sample efficiency and exploration

A major concern in Reinforcement Learning is how sample-efficient the algorithm is, i.e. how many data points or tuples of experience it needs to be able to learn a near-optimal policy [19]. Although not an issue when working with simple simulated environments, this can become a tight bottleneck in real- world scenarios or computationally heavy simulations. Training algorithms on millions or even billions of samples is not uncommon in the literature [20].

Fortunately, relatively sample-efficient algorithms have been developed to face this issue [21] [22]. It is possible to initialize the parameters such that the agent is quicker to learn [23], and to collect some of the data in advance, in order to perform a "warm start". But in most cases, the main part of potential gains comes from the exploration-exploitation trade-off :

Exploration is the ability of the agent to broaden its knowledge of the environment by exploring new configurations. Exploitation is the ability to refine the knowledge that the algorithm already possesses and gather as much reward as possible with its current knowledge. A highly performing, sample efficient RL algorithm needs a balanced behavior between these two concepts.

(20)

2.2.5 From bandits to RL

We can restrict RL to simpler cases when possible, as shown on Figure 2.3, and it can help with both performance and sample-efficiency.

The simplest form of RL commonly studied is called Multi-Armed Bandits.

Here the state-space is a singleton, the agent is presented a set of actions, and the goal is to find the most rewarding action. But rewards are noisy, so the agent needs to try several times each action, while keeping a short exploration phase, so that the agent maximizes its gain over the whole session.

A more complex setting is called Contextual Bandits. Now the state-space is not restricted to a singleton, and the reward is dependent on both the state and the action. However, note that the transitions probabilities don’t depend on the previous action : p(s⁰|s, a) = p(s⁰|s). Thus, we consider a succession of i.i.d. experiments, or in other words the time horizon is T = 1. High-quality exploration is still the main source of concern, but generalization is also an issue now.

Finally, with general RL, actions have long-term effects because they affect the distribution of future states. Usually, sophisticated exploration is not the main concern when designing RL algorithms, because simple exploration heuristics have been shown to perform decently while not hurting the stability of the algorithm too much . Indeed stability (i.e. ability to converge to the optimal policy with high probability and a smooth learning curve) is usually of great concern in these settings.

2.2.6 Bandit algorithms

Let’s consider a Stochastic Multi-Armed Bandit (MAB) problem defined as follows : for t > 0, the agent can take one of actions a1, · · · , a_nand observe a reward rt,k ∼ R_k randomly distributed with mean E [Rk] = µ_k. We will now refer to the process of choosing an action and observing a reward as an experiment.

The quality of the exploration phase in bandit problems, and thus the quality of the algorithm, is often measured using regret. It corresponds to the difference between the reward obtained by following the optimal policy and the reward obtained under the given policy :

regret(t) = r_t,k^∗− r_t,k_t

(21)

Figure 2.3: Diagram of a possible representation of different ML fields

It has been shown in [24] that every consistent bandit algorithm is subject to a logarithmic asymptotic lower bound for a given problem instance I = (R₁, · · · , R_n), that is,

R(T ) > log(T )

" _n X

k=1

µ^∗− µk

D(µ_k k µ^∗) + o(1)

#

, (2.1)

where R(T ) = E hPT

t=1regret(t) i

is the expected cumulative regret after time T , µ^∗ = max_kµ_k, and D is the KL-divergence. A consistent algorithm is defined as an algorithm whose regret is sub-polynomial on every instance [25]. For example, the algorithm which consists in always choosing the same arbitrary arm will work well on instances where this arm is optimal but will have linear regret in other instances, so it is not consistent.

This bound means that without previous knowledge of the particular problem instance, it is not possible to design an algorithm such that the cumulative regret converges all the time after enough experiments. It might seem counter- intuitive at first sight, but the reason is that in a MAB problem, we cannot know the true distribution of rewards, and will never know it.

Let’s consider the toy example in which we have 2 actions and deterministic rewards r1 = 1 and r2 = 2. If we take 500 times each action and thus obtain 500 times r1 and 500 times r2, we may feel very confident that a2is the optimal action. However, it might happen that the distributions are actually not deterministic, and that these results are only due to "bad luck" when rewards are

(22)

sampled from their distribution. This is unlikely but possible. The lower bound 2.1 shows that we cannot a priori expect the regret of any bandit algorithm to be sub-logarithmic. Another interpretation of this bound is that every algorithm needs to make a logarithmic number of mistakes in order to find the optimal action [25].

The questions are whether we can design an algorithm able to achieve this asymptotic lower bound, and how well it performs in the first iterations.

2.2.7 -greedy algorithms

In the multi-armed bandits settings mentioned above, a frequentist approach would consist in building a point estimate of each action’s mean expected reward : eµ_Φ ≈ µ = (µ₁, · · · , µ_n), where Φ is the set of parameters defining the estimator. Then, the greedy action is the action with the highest expected average reward.

A very simple and naive exploration strategy would be to explore first, by taking random actions, and then exploit by always taking the greedy action.

The expected cumulative regret is linear, and depends on the duration of the exploration phase [26]. But we may want to mix exploration and exploitation in order to lower the cumulative regret. A common approach is the -greedy strategy. In this approach, at every time-step, we take a random action with probability , and the greedy action with probability 1 − (see Algorithm 1).

The expected cumulative regret is also linear, because we continue to take random actions forever, but we empirically get better results than with initial random exploration in most problem instances. A natural idea is to make decay over time. If converges to 0, we can hope to have sub-linear regret. [26] has shown that under a certain decay schedule, we can obtain logarithmic expected regret. However, this would require previous knowledge of the problem instance, which we don’t have.

-greedy algorithms are good exploration heuristics, but need tuning to design a schedule which would be appropriate for the task. There is a trade-off between lowering the cumulative regret and having confidence in the successful outcome of the exploration phase in unseen problem instances.

-greedy heuristics are widely used, especially in general RL because they have empirically been shown to provide "good enough" exploration, while not hurting the stability of the algorithm [17]. Note that in general RL, it would be unfeasible to collect data in advance with random exploration because it would mean exploring parts of the state-space that the agent would never observe with any "decent" policy, and as a consequence, it would require an extremely long

(23)

exploration phase.

Algorithm 1: -greedy algorithm

Hyperparameters : decay schedule (t)_t>0, initial estimates for each armµe_0,k

for t> 0 do

Draw a random number p between 0 and 1 if p < _tthen

Select a random action aKt

else

Select the greedy action aKt with Kt = arg maxkµe_t,k end

Observe reward rt,Kt

Update the reward estimate : µe_t,K_t = ^t^k^×^µ^e^t,Kt_t ^+r^t,Kt

k+1 , where t^kis the number of times arm k was pulled

end

2.3 Thompson sampling in Bandit problems

Some simple heuristics like -greedy algorithms have been shown to constitute good exploration methods, but many previous pieces of research have tried to build theoretically sound exploration algorithms, whose performance can be quantified thanks to asymptotic bounds in general cases. Considering the sequential nature of RL, which implies a growing knowledge of the problem instance over time, bayesian reasoning has inspired researchers from different fields to come up with bayesian exploration algorithms, notably Bayesian Upper Confidence Bound and Thompson Sampling. The latter is of particular interest, as it has been thoroughly explored, but its asymptotic behavior is still an open question in the general case [27].

2.3.1 Bayesian Inference for Multi-Armed Bandits

In a bayesian approach, we want to quantify the uncertainty of the estimate, to reflect the current knowledge about the problem instance. The Bayesian Inference process works as following [28]:

• We choose a probability distribution model for the reward distribution of each action. This model is called the likelihood f (r | a^k).

(24)

• We also choose a probability distribution model for a set of parameters θ of the likelihood ( for example mean and standard deviation for a normal distribution ). This model is called the prior π(θ).

• When the agent chooses an action and observes a reward, we compute the posterior distribution of this action using the Bayes formula

P (θ_k|a_k, r_k) ∝ π(θ_k)f (r_k | a_k) .

• The posterior serves as prior for the next experiment.

Because of the sequential nature of multi-armed bandit, we obtain a se- quence of posteriors πt(θ) = (π_t(θ₁), · · · , π_t(θ_n)) = (P (θ₁ | H_t), · · · , P (θ_n | H_t)) where Htis the history of chosen actions and observed rewards before time t. With data comes better knowledge of the reward distributions, which means that the variance of the posterior π^tshould decrease to 0 as t grows if the reward distribution does not change over time.

We often choose a conjugate prior of the likelihood as prior model. This means that the prior and the posterior belong to the same distribution family, and we can derive a formula for updating the parameters of the prior. For example, in the Bernoulli Bandit problem, whose rewards are either 1 or 0 with a fixed probability of success, the likelihood is Rk = Bernoulli(pk) and a common conjugate prior is π^k = Beta(α^k, βk) [28]. The parameter update after getting the reward is given by

(α_k,t+1, βk,t+1) = (αk,t+ 1, βk,t) if r^k,t+1 = 1 (α_k,t+1, β_k,t+1) = (α_k,t, β_k,t+ 1) if rk,t+1 = 0

As a consequence, when arm k has been played tktimes, with s successes and f failures (s + f = tk), the knowledge about arm k can be represented by µk∼ Beta(α⁰+ s, β0+ f ).

Another question of interest is the choice of initial parameters for the prior (α0and β0 in the previous example). If one does not have any particular belief about the distribution of θ, we might want to use a non-informative prior, i.e.

a prior conveying no information. Note that a flat uniform prior is not necessarily a non-informative prior as it may not be invariant to reparameterization (Michael Jordan, 2010 [29]).

An example of this property is for Binomial Likelihood Binom(n, p), where

(25)

the flat prior on p Beta(1, 1) is informative, while Beta(0.5, 0.5) is not (see Figure 2.4). Indeed, if we choose to estimate the log-odds θ = log_1−p^p instead of p, the flat prior on p is not flat on θ ! However, Beta(0.5, 0.5) stays the same prior whatever the choice of parameterization is. It might seem surprising that a non-flat prior is non-informative, but it makes sense : when the true p is closer to 0 or 1, the observed data has more effect on the posterior than when p is close to 0.5, and this non-flat prior compensates for this effect.

Figure 2.4: The flat prior Beta(1, 1) is informative because it does not compensate the effect of the likelihood.

However, the existence of an "absolutely non-informative" prior and even the relevance of such a denomination is extensively criticized in the bayesian community, and is at the core of a famous opposition between so-called objec- tive and subjective bayesian thinking. Proposing an original theory of bayesian statistics being out of scope for this thesis, we will only compare different choices of prior for the given problem instance and propose interpretations of their relative performance based on previous work.

2.3.2 Thompson Sampling for Multi-Armed Bandits

Thompson Sampling (TS) is a Bayesian exploration strategy for the MAB problem. The uncertainty contained in the posterior distribution is not only useful for a better representation of the expected reward distribution, but can also be used to select actions.

The idea behind TS, called Probability matching, is to sample an action from its probability of being optimal. At every time step, a sample µe_k,t is

(26)

drawn from each action’s posterior, and the selected action aKt is given by Kt= arg max^kµek,t.

This action will be used for getting a reward and performing the bayesian inference [30]. The pseudo-code is available in Algorithm 2.

Algorithm 2: Generic Thompson Sampling Choose a prior distribution π⁰ = (π_1,0, · · · , π_n,0) for t> 0 do

Sampleµe_t= (eµ_1,t, · · · ,µe_n,t) from πt

Compute Kt= arg maxkµe_k,t

Choose action aKt and observe reward rt,Kt

Update the posterior : π^t+1(µ) ∝ πt(µ)f (rt,Kt | µ) end

Thompson Sampling has been proven to achieve the asymptotic bound (2.1) for the Bernoulli Bandit in [27]. Moreover, in [31], TS is shown to be asymptotically optimal in mean value in the general case, which implies that the expected regret converges to 0 as the exploration goes on:

Er^{T S}(t) −−−→

t→∞ 0 (2.2)

In other words, it means that Thompson Sampling cannot get stuck pulling a sub-optimal action, contrary to epsilon-greedy algorithms with decaying epsilon.

However, it does not achieve the asymptotic bound for every prior in the general case [32]. From now on and for the sake of clarity, we will focus on Bandit problems with Normally distributed rewards R^k∼ N (µ_k, σ²_k).

2.3.3 Importance of the choice of prior for Gaussian Thompson Sampling

While a Beta prior is commonly accepted as a good solution for the Bernoulli Bandit problem, it is yet unclear what the prior should be for Bandit problems with gaussian rewards. Note that we are only interested in µ = (µk)_16k6nsince we only want to determine the arm with the highest expected reward.

A common approach is to consider σ = (σk)_16k6nto be known, such that the likelihood simplifies into a one-parameter model. Then, we can choose a

(27)

normal conjugate prior πk = N (µ_k,0, λ²_k,0). The posterior after arm k has been played t^ktimes is given by :

π_k,t_k = N (µ_k,t_k, λ²_k,t

k) with

µ_k,t_k = 1

1 λ²_k,0 + _σ^t^k2

k

µ_k,0 λ²_k,0 +

Ptk

i=1r_k,i σ_k²

!

and λ²k,t_k = 1 λ²_k,0 + t_k

σ²

!−1

(2.3) We can verify that we obtain as expected µk,tk −−−→

tk→∞ µ_kand λk,tk −−−→

tk→∞ 0.

This method is proven to achieve a logarithmic asymptotic bound [33] but, to the best of my knowledge, it is unknown whether it achieves the lower bound provided in Equation (2.1). Note that if we don’t have expert knowledge on the value of µ^k,0and λ^k,0, we can choose to set λ^k,0to ∞ to obtain a flat "non- informative" prior, which happens to be Jeffrey’s Prior [34] (whose general form is π(θ) =pI(θ) where I is the Fisher Information). TS with Jeffrey’s Prior achieves the lower bound in Equation (2.1) for every one-parameter likelihood model from the exponential family, to which the normal distribution belongs.

Algorithm 3: Gaussian Thompson Sampling with known variance Hyperparameters : σ ∈ (R⁺)ⁿ, µ0 ∈ Rⁿ, λ0 ∈ (R⁺S {+∞})ⁿ for t> 0 do

For each arm k, sampleµe^(t)_k from π^k,tk = N (µk,t_k, λ²_k,t

k) Compute Kt= arg maxkµe^(t)_k

Update µk,t_Kt, λ_k,t_Kt with Equation (2.3) end

However, a big flaw of this approach is that we need expert knowledge and/or empirical tuning to determine the value of σ. If we cannot evaluate σ, we may end up with either a too optimistic or too conservative algorithm.

If σ is not known, we still have guarantees thanks to Honda et al. [32].

They consider priors of the form π(µ, σ²) ∝ (σ²)^1−α, α ∈ R. For such a prior we can compute the marginal posterior on µkafter arm k has been played tk

times [35] :

(28)

For t > max{2, 2 − d2αe}, we define :

Θ_k,t(µ) = s

t_k(t_k+ 2α − 1)

S_k,t_k µ − µ_k,t

k

where µk,tk and Sk,tk are respectively the empirical mean and sum of squares:

µ_k,t_k = 1 t_k

tk

X

i=1

r_k,t_k and Sk,tk =

tk

X

i=1

r_k,t_k− µ_k,t_k2

Then, the posterior on Θk,t(µ) is given by :

∀t_k > max{2, 2 − d2αe} , πk,tk(Θ_k,t(µ) | H_t) = f_n+2α−1(Θ_k,t(µ)) where fν is the t-distribution with degree of freedom ν :

fν(x) = Γ ^ν+1₂

√νπΓ ^ν₂

1 + x²

ν

⁻^ν+1₂

Note that we need to play each arm at least max{2, 2 − d2αe} number of times to get a proper (i.e. integrable) posterior. TS with this prior achieves Equation (2.1) for α < 0 but not for α > 0, which includes Jeffrey’s prior obtained for α = ¹₂. Furthermore, it appears empirically that Jeffrey’s prior suffers from polynomial regret in average. However, Honda et al. note that these bad results of Jeffrey’s prior in average occult the fact that it actually performs better than other priors in most cases, and a few bad trials largely influence the average. It will be interesting to compare the empirical performance of these priors in the problem instance considered in this thesis, and determine whether this is acceptable. The pseudo-code is given in Algorithm 4.

2.3.4 Thompson Sampling for Contextual Bandits

In Contextual Bandits problems, the reward does not only depend on the action, but also on the state (or context). As a consequence, the likelihood is now of the form f (r | X, a) where X is the context.

Bishop [36] and Riquelme et al. [37] provide a formula for Bayesian Linear Regression (BLR), i.e. if we assume that the distribution of rewards is R_k = X^Tζ_k+ _kwhere k ∼ N (0, σ_k²). Then a conjugate prior can consist of a multivariate normal-inverse-gamma distribution:

(29)

Algorithm 4: Gaussian Thompson Sampling with unknown variance Hyperparameters : α ∈ R

Play each arm max{2, 2 − d2αe} times and store the rewards for t> max{2, 2 − d2αe} do

for each arm k do

Sample eΘ_k,t from πk,tk = f_t_k_+2α−1 Compute µk,tk = _t¹

k

Ptk

i=1rk,t_k and Sk,t_k =Ptk

i=1 rk,t_k− µ_k,t_k2

Getµe_k,t = µ_k,t

k + eΘ_k,t

q _S

k,tk

tk(tk+2α−1)

end

Compute Kt= arg maxkµe_k,t

Choose action a^Kt and observe reward r^t,Kt

end

π(ζ, σ²) = π(ζ | σ²)π(σ²) = N-Γ⁻¹(α, β, µ, Λ)

where σ²and ζ | σ² follow respectively an Inverse Gamma and a Normal distribution :

σ² ∼ Γ⁻¹(α, β) and ζ | σ² ∼ N (µ, σ²Λ⁻¹)

The update of parameters after observing X = (X¹, · · · , Xt_k) and R = (r₁, · · · , r_t_k) is given for arm k by :

α_t_k = α₀+ t_k/2 β_t_k = β₀+ ¹₂ R^TR + µ^T₀Λ⁻¹₀ µ₀− µ^T_t

kΛ_t_kµ_t_k µ_t_k = Λ⁻¹_t_k Λ₀µ₀+ X^TR

Λ_t_k = X^TX + Λ₀

(2.4) Riquelme et al. propose a0 = b₀ = η ≥ 1, µ0 = 0, and Λ0 = λId, as choice of prior hyperparameters.

By using these formulas, we can create a Thompson Sampling algorithm for Linear Contextual Bandits, whose pseudo-code is available in Algorithm 5.

However, a linear relationship between context and reward may lack representational power in the general case. Unfortunately, extending these formulas to the non-linear case can be very tedious. Furthermore, computing correlation matrices and their inverse becomes intractable in Bayesian Networks [37].

One possibility to avoid using Bayesian Networks while achieving a good representational power is to use a feed-forward neural network to obtain a latent

(30)

Algorithm 5: Contextual TS : Bayesian Linear Regression Hyperparameters : η ∈ [1, +∞], λ ∈ R

Put αk,0 = β_k,0 = η, µ_k,0 = 0, Λ_k,0 = λId for t> 1 do

Observe the context Xt

∀k, sampleeσ²_k,tfrom πk,tk(eσ_k²) = Γ⁻¹(α_k,t_k, β_k,t_k)

∀k, sample eζ_k,tfrom πk,tk(eζ_k |eσ_k,t² ) = N (µ_k,t_k,σe_k,t² Λ⁻¹_k,t

k) Predict rewardsre_t,k = X_t^Tζe_k,t

Compute Kt= arg maxker_k,t

∀k, update α^k, βk, µk, Λk using Equations (2.4) end

representation of the data, and perform the linear Bayesian inference on this latent variable ZΦ(X). As proposed in [38], we can use the last hidden layer of a neural network trained to predict the reward, and replace the last linear layer with a Bayesian linear regression. We can thus hope to get at the same time the strong representational properties of a neural network and the uncertainty estimate of the Bayesian regression. In this approach, it can be useful to regularly retrain the network on previously seen data [37], and the update frequency of the neural network and the Bayesian regression are important hyperparaeters of the algorithm. We will now refer to this technique as Last-Layer Bayesian Linear Regression (LLBLR). The pseudo-code is given in Algorithm 6.

Another simpler possibility is to approximate Bayesian inference : it is possible to do so using variational techniques [39] or Monte-Carlo estimation of the posterior [37], but we chose to focus on an even simpler technique using Dropout.

Dropout consists in randomly zeroing out ("dropping") some neurons’ value in a neural network at each forward pass, so that it compells the neural network to use all neurons when learning a representation of the data [40]. It has been shown to improve the performance of neural networks by avoiding overfitting.

Gal et al. [41] have shown than more than just a technique to avoid overfitting, dropout can be used to estimate the uncertainty of the model. Furthermore, they showed that a neural network using dropout before every layer is equivalent to an approximation of a Deep Gaussian Process [42], and that performing a forward pass in a dropout network results in sampling from a distribution minimizing its KL divergence with the true Bayesian posterior, which means implicitely

(31)

Algorithm 6: Contextual TS : Last-Layer Bayesian Linear Regression Hyperparameters : η ∈ [1, +∞], λ ∈ R, fnet∈ N^∗, f_BLR ∈ N^∗ Put αk,0 = β_k,0 = η, µ_k,0 = 0, Λ_k,0 = λId

for t> 1 do

Observe the context Xt

Perform a forward pass to obtain last hidden layer’s value ZΦ(X_t)

∀k, sampleeσ²_k,tfrom π^k,tk(eσ_k²) = Γ⁻¹(α_k,t_k, β_k,t_k)

∀k, sample eζk,tfrom π^k,tk(eζk |eσ_k,t² ) = N (µk,t_k,σe_k,t² Λ⁻¹_k,t

k) Predict rewardsre_t,k = Z_Φ(X_t)^Tζe_k,t

Compute Kt= arg maxker_k,t

Choose action a^Kt and observe reward r^t,Kt

if t mod fnet= 0 then

Retrain the neural network on past data X^t⁰6t, r_t⁰_6t Perform a forward pass on past data to update ZΦ(X_t⁰_6t) end

if t mod fBLR = 0 then

Refit the BLR using ZΦ(X_t⁰_6t) , r_t⁰_6tand Equations (2.4) end

end

performing Thompson Sampling. The performance of this algorithm depends tightly on the type of non-linearity used [41] and the dataset it is used on [37].

(32)

Method

3.1 Motivation for RL

As explained in the previous chapter, choosing an appropriate iOff is important to eventually maximize the throughput, especially for short transmissions, so that we can minimize the radio resources required for transmitting unit amount of data.

As for any optimization problem for which there is no known analytic solution, it is worth questioning whether Machine Learning (ML) is a good tool for solving the problem. In this case :

• The output space is smooth (i.e. increasing the iOff by a tiny bit is going to increase the throughput by a tiny bit, at least before the offset is too high and 10% BLER cannot be guaranteed), which is a good point for ML with gradient descent approaches [43].

• Simulating transmissions with different scenarios is possible at the company’s place, which makes errors during early learning more forgiving and data collection easier (however far from insignificant)

• ML labels are possible to collect using past experience of large transmissions during which OLLA converged to the optimal offset [9]. However, gathering enough samples for supervised learning to be efficient would need a considerable amount of simulation time, which is a bottleneck in this work. In addition, it is unclear how to compute the optimal iOff because it is changing over time with traffic: should it be the final value of the simulation, the mean or median value over a certain period of a time, with or without a moving window ?

22

(33)

In summary the problem is suited for ML but we cannot easily build the labels, and consequently cannot use Supervised Learning (SL). Moreover, we might want to fine-tune the algorithm on-line for better performance. This is not possible with SL. For these reasons, Reinforcement Learning (RL) (see Setion 2.2) could be appropriate for this task, assuming we are able to define a workable environment.

In other words, SL could be interesting here, and a detailed assessment of its relevance and performance would be worth making. But several points of concern constitute a motivation to try other methods. An idea is to change the definition of the optimization problem : instead of trying to optimize iOff, we try to directly maximize throughput, since this is what we want to achieve in the end anyway. In this configuration, for the reasons mentioned above, RL with throughput as a reward appears suitable for the task.

Here, a straightforward and naive RL procedure would be to collect the cell’s state at a certain timestep t^kdenoting the k-th user arriving in the cell and map it to a value for the iOff of this user, and it would assess the effect of its choice by measuring the throughput in the cell between tk and tk+1.

3.2 Difficulties and constraints

The vague problem definition and RL procedure proposed in section 3.1 need to be greatly precised in order to come up with an implementation. Besides, the precise problem definition shall also be guided by the constraints induced by working with non-synthetic data and an already existing simulator. The main difficulties we were confronted with are listed below :

• RL models need big amounts of data to train on, as explained in section 2.2.4, but the complexity of the environment makes the simulations very costly : 1 second of simulated time typically takes around 20 minutes of real time. Even with parallelization, getting millions or billions of data points is unrealistic.

• The complexity of the environment also affects the performance of algorithms : it is often hard to generalize over different network settings, channel conditions and types of traffic.

• Common RL algorithms are inherently limited : for instance it may be hard to ensure consistency of the states and reward if UEs randomly arrive in the network

(34)

• The simulator is complex and modifying it to fit the needs of a chosen scenario was sometimes either time-consuming or not an option.

The next section takes these constraints into account to restric the problem to a tractable work in the given time.

3.3 Restriction of the Problem

Creating a very general model, which would be able to learn and predict the desired quantities based on full information is illusory in such a complex system. Instead, we may want to restrict the problem to a simpler one, to obtain a tractable scenario. From the identified scenario(s) will result the choice of techniques used. Before coming to the listing of such scenarios, we propose two relevant restrictions of the problem.

3.3.1 Global optimization

Although the initial offset problem is UE-specific, we can work at different scales : UE, cell, site, cluster or even the whole network. At a larger scale, we can in theory have a better representation of the influence of the "outside world" (e.g. interference from other cells) and between agents (e.g. how UEs affect each other’s allocated resources).

However this comes with the cost of a very high complexity mainly for three reasons.

• Firstly, because we may have to put more information as input of the RL algorithm, which makes training longer and hurts generalization.

• Secondly, because when representing cross-agents dependencies we end up in multi-agent settings, in which agents need to coordinate their strategy. Algorithms which take care of this are more complex, unstable and typically take longer to train. Furthermore the credit-assignment problem arises when it is unclear which agent caused a change in the reward. On the other hand, we can train UEs independently if we ignore the need for cross-UE coordination but the environment then appears as non-stationary from the perspective of one UE. This framework is called Independent Learners. The question is whether the non-stationarity will prevent the learning to succeed.

(35)

• Finally, defining the reward is a tricky part. For optimization at a larger scale, we want a "global" reward, e.g. a common reward in one cell/site/cluster/network. But this global reward comes from UE-specific decisions, which means it is necessarily an aggregation of UE-specific metrics. For example, if we choose throughput as reward, this throughput is derived from each UE’s throughput using an aggregation strategy (e.g.

sum, mean, median, etc). The choice of this aggregation strategy leads to different results, as Figure 3.1 shows. In practice, this means that depending on the chosen aggregation strategy we don’t prioritize the same UEs : using an arithmetic mean will take more into account outliers like UEs on the cell edge or with an uncommon device, while using a median will ignore those and optimize solely the behavior of the majority.

Note that this choice is not only a technical choice, but also a design choice : for instance we might want to assign a higher priority to certain types of traffic, or to enforce net neutrality between different types of users.

Now we must assess the gains of working at larger scale. The experiment depicted in Figure 3.2 shows that the action chosen for one UE almost does not affect the reward of other UEs. We can thus assume that independent learners will be stable enough and that we can restrict the problem to optimizing each UE’s initial offset regardless of the others. Note that this assumption holds for low-load settings, in which the network is not congested.

Nevertheless, even in high-load settings, the effect is more likely to be an increase of the reward of other UEs. Indeed, improving the choice of initial offset reduces the amount of resources allocated to the UE, which allows for allocating more resources for the others. This type of influence is less likely to make independent learners unstable because the non-stationarity induced by the improvement of other agents does not conflict with their belief, but only increases the true value function over time. It might even compensate for the typical over-estimation of the Q-function mentioned in [44].

We only focus on low-load settings for this thesis though, and the detailed analysis of cross-UE dependencies in high-load settings is left for future work.

3.3.2 Variable network parameters

As stated above, we cannot expect to easily create an algorithm able to optimize the initial offset in the general case, where the number of UEs and cells, the type of traffic, the outside influence and the channel conditions are all varying

(36)

Figure 3.1: In one site with 20 UEs, we observe the common reward for different values of the initial offset (we assign the same offset to every UE in the cell in each experiment). The maximum reward is obtained for iOff = −7 dB with median as aggregation strategy and iOff = −1 dB when we use arithmetic or harmonic mean.

The per-UE reward used here is described in section 3.4.1.

at the same time in unpredictable ways. Indeed, if we put this information in the state, the algorithm will suffer from the curse of dimensionality, increasing training time and worsening the performance, and if we don’t, we introduce non-stationarity. For this reason, we chose to restrict the scope of the work to fixed network parameters :

• 1 site, 3 cells, 20 UEs

• Fixed burst traffic : each UE is requesting one 10kB file every t ∼ T where T is an exponential distribution with mean 0.02s.

• We observe the reward in a span of 0.5s

(37)

Figure 3.2: In one site with 20 UEs, we keep every UE’s initial offset fixed to 0 except for one test UE. We observe the per-UE reward for each UE as we modify the initial offset of the test UE. The latter’s reward varies as expected (blue curve), but other UEs’

reward almost never changes (grey curves).

3.4 Definition of the problem

Thanks to the previous analysis, we identified four possible scenarios which would constitute interesting topics of research :

• A priori learning for common iOff : in this scenario, we want to set the same iOff for every UE in the cell. The goal is to learn a mapping from the parameters of a cell (Number of UEs, traffic, outside interference, etc) to the ideal common iOff for this cell. We would learn this mapping offline, and make one prediction per cell in the network. Supervised Learning would most likely be an appropriate technique for this task.

However, we chose not to study this scenario, as RL does not seem to be the most appropriate technique and the gains are most likely limited.

• OLLA Replacement : although this scenario does not belong exactly to the initial offset problem, it is worth noting that it is possible to

(38)

formulate OLLA as an MDP, and that we can apply infinite time horizon RL techniques to solve it. In this scenario, we would collect features (including ACK/NACK) every TTI and modify the offset accordingly.

We also did not study this scenario at it lies outside of the given problem.

• On-line cell-wide learning for common iOff : contrary to the first scenario, we want to learn the common iOff on-line : we don’t collect the cell’s parameters but consider them part of the environment, and we want to learn as quickly as possible the optimal common iOff for this particular cell. This problem can be formulated as a Multi-Armed Bandit problem.

• Per-UE iOff learning : in this last scenario, the agent is located at the base station but takes decision for each UE based on UE-specific features.

We can either learn off-line, and deploy the algorithm in every cell for inference, learn on-line so that the geography of the cell (obstacles, reflections, etc) is fixed, or have a mixed strategy consisting in an offline phase followed by an on-line fine-tuning phase. Since we now have different states, but each UE’s arrival corresponds to a new i.i.d experiment, the appropriate family of techniques is Contextual Bandits.

3.4.1 On-line Cell-wide learning for common iOff

Every time a UE arrives in the cell, the agent is requested to choose an iOff for this UE, and observes the effect of its choice during the next 500 ms.

• The action-space is a priori continuous and actions can typically take value between −10 dB and 10 dB. However, in order to boost the learning

process, we chose to discretize it into 7 actions : A = {2, 0, −2, −4, −6, −8, −10}.

• The reward is a UE-specific reward that can be referred to as "Perceived Goodput":

R = 1

N_TTI

X

i=1

T BSi∗ ^{ACK/N ACK}_i (3.1)

where NTTIis the number of busy TTIs, i.e. when data is transmitted, T BS_iis the Transport Block Size, i.e. the number of bits transmitted during TTI i, and

^{ACK/N ACK}_i =1 if the transmission was acknowledged

0 otherwise

Bandit Algorithms for Adaptive Modulation and Coding in Wireless Networks