Distributed Optimisation in Multi-Agent Systems Through Deep Reinforcement Learning

(1)

Distributed Optimisation in Multi-Agent Systems Through Deep Reinforcement Learning

Andreas Eriksson and Jonas Hansson

Abstract—The increased availability of computing power have made reinforcement learning a popular field of science in the most recent years. Recently, reinforcement learning has been used in applications like decreasing energy consumption in data centers, diagnosing patients in medical care and in text-to- speech software. This project investigates how well two different reinforcement learning algorithms, Q-learning and deep Q- learning, can be used as a high-level planner for controlling robots inside a warehouse. A virtual warehouse was created, and the two different algorithms were tested. The reliability of both algorithms where found to be insufficient for real world applications but the deep Q-learning algorithm showed great potential and further research is encouraged.

Index Terms—Reinforcement learning, distributed, optimisa- tion, Q-learning, deep Q-learning, warehouse robots, artificial neural networks.

TRITA number: TRITA-EECS-EX-2019:127

I. I NTRODUCTION

The term Machine Learning was coined in 1959 by Arthur Samuel, an American engineer and pioneer within the field of computer gaming and artificial intelligence [1]. Reinforcement learning, an area of machine learning, has recently been gaining popularity due to the rise in parallel computing power making deep learning algorithms feasible.

Reinforcement learning algorithms have in particular had a lot of recent success in games. One example of the growing popularity of reinforcement learning is the artificial intelli- gence (AI) AlphaGo developed by Google DeepMind which became the first AI to beat a professional Go player [2].

Aside from developing AIs capable of beating professional human players at a multitude of different video and board games, Google DeepMind is also developing AIs to be used in more useful applications such as in text-to-speech software, used in virtual assistant software like Google Assistant, or more recently through a research programme with the Uni- versity College London Hospital, where they are developing an algorithm capable of differentiating between healthy and cancerous tissue in the head and neck [3]. Google DeepMind also claims to have taken advantage of its AI in an effort to reduce its energy consumption in data centres. According to [4], they were able to reduce their data centre cooling bill by as much as 40%, which shows great potential if the same technology were to be used by data centres all over the world.

The purpose of this project is to examine whether reinforce- ment learning and deep reinforcement learning algorithms can be used for optimisation in complex dynamical systems. The specific scenario which is considered is a virtual warehouse with multiple warehouse robots (agents) operating within it at once.

Both of the reinforcement learning algorithms are tested in different cases with one, two, and four agents respectively. A maximum number of four robots operating inside the virtual warehouse at once has been decided upon since this should be sufficient to test out the multi-agent aspect of the project without making the warehouse needlessly intricate. Mainly, the algorithms performance is examined as the number of agents are increased, but the project also takes a look at how the algorithms can be implemented to fit well into a distributed system where every agent is operating by itself and there is no central computer controlling all of the agents.

II. P RELIMINARIES

A. Markov Decision Process

A Markov decision process (MDP) is a stochastic control process that functions as a tool to mathematically model decision making [5]. This has made MDP one of the building blocks to modern reinforcement learning. MDP works by dividing a system into finite time discrete stages. At each stage, the state of the system is observed containing all necessary information that the decision maker (agent) need to make an action. After an action has been made by the agent at time t two things happen: the system evolves to a possible different state at time t + 1 and the agent receives an immediate reward.

The action that decides the transitioning from state at time t to next state at time t + 1 is denoted by a t .

The immediate reward that is obtained when transitioning from one state to another can be expressed by the reward function r(s t , s _t+1 ) [6]. Where the received reward (or cost) between time t and t + 1 depends on the current state s _t and the transitioned state s _t+1 . The information exchange between the agent and the environment is illustrated in Figure 1.

Agent

Environment Reward

r _t

s

_t+1

State

s _t

Action a _t r

t+1

Fig. 1. Flowchart explaining the principle of an MDP.

The cumulative reward that the agent collects when transi- tioning between states can be expressed by the return function.

A parameter γ (discount rate), which functions as a weight

between immediate and future reward, is added to the return

(2)

function. The discounted return from time t to the terminal state at time T is given by the following expression

G _t =

T

X

k=0

γ ^k r(s _t+k+1 , s _t+k+2 ), (1)

where 0 ≤ γ ≤ 1 and T is the final time step [7].

For the agent to be able to know how beneficial it is to be in a given state, a value function v(s _t ) is defined. The value function is a indicator on the expected future return that the agent can expect when in state s t . The cumulative reward that the agent can expect in the future is depends on what action the agent takes. Therefore value functions are defined with respect to the way the agent behaves, called policies. We can define the state-value function for policy π with the following equation

v π (s t ) = E π [G t |s t ]. (2) where E π [.] denotes the expected value of a random variable give that the agent follows policy π [8].

A policy π is defined to be better than or equal to a policy π ⁰ if the expected return is greater than or equal to that of π ⁰ for all states. The policy (or policies) that is better than or equal to all other policies is called an optimal policy. Among all possible value functions there exist one optimal value function v _∗ (s _t ) that follows the optimal policy. The optimal value functions is given by

v ∗ (s t ) = max

π v π (s t ), (3)

for all states s t [9].

B. Reinforcement Learning

Reinforcement learning is a discipline of machine learning based on trial and error. The agent starts exploring the envi- ronment with no knowledge of it or any explicit instructions of what to do. Thus, the agent is forced to learn which actions would be best for any given state, by trying all possible actions at every state and observing the outcome in terms of the obtained reward.

Every time the agent takes an action it receives an immedi- ate reward based on whether the action taken turned out to be good or bad. The next time the agent finds itself in a similar situation it should pick an action with it’s previous experiences in mind and therefore do better than last time. The long term goal of the agent is to try to maximise its possible future reward.

Q-learning, introduced in [10], is one such reinforcement learning algorithm that is used in this report. It is a model- free temporal difference learning algorithm, meaning that it does not utilise the transition probabilities associated with the MDP directly. Instead it uses a learned action-value function Q for updating and determining its behavior. According to [11]

the Q-function is given by Q(s _t , a _t ) = Q(s _t , a _t )

+ αr t+1 + γ max

a Q(s t+1 , a) − Q(s t , a t ) (4)

which is used iteratively to approximate the optimal action- value function

Q _∗ (s _t , a _t ) = max

π Q _π (s _t , a _t ) (5) where Q π is the action-value function for the policy π being followed. It should, however, be noted that since Q-learning is an off policy learning algorithm - it uses a greedy policy to determine the Q-values for upcoming states rather than the policy that is actually being followed. The optimal action-value function can be written in terms of V _∗ according to

Q _∗ (s t , a t ) = E[r t+1 + γV _∗ (s t+1 )|s t , a t ]. (6) For practical implementations of Q-learning the values calcu- lated by the Q-function are stored in a data structure usually referred to as the Q-table, which has one element for every possible state-action pair.

C. Artificial Neural Networks

Artificial neural networks (ANN) is a type of computing framework influenced by the structure of biological neural networks found in animal brains. Artificial neural networks are made up of a large number of neurons, or nodes, inter- connected in a layer like structure consisting of an input layer and an output layer with some amount of hidden layers in between.

Every neuron in the network receives an input in the form of one or more values, depending on the number of preceding connections and the respective weight of each connection, and uses a nonlinear activation function as well as a bias to compute an output value based on its inputs which is then propagated further and deeper into the network.

Without the usage of these nonlinear activation functions the neural network would simply turn into a linear regression model, see [12]. Even though these can be used efficiently and reliably in simpler cases, they are limited to linear functions and thus are not capable of being used in more complex systems like the ones covered by this report.

For any given input data the ANN will produce an output largely dependent of the different weights of the connections between the neurons. The training procedure for the ANN consists of trying to improve the network’s performance by tailoring all of these respective weights. This can be done through the use of error back-propagation where, for any given action, the difference between the expected output of the network and its actual output is calculated by a loss function.

The calculated error is then propagated backwards through the network, and used to adjust the weights and biases in the network through the use of an optimiser to try to lower future errors.

For the deep learning aspects of this project a deep feed-

forward neural network is used. A feed-forward neural net-

work, as is further explained in [12], means that information is

only passed one way and there exists no feedback connections

in the form of loops in the network. As illustrated in Figure

2, the network that is used is fully connected meaning that

every neuron (excluding the neurons in the output layer) is

connected to every neuron in the following layer.

(3)

Fig. 2. The general structure of a deep feed-forward neural network, one type of an artificial neural network.

The Adam optimiser, see [13], is used for training the neural network. According to [14] Adam is a gradient-based optimisation algorithm which works well in non-stationary settings and should therefore work well for the types of systems considered by this report. The loss function used is a simple mean square error function which calculates the mean square error between the expected output and the actual output.

D. Deep Reinforcement Learning

Many aspects of deep reinforcement learning are similar to regular reinforcement learning. The deep reinforcement learning algorithm used in this project is deep Q-learning which builds on the Q-learning algorithm previously discussed in Section II-B. It still learns how to optimally act inside the environment through trial and error just like regular Q-learning but rather than using the Q-function, see Equation 4, to update the entities of the Q-table one-by-one it uses an artificial neural network in order to handle multiple adjacent states at once.

Like for most other deep reinforcement learning algorithms, the information passed on to the input layer of the network corresponds to the agent’s current state in the environment, while the output layer contains the different values calculated by the Q-function for all of the possible actions associated with that certain state.

One main advantage of deep Q-learning compared to regular Q-learning is the ability to handle cases where the agent finds itself in a previously unencountered state which is adjacent to an earlier encountered one. Due to the nature of deep Q- learning’s continuous state space it is still able to choose an action pretty well. Since the state space used in regular reinforcement learning algorithms is discrete, it prevents such algorithms from being able to handle these kinds of cases at all.

III. M ETHOD

A. Environment

The environment is composed of a simulated virtual ware- house made up of a ten by ten grid containing in total 100 different available spaces reflecting the different possible states. The available actions for each state in the warehouse environment are to either move a single step in one of four directions (up, right, down, left) or to remain in its current

position without moving. The warehouse environment can be described using an MDP however as explained in Section II-B the transition probability matrix is in this case equal to a matrix filled with ones and zeros since the transition probabilities for each action to take it between two different states is always one.

All parts of the project were implemented and simulated using the programming language Python. Shown in Figure 3 is a graphical overview of the virtual warehouse for a single agent case implemented with the TkInter package, Python’s de-facto standard graphical user interface package.

The warehouse is made up of a ten by ten grid containing some arbitrarily placed obstacles (illustrating the possible placement of e.g. warehouse shelves or walls). The warehouse also keeps track of the starting positions for the agents as well as the locations of the goals where the respective agents are supposed to go.

Fig. 3. A graphical overview of the simulated virtual warehouse with obstacles as well as start (lower left corner) and goal (top right corner) locations for the one agent case.

Similar to what was used by Google DeepMind in [15]

a parameter ε, referred to as exploration rate, is used to determine whether the agent should pick an action at random (exploration) or if it should rely on its previous experiences and decide upon the action which the agent believes would be the best option for maximising the possible future reward (exploitation).

The probability for the agent deciding to explore the envi- ronment is given directly by ε while the probability for the agent deciding to opt for exploitation instead is 1 − ε.

The exploration rate is initialised as 1.0 in order to promote early exploration of the environment. Over time the explo- ration rate is incrementally decreased as the agent is getting smarter and becomes increasingly more able to decide upon the most promising action for that current state. As the number of elapsed episodes increases ε approaches zero.

If an agent collides with an obstacle or another agent it

receives a negative reward while reaching its goal would result

in a positive reward. The specific values for these different

situations are shown in Table I. If neither of these events occur

the agent will receive a minor negative reward of −1 points in

(4)

order to prevent it from staying in one spot or walking around in circles indefinitely. By giving it a small negative reward every time step the agent gets an incentive to try to reach its goal as quick as possible.

TABLE I

R

EWARDS GIVEN TO AGENTS IN DIFFERENT SITUATIONS

.

Situation Reward

Colliding with obstacle -50 Colliding with agent -50 Reaching its goal +50 None of the above -1

B. Q-Learning for single agent case

For the case with single agent the start position for the agent and its goal is shown in Figure 3. A high value for γ has to be used in order to make sure that the agents value long term rewards. γ is chosen as 0.95. Since the environment depicted by the MDP is deterministic α is set to 1.

The Q-learning algorithm begins with the initialisation of the Q-table as a table of zeros. The Q-table, as Section II-B suggests, is used for storing the calculated values of the Q- function for each of the different encountered states s t and every possible action a t associated with that state. The state space for the scenario with only one agent is simply made up of every possible location in the warehouse (a ten by ten warehouse therefore results in a state space of 100 different possible states).

Every episode consists of the agent wandering around the warehouse, one step at a time, until it collides with either a wall or an obstacle or until it reaches its goal.

For every step the agent can either pick an action randomly in an effort to explore the environment or use the Q-table to select the action which would give it the the maximum possible reward. The latter is referred to as exploitation. Whether the agent should decide for exploration of exploitation is determined randomly for every step and the likelihood for each of the options to be selected is determined by the exploration rate.

When an action has been selected the warehouse has to check for eventual collisions or if the agent has reached its goal and based on this information give it an appropriate reward in accordance with Section III-A.

After the reward for the committed action has been calcu- lated the algorithm uses Equation 4 to update the Q-value for the current state and action. The episode continues with the agent wandering around the warehouse until it either experi- ences a collision or it reaches its goal. A new episode then begins and the procedure is repeated. After every completed episode the exploration rate is lowered, resulting in the agent having a slightly greater tendency to pick an action from the Q- table rather than choosing one randomly during the following episode.

By using a mix of exploration and exploitation the agent iteratively updates the Q-table in an effort to obtain a Q-table which depicts optimal behaviour inside the environment. After a number of episodes, depending on the size of the warehouse,

the exploration rate reaches zero and the agent relies solely on the Q-table to make good decisions. If the training procedure was successful the agent should now be able to act functionally inside the warehouse. The algorithm is further explained using pseudocode in Algorithm 1.

Algorithm 1 Q-learning for single agent case

1: Initialise Q-table

2: for each episode do

3: for each step do

4: Decide between exploration or exploitation

5: if exploration then

6: Pick action by random

7: else if exploitation then

8: Pick action from Q-table

9: Check reward for committed action

10: Update Q-table using equation 4

11: if agent collided or reached its goal then

12: End episode

13: else

14: Take another step

15: Reduce exploration rate

C. Q-learning for two agent case

The algorithm can easily be applied to the case with multiple agents if some minor adjustments are made. Since we are considering distributed optimisation every agent should have its own Q-table. Additionally, for the agents to be able to account for other agents in the warehouse the state space has to be expanded to include the positions of some or all of the other agents as well. In general the size of the state space increases in accordance with,

N n = N ₁ ⁿ , (7)

where N n represents the size of the state space for the case with n agents and N 1 equals the size of the state space for only one active agent (100 in this case).

The parameters α and γ are kept unchanged from the single agent case just like the start and goal location for the first agent in the warehouse. For the case with two agents, which is the only multi-agent case with Q-learning that is considered in this project, the second agent starts off in the top left corner and has to reach its goal in the bottom right corner.

In every episode the agents takes turns doing actions.

When calculating the reward for a committed action it is now necessary to check for collisions with the other agents. It should be noted that only the agent which caused the collision is given a negative reward and the victim of the collision is not penalised in any way. An episode ends when either both of the agents have collided with something, when both agents have reached their goals, or if one agent has collided with something whilst the other agent has reached its goal.

Just like in the single agent case the exploration rate is

reduced for every completed episode. However it should to be

noted that since the multi agent cases entails quite a lot larger

state spaces the training procedure will demand much more

(5)

computational time for the algorithm to converge thus the rate at which the exploration rate decays must be decreased. The algorithm is also explained in Algorithm 2.

Algorithm 2 Q-learning for two agent case

1: Initialise Q-tables

2: for each episode do

3: for each step do

4: for each agent do

5: if agent has not collided then

6: Decide between exploration or exploitation

7: if exploration then

8: Pick action by random

9: else if exploitation then

10: Pick action from Q-table

11: Check reward for committed action

12: Update Q-table using equation 4

13: if all agents collided or reached their goal then

14: End episode

15: else

16: Take another set of steps

17: Reduce exploration rate

D. Deep Q-learning for general case

Practical implementations of deep reinforcement learning methods using artificial neural networks involves a lot of matrix multiplications. The deep Q-learning algorithms are therefore implemented using Python packages TensorFlow [16] and Keras, see [16] and [17] respectively, since these packages significantly facilitates the implementation of artifi- cial neural network models.

Overall the algorithm for deep Q-learning, see Algorithm 3, does not differ at all that much from the implemented algorithm for regular Q-learning with the main difference being the usage of a deep artificial neural network instead of a Q-table. Although these work in vastly different ways the neural network basically fills the same function as the Q-table.

The neural network is implemented as a Keras Sequential model consisting of one input layer with six neurons, three hidden layers containing 32 neurons each and finally an output layer with 5 neurons depicting the different possible actions. Using this setup the neural network has a total of 2 373 trainable parameters according to the summary() function provided by Keras.

The neurons in the hidden layers uses rectified linear units as their activation functions and the output layer simply uses a linear activation function in the form of the identity function [18]. As explained in Section II-C the Adam optimiser is used for training the neural network and the loss function used is a simple mean square error function. The parameter γ is kept at 0.95 just like in the Q-learning algorithms but α is however changed to the default learning rate for the Adam optimiser (0.01).

The action space is the same for the deep Q-learning im- plementation as for the implementation of regular Q-learning.

When it is time to select an action the algorithm still has

to randomly choose between picking an action at random or selecting one based on previous experiences in which case the algorithm uses the current state as input to the deep Q network (DQN) and in turn receives an estimate of the best available action.

It should be noted, however, that for the deep Q-learning implementation the state space is slightly altered from the regular Q-learning cases. Instead of having all agents knowing the positions of every other agent at all times the agent is only allowed to check its immediate vicinity for nearby agents when its about to make a move. The state space is made up of a combination of the current agents coordinates in the warehouse and four binary numbers representing the possible existence of an adjacent agent in any of the four directions (up, down, right, and left). This explains why the input layer of the neural network consists of six neurons regardless of the number of agents operating within the warehouse.

The reward for the committed action is determined in the same way as it is done in the regular Q-learning algorithms which is explained in Section III-B and Section III-C. After the reward has been determined it is time for the algorithm to store important data from the current time step to its memory.

This mainly includes information about the current state, the committed action, the next state, and the assigned reward.

When the amount of stored memories reaches a predeter- mined threshold, the stored memories are used to train the DQN through the use of the Keras function fit. The memory is then wiped clean in preparation for being filled up with new memories.

The locations of the agents goals and starting positions in the cases of one and two agents are the same as they were their respective Q-learning counterparts which are explained in Section III-B and Section III-C. In the case with four agents operating at once an agent is starting in each one of the four corners of the warehouse and has to reach its individual goal set in the corner diagonally opposite of each respective robot.

Algorithm 3 Deep Q-learning for general case

1: Initialise Deep Q-networks

2: for each episode do

3: for each step do

4: for each agent do

5: if agent has not collided then

6: Decide between exploration or exploitation

7: if Exploration then

8: Pick action by random

9: else if Exploitation then

10: Pick estimated best action from DQN

11: Check reward for committed action

12: Add agent step info to memory

13: for every batch of memories do

14: Train DQN with data from recent memories

15: Clear memories

16: if all agents collided or reached their goals then

17: End episode

18: Reduce exploration rate

(6)

IV. R ESULTS

The entire system was simulated multiple times in order to separately examine the feasibility of using Q-learning and deep Q-learning in each of the different cases with up to four agents operating at once inside the warehouse.

A. One agent

For the simple case with only one single agent the ware- house was initially simulated using regular Q-learning. The algorithm showed to converge in approximately 500 episodes as illustrated in Figure 4.

0 500 1,000 1,500

−800

−600

−400

−200 0

Episodes

T otal re w ard

Fig. 4. Total reward accumulated over each episode for regular Q-learning with one active agent.

The case was later also simulated with the usage of deep Q-learning instead which, as shown in Figure 5, resulted in the system needing only about 150 episodes of training to successfully converge.

0 500 1,000 1,500

−400

−200 0 200

Episodes

T otal re w ard

Fig. 5. Total reward accumulated over each episode with deep Q-learning and one active agent.

B. Two agents

The system was also simulated both for Q-learning and deep Q-learning in the case with two agents operating at once.

The results from these simulations are shown in Figure 6 and Figure 7. The Q-learning algorithm needed about 250 000 episodes to converge nicely while the deep Q-learning algorithm converged in approximately 1 200 episodes.

0 1 2 3 4 5

·10

⁵

−200

−100 0

Episodes

T otal re w ard

Fig. 6. Total reward accumulated over each episode with Q-learning with two active agents.

0 500 1,000 1,500 2,000 2,500

−400

−200 0 200

Episodes

T otal re w ard

Fig. 7. Total reward accumulated over each episode with deep Q-learning with two active agents.

C. Four agents

For the case with four agents operating at once the system was only simulated using the deep Q-learning algorithm. The results are presented in Figure 8.

0 1,000 2,000 3,000 4,000

−400

−200 0 200

Episodes

T otal re w ard

Fig. 8. Total reward accumulated over each episode with deep Q-learning with four active agents.

D. Overview

Table II shows the average number of episodes needed for

convergence for each of the different simulated cases.

(7)

TABLE II

A

COMPARISON OF THE APPROXIMATE NUMBER OF EPISODES NEEDED FOR CONVERGENCE IN EACH OF THE DIFFERENT CASES

.

Q-learning Deep Q-learning

One agent 500 150

Two agents 250 000 1 200

Four agents — 1 600

An analysis of how well the algorithm performed in each of the different cases after convergence was carried out. This was done by simply noting how often any of the agents crashed for the first one thousand episodes after the algorithm with certainty had converged. The results of this inquiry can be found in Table III.

TABLE III

C

OLLISION STATISTICS FOR THE DIFFERENT REINFORCEMENT LEARNING ALGORITHMS AFTER CONVERGENCE

.

Case Learning method Collisions [%]

One agent Q-learning 0.0

Deep Q-learning 0.2

Two agents Q-learning 0.1

Deep Q-learning 12.5

Four agents Q-learning —

Deep Q-learning 9.9

V. D ISCUSSION

The main objective of the project was to simulate the system in different scenarios with varying amounts of agents operating at once. Distributed optimisation in multi-agent sys- tems through reinforcement learning and deep reinforcement learning will be examined separately. Overall the results were consistent and clear conclusions could be drawn.

A. Q-learning

Figure 4 shows that Q-learning seems to work really well in this case, as the agent successfully learns the optimal policy.

It converges in about 400-500 episodes which is deemed reasonable given the number of possible states and actions.

Taking a closer look at Figure 4 and also Table III, we can see that its post-convergence performance is good with collisions happening in fewer than 0.0 % of episodes after it has reached the optimal policy and the exploration rate has been lowered close to zero.

The main difference when moving to a multi-agent dynam- ical system is the increased state space since every agent now also has to take the other agents into account which results in the size of the state space growing in an exponential fashion. The Q-tables will quickly become too large to work with thus seemingly making Q-learning impractical for multi- agent dynamical systems such as the ones considered by this project. The increase in necessary calculations also leads to more time being needed to make these calculations. With the usage of more powerful computers this problem can always be dealt with to some extent, but we believe that switching to an alternative method would be an overall better approach.

We do, however, believe that it should be noted that even though the computation times and number of episodes necessary for convergence for regular Q-learning multi-agent systems quickly become high, the results show that post- convergence performance is great with collisions occurring in less than 0.1% of episodes. The negative spike that can be found at about 400 000 episodes in Figure 6 can be explained by the fact that the exploration rate never actually reaches zero, which means that there is still a very small chance that the agent would choose an action at random, even after the optimal policy has been reached, potentially making the agent go into a wall or the other agent. This is not considered a flaw of the algorithm as it is simply a consequence of the way the exploration aspect was implemented. If the exploration rate was actually set to zero instead of it only approaching zero ever so slightly without ever actually reaching it the agent would act according to the found optimal policy.

We also found some difficulties in deciding on how quickly ε should be decreased. If ε is decreased too slowly the algorithm will take an unnecessarily long time to converge.

On the other hand, reducing it too quickly would not give the agents enough time to explore the environment and thus the optimal policy will not be found. To solve this problem it might be a good idea to introduce a feedback loop and lower ε based on the agents performance, making the agents determine their own actions to a higher degree as they get progressively smarter.

Even though a Q-learning solution would technically be possible also in the four agent case, no such simulations were done due to the fact that it would take too long to compute. We also believe that such an simulation would not be necessary as the point has already been proven - Regular Q-learning works great for smaller, less complex, systems but is not feasible in the complex dynamical systems considered by this report.

B. Deep Q-learning

Even in the simple case with only one active agent the differences between deep Q-learning and regular Q-learning become apparent. The solution using deep Q-learning con- verges quite a lot quicker, which is shown in Table II, and with similar performance regarding the amount of unwanted collisions after the algorithms have been considered to have converged deep Q-learning clearly stands out as the better approach.

In the dual-agent case the lower amount of episodes nec-

essary for the deep Q-learning algorithm to converge is even

more apparent. According to Table II the regular Q-learning

algorithm takes about 250 000 episodes to converge while deep

Q-learning only needs approximately 1 200 which opens up

the possibility for it to be considered a feasible algorithm. The

other interesting metric to consider is the post-convergence

performance which for the dual-agent deep Q-learning al-

gorithm is not as great as in the earlier cases. According

to Table III the deep Q-learning algorithm still collides in

12.5% of episodes after it has converged. This is not a very

good behaviour at all and in this respect the deep Q-learning

algorithm is considerably worse than the algorithm for regular

Q-learning.

(8)

Comparing the deep Q-learning algorithms performance in the four and two agent cases is very interesting. The amount of episodes necessary for convergence does not increase very much in the four agent case. This is most likely due to the fact that the size of the state space used by this algorithm does not increase in the same way as the state space used by the Q-learning algorithm does. This indicates that the algorithm could have a potential to be feasible in even more complex systems with a higher number of agents operating at once. The algorithm is, however, slightly more reliable after convergence.

We believe this is caused by the fact that we had to remove the lower obstacle in Figure 3 in order to get the algorithm to converge successfully. For both multi-agent cases it seems like the optimal policy is never actually found. We are not sure why this is happening but we believe it is most likely due to ε being decreased too quickly, which results in the agents making their choices based on incomplete knowledge of the environment. Decreasing ε slower would allow the agents to get a better understanding of their environment and thus allow then to make better decisions. It is also possible that the bad performance is caused by the neural network that is used. Suggestions for possible future work which might help investigate this are mentioned in Section VII.

The difference in the number of collisions happening can also to some extent be explained with the fact that the exploration rate never actually reaches zero, and thus there is still a chance for the agents to collide randomly even after convergence.

In the regular Q-learning algorithms every agent at all times knows the positions of the other agents. For the deep Q- learning algorithm every agent is instead only aware of its own position in the warehouse combined with the knowledge of whether any other agents are nearby thus the size of the state space is constant regardless of the number of agents operating in the warehouse. For the case with two agents this approach results in a larger state space than if every agent simply knew the location of the other agent. In the long run however this should make it easier to scale the algorithm for use in systems with higher number of agents while at the same time makes it more appropriate for decentralised distributed systems with minimal actual agent to agent communication. If the state space was made up of all the agents locations, as initially done in the dual-agent case with Q-learning, there would be a need for a central computer tasked with constantly updating every agent with current information about the other agents whereabouts. This would eliminate the need for distributed computing.

VI. C ONCLUSIONS

Most importantly, the project showed that regular reinforce- ment learning, at least Q-learning, does not seem like an appropriate method to be used in complex dynamical systems with multiple agents operating at once, since the number of episodes necessary for convergence increases exponentially with the number of agents present. The study clearly demon- strated that a better alternative would be the usage of deep reinforcement learning algorithms such as deep Q-learning.

Furthermore the algorithms developed showed great potential for being used in distributed solutions as they require no communication between agents or to any central computer.

The algorithms developed in this project are however not deemed reliable enough to be used in most real world scenarios as they would lead to an unacceptable amount of crashes. The findings of this report are nonetheless very convincing and further inquiries into the usage of deep reinforcement learning algorithms for these kinds of complex distributed dynamical systems are enthusiastically encouraged.

VII. F UTURE WORK

Even though the algorithms developed in this project show great potential and above all illustrates the possibility of using deep reinforcement learning algorithms in situations involving complex distributed dynamical systems, they carry some notable flaws mainly regarding their post-convergence performance and lack of reliability.

It would be interesting to find the root of this unsatisfactory behaviour and, in particular, how it can be improved since this is the main issue preventing these algorithms from being used in a real world scenario. The failure of such a system in the real world could have severe societal and economic consequences, and in the worst case lead to the loss of human life. Developing a stable and reliable algorithm is therefore critical in order to allow for safe practical use.

Lowering the ε decay rate, as mentioned in Section V-A, could lead to convergence for the developed deep Q-learning algorithms. This will slow down the algorithm in terms of how quickly it learns, but this is regarded as acceptable if it means that the algorithms are able to find the optimal policy.

In an effort to improve the deep reinforcement learning algorithms developed in this project an inquiry into the possi- ble benefits of tweaking the neural network is suggested. For example, changing the number of trainable parameters might have a noticeable impact on the neural networks ability to learn. This can be done by simply increasing the number of neurons and layers. Furthermore the benefits of using different types of activation functions, loss functions, and optimisers could be worthwhile to investigate.

Making a comparison of different types of artificial neural networks or reinforcement learning algorithms in different scenarios would also be interesting. One potentially interesting algorithm is Double Deep Q-learning which according to [19]

is supposed to perform better than deep Q-learning in some cases.

VIII. A CKNOWLEDGEMENTS

The authors of this report would like to especially thank our

supervisors Rijad Alisic and Mina Ferizbegovic for supporting

us through the project. We would also like to thank the

members of group C3b Daniel Dalbom and Petter Eriksson

as well as their supervisors Takuya Iwaki and Yassir Jedra for

contributing to helpful discussions.

(9)

R EFERENCES

[1] A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM Journal of Research and Development, vol. 3, no. 3, pp.

71–105, 1959.

[2] (2017, May) Googles alphago defeats chinese go master in win for a.i. [Online]. Available: https://www.nytimes.com/2017/05/23/business/

google-deepmind-alphago-go-champion-defeat.html

[3] (2019, Apr.) Google deepmind targets nhs head and neck cancer treatment. [Online]. Available: https://www.bbc.com/news/

technology-37230806

[4] (2019, Apr.) Deepmind ai reduces google data centre cooling bill by 40%. [Online]. Available: https://deepmind.com/blog/

deepmind-ai-reduces-google-data-centre-cooling-bill-40/

[5] R. S. Sutton, Reinforcement learning: an introduction, Second edition, ser. Adaptive computation and machine learning series. Cambridge, MA: The MIT press, 2017, ch. 3, p. 37.

[6] ——, Reinforcement learning : an introduction, Second edition, ser.

Adaptive computation and machine learning series. Cambridge, MA:

The MIT press, 2017, ch. 3, p. 38.

[7] ——, Reinforcement learning : an introduction, Second edition, ser.

Adaptive computation and machine learning series. Cambridge, MA:

The MIT press, 2017, ch. 3, p. 44.

[8] ——, Reinforcement learning : an introduction, Second edition, ser.

Adaptive computation and machine learning series. Cambridge, MA:

The MIT press, 2017, ch. 3, p. 46.

[9] ——, Reinforcement learning : an introduction, Second edition, ser.

Adaptive computation and machine learning series. Cambridge, MA:

The MIT press, 2017, ch. 3, p. 50.

[10] C. J. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992.

[11] R. S. Sutton, Reinforcement learning: an introduction, ser. Adaptive computation and machine learning series. Cambridge, MA: The MIT press, 1998, ch. 6, p. 148.

[12] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, ch. 6, pp. 164–167, http://www.deeplearningbook.org.

[13] (2019, Apr.) Optimizers - keras documentation. [Online]. Available:

https://keras.io/optimizers/

[14] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

in Proceedings of the Eighth Nobel Symposium, 2014.

[15] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration via bootstrapped DQN,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 4026–4034. [Online]. Available: http:

//papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn.pdf [16] Google Brain, “Tensorflow: An end-to-end open source machine

learning platform,” 2019, [Accessed Apr. 25, 2019]. [Online].

Available: https://www.tensorflow.org/

[17] Franois Chollet, “Keras,” 2019, [Accessed Apr. 25, 2019]. [Online].

Available: https://keras.io/

[18] (2019, Apr.) Optimizers - keras documentation. [Online]. Available:

https://keras.io/activations/

[19] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning

with double q-learning,” in AAAI’16: Proceedings of the Thirtieth AAAI

Conference on Artificial Intelligence, 2016, pp. 2094–2100.

Distributed Optimisation in Multi-Agent Systems Through Deep Reinforcement Learning

Distributed Optimisation in Multi-Agent Systems Through Deep Reinforcement Learning

Andreas Eriksson and Jonas Hansson

Index Terms—Reinforcement learning, distributed, optimisa- tion, Q-learning, deep Q-learning, warehouse robots, artificial neural networks.

TRITA number: TRITA-EECS-EX-2019:127

I. I NTRODUCTION

Reinforcement learning algorithms have in particular had a lot of recent success in games. One example of the growing popularity of reinforcement learning is the artificial intelli- gence (AI) AlphaGo developed by Google DeepMind which became the first AI to beat a professional Go player [2].

II. P RELIMINARIES

A. Markov Decision Process

The action that decides the transitioning from state at time t to next state at time t + 1 is denoted by a t .

Agent

Environment Reward

r t

s

State

s t

Action a t r

Fig. 1. Flowchart explaining the principle of an MDP.

The cumulative reward that the agent collects when transi- tioning between states can be expressed by the return function.

A parameter γ (discount rate), which functions as a weight

between immediate and future reward, is added to the return

function. The discounted return from time t to the terminal state at time T is given by the following expression

G t =

T

X

k=0

γ k r(s t+k+1 , s t+k+2 ), (1)

where 0 ≤ γ ≤ 1 and T is the final time step [7].

v π (s t ) = E π [G t |s t ]. (2) where E π [.] denotes the expected value of a random variable give that the agent follows policy π [8].

v ∗ (s t ) = max

π v π (s t ), (3)

for all states s t [9].

B. Reinforcement Learning

the Q-function is given by Q(s t , a t ) = Q(s t , a t )

+ αr t+1 + γ max

a Q(s t+1 , a) − Q(s t , a t ) (4)

which is used iteratively to approximate the optimal action- value function

Q ∗ (s t , a t ) = max

Q ∗ (s t , a t ) = E[r t+1 + γV ∗ (s t+1 )|s t , a t ]. (6) For practical implementations of Q-learning the values calcu- lated by the Q-function are stored in a data structure usually referred to as the Q-table, which has one element for every possible state-action pair.

C. Artificial Neural Networks

The calculated error is then propagated backwards through the network, and used to adjust the weights and biases in the network through the use of an optimiser to try to lower future errors.

For the deep learning aspects of this project a deep feed-

forward neural network is used. A feed-forward neural net-

work, as is further explained in [12], means that information is

only passed one way and there exists no feedback connections

in the form of loops in the network. As illustrated in Figure

2, the network that is used is fully connected meaning that

every neuron (excluding the neurons in the output layer) is

connected to every neuron in the following layer.

Fig. 2. The general structure of a deep feed-forward neural network, one type of an artificial neural network.

D. Deep Reinforcement Learning

III. M ETHOD

A. Environment

All parts of the project were implemented and simulated using the programming language Python. Shown in Figure 3 is a graphical overview of the virtual warehouse for a single agent case implemented with the TkInter package, Python’s de-facto standard graphical user interface package.

Fig. 3. A graphical overview of the simulated virtual warehouse with obstacles as well as start (lower left corner) and goal (top right corner) locations for the one agent case.

Similar to what was used by Google DeepMind in [15]

The probability for the agent deciding to explore the envi- ronment is given directly by ε while the probability for the agent deciding to opt for exploitation instead is 1 − ε.

If an agent collides with an obstacle or another agent it

receives a negative reward while reaching its goal would result

in a positive reward. The specific values for these different

situations are shown in Table I. If neither of these events occur

the agent will receive a minor negative reward of −1 points in

order to prevent it from staying in one spot or walking around in circles indefinitely. By giving it a small negative reward every time step the agent gets an incentive to try to reach its goal as quick as possible.

TABLE I

R

.

Situation Reward

Colliding with obstacle -50 Colliding with agent -50 Reaching its goal +50 None of the above -1

B. Q-Learning for single agent case

For the case with single agent the start position for the agent and its goal is shown in Figure 3. A high value for γ has to be used in order to make sure that the agents value long term rewards. γ is chosen as 0.95. Since the environment depicted by the MDP is deterministic α is set to 1.

Every episode consists of the agent wandering around the warehouse, one step at a time, until it collides with either a wall or an obstacle or until it reaches its goal.

When an action has been selected the warehouse has to check for eventual collisions or if the agent has reached its goal and based on this information give it an appropriate reward in accordance with Section III-A.

By using a mix of exploration and exploitation the agent iteratively updates the Q-table in an effort to obtain a Q-table which depicts optimal behaviour inside the environment. After a number of episodes, depending on the size of the warehouse,

the exploration rate reaches zero and the agent relies solely on the Q-table to make good decisions. If the training procedure was successful the agent should now be able to act functionally inside the warehouse. The algorithm is further explained using pseudocode in Algorithm 1.

Algorithm 1 Q-learning for single agent case

1: Initialise Q-table

2: for each episode do

3: for each step do

4: Decide between exploration or exploitation

5: if exploration then

r _t

s _t

Action a _t r

G _t =

γ ^k r(s _t+k+1 , s _t+k+2 ), (1)

the Q-function is given by Q(s _t , a _t ) = Q(s _t , a _t )

Q _∗ (s _t , a _t ) = max

Q _∗ (s t , a t ) = E[r t+1 + γV _∗ (s t+1 )|s t , a t ]. (6) For practical implementations of Q-learning the values calcu- lated by the Q-function are stored in a data structure usually referred to as the Q-table, which has one element for every possible state-action pair.

N n = N ₁ ⁿ , (7)