order to prevent it from staying in one spot or walking around in circles indefinitely. By giving it a small negative reward every time step the agent gets an incentive to try to reach its goal as quick as possible.
TABLE I
R
EWARDS GIVEN TO AGENTS IN DIFFERENT SITUATIONS
.
Situation Reward
Colliding with obstacle -50 Colliding with agent -50 Reaching its goal +50 None of the above -1
B. Q-Learning for single agent case
For the case with single agent the start position for the agent and its goal is shown in Figure 3. A high value for γ has to be used in order to make sure that the agents value long term rewards. γ is chosen as 0.95. Since the environment depicted by the MDP is deterministic α is set to 1.
The Q-learning algorithm begins with the initialisation of the Q-table as a table of zeros. The Q-table, as Section II-B suggests, is used for storing the calculated values of the Q- function for each of the different encountered states s t and every possible action a t associated with that state. The state space for the scenario with only one agent is simply made up of every possible location in the warehouse (a ten by ten warehouse therefore results in a state space of 100 different possible states).
Every episode consists of the agent wandering around the warehouse, one step at a time, until it collides with either a wall or an obstacle or until it reaches its goal.
For every step the agent can either pick an action randomly in an effort to explore the environment or use the Q-table to select the action which would give it the the maximum possible reward. The latter is referred to as exploitation. Whether the agent should decide for exploration of exploitation is determined randomly for every step and the likelihood for each of the options to be selected is determined by the exploration rate.
When an action has been selected the warehouse has to check for eventual collisions or if the agent has reached its goal and based on this information give it an appropriate reward in accordance with Section III-A.
After the reward for the committed action has been calcu- lated the algorithm uses Equation 4 to update the Q-value for the current state and action. The episode continues with the agent wandering around the warehouse until it either experi- ences a collision or it reaches its goal. A new episode then begins and the procedure is repeated. After every completed episode the exploration rate is lowered, resulting in the agent having a slightly greater tendency to pick an action from the Q- table rather than choosing one randomly during the following episode.
By using a mix of exploration and exploitation the agent iteratively updates the Q-table in an effort to obtain a Q-table which depicts optimal behaviour inside the environment. After a number of episodes, depending on the size of the warehouse,
the exploration rate reaches zero and the agent relies solely on the Q-table to make good decisions. If the training procedure was successful the agent should now be able to act functionally inside the warehouse. The algorithm is further explained using pseudocode in Algorithm 1.
Algorithm 1 Q-learning for single agent case
1: Initialise Q-table
2: for each episode do
3: for each step do
4: Decide between exploration or exploitation
5: if exploration then
6: Pick action by random
7: else if exploitation then
8: Pick action from Q-table
9: Check reward for committed action
10: Update Q-table using equation 4
11: if agent collided or reached its goal then
12: End episode
13: else
14: Take another step
15: Reduce exploration rate
C. Q-learning for two agent case
The algorithm can easily be applied to the case with multiple agents if some minor adjustments are made. Since we are considering distributed optimisation every agent should have its own Q-table. Additionally, for the agents to be able to account for other agents in the warehouse the state space has to be expanded to include the positions of some or all of the other agents as well. In general the size of the state space increases in accordance with,
N n = N 1 n , (7)
where N n represents the size of the state space for the case with n agents and N 1 equals the size of the state space for only one active agent (100 in this case).
The parameters α and γ are kept unchanged from the single agent case just like the start and goal location for the first agent in the warehouse. For the case with two agents, which is the only multi-agent case with Q-learning that is considered in this project, the second agent starts off in the top left corner and has to reach its goal in the bottom right corner.
In every episode the agents takes turns doing actions.
When calculating the reward for a committed action it is now necessary to check for collisions with the other agents. It should be noted that only the agent which caused the collision is given a negative reward and the victim of the collision is not penalised in any way. An episode ends when either both of the agents have collided with something, when both agents have reached their goals, or if one agent has collided with something whilst the other agent has reached its goal.
Just like in the single agent case the exploration rate is
reduced for every completed episode. However it should to be
noted that since the multi agent cases entails quite a lot larger
state spaces the training procedure will demand much more
TABLE II
A
COMPARISON OF THE APPROXIMATE NUMBER OF EPISODES NEEDED
FOR CONVERGENCE IN EACH OF THE DIFFERENT CASES
.
Q-learning Deep Q-learning
One agent 500 150
Two agents 250 000 1 200
Four agents — 1 600
An analysis of how well the algorithm performed in each of the different cases after convergence was carried out. This was done by simply noting how often any of the agents crashed for the first one thousand episodes after the algorithm with certainty had converged. The results of this inquiry can be found in Table III.
TABLE III
C
OLLISION STATISTICS FOR THE DIFFERENT REINFORCEMENT LEARNING
ALGORITHMS AFTER CONVERGENCE
.
Case Learning method Collisions [%]
One agent Q-learning 0.0
Deep Q-learning 0.2
Two agents Q-learning 0.1
Deep Q-learning 12.5
Four agents Q-learning —
Deep Q-learning 9.9
V. D ISCUSSION
The main objective of the project was to simulate the system in different scenarios with varying amounts of agents operating at once. Distributed optimisation in multi-agent sys- tems through reinforcement learning and deep reinforcement learning will be examined separately. Overall the results were consistent and clear conclusions could be drawn.
A. Q-learning
Figure 4 shows that Q-learning seems to work really well in this case, as the agent successfully learns the optimal policy.
It converges in about 400-500 episodes which is deemed reasonable given the number of possible states and actions.
Taking a closer look at Figure 4 and also Table III, we can see that its post-convergence performance is good with collisions happening in fewer than 0.0 % of episodes after it has reached the optimal policy and the exploration rate has been lowered close to zero.
The main difference when moving to a multi-agent dynam- ical system is the increased state space since every agent now also has to take the other agents into account which results in the size of the state space growing in an exponential fashion. The Q-tables will quickly become too large to work with thus seemingly making Q-learning impractical for multi- agent dynamical systems such as the ones considered by this project. The increase in necessary calculations also leads to more time being needed to make these calculations. With the usage of more powerful computers this problem can always be dealt with to some extent, but we believe that switching to an alternative method would be an overall better approach.
We do, however, believe that it should be noted that even though the computation times and number of episodes necessary for convergence for regular Q-learning multi-agent systems quickly become high, the results show that post- convergence performance is great with collisions occurring in less than 0.1% of episodes. The negative spike that can be found at about 400 000 episodes in Figure 6 can be explained by the fact that the exploration rate never actually reaches zero, which means that there is still a very small chance that the agent would choose an action at random, even after the optimal policy has been reached, potentially making the agent go into a wall or the other agent. This is not considered a flaw of the algorithm as it is simply a consequence of the way the exploration aspect was implemented. If the exploration rate was actually set to zero instead of it only approaching zero ever so slightly without ever actually reaching it the agent would act according to the found optimal policy.
We also found some difficulties in deciding on how quickly ε should be decreased. If ε is decreased too slowly the algorithm will take an unnecessarily long time to converge.
On the other hand, reducing it too quickly would not give the agents enough time to explore the environment and thus the optimal policy will not be found. To solve this problem it might be a good idea to introduce a feedback loop and lower ε based on the agents performance, making the agents determine their own actions to a higher degree as they get progressively smarter.
Even though a Q-learning solution would technically be possible also in the four agent case, no such simulations were done due to the fact that it would take too long to compute. We also believe that such an simulation would not be necessary as the point has already been proven - Regular Q-learning works great for smaller, less complex, systems but is not feasible in the complex dynamical systems considered by this report.
B. Deep Q-learning
Even in the simple case with only one active agent the differences between deep Q-learning and regular Q-learning become apparent. The solution using deep Q-learning con- verges quite a lot quicker, which is shown in Table II, and with similar performance regarding the amount of unwanted collisions after the algorithms have been considered to have converged deep Q-learning clearly stands out as the better approach.
In the dual-agent case the lower amount of episodes nec-
essary for the deep Q-learning algorithm to converge is even
more apparent. According to Table II the regular Q-learning
algorithm takes about 250 000 episodes to converge while deep
Q-learning only needs approximately 1 200 which opens up
the possibility for it to be considered a feasible algorithm. The
other interesting metric to consider is the post-convergence
performance which for the dual-agent deep Q-learning al-
gorithm is not as great as in the earlier cases. According
to Table III the deep Q-learning algorithm still collides in
12.5% of episodes after it has converged. This is not a very
good behaviour at all and in this respect the deep Q-learning
algorithm is considerably worse than the algorithm for regular
Q-learning.