Using Deep Reinforcement Learning For Adaptive Traffic Control in Four-Way Intersections

(1)

Department of Science and Technology Institutionen för teknik och naturvetenskap

LiU-ITN-TEK-A--19/027--SE

Using Deep Reinforcement

Learning For Adaptive Traffic

Control in Four-Way

Intersections

Gustav Jörneskog

Josef Kandelan

(2)

LiU-ITN-TEK-A--19/027--SE

Using Deep Reinforcement

Learning For Adaptive Traffic

Control in Four-Way

Intersections

Examensarbete utfört i Elektroteknik

vid Tekniska högskolan vid

Linköpings universitet

Gustav Jörneskog

Josef Kandelan

Handledare Evangelos Angelakis

Examinator Nikolaos Pappas

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Abstract

The consequences of traffic congestion include increased travel time, fuel consumption, and the number of crashes. Studies suggest that most traffic de-lays are due to nonrecurring traffic congestion. Adaptive traffic control using real-time data is effective in dealing with nonrecurring traffic congestion. Many adaptive traffic control algorithms used today are deterministic and prone to hu-man error and limitation. Reinforcement learning allows the development of an optimal traffic control policy in an unsupervised manner. Most previous re-searchers have focused on maximizing the performance of the algorithm assum-ing perfect knowledge. We have implemented a reinforcement learnassum-ing algo-rithm that only requires information about the number of vehicles and the mean speed of each incoming road to streamline traffic in a four-way intersection. The reinforcement learning algorithm is evaluated against a deterministic algorithm and a fixed-time control schedule. Furthermore, it was tested whether reinforce-ment learning can be trained to prioritize emergency vehicles while maintaining good traffic flow. The reinforcement learning algorithm obtains a lower average time in the system than the deterministic algorithm in eight out of nine exper-iments. Moreover, the reinforcement learning algorithm achieves a lower aver-age time in the system than the fixed-time schedule in all experiments. At best, the reinforcement learning algorithm performs 13% better than the deterministic algorithm and 39% better than the fixed-time schedule. Moreover, the reinforce-ment learning algorithm could prioritize emergency vehicles while maintaining good traffic flow.

(5)

Acknowledgments

We want to thank Cybercom Sweden AB for the opportunity to do our master thesis together with them. We specifically would like to thank Johan Billman for giving us the opportunity to work with Cybercom and Johannes Westlund for the supervision. We also would like to thank our supervisor Evangelos Angelakis and examiner Niko-laos Pappas at Linköping University. Lastly, a big thank you to our colleagues at the Innovation Zone.

(6)

Abstract i Acknowledgments ii Contents iii List of Figures v List of Tables vi 1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 2 1.3 Research questions . . . 2 1.4 Method . . . 2 1.5 Delimitations . . . 3 1.6 Structure . . . 3 2 Theory 4 2.1 Reinforcement Learning . . . 4

2.2 Mathematical Formulation of Markov Decision Processes . . . 5

2.3 Q-learning . . . 7 2.4 Deep Q-learning . . . 8 2.5 Related Work . . . 8 3 Method 10 3.1 Environment . . . 10 3.2 Deterministic Algorithm . . . 14

3.3 Reinforcement Learning Algorithm For Isolated Four-Way Intersection 16 3.4 Reinforcement Learning Algorithm For Green Wave . . . 20

4 Results & Analysis 22 4.1 Comparison Between the Algorithms for the Four-Way Intersection . . 22

4.2 Relative Performance Between The Algorithms . . . 34

4.3 Reinforcement Learning for Prioritizing Emergency Vehicles While Maintaining Good Traffic Flow . . . 36

(7)

5.1 Collection of Data . . . 45 5.2 Different Technologies to Measure Relevant Data . . . 45

6 Discussion 50

6.1 Results . . . 50 6.2 Method . . . 50

7 Conclusion 54

(8)

List of Figures

2.1 Agent and Environment Interaction. . . 5

3.1 Isolated four-way intersection. . . 11

3.2 Road for motorized vehicles with pedestrian crossings. . . 12

3.3 Action Timeline. . . 18

4.1 Average time in the system for p=1/3, 1000 epochs. . . 23

4.2 Average time in the system during training for p=1/3, 1000 epochs. . . . 26

4.3 Average time spent in the intersection p=1/3, 2000 epochs. . . 27

4.4 Average time in the intersection during training of RL for p = 1/3, 2000 epochs. . . 28

4.5 Average time spent in the system with arrival rate p=1, 2000 epochs. . . 29

4.6 Average time in the intersection during training of RL for p = 1, 2000 epochs. . . 30

4.7 Average Time in the System with Arrival Rate p =1/5, 2000 epochs, 500 meters. . . 31

4.8 Average time in the system during training with arrival rate p=1/5, 2000 epochs. . . 32

4.9 Average time in the system with arrival rate p = 1/5, 2000 epochs, 100 meters. . . 33

4.10 Average in the system during training with arrival rate p = 1/5, 2000 Epochs, 100 meter. . . 34

4.11 Relative Performance for Different Vehicle Distributions. . . 35

4.12 Relative Performance for Different Arrival Rates. . . 36

4.13 Average time for the vehicles, emergency vehicles and pedestrians during training of RL for Green Wave, Combined Weights. . . 38

4.14 Average time for the vehicles, emergency vehicles and pedestrians during training of RL for Green Wave, Emergency Weights. . . 39

4.15 Average time for the vehicles, emergency vehicles and pedestrians During Training of RL for Green Wave, No Emergency Weights. . . 41

4.16 Average time for the vehicles, emergency vehicles and pedestrians during training of RL for Green Wave, No Emergency Weights, 2000 epochs. . . . 43

5.1 High Level Architecture of The Traffic Control System. . . 48

(9)

List of Tables

3.1 Traffic light state description. . . 11

3.2 Default traffic light schedule for the isolated four-way intersection. . . 12

3.3 Available phases and phase duration for the connected intersections. . . . 13

3.4 Vehicle Attributes. . . 13

3.5 State. . . 17

3.6 Possible Actions. . . 17

3.7 State Representation Green Wave. . . 21

4.1 Departing probability and road pair distribution. . . 22

4.2 Weights for emergency vehicles and no emergency vehicles. . . 37

4.3 Results Emergency Prioritization. . . 40

4.4 Average time in the system when the emergency vehicles are prioritized. . 42

4.5 Average time in the system when vehicles and pedestrians are prioritized equally. . . 42

(10)

1 Introduction

This chapter will present the aim and research questions the thesis intends to answer and why the research is considered significant. Furthermore, an overview of the method, delimitations, and structure of the report will be presented.

1.1 Motivation

The consequences of traffic congestion include increased travel time, fuel consump-tion, and the number of crashes. The hourly cost of these consequences can be valued at approximately 50 % of the hourly wage rate for personal travel and 100 % of the hourly wage rate for business travel [6]. Traffic congestion can be recurring, such as rush hour traffic or nonrecurring due to random events. More than 50 % of all traf-fic delays in the United States where up to 67 % of traftraf-fic delays in urban areas has been shown to be due to nonrecurring traffic congestion [6]. In order to minimize the impact of nonrecurring congestion, adaptive traffic control using real-time data is advantageous if not necessary. Most dynamic traffic control systems today use traffic lights. The algorithms controlling traffic lights can be generalized into two different categories. The first category can be labeled as fixed-timed control, where all control parameters have been optimized offline using historic data [29]. Fixed-time control can be useful when trying to manage recurring traffic congestion since the historical data will give an accurate estimation of future events. The second category can be labeled as actuated control where control parameters are determined online using real time sensor data [29]. Actuated traffic control is an effective method to control both recurring and nonrecurring traffic congestion. However, the research shows that many algorithms within this category are deterministic. For example, a popular type of algorithm is the longest queue first or highest density first [10][16]. These types of algorithms are limited in that they require an industry expert to define how the traf-fic light should behave in each situation. Considering the complexity of intersections, this approach can lead to sub-optimal traffic control decisions due to human error or

(11)

1.2. Aim limitation [10]. Researchers have therefore gravitated towards reinforcement learn-ing, which allows the algorithm to develop an optimal policy to take actions given the state of the environment [11]. The algorithm does this by interacting with the environment and trying to maximize cumulative reward. Even though the reward and state representation are predefined, the algorithm may develop a state-action policy in an unsupervised manner. Reinforcement Learning in the context of adap-tive traffic signal control has shown promising results. For example, researchers have found that traffic signal control using reinforcement learning reduces vehicle delay in a four-way intersection by 47 % compared to the longest queue first deterministic algorithm [10].

1.2 Aim

This thesis aims to investigate how deep reinforcement learning can be used to streamline traffic light control and to evaluate it against popular methods used to-day. Furthermore, this thesis aims to investigate how a deep reinforcement learning-based traffic control system can be deployed. The thesis includes the development and evaluation of traffic light control algorithms and schedules using different ap-proaches — specifically, a deep Q-learning algorithm, a deterministic algorithm, and a fixed-time control schedule. Moreover, a qualitative study of different sensors and deployment methods will be included in the thesis.

1.3 Research questions

• How does a reinforcement learning-based traffic light control algorithm per-form against a deterministic algorithm and a fixed-time control schedule in terms of minimizing the average time in the system?

• How does the arrival rate of vehicles affect the performance of the algorithms? • How does the distribution of vehicles between incoming roads affect the

per-formance of the algorithms?

• Can reinforcement learning be used to obtain a green wave for emergency ve-hicles while maintaining good traffic flow?

• How should a reinforcement learning-based traffic control be deployed?

– What sensors should be used?

– How should units communicate?

1.4 Method

This thesis was structured in six stages. First, research was conducted to identify what has and has not been done in the research area to identify where value can be added to previous research. Furthermore, the methods used were researched to determine how traffic control algorithms were developed and tested. Second, the

(12)

1.5. Delimitations intersections were implemented in the simulation environment, specifically in Simu-lation of Urban Mobility. Third, data were generated, i.e., the flow of vehicle. Fourth, the algorithms were developed and implemented in Python. Five, the algorithms were tested and evaluated. Lastly, sensors and communication were researched, and a suggested deployment of the traffic control system was presented.

1.5 Delimitations

The reinforcement learning algorithm is trained without considering run-time, CPU, memory, latency and incomplete information. In reality, the time available to obtain convergence is highly limited and affected considerably by the aspects mentioned above. Furthermore, the only performance measurement is the average time in the system, and other aspects should be considered to measure the performance of the algorithms accurately. Lastly, this thesis will only consider two types of intersections.

1.6 Structure

The thesis is structured as follows. Chapter 2 provides the reader with theory and related work necessary to understand how the algorithms were developed and why. In Chapter 3, the development of the environment and the implementation of the algorithms are described. Chapter 4 contains results from the experiments and anal-ysis. Chapter 5 will include a discussion regarding sensors, communication, and the proposed system architecture. In Chapter 6, the thesis is discussed followed by a conclusion in Chapter 7.

(13)

2 Theory

This chapter will explain the relevant theory to understand reinforcement learning and present the related work used as a basis for the development of the algorithms.

2.1 Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning that aims to maximize a numerical reward signal by learning how to map situations to actions [25]. More concretely, RL is the task of learning how a software agent should take a sequence of actions in an environment in order to maximize a notion of cumulative reward. The software agent interacts with the environment directly, and no supervision nor com-plete knowledge of the environment is needed, which distinguishes RL from other computational approaches. The agent’s interaction with the environment can gener-ally be described as the following: consider a starting state of an environment st. The

agent takes an action atat each time step t. Each action yields a reward rtand a state

transition to st+1[7]. The interaction between agent and environment can be seen in

(14)

2.2. Mathematical Formulation of Markov Decision Processes

Figure 2.1: Agent and Environment Interaction.

RL is generally defined as a discrete, stochastic control process where future states in an environment depend only on the current state of the environment. The agent has, therefore, no interest in in the sequence of events that led to the current state of the environment. This implies that RL has the Markov property and thus is a Markov Decision Process (MDP) [25]. The way an agent behaves at a given time step is de-termined by a policy π(st). Generally speaking, the policy decides the probability

to take action given a particular state of the environment. Accordingly, the process of RL aims to find a policy that maximizes cumulative reward [25]. The policy of an RL algorithm is developed over time as a consequence of the rewards given by the environment after each action. The RL algorithm seeks to maximize cumulative reward over time. This implies that the reward defines what good and bad actions are. As a consequence, the reward ultimately defines the goal of the RL algorithm [25]. However, the reward defines the immediate worth of an action in a given state. The actual value of an action is not only determined by the immediate reward but also accumulated future rewards starting from the state the environment ends up in. To take this into account, RL uses a value function to predict expected future reward by taking into account the probability of states following the current state and the reward in those states. Actions are chosen based on the value function to maximize total value, thus maximizing total reward over time. Some researchers consider the estimation of the value in a state to be the most critical aspect of RL [25].

2.2 Mathematical Formulation of Markov Decision Processes

The mathematical formulation of reinforcement learning is, as previously mentioned, as a MDP. A MDP satisfies the Markov property, which is that the current state [13]

(15)

2.2. Mathematical Formulation of Markov Decision Processes entirely characterize the state of the environment. The MDP consists of the following elements:

S: set of possible non-terminal states, A: set of possible actions,

P(s1

|s, a): probability of transitioning to state s1

by taking action a in state s, R(s, a, s1₎

: immediate reward of transition to state s1

from state s by taking action a,

γ: discount factor of future rewards.

A state in the context of RL is a representation of the environment as it appears in a given time step. The set of states, therefore, should include all possible states the environment could be in except for terminal states. Terminal states are the states in which the current episode of the algorithm ends, for example, when a goal is reached. An action is possible decisions the agent can take in a particular state. Furthermore, the set of all actions include all the decision the agent can take. Note that all actions in A are not necessarily allowed in all states [25]. The probability of transitioning P(s1

|s, a)gives the probability of ending up in a state s1

given a state-action pair s, a. In a deterministic setting, the probability of the agent ending up in state s1

given that the action a in state s leads state s1

is 1.0 and 0.0 to end up in any other state. However, in a non-deterministic setting, where there is noise, the action will not necessarily lead to state s1

all of the time. However, the sum of the probabilities to end up in all states reachable from state s should be equal to 1.0 [22]. Reachable states from state s are determined by allowed actions in that state. Lastly, the reward function R(s, a, s1₎

gives a scalar that represents how useful it is to transition to state s1

from state s by taking action a [13]. The solution to a MDP is a policy π(s)which is a function that determines what action the agent should take in each state.

To describe the RL process, consider a timeline starting at time step t = 0. The environment samples a starting state s0 [13]. The current state is denoted s and the

state the environment ends up in after an action a is denoted s1

[13]. From t=0 until the environment reaches a terminal state or after a pre-defined number of iterations, the following steps will be taken in the given order:

1. The agent chooses an action a „ π(s), 3. The environment samples a new state s1

„ P(s1

|s, a), 2. The environment samples a reward r „ R(s, a, s1₎

, 4. The agent receives reward r and state s1

.

The objective of this process is to find an optimal policy to maximize the expected cumulative discounted reward over time, see equation 2.1 [13].

π˚ =arg max π E 8 ÿ tą0 γtr|π (2.1) As has been discussed, the value function approximates the value of being in a state. The state value is defined as the expected cumulative reward of starting at state s and

(16)

2.3. Q-learning following the current policy until a terminal state is reached or after a pre-defined number of iterations, see equation 2.2 [7].

Vπ(s) =E 8 ÿ tą0 γtr|s0=s, π) (2.2) As have been mentioned previously, the immediate reward is a short term represen-tation of a state’s value. Therefore, equation 2.2 will give the expected immediate cu-mulative discounted reward over time. This may be sub-optimal as the actual value of the state is determined by accumulated future rewards as well as the immediate reward [22]. Therefore, equation 2.1 should be rewritten as the following:

π˚ (s) =arg max a E ÿ s1 P(s1 |s, a)Vπ˚ . (2.3)

The optimal value function can be written recursively using the Bellman equation as follows: Vπ˚(s) =r+γmax a ÿ s1 P(s1 |s, a)Vπ˚(s1 ). (2.4)

By solving equation 2.4, the optimal policy can be obtained. The way this is done is by value iteration. Value iteration is the process of updating the value function in a state based on the value of neighboring states and the probability of ending up in those states, see equation 2.5 [25].

Vi+1(s) =r+γmax a ÿ s1 P(s1 |s, a)Vi(s 1 ) (2.5)

2.3 Q-learning

The value function formulated in equation 2.4 gives the value of a state but not the value of a state-action pair. Therefore Q-learning uses a Q-value function, which is a slight modification from the value function that has been defined previously. The Q-value function is defined as the expected cumulative reward of taking action a in a state s and then following the optimal policy, see equation 2.6 [7].

Q˚ (s, a) =r+γÿ s1 P(s1 |s, a)max a Q ˚ (s1 , a1 ) (2.6)

This follows that optimal value function equals to the optimal Q-value function, given that the best action is taken, see equation 2.7 [25].

Vπ˚(s) =max

a Q ˚

(s, a) (2.7) Furthermore, the maximum policy is the policy that gives the action that maximizes the Q-value, see equation 2.8 [22].

π˚

(s) =arg max

a Q˚

(17)

2.4. Deep Q-learning For each transition from state s to state s1

by taking action a and receiving reward r, the Q-value function is updated according to equation 2.9. As i goes towards in-finity Q(s, a)goes towards Q˚₍

s, a)[13]. The parameter α is the learning rate which determines to what extent the Q-values should be updated.

Qi+1(s, a) =Qi(s, a) +α(r+γmax a Qi(s 1 , a1 )´ Qi(s, a)) (2.9)

2.4 Deep Q-learning

In the traditional Q-learning approach, the Q-value function is a table of size number of states time number of actions. Considering that Q(s, a)needs to be calculated for every state-action pair, the traditional Q-learning approach becomes computationally infeasible as the number of state-action pairs grows beyond a certain number. There-fore, function approximators are used to estimate Q˚

(s, a). Deep Q-learning is simply Q-learning where Q˚

(s, a)is approximated by a neural network with weights θ, see equation 2.10 [8].

Q(s, a; θ) =Q˚

(s, a) (2.10) The neural network is trained to minimize expected squared error between the pre-dicted Q-value and the target Q-value, see equation 2.11 [8].

L=E (r+γmax a Q(s 1 , a1 )´ Q(s, a; θ))2 (2.11) The weights of the neural network are updated using gradient descent, see equation 2.12. The idea is that the weights become optimized to minimize the error in equation over time 2.11 [13]. θi+1= θi+α(r+γmax a Q(s 1 , a1 )´ Q(s, a; θ))∇θiQ(s, a; θi) (2.12)

2.5 Related Work

When researching machine learning techniques in the context of traffic signal con-trol, RL seems to have undergone substantial research during recent years. However, within the scope of RL, there are an extensive usage of learning algorithms, state space definitions, action space definitions, reward definition, simulation software, road traffic network and probability distribution models for generating vehicle traf-fic. The state space definitions used in previous research can generally be categorized into two types; human-crafted attributes of traffic and raw data. A commonly used human-crafted attribute to define the state space is queue length [21]. However, the more general category of state space definition seems to be raw data in the form of vehicle position and velocity [10][11][19]. The action space descriptions used in previous research are highly dependant on the type of road traffic network studied. Most studies researched have considered isolated four-way intersections. This im-plies that the determining factor when deciding the action space description is the number of lanes on each road. However, most research defines the action state space as either two or four possible actions [10][11][19][20]. The possible actions are the

(18)

2.5. Related Work different configurations of green light. The researchers that define the action space as two actions consider green light for all lanes north and south and a green light for all lanes east and west as the possible actions [10]. Exclusive green light for left turns, i.e., green light for south to west and north to east and green light for west to north and east to south are considered part of the transition between the two actions. The researchers that define the action space as four actions include the above mentioned exclusive left turns as actions in the action space [11][19][20].

RL typically include a reward definition regardless of the learning algorithm. Some of the previous research seems to correlate the reward definition with the state space definition. This seems to apply for research where the state space is defined as human-crafted attributes. For example in research where the state space definition is queue length, the reward is defined as the negative value of the average queue length [21]. Since RL typically aims to maximize the accumulated reward, the algo-rithm will favor actions that give the least negative value, i.e., the actions that give the lowest average queue length. For research where the state space definition is raw data, the correlation between state space and reward is less obvious. In these studies, the reward is usually defined as the difference between accumulated waiting time or staying time at the intersection before and after the action. This can be interpreted as setting the reward to the aim of the traffic light directly and not through an indicator such as queue length which indicates whether the accumulated delay or staying time reduces in the system [10][11]. Since it is more complex to measure vehicle delay or staying time at an intersection, some researchers use more straightforward reward definitions such as the net outflow at the considered intersection [20]. To our knowl-edge, most previous researchers have used a Deep Neural Network to approximate Q-values and thus optimal action given a particular state [10][11][19][21][20]. As have been mentioned in section 1.1, deterministic algorithms are the most commonly used to control traffic lights.

The studied algorithms vary mostly in the way roads are prioritized, i.e., which traf-fic light phases are initiated and when. Furthermore, the duration of each phase is another factor that varies among researchers. Choice of phase and phase duration, in turn, affect what attributes of the intersection that are of interest and thus, what data need to be gathered. This information can be collected from vehicles that peri-odically broadcast their basic traveling data, such as location, speed, direction, and destination [27]. A common alternative to vehicles broadcasting data is sensors such as induction loops and cameras [1][14]. Common attributes studied are queue length, average speed and vehicle density [24][27][14][16]. Usually, the roads with the high-est vehicle density or queue length are prioritized, meaning that they will obtain a green light during the following phase [27][14]. After a road have been prioritized, the time it takes for the vehicles within a specified range or queue to cross the in-tersection is calculated, and the phase duration is set to the estimated time [27][14], making the scheduling of the traffic light dynamic.

(19)

3 Method

The method will describe how the work was carried out by showcasing the develop-ment and impledevelop-mentation of the intersection and the algorithms.

3.1 Environment

There are two intersections considered in this study illustrated in Figure 3.1 and 3.2. The intersection in 3.1 is a four-way intersection where each incoming road allows traffic towards each direction. Each road contains three lanes, and each cardinal di-rection has one incoming and one outgoing road. Each road is 500 meters long and has a speed limit of 19.44 m/s. Each lane at the intersection is named after its car-dinal direction together with an index. Depending on which index the lanes have, their characteristics differ from one and other. Lanes with index 1 have two destina-tions, which means that drivers using these lanes can choose between turning right or continuing forward. Lanes with index two, however, only have the opportunity of continuing forward through the intersection. Lastly, incoming lanes with index three can only proceed through the intersection by taking a left turn. The intersection is considered to be isolated in all experiments, which means that possible information from surrounding traffic lights is not considered. Only data within the intersection is taken into account.

(20)

3.1. Environment

Figure 3.1: Isolated four-way intersection.

A traffic light controls the traffic flow in the intersection. To emulate realistic traf-fic, the simulation software Simulation of Urban Mobility (SUMO) version 1.1.0 was used. SUMO is an open source microscopic road traffic simulation package that is designed to simulate large networks of roads and intersections [17]. The main reason SUMO was chosen is that each vehicle can be modeled explicitly with its attributes and route, which provides the opportunity to introduce randomness in the simu-lations. Furthermore, using an existing open-source simulation software allows for more experimentation within the limited time and simplifies for future studies to recreate the simulation environment. While the most common traffic lights use three states (red, yellow, and green), SUMO has four different states presented in Table 3.1. The green state in SUMO is divided into two different green lights, where one indi-cates that the vehicle has priority over conflicting vehicles and the other that it does not.

Table 3.1: Traffic light state description.

Description Traffic Signal Characters Green light with priority G

Green light with no priority g Yellow light y

Red light r

For a traffic light to be fully functional, there needs to be a traffic lights schedule. The traffic light schedule is by default static with fixed phases and phase duration. The default settings of the schedule can be seen in Table 3.2. A phase consists of the state on each lane as described in Table 3.1, and the phase duration defines how long each phase will last. A cycle consists of eight phases and lasts for 90 seconds in total. The algorithms presented in this thesis will only control phase 1 and 5 with the associated phase duration. The remaining phases are defined as transition phases to

(21)

3.1. Environment assure safety through the intersection and will thus remain unchanged throughout all experiments.

Table 3.2: Default traffic light schedule for the isolated four-way intersection.

Phase N1 N2 N3 E1 E2 E3 S1 S2 S3 W1 W2 W3 Phase Duration [s]

1: G G g r r r G G g r r r 29 2: y y g r r r y y g r r r 5 3: r r G r r r r r G r r r 6 4: r r y r r r r r y r r r 5 5: r r r G G g r r r G G g 29 6: r r r y y g r r r y y g 5 7: r r r r r G r r r r r G 6 8: r r r r r y r r r r r y 5

The intersection is Figure 3.2 include pedestrians as well as motorized vehicles. There is a one-way road for vehicles that span from west to east. The length of the road is 300 meters in total, and the speed limit is 13.89 m/s. Furthermore, there are two pedestrian crossings that go both ways, i.e., south to north and vice versa.

Figure 3.2: Road for motorized vehicles with pedestrian crossings.

The available phases for each traffic light are illustrated in Table 3.3. Observe that the first traffic light is described in Table 3.3. However, both traffic lights have identical available phases. Phase one and three are considered the main phases and will be controlled. The other phases are considered to be transition phases and will thus not be controlled. Each transition phase will last for 5 seconds.

(22)

3.1. Environment Table 3.3: Available phases and phase duration for the connected intersections.

Phase W C N1 S1 1: G G r r 2: y y r r 3: r r G G 4: r r r r 3.1.1 Data

In this study, all vehicles have the same attributes, and all drivers and pedestrians follow the same traffic behavior models. However, this should not be confused with all drivers and pedestrians behaving the same. The specific situation and stochastic-ity [17] decide the exact behavior. The varying factors in the input data are mainly the arrival rate and routing of vehicles and pedestrians. Vehicle attributes, traffic behav-ior models, and the probability distributions used to define arrival rate and routing will be described in greater detail below.

3.1.2 Vehicle Type

All vehicles have the default settings in SUMO. The attributes of each vehicle are presented in Table 3.4. The only fixed values in Table 3.4 are the length and width of the vehicle itself. Acceleration, deceleration, and velocity, while limited within the given values, are decided by the specific behavior of the driver. The specific behavior of a driver is defined by a car following model and a lane changing model defined in Section 3.1.3.

Table 3.4: Vehicle Attributes. Attribute Value Maximum Acceleration 2.6 m/s2 Maximum Deceleration 4.5 m/s2 Length 5.0 m Width 1.8 m Maximum Velocity 55.55 m/s 3.1.3 Traffic Behavior

There are several following models in SUMO. However, The default car-following model is used. The default car car-following model in SUMO is based on the Krauss model with minor modifications [17]. The model aims to let each vehicle drive as fast as possible while always being able to avoid collision [18]. However, in this study, the vehicles will never exceed the speed limit of the road. Furthermore, the lane changing model determines lane choice on multi-lane roads. The default lane changing model LC2013 is used in this study. In the default setting, lane changes occur mainly when necessary, i.e., when the current lane does not lead to where the vehicle is routed to go [5].

(23)

3.2. Deterministic Algorithm

3.1.4 Arrival Rate & Route

Apart from vehicle type and traffic behavior, each vehicle has a route and departure time. A vehicles route is chosen using a set of weights defining the probability of the source of a vehicle road and destination road. The destination roads are chosen uniformly, whereas the weights of the source roads are varied during experiments to control the incoming traffic distribution amongst roads. The arrival rate is ran-domized using a binomial distribution where the input arguments are the maximum number of simultaneous arrivals and the expected arrival rate. The maximum num-ber of simultaneous arrivals is fixed, and the arrival rate will be a varying factor in experiments described in Section 6.1. Each route file will simulate 30 minutes of sim-ulation.

3.2 Deterministic Algorithm

To optimize traffic flow through an intersection, there are several important parame-ters to acknowledge. However, a traffic light control system is limited to controlling phase and phase duration. Therefore, parameters directly correlated with phase and phase duration are considered. To control phases, two common parameters consid-ered in previous research are queue length and vehicle density. Queue length gives a limited representation of the state of the intersection as vehicles that might end up in a queue are ignored. Therefore, this paper considers vehicle density as a deter-mining parameter to control which phase should be prioritized. To further describe the prioritization of phase, consider the intersection presented in Figure 3.1 with its associated traffic light control schedule presented in Table 3.2. As can be seen in the main phases, roads with opposite cardinal directions will have green light simultane-ously. As main phases constitute the phases the traffic control system can prioritize between, it is necessary to compare the combined density of the roads that obtain green light in those phases. Thus, the density of lane N1 ´ N3 and S1 ´ S3 are com-bined as well as the density of lane E1 ´ E3 and W1 ´ W3, see Figure 3.1. The road pair with the highest combined density is prioritized. This follows that the phase that gives a green light for the prioritized road-pair is initiated during the following time step. Furthermore, to control phase duration, previous researchers usually estimates the time it takes for a certain amount of vehicles to pass the intersection. The objec-tive is usually to make sure all vehicles in a queue cross the intersection before the next phase. As have been discussed earlier, it is assumed in this study that the queue does not give an accurate representation of the state of the traffic flow in the inter-section. Therefore, the phase duration is set to the time it takes for the vehicle that takes the longest in the prioritized road-pair to cross the intersection. This is done regardless of whether there is a queue or not. To accurately estimate the time ti it

takes for vehicle i within the pre-defined distance to cross the intersection, equation 3.1 is used, where vi, ai and di is the velocity, acceleration and distance to the center

of the intersection for vehicle i. Equation 3.1 is selected since the velocity of vehicle i at the current time step is known, i.e. the initial velocity as well as the distance diand

acceleration ai. This follows that time ti for vehicle i is the only unknown variable

which can be solved for, given constant acceleration ai. A compensating factor sr is

(24)

3.2. Deterministic Algorithm the prioritized road pair r. The factor sr is defined as the total number of vehicles

standing still in the prioritized road pair r divided by two. Vehicles with a velocity lower than 0.1m/s are considered to be standing still. It is thus assumed that the re-action time is one second per vehicle. This is a major simplification as vehicles are only affected by other vehicles standing still in front of them on the same road.

ti =´vi ai + d (vi ai )2₊ 2di ai +sr (3.1)

To further describe the scheduling of phases and phase duration, consider Algorithm 1. Prioritization of phases is always made by the end of a full cycle, and the length of a cycle is always kept below the cycle length in the fixed schedule. Furthermore, a full cycle is always performed before the algorithm makes the next decision. Thus, prioritization will be made in every other main phase. Consider the intersection in Figure 3.1 and an environment in which the road pair consisting of western and east-ern roads has the highest accumulated vehicle density, in which case phase 5 will be chosen as can be seen in the first if-statement. The starting phase of this cycle will thus be phase 5. Phase 5 will be initiated for the estimated time or the scheduled time, depending on whether the estimated time exceeds the scheduled time. This is done in order to limit the total cycle time to the total cycle time in the fixed schedule. This procedure occurs in the second if-statement. However, if the estimated time does not exceed the scheduled time, the excess time relative to the maximum green time allowed is saved in a buffer and can be used by later phases, within the cycle. If the schedule in Table 3.2 is considered, the maximum green time would be 58 seconds, and the maximum cycle time would be 90 seconds. After phase 5 has run for the full phase duration, a transition is initiated to phase 1. This is to ensure that a full cycle always run before the next decision is made to avoid a specific phase from running too long. The schedule will thus always start with the phase that has been chosen by the algorithm and end on the opposite phase. Observe that there are only two main phases in both intersections considered in this study. If the estimated time exceeds the scheduled time, the phase duration will be set to the scheduled time. Lastly, a transition phase is initiated, i.e., phase 2 through 4, and the algorithm makes the next decision.

(25)

3.3. Reinforcement Learning Algorithm For Isolated Four-Way Intersection

whileSimulation running do

iftake_action then

density=getDensity(road_pairs); // Get vehicle density road_pair=arg max

road_pair

(density); // Get prioritized road pair phase=getPhase(road_pairs); // Choose phase

end else

phase=getPhase(!= previous_phase); // Choose next phase

end

vehicles=getVehicles(road_pair); // Get vehicles

forvehicle v P V do

time(vehicle)=estimateTime(vehicle); // Estimate travel time

end

max_estimated_time=max(time); // Get longest travel time

ifmax_estimated_time < scheduled_time & take_action then

phase_duration=max_time; // Set phase duration to travel time

scheduled_time=cycle_green_time-max_estimated_time; // Save time

end else

phase_duration=scheduled_time; // phase duration unchanged

end

initiatePhase(phase, phase_duration); // phase duration unchanged initiateTransition(phase) // phase duration unchanged

end

Algorithm 1:Deterministic Algorithm Pseudocode.

3.3 Reinforcement Learning Algorithm For Isolated

Four-Way Intersection

This section will describe the state, action and reward definition of the reinforcement learning algorithm for the four-way intersection. Furthermore, the included neural network topology and training procedure will be described.

3.3.1 State Definition

As has been discussed in Section 2.5, the most common state space definition amongst the previous researchers considered in this study is raw data in the form of position and velocity of vehicles and pedestrians. This method has been shown to allow the RL algorithm to identify patterns without being limited to human-crafted attributes. However, this type of state definition is limited in terms of scalability as the state size is highly dependant on the number of lanes and length of each road. As the num-ber of lanes and road length increases, the size of the state must increase in order

(26)

3.3. Reinforcement Learning Algorithm For Isolated Four-Way Intersection to accurately represent the environment [10][11]. To define the state in a more scal-able manner, the state of an intersection is defined as the density and mean speed of vehicles and pedestrians for each incoming road. An incoming road is defined as a road leading towards the intersection, and density is defined as the total number of vehicles and pedestrians on the road. This limits the size of the state to the number of incoming roads times the number of parameters considered for each road, regardless of the number of lanes and the length of each road. Density and mean speed consider vehicles and pedestrians that have not passed a traffic light. The moment a vehicle or pedestrian passes a traffic light, it is considered as served and thus no longer in-cluded in the state representation. To exemplify, consider the intersection illustrated in Figure 2.1, which has eight roads in total containing three lanes each. The state will thus be an array of size eight as there are four incoming roads where two parameters are considered per road, see table 3.5.

Table 3.5: State.

Parameters Density Mean Speed

Roads N S E W N S E W

Lanes N1-N3 S1-S3 E1-E3 W1-W3 N1-N3 S1-S3 E1-E3 W1-W3

3.3.2 Actions

Most previous researchers define actions as the phases in which green light is set for all lanes on the road simultaneously during a fixed phase duration. Other phases of a traffic light schedule are considered part of the transition between actions. For example, if the schedule in Table 3.2 is considered, phase one and five would be possible actions, and the duration of that action would be fixed. This method has shown promising results in general. However, the optimization of duration is limited to multiples of the fixed duration of action. To allow for optimization of duration as well as the choice of phase, the actions are defined as phases in which green light is set for all lanes on the road simultaneously where each phase has three different duration’s. Again, if the schedule in Table 3.2 is considered, the possible actions would be phase 1 for 5, 10 or 15 seconds and phase 5 for 5, 10 or 15 seconds, see Table 3.6. Thus, the action space is of size six with two phases to choose between, where each phase has three different duration settings.

Table 3.6: Possible Actions.

Possible Actions 1 2 3 4 5 6 Phase 1 1 1 5 5 5 Phase Duration [s] 5 10 15 5 10 15

An action is taken at the end of each green phase. As have been discussed, each action contains a phase and a phase duration. If the agent takes action with a phase that is different from the current phase, a transition phase is executed before the start of the next green phase, see Figure 3.3 where a represents an action and t represents a time stamp. Considering the schedule in Table 3.2, the transition phases are phase 2-4 or phase 6-8 depending on whether phase 1 or 5 is the current green phase. No

(27)

3.3. Reinforcement Learning Algorithm For Isolated Four-Way Intersection transition occurs if the agent takes action to change the current duration without changing the phase of the traffic light.

Figure 3.3: Action Timeline.

3.3.3 Reward

The reward is entirely based on the state definition and thus consider density and mean speed of each road. By comparing the current state with the previous state, the agent will get a reward that indicates whether the action that led to the current state had a positive impact on the traffic flow or not. A positive impact is assumed to be decreased vehicle density and increased mean speed. The magnitude of the impact is determined by the increase or decrease relative to the previous state, see Equation 3.2. k is a constant, α and β are weights, d is the sum of the vehicle density of all roads, and v is the sum of the mean speed of all roads. Index s and s1

indicate which state the density and mean speed are retrieved from where s is the initial state, and s1

is the state the environment ends up in after action a is taken.

r =k(α(ds´ ds1) +β(v_s1´ v_s)) (3.2) 3.3.4 Deep Neural Network Topology

Since continuous values define the state, the state space approaches infinity. As the Q-value function needs to be calculated for each state-action pair, the traditional tab-ular Q-learning method becomes computationally infeasible. Thus, a function ap-proximator is needed to estimate Q˚

(s, a). The previous researchers considered in this paper all use deep neural networks to approximate Q-values as deep RL, per definition, is RL using a deep neural network as to approximate Q˚

(s, a). The agent is therefore defined as a deep neural network with two hidden layers. The number of hidden layers is chosen arbitrarily. However, since at least two hidden layers define a deep neural network, the number is intentionally chosen to ě 2. Both hidden layers are fully connected layers with 28 nodes, and both uses rectified linear unit as the activation function. As the neural network takes the state as input, the input layer will contain the same number of nodes as the state size.

Furthermore, as all actions can be taken in all states, the number of nodes in the out-put layer is equal to the number of possible actions. For example, if the intersection in Figure 3.1 is considered, The input layer of the neural network would be of size 8 since the state definition consists of eight values, see Table 3.5 and the the output

(28)

3.3. Reinforcement Learning Algorithm For Isolated Four-Way Intersection layer would be of size 6 as there are six possible actions in each state, see Table 3.6. The output layer is a fully connected layer as well using a linear activation function.

3.3.5 Deep Neural Network Training

As the RL agent acts in the environment, experiences are stored in the order they oc-cur in a list of memories of finite size. Each experience is defined by an initial state of the environment, the action taken by the agent, the state the environment ends up in, and the reward the agent receives. The size of the memory is set to 2000 experiences arbitrarily, and it will contain all experiences until the memory is full. After that, memories will be removed starting with the oldest. In order to train the neural as the simulation is running, experience replay is used. The RL algorithm randomly sam-ples a finite number of experiences from the memory each epoch. The memories are saved in a mini-batch which will be used to train the neural network after each epoch. As it is not known how many experiences each simulation will yield, the batch size is set to the minimum number of experiences that can be generated during an epoch rounded down to the closest ten. The reason for this is to avoid the number of expe-riences the algorithm tries to sample to exceed the total number of expeexpe-riences in the memory during the first iterations, as it fills up over time. Each simulation runs for 1800 time steps, which represent a half hour of simulation as each time step is defined as a second. If the schedule in Table 3.2 is considered each epoch would generate a minimum of 58 experiences given that each action leads to a transition which lasts for 16 seconds and each phase duration is maximized, i.e., 15 seconds. The batch size would thus be set to size 50. As have been mentioned, samples are chosen randomly and not in the order they appear. The reason for this is to avoid inefficient learning, as consecutive samples can be highly correlated. The neural network is trained using the loss function in Equation 2.11 and the weights are updated using Equation 2.12 with learning rate α = 0.01. The target Q-value is thus the real reward in the se-quence(s, a, r, s1

), which is known since it has happened plus a discounted estimate of future rewards. Future rewards are defined as the predicted rewards of the neural network starting at state s1

and choosing the action that yields the highest Q-value from there, and this Q-value is compared to the Q-value that was predicted by the neural network at state s. The discount factor γ is arbitrarily set to 0.95, meaning that we weight future rewards to 95% of its original value.

Actions are chosen using an epsilon-greedy method to balance the trade-off between exploration of the state space and exploitation of optimal actions. Exploration of the state space, i.e., exploration of the environment, is done by choosing random actions with probability ǫ while exploitation is done by choosing the action that yields the highest Q-value in a given state with probability 1 ´ ǫ. Initially, ǫ = 1, meaning that the agent will explore the environment 100% of the times as the agent has no knowledge of the environment and thus no knowledge of how to best exploit op-timal actions in the environment. For each epoch, ǫ decays with 0.5% increasing the probability for exploitation over time. However, ǫ can never go below 1% to al-low occasional exploration even as the algorithm has learned an optimal policy. The assumption is that occasional changes could occur in the environment. Lastly, the number of episodes is set to 1000 as it has shown to be enough for convergence in

(29)

3.4. Reinforcement Learning Algorithm For Green Wave conducted experiments, and the weights of the neural network are saved every tenth epoch. The pseudo-code for the deep RL is presented below.

forepisode e P E do

state = getState(); // Get initial state

previous_action = 1; // Get initial traffic light state

whilesimulation running do

action = predictAction(state); // Predict best action

ifprevious_action != action then

performTransition(previous_action, action); // Run transition

end

performAction(action); // Run action next_state = getState(); // Get next state

reward = getReward(state, action, next_state); // Calculate reward agentRemember(state, action, reward, next_state); // Save

experience state = next_state;

previous_action = action;

end

replayMemory() // Perform memory replay

ifsimulation % 10 == 0 then

saveWeights(); // Save the weights of the deep neural network

end end

Algorithm 2:Reinforcement Learning Using Experience Replay Pseudocode.

3.4 Reinforcement Learning Algorithm For Green Wave

Experimentation will be performed using the two connected intersections illustrated in Figure 3.2. These experiments intend to test whether it is possible to train the RL to minimize the average time standing still for emergency vehicles. The aim is to do so while minimizing the average time in the system for vehicles and average waiting time for crossing pedestrians when there are no emergency vehicles. There will be some changes made in the RL model described above that will be described in the section below.

3.4.1 Reinforcement Learning Model Changes

This section will describe the changes made in the state, action, and reward definition from the algorithm described in Section 3.3.

3.4.1.1 State

The state is represented by the number of vehicles on each incoming road, the num-ber of waiting pedestrians on each crossing pedestrian zone and a binary value rep-resenting whether there are emergency vehicles on the incoming roads, see Table 3.7.

(30)

3.4. Reinforcement Learning Algorithm For Green Wave Table 3.7: State Representation Green Wave.

Parameters No. Vehicles No. Vehicles No. Pedestrians No. Pedestrians If Emergency Roads W C N1 & S1 N2 & S2 W & C

3.4.1.2 Action

The available actions are the different combinations of green phases between the two traffic lights, i.e., the combinations of phase 1 and 3. The duration will not be pre-defined in these experiments. An action will last as long as it gets re-selected by the RL.

3.4.1.3 Reward

The reward is defined as the difference in density for vehicles and waiting pedes-trians, see Equation 3.3. All parameters are similar to what has been described for Equation 3.2 with the exception that dp represents the number of pedestrians waiting to cross the road.

(31)

4 Results & Analysis

This chapter presents the experiments performed and their results. Furthermore, the results will be analyzed objectively as well as subjectively.

4.1 Comparison Between the Algorithms for the Four-Way

Intersection

The comparison between the FA, DA, and RL tests performance with traffic flows that differ in arrival rate and distribution of vehicles between incoming roads. Fur-thermore, hyperparameters will be tweaked for the RL based on the results of ex-periments as it is difficult to know beforehand what leads to convergence. The per-formance will be measured in average time in the system, and all experiments in this section will be performed on the intersection presented in Figure 3.1. The edge probabilities determine the distribution of vehicles for the incoming roads. The edge probabilities for the incoming roads denote the probability of vehicles departing from each incoming road, i.e., north (N), south (S), east (E) and west (W), see Table 4.1. The edge probability of outgoing roads is uniformly distributed throughout all ex-periments. Consider the edge probabilities as the distribution of incoming vehicles between the two road pairs north-south (NS) and east-west (EW), see Table 4.1. The arrival rates that will be tested are p = [1 1

3 1

5]vehicles per second. The maximum number of simultaneous arrivals n = 12 is constant throughout all experiments as there are 12 incoming lanes.

Table 4.1: Departing probability and road pair distribution.

Experiment 1 2 3

Departing Probability (N; S; E; W) (0.25; 0.25; 0.25; 0.25) (0.4; 0.3; 0.2; 0.1) (0.7; 0.2; 0.07; 0.03) Road Pair Distribution (NS; EW) (50%; 50%) (70%; 30%) (90%; 10%)

(32)

4.1. Comparison Between the Algorithms for the Four-Way Intersection

4.1.1 Results and Analysis with Arrival Rate p

=

1/3

Figure 4.1 show the average time in the system with arrival rate p = 1/3 for the vehicle distributions presented in Figure 4.1. These experiments aim to test how the algorithms perform when vehicles are distributed differently, given arrival rate p =

1/3, which is assumed to be an average flow of traffic. Initially, the RL was trained for 1000 epochs to see if it converges. The results can be seen in Figure 4.1.

Figure 4.1: Average time in the system for p=1/3, 1000 epochs.

4.1.1.1 Fixed Schedule Analysis

As can be seen in Figure 4.1, the DA and RL algorithm both perform better than the FA for all three sets of edge probabilities. The average time in the system when FA is implemented are 49.4, 48.8, and 51.0 in the order they appear in the bar diagram. As can be seen, the average time in the system decreases slightly when the edge probabilities changes from[0.25; 0.25; 0.25; 0.25]to[0.4; 0.2; 0.3; 0.1], i.e. from an even distribution of vehicles between NS and EW to 70% of vehicles from NS and 30% of vehicle from EW, see Table 4.1. However, when the distribution is 90% NS and 10% EW, the average time in the system is significantly higher compared to the results for the other two distributions. An even flow of vehicles is expected to yield better results than a more uneven distribution for the FA as green time is divided evenly between the road pairs. The decrease in travel time for vehicles on the road pair having green light should be close to equal to the increase in travel time for the vehicles on the road pair having a red light over time, given even distribution of vehicles. The reason for this is that the speed limits and length of all roads are equal, all vehicles have identical attributes, and all drivers are expected to behave similarly in the intersection over time.

The hypothesis is that, as the difference in vehicle distribution significantly increases between NS and EW, the FA becomes less suitable for the flow of traffic as it dis-tributes green time evenly. When the road pair with a higher expected number of vehicles has a green light, it should be beneficial for the overall average time in the

(33)

4.1. Comparison Between the Algorithms for the Four-Way Intersection system. The reason is that the travel time reduction for vehicles on the road pair with the green light is expected to be larger than the time increase on the opposite road pair as there are more vehicles on the road pair with green lights. This follows that the opposite occurs when the road pair with fewer expected vehicles has a green light. However, considering the time it takes for vehicles to start up from standing still in a large queue compared to a smaller queue, the time loss is assumed to be greater when the road pair with fewer expected vehicles has a green light creating a large queue on the opposite road. This is, of course, dependant on how big the differ-ence is between the road pairs and randomness in the input data. This also depends on how well the fixed green time, i.e., 29 seconds, fits the traffic flow. This could ex-plain why a slight reduction in average time in the system occurs as the road weights changes from[0.25; 0.25; 0.25; 0.25]to[0.4; 0.2; 0.3; 0.1].

4.1.1.2 Deterministic Algorithm Analysis

The average times in the system when the DA controls the traffic light are 45.2, 44.2 and 42.4. As can be seen, the DA performs better the more significant the difference in vehicle distribution between NS and EW. This was expected as the DA prioritizes on vehicle density, and a more significant difference in vehicle density between the road pairs allow the algorithm to prioritize more accurately. As the difference in vehicle distribution between road pairs increases, each decision to prioritize a road pair should yield in a higher reduction in travel time compared to the increase in travel time on the opposite road pair. However, overly prioritizing one road pair can lead to massive queues on the opposite, thus leading to an overall increase in average time in the system. Though, as the DA always follows a full cycle and never exceeds full cycle time, this seems to be avoided.

Furthermore, the DA should hypothetically allocate green time more accurately as the difference in vehicle distribution increases. Considering Equation 3.1, as the av-erage number of vehicles, reduce on the road, the expected number of vehicles stand-ing still should decrease as well as reducstand-ing the allocated green time. Furthermore, as the edge probability decreases, it becomes less likely for a vehicle to be at the end of a road when the green time needed is estimated, also leading to a green time reduc-tion. If the estimated green time needed is lower than the scheduled green time, the allocated green time is adjusted, otherwise set to the default. A green time reduction should, therefore, lead to a better fit green time than the default and thus an overall reduction in average time in the system.

Furthermore, as the green time reduces, the cycle time reduces as well. This allows for more decisions to be made by the algorithm during each simulation as a decision is made once every cycle. This allows for more reduction in average time in the system.

4.1.1.3 Reinforcement Learning Algorithm Analysis

The average time in the intersection when the trained RL controls the traffic are 46.4, 44.5, and 39.1. As expected, considering the results from the DA, the RL performs better as the difference in vehicle distribution between NS and EW increases. The hypothesis is that this occurs for the same reasons the DA performs better, and the

(34)

4.1. Comparison Between the Algorithms for the Four-Way Intersection FA performs worse as the difference increases. As the number of vehicles differs significantly between road pairs, the right decision yields more, and wrong decision costs more. However, the RL performs worse than the DA when the distribution is 50% NS 50% EW and 70% NS 30% EW but significantly better when the distribution is 90% NS 10% EW.

It should be noted that the differences in average time in the system are not significant during these experiments. Future experiments will test whether a more substantial traffic flow increases the difference between DA and RL. It was assumed that the RL would perform better than the DA in all experiments, so further experimentation was done. As the RL algorithm is highly dependant on many hyperparameters, this could be due to many aspects. As can be seen in Figure 4.2 it seems as if further improvements can be made in training, specifically for the distribution 70% NS 30% EW as it seems as if the average time in the system is decreasing still and have not fully converged.

(35)

Figure 4.2: Average time in the system during training for p=1/3, 1000 epochs. It should be noted that certain outliers are expected as the minimum epsilon value is set to ǫ = 0.01, meaning the algorithm is at least allowed to take random actions 1% of the time. This is to allow further exploring of the environment now and then regardless of how long the algorithm trains. Also, Figure 4.2 shows the average time in the system for each epoch whereas the reward is the difference in vehicle den-sity as it is assumed that the number of vehicles directly correlates to the average time in the system. This follows that convergence of average time in the system is interpreted as the convergence of the reward and thus the RL. As more vehicles get

(36)

4.1. Comparison Between the Algorithms for the Four-Way Intersection passed through, the average time in the system should decrease. To reiterate, this reward was chosen as it is more practical and less complex to measure the number of cars relative to the average time in the system in a real-life implementation. Further discussion regarding this will be conducted later on in the thesis. For comparabil-ity, no hyperparameters were changed, but the number of epochs increased to allow further exploiting using the same input data. The results can be seen in Figure 4.3.

4.1.1.4 Further Testing and Analysis for Reinforcement Learning Algorithm

The average time in the system time for the RL when trained for 2000 epochs are presented in Figure 4.3. The values are 44.5, 43.8 and 42.3, in all cases better than the DA. Furthermore, the algorithm seems to perform still better the more substantial the difference in vehicle distribution between NS and EW. It should be noted that the RL performed significantly better when trained for 1000 epochs with distribution 90% NS 10% EW in this case. Considering that no changes are made to the algorithm nor the input data, this is likely due to the initial exploring when the algorithm takes random actions. This showcase the importance of exploring the environment before exploiting the best actions.

Figure 4.3: Average time spent in the intersection p=1/3, 2000 epochs. Furthermore, Figure 4.4 show that the RL seems to have converged as little to no change seems to occur during the last 500 epochs for all three distributions. As 2000 epochs seem to be the right amount of training for convergence, future experiments will be trained for 2000 epochs.

(37)

Figure 4.4: Average time in the intersection during training of RL for p = 1/3, 2000 epochs.

=

1

Same experiments as for arrival rate p=1/3 are performed with an arrival rate p=1 and the results can be seen in Figure 4.5. Much of the patterns seem to repeat them-selves such as the FA performing worse the and the RL performing better the larger the difference in vehicle distribution. However, the DA seems to perform worse when the distribution is set to 70% NS and 30% EW. As the arrival rate increases, it becomes

(38)

4.1. Comparison Between the Algorithms for the Four-Way Intersection more likely that a vehicle enters the intersection at the time a decision is made. Fur-thermore, it becomes more likely that more vehicles are standing still on a road pair. Both these factors lead to a high estimated green time needed. As the cycle length is fixed to 90 seconds, however, the estimated green time will more often exceed the maximum green time of 58 seconds and thus be set to the default green time, see Sec-tion 3.2. This makes the algorithm behave like the FA when the arrival rate is high enough, as in this example. The reason it still performs better than the FA is the pri-oritization of road pairs with the highest density. The reason the DA performs better when the distribution is set to 90% NS and 10% EW is thus that 10% leads to a low enough density for the algorithm to predict green times lower than the maximum, thus allowing green time distribution.

Figure 4.5: Average time spent in the system with arrival rate p=1, 2000 epochs. Furthermore, the results during training can be seen in Figure 4.6. It seems as if the average time in the system converges for distributions 70% NS 30% EW and 90% NS 10% EW. Some variations seem to occur for distribution 50% NS 50% EW during the last 500 epochs but as the algorithm still performs better than the DA, no further tests will be performed.

(39)

Figure 4.6: Average time in the intersection during training of RL for p = 1, 2000 epochs.

=

1/5

Figure 4.7 show the average time in the system for arrival rate p=1/5. The algorithm settings are identical to the ones described for the arrival rate of p= 1/3 and p =1. As can be seen, the general patterns seem to repeat themselves for the FA and DA. However, the RL does not behave as expected considering previous experiments.

(40)

Figure 4.7: Average Time in the System with Arrival Rate p= 1/5, 2000 epochs, 500 meters.

To determine if the algorithm has converged, the average time in the system for each epoch during the training was studied as can be seen in Figure 4.8.

(41)

Figure 4.8: Average time in the system during training with arrival rate p=1/5, 2000 epochs.

The RL does not appear to have converged as there seem to occur some oscillation still. Slight oscillation is expected as 1% exploration is allowed at minimum. The average time in the system for distribution 70% NS 30% EW seem to oscillate the least, which is the only case where the RL outperforms the DA. It was assumed that this was partially due to the state and reward considering the entire road in the RL. Each incoming road is of length 500 meters, and the number of vehicles on the entire road is not necessarily relevant for actions. This was further tested by training the RL