A comparison of genetic algorithm and reinforcement learning for autonomous driving

(1)

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2019

A comparison of

genetic algorithm

and reinforcement

learning for

autonomous driving

KTH Bachelor Thesis Report

(2)

EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM, SVERIGE 2019

En jämförelse mellan

genetisk algoritm och

förstärkningslärande

för självkörande bilar

<Ziyi Xiang>

(3)

Abstract

This paper compares two different methods, reinforcement learning and genetic algorithm for designing autonomous cars’ control system in a dynamic environment.

The research problem could be formulated as such: How is the learning efficiency compared between reinforcement learning and genetic algorithm on autonomous navigation through a dynamic environment?

In conclusion, the genetic algorithm outperforms the reinforcement learning on mean learning time, despite the fact that the prior shows a large variance, i.e. genetic algorithm provide a better learning efficiency.

Keywords

(4)

Abstract

I det här papperet jämförs två olika metoder, förstärkningsinlärning och genetisk algoritm för att designa autonoma bilar styrsystem i en dynamisk miljö.

Forskningsproblemet kan formuleras som: Hur är inlärningseffektiviteten jämför mellan förstärkningsinlärning och genetisk algoritm på autonom navigering i en dynamisk miljö?

Sammanfattningsvis, den genetisk algoritm överträffar förstärkningsinlärning på genomsnittlig inlärningstid, trots att den tidigare visar en stor varians, dvs genetisk algoritm, ger en bättre inlärningseffektivitet.

Keywords

(5)

Acknowledgements

(6)

Authors

Ziyi Xiang <zxiang@kth.com>

Information and Communication Technology KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Examiner

The Professor Örjan Ekeberg

KTH Royal Institute of Technology

Supervisor

The Supervisor Jana Tumová

(7)

1 Introduction

Autonomous driving has many debates among experts. It has good potential for better safety over human drivers. The cars cannot get distracted and always obeying traffic rules.

Autonomous cars are already being developed by many companies like Volvo, Tesla, and Google. [6, 14]. The cars use sensors like radar or camera to observe the environment. The movement control system predicts the environment based on sensors’ observations and makes the decisions of movement control. These solutions can still to be optimized, due to full automation has not been achieved. Most of them are in a partially automated level. [2]

The level classification system is based on driver intervention and attentiveness, which is published by the society of automotive engineers SAE International in 2014. [13] According to their definition, a partially automated car system will have some control like speed and steering, but the driver needs always be prepared to take control in a needed situation. While a fully automated car can operate in all conditions without any input.

Autonomous cars are closely associated with machine learning and artificial intelligence. The machines in the industry need to be smarter in order to be able to accomplish the tasks with increasing size of difficulty. Machine learning considers being a technology that helps robots learn and interact with the environment.

Deep reinforcement learning is widely used and is a very efficient way to design the AI behaviors.[5] On the other hand, the genetic algorithm has also been proved to be a successful technique for the optimization of automatic robot and vehicle guidance. [4]Both methods require a large sample data set to learn which can lead to an expensive cost for producing sample data. [4, 5]

(9)

1.1 Problem statement

This paper compares two different methods, reinforcement learning and genetic algorithm for designing autonomous cars’ control system in a dynamic environment. The research problem could be formulated as such: How are the learning efficiency compared between reinforcement learning and genetic algorithm on autonomous navigation through a dynamic environment? The dynamic environment meaning environment is a real-time simulation. The agents1 _{are continuously making decisions in this environment based on}

observations. The learning efficiency defines how many samples are used in order to obtain an acceptable result.

1.2 Delimitations

(10)

2 Theoretical Background

This study focus on reinforcement learning and genetic algorithm for designing a car control system. Both methods are capable of decision making and optimization problems. The action predictions is a decision-making problem based on observation. An optimized solution of such a problem can be obtained by the optimization algorithm. [4, 5]

In this section, the paper introduces some basic theoretical concepts and research finding for both algorithms.

2.1 Reinforcement learning

2.1.1 Markov decision process

The Markov decision process(MDP) is a tuple.[1]

(S, A, P (s, a, s’), γ, R)

Sdenotes a set of states:

s ∈ S

The state is defined by the agent itself and the surrounding environment around the agent, such as velocity, position, mass or distance to obstacles. In a real life scenario, the environment could be observed by camera and sensor, data obtained from these observations in turn will represent the states of the vehicle.

Adenotes a set of Actions:

a ∈ A

The agent can take different action in a given state, such as accelerate and rotate. The action set S defined by the available choices agent can make in a particular state.

A transition function:

(11)

The transition function is when we take the current state as the starting point, taking an action and the probability it lands to the status s’.

A discount factor γindicates how much future rewards agents should care about compared to current rewards.

A reward function

R(s, a, s’)

The reward function represents the reward when agents take an action in state s and land to the new state s’.

The agent constantly observes the environment and acts base on these observations. After the agent performs some action, the reward associated with observation of what just happened will be calculated. The policy2 _{of action is}

updated based on what the agent has learned, where future decision-making is influenced by previous attempts. This creates a feedback cycle and the process repeat until it finds a policy to maximize its reward.

MDP describes the agent’s decision-making process for optimal action selection[1]. The main objective is to find an optimized policy maximizes future reward. A policy contains guideline for the agent on which action to take in each and every state, where as the optimized policy enable the agent to maximize the rewards in every state by choosing the best possible action.

The mathematical equation of this objective is defined as following where π denotes the policy:

maxπE[ H ∑ t=0 γtR(st, at, st+1)|π] [1]

H denotes horizon, indicates the length of the finite states. Policy π is a prescription for mapping between each state to the corresponding action. The discount factor γindicates how much future rewards agents should care about compared to current rewards.

(12)

Dynamic programming and policy optimization are two keys to solving MDP problem.[1, 10]

2.1.2 Dynamic programming

In dynamic programming, the program finds the policy which maximized expected reward based on past experiences.[1]

A famous method for solving MDP using the concept of dynamic environment is called Q-learning. Q-learning learns the policy by trial and error from storing data. It uses past experiences to calculates the expected reward for each action in a given state and iterative update the change of reward.

If we have a problem defined with n number of possible actions and a total of m finite states. We can create a m by n matrix and fill all possible actions with the maximal reward it gets to reach a certain point.

The problem of using dynamic programming is the huge time and space complexity when problem size scales up. If the states and actions are large or infinite the memory is not enough to store all data and the calculation becomes slow.[1]

2.1.3 Policy optimization

In subsection 2.1.2, we introduced some concept of reinforcement learning using dynamic programming. Dynamic programming has some complexity problems, as a result, the maximum reward cannot be calculated efficiently. An alternative is to use policy optimization method.

(13)

The objective of policy optimization is to find a policy function π with parameter θwhich maximized the total reward.

θ is a parameter or weight vector for the policy π. It can use gradient descent function3_{to update the policy by changing.}

By using policy optimization, the algorithm will increase the probability of taking action that gives higher reward and decreases the probability of action which is worse than the latest experience. It calculates the performance of policy and uses it to influence the next iteration.

The optimization method will slightly change the θ of the policy based on the latest performance. Unlike q-learning, it does not store the offline data in memory. It learns directly from what the agent is acting. Once policy updates, the old experience will be discarded.

2.1.4 Proximal Policy Optimization

Proximal Policy Optimization(PPO) is a method based on the idea of policy optimization.[7] It has been proven to be able to solve a wider range of problems rather than q-learning, especially for complex problems. It is the default reinforcement learning method used for OpenAI[7] and Unity Machine Learning Agents Toolkit(ML-agent)[8].

In policy optimization, the sampling operation is not efficient due to the data only update one performance and discards the data after use. Moreover, the result is not stable due to the large changing distribution of observation and policy. If the program takes a step too far from the previous policy, it will change the entire distribution of acting in the environment. Recovering the old policy will be difficult and the policy could then ends up in a bad position. Therefore, a more stable algorithm is required.

PPO can be used to limit the policy update by defining a maximum distance which is called a region. The algorithm optimizes the local approximation and finding

3_{An optimization method that iterative adding or subtracting value to find the optimal input to}

(14)

the optimal point in this bounded region as a result updated policy can no longer move too far away from the old policy.

2.1.5 Neural network and deep reinforcement learning

The deep reinforcement learning is based on the neural networks. The neural network is a matrix network system spired by the biological neural network and animal brains.[3]

In deep reinforcement learning, the algorithm present the policy in a form of a neural network. The basic idea is to identify the correct action, by the means of utilizing a neural network, which maps states into actions.

Let us introduce a simple neural network. This neural network is a 2-dimensional matrix consists of multiple arrays. Each neuron inside the so-called layer holds a digit and the layer itself is an array of digits.

(a) Neural network illustration

Figure 2.1: A simple neural network with 2 hidden layers of size 3

(15)

For each pair of connecting neurons iandj, there is a weight value, which is given between two neurons. With each update, the neurons from the left layer update its connecting neurons by adding connecting neuron’ values times the weights of connection. [3]

Once the neuron from the input layer receives the inputs, they will update the connecting neuron on the right by multiplying its own value and weight connection value, then adds to the target value. This iteration repeats until all inputs digits have passed through the network. We apply mathematics function like Sigmoid function[3] to limit the value between 0 and 1. This is used for some models where we have to predict some probabilities in range 0 and 1.

Sigmoid function:

g(z) = 1/(1 + e−z)

The connecting layers between the input layer and the output layer are called hidden layers.[3] The hidden layers are mainly used to increase the complexity of the matrix, which gives the neural network an opportunity to create a solution with more advanced mapping. It is especially useful for solving large and complex problems, but an increased complexity can also lead to an increased difficulty for learning.

There are many variants of the neural network, while the basic concepts are similar. After observing the result, the network adjusts the value of the weight along the path using backpropagation. The detail of backpropagation is irrelevant on understanding the following chapters, thus will not be discussed here. More detail and further explanation is referred to chapter 4 of the book ”Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning” by James and Stone. [11]

(16)

2.2 Genetic algorithm

Genetic algorithms are methods for solving optimization problems inspired by the survival of the fittest Darwinian Evolution. They are based on natural selection and are widely used to design AI behaviors and machine learning applications.[9]

Genetic algorithms start from a set of randomly generated solutions based on their problems.[9] By observing the solutions’ performances, the algorithms select the most successful solutions for the reproduction of new solutions. The algorithms repeatedly create multiple sample solutions and observe their performances. After evaluating multiple iterations, the solutions evolve toward the optimal. The solutions are called chromosomes and represent in the form of an array or a matrix.[9] The iteration in genetic algorithms is called generation and the most successful agents in each generation are called parents. Each agent has it’s own identical solution in the form of an array or matrix. It is a mutation version of reproduced solution inherited from their parents.

In each generation, the algorithms will select two parents who perform best in the previous simulation and use their chromosomes to create children. The parents cross over the chromosomes and create a new chromosome by combing their solutions. The program takes the child solution, replicates it to create multiple solutions which will use in the next generation.

(17)

Figure 2.2: Genetic algorithm illustration

2.2.1 Genetic algorithm using neural network

The neural network is an efficient way to represent the mapping between the state and the action. In deep reinforcement learning, the algorithm use the neural network to represent the policy. By utilizing the network it maps the states into the actions.

(18)

3 Methods

3.1 Measurement evaluation

A successful test is defined by when a car in the respective method could finish 100 laps(figure3.2a) on the racing track without any collision. Any contact with the walls destroy the cars. (figure3.1a)

The learning efficiency or learning-cost is defined to be the number of cars that get destroyed in a simulation.

Due to a perfectly optimized solution is nearly impossible to find, we define optimized solution as such: If an agent could finish 100 laps without collision when controlled by the solution, said the solution is said to be an optimized solution. An algorithm with better learning efficiency will destroy less car to find the optimized solution, thus the study wants to calculate how many cars will be destroyed on average to optimize the solution which can answer our problem statement.

(a) Track surrounded by walls

(19)

(a) lap

Figure 3.2: Aerial view

3.2 The simulation

3.2.1 The simulation environment and library

The simulator is created in Unity 3D, which is a widely used software for game development.[8] In Unity, we can import 3d assets like roads and cars from the store instead of making our own, which is time efficient. It also has some useful plugins such as Unity machine learning agents toolkit(ML-Agent) for training intelligent agents.[8] The ML-Agent is a library, which can use for building reinforcement learning.

(20)

3.2.2 The motion controller

The motion controller is a c# script, which attached to each car and allows our agents to take control of the car’s position. (figure3.2a)

In our simulation, the car moves in the horizontal plane with a given speed and an angle of facing direction. The speed and the facing direction are control by our agents.

Here is the pseudocode for the motion of rotation and acceleration control.

UpdateMovement ( RotationAngleFromAgent , AccelerationSpeedFromAgent ) : d i r e c t i o n A n g l e <− directionAngle + RotationAngleFromAgent

speed <− Speed + AccelerationSpeedFromAgent

transform ( c u r r e n t P o s i t i o n , d i r e c t i o n A n g l e , speed )

The update function is called once per frame, it calculates the car’s new rotation angle and current speed based on the neural network’s output. The agent controls the motions, by sending the values of the updates to the UpdateMovement function.

Transform is a Unity in-build function, which changes the position of an object towards a direction.

3.2.3 The sensors

The sensor scripts allow the agent to observe the current state, by continuously scanning the surrounding environment in each frame. The sensors consist of 7 laser beams with 30 degrees angles in between, which placed in front of each car. (figure3.3)

(21)

Figure 3.3: Sensors

3.3 The agent’s control system

The study illustrates two high-level figure which contains the executive summary of designing the control system.

The control system in reinforcement learning has a feedback cycle.(figure3.4a) The feedback cycle requires the states: lasers observations, the actions: the acceleration and rotation, and the reward system. The agent is connected with a neural network using the ML-agent library, which automatically updates the neural network in each iteration.(refers to the theory in section2.1.4)

In genetic algorithm, the program does not need a reward system, but it requires the functions that can select the parents, crossover and mutate the chromosomes. (figure 3.5a)

The genetic algorithm builds on the most successful agent in the previous attempts. In each so-called generation, a chosen number of cars are created and simultaneously sent along the track. If all cars collide with the walls, the algorithm will pick the last two cars that stayed on the track to be the parents.

(22)

5 or 10 cars along the path(figure3.6), when all the cars get destroyed, it starts a new generation with updated neural networks. This process repeats until the program reaches the goal we have settled.

(a)

(23)

(a)

(24)

Figure 3.6: Simulation of GA consists of 5 cars per generation

3.4 The reinforcement learning’s car controller

The car controller for reinforcement agent interacts with neural network and sends the speed and rotation changes to the motion controller.

In each frame update, the car received the observation which is an array of floating points from the sensor, and send the observation to the neural network. There is a total of 7 lasers, thus the size of the input layer will also be 7. After utilizing the neural network, it returns two corresponding actions: the acceleration and the rotation, thus the output layer will have a size of 2.(figure3.7)

(25)

learning cost. A smaller network learns faster, but the mapping from the actions to the states will be simpler, which means it can be difficult to solve the large problem with multiple inputs.

The simulation chooses the neural network with a size of 2x64. The neural network of size 2 x64 has 2 hidden layers of length 64, which is considered to be large enough to solve our problem.

The program can use the in-built function from ML-agent to directly access the output layer from the network. The first element of the output array represents the rotation direction. The second element represents the acceleration. The rotation speed and acceleration speed are two predefined variables, which can be changed.

Here is the pseudocode.

R o t a t i o n ( outPutLayer [ 0 ] ) : I f outPutLayer [ 0 ] i s 1 then //make a r i g h t turn d i r e c t i o n A n g l e <− directionAngle + rotationSpeed I f outPutLayer [ 0 ] i s 2 then //make a l e f t turn d i r e c t i o n A n g l e <− directionAngle − rotationSpeed I f outPutLayer [ 0 ] i s 0 then // donothing return d i r e c t i o n A n g l e A c c e l e r a t i o n ( outPutLayer [ 1 ] ) : I f outPutLayer [ 1 ] i s 1 then // a c c e l e r a t e

speed <− speed + accelerationSpeed I f outPutLayer [ 1 ] i s 2 then

// d e a c c e l e r a t e

speed <− speed − accelerationSpeed I f outPutLayer [ 1 ] i s 0 then

(26)

return speed

The controller passes the observing states using a built-in function called AddVectorObs(). AddVectorObs() takes a vector, which is from our sensors and send the vector to the neural network.

Figure 3.7: RL.neural network

3.4.1 The reward System

The reward system defines how much a policy should be rewarded or punished by observing the result.

(27)

the vector is set to be the length of 8.0. We create a constant variable of r which represents the ratio of the reward gains.

In each frame updates, If the distance between the car and the wall is in range of 0 and 2,5, the program will get a penalty of 5 times r for dangerous driving. If the car crashed, a penalty of 100 times r will be added,otherwise it gets a reward of 1.

CheckReward ( Sensors ) :

for i <− 0 to sensors . length i f c o l l i d i n g then AddReward(−r * 100) else i f s e n s o r [ 0 ] . d i s t a n c e T o W a l l <= 2 ,5 then AddReward(−r * 5) o t h e r w i s e AddReward ( 1 )

The controller and reward system are the only things that need to implement, the ML-agents has the implementation of reinforcement learning inside its library, which does all network changes and the iteration updates.

3.5 The genetic algorithm’s car controller

The size of the neural network remains the same as the network in reinforcement learning, which is 2x64.

The interaction with the neural network gives the agent rotation and acceleration control, it is implemented in a similar way compare with reinforcement learning.

R o t a t i o n ( outPutLayer [ 0 ] ) :

I f outPutLayer [ 0 ] i s in range o f 0 . 2 1 and 0.6 then //make a r i g h t turn

(28)

d i r e c t i o n A n g l e <− directionAngle − rotationSpeed I f outPutLayer [ 0 ] i s in range o f 0 and 0.2 then // donothing

return d i r e c t i o n A n g l e

A c c e l e r a t i o n ( outPutLayer [ 1 ] ) :

I f outPutLayer [ 1 ] i s in range o f 0 . 2 1 and 0.6 then // a c c e l e r a t e

speed <− speed + accelerationSpeed

I f outPutLayer [ 1 ] i s in range o f 0 . 6 1 and 1 then // d e a c c e l e r a t e

speed <− speed − accelerationSpeed

I f outPutLayer [ 1 ] i s in range o f 0 and 0.2 then // donothing

return speed

Neural network itself looks like a 2-dimensional matrix, which consists of multiple arrays of different lengths(figure3.8). The car controller received the input array from the sensors and send it to the neural network. The function utilizes the neural network using the weight matrix and maps the observation to the actions.

The genetic algorithm uses weights to represents the chromosomes. The crossover and mutation function will then apply to the matrix of weights. For each connected pair in the neural network, there is a weight value and the neural network has multiple layers, therefore the weights will be a 3-dimensional matrix. ijk in weights[i][j][k] stands for the number of layers, the length of current layer, and the length of previous layer.3.8

Here is the pseudocode which describes the mapping from the observations to the actions.

FeedForward ( i n p u t s ) :

for i <− 1 to numberOfLayer−1 {

//from the second l a y e r through the network

(29)

v a l u e += w e i g h t s [ i−1][ j ] [ k ] * neurons [ i +1][ j ] // th e c o n n e c t i o n s w e i g h t s between two neuron // times t he v a l u e from the l e f t connected neuron }

neurons [ i ] [ j ] <− Sigmoid ( value ) }

}

return neurons [ numberOfLayer−1]// output Layer

For each connected neurons from the left layer, the neuron multiples the weight with its value and sends the result to all connected neurons on the right. The neurons on the right will then sum over the results from all left layer and apply a sigmoid function[12] to bound the result between 0 and 1. This process will repeat until all the updates on the left-hand side pass through are hidden layers and return a solution.

(30)

3.5.1 Selection,crossover and mutation function

In each generation, the genetic algorithm selects two agents which survives the longest time and combine their chromosomes using crossover function. As a result, the children will randomly inherit from their parents.

C r o s s o v e r ( f a t h e r , mother ) : for i , j , k t o s i z e o f w e i g h t s rand <− Random(0 or 1 ) i f rand i s 1 then w e i g h t s [ i ] [ j ] [ k ] <− f a t h e r [ i ] [ j ] [ k ] else w e i g h t s [ i ] [ j ] [ k ] <− mother [ i ] [ j ] [ k ] } return w e i g h t s

The mutation function goes through all elements in the weight matrix, with a probability value p it replaces the value with a random number.

Mutate ( w e i g h t s ) :

for i , j , k t o s i z e o f w e i g h t s i f ( p < RandomFloat (0 t o 1 )

w e i g h t s [ i ] [ j ] [ k ] <− RandomFloat(0 to 1 )

(31)

4 Result

The result of the simulation will be presented in this chapter and will be used to aware the research question.

4.1 Reinforcement learning test

The statistic from the tables (table4.1,table4.2) are used to compare their learning-cost to achieved the optimized solution. In the simulation of reinforcement learning, the program performs 10 tests with the same configuration to generate a reliable result. The simulation destroyed 3763 cars on average to optimize the solution, where the maximum learning-cost is 7054 cars and the minimum is 2934. The data points tend to be close around three thousand, which can prove the reliability of this data set.

Table 4.1: Number of cars destroyed to finish 100 laps

(32)

Figure 4.1: Number of cars destroyed to finish 100 laps

Reinforcement learning cannot find a good solution with just a few attempts, it takes a longer time to learn, but the variances of learning-cost consider to be smaller and more stable. Unlike the genetic algorithm, the reinforcement learning optimizes the solution in a stable way by constantly making small progress in a bounded region, which is defined by the implementation of ML-agents reinforcement learning.

4.2 Genetic algorithm test

17 of 20 attempts finished 100 laps without colliding the walls. (table4.2)

The result we got from the genetic algorithm is 3005 cars on average for using 10 cars in each generation, which is lower than 3763 from reinforcement learning.

(33)

Table 4.2: Number of cars destroyed to finish 100 laps

SAMPLE 5 cars each generation 10 cars each generation

Test 1 12424 3869 Test 2 13704 2409 Test 3 9234 4499 Test 4 8474 819 Test 5 5234 6559 Test 6 17263 1789 Test 7 18712 1299 Test 8 2164 939 Test 9 5624 3229 Test 10 15672 4639

Figure 4.2: Number of cars destroyed to finish 100 laps

(34)

4.3 Comparison

The genetic algorithm outperforms the reinforcement learning on mean learning time, despite the fact that the prior shows a large variance.

Refers to the box diagrams4.1,4.2, the genetic algorithm starts strong finding optimized solutions efficiently, but this increase of performance is unstable. There are multiple trials have small cost, which lowers the average learning-cost, but it can be surpassed with increased testing size by the reinforcement learning variant.

(35)

5 Conclusion

Both methods have been proved to be a successful technique for autonomous control optimization but have different performances.

In conclusion, the genetic algorithm outperforms the reinforcement learning on mean learning time, despite the fact that the prior shows a large variance, i.e. genetic algorithm provide a better learning efficiency, which answers our research question: How are the learning efficiency compared between reinforcement learning and genetic algorithm on autonomous navigation through a dynamic environment.

6 Discussion

In the current stage, the learning-rate are very similar for both algorithms. It is very difficult to draw an easy conclusion from the current simulation, but the genetic algorithm is more likely to have less learning-cost than reinforcement learning to finish 100 laps. Reinforcement learning has higher learning-cost, but the results have a smaller variance compare with genetic algorithm. A smaller variance indicates that the data points tend to be close to the mean, which giving a better statistic significance on its mean.

6.1 safety

(36)

6.2 Implementation difficulty and time cost

The genetic algorithm is easy to implement, while reinforcement learning takes a longer time to design, some small changes will lead to a big difference in performance, and without a good reward system, the agents will perform poorly. It can cost more cars but cannot find a single solution. The training ground with U-curved requires a sharp turn, the agent can easily get stuck if they are cannot find a solution.

The genetic algorithm is easier to modify, by changing the mutation probability if the car gets stuck. Genetic algorithm has much better learning time to finish 100 laps. It is simply because multiple agents learn parallel in genetic algorithm, while reinforcement learning train once at a time.

6.3 Further study

The implementation can still be optimized by adjusting the configuration.

The probability of mutation and the number of cars in each generation might result in a better learning rate and accuracy could be optimized in the genetic algorithm. There is room for improvement in the reward system design concerning reinforcement learning..

For each algorithm, we only perform 10 tests which are pretty small due to time limitation. There is a number of laser detection, but the detection does not cover all angles, therefore, the agent can not detect full information of the surrounding environment. That can lead to a wrong prediction. The inputs and outputs are separate actions and observations. But in the ideal condition, it should be a combination of actions under a period of time, instead of predicting one single action per state.

(37)

(38)

References

[1] Abbeel, Pieter. “MDPs-exact-methods”. In: (2012). URL:https://people. eecs . berkeley . edu / ~pabbeel / cs287 fa12 / slides / mdps exact -methods.pdf, visited 2019-5-1.

[2] Bimbraw, Keshav. Autonomous Cars: Past, Present and Future - A Review of the Developments in the Last Century, the Present Scenario and the Expected Future of Autonomous Vehicle Technology. Tech. rep. Thapar University, 2015.

[3] Dabbura, Imad. “Coding Neural Network — Forward Propagation and Backpropagation”. In: (2018). URL:https://towardsdatascience.com/ coding- neural- network- forward- propagation- and- backpropagtion-ccf8cf369f76, visited 2019-4-1.

[4] Fleming, Peter and Purshouse, Robin. Genetic Algorithms In Control Systems Engineering. Tech. rep. The University of Sheffield, May 2002.

[5] Fridman, Lex. “MIT 6.S094: Deep Learning for Self-Driving Cars ”. In: (2019). URL:https://selfdrivingcars.mit.edu/resources/.

[6] Hendrickson, Josh. “What Are the Different Self-Driving Car “Levels” of Autonomy?” In: (2019). URL: https://www.howtogeek.com/401759/what-are - the - different - self - driving - car - levels - of - autonomy, visited 2019-5-10.

[7] Joschu et al. Proximal Policy Optimization Algorithms. Tech. rep. 2017. [8] Juliani, A. et al. “Unity: A General Platform for Intelligent Agents. arXiv

preprint arXiv:1809.0262”. In: Sport Management Review (2018). URL: https://github.com/Unity-Technologies/ml-agents, visited 2019-3-1. [9] MathWorks. “What Is the Genetic Algorithm? ” In: (2019). URL:https://

de.mathworks.com/help/gads/what-is-the-genetic-algorithm.html, visited 2019-4-1.

(39)

[11] Stone, James. Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning. Apr. 2019, pp. 37–62. ISBN: 9780956372819.

[12] TutorialsPoint. “Genetic algorithms tutorial”. In: (2016). URL: https : / / www . tutorialspoint . com / genetic _ algorithms / genetic _ algorithms _ tutorial.pdf, visited 2019-3-1.

[13] Warrendale and PA. “SAE International Releases Updated Visual Chart for Its “Levels of Driving Automation” Standard for Self-Driving Vehicles”. In: (2018). URL:https://www.sae.org/news/press- room/2018/12/sae-international - releases - updated - visual - chart - for - its - %E2 % 80 % 9Clevels of driving automation % E2 % 80 % 9D standard for self -driving-vehicles, visited 2019-5-10.

(40)

A comparison of genetic algorithm and reinforcement learning for autonomous driving

A comparison of

genetic algorithm

and reinforcement

learning for

autonomous driving

KTH Bachelor Thesis Report

En jämförelse mellan

genetisk algoritm och

förstärkningslärande

för självkörande bilar

<Ziyi Xiang>

Abstract

Keywords

Abstract

Keywords

Acknowledgements

Authors

Place for Project

Examiner

Supervisor

Contents

1

Introduction

1.1

Problem statement

1.2

Delimitations

2

Theoretical Background

2.1

Reinforcement learning

2.2

Genetic algorithm

3

Methods

3.1

Measurement evaluation

3.2

The simulation

3.3

The agent’s control system

3.4

The reinforcement learning’s car controller

3.5

The genetic algorithm’s car controller

4

Result

4.1

Reinforcement learning test

4.2

Genetic algorithm test

4.3

Comparison

5

Conclusion

6

Discussion

6.1

safety

6.2

Implementation difficulty and time cost

6.3

Further study

References