Conditional Imitation Learning for Autonomous Driving: Comparing two approaches

(1)

AND THE MAIN FIELD OF STUDY

COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2021,

Conditional Imitation

Learning for Autonomous Driving

Comparing two approaches JESPER HAGSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Conditional Imitation

Learning for Autonomous Driving: Comparing two approaches

JESPER HAGSTRÖM

Master in Computer Science Date: January 5, 2021

Supervisor: Mika Cohen (KTH, FOI), Farzad Kamrani (FOI) Examiner: Erik Fransén

School of Electrical Engineering and Computer Science Host company: Swedish Defence Research Agency (FOI) Swedish title: Betingad Imitationsinlärning för självkörning:

Jämförelse av två angreppssätt

(4)

Syftet med denna studie var att bygga, träna och testa two olika självköran- de agenter med hjälp av maskininlärningstekniker, specifikt neurala nätverk.

För att träna agenterna användes en teknik kallad Imitationsinlärning. Imita- tionsinlärning är en teknik för att lära agenter sekventiell beslutsfattning ge- nom demostrering från en expert (vanligtvis en människa). Två något olika nätverksarkitekturer jämfördes. Skillnaden mellan dessa var att den kontroll- modul för en specifik intention (vilket här betecknar t.ex. åt vilket håll man ska köra i en korsning) var placerad antingen tidigt eller sent i det neurala nätver- ket. Testningen av de tränade modellerna observerade körningsförmåga såsom antalet lyckade intentioner och hastighet. Den tidiga modellen hade en signi- fikant högre hastighet än den sena modellen och verkade också köra generellt

“bättre” än den sena modellen dock utan signifikans från den statistiska utvär- deringen. Detta kan vara en effekt av en för liten stickprovstorlek, vilket kunde åtgärdats med användning av andra verktyg vid träning och testning. Det upp- täcktes också att den tidiga modellens gas-värden som ackumulerades under testningen, var närmare expertens gas-distribution.

(5)

Conditional Imitation Learning for Autonomous Driving: Comparing two approaches

Jesper Hagstr ¨om, Royal Institute of Technology

Abstract—This study aimed to build, train, and test two different autonomous vehicle (AV) agents by using machine learning techniques, specifically neural network architectures. To be able to train the agents a technique called Imitation Learning was used. Imitation Learning is an approach for learning sequential decision-making from demonstrations provided by an expert. Two slightly different neural network architectures were compared. The difference was that the intentional command module (which denotes what direction to take in an intersection for example) was located either in the beginning or the end of the respective networks. The testing of the trained models was looking at their driving capabilities such as intentions completed and speed. The early model was significantly faster than the late model and seemed to be “better” at driving in general but with no significant difference from the late network. This could be an effect of the sample size being too small, which could have been rectified with different tools used in the training and testing. Additionally, it was found that the early model’s gas values, acquired at testing time, were closer to the expert gas distribution.

Index Terms—Imitation learning, Behavior Cloning, CIL, End-to-end learning, Autonomous driving, Policy Approxima- tion, Virtual Battlespace, Multi-modal, Computer Vision.

F

1 I

NTRODUCTION

Imitation Learning is a promising approach for solving autonomous driving, i.e. autonomous cars that drive with the quality of humans and beyond. The car manufacturing company Tesla has been pioneering the field of autonomous driving and other manufacturers follow. As of today, there seems to exist two main paradigms to crack the nut of autonomous driving. On the one hand, there is mediated perception, where techniques like segmentation and identification are being carried out to help decide what driving actions to take. On the other hand, we have behaviour reflex where a supervised network is being trained end-to-end (e.g. directly from visual input to action) to take action [5]. This study looks at the latter.

So, what questions arise when wrapping your head around autonomous driving? The field of self-driving vehicles is a delicate matter in the sense that it gives birth to numerous

• J. Hagstr¨om is a Master Thesis student at EECS (Department of Electrical Engineering and Computer Science), Royal Institute of Technology, Stockholm.

E-mail: jhagstro@kth.se

problems with technical, ethical and societal implications. This could for example be a very simple question of decision making such as what side of the road to drive on or a more complex task such as lane overtaking. A question that arises with end-to-end learning is, whether it is possible to learn a general self- driving policy just by the visual cue in the direction of motion? Although this is a rele- vant question to ask when working within this domain, the aim of the thesis is not to answer this question.

This thesis has a smaller scope than to learn general driving policies. It would be fair to say that self-driving’s first and foremost objective would be to stay on the road as well as to avoid obstacles. Although this is important, in this thesis it is secondary. The investigation is more specific and applies intersections. An intricate problem appears when approaching an intersection. It is ambiguous what path to take and a high-level intentional command is needed in order to make a decision. An ana- logue to this would be the passenger in a taxi who commands the driver with “Take next left, please!”. This is an interesting field of research

(6)

that has the potential to shed light on some of the delicate problems in path planning and autonomous driving in the field of end-to-end driving.

This thesis aims to build and compare two different agents that will be given intentional commands at different locations in their respective neural network architectures. This discrepancy gives the intention different dependencies. In the one case, it has dependencies with the visual cue and in the other it is parallel. The difference between the networks yields different driving characteristics, which are demonstrated through experiments.

1.1 Research question

The aim of this thesis is twofold. The first aim is to build two neural networks that can be trained as agents in a driving-simulator. The second aim is to compare these two networks via testing of their driving capabilities in equiv- alent settings.

The idea is that there exists a difference in driving capabilities when injecting a high-level intentional command at different locations into neural networks. To address this, the following question was formulated:

How does the location of injecting a high-level intentional controlling command into a neural network affect driving performance, in the domain of end-to-end driving?

Answering this will get a clearer understanding of how neural networks with slight differences, differ from each other from a performance perspective in the field of end-to-end driving

1.2 Delimitations

This thesis is being undertaken at Master level which imposes a constraint to what is feasible in terms of time and resources.

The data for training the neural networks will be manually collected via a predetermined simulator; Virtual Battlespace 3 v 20.1.134 [25].

This is a major part of the thesis work and puts a constraint on the magnitude of the experiments. It also will put a constraint on the quantity of data collected. The collected driving

data will be owned by the principal. The data collected is presented in Section 3.2.

1.3 Ethical considerations

Autonomous driving is a new paradigm that raises both ethical and societal issues. What if the autonomous car hits someone? Who is responsible in the case of a crash? This puts the law system on the spot and raises discussions important to have when setting autonomous driving systems online. This also puts a lot of responsibility on the engineers developing this.

This thesis works in the domain of simu- lating autonomous driving which luckily does not impose any dangers to others. Even though simulation is beneficial from an ethical standpoint, there are things to consider. Especially, if the system has an intention of being de- ployed into the real environment, the discrep- ancies and similarities should be identified.

This could, for example, be gravitational forces and air dynamics affecting the vehicle. If these things are in place it amplifies the transfer capabilities between the simulated and real environment [13].

Furthermore, simulation is beneficial from a sustainable standpoint. It makes experiments both resource-efficient and easier to scale than to perform them in the real environment. In addition to this, it is beneficial to consider using pre-trained neural network models that gener- alise well to subsequently fine-tune them into the specific tasks. One example from this thesis standpoint of view would be to use ResNet [26] as a pre-trained image module in the self- driving networks to be trained.

2 T

HEORETICAL FRAMEWORK

This section handles Imitation Learning and the specifics in end-to-end driving. This is followed by the sub-domain Conditional Imita- tion Learning followed by Related work and Neural Networks in Imitation Learning.

2.1 Imitation Learning

The term Imitation Learning is at a first glance almost self-explainable. The concept can be thought of as the same thing as when your

(7)

kid intensively observes you, essentially just to try to mirror your actions. To imitate is to copy some type of behaviour. The purpose of Imitation Learning is to efficiently learn the desired behaviour by imitating an expert’s behaviour [20]. In this thesis, the expert is a human and the learner that mimics the expert is a modelled neural network via the computer, called an agent. When the agent is being presented with an activity with a behaviour that is very complex and unstructured such as driving a car, things become demanding solution-wise. Instead of, for example, trying to manually construct a heuristic for the agent to create the desired behaviour, a demonstration could suffice as long as there is a possibility to transfer this knowledge from the expert to the learner. This field of giving demonstrations and integrating them into the agents’ behavior is called Imitation Learning [20] or Learn- ing from Demonstration. Imitation Learning is widely used and will be the term used in this paper.

This field has been investigated for several years and algorithms for how to optimize learning and addressing issues have been de- veloped. The key issues have been formulated into generic questions such as what to imitate, how to imitate, when to imitate and whom to imitate [12].

When and who is still on the whole unex- plored. What to imitate addresses what qual- ities in the demonstration the demonstrator intends to transfer to the agent (i.e. what is important to produce the desired behaviour).

When figuring out what to imitate the question of how to imitate it arises. In general the agent and demonstrator exhibit dissimilar embod- iment (physical in-equivalence). An example would be when turning; the driver has the capability to freely turn her head to integrate more information to the decision making while an agent might only have a static mounted front camera.

Usually, the agent has more constraints and fewer degrees of freedom when building models of the real world. This is closely related to the correspondence problem where the discrepancy between the imitator and the imitated is accentuated [19]. Due to computational cost

when training an agent we have to impose constraints such as a limited resolution of the agents’ perceptual visual-motor compared to the one of the demonstrator (in our case, the resolution of the computer screen, compared to the human eye). To deal with the correspondence problem these constraints have to be accounted for [2]. More precisely perceptual and physical equivalence should be obtained to the highest possible extent. This means making sure that the information necessary to perform the task is available to both the demonstrator and the imitator as well as the physical feasi- bility [2].

2.1.1 Formulation of the Imitation Learning problem in the context of end-to-end driving Imitation Learning aims to learn a policy (i.e.

a set of rules that forms a behaviour) induced from demonstrations of an expert demonstrator (or demonstrators). Within this field, this often refers to something taking place over a series of time-steps, such as when grasping and throwing a ball. This behaviour could be derived from a certain trajectory containing a set of states τ = [s0, ..., s_T]. In this thesis, with the tools we use, we make the assumption that the states are independent of each other which implies time-invariance (we do not use the history to predict the now). So, at any arbitrarily time-step i, there exists a state si

where some features describing that state are present (i.e. these features describe the system in some form). For example, considering end- to-end driving these features could be the raw- pixel image, velocity and steering angle.

In end-to-end driving, one way to obtain a learned policy that reproduces the demonstrated behaviour is to learn a policy that maps from input to action. We can directly compute a mapping from states stto control inputs (features) ut as ut = π(s_t) via supervised learning methods such as training feed-forward neural networks. Imitation Learning is an umbrella term and when supervised learning is used, such as in end-to-end driving, the specific term used is Behavioural cloning (BC) [1, 17]. In this thesis, we will continue to use the term Imita- tion Learning for simplification.

(8)

In the broad sense, the expert is assumed to have a policy π^E, that follows a certain distribution. The expert is usually a human that act in a certain way. By recording the actions of this expert, and implicitly it’s policy, we acquire a dataset D = {si}^N_i=1. From this dataset D a learner policy π^L is trained to get close to π^E. The general idea:

π^E −→ D −→ π^L, (1) These two policies could be seen as the abstraction of the behaviours of the expert and learner respectively. Imitation Learning aims to use some type of optimization-based strategy to minimize the distance between the experts’

and the learners’ state distribution in order to create the learners’ policy π^L defined as follows:

π^L= argmin

q,p

∆(q(s), p(s)). (2) where q(s) is the distribution of the states induced by the experts’ policy and p(s) is the distribution of states induced by the learner, and ∆(q, p) some difference measure between q and p.

Imitation Learning benefits from resource- efficient training methods and thus we aim to identify the most compact representation of the behaviour [20]. This can be referred to as a parsimonious description (i.e. to find the simplest explanation) of a certain behaviour.

In end-to-end driving, this could be to try to represent the state by mapping the raw pixel image from a front-mounted camera to a certain steering angle and gas/brake factor.

This instead of including other features such as different sensor data (e.g. lidar distances from objects, horizontal incline of the car etc.) which would not by default give a better mapping but could rather impose curse of dimensionality (i.e. the state space as well as the exploration grows exponentially with more features) [16].

It is counter-intuitive to believe that we can predict the speed from a stationary image. Nev- ertheless, the image could be representative to understand what speed we should have at a certain point. For example, an image containing a tree 20 meters ahead could impose more break based on an inherent sub-policy of not

driving into a tree derived from the expert policy π^E. Additional contextual features such as velocity (v) can be added to the stationary image.

It is worth noting that the abstraction level in end-to-end driving is chosen to be at the action-state level. Compared to more complex abstraction levels such as task or trajectory the action-state level requires less training data and shorter training times [20].

The training on the action-state level is done via supervised learning which makes the assumption that the data is independent and identically distributed (i.i.d.), which it is not [22].

2.1.2 Observability

Formulation 1 implies that there exists a discrepancy between the policy of the expert and the learner delimited by the dataset D. To elaborate on this, the number of states, the complexity of the states in combination with the size of the dataset affect the mapping between the policies.

The dataset puts a limitation on how the learner observes the environment compared to the expert. If we see the experts observation of the environment as a baseline for observability the learner observes only a part of this [20].

An example of this constraint manifests itself when we only record a subset of the frames that the expert sees. In this thesis, this happens at dataset recording, where 10 frames per second (fps) are recorded. This, in contrast to the expert observing 30 fps at a standard monitor.

The implication from this is that there exists an informational advantage in favour of the expert, which could manifest itself in creating a different learner policy.

2.2 Conditional Imitation Learning

The field of Conditional Imitation Learning was first proposed by [6] and aims to address the problem that the optimal action cannot be inferred by the perceptual input alone. For example, when approaching an intersection the camera and speed measurement input are not sufficient to infer what path to take. According

(9)

to [6] the mapping from the inputs to the control command is no longer a function. Trying to approximate a function is possible but bound to run into difficulties. This issue was recog- nised by [21] who observed an oscillation in the dictated travel direction upon approaching a dividing road.

The solution to this problem is addressed by giving the controller a high-level command both at training and test time. This frees the trained network from the task of planning so that it can focus on the task of driving.

This makes Imitation Learning and specifically vision-based driving feasible in more complex urban environments where intersections are more abundant.

As we have seen in formulation 1 the dataset induced by the expert is in the form of the state representation D = {si}^N_i=1. In the domain of end-to-end driving and particularly this thesis, the state can be divided into observation- action pairs D = {(oi, ai)}^N_i=1 where oi denotes observational data and ai denotes the action that pairs with that observation. The mapping between these two is a supervised model-free problem. In the standard case [3] this is done by optimizing the parameters θ of some function F according to:

argmin

θ

X

i

`(F (o_i; θ), a_i), (3) with regards to some loss function `. This formulation implies that there exists a function E that maps observations to actions ai = E(oi) [6]. This assumption holds when performing simpler tasks such as lane following or ob- stacle avoidance since an optimal action can, according to the experts policy, be derived from these situations. In more complex situations this function E is likely to break down. In the case of end-to-end driving the subsequent actions when approaching an intersection is not described by the perceptual input but rather from the drivers internal state or intention. This decision is ambiguous and can thus not be solved with any form of linear approximation.

With that said, even if the trained controller in a more complex environment would stay on the road and make turns in intersections the decision making would be arbitrary and not useful

in real-world applications. Codevilla et al. [6]

addressed this by modelling the internal state of the expert with a vector h which together with the observation explains the experts’ action, ai = E(o_i, h_i). In the general sense h could be anything that enfolds the internal state such as the experts’ intention, goals and prior knowledge. The learning objective is rewritten as:

argmin

θ

X

i

`(F (o_i; θ), E(o_i, h_i)). (4) To focus on intentional input where the intention is a subset of the internal state we define c = c(h). Where c is the intention given by the expert, both at training and test time. This will act as an auxiliary input to help the decision making at, for example, intersections.

This alteration expands the dataset induced by the expert to D = {(oi, c_i, a_i)}^N_i=1 and gives the learning objective:

argmin

θ

X

i

`(F (o_i, c_i; θ), a_i). (5) This gives the learner additional information about the latent state of the expert. Formulation 5 is the foundation for this thesis and the building block of the function approximators.

2.3 Related work

In the late 80’s the self-driving model ALVINN produced remarkable results in end-to-end driving, this with fairly primitive technologies by today’s standards [21]. With increasing computational power as well as the development of Graphical Processing Units (GPUs), which perform well on machine learning tasks, the research within the field of end-to-end driving has received some serious attention. In 2016 a study was published by Bojarski et. al.

with powerful results in end-to-end driving [3].

The eighth layer CNN-network (called Pilot- Net) took three camera images as input in order to produce a steering angle. With no other priors these results were very promising. The intuition behind the success is elaborated in [4]. Since PilotNet, other sensor information has been used. Moreover, other performing tasks (such as brake, acceleration and intention learning) have been introduced into neural networks

(10)

with different outcomes. This thesis is mainly based on the network created by Codevilla et. al [6] where intentional based end-to-end driving is in focus.

2.4 Neural Networks and Imitation Learning Neural networks is a promising approach in solving the supervised learning problem of Im- itation Learning and Behavioural Cloning [20].

When using a feed-forward neural network the aim is to approximate a function that maps to a particular dataset. The policy learned by the network is stationary and deterministic [20].

When we are dealing with a stationary policy this implies that it is time-invariant, meaning that no memory of previously seen states is built into the network (i.e. the network only cares about the present). The deterministic part is that the feed-forward neural network’s (as being built in this report) output will only be dependent on the input that it is given. The stochastic part of feed-forward neural networks is only present during training time due to parallel computation, initialization and sampling order.

3 M

ETHODOLOGY

This section includes the system setup, data collection, neural network architecture, training and evaluation.

3.1 System setup

The programming, recording of data, training and evaluation was done on a laptop Dell Precision M4800 with an Intel i7-4800MQ CPU 2.7GHz 64-bit architecture, 32GB RAM and NVIDIA Quadro K2100M GPU. The simulator used for recording data and evaluating the models was Virtual Battlespace 3 v.20.1.0 (VBS3) [25]. This simulator is proprietary, therefore, customized Dynamic Link Libraries (DLL) was programmed in C++17 to communicate with it. The training platform and model was programmed in Python 3.7.4 using PyQt 5.9.7, Ten- sorflow 2.1.0 and Keras 2.3.1 as main libraries.

To drive the vehicle and record input a steering controller Hori Racing Wheel Apex was used.

3.2 Data collection 3.2.1 Dataset bias

In real-world machine learning applications such as self-driving vehicles, data bias is a core problem [27]. Supervised learning methods as a part of Imitation Learning are particularly sensitive to this problem since learning might be dominated by main modes such as driving straight [7]. Driving a car is a complex problem, which includes both simple behaviours and complex ones according to some rare events happening while engaging with the environment. As a consequence of this, performance loss can happen as more data is collected because the diversity of the dataset does not grow fast enough in comparison to the main modes of demonstration [7]. Even though supervised learning enables scaling to large datasets the assumption of i.i.d., in addition to dataset bi- ases, is prone to create distributional shift as well as causal confusion [7].

3.2.2 Distributional shift and causal confusion When approaching Imitation Learning via supervised learning, the assumption that the training and testing data are i.i.d. has to be made. Although, this statement does not hold because previous states affect future ones [9].

This leads to a learning that produces a distributional shift between the expert and learners distributions. This meaning that the training and testing state distributions are different, induced respectively by the expert and learners policies. The problem that this creates is at test time, specifically when the vehicle deviates from its path, the error will propagate. This is referred to as compounding errors [9]. This problem was also observed by Pomerlau with the famous work of Alvinn [21].

If some mistake is done by the learner (such as start deviating from the centre of the road) and if it thus ends up in a state which π^E never visited it will incur a maximal cost to the error making it compound by a quadratic factor [22]. Several solutions to this problem have been suggested [22, 23, 18]. This thesis work has addressed this problem in a similar way as [18] yet more simplified and ad hoc.

(11)

One session of recording (20% of total recording time) was done in the manner of recording only correcting situations. In detail, for example, while driving at the centre of the road, the recording was paused, the expert deviated to the edge of the road, the recording resumed again and the corrective data was recorded.

This was done to eliminate deliberate deviating data from the dataset. The ratio of corrective data was inspired by [6].

Even though the distributional shift problem is addressed, an additional issue was identified by [9]. The issue was the one of “causal misidentification” which can be described as not being able to identify the true cause for a particular action induced by the expert. This is best described in this thesis’s context by the

“inertia problem”, a particular causal confusion identified by [7]. The problem of inertia is that of misidentifying standing still (before an intersection) with the action of no acceleration.

This results in the car not driving. In [9] they address this problem with either environmental rewards or expert on-line intervention. Codev- illa et al. addressed this by speed regularization via their ResNet [14] perception module. This means that they forced the image module to predict the speed from the image alone which resulted in the network gaining more knowledge about the scene with regards to speed than with just the speed module. This thesis’s aim is not to address this problem. However, the speed will be a variable looked at when comparing the two networks.

3.2.3 Collected dataset

The task was to utilize Conditional imitation Learning with the intentional commands: just drive, take left, take right and drive straight in the upcoming intersection. An environment with an abundance of intersections (a city-like struc- ture) was used to collect the data for this task.

A subarea of this map is shown in Fig. 1.

The recorded data consists of driving by the author for approximately 1 hour and 40 minutes where 20 minutes consists of correcting scenarios as explained in Section 3.2.2. The data was recorded 9 frames per second (due to system constraints) where each frame maps to two continuous controller values of steer

Figure 1: Subarea of map - Example

and gas (between -1.0 and 1.0) as well as the intentional command at that particular frame and the current velocity of the vehicle. Before the recording of the data the author defined some constraints to the driving:

• The maximum speed was set to 25km/h via the simulator, this due to the informational loss of the recording frame rate.

• The intentions were equally distributed to a high degree, especially between left and right.

These constraints created a framework for the expert to avoid arbitrary decisions. A typical driving scenario can be seen in Fig. 2

The frame was recorded at 800x600 pixels and subsequently cropped halfway (i.e., reducing the image from the top) to 800x300.

This was followed by downsampling to 200x88 pixels which acted as the input to the image module in the neural networks (Fig. 8, 7).

3.2.4 Data distributions

The distributions of intentions, controller values of steering and gas as well as the speed can be seen in Fig. 3, 4, 5 and 6. Due to the design of the driving scenario (Fig. 1) there exists a bias towards driving straight and that explains the over-representation of the just drive intention, full gas and straight steering in the dataset.

The intention distribution shows a slight bias towards having a left intention over the right.

(12)

Figure 2: A typical driving scenario when approaching an intersection. The intention injected will denote what lane in the fork to aim for.

Left Straight Right Just drive Intention

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Density

Figure 3: The distribution of the intentions in the dataset.

This is also reflected in the steering distribution where the mean is −0.0029, so a slight bias towards left steering angles. The gas distribution is saturated between approximately 0.6 − 1.0 and also peaks around 0 which maps to when letting the foot off the gas. The mean for the gas-values in the dataset is 0.87 with a standard deviation of 0.24. The speed distribution has somewhat the same characteristics as the gas distribution with a mean of 21.81 km/h and a standard deviation of 4.57 km/h.

0.6 0.4 0.2 0.0 0.2 0.4 0.6

Steering 0

250 500 750 1000 1250 1500 1750 2000

# Occurences

Max peak value: 41985.0

= -0.0029 = 0.13

Figure 4: Histogram over the steering values in the dataset. Note: The y-axis is limited due to the maximum peak at ∼0.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Gas values

0 500 1000 1500 2000 2500 3000

# Occurences

=0.87 =0.24

Figure 5: Histogram over the gas throttle values in the dataset. Note: Gas-values < 0 is left out and the y-axis is limited due to the maximum peak at ∼1.0.

3.3 Architecture

Here we present the two neural network architectures that will be compared. The main difference between the two is that the intentional high-level command injection is located differ- ently in the networks. One would receive the intentional command late in the network, close to identical to the work of [6]. The other one is an altered version of the late, where the intention is received early. These two networks will be referred to as Late Intention Network (Fig. 7)

(13)

0 5 10 15 20 25 Speed

0 500 1000 1500 2000 2500 3000

# Occurences

=21.81 =4.57

Figure 6: Histogram over the speed values in the dataset. Note: The y-axis is limited due to the maximum peak at ∼25 km/h.

image i

speed v

command c

concat layer

intermediate layer

output

Figure 7: This is a neural network visualisation of the late intention network where the intentional command c is injected closer to the output layer.

and Early Intention Network (Fig. 8).

3.3.1 Late Intention Network

This architecture is heavily influenced by Codevilla et al. [6] with some minor differences. To begin with, the optimizing strategy of Adam-solver is replaced with Nadam [11]. The learning rate is initially 0.001 with a reduction by a factor ten after not seeing an improvement in validation error after 5 epochs.

The perception module takes an input image i and consists of 8 Convolutional layers with filters (32, 32, 64, 64, 128, 128, 256, 256).

Each Convolutional layer is succeeded by Batch Normalization [15], and the last Convolutional layer is followed by two Fully Connected (FC) layers of size 512, each with a Dropout rate of 0.3.

command c speed

v image

i

concat layer

output

s intermediate

layer

Figure 8: This is a neural network visualisation of the early intention network where the inten- tional command c is injected in parallel with the image and speed modules.

The speed module consists of an input of speed v that connects to two FC layers of size 128, each with Dropout rate of 0.5.

The outputs of the perception module and the speed module are concatenated and connected to an FC layer of size 512 with Dropout 0.5. This is later on connected to four parallel command modules representing the different intentions. The intentional command is injected at this point to select what branch to train/evaluate on. These four command modules consist of two FC layers with Dropout 0.5 in between. Lastly, all the command modules are linked to an FC linear output layer of size two, representing the two different controller values. Worth noting here is that in [6] they use a three-size output layer to separate between gas and brake. Here, we instead collect gas and brake (reverse) are collected in the same output this since gas and brake were never used at the same time when collecting data.

The two output values are mapped against steering and gas values between −1.0 and 1.0.

The network is 15 layers deep including the perception module and 7 without (Input and Output layers included). Furthermore, all layers apply the Rectified Linear Unit (ReLU) as activation function except the last layer which utilizes a linear activation. The full architecture has 6.76 million trainable parameters. See Fig.

7 for network schematics.

(14)

3.3.2 Early Intention Network

The architecture of the early intention network is fairly similar to that one of the late. With some differences:

• The intention module is located in parallel with the perception and speed module instead of sequential. Thus, the intention module is not dependent on the output from the other modules. The intention module’s input is the intention (as an in- teger coded value) which also acts as a branch selector as in the late network.

• The network has 6.37 million trainable parameters, so slightly less than the late network.

• The network is 13 layers deep including the perception module and 5 layers deep without.

See Fig. 8 for network schematics.

3.4 Training

Due to the effect of different initial condi- tions as well as sampling order of data [7]

when training the models the training was iterated. Both the early and the late network were trained two times with different seeds for the random processes involved in the training.

This was done to make the experiments replica- ble. The seeds used were chosen arbitrarily as 3874562367 and 1497284469.

The loss-function and metrics used when training the network was Mean squared error (MSE or L2-loss), which was also used in [6].

In [7] they experimented with L1-loss which is also proposed in [20] as an interesting approach. When using the L1-loss, the outliers in the distribution get a larger weight when updating the parameters. Due to the scope of this work L1-loss was not experimented with.

L2-loss was exclusively used.

The hyperparameters used for the training of the networks was never experimented with but rather heavily influenced by [6] with some minor changes. The optimizer used in [6, 7]

was Adam with a learning rate of 0.0002. The networks in this project used Nadam learning optimizer with a higher learning rate (0.001).

This due to very slow convergence with the lower learning rate. Reduction of the learning

rate was also used, this by a factor of ten when the learning reached a plateau (i.e. not improving the validation error) with a patience of 8 epochs and a cool-down of 2. The batch size was 120, the same as [6]. The models were trained for a maximum of 150 epochs and early stopping was used with a patience of 12 epochs to account for the learning rate reduction scheme.

The training was monitored using 10% of the training data as a hold-out set to identify training performance and avoid overfitting.

3.5 Evaluation

A common practice when evaluating the training of neural networks is to evaluate the loss on some test set completely held out from the training. These measurements usually acts as the performance metric of the models. Due to this thesis objective, where driving capabilities is the performance metric, the test set evaluation is not included. Only the validation set acts as data never seen and with its first and foremost objective to monitor the training performance.

It can be seen in the field of Autonomous driving that it is common to measure driving capabilities according to a specific test track and some specific evaluation constraints. Since the work is done in a proprietary simulator, there is no access to a public benchmark dataset. Additionally, the aim of the thesis is to compare two models with each other and not to take any baseline case into account, we therefore generated our own test tracks.

To evaluate the models in each test track run the following measurements are logged: i) throttle values for the car, ii) average speed, iii) time to complete the test track, iv) number of turns completed, v) driven distance, and vi) the number of road drifts. A road drift does not im- ply a test termination (if it recovers). However, if there is no recovery to the prospective path the test is terminated.

The trained neural nets in this thesis are deterministic. Thus, the same starting point for the vehicle should create the same path every test run. However, this is not the case due to stochastic noise in the physical system. For example, the simulator runs on an

(15)

Figure 9: Test track 1. Including 16 intentions (excluding “just drive”).

operating system that uses a different amount of resources at any time-step which creates input/output variations. To address this every test track and model is run for 25 episodes to be able to perform statistical comparisons between the different runs.

The measurements are analysed using calculated Means (µ), Standard deviations (σ) as well as two-tailed heteroscedastic (i.e. unequal variance) Student’s t-test (p-value) and Effect Size (Cohen’s d). The Effect size is a statistical aid for the t-test to show, if there exists a significant difference from the t-test, what magnitude the significance has between the sample groups. As a rule of thumb, a d-value of 0.2, 0.5 and 0.8 would be of small, medium and large effect respectively [8]. These definitions are used in this report when analysing the results. Ad- ditionally, two-sampled Kolmogorov-Smirnov test is used to compare distributions which are skewed.

Two test tracks were created to account for differences between the models. These are presented in the following sections.

3.5.1 Test Track 1

The first track consists of 16 intentions that are injected into the model well before approaching the intersection. The weather, texture and surrounding were fairly similar to the one of the training map. Test Track 1 is shown in Fig.

9.

3.5.2 Test Track 2

Test Track 2 consists of 6 intentions to be injected into the model. The major difference

Figure 10: Test track 2. Including 6 intentions (excluding “just drive”).

in this track compared to the training track is that some of the buildings are different (tall buildings) as well as more trees are present.

Additional differences are that the roads and surroundings are of familiar texture except for one part of the road which has white marking lines on it. Test Track 2 is shown in Fig. 10.

4 R

ESULTS

This section reports the test results for the two architectures - Early intention network and Late intention network. The networks were trained with two random seeds to mitigate the stochastic nature of training neural networks, which can affect performance/convergence. The driving capabilities are presented in the following subsections.

4.1 Driving capabilities

Table 1 provides an overview of the driving capabilities. Both early and late ran for 25 episodes on both tracks, this due to the stochastic nature of the physical system. The results are averages with standard deviations. The p-values denotes the significance of the difference between the distributions produced by early and late respectively.

(16)

4.1.1 Test Track 1

The track consists of 16 intentions of left, right and straight with “just drive” in between. The early architecture performed slightly better than the late architecture on intentions completed, speed and distance. But only with highly significant results on speed, with a p-value of 0.

The effect size (Cohen’s d) was medium. The road drift was larger on the early model and also with greater standard deviation although with a fairly low effect size. See Table 1 for results.

4.1.2 Test Track 2

This track, with slightly different surroundings, was easier for both model architectures to complete. The early model had a higher number of intentions completed but not significantly.

The strong difference in this track was the same as in the first, the speed. This with a medium effect size. The early architecture drove longer on average, with confidence of 85% and with low-medium effect size. The early model deviated from the road more often than the late model with high significance and large effect size. One interesting thing to note is that the late model stopped at points when driving on a straight road, this behaviour can be deducted from Fig. 14. See Table 1 for results.

4.2 Gas and speed distributions

Here follow the cumulative gas and speed distributions from the different runs on each track.

Track 1 consisted of around ∼50k data points and track 2 consisted of ∼18k data points.

The histograms are in higher resolution with 300 bins to infer a clearer insight between the different networks. The gas values were over- represented in the higher half (> 0.5) and thus the lower halves are not presented in Fig. 11 and 13. Worth noting is that the mean and deviation are calculated on the whole gas value range (0 to 1) on the gas distributions.

Table 2: Kolmogorov-Smirnov test and Effect size between gas-values from the expert dataset and the ones the learners distribution produced in the testing scenario.

Test track 1 Test track 2

p Cohen’s d p Cohen’s d

Early 0 0.21 0 0.35

Late 0 0.55 0 0.73

4.2.1 Test Track 1

As seen in Fig. 11 the gas distributions have different characteristics. The early network gas values have a higher mean and lower standard deviation. The late network values have one very large peak at gas value 0.898.

In Fig. 12 it can be seen that the distributions deviate from each other with different means and standard deviations. The late network exhibits a large peak at a speed around 7 km/h as well as a large peak at 19.05 km/h. The highest peak of the early network is at 19.90. In general, both architectures have speed values that they seem to be prone to converge to as they have valley-peak like appearance.

The results in Table 2 reveals that both the testing distributions of the early and the late networks are different from the expert distribution in Fig. 5 with high confidence. Although, the effect size is lower for the early testing distribution with a low effect size of 0.21 vs a medium size for the late of 0.55.

4.2.2 Test Track 2

The gas distributions in Fig. 13 have similar characteristics to the ones in Fig. 11 with similar means and standard deviations as in Track 1. Here the peak from the late network is on the same gas value as on Track 1.

Fig. 14 reveals a similar pattern as in Fig.

12, however, they have different means and standard deviations as well as the number of high peaks. The late network exhibits three large peaks at ∼0, ∼7 and ∼19 km/h while the early network only has one larger peak at ∼19.

Here the distribution landscape has the same appearance as in Fig. 12.

(17)

Table 1: Results from experiments. Each track and model was sampled for 25 episodes. The p-value denotes the hypothesis testing between the distributions of early and late (i.e. how significant is the difference).

Test track 1 (16 intentions) Test track 2 (6 intentions)

Means Early Late p Cohen’s d Early Late p Cohen’s d

Intentions completed 43% 37% 0.39 0.17 81% 73% 0.43 0.25

Speed (km/h) 18.09 ± 3.30 15.63 ± 3.88 0 0.68 17.26 ± 3.68 14.35 ± 4.99 0 0.66

Distance (m) 1166 ± 593 1008 ± 844 0.28 0.21 691 ± 251 574 ± 243 0.14 0.47

Road drift (#) 2.06 ± 1.48 1.78 ± 0.76 0.24 0.24 0.8 ± 0.95 0.15 ± 0.49 0.01 0.85

0.5 0.6 0.7 0.8 0.9 1.0

Gas values 0

500 1000 1500 2000 2500

# Occurences

0.8983 Early =0.83 =0.13

Late =0.75 =0.16

Figure 11: Gas-value distribution for track 1.

0 5 10 15 20 25

Speed values 0

250 500 750 1000 1250 1500 1750

# Occurences

19.0505 19.9013 Early =18.09 =3.3

Late =15.63 =3.88

Figure 12: Speed-value distribution for track 1.

Maximum values for early and late is presented.

0.5 0.6 0.7 0.8 0.9 1.0

Gas values 0

500 1000 1500 2000 2500

# Occurences

0.8983 Early =0.83 =0.13

Late =0.75 =0.18

Figure 13: Gas-value distribution for track 2.

Test Track 2 in Table 2 (like Test Track 1) shows a significant difference between the expert and test distributions. There is also a discrepancy in effect size.

5 D

ISCUSSION

This section discusses the differences between the networks and their results followed by suggestions for further research.

5.1 General comparison

Here follows a general discussion regarding the results from Table 1.

It seems that the early architecture is more prone to complete the intentions to a higher degree and thus also drive further. Although, with high p-values, more testing is needed to account for the differences between episode runs. Nevertheless, it was found that only

(18)

0 5 10 15 20 25 Speed values

0 100 200 300 400 500 600 700

# Occurences

18.947 19.0323 Early =17.26 =3.68

Late =14.35 =4.99

Figure 14: Speed-value distribution for track 2.

Maximum values for early and late is presented.

speed could pass as a significant difference on a 95% confidence level both for track 1 and track 2. So, based on the experiments it can be concluded that the early architecture drove faster than the late architecture. The effect size of this difference was medium meaning that the actual difference between the means in relation to their respective standard deviations is of medium size.

The early architecture had higher occurrences of road drift, one explanation of this could be that a higher speed of the vehicle resulted in a larger distance between two steering actions, which made the model more prone to deviate from its path. The p-values for the road drift are fairly low and on track 2 the difference passed on a 90% confidence level.

5.2 Architecture complexity

The early and late networks (Fig. 8, 7) have an equal number of nodes/neurons, however, variances in their structures result in a different number of connections (parameters). The late network has 15 layers (including the input and output layer and excluding the concatenation layer) and has 6.76 million trainable parameters. The early network is shallower with 13 layers and fewer trainable parameters (6.37 million). The early network requires a slightly smaller storage space. As seen in Fig. 8 and 7 the difference between the architectures is that

the command-module is located either early or late in the network. This results in the high- level intentional command to be located 3 layers from the output in the early network and 2 layers in the late network. Furthermore, this also changes the number of intermediate layers between the speed-input and output, in this case, 3 layers for the early network and 5 for the late network. In essence, these alterations give, in the early network, more equality between the speed-module and the command-module.

Since the speed module in the late network has two more layers to traverse before reaching the output it is more prone to lose information about the speed input. This could be one of the explanations why the early network has a higher speed.

Another thing to observe is that the command module in the late network is dependent on the output from the layers before. This is not the case with the early as it only relies on the intentional command as input. The late network could have a harder time training the command modules due to the dependency.

This could also explain why the early complete more intentions on average. Although, more test samples is required to acquire reliable results.

5.3 Distributional differences

A visual interpretation of the gas value distributions from testing showed some interesting results (Fig. 11, 13). In both tracks, a gas peak occurred for the late network, at the same value. One explanation for this coincidence could be that the architecture in the late network (Fig. 7) is deeper and the speed module

“further” from the output. This could poten- tially reduce the conservation of the speed input in the late layers.

Another thing that was found by an ocular inspection was that the early distribution of gas values had a more symmetric/smoother look, with an emphasis on higher gas values.

A hypothesis is that this could be inferred as a more realistic behaviour with smoother driving, rather than choppy.

When comparing the gas distributions from testing the models against the training distribution in Table 2 it showed that both distributions

(19)

were significantly different from the training distribution. However, the early model had a lower effect size on both tracks which could be an indicator that it was closer to the expert’s policy.

Speed values (Fig. 12, 14) for the tracks showed a valley-peak like landscape indicat- ing that certain speeds were preferred over others by the networks. A speed around 0 was exhibited by both networks on both tracks but on track 2 the late network had a higher occurrence of 0 speed. This happened when the late network stopped the agent in the middle of a straight road at some points. This could be referred to as causal confusion [7], or in this case, the “inertia problem”. This was not exhibited by the early model.

Both models had high peaks of a certain speed but the early network’s maximum peak was both slightly larger and with a larger speed value. The early model was more prone to having a high speed which also can be seen in Table 1.

5.4 Methodological assessment and Virtual Battlespace 3

A predetermined decision was to use Virtual Battlespace 3 (VBS3) [25] as the simulation en- gine on behalf of the project owner’s interest. A sub-question for this project was to see whether VBS3 could act as a good candidate for creating self-driving agents with realistic behaviour.

VBS3 is proprietary but highly modular with its own Scripting language - SQF [24]. It has the possibility to include neural-network trained agents without major configurations, although this includes changing parts of the proprietary system.

In this project, VBS3 acted as the simulator by which the data was recorded and models tested. The pipelines used for this were inef- fective since the operating system and other slow interfaces acted as communication between VBS3 and the trained models. This set limits on the rate of recording data and execu- tion of the models on the simulator. If these problems did not exist, the discrepancy between the distributions of expert’s policy and the recorded data would probably shrink. This

could have the potential to make the learner’s policy distribution, p(φ), closer to the expert’s policy distribution q(φ) and thus decrease the difference measure ∆(q, p). The rate restrictions inhibited the speed of testing. If testing could be done in a more effective manner more tracks could have been tested which might have resulted in more adequate results. An alternative to achieve this is to use the CARLA simulator [10] which is open source and thus more modular. Furthermore, it is used in scientific research for the same application as in this project.

Another topic is that there were constraints in the computing power available. Since training neural nets is time-consuming combined with the fact that the testing couldn’t be done in an automated manner (and thus even more time-consuming than the actual training) only two random seeds were employed. More seeds in the testing could have given a more nu- anced result but of course with a more complex testing scenario (which would add to the time required).

5.5 Further research

This section summarizes suggestions from the previous sections.

As discussed in Section 5.4 it would be beneficial to use a more suitable simulator for both the training and testing between the networks.

One possible candidate would be the CARLA simulator [10], an open-source framework for self-driving.

The general comparison in Section 5.1 showed differences between the networks but these were not significant in all cases. More testing, with different random seeds, would be beneficial for comparison. Additionally, due to the small measurement differences between the networks more episodes would be required to get reliable results. This would be made easier if the testing could be done at a higher rate and not in real-time.

In Section 5.2) the difference between the architectures was highlighted. More research would be required to understand how information flows. A suggestion would be to monitor the contribution each module has to the input.

This falls under the category of XAI (Explainable

(20)

AI) and could facilitate the understanding of the results in this report.

Section 5.3 elaborated on the inertia problem and its occurrence in the late model but not the early model. This should be further tested to see whether this stands as true or not.

6 C

ONCLUSION

This thesis aimed to build, train, and compare two neural networks and their driving capabilities in the domain of Conditional Im- itation Learning in end-to-end driving. The architectural differences in the networks was the intentional command being injected early respectively late. The trained networks demonstrated different driving capabilities, thus they learned two different driving policies. With this in mind there was no hyperparameter tuning of the models, which could have improved the models even further. The early model drives faster in general and also closer to the expert policy trained on. This model also seems to deviate from the road more often than the late model. If this is an effect due to a low action to speed ratio, or an inherent downside was highlighted but never decided. The two models had distributional differences in the testing scenarios and the early model was highlighted as to have a more smooth distribution of gas values - closer to the one of the expert. The intentions completed for each test track as well as distance driven seemed to be better for the early model. However, apart from the speed, the statistical tests did not reach significance.

The lack of significance could be an effect of the small sample size, which was due to constraints in the testing environment. More samples would likely reduce the variance and thus achieve more reliable results. To facilitate training and testing a simulation framework with greater modularity would be required. For this, the CARLA simulator was suggested.

On the note of driving capabilities it is a sub- jective matter which this thesis is trying to han- dle by introducing metrics that could work as driving performance indicators. Nonetheless, it still remains unclear whether for example a high speed is what traits good driving skills. In this thesis this was the case and since the mean

speed of the early model was closer to the one of the expert the early model was considered to model the expert data better than the late.

R

EFERENCES

[1] Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15, pages 103–129, 1995.

[2] Aude G Billard, Sylvain Calinon, and R ¨udiger Dillmann. Learning from humans. In Springer handbook of robotics, pages 1995–2014. Springer, 2016.

[3] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.

[4] Mariusz Bojarski, Philip Yeres, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Lawrence Jackel, and Urs Muller. Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911, 2017.

[5] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learn- ing affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pages 2722–2730, 2015.

[6] Felipe Codevilla, Matthias M ¨uller, Anto- nio L ´opez, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 1–9. IEEE, 2018.

[7] Felipe Codevilla, Eder Santana, Anto- nio M L ´opez, and Adrien Gaidon. Explor- ing the limitations of behavior cloning for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pages 9329–9338, 2019.

[8] Jacob Cohen. Statistical power analysis for the behavioral sciences. Academic press, 2013.

(21)

[9] Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. In Advances in Neural In- formation Processing Systems, pages 11693–

11704, 2019.

[10] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. arXiv preprint arXiv:1711.03938, 2017.

[11] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.

[12] Markus Ehrenmann, Oliver Rogalla, Raoul Z ¨ollner, and R ¨udiger Dillmann. Teaching service robots complex tasks: Program- ming by demonstration for workshop and household environments. In Proceedings of the 2001 International Conference on Field and Service Robots (FSR), volume 1, pages 397–402, 2001.

[13] Jeffrey Hawke, Richard Shen, Corina Gu- rau, Siddharth Sharma, Daniele Reda, Nikolay Nikolov, Przemysław Mazur, Sean Micklethwaite, Nicolas Griffiths, Amar Shah, et al. Urban driving with conditional imitation learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 251–257.

IEEE, 2020.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[16] Mario K ¨oppen. The curse of dimensionality. In 5th Online World Conference on Soft Computing in Industrial Applications (WSC5), volume 1, pages 4–8, 2000.

[17] Alex Kuefler, Jeremy Morton, Tim Wheeler, and Mykel Kochenderfer.

Imitating driver behavior with generative adversarial networks, 2017.

[18] Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart:

Noise injection for robust imitation learning. arXiv preprint arXiv:1703.09327, 2017.

[19] Chrystopher L Nehaniv, Kerstin Dauten- hahn, et al. The correspondence problem.

Imitation in animals and artifacts, 41, 2002.

[20] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning. Foun- dations and Trends in Robotics, 7(1-2):1–179, 2018. ISSN 1935-8261. doi: 10.1561/

2300000053. URL http://dx.doi.org/10.

1561/2300000053.

[21] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305–313, 1989.

[22] St´ephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Pro- ceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668, 2010.

[23] St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no- regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.

[24] Bohemia Interactive Solution. SQF sqripting language. URL https://sqf.

bisimulations.com/.

[25] Bohemia Interactive Solutions. VBS3. URL https://bisimulations.com/products/

vbs3.

[26] Sasha Targ, Diogo Almeida, and Kevin Lyman. Resnet in resnet: Generaliz- ing residual architectures. arXiv preprint arXiv:1603.08029, 2016.

[27] Antonio Torralba and Alexei A Efros. Un- biased look at dataset bias. In CVPR 2011, pages 1521–1528. IEEE, 2011.

(22)

(23)

www.kth.se