Visual Object Detection using Convolutional Neural Networks in a Virtual Environment

(1)

Visual Object Detection

using Convolutional Neural

Networks in a Virtual

Environment

(2)

Examiner: Michael Felsberg

isy_{, Linköping University}

Computer Vision Laboratory Department of Electrical Engineering

(3)

Visual object detection is a popular computer vision task that has been inten-sively investigated using deep learning on real data. However, data from virtual environments have not received the same attention. A virtual environment en-ables generating data for locations that are not easily reachable for data collection, e.g. aerial environments. In this thesis, we study the problem of object detection in virtual environments, more specifically an aerial virtual environment. We use a simulator, to generate a synthetic data set of 16 different types of vehicles cap-tured from an airplane.

To study the performance of existing methods in virtual environments, we train and evaluate two state-of-the-art detectors on the generated data set. Ex-periments show that both detectors, You Only Look Once version 3 (YOLOv3) and Single Shot MultiBox Detector (SSD), reach similar performance quality as previously presented in the literature on real data sets.

In addition, we investigate different fusion techniques between detectors which were trained on two different subsets of the dataset, in this case a subset which has cars with fixed colors and a dataset which has cars with varying colors. Ex-periments show that it is possible to train multiple instances of the detector on different subsets of the data set, and combine these detectors in order to boost the performance.

(4)

(5)

First, I would want to thank my supervisor Astrid Lundmark for helping me making this report more readable than what it would have been, secondly thanks to Saab Aeronautics for giving me the opportunity to conduct this thesis. And finally thanks to my supervisor Abdelrahman Eldesokey and examiner Michael Felsberg for the additional suggestions, and comments on this report.

Linköping, April 2019 Andreas Norrstig

(6)

(7)

Notation ix 1 Introduction 1 1.1 Background . . . 1 1.2 Problem Formulation . . . 2 1.3 Motivation . . . 2 1.4 Delimitations . . . 3 1.5 Thesis Outline . . . 3 2 Theory 5 2.1 Neural Networks (NN) and Convolutional Neural Networks (CNNs) 5 2.1.1 Convolution . . . 6

2.1.2 Pooling . . . 8

2.1.3 Convolution used in practice . . . 8

2.1.4 1 × 1 convolution . . . 9

2.2 Visual object detection . . . 9

2.2.1 Literature Review . . . 9

2.3 Metrics for object detection performance . . . 15

2.3.1 PASCAL VOC metric . . . 16

2.3.2 Metric for COCO . . . 18

3 Method 19 3.1 Data set generation . . . 19

3.1.1 Aim for data set generation . . . 19

3.1.2 Practical details for image generation . . . 21

3.1.3 Data set characteristics . . . 22

3.2 Detector training and evaluation of YOLOv3 . . . 24

3.3 Detector training and evaluation of SSD . . . 25

3.4 Investigation outline . . . 26

3.4.1 Fusion of detections . . . 26

3.4.2 Detector evaluation before fusion . . . 28

4 Results 29

(8)

A Additional illustrations and performance metrics 43

A.1 Detection made by YOLOv3 trained on all vehicles . . . 44

A.1.1 True detections . . . 44

A.1.2 False detections . . . 48

A.2 Detection made by the SSD detector . . . 52

A.2.1 True detections . . . 52

A.2.2 False detections . . . 56

A.3 Precision, Recall and F-measurements for individual car type . . . 61

A.3.1 YOLOv3 trained on all vehicles . . . 61

A.3.2 SSD trained on all vehicles . . . 62

(9)

Abbreviation

Abbreviation Definitions

ap _{Average Precision}

mAP mean Average Precision cnn _{Convolutional Neural Network} fcnn Fully Connected Neural Network

fn False Negative

fp False Positive

iou Intersection over Union

r-cnn Regions with CNN

relu Rectified Linear Unit

r-fcn Region based Fully Convolutional Networks rpn Region Proposal Network

ssd Single Shot MultiBox Detector

tp True Positive

xor _{Exclusive OR}

yolo _{You Only Look Once}

(10)

(11)

1

Introduction

Object detection is the task of recognizing and localizing different objects, nor-mally in images depicting daily-life objects. In this thesis, the problem of object detection in the context of a virtual environment is investigated, more specific object detection in the scenario of simulated aerial images where the objects of interest were represented by 16 different vehicle types. The purpose of this first chapter is to present the reasoning behind the problem investigated, why the work in this thesis was conducted and the reasoning behind the research ques-tions stated.

1.1 Background

Object detection for daily-life objects has advanced dramatically in the last few decades [27]. Contrarily, other types of objects that exist in environments that are not so easily observed, such as aerial and naval environments have not received the same attention. Xia et al [27] made a contribution to this problem by produc-ing a data set for object detection in aerial imagery collected from mainly satellite images at an angle of 90◦. However, they argued that there is a real need for more annotated data in the field of aerial imagery. In the literature, aerial imagery with a pitch angle of around 90◦ is referred to as Earth Vision, also known as Earth Observation and Remote Sensing.

The purpose of this thesis work is to investigate the problem of object detec-tion in aerial images. The lack of annotated images was circumvented by genera-ting a synthetic data set of aerial imagery at a camera pitch angle of 20◦

. These images were generated using the urban driving simulator CARLA [3]. With these simulated images, two state-of-the-art object detection models were trained and evaluated: You Only Look Once version 3 (YOLOv3) [21] and Single Shot Multi-Box Detector (SSD) [18].

(12)

of synthetic images generated from Grand Theft Auto V, a popular video game.

1.2 Problem Formulation

In this thesis, two main investigations are presented. The first investigation looks into whether state-of-the-art object detectors on real data would maintain their performance if trained on the generated synthetic dataset. The second investi-gation, is to investigate if the performance of the trained detectors will increase by training multiple instances of the detector on subsets of the original training set exploiting the control of the generation of images. More precisely the experi-ment is to split the training set into two sets, where one contains vehicles where the vehicle models have multiple possible colors, while for the second subset all vehicles are single colored.

Therefore the research questions this thesis aim to answer are the following: 1. Will state-of-the-art object detectors on real data maintain their performance

if trained on our synthetically generated dataset?

2. Is it possible to boost the performance of the detectors by training multiple instances of a detector on different subsets of the dataset, and then perform fusion between those instances?

1.3 Motivation

This thesis work was conducted at Saab Aeronautics in Linköping and the great-est motivation for the work conducted was the intergreat-est in what the state-of-the-art methods for object detection are and how they perform. This thesis work was therefore conducted with the intention of exploration, and most of the different decisions regarding methods were taken due to limitations described more in Section 1.4.

(13)

1.4 Delimitations

In this section, the limitations relevant to this project are presented in a decreas-ing order of impact.

The greatest limitation in many projects is time, this project was no excep-tion. This project was conducted as a Linköping Unversity Master Thesis project and were to be conducted within a budget of twenty full working weeks or cor-respondingly 800 working hours. It was this time budget that posed the main limitation, and for that reason many compromising decisions had to be made.

Another big reason behind the decision to focus on simulated data came from the information restrictions at Saab and Lantmäteriet which have restrictions for what aerial images are allowed to be published. This decision resulted also in fewer restrictions on how the information in this report needed to be handled.

1.5 Thesis Outline

The structure of this report is as follows: in Chapter 2 an introduction to the basic building blocks of convolutional neural networks and theory about commonly used performance metrics are presented. Chapter 2 also contains an overview of modern object detection systems. Chapter 3 contains descriptions of the methods used for data set generation and detector evaluation. Chapter 4 presents the result gathered in this thesis. Finally, a conclusion and discussion about the work conducted and possible future work are presented in Chapter 5.

(14)

(15)

2

Theory

In order to appreciate and understand the theory presented in this chapter, the reader is expected to have prior knowledge about supervised learning, especially neural networks. Knowledge about different activation functions and how neu-ral networks may be trained using stochastic gradient descent where the gra-dient is calculated with help of for example the backpropagation algorithm is also assumed. Reference [8] is a source of more information about these sub-jects. This chapter is divided into three sections where the first section presents a brief overview of object detection, and popular performance metrics used. In the second section, a few fundamental building blocks for convolutional neural networks are presented. The final section presents an overview of progress made in the field of object detection in the last few years. This chapter is not a com-prehensive presentation of the field of CNN and object detection, therefore the reader is strongly encouraged to follow the references provided in this chapter in order to collect a deeper understanding of the methods described. The aim of this chapter is to describe concepts that are considered important by the au-thor in order to understand the intuition for the architectures presented and the reasoning behind the methods used in this thesis work.

2.1 Neural Networks (NN) and Convolutional Neural

Networks (CNNs)

The purpose of a neural network used in supervised learning is to learn the map-ping between an input x and an output y. In other words, a neural network may be seen as function approximator, y = f (x). Neural networks are normally im-plemented using layers. A network can be imim-plemented by stacking multiple functions together y = f3(f2(f1(x))). A simple example of a neural network

(16)

transformation resulting in the final definition:

f2(f1(x)) = wTg(WTx + b1) + b2. (2.2) When all neurons in a layer are connected to every neuron in the following layer, the network is called a Fully Connected Network (FCNN). The feedforward net-work is a very powerful architecture capable of, as stated by the universal ap-proximation theorem [10], approximating any Borel measurable function, with at least one hidden layer using a squashing1_{function and a sufficient number of}

hidden units.

Convolutional Neural Networks (CNNs) are a special type of neural networks or in this case feedforward networks. CNNs are normally used when the input to the network are images. More information about the convolution step of these networks is presented in the following section.

If now a network with only one hidden layer is able to approximate any func-tion, what is actually the problem? Being able to realize any function is not the same as being able to learn any function. And in some cases, the number of hid-den units needed may not be computationally possible. Because of these reasons, a tremendous amount of research is conducted in order to find different activa-tion funcactiva-tions (g in Equaactiva-tion 2.1) and different architectures describing what structure the network should have. These questions about how a network is de-signed and what activation function should be used in order for the network to learn in the best way are still a field of active research. In the following sections, different types of layers specific for CNN architectures are presented. The out-put from these layers described in the following sections is as with any network passed through different types of activation functions.

2.1.1 Convolution

Mathematically, convolution is defined as

s(t) = (x ∗ w)(t) =

Z

x(a)w(t − a)da (2.3) This operation describes a mapping of how the signal x(a) is weighted with the signal w(t). Normally function w(t) can be thought of as a filter applied to the

1_{For the purpose of this report it is enough to think about a squashing function as a function}

mapping all real numbers to an open interval between zero and one; for a more formal definition see [8, 10].

(17)

0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0 ∗ = I K I∗ K ×1 ×0 ×1 ×0 ×1 ×0 ×1 ×0 ×1 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0 ∗ = I K I∗ K ×1 ×0 ×1 ×0 ×1 ×0 ×1 ×0 ×1

Figure 2.1: Simple illustration how the kernel K slides over I in order to generate the output.

signal x(t). In the case of images, the input signal is two-dimensional and sam-pled. If I denotes the input image and K represents a filter, the convolution is then defined by Equation 2.4.

S(i, j) = (I ∗ K)(i, j) =X m

X

n

I(m, n)K(i − m, j − n) (2.4) The definitions given in equations 2.4, 2.3 are both realizations of what mathe-matically is called convolution, but normally when convolution is described in the setting of deep-learning, the function actually meant by convolution is the cross-correlation function. Cross-correlation has similar properties as convolu-tion but one difference is that the cross-correlaconvolu-tion funcconvolu-tion is not commutative. The cross-correlation function is defined as:

S(i, j) = (I ∗ K)(i, j) =X m

X

n

I(i + m, j + n)K(m, n) (2.5) which is almost the same as the function described in Equation 2.4 but without mirroring the kernel. Similar to the one-dimensional case the function defined in Equation 2.5 can be described as the mapping from the input image to output image where the kernel has been applied at multiple locations of the image. A simple example of the mapping conducted by the cross-correlation function is given in Figure 2.1. The size of the resulting image is not only dependent on the width and height of the kernel, but also of two parameters called stride and padding. Stride is defined to be how many pixels the kernel is moved between producing every element in the output. The function including the stride param-eters is defined as S(i, j) = (I ∗ K)(i, j) =X m X n I(i · s + m, j · s + n)K(m, n) (2.6) To maintain the image size, paddings are used to handle the cases when the ker-nel is close to the borders of I. There are a few different options for how to do

(18)

padding of an image, one of these is to add zeros at the border of I in order to prevent the output from decreasing in width and height. For the case where both padding and a non-unit stride has been used the output width or height can be calculated as:

o =$ i + 2p − k s

%

+ 1. (2.7)

o, i, k represents the output size, input size and kernel size; p, s represent the

number of zeros appended to the input and the size of the steps taken by the kernel(i.e. the stride). More thorough descriptions of how to calculate the output using the convolutional operation are presented in [4, 8].

2.1.2 Pooling

Normally, convolution is not the only operation performed in a convolutional neural network. Another popular operation is called pooling, which is an opera-tion with the purpose of reducing the dimensions of the input. This may be done using max pooling or average pooling, or some other mathematical function to combine values. Figure 2.2 is an illustration of how an image is pooled using a 2x2 max pooling filter. The main motivation for using pooling other than re-ducing the dimensions of the feature map is to reduce the spatial variance in the features. This fact is intuitively easy to understand, when realizing that there are multiple feature maps fed into the pooling layer that will result in the same out-put. Similar to the convolutional operation the output size of a pooling operation can be calculated as:

o =$ i − k s

%

+ 1 (2.8)

where once again i, k, s represents images size, pooling window size and the amount of stride in the pooling operation.

2.1.3 Convolution used in practice

One of the reasons for using convolution in a CNN is to reduce the dimensional-ity of the input image, resulting in smaller feature maps where size is dependent on the problem to be solved by the CNN. In order to understand how this is ac-complished it is important to understand how the convolutional operation maps a multidimensional image where the third dimension is typically the spectral

(19)

− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − ∗ = I K +

Figure 2.3: Visualization of how convolution is conducted in multiple di-mensions.

dimension, often represented by the three colors (RGB). The multi-dimensional convolution operation is performed by simply extending the summation over the third dimension, which we will assume has c values for each pixel. Alternatively, this operation can be described as generating c number of normal feature maps and combining these maps by adding every pixel with the c corresponding pixel values. This therefore once again results in a two-dimensional feature map. An illustration of convolution in multiple dimensions is presented in Figure 2.3. [8]

2.1.4 1 × 1 convolution

A special case of convolution that has had a big impact on the field of neural net-work architecture design is the one by one convolution, initially described and investigated in [16]. It may first seem like an odd thing to perform convolution with a 1 × 1 kernel, but it is actually a very powerful operation which makes it possible to increase the number of layers and number of weights without decreas-ing the dimension of the features; this results in a more powerful architecture. One architecture that uses this type of layer is described in Section 2.2.1.

2.2 Visual object detection

As mentioned in Chapter 1, visual object detection is the problem of localization and classification of objects in images.

2.2.1 Literature Review

In this section, an overview of contributions to solving the problem of object detection is presented. Some of the references included in this section use general methods in order to boost their architecture’s ability to learn. Two examples of such methods are batch normalization and dropout. The focus of this report is to present different architectures specific to object detection, therefore general methods such as batch normalization and dropout are only mentioned briefly.

The method of dropout is to, during training of the network, remove nodes with a probability. At test time all nodes are used, but the weights out from the

(20)

Figure 2.4:Visualization of a simple neural network architecture. Generated by modifying code from1_.

nodes are multiplied with the probability used for dropout. This process results in the fact that weights do not depend too heavily on each other, resulting in a more robust learning process [26]. A brief description of batch normalization is discussed later, see section “Incremental work for YOLO”.

Image classification using CNN

One of the first complete pipelines using a CNN for image classification was pre-sented in [14], where the authors present an architecture capable of classifying handwritten numbers on bank checks. After 1998, there was a period where not that many advancements were made due to the hardware limitations at this time, but in 2012 Alex Krizhevsky and others presented their architecture of CNN for image classification [13]. This work (AlexNet) is considered by many as the big break for CNN, one of the reasons for this is its first place in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 [1]. After AlexNet, there have been many more suggested architectures for image classification, ZFNet [28] and VGG16 presented in [25] to name a few, this type of networks has later been used as building blocks in architectures aimed for object detection.

Figure 2.4 shows a simple illustration of how the building blocks from Sec-tion 2.1 are combined to form an architecture for image classificaSec-tion inspired by the LeNet 5 from [14].

CNNs for object detection

In the context of object detection, there has been a tremendous amount of work conducted. Many different ways to use different architectures of CNN have been proposed. Figure 2.5 shows a timeline over recently suggested architectures [2, 6, 7, 18, 20–23] is presented, where R-CNN, and YOLO stands for Regions in CNN and You Only Look Once respectively.

(21)

Figure 2.5: Timeline over work contributing to the architectures for object detection and one classifier (AlexNet).

Figure 2.6:Illustration of the R-CNN architecture. Taken from [7] with the approval from the author.

R-CNN

The paper [7] written by Tsuing-Yi and others present an architecture for object detection by combining CNN for classification with regions of interest. This ar-chitecture is illustrated in Figure 2.6, taken from the original paper [7]. As illus-trated in figure 2.6, multiple patches of the image is passed through the CNN in order to determine a classification, this is the main drawback of the method, succeeding work presents alternative methods in order to decrease the computa-tional cost.

Fast R-CNN

In order to create an architecture more efficient than R-CNN, the authors of [6] presented a method where a VGG16-Net from [25] was used to extract one fea-ture map for the entire image. From this feafea-ture map, fixed size feafea-ture vectors were extracted from all Region of Interest, by using max-pooling layers. From the feature vectors extracted from each RoI, the system determines the class of the RoI using a softmax layer and refines the location of the object using regression. This architecture is illustrated with Figure 2.7 from [6].

Faster R-CNN

Similar to the work conducted by the author of Fast R-CNN the creator of the Faster R-CNN [23] made more use of a CNN architecture than previous work. In

(22)

Figure 2.7:Illustration of the Fast R-CNN architecture. Taken from [6] with approval from the author.

[23] the authors present an architecture that uses VGG16 as a Region Proposal Network (RPN), the same feature map is also used at the object scoring part of the pipeline, in other words, they used a CNN for both the object bounds and answering the question if there is an object or not. This architecture is illustrated with the same image used by the original authors in Figure 2.8.

You Only Look Once (YOLO)

In the paper of the YOLO architecture, the authors present their architecture as the first deep learning object detector system to perform object detection in real-time [22] at 45 frames per second. The fast frame rate is accomplished by the simplicity of the architecture where a tailored CNN is made to handle the entire problem of object detection by simply mapping an input image to an appropriate tensor describing the objects in the image. In the final tensor there are param-eters describing the location of the object in a section and a probability for it being an object in that section and probabilities for the appropriate classes. The architecture of the YOLO network is here illustrated with the same Figure as the original paper in Figure 2.9. For the case illustrated in Figure 2.9 the final tensor is a 7x7x30 tensor were 30 comes from the fact that the network is designed for the PASCAL VOC data set (20 different objects), and for every sub-region (orig-inal image divided seven times for width and height) of the image the network is chosen to produce one object hypothesis and two potential bounding boxes (represented by five values x, y, width, height and a confidence).

SSD: Single Shot MultiBox Detector

The author of the paper describing the SSD architecture [18] presents a work sim-ilar to that of the YOLO [22] where the systems are simsim-ilar both in architecture and in performance, but the three main differences are the fact that the SSD archi-tecture searches for objects in multiple scales of the feature map extracted by a CNN, and that SSD uses anchor boxes in order to handle the bounding boxes [18].

(23)

Figure 2.8: Illustration of the Faster R-CNN architecture. Taken from [23] with approval from the author.

Figure 2.9:Illustration of the YOLO architecture. Taken from [22] with ap-proval from the author.

(24)

Figure 2.10: Illustration of the SSD architecture. Taken from [18] with ap-proval from the author.

The SSD detector due also really heavily on data augmentation and hard negative mining which means that during training the collection of negative patches are filtered such that the ratio between negative and positive patches are at most 3:1. The steps for data augmentation include random color distortion, random expansion and random cropping and a step of random horizontal flipping. The architecture is illustrated in Figure 2.10, originally presented in [18].

R-FCN: Object Detection via Region-based Fully Convolutional Networks

In [2] Jefeng Dai and others present an architecture that is similar to Faster R-CNN [23], using a R-CNN in order to extract a feature map to perform location and detection, but with the difference that the feature maps extracted are meant to be position-sensitive. This means that multiple feature maps are needed in order to find one object. This architecture has shown to produce state-of-the-art results at 83.6% mAP for the PASCAL 2007 data set [2]. The R-FCN architecture has also been presented as the best detector in a review paper [29] where results for the PASCAL 2012 were shown to be 85% mAP.

Incremental work for YOLO

The main original author of the YOLO paper [22] Joseph Redmon has presented two incremental works in [20, 21]. A few of the changes made in [20] were the implementation of batch normalization, the use of anchor boxes and addition-ally adding a pass-through layer allowing the network to use more fine-grained features similar to the SSD architecture. Batch normalization is the method of normalizing the output from one layer to another, with the intuition that it is the distribution from a previous layer that is of interest. Batch normalization is applied in order to speed up the convergence during training [11], in the case of YOLOv2, batch normalization made the use of dropout unnecessary.

In the last paper [21], the authors present an updated CNN architecture with 53 convolutional layers instead of the previous 19 in [20], and YOLOv3 now pre-dicts boxes at three different scales. This change increased the COCO mAP from 21.6% for YOLOv2 to 33.0% for YOLOv3 on the COCO data set. The YOLOv3

(25)

Figure 2.11: Visualization of intersection over union, the ground truth box is represented by a green box and the detected object by a red box.

architecture is also shown to perform object detection at a speed of 78 fps in [21] compared to the frame rate of 45 for YOLOv1 [22], for the same hardware.

2.3 Metrics for object detection performance

In order to be able to evaluate the performance of a detector, different evaluation metrics are used. There are a few different competitions in object detection, two of the most popular are PASCAL VOC [5] and COCO [17]. Due to differences in both data set and metric system for the two challenges, results between the two data sets are not comparable. In order to be able to determine which detector that generated the best detections, both the PASCAL and COCO challenge have defined their own metric or scoring system. Both of the metrics are built upon two metrics called precision and recall [8]. The definition of precision and recall is:

precision = T P

T P + FP

recall = T P

T P + FN (2.9)

whereof TP, FP, and FN are the number of true positive, false positive, and false negative respectively. More specific to the problem of object detection is the mea-surement of intersection over union (IoU), which is used to determine an associ-ation between predicted detection boxes and ground truth boxes. Intersection of union is illustrated in Figure 2.11, which can also be expressed with the notation for sets as:

IoU = Bgt ∩_B_p

Bgt∪Bp

(2.10) where Bgt, Bprefer to the ground truth box and predicted detection box,

(26)

where

Pinterp(r) = max_{˜r: ˜r≥r}p( ˜r) (2.12)

and p( ˜r) is the precision at recall ˜r. These equations are originally found in the paper describing the PASCAL VOC challenge [5]. The intuition behind this mea-surement is to create an average in the precision-recall curve, see Figure 2.12 where precision is on the vertical axis and recall is on the horizontal axis.

In practice, increasing the recall means to include more and more of the detec-tions provided by the detector tested. In this case, it is implied that the detector tested has also provided a confidence alongside with the detection, thereby pro-viding the opportunity to add detections one by one in decreasing confidence order. The AP defined in Equation 2.11 therefore corresponds to the area un-der the Pinterpcurve in Figure 2.12. In order for a detection to be classified as a

True Positive, it needs to have an IoU above 0.5, and there may only be one Bp

associated with one Bgtbox [5].

In 2010 the PASCAL VOC challenge made an update for the calculation of the AP. Instead of average over 11 equally spaced different points, the measure changed to average over all detections. This new definition is described by equa-tion 2.13. It needs also be pointed out that this update is not reflected in the paper for PASCAL [5], this new definition may for example be found in the source code for evaluation1. AP = 1 X r=0 (rn+1−rn)Pinterp(r) (2.13)

In order to give an intuition behind the metrics described, an example is pre-sented. Figure 2.13 contains an illustration of how detections from a detector may look, these detections are labeled according to the likelihood of them being an object, the most likely being labeled as 1. In Figure 2.13 the ground truth boxes are illustrated using green boxes, while the red boxes represent detections. For a detection to be accounted for as a true positive, its IoU value has to be 0.5 or higher. The detections shown in Figure 2.13 give rise to the precision-recall graph in Figure 3.7, from which the AP value can be calculated with either definition presented in Equations 2.11 or 2.13.

1_{http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCdevkit_18-May-2011.}

(27)

Figure 2.12:Visualization of a precision-recall with associated Pinterpcurve.

This Figure was generated by the detections and values presented in Figure 2.13. 1 2 3 4 5 6 Visualization of hypotheti-cal detections

Presentation of the matching probability, IoU, Precision and Recall.

Figure 2.13:Illustration of detections and corresponding Precision and Re-call values.

(28)

In Equation 2.14 APIoU refers to AP defined in equation 2.13 with the threshold

of IoU. This definition may be found in1, but is not presented in [17].

(29)

3

Method

This chapter is divided into four sections describing the approach conducted dur-ing this thesis project. In the first section, a description of how the data collection was performed, and some practical details about the data collection are presented. The second and third sections present practical details about the detectors inves-tigated. In the final section, the different fusing methods are presented.

3.1 Data set generation

As described in Chapter 1, the main purpose of this thesis work is to explore the problem of object detection in synthetic aerial images captured at a pitch angle of about 20◦. In order to gather an insight into the problem, images of appropriate characteristics were needed. In the following sections, the motivation behind the choice of images and a description of how these images were generated are presented.

3.1.1 Aim for data set generation

The aim for the image generation conducted in this thesis work was to generate images simulating aerial images in sufficient amount to train the CNN architec-tures presented in Chapter 2. Due to the fact that the simulated images are to represent images from airplanes, all the images were collected at the same height with the reasoning that the height of the airplane is known, therefore it is pos-sible to choose a detector trained for the appropriate height. In order to collect images of different characteristics, the parameters of pitch and azimuth angles were randomly sampled from the intervals 15◦−₂₅◦_{and 0}◦−₃₆₀◦_{, respectively.} The azimuth angle refers to the angle between the object of interest and airplane in the xy-plane. In Figure 3.1 an illustration of these angles is presented, were

(30)

x

y

φ

x

0

y

0

z

0

α

Figure 3.1:Visualization of possible camera positions, the pitch and azimuth angle are represented by α, φ respectively. An example of a possible camera position and orientation is illustrated by the red arrow. The blue arrows represent a camera related coordinate system.

α, φ represents pitch, azimuth angle respectively. All the possible camera

posi-tions are illustrated in Figure 3.1 by the area bounded by the two dashed circles around the z-axis. With the setup illustrated in Figure 3.1, it is apparent that the object placed at the origin is depicted in the center of the image generated. In order to generate images where the object is not constrained in the center of the image, images with twice the size needed were generated and then later cropped to appropriate size around a point randomly picked in the image. Two additional parameters changed between images, weather conditions and for half of the vehi-cles the color of the car was randomly chosen from a set of 15 colors.

In order to train and evaluate different detector architectures, different sets for training, evaluation and test were needed. These sets were created by assign-ing the different locations in the virtual environment to the different sets, 20% of the locations were used for testing, and remaining points were once more split 80%, 20% and assigned to training, validation respectively. At every position in the virtual environment multiple images were generated from different camera positions.

(31)

Town1 Town2

Figure 3.2:Illustration of the two different maps and the corresponding ob-ject positions.

3.1.2 Practical details for image generation

The images were generated using CARLA Simulator described in [3] and acces-sible at their website 1. As described in Section 3.1.1 images were generated at different locations in the environment. CARLA supports two different maps with around 150 different locations, where a vehicle can be placed. Additional locations were added to the simulator to result in a total of 300 locations. The different locations and their corresponding set type are presented in Figure 3.2.

The images were generated by collecting images with an initial oversampling at a size of 1344 × 1344 and then down-sampling these image by a factor of three in each dimension to a size of 448 × 448 pixels in order to match the size of the images used in the PASCAL VOC challenge [5]. The reason for using an initial oversampling was to alleviate aliasing effects, and the factor of three resulted from a trade-off between computation time and image quality. The Field of View (FOV) was set to 10◦. All images generated were obtained at a height of 200 m, resulting in a typical car being depicted with a size of 28 pixels wide and 17 pixels in height. With the look-down angle set to 20◦

, a meter on the ground are depicted by approximately five pixels in the final image.

The CARLA Simulator only supports 8 different car models with multiple colors, therefore only 16 of CARLA’s 18 car models were used in order to have the same number of classes in the varying color subset and the constant color subset used in the additional experiment. The two subsets are later referred to as

S1, S2 respectively. All the car models used are listed in Table 3.1

All the cars with varying colors were modified to support 15 different colors. The CARLA simulator also supports 14 different weather condition, all of these were used during the generation of images.

(32)

Tesla SeatLeon

Figure 3.3:Tesla and a SeatLeon marked with a ground truth box.

Finally, all the images were generated by collecting 50 images of every car from different angles at every location on the map. In order to decrease the time to collect all the images, 10 cameras were used at the same time, resulting in only five changes of the car color and the weather condition. To avoid generating images only depicting background, images containing less than 1% of the car were removed. All these steps resulted in a total of 162641 images generated in 99 hours. Two of these images are shown in Figure 3.3, the ground truth boxes were obtained by simply using the segmentation image supported by the CARLA Simulator here illustrated by the blue boxes.

3.1.3 Data set characteristics

This sections presents a collection of figures describing the characteristics of the generated images. The distributions over the different car models are presented in Figure 3.4 and Table3.2. In Figure 3.5 the distribution over camera positions,

(33)

0 2000 4000 6000 8000 10000 AudiA2 AudiTT BmwGrandTourer BmwIsetta ChevroletImpala CitroenC3 DodgeChargePolice JeepWranglerRubicon Mini Mustang NissanMicra NissanPatrol SeatLeon Tesla ToyotaPrius VolkswagenT2

test valid training

Figure 3.4:Distribution over vehicle types, in the different data sets.

Vehicle training valid test Total

AudiA2 6429 1643 2044 10116 AudiTT 6344 1695 2013 10052 BmwGrandTourer 6573 1692 1993 10258 BmwIsetta 6133 1596 1953 9682 ChevroletImpala 6563 1703 2039 10305 CitroenC3 6450 1715 2031 10196 DodgeChargePolice 6457 1673 2034 10164 JeepWranglerRubicon 6528 1700 2059 10287 Mini 6388 1676 2048 10112 Mustang 6475 1724 2046 10245 NissanMicra 6420 1664 2025 10109 NissanPatrol 6588 1725 2070 10383 SeatLeon 6390 1665 2021 10076 Tesla 6424 1692 1990 10106 ToyotaPrius 6508 1672 1983 10163 VolkswagenT2 6553 1728 2106 10387 Total 103223 26963 32455 162641

(34)

−800 −600 −400 −200 0 200 400 600 800 −800 −800 −600 −400 −200 0 200 400 600 800 −800 −800 −600 −400 −200 0 200 400 600 800 −800

Figure 3.5: The illustration shows that no systematic difference exists with respect to azimuth and pitch angles.

for one of the different cars is shown, the positions are marked with the appro-priate set from which they belong. In Figure 3.6 the camera positions collected at locations 123 and 124 from Town1 (in Figure 3.2) are visualized. The size of the points in Figure 3.6 represents what weather condition was observed. The figure shows that the majority of training and test points do not overlap, imply-ing that the resultimply-ing images are different enough to be used for trainimply-ing and test respectively. An additional fact increasing the differences between training and test points are the orientation of the vehicle which is dependent on the vehicles direction of travel (left or right lane).

3.2 Detector training and evaluation of YOLOv3

In order to give an as fair answer to Question 1 as possible, no alterations to the network’s parameters were made for training and testing. These hyper-parameters included settings like a batch size of 64 images, an initial learning rate of 10−3, a momentum of 0.9 and a weight decay factor of 0.0005. Even the original sizes for the 9 anchor boxes where used ((10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), (373x326)). The performance of the detector would most likely be different if these anchor boxes had been al-tered appropriately for the smaller object sizes of our generated data set, but a high mean AP (mAP) was not the purpose of this thesis.

The implementation provided by the original authors was used. In order to find the best weights for the network, evaluations were conducted at every 1000 iteration during training. All the results were compared using the evaluation set (mAP according to PASCAL VOC 2007 definition).

The implementation of the YOLOv3 is available at their website1, additional information about how to use and modify the network for arbitrary data sets was

(35)

−600 −400 −200 0 200 400 600 −600 −400 −200 0 200 400 600 800 AudiA2 training test

Figure 3.6:Illustration over camera positions used at locations 123 and 124 in Town1, the weather condition is represented by the size of the points in the figure.

obtained from1. In order to decrease the training time the network was trained using pre-trained weights for ImageNet [13].

3.3 Detector training and evaluation of SSD

Unlike for the YOLO detector, the original source code was not used for the SSD detector due to the convenience of using a Python implementation instead. The architecture was supplied by ChainerCV [19] at their website2 which presents comparable result to the original authors in [18] (77.8%, 77.5% mAP respectively for the PASCAL VOC challenge).

The SSD detector has many similarities to the YOLO detector and a few dif-ferences as presented in Section 2.2.1. The implementation used was the SSD version adapted for working on 512x512 images. In order for the generated im-ages to pass through the network, an initial step of the ChainerCV supplied SSD implementation was to resize the images to appropriate size by randomly picking an interpolation method of the five choices supported by OpenCV (linear, area, nearest, cubic, lanczos4) for every image past through the network. For hyper-parameters an initial learning rate of 1e − 5, a momentum of 0.9 and a weight decay factor of 0.0005 were used. As for the YOLOv3 detector no exploration of the hyper-parameters was conducted, the values were supplied as default values for the implementation.

The detector was trained for 120000 iterations with a batch size of 8 images,

1_{https://pjreddie.com/darknet/yolo/}

(36)

In order to find an answer to research Question 2, experiments were made where multiple instances of the same detector were trained on different subsets of the generated images. An illustration of how the data set is split and the multiple instances of the detector were trained is shown in Figure 3.7. In Figure 3.7 the YOLO_ nodes represent different instances of the YOLOv3 architecture. YOLO_-big represent the instance of YOLOv3 with which the initial training on all ve-hicle types was conducted, as presented in Section 3.2. YOLO_1 and YOLO_2 represents the two instances trained on S1, S2 respectively. S1 and S2 refer to the vehicles subset presented in Table 3.1. The YOLO_fused node refers to the pro-cess of fusing detections from two detectors. Two different methods for fusing the detections produced by the two detectors (YOLO_1 and YOLO_2) are described in the following subsection.

3.4.1 Fusion of detections

Two different fusion methods were investigated. Therefore the methods described in this section are not suggestions for universal applicable methods and were in-vestigated only with the intention of showing the possibility of increasing the performance of a state-of-the-art detector by applying pre-knowledge about the relevant data set. The first and simplest method was simply trusting the most confident detector.

For the second method, a more sophisticated approach was applied. For every image, all detections from both detectors were collected and the smallest patch including all detections was determined, by determining the corner (of all de-tections) closes to the top left corner of the image and the corner closest to the bottom right corner of the image. The patches width and height were multiplied with three. These procedures were made with the intention to extract an image depicting the relevant vehicle with some background, in order to train a classifier capable of classifying the depicted vehicle. An example of such a patch is illus-trated by the bigger red box in Figure 3.8. With these images, the classifier was trained to distinguish between all sixteen vehicles and the appropriate detector was chosen by comparing the likelihood for the two corresponding subsets, by summing the likelihood over every individual vehicle type. The classifier trained was an instance of a ResNet50 network originally presented in [9], implementa-tion supported by ChainerCV [19]. The patches were resized to a size of (224 x 224) pixels using bilinear interpolation before being passed through the ResNet. The hyper-parameters used for the ResNet consisted of an initial learning rate

(37)

Figure 3.7:Illustration of training and evaluation process for Research Ques-tion 2.

(38)

Figure 3.8: An illustration of how a patch is extracted from two detections, the patch extracted is depicted as the bigger red box.

of 0.025, momentum 0.9, and a weight decay factor of 0.0001, during training a batch size of 64 images was used. A similar solution using multiple detectors to increase performance have been presented in [12], but using different detectors all trained on the same problem rather than the same type of detectors trained on different parts of the problem.

3.4.2 Detector evaluation before fusion

For evaluation and comparison between all the detector instances, the definition of AP from PASCAL VOC before the update in 2010 (average over 11 different recall values) was used. As described in Section 3.2, the weights were chosen by determining the best version of weights considering the mAP on the evaluation set. Finally, the YOLO_big and the fused network were evaluated and compared on the test set.

(39)

4

Results

In this chapter, the results of this thesis work are presented. The final Section 4.3 presents results for the detector created by fusing detections from two YOLOv3 sub-detectors, in order to answer Question 2 posed in Section 1.2.

4.1 Results for YOLOv3

As described in Section 3.4.2, an mAP optimum was found by searching over multiple iterations and evaluating the performance on the appropriate valida-tion set. In Figure 4.1, the loss and mAP values as a funcvalida-tion of iteravalida-tions are presented. An optimum was found to be at 51000 iterations with corresponding mAP 73.3% and COCO mAP 43.2%, also presented in Table 4.1. These values are to be compared with the results presented in [21]. In [21] the authors present results of 57.9% and 33% for VOC mAP and COCO mAP respectively, on the COCO data set. These differences may be explained by the differences in char-acteristics between our generated data set and the COCO dataset, arguments for this are presented in Section 5.1.

From the graphs in the top row in Figure 4.2 it can be seen that the measured mAP values are influenced by both precision and recall and not constrained by just one of the two. In the Appendix, additional measurements for every car type, are presented in Figure A.17. For illustrations of a few good and bad detections, see Sections A.1.1 and A.1.2 respectively.

(40)

0 10000 20000 30000 40000 50000 iterations 0 1 2 3 Loss 0 10000 20000 30000 40000 50000 iterations 10 20 30 40 mAP

Figure 4.1: Loss error and mAP performance for the YOLOv3 detector on the validation set.

Vehicle AP COCO AP AudiA2 81.3% 47.4% AudiTT 63.1% 36.5% BmwGrandTourer 63.3% 38.8% BmwIsetta 81.8% 42.8% ChevroletImpala 71.0% 41.6% CitroenC3 71.9% 42.8% DodgeChargePolice 72.6% 47.0% JeepWranglerRubicon 72.5% 42.8% Mini 81.8% 49.8% Mustang 72.2% 39.8% NissanMicra 62.6% 37.3% NissanPatrol 72.7% 46.7% SeatLeon 72.2% 42.5% Tesla 81.2% 43.7% ToyotaPrius 70.5% 41.6% VolkswagenT2 81.6% 50.5% Mean 73.3% 43.2%

Table 4.1:AP and COCO AP performance on test set for the YOLOv3 detec-tor.

(41)

0.2 0.4 0.6 0.8 1.0 IoU 0.0 0.2 0.4 0.6 0.8 1.0 precision 0.2 0.4 0.6 0.8 1.0 IoU 0.0 0.2 0.4 0.6 0.8 1.0 recall 0.2 0.4 0.6 0.8 1.0 IoU 0.0 0.2 0.4 0.6 0.8 1.0 F-measure 0.2 0.4 0.6 0.8 1.0 IoU 0.0 0.2 0.4 0.6 0.8 1.0 precision 0.2 0.4 0.6 0.8 1.0 IoU 0.0 0.2 0.4 0.6 0.8 1.0 recall 0.2 0.4 0.6 0.8 1.0 IoU 0.0 0.2 0.4 0.6 0.8 1.0 F-measure

Figure 4.2:Metrics over IoU. Top row shows YOLOv3 performance, bottom row shows SSD performance.

4.2 Results for SSD

As previous described in Section 3.3, the SSD detector was trained for 120000 iterations. In Figure 4.3 depicting the loss function for the SSD network it is shown that the network has reached a plateau in performance. During these iter-ations the mAP was calculated on the validation set for iteriter-ations 80000, 100000 and finally 120000, which resulted in mAP values of 56.2%, 56.9% and 57.0% respectively, therefore the weights corresponding to iteration 120000 were used for evaluation on the test set. With these final weights the result was 60.7% mAP. The AP values for every vehicle type are presented in Table 4.2. These low fig-ures compared to the original results (76.9%) presented in [18] are explained by the SSD architecture’s difficulty to detect smaller objects, due the severe down-sampling (in relation to the size of our objects) performed by the early layers of the SSD architecture [18]. These measurements may be compared with the origi-nal results for the object type plant, with corresponding AP values 44.9%, 50.3% and 59.1% depending on how many images were used during training.

From the graphs depicted in the bottom row in Figure 4.2 which present pre-cision, recall and F-measure as a function of IoU, it is evident that the poor per-formance of the SSD does not originate from inaccurate detections but from the objects not detected (low recall for all IoU). The same metrics are presented for every vehicle in the Appendix in Figure A.18. In the Appendix a few good and bad detections made by the SSD detector are presented under Sections A.2.1 and A.2.2 respectively.

(42)

0 20000 40000 60000 80000 100000 120000 iterations 0 5 10 15 Loss error

Figure 4.3:Illustration over total loss, localization and classification loss for the SSD detector during training for the validation set.

Vehicle AP AudiA2 71.4% AudiTT 37.0% BmwGrandTourer 50.6% BmwIsetta 77.1% ChevroletImpala 49.5% CitroenC3 49.7% DodgeChargePolice 68.7% JeepWranglerRubicon 65.1% Mini 83.0% Mustang 58.3% NissanMicra 38.5% NissanPatrol 69.3% SeatLeon 56.3% Tesla 73.3% ToyotaPrius 43.5% VolkswagenT2 79.5% Mean 60.7%

(43)

0 10000 20000 30000 40000 50000 iterations 0 5 10 15 20 25 30 35 mAP

colored vehicles single colored vehicles

Figure 4.4:mAP performance for the two additional detectors.

4.3 Fusion detector results

The weights for the two additional instances of the YOLOv3 detector were de-termined in the same manner as the YOLO_big instance. The performance of the two networks on the validation set is illustrated in Figure 4.4. The networks were found to reach an optimum at 51000, 45000 iterations for colored and sin-gle colored vehicles respectively. The mAP measurements for the two networks (trained on S1 and S2 respectively) and the network trained on all vehicles are presented together in Table 4.4. From the result presented in Table 4.4 and in the column Max Fusion in Table 4.3, it is evident that some of the unknown ve-hicles (for YOLO_1 and YOLO_2) are wrongly classified, and therefore decrease the accuracy for other known vehicles. The AP values in 4.4 are calculated as if the YOLO_1 and YOLO_2 were able to output all classes, which is the reason for the occurrences of the undefined values (represented by −). The results for the two different fusion methods described in Section 3.4.1 are presented in Table 4.4. This Table also contains an extra column with the AP measurements which would have been observed if the decision of appropriate detector would been done perfectly. From the Tables 4.4 and 4.3 it is shown that the method of com-bining the two detectors using a ResNet50 as classifier results in an increase by 0.6% compared to the detector trained on all types of vehicles, with an instance of a ResNet50 network with an 89% accuracy for classification on the test set. From the mAP values plotted in Figure 4.5 (for appropriate subset) depicting all three instances of the YOLOv3 detector, it is shown that the mAP from the YOLO_2 (single colored cars) detector is consistently higher than the YOLO_big detector.

(44)

Mini 81.8% - 81.5% Mustang 72.2% 71.9% -NissanMicra 62.6% 59.5% -NissanPatrol 72.7% - 68.6% SeatLeon 72.2% - 44.5% Tesla 81.2% - 64.6% ToyotaPrius 70.5% 63.7% -VolkswagenT2 81.6% - 80.9%

None zero Mean 73.3% 63.4% 70.6%

Table 4.3: The AP performance compared between the different detectors. YOLO_big, YOLO_1, YOLO_2 refers to the different detector instances with the same definition as in Figure 3.7

0 10000 20000 30000 40000 50000 iterations 0 20 40 60 80 mAP

all vehicles colored vehicles single colored vehicles

(45)

Vehicle Max Fusion ResNet50 Fusion *Perfect Fusion* AudiA2 77.4% 80.9% 81.8% AudiTT 51.6% 63.3% 71.7% BmwGrandTourer 59.6% 63.3% 71.7% BmwIsetta 81.3% 81.7% 81.7% ChevroletImpala 63.1% 72.0% 72.2% CitroenC3 58.6% 72.0% 72.2% DodgeChargePolice 70.9% 72.5% 81.6% JeepWranglerRubicon 72.3% 72.6% 72.6% Mini 81.6% 81.8% 81.8% Mustang 72.2% 72.4% 72.5% NissanMicra 59.4% 71.2% 71.6% NissanPatrol 70.2% 72.3% 81.2% SeatLeon 53.4% 71.4% 72.7% Tesla 67.2% 81.1% 81.8% ToyotaPrius 61.4% 71.4% 72.2% VolkswagenT2 81.0% 81.7% 81.8% mean 67.6% 73.9% 76.3%

None zero Mean 67.6% 73.9% 76.3%

Table 4.4: The AP performance for different methods of fusing detections, where "Max Fusion" means selecting the detector with highest confidence. The *Perfect Fusion* column shows a hypothetical result if the correct detec-tor were chosen.

(46)

(47)

5

Conclusion and Future Work

In this final chapter, the conclusions of this thesis work are presented in the form of a discussion about the answers to the questions posed in Chapter 1. Some final concluding statements and a few suggestions for future work are also presented in this chapter.

5.1 Performance of State of the Art CNNs on

Synthetic Aerial Images

The first question posed in Chapter 1 was formulated as follows:

1. Will state-of-the-art object detectors on real data maintain their performance if trained on our synthetically generated dataset?

The result collected during this thesis work concerning question one is for the YOLOv3 detector found in Table 4.1, and were 73.3%, 43.2% using PASCAL VOC AP and COCO AP respectively. These results are compared with 57.9%, 33% which are the result presented by the original author of YOLOv3 in [21] on COCO [17] test-dev set. The higher values indicate a greater accuracy for YOLOv3 on the generated data set, this may, as previously mentioned, be explained by the simple fact of the simplicity of the generated data set which only depicts one target object in every image and the fact that all targets are of very similar size. The detection problem in the generated data set is therefore indicated to be a simpler one than in COCO [17] test-dev. Our results can also be compared to results produced using YOLOv2 (a predecessor to YOLOv3 having somewhat lower performance than that of YOLOv3) for the PASCAL VOC 2007 challenge, with an mAP of 76.9% in [20] which indicates that the COCO-test set is a harder problem than of PASCAL VOC.

(48)

ing and performing object detection on a data set containing daily-life objects may also be capable of the same on real aerial images, with the assumption that sufficient training data is available.

5.2 Combining detectors by fusing detections

The second question posed in Chapter 1 was formulated as follows:

2. Is it possible to boost the performance of the detectors by training multiple instances of a detector on different subsets of the dataset, and then perform fusion between those instances?

The result collected during this thesis work concerning question two can be found in Tables 4.3, 4.4 and also Figure 4.5. From the measurements presented in the tables, it is shown that it is possible to combine the detectors in order to create a detector with a higher mAP than for the detector simply trained on all vehicle types. From the column "Perfect Fusion", it is also shown that it would be possible to increase this even more if a better classifier method were found.

In Figure 4.5 it is shown that the detector for single colored vehicles has a top performance of 83.6% mAP (for validation set), significantly higher than the performance of the two other detectors, which indicates that the problem of sin-gle colored vehicles is a simpler problem having characteristics that the YOLOv3 detector is able to exploit. From the same Figure, it is also possible to argue that the increase in mAP (on the test set) is not a coincidence confined to the test set.

The mAP difference between the big detector and the two detectors was deter-mined to 0.6% in favor of the two combined detectors. This extra performance may seem like a small reward for all that extra work conducted, but the interest-ing implications are as stated in Section 1.2 that even a state-of-the-art (YOLOv3) architecture capable of performing learning in the context of object detection may be boosted by applying pre-knowledge about the problem at hand. In this specific case, the result presented indicates that the problem of detecting an ob-ject in images is made more difficult if the obob-ject may vary in color. Therefore it seems that the abstract problem formulation "find the car", is a harder problem for the YOLOv3 detector than "find the red car" (red is obviously an arbitrarily chosen color).

(49)

5.3 Concluding statements

In this thesis, we have studied the computer vision problem of object detection, in synthetically generated images depicting vehicles from a height of 200m and a pitch angle around 20◦. Two different state-of-the-art detectors (YOLOv3 and SSD) was trained on generated images, from these detectors reasonable results were observed. An additional investigation was conducted on the best perform-ing detector, the performance was boosted by applyperform-ing pre-knowledge about the data during training. This was done by dividing the data set of images in two groups and training two instances of the detector on one subset respectively. The results reported show that the detector investigated (YOLOv3) when trained in the usual way (i.e training the detector on the whole dataset simultaneously) was unable to find the same vehicle separation as used to boost the performance by splitting the data set into two sub-groups.

5.4 Future Work

A few open suggestions for future work is to conduct, in the same virtual environ-ment, additional experiments for different state-of-the-art detectors, for example, how the detector’s performances are affected by adding blurring to the testing im-ages, or if an increase in detector resolution would increase the performance. Or for example how much can the angle at test time differ from angles used during training before performance degrades.

A final interesting question not investigated in this thesis is how to transfer or adapt the detectors for the real world, one example is the process of creating more photo-realistic images. Generative adversarial networks [8, 15] present an interesting option in the field of image generation.

(50)

(51)

(52)

(53)

A

Additional illustrations and

performance metrics

(54)

AudiA2 AudiTT

BmwGrandTourer BmwIsetta

Figure A.1: Illustrations depicting correct detections made by YOLOv3 de-tector.

(55)

ChevroletImpala CitroenC3

DodgeChargePolice JeepWranglerRubicon

Figure A.2:Illustrations depicting correct detections made by YOLOv3 de-tector.

(56)

Mini Mustang

NissanMicra NissanPatrol

Figure A.3: Illustrations depicting correct detections made by YOLOv3 de-tector.

(57)

SeatLeon Tesla

ToyotaPrius VolkswagenT2

Figure A.4:Illustrations depicting correct detections made by YOLOv3 de-tector.

(58)

AudiA2 AudiTT

Figure A.5:Illustrations depicting false detections made by YOLOv3 detec-tor.

(59)

(60)

Mini Mustang

(61)

SeatLeon Tesla

(62)

AudiA2 AudiTT

(63)

(64)

Mini Mustang

(65)

SeatLeon Tesla

(66)

AudiA2 AudiTT

(67)

(68)

Mini Mustang

Figure A.15: Illustrations depicting missing/false detections made by SSD detector.

(69)

SeatLeon Tesla

(70)

(71)

A.3 Precision, Recall and F-measurements for

individual car type

A.3.1 YOLOv3 trained on all vehicles

Figure A.17:Metrics over IoU for the YOLOv3 detector. Left column: color variant vehicles subset (S1); Right column: constant vehicles subset (S2)

(72)

Figure A.18:Metrics over IoU for the SSD detector. Left column: color vari-ant vehicles subset (S1); Right column: constvari-ant vehicles subset (S2)

(73)

[1] Md. Zahangir Alom, Tarek M. Taha, Christopher Yakopcic, Stefan West-berg, Mahmudul Hasan, Brian C. Van Esesn, Abdul A. S. Awwal, and Vi-jayan K. Asari. The History Began from AlexNet: A Comprehensive Sur-vey on Deep Learning Approaches . CoRR, abs/1803.01164, 2018. URL http://arxiv.org/abs/1803.01164. Cited on page 10.

[2] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural informa-tion processing systems, pages 379–387, 2016. Cited on pages 10 and 14. [3] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and

Vladlen Koltun. CARLA : An Open Urban Driving Simulator. In Proceed-ings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017. Cited on pages 1 and 21.

[4] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. 2016. Cited on page 8.

[5] Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and An-drew Zisserman. The Pascal Visual Object Classes (VOC) Challenge . Inter-national Journal of Computer Vision, 88(2):303–338, June 2010. Cited on pages 15, 16, and 21.

[6] Ross Girshick. Fast R-CNN. In 2015 IEEE International Conference on Com-puter Vision (ICCV), pages 1440–1448, Dec 2015. doi: 10.1109/ICCV.2015. 169. Cited on pages 10, 11, and 12.

[7] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich fea-ture hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recog-nition, pages 580–587, 2014. Cited on pages 10 and 11.

[8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. Cited on pages 5, 6, 8, 9, 15, and 39.

(74)

Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 448–456. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045167. Cited on page 14.

[12] Sezer Karaoglu, Yang Liu, and Theo Gevers. Detect2rank: Combining object detectors using learning to rank. IEEE Transactions on Image Processing, 25(1):233–248, Jan 2016. ISSN 1057-7149. doi: 10.1109/TIP.2015.2499702. Cited on page 28.

[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classi-fication with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural In-formation Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. Cited on pages 10 and 25.

[14] Yann Le Cun, Leon Bottou, and Yoshua Bengio. Reading checks with multi-layer graph transformer networks. In Acoustics, Speech, and Signal Process-ing, 1997. ICASSP-97., 1997 IEEE International Conference on, volume 1, pages 151–154. IEEE, 1997. Cited on page 10.

[15] Peilun Li, Xiaodan Liang, Daoyuan Jia, and Eric P. Xing. Semantic-aware Grad-GAN for Virtual-to-Real Urban Scene Adaption. CoRR, abs/1801.01726, 2018. URL http://arxiv.org/abs/1801.01726. Cited on page 39.

[16] Min Lin, Qiang Chen, and Shuicheng Yan. Network In Network. 2013. Cited on page 9.

[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740– 755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1. Cited on pages 15, 18, and 37.

[18] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single Shot MultiBox De-tector . In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,