Data augmentation using military simulators in deep learning object detection applications

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Data augmentation using military

simulators in deep learning object

detection applications

WILHELM ÖHMAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Data augmentation using

military simulators in deep

learning object detection

applications

WILHELM ÖHMAN

Master in Computer Science Date: September 10, 2019 Supervisor: Joel Brynielsson

Supervisor at company: Linus Luotsinen Examiner: Elena Troubitsyna

School of Electrical Engineering and Computer Science Host company: Swedish Defence Research Agency

Swedish title: Dataaugmentering med militära simulatorer för objektdetektion

(4)

(5)

iii

Abstract

While deep learning solutions have made great progress in recent years, the requirement of large labeled datasets still limit their practical use in certain areas. This problem is especially acute for solutions in domains where even unlabeled data is a limited resource, such as the military domain. Synthetic data, or artificially generated data, has recently attracted attention as a potential solution for this problem.

This thesis explores the possibility of using synthetic data in order to im-prove the performance of a neural network aimed at detecting and localiz-ing firearms in images. To generate the synthetic data the military simulator VBS3 is used. By utilizing a Faster R-CNN architecture multiple models were trained on a range of different datasets consisting of varying amounts of real and synthetic data. Moreover, the synthetic datasets were generated follow-ing two different philosophies. One dataset strives for realism while the other foregoes realism in favor of greater variation. It was shown that the synthetic dataset striving for variation gave increased performance in the task of object detection when used in conjunction with real data. The dataset striving for realism gave mixed results.

(6)

iv

Sammanfattning

Lösningar som använder sig av djupinlärning har gjort stora framsteg under senare år, dock så är kravet på ett stort och etiketterat dataset en begränsande faktor. Detta är ett än större problem i domäner där även icke etiketterad data är svårtillgänglig, som till exempel den militära domänen. Syntetisk data har på sistone ådragit sig uppmärksamhet som en potentiell lösning för detta problem. Detta examensarbete utforskar möjligheten att använda syntetisk data som ett sätt att förbättra prestandan för en djupinlärningslösning. Detta neuronnät-verk har i uppgift att detektera och lokalisera skjutvapen i bilder. För att genere-ra syntetisk data används militärsimulatorn VBS3. Neuronnätverket använder sig av Faster R-CNN-arkitektur. Med hjälp av detta tränades flera modeller på varierande mängd riktig och syntetisk data. Vidare har de syntetiska dataseten genererats där de följt två olika filosofier. Ett dataset försöker efterlikna den verkliga världen, och det andra förkastar realism till förmån för variation. Det påvisas att det dataset som strävar efter variation medförde ökad prestanda i uppgiften att detektera vapen. Det dataset som eftersträver realism gav blan-dade resultat.

(7)

Chapter 1 Introduction

During recent years great progress has been made in the field of computer vision. In 2012 year’s imagenet classification challenge1 the alexnet [1] out-performed all other traditional methods for image classification by using con-volutional neural networks. This started a new resurgence of interest and re-search in the field of computer vision, as well as AI in general. Today, neural nets match, or even outperform, humans in the imagenet classification chal-lenge [2].

State of the art neural networks often contain millions of parameters and in order to tune these parameters to yield a high performance, the network need to be trained on a proportional amount of data. This means that how well a specific network performs is not dictated solely by how well the network is structured, but is also largely dependant on the dataset it has trained on. In order for the network to properly generalize to unseen data, one needs to make sure that the dataset it has trained on is diverse enough to allow this generalization. Furthermore, machine learning algorithms are designed to find patterns and if the dataset does not contain enough variation this pattern might not solve the problem at hand. An example of this could be in a classification task of different types of vehicles. Many of the images of boats used to train the network might have water in the background. This could lead to the network basing its classification on the water in the background as opposed to the boat itself. If presented with a boat on land it could have a hard time classifying it correctly. This also works in reverse - if one has an image with water as background it might still predict a boat, even though no boat is present.

The best way to remedy these problems is to acquire and use a larger dataset to iron out these biases in the data. Doing so, however, is a costly process,

espe-1

http://image-net.org/challenges/LSVRC/2016/index

(12)

2 CHAPTER 1. INTRODUCTION

cially in the military domain where data is a very limited resource. Moreover, data is not easily shared even among allies. The ability to efficiently augment military datasets is critical for deep learning applications in this domain.

Generating synthetic data is a promising approach that can be used to ad-dress this challenge. Synthetic data in this context is artificially simulated data. This approach has been used to improve object detection, for example, cars [3], or teaching a drone to fly by only using CAD constructed images [4]. It is, however, unknown to which extent this approach is applicable in the mil-itary domain.

By using VBS3, a state of the art military simulator containing a wealth of high quality firearm models, this thesis aims at evaluating if one can use synthetic data generated in this simulator in conjunction with real life data in order to improve the performance of a deep neural network object detection model. The objects that are to be detected in this case is firearms.

1.1 Purpose

This thesis aims at exploring the possibility of using a synthetic image dataset to detect firearms in images. These synthetic images will be generated in a state of the art military simulator called VBS3. These synthetic images, in conjunction with real images, will then be used to train a deep neural network.

1.2 Research question

The primary research question for this thesis is:

• Can military simulators be used to generate synthetic image data that augments a real-world dataset to improve the performance of DL appli-cations focused on object detection?

Additionally, an effort will be made to also answer the following sub-questions:

• Does the ratio of real-world to synthetic data have an impact on the re-sults?

• How does the amount of real-world data factor in when using synthetic data, if at all?

• How might the quality of the synthetic data affect the results? Which qualities are desirable in the synthetic data?

(13)

CHAPTER 1. INTRODUCTION 3

1.3 Limitations

The synthetic data is only generated through the military simulator VBS3. Use of this synthetic data will only be evaluated for the purpose of object detection regarding firearms. The network structure will not be changed in order to accommodate for the augmented dataset, should this prove to be a factor. The testing will also only be conducted on one type of model architecture for object detection.

1.4 Outline

In Chapter 2 the background of the thesis will be presented. This section in-cludes the information related to the thesis as well as previous and related works on the subject. Chapter 3 will go more in depth of the theory that is needed to understand the network architecture used in this thesis. Based on the theory and background, Chapter 4 presents a method used to answer the research question. The results of using this proposed method is then reported in Chapter 5, followed by a discussion of the results in Chapter 6. Lastly in Chapter 7 the reader will find the conclusion as well as possible future research based on this thesis.

(14)

Chapter 2 Background

In this chapter some general information related to the thesis will be presented, along with related works.

2.1 Swedish Defence Research Agency

The Swedish Defence Research Agency (FOI) is a state agency reporting to the ministry of defence. Its main focus is on research and development of technol-ogy relating to military defence. As an agency directly related to the defence it is of utmost importance to always be on the forefront of recent technology.

2.2 VBS3

VBS31is a highly sophisticated military battle simulator designed by Bohemia Interactive and features thousands of highly detailed 3D models. Its main use in a military setting is as a training environment for users to learn and rehearse proper decision making and how to think during a wide range of military sce-narios. It has been used by the majority of NATO partners during the last 5 years [5]. With its scripting language it gives the user the ability to fully customize and create their own scenarios and missions.

1

https://bisimulations.com/products/virtual-battlespace

(15)

CHAPTER 2. BACKGROUND 5

2.3 IMFDB

The Internet Movie Firearms Database (IMFDB)2is a wikibased online database of firearms featured in movies. It includes both pictures of the firearms as they are appear in film, as well as stocktype images of the firearm. The database also includes several types of explosives such as hand grenades and mines.

2.4 Machine learning

Machine learning is a field of computer science that focuses on algorithms and statistical analysis that allows computer systems to solve tasks without using specific instructions. This is accomplished by allowing the model to train on a set of datapoints. The model then generalizes to unseen data based on the previously shown datapoints. Overfitting is a term used in machine learning that denotes that the algorithm or network does not generalize well enough to the unseen test data, although it performs well on training data. Underfitting is the opposite of overfitting and means that the network is too generalized and performs poorly. Further, there are three primary ways of training: Supervised learning, unsupervised learning, and reinforcement learning.

In supervised learning the available dataset is split into training data and testing data. The training data is labeled with the input and the desired out-put. By using the training data we teach the model to optimize an objective function. In a perfect scenario, the algorithm learns to correctly determine output for unseen input, i.e. data not part of the training dataset. The perfor-mance of the algorithm is then evaluated on the test data. This type of training is commonly used for classification and regression tasks and is the type used and described in this thesis.

Unsupervised training involves data that is not labeled in any way and the tasks most often entail grouping and clustering of datapoints.

Reinforcement learning is a mix of the two aforementioned methods. Like supervised learning the data points have a label, unlike it though this is not presented beforehand but the model must learn it by exploring. Most often the problem concerns an agent that is placed in an environment where the agent tries to maximize a cumulative reward. The agent earns these rewards by taking an action, and then receiving a reward based on the outcome of that action.

2

(16)

6 CHAPTER 2. BACKGROUND

2.5 Related work

The interest in using synthetic data as a complement or replacement to real world data has increased significantly in recent years. There already exist a great deal of datasets that contain synthetic data but the success from using them are varied.

2.5.1 The reality gap

One of the most prominent issues with using synthetic datasets is called the reality gap. This term refers to the subtle and small differences between sim-ulations and the real world. Sophisticated machine learning algorithms often learn to exploit these small discrepancies making simulated environments dif-ficult to learn from [6].

2.5.2 KITTI and VKITTI

One of the larger synthetic datasets is the VKITTI dataset [7]. It is a synthetic dataset for the KITTI challenge [8]. As the KITTI challenge consists of many different parts, it is worth mentioning that when referring to this challenge it is the object detection part that is referred to.

The KITTI dataset in itself is a dataset of videos from an autonomous car driving on roads in different environments, such as in the city, suburban ares, etc. It collects sensory data and has labels for all objects, both in 3D and 2D. The VKITTI authors Gaidon et al. [7] uses this dataset as a seed to generate realistic synthetic data. This allows for further variation in the scenes of the original dataset.

2.5.3 Domain randomization

Domain randomization (DR) is a method used to tackle the reality gap problem mentioned in Section 2.5.1. The goal of the method is to force the neural network to focus on the object of interest as much as possible during training, as real life images of objects often tend to contain a certain bias. An example of this can be that some objects are much more likely to appear in one scenario as opposed to another. In such biased cases the network then learns the object based on the scenario as opposed to learning the object.

In DR, as shown by Tremblay et al. [3], the authors construct heavily ran-domized scenarios in order to force the neural network to focus on the features

(17)

CHAPTER 2. BACKGROUND 7

of the object itself as opposed to the entire context of the image. The ran-domization includes, but is not limited to, lighting, textures of the objects, distractor objects (obscuring the view of the object of interest), camera posi-tion, and localization of the object in the scene, etc. These heavily randomized scenarios do not strive for perfect realism or photorealistic quality, and instead does the opposite. But by doing so the network can no longer rely on the con-text that the object appears in and as such should perform and generalize better to unseen images. This method is also in contrast to other methods of using synthetic data where the goal is photorealistic quality. What DR also succeeds in is priming the network for changes in the domain, and as such the network can more easily adapt to the goal domain.

This method has been shown to increase performance, especially when used in conjunction with a real world dataset as shown by [3] and [6]. The work by Tremblay et al. [3] also show that a model trained by their method of DR outperforms the models trained on VKITTI, implying that it can be effective to forego the strive for realism and instead aim for variation in the synthetic dataset.

2.5.4 Structured domain randomization

Prakash et al. [9] uses a combination of the chaotic randomness from DR and the realism from VKITTY by using a method they call "Structured Do-main Randomization" (SDR). In SDR an effort is made to preserve some of the context and structure of the problem at hand by having a more structured approach to the randomization. The context for their paper is the same as in the paper on DR mentioned in Section 2.5.3, namely creating a synthetic dataset for the KITTI challenge.

They found that during testing with regular DR the network performed adequately on large objects, but struggled with smaller ones. SDR performed well on both, and Prakash et al. [9] bring forth the argument that in order to accurately classify and detect small objects context is a necessity. They base this on the theory that the information needed to detect and classify such small objects can not be stored in the few pixels that the object appear in. They also received better results overall compared to regular DR + real world data by using SDR + real world data, and even achieved good results on models trained exclusively on synthetic data [9].

(18)

8 CHAPTER 2. BACKGROUND

2.5.5 Cut & paste

Dwibedi, Misra, and Hebert [10] have shown that simply cutting out objects of interest then pasting them into an unseen environment can be an effective way of augmenting a dataset. It works similarly to domain randomization in that it forces the network to focus on the object, instead of the background.

This method does have some notable drawbacks. The method requires segmentation of the objects, which can be costly. The segmentation need to be very precise and just cutting out the object means that the object is only viewed from one angle.

2.5.6 Freezing weights with synthetic data

Freezing weights when using synthetic data have been shown to be able to increase performance when training with synthetic data [11], although it has had varied success in other studies, where e.g. Tremblay et al. [3] found that performance decreased by freezing the weights.

(19)

Chapter 3 Theory

This chapter contain the theory behind deep and regular neural networks in general, as well as the Faster-RCNN architecture.

3.1 Artificial neural networks

Artificial neural networks (ANN) are a subset of machine learning algorithms. They are inspired by the biological structure of the brain and tries to simulate this with neurons activating based on the input [12]. The practical use for ANN are to approximate different functions by having them trained on an ap-propriate dataset.

3.1.1 Perceptrons & Feedforward Networks

The perceptron is the most basic and fundamental artificial neural network. It was first introduced by Frank Rosenblatt in 1962 [13]. It takes a number of input elements and produces a single output of 0 or 1 based on these inputs. At its disposal the perceptron has of a number of weights equal to the number of input elements, where each respective weight represent the relative importance of their input values. Each input value are multiplied by their respective weight and then summed up as the weighted sum

X i=0

wixi

where wiis the accompanying weight of input xi. The output of the perceptron is then based on if this weighted sum exceed a certain threshold.

(20)

10 CHAPTER 3. THEORY xn x2 x1 x0 P wn w2 w1 w0 Inputs Weighted sum Step function Weights

Figure 3.1: A simple perceptron.

The output is described as

output = (

1, _if P

i=0wixi > threshold 0, _otherwise

This threshold is usually moved to the left hand side of the argument, where b = −threshold, called the bias. The notation can be rewritten as:

output = (

1, _if P

i=0wixi+ b > 0 0, _otherwise

The neuron is a generalization of the perceptron. It behaves very much the same except to determine the output a different activation function is used. Activation functions is covered in Section 3.1.2.

Combining several neurons yields a feedforward neural network. Each neuron has their own respective weights. By adding more layers of neurons to-gether with non-linear activation functions to the network we further increase non-linearity, which in turn allows the network to approximate more complex functions. This is called a deep feed forward neural network, or multi-layered perceptron. The layers between the input and output layers are called "hidden layers", seen in Figure 3.2.

(21)

CHAPTER 3. THEORY 11 Input 1 Input 2 Input 3 Output Output Hidden layer 1 Hidden layer 2 Input layer Output layer

Figure 3.2: A simple example of a deep feed forward neural network.

3.1.2 Activtion functions

The activation function decides how the neuron shall process the weighted sum of inputs. In the case of the perceptron the activation function is a simple threshold function. Today the most common activation function is the recti-fied linear unit (ReLU) or a variation of the ReLU, such as the Leaky ReLU. Tanh, and sigmoid have historically seen use but has since fallen out of favor due to the vanishing gradient problem [14], explained in more detail further down. These activation functions can be seen in Figure 3.3. The activation function introduces non-linearity to our network, which is important if we are to approximate a complex function. The choice of activation function depends on the task and network architecture and can have a large impact when training a network, more on training in Section 3.1.3.

(22)

12 CHAPTER 3. THEORY −1 1 ReLU −1 1 Leaky ReLU −1 1 Tanh −1 1 Sigmoid Figure 3.3: Activation functions

(23)

CHAPTER 3. THEORY 13

The rectified linear unit function is defined as: fReLU(x) = max(0, x)

The function converts all negative values to zero. It is easy and fast to compute and has been shown to greatly accelerate the convergence when used with stochastic gradient descent [15]. A problem that arises when using ReLU is that a neuron can "die" during training. If a large negative gradient is flowing through the neuron it can update the weights of the neuron in such a way that the neuron is never activated again [16]. The gradient will then forever be zero for the neuron. This is mostly an issue when the learning rate is too high.

The leaky ReLU lets through some of the negative values in an attempt to fix the dying ReLU problem. It uses a slight negative slope for values below zero. It is defined as:

fLeakyReLU(x) = max(αx, x)

where α is a small constant. This gives the neuron a chance to recover. The sigmoid and tanh functions are very similar to each other, with tanh being a scaled version of the sigmoid, by ranging from -1 to 1, and the sigmoid from 0 to 1. The sigmoid function is defined as

σ(x) = 1 1 + e−x and tanh as

tanh(x) = 2σ(2x) − 1

The sigmoid function is used because it has a natural representation of the firing of a neuron. As mentioned earlier both the tanh and the sigmoid function suffer from the vanishing gradient problem [14]. The problem of the vanishing gradient occurs when the earlier layers of a deep neural network have gradients very close to 0. This happens when the deeper layers of the network are saturated at -1 or 1, or 0 and 1, the asymptotes of the tanh or sigmoid function respectively. This can be seen in Figure 3.3. This causes slow convergence of the network during training, and could also result in a poor local minimum. The ReLU and leaky ReLU, however, always have a constant gradient above 0, which means they do not suffer from the same problem. The sigmoid function additionally suffers from the problem of not being zero centered, hence the introduction of the tanh function. In deep networks today, ReLU and leaky ReLU are by far the most common.

(24)

14 CHAPTER 3. THEORY

Another activation function worth mentioning is the softmax function. It is mostly used in the output layer when working with a discrete task, and is defined as follows: sof tmax(z)j = ezj PK k=1ezk j = 1, ..., K

where K is the number of nodes in the layer. This normalization ensures that the sum of the activations equals 1. The values of the activations can then be viewed as a kind of certainty measure in classification tasks where the largest of these values is the class predicted by the network. It also demonstrates how certain the network is of the assigned class.

3.1.3 Training a network

The learning problem can be formulated as minimizing a given loss function L by using a set of weights W . The loss function can be seen as a mapping of the error between the desired output of the network, y, and the predicted output ˆy.

Supplying a neural network with an input and calculating the output is called the forward pass. When initializing a network, all the weights are ran-domly assigned a starting value. The first forward pass thus only produces random output. During supervised learning the error in result of the forward pass is calculated according to the loss function. There are many different loss functions and which one to use depend largely on the problem at hand.

In order to train the network the gradient of the loss function is calculated across the entire network. This is calculated using the backpropagation algo-rithm, which is an efficient way of computing gradients in a directed graph (neural network) [17]. By using stochastic gradient descent [18] when the gradient has been calculated we can update the weights according to the cal-culated gradient and our learning rate as follows:

wi,j = wi,j − η δL δwi,j

where η is the learning rate. This is called the "backward pass". It is important to select the learning rate appropriately, as a too high learning rate will lead to unstable learning and overfitting, while a too low learning rate can lead to excessively long training and a poor local minimum.

(25)

3.1.4 Batch normalization

A limiting factor when training deep neural networks is that the distribution of each layers inputs change when the preceding layers in the network updates as part of training. As a consequence the network is very sensitive to the selection of the learning rate as well as to the initialization of the parameters. This is often referred to as internal covariate shift [19].

Batch normalization is a method that normalizes inputs to each layer and allows the layers to learn independently from each other. Benefits of using batch normalization is that it allows for a higher learning rate without having to worry about training becoming unstable, as well as the network being less sensitive to the initialization of its parameters [19].

3.2 Convolutional neural networks

The feed forward network discussed in Section 3.1 only utilize fully connected layers, meaning that each neuron output is fed as input into each of the neu-rons in the succeeding layer. As such each neuron receives input from all the neurons in the preceding layer. Processing images of any respectable size will lead to an abundance of parameters as the input dimension of such an image is of substantial size. Apart from being computationally dense a network with an overwhelming amount parameters will often suffer from overfitting unless reg-ularized properly [20]. To remedy this convolutional neural networks (CNN) employ a couple of layers to reduce the number of parameters.

Another aspect of the fully connected layer is that the spatial structure of the data is not taken into account. When processing data this means that the spatial locality is ignored. In images spatial locality is often important and something that we want the network to consider. By using convolutional layers we can not only reduce the amount of parameters by parameter sharing, but we also make sure to take the spatial locality into consideration.

3.2.1 Convolutional layer

A reasonable assumption to make regarding image analysis is that if a feature is relevant to compute at one spatial location, it is also important to compute at another. This opens up the possibility of parameter sharing which is an essential part of the convolutional layer.

The convolutional layer uses kernels, sometimes referred to as filters, to convolve, or slide around the image. The kernels consist of a matrix with a

(26)

number of weights that processes a sub-region of the image. The dot product of the pixels of that sub-region and the parameters of the kernels is calculated and stored in the neuron connecting to that sub-region, and then the kernels move on to the next set of pixels. Depending on the stride used, this is done for every relevant sub-region of the image. Stride is the amount of pixels that the kernel moves by, so a stride of 1 mean that the kernel process a sub-region at each pixel, while a stride of 2 mean that the kernel process a sub-region at every other pixel and so on. It can also be seen as a measure of how many neurons that are responsible for processing the feature map. Each of the weights in the kernels are adjustable and as the network is trained the kernels will learn to activate on different features of the image. Early kernels in the model will learn basic shapes, such as lines and circles while layers deeper in the network learn more abstract and specific shapes. A kernel "learns" a shape when the shape is represented by large values in the kernel matrix. It is then easy to see that this means that when both the input sub-region and the kernel have large values, the resulting dot product is larger, i.e., the kernel has detected a feature. Figure 3.4 shows the activation map of one such kernel. One such slice of activation map is commonly referred to as a depth slice.

The assumption made at the start of this section is what allows the kernels to share parameters across all sub-regions of the image. Instead of having mul-tiple kernels and a separate kernel(weight matrix) at each sub-region, we use the same kernel for all sub-regions. This is why the kernels "slide". However in some cases it can be useful to not have this restriction of parameter shar-ing, for instance, when there is a need to learn different features in different locations of the image. Facial recognition is such a case.

3.2.2 Padding

It is often desired to preserve the input dimension, especially in very deep net-works as repeatedly lowering the dimension will cause a loss of information, and in such cases a padding can be added to the input. The padding consists of adding a frame of zeros to the image. Padding also aids the network in detecting features along the edges of the image, which can further increase performance.

3.2.3 Pooling layers

Pooling layers can also be used to downsample the parameters as well as reduce the risk for overfitting. There are several kinds of pooling layers, with the most

(27)

CHAPTER 3. THEORY 17 32 32 3 1 28 28 32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations

activation map

Figure 3.4: Example of a convolutional layer. Input is in this case a picture with size 32x32x3. A filter of size 5x5x3 then convolves (slides) over all spatial locations. The dot product of the pixels at the sub region and the parameters is calculated and stored in the neuron connecting to that sub-region. This forms the resulting activation map, with a size of 28x28x1, using a stride of 1 and no padding.

common being max pooling. Others include average pooling and sum pooling. Pooling layers split the preceding layers outputs into equally sized regions and based on the pooling layer type downsamples the region into a single value. Max pooling chooses the highest value node, average calculates an average of the nodes, and sum calculates the sum of nodes. The pooling layer also introduces further non-linearity.

3.3 Object detection

Object detection refers to the problem of localizing and classifying all objects of interest in an image. This detection and location is most often represented by a bounding box around the objects. An example can be seen in Figure 3.5. There exists multiple different strategies for object detection, with the two main ones being the region based convolutional network (R-CNN) [21] and the single shot multibox detector (SSD) [22]. The architecture and procedure of processing the data differs between the two and they both have different

(28)

ad-18 CHAPTER 3. THEORY

vantages and drawbacks. The R-CNN architecture, or more precisely the faster R-CNN, has been shown to have an edge between the two of them in detect-ing small objects specifically, while sufferdetect-ing from larger inference time [23]. The SSD is faster and yields comparatively good results on large objects, but generally performs worse in the task of detecting small objects, compared to the R-CNN [23]. In this thesis the R-CNN architecture is used. The Faster R-CNN architecture is the product of two successive iterations of the original R-CNN model. All three iterations will be covered below.

Figure 3.5: An generic example of object detection.

3.3.1 Region based CNN

The region based CNN (R-CNN) structure consists of three modules, the first module is a category-independent region proposal, the second module is a CNN that extracts a fixed dimensional feature vector, and lastly classification using support vector machines(SVM) [24]. In the first module an algorithm such as selective search [25] is used to extract regions of interest. A region of interest is defined as an area of the image that may contain an object. Each of these regions are then cropped out of the image and resized. They are then fed into a CNN to extract relevant features. These features then proceed to be fed into the SVM classifier. In addition to extracting features the CNN also suggests 4 offset values to increase the precision of the bounding box [24]. The structure can be seen in Figure 3.6. The R-CNN model is not very fast, and

(29)

much of the computational cost consists of individually cropping and feeding each image into the CNN module. The structure of the CNN module itself does not require any specific structure and can differ between different R-CNN models. The main goal of the module is simply to extract the features of the image in an efficient and precise way.

R-CNN

CNN face? no. ... hat? yes. ... horse? no.

1. Input image 2. Extract region proposals 2. Extract region proposals

3. Compute CNN features 4. Classify regions

warped region

Figure 3.6: R-CNN architecture. Illustration inspired by Girshick et al. [24]

Fast R-CNN

The R-CNN has since been improved upon by the original creators and the second version is called Fast R-CNN. Instead of feeding each suggested re-gion proposal into the CNN module independently, Fast R-CNN uses what they call Region of Interest (RoI) pooling. The Region of Interest pooling is a layer that shares the feature map generated by the CNN module for the en-tire image. It produces a fixed-size feature map from an input of nonuniform size. This allows the network to share the extracted features when detecting and classifying the regions of interest. It means that the individual proposed regions does not need to be resized in order to be fed to the convolutional lay-ers. Fast R-CNN also foregoes the use of SVM in favor of using a softmax output layer to predict classes, as this was shown to yield increased accuracy in predictions [26].

Faster R-CNN

The Faster R-CNN is the latest iteration of R-CNN. It outperforms its prede-cessors both in speed and accuracy. It uses the same improvements as the Fast R-CNN with the softmax output and region of interest pooling. In addition to these it also introduces a region-proposal net (RPN). The RPN gives quicker generation of proposal regions in an image instead of using a costly region proposal algorithm. Given an image of any size the RPN in the Faster R-CNN

(30)

outputs a set of rectangular object proposals along with their respective ob-jectness score. The obob-jectness score being how certain the RPN is that the proposal is an object class (as opposed to background). The region proposal network shares the same convolutional layers as the classifier to extract the features of the image. This effectively means that the RPN and the RoI pool-ing layer share and work with the same extracted feature maps. The RPN uses a sliding window on these feature maps. At each such window position, the features are mapped to an intermediate layer. This layer has a set amount of dimensions and that then maps to the output layers, which is fully connected. The network predicts a total of k multiple region proposals and as such means that the network produces 4k + 2k outputs. 4k outputs are used to predict coordinates of the boxes, and 2k to predict the likelihood of an object or back-ground. These k proposals are all relative to k reference boxes, called anchors. Each anchor has a different scale and aspect ratio. Since the region proposal module now is a part of the learning process it can also adapt the region pro-posals to better fit the task data [21].

3.4 Data augmentation

To expand a dataset there are several methods of data augmentation that can be applied when working with images. Due to the fact that objects in images have certain invariances, images can be altered in a way that does not change what the image depicts. In Figure 3.7 there is an example of translational invariance. We can recognize the object in all 3 of the images despite the object being at different locations. A neural net will consider each of these objects as unique. Instead of requiring to have each unique image in the original dataset we can apply different augmentations to it. In the case of the statue we can have the main image and then crop out the statue and then translate it to different areas of the image, and adjusting the label, to artificially increase the dataset. The same logic can be applied to rotation of the image.

3.5 Synthetic data

Synthetic data is artificially created data, as opposed to data gathered from real world measurements. In this thesis the synthetic data refers to images from the simulator VBS3, and real world data refers to images gathered from the IMFDB. Data created artificially by the process mentioned in Section 3.4, although technically synthetic data, is not considered synthetic data in this

(31)

Figure 3.7: Example of translational invariance

context.

3.6 Transfer learning

It has been shown that many models in image analysis learn similar features in the earliest layers of the network. These features are often simple lines, contours, and colour blobs. Transfer learning attempts to utilize this fact in order to minimize training time. By using a network trained for an image analysis task, the training of such features for the task at hand (not the same task as the network has previously been trained for) can be ignored or only require fine-tuning and as such speed up the learning process. It can also improve the performance of the final trained network as well [27].

3.7 Evaluation metrics

The metrics used to evaluate the models in this thesis is the COCO evaluation metrics [28]. The COCO metrics include the mean average precision (mAP), sometimes referred to as just average precision, at different intersections over union (IoU) as well as mAP across different object scales. These object scales are small, medium and large, where the size of the object is defined as

small = area < 322 medium = 322 ≤ area ≤ 962

large = area > 962

where the area refers to the area of the bonding box for the object in number of pixels. The primary metric for the challenge is the average of the mAP scores over 10 different IoU, ranging from 0.5 to 0.95, with a 0.05 interval.

(32)

Additionally the average recall (AR) across different scales as well as AR given a number of detections per image is measured.

To calculate the mAP first let • True positive = TP • True Negative = TN • False Positive = FP • False Negative = FN Then Precision = T P T P + F P Recall = T P T P + F N and IoU = A ∩ B A ∪ B

where A and B is the predicted bounding box and ground truth bounding box. For a prediction to be considered correct the label has to match the ground truth and the IoU be over a certain threshold.

All the predictions made by the classifier are then collected and sorted by 11 ranks, where each rank is a confidence threshold such that the recall of the model when using that confidence threshold is equal to [0, 0.1, ..., 0.9, 1]. The average precision is then defined as the mean of the respective precision value for each of these ranks.

(33)

Chapter 4 Methodology

This chapter will describe how the data was generated as well as which neural network architecture has been used. In addition to this, it will also be de-scribed how the evaluations of the different models are done. Firstly a small description of the original dataset and how it is labeled. Following this is a more in-depth look at how the synthetic dataset was created and the character-istics of it. For the specific method of the extraction of bounding boxes refer to Appendix A.

4.1 IMFDB Dataset

The dataset used as the real world dataset are images scraped from the website IMFDB, mentioned in Section 2.3. All images featured in this dataset contain at least one firearm. The labeling for this dataset was done by hand by FOI. The dataset consists of approximately 34000 images. Described in Section 2.3 explosives are featured on the website as well. They are, however, not included in the dataset used for training and evaluating. An example image from this dataset can be seen in Figure 4.1.

(34)

24 CHAPTER 4. METHODOLOGY

Figure 4.1: Image from the IMFDB dataset, in this case an image from Bat-tleship Potemkin (1925).

4.2 Generating synthetic data

To generate the data the military simulator VBS3 mentioned in Section 2 was used. Since FOI already uses this simulator for other purposes, it is readily available and as FOI wants to make sure to make the most out of the resources already available to them. In the following sections the term "unit" will refer to a 3D model in the simulator depicting a person.

Two types of images were created. One where the firearms are created as individual objects, meaning there are no units holding the firearms, and one where there are one or multiple units carrying the firearms. The first method has the advantage of showing the entire firearm and lets the network focus on the firearm specifically without any bias towards a unit holding the firearm. It also allows for more variation as the scenes created do not strive for realism. This type of image will therefore be referred to as non-realistic. An example can be seen in Figure 4.4 and 4.5. This aims to follow the same philosophy as domain randomization, mentioned in Section 2.5.3.

(35)

CHAPTER 4. METHODOLOGY 25

at the cost of slightly lower variation. As argued by Prakash et al. [9] the context that an object appears in can play an important role in the detection of small objects. This type of image will be referred to as semi-realistic. An example can be seen in Figure 4.6. Both of these methods use the same core structure of the data generation described in Section 4.3.

4.3 Core structure

The simulator contains several large scale maps set in different geotypical set-tings. A map in this context is a playable game environment. The maps used only act as a variation of the background of the images as there is no meaning-ful interaction with the map itself. Four maps were used for the data generation during this thesis:

• Geotypical Afghanistan • Geotypical Eastern Europe • Geotypical Southwest USA • Geotypical Tropical

In addition to providing several maps the simulator also allows for variation in lighting and general look of the scenes. The parameters randomized are as follows:

• Time of day • Weather

• In scattering of light to the camera

The time of day varies between 9 and 17 in simulator time. Scattering of light randomizes the color of the sunrays. It is randomized uniformly across the three RGB colors. Further, there are three randomized weather parameters for each image:

• Level of rain • Level of overcast • Level of fog

(36)

The camera itself is also placed at a random distance from the firearm to-gether with a random elevation and field of view. The camera is then randomly tilted with a small angle. The location of the scenes themselves are chosen ran-domly, initially by picking a random location of the map, then each succeeding scene is sampled randomly within a small distance from the first. While se-lecting scenes randomly across the entire map would provide further variation, the quality of the rendered terrain would be lowered (by the simulator to main-tain a stable framerate) if the selected location was far from the preceding one. Therefore a cap of the randomization was introduced. A random location in a much larger radius is selected every 50 image to still provide some of the variety offered by the map.

To keep this dataset similar to the IMFDB dataset all images always contain at least one firearm. Having negative samples (images with no firearms in them) in the dataset may have an impact on how well the model is able to learn. The motivation behind this was to ensure that any gain made was due to the synthetic data itself, and not by adding negative samples. One could note that supplementing the dataset with negative data is not an expensive procedure, if one would like to include it. As this thesis is mainly focused on reducing the cost of creating a dataset the creation of negative samples are not included.

4.3.1 Non-realistic images

For the images containing the firearms as individual objects for each image there were one or two firearms present, which were selected either from a list of handguns, or a list of rifles and machineguns, both with a 50% chance. The firearms were placed in front of the camera at a random distance as well as rotated randomly across all three axes (pitch, roll, yaw). In addition to creating the firearms a number (ranging between 1 and 25) of distractor objects were also created and placed slightly behind the firearms. These served to provide further variation in the image as many parts of the maps contain very open and homogeneous areas (such as a desert). The objects include models such as tables, chairs, persons, etc. Examples of the non-realistic images can be seen in Figures 4.4 and 4.5.

4.3.2 Semi-realistic images

The other type of images were the semi-realistic ones. These images were further divided into two equally likely scenarios. One of which was a single unit present in the image and another multiple units.

(37)

For the scenario with one unit present, the camera is closer to the firearm. This allows for a more detailed view of the firearm. For the other the camera is further away to capture multiple units. This diversifies the dataset so that there are both images with detailed view of the firearms as well as images where the firearms are further away and smaller.

All units created are created while frozen in a random animation. These animations include running, sitting, being prone, crouched, different standing poses etc. The number of units present in the second scenario are between 2 and 4, uniformly distributed. The unit models are male soldiers in different uniforms as well as male and female civilians. Example can be seen in Figure 4.6. There were no female soldier models in the version of VBS3 that was used, which is why they are not part of the synthetic datasets.

4.4 Extracting bounding boxes

Since the simulator is not built for the purpose of data generation, a few extra steps were necessary in order to extract accurate bounding boxes. The method used for extracting bounding boxes from the simulator is described in Ap-pendix A. This method produced pixel perfect segmentation of the firearms present in the image. Examples of the segmentation for the semi-realistic dataset can be seen in Figure 4.2 and Figure 4.3.

4.4.1 Filtering out poor examples

There are many random factors when creating the dataset and some images are bound to have the firearm obscured by another object to a point where the firearm is not recognizable, or simply not in view at all. To sift out these images, the number of firearm pixels for each firearm are counted. If the total amount of pixels does not exceed a certain threshold, a bounding box is not drawn for that firearm. If the image does not contain any bounding boxes at all, the original image is discarded.

(38)

Figure 4.2: Handgun with corresponding pink segmentation

Figure 4.3: Scene with multiple units with one corresponding pink segmenta-tion image.

(39)

Figure 4.5: Sample image of non-realistic generated image along with gen-erated bounding boxes. The label of the bounding box in the example is the name of the 3D model in the VBS3 simulator.

Figure 4.6: Sample image of semi-realistic generated image along with gen-erated bounding boxes. The label of the bounding box in the example is the name of the 3D model in the VBS3 simulator.

4.5 Network architecture used for object

de-tection

As mentioned in Section 2 there are a number of different architectures for ob-ject detection. Since the faster R-CNN has shown better performance specifi-cally in the task of detecting small objects, this was deemed as the most suit-able architecture for the task of detecting firearms. Training an object detector from scratch is no easy task and would in itself take several weeks. Instead, we utilize the findings on transfer learning mentioned in Section 3.6. A pre-trained model of the inceptionv2 [29] network was used for the experiments in

(40)

this thesis. The network had been pre-trained on the COCO dataset1[28] and this network is used as a fine-tuning checkpoint for all the different models.

The tensorflow object detection api2was utilized to create, train and eval-uate the model. The pre-trained model was also acquired through their "model zoo".

The networks were trained with a step size of 0.0002. The exact settings for the network can be seen in Appendix B.

4.6 Evaluation

To gauge the impact of complementing the original dataset with synthetic data, several different models were trained, each with a unique dataset. Primarily a model was trained that contained none of the synthetic data. As mentioned in Section 4.5, each model uses the same baseline checkpoint in the incep-tionv2 pre-trained on the COCO dataset. The remaining models were then individually trained with a dataset consisting of the IMFDB data as well as a certain amount of synthetic data. The evaluation is done exclusively on the IMFDB dataset as there is no interest in the detection of firearms in the syn-thetic data. All models were evaluated on 15% of the IMFDB dataset (5135 images). These 15% were randomly selected from the entire IMFDB dataset and the remaining images are used for training. These were randomly gener-ated for each model. The metrics gathered are the same as the COCO dataset, see Section 3.7, with the primary performance metric being the mean average precision. The models were evaluated continuously during training and the highest mean average precision is recorded. The models were trained with: 500, 1000, 2500, 5000, 7500, 10000, 12500, 15000 and 20000 images from the synthetic datasets.

Additionally there were models trained using only a portion of the IMFDB dataset together with synthetic data. These models were trained with 1000, 5000 and 10000 images from the IMFDB dataset along with 0, 5000, 10000, 15000 and 20000 synthetic images. These models were trained with the exact same hyperparameters as the ones trained with the full amount of IMFDB data, except for number of epochs, which were increased to 10 (see Appendix B for the rest of the settings). For evaluation the same was done for these models; evaluated continuously and the highest mean average precision is reported.

1

http://cocodataset.org

2

(41)

4.7 Dataset augmentations

The synthetic and real world data are shuffled together for all experiments so that the domain of the image presented during training is random. The specific synthetic images in each dataset are also randomized. In addition to comple-menting the dataset with synthetic data, during training of the models a few more data augmentations were made including:

• Random brightness • Random contrast • Random hue • Random saturation

• Random distortion of colors • Random jitter boxes

• Random horizontal and vertical flips • Random 90 degree rotations

• Random image scaling

• Random cropping of the image

The augmentations mentioned in this list are done for all of the data, both real world and synthetic.

(42)

Chapter 5 Results

This chapter presentes the results produced by the method in Chapter 4.

5.1 Training with the full amount of real data

In this section the results of training with the full amount of real world data together with an increasing amount of synthetic data is presented. Presented first is the models trained with the non-realistic data, followed by the models trained with semi-realistic data.

5.1.1 Training with non-realistic data

As described in Section 4.6, these are the results of networks trained with the full amount of real world data and an increasing amount of non-realistic syn-thetic data. In Table 5.1 the highest mean average precision (mAP) is reported for each of the models trained, with the top row being the amount of synthetic data used during training and the bottom row the reported mAP. The amount refers to the specific number of images used. The mAP is averaged over the 10 intersections over unions (IoU) ranging from 0.5 to 0.95 with a 0.05 step interval. See Section 3.7 for the evaluation metrics.

For a detailed report regarding all metrics collected refer to Appendix C.

Mean average precision of models trained with specified amount of synthetic data

0 500 1000 2500 5000 7500 10000 12500 15000 20000

0.414 0.415 0.414 0.413 0.422 0.416 0.417 0.418 0.420 0.415

Table 5.1

(43)

CHAPTER 5. RESULTS 33

In Graph 5.1 we can see how the mAP changes when the amount of syn-thetic data is increased.

0 0.5 1 1.5 2 ·104 0.40 0.41 0.42 0.43

Amount of synthetic data

mAP

Mean average precision (mAP)

Figure 5.1: Mean average precision

Further the graph 5.2 shows the average percentage increase across all met-rics. 0 0.5 1 1.5 2 ·104 −1 0 1 2 3 4

P

ercentag

e

increase

Average percentage increase across all metrics

(44)

34 CHAPTER 5. RESULTS

5.1.2 Training with semi-realistic data

As described in Section 4.6 these are the results of networks trained with the full amount of real world data and an increasing amount of semi-realistic syn-thetic data.

Mean average precision of models trained with specified amount of synthetic data

0 500 1000 2500 5000 7500 10000 12500 15000 20000 0.414 0.412 0.413 0.419 0.407 0.405 0.413 0.413 0.411 0.414 Table 5.2 0 0.5 1 1.5 2 ·104 0.40 0.41 0.42 0.43

mAP

(45)

CHAPTER 5. RESULTS 35 0 0.5 1 1.5 2 ·104 −1 0 1 2 3 4

P

ercentag

e

increase

Average percentage increase across all metrics

(46)

36 CHAPTER 5. RESULTS

5.2 Training with a small real world dataset

The results presented here is the mAP from training networks with a limited amount of data from the IMFDB dataset, along with the results from training networks with an augmented dataset. In Figure 5.3 we can see the results from training with an augmented dataset using the non-realistic data, and in Figure 5.4 using semi-realistic data.

IMFDB Synthetic 0 5000 10000 15000 20000 1000 0.215 0.245 0.244 0.246 0.233 5000 0.312 0.327 0.333 0.325 0.329 10000 0.356 0.365 0.359 0.356 0.348

Table 5.3: Mean average precision using non-realistic data. Top is specified amount of synthetic images used, leftis specified amount of real images.

IMFDB Synthetic 0 5000 10000 15000 20000 1000 0.215 0.235 0.233 0.243 0.235 5000 0.312 0.318 0.303 0.309 0.320 10000 0.356 0.353 0.351 0.355 0.354

Table 5.4: Mean average precision using semi-realistic data. Top is specified amount of synthetic images used, left is specified amount of real images.

(47)

Chapter 6 Discussion

In this chapter, discussion will be held regarding the results produced using the method in Chapter 4 along with discussion of the method itself. There will also be some analysis regarding the ethics of the results of the thesis as well as any potential impacts on sustainability.

6.1 Evaluating the results

In this section there will be discussion regarding the results of the different experiments carried out in this thesis.

6.1.1 Training with the full amount of IMFDB data

From the graphs in Figure 5.1 and 5.3 we can see the results of the experiments where the full amount of IMFDB data was used. The models trained with the non-realistic data show an increase in mAP, except for the model trained with 2500 added synthetic images. Most of the models trained with semi-realistic data show a slight decrease in mAP, except for one model trained with 2500 added synthetic images. This seems to point to the fact that the non-realistic dataset is better suited for this problem compared to the semi-realistic.

There does not seem to be a clear correlation between the amount of syn-thetic data and the performance of the network, at least not from the amount of synthetic data used for these experiments. From the graphs we can see a slight peak in mAP at around 2500 and 5000 synthetic images for semi-realistic and non-realistic respectively. But looking at Table 5.1 and 5.2 the second high-est reported mAP for the datasets are at 15000 synthetic images for the non-realistic dataset and 20000 synthetic images for the semi-non-realistic. One would

(48)

38 CHAPTER 6. DISCUSSION

expect that there would be an optimal amount of synthetic images as the model would overfit to the synthetic data as it constitutes an increasingly larger part of the dataset. It could be the case, since the training set and validation set are randomized, there is a variation between the reported performance of the trained models that are not due to the amount of synthetic images. Rather, it is due to a good or bad composition of real and synthetic data in the training set combined with an easy or hard validation set.

The average performance gain is also measured in Figures 5.2 and 5.4. One can see that all models trained with added synthetic data show an av-erage increase in the metrics collected. This could hint towards there being certain scenarios where the added synthetic data is helpful for other purposes than simply mAP. This is, however, not tested during this thesis due to time constraints, but could point to a possible area of further research.

6.1.2 Training with a small real world dataset

Starting with the non-realistic data, we can see from Table 5.3 that when traing with a small real world dataset, ustraing non-realistic synthetic data give in-creased mAP at all levels of real data(1000, 5000 and 10000 data points). In-terestingly, when using 10000 real data points the performance drops when in-creasing the synthetic data, with a pretty steep drop at 20000 synthetic images. Perhaps this could hint towards the fact that as the synthetic data increase, the model will overfit to the synthetic data.

For the semi-realistic data, seen in Table 5.4, models trained with synthetic data outperforms the base line at the lower levels of real data (1000 and 5000 data points), while at 10000 real data points the model trained only on real data points outperforms the models trained with synthetic. A hypothesis to explain this could be that the models trained with a low amount of real world data does not have enough training data to find even the basic shapes and patterns of the firearms. When adding the synthetic data it helps to detect these basic shapes and patterns and so the performance is increased. When using the larger amount of real world data it might have enough real training data to find these basic shapes and patterns and even learn more advanced shapes. Now the large amount of synthetic data is a detriment and instead the model trained with it overfits to the synthetic data, despite having enough resources to learn from the real data. This not being present in the non-realistic data further strengthens the claim that the non-realistic dataset is better suited for this problem.

For both datasets there is quite a large performance gain for the images with only 1000 and 5000 data points showing that using synthetic data when there

(49)

CHAPTER 6. DISCUSSION 39

is a severe lack of real data might be a viable strategy to boost performance.

6.2 Discussing the datasets

In this section there is discussion on how the synthetic data relates to the real data, as well as thoughts on why the non-realistic perform better than the semi-realistic.

6.2.1 Gap between the real and synthetic domains

To understand why the synthetic datasets perform as they do one needs to con-sider the goal of the synthetic data. To improve performance of the evaluation of the real dataset. In this case, the real dataset is the IMFDB data. In the IMFDB dataset, there are, as mentioned, two types of images, the first being stock type images with a white background. These were not replicated in the synthetic data. The other types of images in the IMFDB dataset are images from movies, which one can imagine have a slight bias in how the image is structured. For example a shot of staring down the barrel of a firearm might be more common in such a dataset than they would be in natural images. An image like that is far more unlikely to happen when looking at the firearms from a random angle. Furthermore, the only models depicted in the simulator are modern firearms, primarily heavy assault rifles. This is another discon-nect between the datasets where the IMFDB has a much larger diversity in its firearms. The IMFDB try to feature as many different firearms as possible in its database and as such several of the firearms in the real dataset are not in the synthetic data.

6.2.2 Non-realistic versus semi-realistic

From the results we can clearly see that the non-realistic dataset outperforms the semi-realistic, both when training with the full amount of real data as well as when training with a small amount of real data. To explain this there are a couple of points that could be brought up. To start, consider the domains of the synthetic and real world datasets mentioned in Section 6.2.1. One is stock type image with clear view of a firearm together with a white background. Look-ing at the synthetic datasets the non-realistic clearly resembles these types of images more with no one holding the firearm and the firearm is often in clear view in the foreground. This could very well be one reason for the increased performance.

(50)

40 CHAPTER 6. DISCUSSION

The non-realistic dataset also generally contains more variation. Firstly, the images in this dataset contain a clear view of the firearms from all angles, including ones that would be rare in the real world, such as views from be-low. Secondly, this dataset also includes a number of distractor objects. These could aid the region proposal network to suggest more appropriate regions of interest in the images by having more objects present in the images.

By having the firearms held by a unit (as in the semi-realistic dataset) it also opens up for the firearm being much more likely to be obstructed, which could speak further in the non-realistic dataset’s favor.

6.3 Limitations of the VBS3 simulator and

improving the synthetic data

As the simulator was not designed with data generation of this kind in mind, the limitations of the simulator led to some compromises that resulted in a slight decrease in quality of the synthetic images. The first issue encountered was with extracting the bounding boxes of the firearms. While the simulator does contain functions to extract the 3D bounding box of a model, including firearms, they are often significantly larger than the models actual size. Cou-pling this with the fact that converting this 3D bounding box in the game’s world space to 2D coordinates on screen further increases the discrepancy of the calculated 2D bounding box and the true 2D bounding box, it leaves this method very unreliable. Instead, the method described in Appendix A was used, which also had some issues, especially for the semi-realistic data. The masking function that was used in this method did not always function as in-tended and would sometimes mask objects (or in this case firearm attachments) that use identical 3D models. This led to faulty bounding boxes in some of the images. An example can be seen in Figure 6.1. The non-realistic dataset did not suffer from this to the same extent due to hiding all other firearms when capturing the segmented image. There were, however, some cases where there was noise in the background that would cause an error in the extracted bound-ing box.

Another issue is that the game engine that the simulator runs on forcefully lowers the level of detail on models or terrain when they are first introduced on screen (in order to maintain a steady framerate). As a result some of the images have a very low level of detail. This is further exacerbated by the fact that the camera is moved for every image, meaning that almost all of the models in the image are newly introduced ones. To combat this a function that waits for

(51)

CHAPTER 6. DISCUSSION 41

Figure 6.1: Issue with the method of extracting bounding boxes. Here we can see that the top left rifle was masked together with the rifle in the bottom right leading to an incorrect bounding box.

a certain number of frames were introduced, as well as limiting the distance that the camera could move. This had moderate success. Most of the images were of higher quality although there were still some with low level of detail. These low quality images will most likely impact the results negatively and were present in both datasets.

In the studies by Tremblay et al. [3] and Prakash et al. [9] they both argue for the importance of variation to bridge the reality gap. Some of these vari-ations were not present in the datasets generated during this thesis, either due to limitations of the simulator, or due to the authors inability to produce them from the simulator during this time. Variation in lighting was one of these and was shown by Tremblay et al. [3] to be one of the most important aspects to in-clude when generating synthetic data. While some variation was achievable in VBS3 by adjusting the time of day and in scattering rays it would be beneficial if it could be further expanded. As an example, the sun is the only light source used. This was a limiting factor in the semi-realistic images particularly as the

Data augmentation using military simulators in deep learning object detection applications

Data augmentation using military

simulators in deep learning object

detection applications

WILHELM ÖHMAN

Data augmentation using

military simulators in deep

learning object detection

applications

WILHELM ÖHMAN

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Purpose

1.2

Research question

1.3

Limitations

1.4

Outline

Chapter 2

Background

2.1

Swedish Defence Research Agency

2.2

VBS3

2.3

IMFDB

2.4

Machine learning

2.5

Related work

2.5.1

The reality gap

2.5.2

KITTI and VKITTI

2.5.3

Domain randomization

2.5.4

Structured domain randomization

2.5.5

Cut & paste

2.5.6

Freezing weights with synthetic data

Chapter 3

Theory

3.1

Artificial neural networks

3.1.1

Perceptrons & Feedforward Networks

3.1.2

Activtion functions

3.1.3

Training a network

3.1.4

Batch normalization

3.2

Convolutional neural networks

3.2.1

Convolutional layer

3.2.2

Padding

3.2.3

Pooling layers

3.3

Object detection

3.3.1

Region based CNN

R-CNN

3.4

Data augmentation

3.5

Synthetic data

3.6

Transfer learning

3.7

Evaluation metrics