Object Detection using deep learning and synthetic data

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University

Linköpings universitet

LiU-ITN-TEK-A--18/030--SE

Object Detection using deep

learning and synthetic data

Love Lidberg

(2)

LiU-ITN-TEK-A--18/030--SE

Object Detection using deep

learning and synthetic data

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid

Linköpings universitet

Love Lidberg

Handledare Pierangelo DellAcqua

Examinator Jonas Unger

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

i

Abstract

This thesis investigates how synthetic data can be utilized when training convolutional neural networks to detect flags with threatening symbols. The synthetic data used in this thesis consisted of rendered 3D flags with different textures and flags cut out from real images. The synthetic data showed that it can achieve an accuracy above 80% compared to 88% accuracy achieved by a data set containing only real images. The highest accuracy scored was achieved by combining real and synthetic data showing that synthetic data can be used as a complement to real data. Some attempts to improve the accuracy score was made using generative adversarial networks without achieving any encouraging results.

(5)

Acknowledgments

I would like to thank my supervisor at FOI David Gustafsson, my supervisor at Linköping University Pierangelo Dell’Aqua and my examiner Jonas Unger for the help and input on my work. I would also like to thank my fellow thesis writers at FOI for the great companionship and all the daily table tennis breaks.

(6)

Abstract i Acknowledgments ii Contents iii List of Figures v List of Tables vi 1 Introduction 1 1.1 Background . . . 1 1.2 Motivation . . . 2 1.3 Aim . . . 2 1.4 Research questions . . . 2 1.5 Delimitations . . . 2 2 Theory 3 2.1 Machine Learning . . . 3 2.2 Neural Networks . . . 4

2.3 Convolutional Neural Networks . . . 7

2.4 Architecture of a CNN . . . 7 2.5 Transfer Learning . . . 10 2.6 Generative Models . . . 12 2.7 Data sets . . . 14 2.8 Evaluation Metrics . . . 15 2.9 Related Work . . . 15 3 Method 17 3.1 Frameworks, Software & Hardware . . . 17

3.2 Data sets . . . 17

3.3 Tests . . . 20

3.4 Evaluation Metrics . . . 23

4 Results of the Tests 24 4.1 Test: Classification . . . 24

4.2 Test: Object Detection . . . 25

4.3 Test: Adding Realism to Generated Images . . . 26

4.4 Test: Flags on Random Generated Background . . . 28

5 Discussion 30 5.1 Answer to Research Questions . . . 31

(7)

6 Conclusion 33

(8)

List of Figures

2.1 An overview of the structure of the perceptron. . . 4

2.2 Activation functions . . . 5

2.3 An example of a multi layered perceptron. . . 6

2.4 Typical architecture for a convolutional neural network . . . 7

2.5 Convolutional operation . . . 8

2.6 Convolutional operation with zero padding. . . 8

2.7 Max pooling . . . 9

2.8 Unpooling . . . 9

2.9 The Inception module . . . 11

2.10 An example of a residual block . . . 11

2.11 Simple overview of a GAN. The generator, G, generates images from a random distribution, Z, and the discriminator tries to distinguish them from some real target domain. . . 13

3.1 Example of real images containing desired objects . . . 18

3.2 Example of images labelled as ’Nothing’. The images are taken from the COCO data set. . . 18

3.3 Example of the different flags rendered in Blender. Three different textures was used for each flag. . . 19

3.4 Flags pasted in to real images. . . 19

3.5 Images taken from game play of Grand Theft Auto V. . . 20

4.1 Images before and after they’ve gone through the ’fake to real’ generator. . . 27

4.2 Images before and after they’ve gone through the ’real to fake’ generator. . . 28

(9)

List of Tables

3.1 Data sets used for the classification test. . . 21

3.2 Data sets used in Object Detection test. . . 21

4.1 Accuracy for rendered flags on different backgrounds. . . 24

4.2 Accuracy scores for the classification experiment. . . 25

4.3 Recall and precision scores for the classification experiment . . . 25

4.4 Accuracy scores for the object detection experiment . . . 26

4.5 Recall and precision scores for the object detection experiment . . . 26

(10)

1 Introduction

1.1 Background

Millions of images are posted daily on social media around the world[5]. Though most images are harmless some images are discriminating, threatening or calls for violence and terrorism. In fact, terrorist in the past have been known to post to social media prior to their attacks and terrorist groups use social media as a way to spread their propaganda and to recruit more people[31]. A person posting images containing terrorists and certain symbols to illustrate their, could be an indicator that this person can pose a threat in the future. The police wants to take advantage of this situation to be able to prevent future attacks. Scanning through millions of images by hand to find the ones that contain objects of interest, for example a weapon or an ISIS flag, is time consuming and not sustainable.

To automate this task, one can employ machine learning and computer vision techniques. Machine learning is a field within computer science where computers have to learn to act on a task without being specifically programmed for that task[3]. Computer vision is another field, highly connected to artificial intelligence, that seeks to automate the tasks of the human visual system. Machine learning has previously proved to be able to automatize many tasks such as spam filtering, playing games or predicting stock prices[1][20][32]. A machine learn-ing area that has gained a lot of attention in the last decade is neural networks, often referred to as ’Deep Learning’. Deep learning has with the rise of computing power and the massive amount of available data shown to be a powerful technique that can be applied to many fields. Convolutional neural networks is a type of deep feed-forward neural network that has revolutionized many tasks within computer vision. Typical computer vision tasks such as image classification, object detection or semantic segmentation have been challenging for a long time. Nowadays using deep learning is the standard approach to solve these problems. Instead of relying on hand tuned features as before, convolutional neural networks can be trained on data and can learn on its own which features that are relevant for the problem. Convolutional neural networks need a lot of data for training, in order to generalize and perform well on new images. The problem is when there is no data to train the notwork. In such cases, augmenting the existing data has shown to be one way to boost the result[30].

(11)

1.2. Motivation Transfer learning is another way[33]. 3D modelling software is nowadays a powerful tool that is able to produce images indistinguishable from real images. Can this technique be used when there is a shortage of data? Images that are produced artificially and is not a true example are often referred to as ’synthetic data’. This thesis will investigate how synthetic data can be used to find certain given symbols of hate in images by using deep learning.

1.2 Motivation

This thesis has been carried out at Swedish Defence Research Agency, FOI. This thesis is a part of a bigger project that focuses on evaluating the use of big data in crime and terrorism prevention. To have a software that can scan through thousands of images fast and automat-ically to find possible future perpetrators that have posted threatening or suspicious images would be a useful tool. This thesis will focus partly on delivering a model that can be a part of such software.

1.2.1 Swedish Defence Research Agency

FOI is an assignment-based authority working under the Swedish Ministry of Defence. Their main customer is the Armed Forces and the Swedish Defence Material Administration but they also accept assignments from civil authorities. FOI is one of the leading research insti-tutes in Europe in the areas of defence and security. Their aim is to contribute to a safer and more secure world by being a world leader in these areas.

1.3 Aim

The aim of this thesis is to train a deep convolutional neural network to detect symbols of hate in images and investigate how and if it is possible to improve the results of the model with synthetic training data.

1.4 Research questions

1. With limited data available, what efforts can be taken to boost the results of a convolu-tional neural network?

2. Is it possible to train a convolutional neural network using only synthetic training data to detect desired objects?

3. Can generative adversarial networks, GAN, be used for the production of synthetic data?

1.5 Delimitations

This thesis will only focus on identifying three different symbols in images. Object detection and classification is mostly considered to be a solved problem, hence this thesis will focus on the use and production of synthetic data, how synthetic data can be made and how well it performs compared to real images.

(12)

2 Theory

This chapter presents the theory and the history behind the technology used in the thesis.

2.1 Machine Learning

As said in the previous chapter machine learning can be seen as a subfield of artificial in-telligence. It has the goal to make computers learn tasks with data without being explicitly programmed. Machine Learning is often divided into three types, supervised learning, unsupervised learning and reinforcement learning. Though, this thesis will not cover rein-forcement learning. When the data that is being fed to the machine learning algorithm is labeled it is called supervised learning. A common use for supervised learning is to predict the outcome of something uncertain or unknown based on previously data, for example to filter e-mails as spam or not spam automatically or to predict the price of a stock the following month. In supervised learning the algorithm learns to estimate the probability p(y|x) where y is the output of the algorithm with the given variable x. If the data instead is unlabeled it is referred to as unsupervised learning. Unsupervised learning is often used to make sense out of some unlabeled data, to cluster pieces of data together or detect anomalies. When using a machine learning algorithm, one often desires to either classify something, for example if there is dog in a image, or to produce a continuous output such as the price of a house. The two tasks are referred to as classification problems and regression problems. A common problem in machine learning is overfitting or underfitting. Underfitting hap-pens when a model fail to perform well. It can happen because the model is to simple for the problem. A model is overfitting when it is performs very well on the training data but fails to generalize on new examples.

When training a model you want to make sure it generalizes well on new examples. A common way to make sure the model is achieving this is to divide the data set into three parts: a training set, a validation set and a test set. The training set is the set that the model will learn from and the majority of the data set will go in to this category. The validation set consists of examples from the data set that the model will test itself on to see how well it

(13)

2.2. Neural Networks performs. The model never trains on these examples, they are only available to see how well the training is going. This set is mostly used to determine when a model is done training and for tuning hyper parameters. Finally when a model is done with the training it is exposed to the test set; this set that determines how well the model generalizes. The test set involves examples never seen by the model except for during the test. When evaluating different models one commonly uses the same test for all models to compare them and see which one performs the best.

2.2 Neural Networks

Neural networks is a machine learning algorithm modelled to mimic how the human brain works. A neural network consists of neurons, or units, that takes inputs from a previous layer, sums them together and then pass it through an activation function to generate an output. The precursor to neural network, the perceptron, consisted of a single neuron that summed up the inputs after being multiplied with their corresponding weights and sent it through an activation function to produce an output, see figure 2.1. The algorithm was introduced in the 1950s by Frank Rosenblatt[25]. A perceptron can be described by the following equation:

Y=ϕ( n

ÿ

i=1

Wi˚ Xi+b) (2.1)

where X denotes the input, W is the corresponding weight, ϕ is the activation function and b is the bias. The bias shifts the decision boundary away from the origin and does not depend on any input value.

Figure 2.1: An overview of the structure of the perceptron.

The activation function is a function that takes the sum of a neuron and decides if this neuron should ’fire’, or be activated. The activation function used for the first perceptron was the step function that mapped the weighted sum of the neuron to either 0 or 1. Today there are a few other activation functions that are used in almost every network, see figure 2.2. The sigmoid function that takes an input and produces an output anywhere between 0 and 1. This is commonly used in the last layer of a network for classification problems to produce a probability of the input belonging to a specific class. The Tanh activation function is similar to

(14)

2.2. Neural Networks the sigmoid function but maps the input from -1 to 1, this produces a stronger gradient when performing back propagation. The two other most used activation functions are the rectified linear unit, ReLU, and leaky ReLU[22][19]. Both functions leave positive inputs untouched but suppress negative inputs. The non-leaky ReLU(figure 2.2 (a)) maps all negative inputs to zero while the leaky ReLU function just suppresses them close to 0. By not being zero they still have a gradient which can be helpful to avoid problems during training.

(a) Leaky ReLU (b) ReLU (c) Sigmoid (d) Tanh

Figure 2.2: Activation functions

The first neural networks consisted of more than one perceptrons stacked on top or next to each other, see figure 2.3. These networks are called feed-forward networks or multi-layered perceptrons, MLP. They consist of an input layer that connects to each neuron in the next layer. The layers between the input and output layers are called hidden layers where each neuron is connected to all neurons in the previous layer and all neurons in the next layer. A feed-forward network with only one hidden layer provided with a sufficient amount of neu-rons has been shown to be able to approximate any nonlinear function[13]. A feed-forward network can have a trivial number of hidden layers. A network with two or more hidden layers is usually referred to as a deep network which has coined the term often used when referring to neural networks today: Deep Learning.

(15)

2.2. Neural Networks

Figure 2.3: An example of a multi layered perceptron.

When an input is passed through a network, or propagates forward, and produces an output we refer to it as a forward pass. For a newly constructed network it would probably produce a arbitrary and useless output, because the weights in the network have not been optimized for the problem yet. To make the algorithm produce an output better fit for the data we have to optimize the weights with respect to the training data. These is what is referred to as training the network.

To train the network after every forward pass, we compare the output to the correct la-bel for the corresponding input and update the weights so they approximate the target’s labels better. To do this we have to have some way of measuring how far off the output of the network is from the correct label. Therefore we define a loss function that tells us how good, or bad, the output was. The most common loss functions are the least square error, LSE, and least absolute error, LAE. The key difference between the two is that LSE-loss penalizes large errors heavier than LAE-loss. The goal of the training is to minimize this error. Computing the gradient of the loss function gives us which direction to move the weights of the last hidden layer. By taking advantage of the chain rule of derivatives we can find each neuron’s gradient and propagate backwards through all hidden layers. Once all gradients are found we can update the corresponding weights by subtracting the current value of the weight with the gradient multiplied by a predefined learning rate, a procedure known as gradient descent. This is done for each example in the training set over and over until the loss of the network converges or the training is stopped manually. The algorithm above that describes how to calculate the gradients is known as backpropagation and it is the standard approach to optimize the weights of a network[10].

Neural networks are scalable and adding more hidden layers and units is easy. Adding more units allows the network to describe more complex functions. But it also means that neural networks has a tendency to easily overfit the training data. Regularization techniques can be used to avoid overfitting. When dealing with neural networks the most common technique used to regularize is dropout[12]. Dropout simply means that some neurons are dropped and not used during some iteration during the training phase. This forces other

(16)

2.3. Convolutional Neural Networks neurons to activate instead. This is a way to prevent relatively few neurons to control the output. Dropout is only used during the training. Ones the model has been trained all neurons must be available.

2.3 Convolutional Neural Networks

Computer vision is a subfield of artificial intelligence often highly connected to machine learning and neural networks in particular. Computer vision deals with how computers can gain a high-level understanding of digital images and videos. It seeks to mimic the capa-bilities of the human visual system. Nowadays convolutional neural networks, CNNs, are the standard approach to many of the tasks within computer vision. Since its breakthrough in the last decade convolutional neural networks have achieved groundbreaking results in computer vision tasks such as image classification and object detection.

CNNs are a type of deep, feed-forward neural networks. The differences to ordinary neural networks are the convolutional layer and how the inputs are being fed into the network. Suppose that we give an image as input to a neural network. In a regular neural network every pixel in an image can be seen as a feature and an input. This is computationally very heavy since a relatively low quality image of size 600x600x3 will have more than one million inputs.

CNNs handle the inputs in a different way and make use of the convolutional layers which are far more effective for this type of input. Convolutional neural networks are today the standard approach for computer vision tasks such as classification and object detection.

2.4 Architecture of a CNN

This section covers the commonly used layers in CNNs.

Figure 2.4: Typical architecture for a convolutional neural network

2.4.1 Convolutional Layer

Section 2.3 explained how standard neural networks are not optimal when working with images. So instead of using each pixel as an input to each neuron we can use a convolutional layer and slide, or convolve, over the image which is far more effective. A convolutional layer can be seen as a stack of filters that take in a 3D volume and outputs another 3D volume of feature maps. Each filter in the layer has a corresponding feature map in the output. For example, let a filter, also often referred to as kernel, be a 3x3 matrix. To produce a feature map we multiply this filter element wise at some starting position and sum up the product. The

(17)

2.4. Architecture of a CNN output of each operation corresponds to one cell in the feature map. After each operation we move the filter with a fixed stride and perform the same operation until we have slid over the whole input and produced a complete feature map. An example of this procedure can be seen in figure 2.5. Since all inputs are convoluted by the same filters, which means all inputs share the same weights, we are using far less parameters than a regular neural network. This allows us to add more filters and make deeper networks. The filters in the network learns features that before had to be hand-engineered through the training process.

Figure 2.5: Convolutional operation

In some cases when using convolutional layers is it important to keep the spatial dimension of the input. If we convolve an input with dimensions 10x10 with a filter with size 3x3 the output would be a feature map of size 8x8.To keep the dimensions in the output the same as in the input we can pad the sides of the input with zeros. This is a procedure known as ’zero-padding’. This keeps the spatial dimensions after the convolutional layer. See figure 2.6 for an example.

Figure 2.6: Convolutional operation with zero padding.

2.4.2 Pooling Layer

In between convolutional layers it is common to use a pooling layer. A pooling has the role of sub-sampling the input to a lower dimension without losing important information. By decreasing the dimensions we lower the computations and parameters in the network, while also control overfitting. The most common pooling operation is max pooling where we from a spatial region take the largest number and discard the rest, see figure 2.7. There are other, nowadays less common, pooling procedures such as average pooling and sum pooling but

(18)

2.4. Architecture of a CNN max pooling is by far the most used. The pooling layer is applied in the same way as a convolutional layer. We have a fixed size of the pooling layer that we slide with a fixed stride across the input. Most common is a 2x2 layer with a stride of 2. The output of such layer will result in half the amount of parameters as the input.

Figure 2.7: Max pooling

For some convolutional networks, such as autoencoders or generative adversarial networks, other pooling procedures are used like unpooling which can be seen as a up-sampling and opposite of regular pooling. A unpooling layer takes in 3D volume and outputs a 3D volume with a larger width and height. There are different methods for how the upsampling can be done. In figure 2.8 an example of unpooling is shown with a stride of 2 and kernel size of 2x2. In this case each cell from the smaller matrix is expanded to a 2x2 matrix with the value from the original cell placed somewhere in the new matrix, where the position in the new matrix depends on the method used. If the network has a corresponding max pooling layer information the positions of the maximum can be saved and used later in the unpooling layer.

Figure 2.8: Unpooling

2.4.3 Fully Connected Layers

The last layers of a CNN are typically fully connected. In fact, the last convolutional layer is flattened so that it can be seen as a long vector and can then be connected to a fully connected layer. The role of the fully connected layers is to make sense of the previous convolutional stages and come up with a prediction. Normally one to three layers are stacked next to each other with the last layer having as many neurons as classes it wants to predict. Each neuron is responsible for the prediction of its corresponding class. If the task on the other hand is to detect an object, the number of neurons in the output layer might have additional neurons responsible for the prediction of the size of the bounding box of the object.

2.4.4 Activation Functions

Activation functions were described section 2.2. In CNNs there are there a few activation functions that are more frequently used depending on where they are placed in the network. ReLU or leaky ReLU are the most used in the convolutional layers. Fully connected layers except for the last layer often use ReLU as the standard activation. The last layer which controls the output of the network have different activations depending on the task. CNNs

(19)

2.5. Transfer Learning used for classification use sigmoid functions for single class classification. For multiclass classifications the activation is dependant on the label. If an image can belong to more then one class sigmoid is also used since it maps all outputs to the interval [0,1]. If an image only belong to one class the softmax activation is used. It is similar to a sigmoid function as it maps to the same interval but it makes the output of all neurons in the layer add up to 1.

2.4.5 Batch Normalization

Introduced by Ioffe and Szegedy in 2015 batch normalization become a popular opera-tion[14]. It is placed between layers and takes the output of its previous layer and normalizes it with regard to the current batch mean and standard variation before sending it to the next layer as input. Batch normalization is used since it allows the learning rate to be higher, therefore the network less sensitive to weight initialization and works as a regularization operation.

2.5 Transfer Learning

Training a large CNN from scratch can be costly in terms of time and computing power and it requires a vast amount of data. For smaller projects that do not have a sufficient amount of data can transfer learning be a solution. To solve this problem we make use of pre trained networks and retrain, or fine tune, them to fit your desired task. Training large networks on huge data sets take multiple days with multiple high end GPUs. Since the convolutional layers in the pre trained network hopefully learn to recognize useful high and low level features we can make use of them. To do this we freeze the weights for the convo-lutional layers and only run backpropagation on the last fully connected layers that control the output. Thus, we only need to make one small modification to the last layer to fit the purpose of our task. For instance COCO has 91 different categories, which means 91 neurons in the last layer. This needs to be changed to amount of categories in the current classification. Since 2010 is the competition ImageNet Large Scale Visual Recognition Challenge, ILSVRC, held every year[26]. This is a competition where researchers compete to achieve the highest accuracy on a given data set on different computer vision tasks. In 2012 Alex Krizhevsky et al.[16] won the image classification competition with its network, AlexNet, by achieving a 15% top-5 error rate, beating the runner up by over 10%. This is seen as the break through of deep convolutional networks. Since the release of AlexNet the competition has seen its record broken multiple times.

In 2014 Google won the ILSVRC with the network GoogLeNet, a 22 layer deep network[29]. Their network achieved a top-5 error rate of 6.67%. What was ground breaking was not only the error rate but also the improved utilization of computing resources. While having a deeper architecture and higher accuracy than AlexNet, it used twelve times fewer param-eters. The main contribution to this was the so called ’Inception’ module, see figure 2.9. The inception module introduced a new type of architecture for convolutional blocks. Each module form a small network of its own with multiple convolutional layers with different kernel sizes that concatenates in a final output.

(20)

2.5. Transfer Learning

Figure 2.9: The Inception module

Google has since the introduction of the inception network released multiple pre trained versions and made them publicly available.

Microsoft Research entered ILSVRC in 2015 and won, not only the classification compe-tition but also semantic segmentation and object detection compecompe-titions, with their deep residual network, the ResNet[11]. ResNet was by far the deepest network in the competition with over hundred layers. But the interesting thing with ResNet was that it introduced a new concept of residual blocks. Residual blocks are built from blocks of convolutional layers. Nowadays there are different implementations of residual blocks but the ones in the original network consisted of two convolutional layers where the input is added to the output after been fed through the convolutional layers, see figure 2.10. This type of architecture forces each block to learn something new from the previous layer. Residual blocks also helps with the vanishing gradient problem, a common problem for deep networks, and allows for deeper networks.

Figure 2.10: An example of a residual block

2.5.1 Classification & Object Detection

Humans can without problem distinguish objects from each other, no matter viewpoint or distance to some degree. We can easily see objects even if they are partially truncated. Al-though many breakthroughs have happened in the last couple of years this is still a problem for computers.

(21)

2.6. Generative Models or classes, the image belongs to. It does not matter where in the image important pieces of data are, the whole image is classified as one. Object detection on the other hand has the task to localize where an object is found in the image and give it a bounding box surrounding the object. Nowadays both task are solved using convolutional neural networks but with some different approaches. Image classification networks are commonly networks having multiple convolutional layers followed by pooling layers, and lastly some fully connected layers to make a prediction. These networks take a full image as input

Object detection could be done in a similar fashion but instead of using the whole im-age as input we can slide over the imim-age and use a small patch of the imim-age as input. This technique is known as ’sliding window’. But sliding over the image and feeding each patch to the network is computationally costly.

Making object detection a faster operation has been an active research area in the last years. R-CNN gained a lot of attention with their approach that used multiple networks to identify an object could be found in the image[8]. Since the release of R-CNN fast releases have been built[7][24]. Another approach that is commonly used today is the single shot multibox detector, SSD[18]. In contrast to R-CNN, it uses a single network, which makes it much faster. Multiple bounding box predictions that get progressively smaller are made by he last layers in the SSD network where the final prediction is the union of all these predictions.

2.6 Generative Models

A very active research area in the field of deep learning lies within generative models. The aim of a generative model is to learn to generate data from any known distribution or trans-late data from one domain to another. The most common generative models are Variational Autoencoders and Generative Adversarial Networks.

2.6.1 Generative Adversarial Networks

Generative adversarial networks, commonly referenced as GAN, were introduced in 2014 by Ian Goodfellow[9]. A simple GAN consists of two networks, a generator and a discriminator, competing against each other. The generator tries to generate data that approximates some real distribution while the discriminator determines what is generated data and what is data from the real distribution. A generator can be seen as an art forger that tries to replicate an artist and the discriminator as an art expert that the separates real paintings from the fake ones. Both networks learn from each others output and gets better and better until ultimately the discriminator can no longer distinguish real data from the fake.

(22)

2.6. Generative Models

Figure 2.11: Simple overview of a GAN. The generator, G, generates images from a random distribution, Z, and the discriminator tries to distinguish them from some real target domain.

2.6.2 GAN Architectures

The discriminator is often an ordinary CNN, with some architecture differences depending on the task. Usually the discriminator outputs a single value between 0 and 1 describing it’s confidence that the image comes from the real distribution. The generator in the first GAN consisted of a random distribution connected to fully connected layers. In 2015 Radford et al.[23] introduced the DCGAN which consisted of a random distribution followed by convo-lutional layers which yielded better looking images. Using convoconvo-lutional layers is nowadays the standard approach for generating images.

GANs are not only used for generating images but also to translate images between dif-ferent domains. For these networks the generators are often replaced by an autoencoder since the output images are not completely generated but are generated by modifying pre-existing images. An autoencoder is a CNN consisting of an encoder network and a decoder network. The encoder tries to represent the image in some latent space and the decoder tries to replicate the original image, or with some desired modification, from the latent space. It can be used for image compression or to colorize black and white images.

2.6.3 Training a GAN

GANs are notoriously hard to train and several papers have been published on this topic. Up to now there is no standard way to go for all networks[2][27]. They are trained using backpropagation as all other neural networks. Since GANs are made up of multiple networks they come with some problems. Tuning one network can be hard which makes tuning mul-tiple networks simultaneously even harder. It is important not to let one network become to good relative the other other. If the generator is trained too heavily, it might discover some weakness in the discriminator and produce the same output regardless of input. On the other hand if the discriminator gets too good at distinguishing between real and fake it won’t produce any good gradients for the generator and the networks will probably not converge. To find the right balance in alternating the training between the networks can be very challenging.

The discriminator is often an ordinary CNN and therefore trained as one. Its objective is to minimize the probability of a fake sample being predicted as real and maximize the probability of a real sample being predicted as real. For the generator the loss function varies a lot depending on its task. The network can have multiple different losses, sometimes

(23)

com-2.7. Data sets bined. The typical loss for the generator in a GAN is often referred to as the adversarial loss and comes from the output of the discriminator, since it wants to maximize the probability of the generated data being labelled as real. For image translation tasks it is often desired to keep the structure of the input image but have it modified in some way. To do this another loss function has to be added. This could be the absolute difference pixel wise between input and output of the generator or a cyclic loss as in cycleGAN. CycleGAN was introduced in 2017 by Jun-Yan Zhu et al[34]. It has gained a lot of attention in this field. The architecture behind this network consist of two generators and two discriminators, one for each domain. The idea behind is that an image should ideally be able to be translated back and forth between two domains. So if an image from one domain is fed in through the first generator and then the output from that generator is fed to the next generator it should ideally be the same as the original image. The difference is what is called the cyclic loss.

2.7 Data sets

To become good at prediction a CNN needs a lot of images of the desired object, the more variation the better, to learn from. To collect good annotated data can be both costly and time consuming. Luckily there are a many publicly available data sets.

Sometimes there exists no data or not enough data is available containing objects of in-terest to train a network and make it generalize well enough. There might not exist a public data set and manually searching after and finding relevant images is time consuming and sometimes not even possible if no public images exists. What can be done in this case? We can try new ways to either replicate data, generate new data or augment the existing data.

2.7.1 Data Augmentation

The most common operation when your data supply is limited is to augment the data you already have in different ways. It is shown that cropping the image at random places (without losing important information in the image), adjusting the colors or brightness in the image can improve the network training[30]. It prevents the network to overfit the training data. This is usually a good approach regardless of data set since it helps the network generalize better after it have been exposed to more variation.

2.7.2 Synthetic Data

Synthetic data refers to data that have not been obtained by direct measurement. It is instead data that have been created artificially.

2.7.2.1 Rendered 3D Models

Rendered 3D models from software as Blender or 3ds Max can today be indistinguishable from real photos for a human. We can take advantage of this when collecting data. Models of desired objects can be modelled and rendered, then either cut in to a ordinary image or let the whole rendered image be a standalone image. These images can then be used as synthetic data. An advantage working with rendering software is the ability to automatically get information about positions of objects in the picture. This makes labelling bounding boxes and classifying images an easier and less time consuming task.

2.7.2.2 Cut & Paste Objects

Another way to expand a data set with synthetic data could be done by cutting out the desired objects to be detected out of an image containing the objects and pasting it on to other images.

(24)

2.8. Evaluation Metrics With this technique we get the same advantage with image labelling as with rendered images once all desired objects have been cut out.

2.8 Evaluation Metrics

To evaluate how well a model works, we need some kinds of metrics. Often in machine learning when working with classification algorithms we use accuracy, recall and precision. When defining those metrics we use the following terms:

• True Positive (TP) - An image correctly predicted as belonging to a specific class. • True Negative (TN) - An image correctly predicted as not belonging to the specific class. • False Positive (FP) - An image wrongly predicted as belonging to the specific class. • False Negative (FN) - An image wrongly predicted as not belonging the specific class. Accuracy is defined as the number of correctly classified data points divided by the total number of data points. It is defined as:

Accuracy= TP+TN

TP+TN+FP+FN (2.2) Precision is a measure of correct predictions of a class divided by the total number of predic-tions for that class. A high precision value indicates a low number of false positives. Precision is defined as:

Precision= TP

TP+FP (2.3) Recall is the fraction of correct predictions for a class divided by the number of total instances for that class. A high recall value indicates a low number of false negatives. Recall is defined as:

Recall = TP

TP+FN (2.4) High values for precision and recall individually is not equivalent to a good score. But com-bined they give a good indication of how well a model has scored.

2.9 Related Work

2.9.1 Synthetic Data

The cost of computational power and storage has decreased dramatically in the last decades which has paved the way for CNNs to train on large amount of data. The problem however is the availability of large well annotated data sets. Previously the researchers have themselves been the ones putting data sets together and manually annotate the data for their task. This is a time consuming work and researchers have been looking for better ways to gather well annotated data sets. Some have been using Amazon’s Mechanical Turk to crowd source tasks as labelling data. This however might not be a perfect solution since data sets often require some kind of expertise to label the data correctly. Researchers have also been looking into using synthetic data as a substitute for real. Movshovitz-Attias et al. used 3D models of cars rendered in different settings that they pasted on to random images from the PASCAL data set[21]. They used this method to train a model to estimate the viewpoint. They showed that

(25)

2.9. Related Work combining rendered images together with a small amount of real data improved the accuracy of their model. Dwibedi et al. look in to the possibility of reusing objects from images[4]. By cutting out desired kitchen objects from real images and pasting them into the background with a kitchen environment and combining them with a small amount of real data, they showed that they could beat models trained exclusively on real images.

2.9.2 Image Translation and Style Transfer

Gatys et al.[6] showed in 2015 in their paper that using features learned by convolutional layers in a CNN, it is possible to change the style of an image. In their paper they use examples where they have regular photos translated to look like paintings. With the gained popularity of GANs new techniques for image translation have appeared.

Isola et al.[15] and Liu et al.[17] showed promising result in image translation tasks. Isola et al. used paired images in different domains to for example translate black and white images to colored, while Liu et al. used, as CycleGAN, unpaired images and showed how images could be translated between domains.

In a paper from Apple by Shrivastava et al.[28] combined both synthetic data and image translation. They used what they call a refiner network that took a rendered image as input and tried to add realism to it. They had a discriminator with the mission to separate the refined images from real images. They showed that using this technique for generating training data, they were able to achieve higher accuracy than with using exclusively real images.

(26)

3 Method

This chapter will walk through how the tests of the diploma work were carried out and which techniques were used.

3.1 Frameworks, Software & Hardware

All tests were carried out on a computer with Ubuntu 16.04 and a Nvidia Geforce GTX 1080TI GPU. All scripts were written in Python and all neural networks were taken from the Tensor-flow library. Blender was used for 3D modelling and rendering of images and videos. Gimp was used for image processing.

3.2 Data sets

The tests in this thesis focused on trying to detect certain symbols in images. The symbols of interest were the swastika, the ISIS flag and the flag of the Nordic resistant movement. No public data set containing images with these symbols was publicly available. For the classification tests was a category with images labelled as ’Nothing’ was used. These images was randomly picked from the COCO data set.

3.2.1 Real Images

A data set consisting of real images with four different classes was collected. Images con-taining the objects to be detected were found by searching for keywords on Google Images. For regular images containing no objects of interest were found in the COCO data set. A total of 412 real images with desired object were collected. Among these images 172 images contained the ISIS-flag, 97 images contained the symbol of the swastika and 143 images con-tained the flag of the Nordic Resistance Movement. 50 images per category were used as a test set and the rest was used for training. No validation set was constructed since removing more images from the already quite small test set would make the results on the test set less reliable. The same test set was used for all tests.

(27)

3.2. Data sets

(a) ISIS (b) Swastika (c) Nordic Resistant Move-ment

Figure 3.1: Example of real images containing desired objects

(a) (b) (c)

Figure 3.2: Example of images labelled as ’Nothing’. The images are taken from the COCO data set.

3.2.2 3D Rendered Images

A 3D model of a flag was modelled in Blender.To make the flag look like it was blowing in the wind physics simulation was added. The model was rendered with three different textures for each category. Each rendering output a 100 images of a flag on a transparent background. These images where then pasted on to regular images from the COCO data set. All images of flags were randomly cropped before adding them to another image. A python script was written to be able to generate a desired amount of images easily.

(28)

3.2. Data sets

Figure 3.3: Example of the different flags rendered in Blender. Three different textures was used for each flag.

3.2.3 Cut & Paste

10 images from each category from the training set of real images where the objects of interest were cut out were saved as a separate images with transparent backgrounds using Gimp. The same procedure regarding augmentations was done as above with the images rendered from a 3D model.

(29)

3.3. Tests

3.2.4 Grand Theft Auto V

For one of the tests a large rendered data set was needed. Since no such data set was available one had to me made. I decided to take images from Grand Theft Auto V since it has in the latest years been praised for it’s realistic graphics and it takes place in a realistic environment only containing real life objects. A data set containing 14,966 still images from the campaign mode and video scenes was collected by grabbing three frames per second from one hour of game play.

(a) (b) (c)

Figure 3.5: Images taken from game play of Grand Theft Auto V.

3.2.5 Labelling

All images that were generated were automatically labelled and had a corresponding text file containing information about the bounding box for the object in the image. The real images that were collected were labelled and annotated by hand.

3.3 Tests

This section will walk through the tests conducted in this thesis.

3.3.1 Test 1: Classification

Since I was working with a small data set I decided to use a pre trained network. Tensorflow is written by Google and they have released their pre trained Inception network with code written for transfer learning. Because of this availability and since Inception is known to be one of the state-of-the-art models it was chosen for this test.

First a test was carried out to see how well rendered images performed depending on background and augmentations. Four data sets were composed and compared: rendered objects on black background, rendered objects on random background noise, rendered ob-jects with a random image from the COCO data set as background and the same as the previous with some random augmentations. The Inception net was trained on the data sets for a thousand steps, one step being equivalent to one batch’s forward propagation through the network, which was enough steps for the network to get a high accuracy on the training images. No modifications were done to the default configurations of the network. Next test was more thorough using more data sets to evaluate how well synthetic data works as training data compared to real images. Multiple data sets were constructed of

(30)

3.3. Tests different sizes using both rendered images and objects cut out from real images. The data sets that did not perform well in the previous test were discarded. A set of real images was used as a benchmark. The complete list of data sets for this tests is shown in table 3.1. This test used the same configurations as the previous; no modifications was done to the default settings.

Table 3.1: Data sets used for the classification test.

Data sets Synthetic Images / Category

Rendered Images 300, 600, 900, 1500 Cut & Paste 300, 600, 900, 1500 Rendered + Cut & Paste 300, 600 Real Images + Rendered 150, 300, 600 Real images + Cut & Paste 150, 300, 600 Real + Rendered + Cut & Paste 300, 600 Real Images

-3.3.2

Test 2: Object Detection

Tensorflow comes with an Object Detection API that is easy to adjust for ones own pur-poses. The API has multiple pre trained network but I decided to use Inception again for consistency. As explained in the theory chapter the network for classification differs from the network for detection. I decided to go with a single shot multibox detector network since it is faster than R-CNN but it allows to achieve similar results.

The data sets scoring low in the previous tests was discarded and not taken into account this time since retraining the network for each data set was considered too time consuming. The data sets used for the test is listed in the table below.

Table 3.2: Data sets used in Object Detection test.

Data sets Synthetic Images / Category

Rendered Images 900, 1500 Cut & Paste 900, 1500 Rendered + Cut & Paste 300, 600 Real Images + Rendered 300, 600 Real images + Cut & Paste 300, 600 Real + Rendered + Cut & Paste 300, 600 Real Images

-3.3.3

Test: Adding realism to rendered Images

Comparing images from GTA V and real images, it is easy to spot which ones are real and which ones are rendered. To add realism to rendered images a generative adversarial net-work was set up. The architecture was set up as a CycleGAN with two generators and two discriminators. Each generator consisted of two down sampling blocks with two convolu-tional layers and one max pooling layer, three residual blocks and two up sampling blocks that mirrored the down sampling. The down sampling layers started with 32 filters and dou-bled for each layer. The number of filters was kept the same throughout the residual blocks. The last layer was a convolutional layer with three filters to make the output have the correct image dimensions. All convolutional layers used the ReLU function as activation except the last layer that used the sigmoid function to map the output image between zero and one. All layers except the last layer used batch normalization before the activation function.

(31)

3.3. Tests The discriminators consisted of five convolutional layers with filter size starting at 64 and doubling in size for each layer. The last layer produced a one dimensional 16x16 feature map with probabilities of the image being real. All activation functions were ReLU func-tions except for the last layer that used the sigmoid function. All convolutional layers in the networks had a filter size of 3 and stride of 1. All inputs to convolutional layers were zero-padded to keep the dimensions after the convolutional operation.

The loss function for the discriminators was defined as:

LD=Lreal+Lf ake (3.1)

Lreal =1 ´ D(y) (3.2)

Lf ake =D(G(x)) (3.3)

where D is the discriminator, G is the generator, X denotes a fake image and Y a real image. The loss for the generators was defined as:

LG=Ladv+λLcyclic (3.4)

Lcyclic=|x ´ G1(G2(x))| (3.5)

Ladv=1 ´ D(x) (3.6)

where the cyclic loss is the same as described in section 2.6.3.

The data set used for this test consisted of 14,966 images from Grand Theft Auto V and 15000 random images from COCO. The images from GTA V were used for this test as a test to investigating the possibility of adding realism and in case of success model and render new 3D scenes containing the objects in the other tests.

3.3.4 Test: Flags on Random Generated Background

Inspired by the paper from Apple, see section 2.9, where they used a refiner network to trans-late rendered images to real images, this test tried to replicate their approach with some mod-ifications. In their test they had two data sets of pictures of eyes, one containing real images and another with rendered. Since I did not have any completely rendered images I took a slightly different approach. Rendered images containing symbols on a transparent back-ground were pasted on to a randomly generated image. To know the location of the symbol a binary mask from the rendered image was retrieved. The networks in this test were one generator, an autoencoder, and one discriminator. The loss for the generator was defined as:

LG =Ladv+λLreg (3.7)

Ladv=1 ´ D(x) (3.8)

Lreg=|x ´ G(x)| ˚ B (3.9)

(32)

3.4. Evaluation Metrics with the absolute difference between the input image and the generated image. This forced the generator to keep the structure of the symbols while the background did not get affected by this and was only linked to the adversarial loss. The discriminator in this test used the same loss function as in the test above.

The generator had a similar architecture as the generators in the previous test. The only difference was that this network used six residual blocks in the middle. No max pooling layer was used, instead was the stride for the convolutional layers set to 2 which works as a down sampling method. The discriminator had the same architecture as in the previous test. The learning rate was the same for both networks and was kept at 0.0001, the batch size was set to 16 and lambda was set to 10.

3.4 Evaluation Metrics

I used accuracy to evaluate how well the tests involving classification and object detection went. For the test with object detection the highest scoring object found in an image deter-mined the class it belonged to. An image was labelled as ’Nothing’ if no object could be found in the image. Recall and precision were used to analyze the results further. When find-ing threatenfind-ing objects in images it is important to find all images containfind-ing objects rather than wrongly classify an image containing an object as nothing, even though these images might have to be filtered out manually afterwards which could be tedious. Therefore, in this project a high recall value for images containing objects is important.

(33)

4 Results of the Tests

4.1 Test: Classification

The results of the first test with rendered flags pasted on to different backgrounds is shown in table 4.1. As shown in table, pasting images onto random images from the COCO data set with augmentations had the highest accuracy. Hence, images with augmentations was used in later experiments.

Table 4.1: Accuracy for rendered flags on different backgrounds.

Images / Category

Data set 300 600 900

No background 26.50% - -Random background 26% 25.50% -COCO background 71% 70% 70% COCO background w/ aug. 69.50% 73% 70%

The accuracy scores of the second experiment are shown in table 4.2. The data sets, those that achieved the highest accuracy, precision and recall scores are shown in table 4.3. Real images achieved the highest score for an individual data set but with both methods of synthetic data added was the accuracy boosted with a couple percent.

(34)

4.2. Test: Object Detection Table 4.2: Accuracy scores for the classification experiment.

Data set Images / Category Accuracy %

3D 300 69.50 3D 600 70 3D 900 73 Cut 300 57.50 Cut 600 55 Cut 900 64 3D + Cut 300 72 3D + Cut 600 69.50 Real + 3D 150 88 Real + 3D 300 90 Real + 3D 600 90 Real + Cut 150 86 Real + Cut 300 91.50 Real + Cut 600 90 Real + 3D + Cut 300 84.50 Real + 3D + Cut 600 86 Real - 86

Table 4.3: Recall and precision scores for the classification experiment

Precision (%) Recall (%)

Data set ISIS Nazi NMR Nothing ISIS Nazi NMR Nothing

Rendered 78.72 71.15 92.59 52.16 74 74 50 92 Cut & Paste 82.86 96.77 90 43.86 58 60 36 100 Real + Rendered 90.74 89.13 89.8 88.24 98 82 88 90 Real + Cut & Paste 87.04 100 95.83 85.96 94 82 92 98 Rendered + Cut & Paste 70.8 83.72 88 60 76 72 44 96 Real + Rendered + Cut 88.46 100 81.13 82.46 82 76 84 94 Real Image 88.68 91.89 79.31 86.54 94 68 92 90

4.2 Test: Object Detection

The overall score for all data sets were higher when using the object detection API. All accu-racy scores can be seen in table 4.4. The data sets with the highest scores precision and recall are displayed in table 4.5.

(35)

4.3. Test: Adding Realism to Generated Images Table 4.4: Accuracy scores for the object detection experiment

Data set Synthetic Images / Category Accuracy %

3D 900 79 3D 1500 73.87 Cut 900 57.29 Cut 1500 52.26 3D + Cut 300 80.9 3D + Cut 600 82.41 Real + 3D 300 93.97 Real + 3D 600 91.46 Real + Cut 300 93.97 Real + Cut 600 91.96 Real + 3D + Cut 300 88.94 Real + 3D + Cut 600 86 Real - 89.95 Table 4.5: Recall and precision scores for the object detection experiment

Precision (%) Recall (%)

Data set ISIS Nazi NMR Nothing ISIS Nazi NMR Nothing

Rendered 100 100 91 56 70 63 84 98 Cut & Paste 94.74 95.65 39.56 42.55 72 45 72 40 Real + Rendered 98 95.65 97.87 85.45 100 89 92 94 Real + Cut & Paste 90.91 97.87 97.96 94.08 100 93,88 96 89 Rendered + Cut & Paste 91.67 97.62 94.29 62.16 88 83.67 66 92 Real + Rendered + Cut 90.38 100 97.73 90.47 94 85.71 86 90 Real Image 93.88 93.48 97.83 77.59 92 88 90 90

4.3 Test: Adding Realism to Generated Images

The result of the experiment can be seen in figure 4.1 and figure 4.2. The images generated from fake to real showed no clear sign of added realism. The images translated from real to fake had visible signs that it had adopted to another domain. Those images got visibly smoother but they often ended up with smudges of irrelevant colors. The experiment did not achieve a result promising enough for another experiment with new 3D scenes containing the desired symbols.

(36)

4.3. Test: Adding Realism to Generated Images

(a) (b)

(c) (d)

(37)

4.4. Test: Flags on Random Generated Background

(a) (b)

(c) (d)

Figure 4.2: Images before and after they’ve gone through the ’real to fake’ generator.

4.4 Test: Flags on Random Generated Background

The result for the images with a backgrounds generated by a generative adversarial network can be seen in table 4.6. Examples of how the images could look after they have gone through the generator is shown in figure 4.3. The images achieved worse accuracy than rendered flags that was pasted on to real images. This indicates that the generator failed to add realism to the flags and generate a better background. The network did never converge which is another indicator of failure. The discriminator ended up predicting that both real and fake images as real. This could be a sign that the generator got to powerful with respect to the discriminator.

(38)

4.4. Test: Flags on Random Generated Background

(a) (b) (c)

Figure 4.3: Flags on backgrounds generated by a generative adversarial network. Table 4.6: Accuracy for images generated by GAN.

Data set Images/Category without Real Images with real images

Generated images 300 54.27% 88.44% Generated Images 600 61.81% 79.90%

(39)

5 Discussion

The results show that a pre trained network does not need a large amount of real images to train on to learn generalize to new domains. It also shows that if there does not exist much data to begin with using synthetic data could be a good complement.

The results indicate that we can use both rendered flags and flags cut out from images pasted on random background images as synthetic data. Neither of them scored a high accuracy when the models was trained on them exclusively. But both of them help boost the accuracy when combined with real images and functioned as a complement to real images. They both have some advantages toward each other. As for rendered images, once a model of the desired object has been modelled it is very easy to render it in different ways. To change the texture, lightning, wind speed, rotation and control the material is easily done and gives endless amount of possible rendered images. In case of no access to modelling software or no modelling skills to cut out objects from already attained images might be a better option. The advantage of cut out objects is that they already are objects from the same domain that we wish to train our model on. Cutting out objects does not give as much control as the rendering approach and the only augmentations that can be done is the same as for regular images, for example rotation, scaling or adjusting brightness. It does require a lot of time as well to cut out objects from images cleanly without either losing relevant information or getting unwanted information from the background. The task is also hard to automate. Due to time restrictions neither of the experiments were carried out with extreme depth. The data sets, for example, used for the experiments could have been expanded if time was not a factor. The amount of data for real images was not particularly large and a bigger data set could have been obtained by searching for videos and cutting out relevant parts. The restrictions in size made the test set for the experiments quite small. A small test set might not show how well a model truly generalizes. If the experiments were to be conducted again a larger test set is advised.

Time restriction plays a part as well in how reliable the results obtained in the experi-ments are. All experiexperi-ments have a random factor to it, training images were pasted on random backgrounds with random augmentations and networks were initialized with

(40)

ran-5.1. Answer to Research Questions dom weights. Because of this all experiments should be carried out multiple times to see that the results remained the same. In this thesis I did not have the time to validate the results with multiple tests for every experiment.

The small amount of data made me take the decision not to use a validation set since there were basically not enough data for it. To have a validation set would have made the training of the network much easier, not only to know when a model had gone through enough iterations but also for the tuning of hyper parameters.

Trying to add realism to rendered image did not work especially well. The generator in the CycleGAN inspired network concentrated on generating real images did not add any visual effects that would indicate that the rendered would be more realistic. The other gener-ator on the other hand did make clear visual changes in the images. Making them smoother and appear more like a painting. This could have been since it might be easier to translate images from the real domain to the rendered. An experiment for future work could be to try and translate real images to rendered and then evaluate a model trained on purely rendered images.

The generative adversarial networks proved to be very difficult to train without constant supervision. All hyper parameters had to be tuned by trial and error which wasted a lot of time. The generators and discriminators easily got to powerful if one was trained with more iterations than the other and a good balance was hard to find. The experiments with generative adversarial did not turn out well and was an overall disappointment.

5.1 Answer to Research Questions

• With limited data available, what efforts can be taken to boost the results of a convolu-tional neural network?

This thesis has shown that using synthetic data such as rendered symbols or symbols cut out from real images pasted on to real images can be one way to go. Using synthetic images boosted the result with multiple percent in more than one experiment. Augmentations to the limited existing data is another but it lacks flexibility because of the limitations of augmenta-tions that can be done to an image.

• Is it possible to train a convolutional neural network using only synthetic training data to detect desired objects?

In the experiments in this thesis did networks trained on fully synthetic data sets achieve an accuracy above 80%. While it is still worse than a small data set containing real images it is still a promising result. This thesis focused more on the possibility of using synthetic data rather than achieving great result with one method. If more effort was put in to rendering symbols with more variety I believe even greater results could be achieved. If there exists no data at all for the wanted task training a network on only synthetic data is absolutely an option as it beats random guessing by a large margin.

• Can generative adversarial networks, GAN, be used for the production of synthetic data?

Generative adversarial models have in the last years shown promising results for image gen-eration. But in this thesis they did not achieve any remarkable results. Rendered symbols on random background than gone through the generator in hope of adding realistic features performed worse than rendered symbols pasted on to random real images. The experiment

(41)

5.1. Answer to Research Questions that used fully rendered 3D scenes in a result of making them appear more realistic did not achieve any visually appealing results. Generative adversarial networks is still a relatively new research area and will probably in the future achieve great results but in this thesis the experiments using generative adversarial networks were considered a failure.

(42)

6 Conclusion

This thesis investigated the possibility of using synthetic data as training data for a convolu-tional neural network. Synthetic data might not be good enough yet for a model to train on exclusively but it can be used as a tool when no other data exists. This theses showed that there are different approaches that are feasible, both rendering new objects and reuse objects from existing data. Combined with real images can synthetic data help boost the result. As for the generative adversarial approach it did not yield any good results but generative ad-versarial networks are in a relative early stage in research and can possibly be a solution in the future.

(43)

Bibliography

[1] Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Con-stantine D. Spyropoulos, and Panagiotis Stamatopoulos. “Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach”. In: (2000). [2] Martin Arjovsky, Soumith Chintala, and Leon Bottou. “Wasserstein Generative

Adver-sarial Networks”. In: (2017), pp. 214–223.URL: http://proceedings.mlr.press/ v70/arjovsky17a.html.

[3] Samuel Arthur. “Some Studies in Machine Learning Using the Game of Checkers”. In: IBM Journal of Research and Development Volume 3 Issue 3 (1959).

[4] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. “Cut, Paste and Learn: Surpris-ingly Easy Synthesis for Instance Detection”. In: CoRR abs/1708.01642 (2017). arXiv: 1708.01642.URL: http://arxiv.org/abs/1708.01642.

[5] Jim Edwards. “PLANET SELFIE: We’re Now Posting A Staggering 1.8 Billion Photos Every Day”. In: Business Insider (2014).

[6] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. “A Neural Algorithm of Artis-tic Style”. In: CoRR abs/1508.06576 (2015).

[7] Ross B. Girshick. “Fast R-CNN”. In: CoRR abs/1504.08083 (2015). arXiv: 1504.08083.

URL: http://arxiv.org/abs/1504.08083.

[8] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hier-archies for accurate object detection and semantic segmentation”. In: CoRR (2013). [9] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,

Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative Adversarial Nets”. In: (2014). Ed. by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Wein-berger, pp. 2672–2680.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www. deeplearningbook.org. MIT Press, 2016.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385 (2015). arXiv: 1512 . 03385.URL: http://arxiv.org/abs/1512.03385.