• No results found

Classification of offensive game-emblem drawings using CNN (convolutional neural networks) and transfer learning

N/A
N/A
Protected

Academic year: 2021

Share "Classification of offensive game-emblem drawings using CNN (convolutional neural networks) and transfer learning"

Copied!
77
0
0

Loading.... (view fulltext now)

Full text

(1)

IT 17 089

Examensarbete 30 hp

Januari 2018

Classification of offensive

game-emblem drawings using CNN

(convolutional neural networks)

and transfer learning

John Tunell

(2)
(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Classification of offensive game-emblem drawings

using CNN (convolutional neural networks) and

transfer learning

John Tunell

Convolutional neural networks (CNN) has become an important tool to solve many computer vision tasks of today. The technique is though costly, and training a network from scratch requires both a large dataset and adequate hardware. A solution to these shortcomings is to instead use a pre-trained network, an approach

called transfer learning. Several studies have shown promising results applying transfer learning, but the technique requires further studies. This thesis explores the capabilities of transfer learning when applied to the task of filtering out offensive cartoon drawings in the game of Battlefield 1. GoogLeNet was pre-trained on ImageNet, and then the last layers were fine-tuned towards the target task and domain. The model achieved an accuracy of 96.71% when evaluated on the binary classi-fication task of predicting non-offensive or swastika/penis content in Battlefield "emblems". The results indicate that a CNN trained on ImageNet is applicable, even when the target domain is very different from the pre-trained networks domain.

Tryckt av: Reprocentralen ITC IT 17 089

(4)
(5)

Acknowledgement

I would like to thank my supervisor, Håkan Rosenhorn, for all the advice and guidance given throughout the project. He openly shared his experience from a career as a software developer, which I’m very grateful for. The feedback and teaching sessions have given me valuable preparation for a career as a software developer. Our lunch break jogs both improved my fitness and gave me insights regarding the software development process. I will also miss working with all the other friendly co-workers at Uprise. I also want to thank my reviewer, Anders Brun. The meetings we had helped me stay focused on the research task and made sure I was going in the right direction.

(6)
(7)

Contents

1 Introduction 1

1.1 Thesis Structure . . . 3

2 Background 5 2.1 Definitions and terminology . . . 5

2.2 Feedforward Networks . . . 6

2.3 Datasets . . . 9

2.3.1 Training set . . . 9

2.3.2 Validation set . . . 9

2.3.3 Test set . . . 9

2.3.4 Test set distribution . . . 10

2.4 Capacity, Overfitting and Underfitting . . . 10

2.5 Convolutional Neural Networks . . . 11

2.5.1 Convolutional layer . . . 11

2.5.2 Pooling layer . . . 13

2.5.3 Fully connected layer . . . 13

3 Related work 15 3.1 Transfer learning . . . 15

3.2 Research exploring transfer learning . . . 15

3.3 Research applying transfer learning . . . 17

3.4 GoogLeNet . . . 19

3.4.1 The inception module . . . 19

4 Method 21 4.1 Emblems in Battlefield . . . 21

4.1.1 How players create and use emblems in Battlefield 1 and Battlefield 4 . . . 21

4.1.2 How offensive emblems are handled in the Battlefield games . 22 4.2 Method to approach the problem . . . 23

4.2.1 Step 1 - Determine goals and measurements . . . 23

4.2.2 Step 2 - Establish working end-to-end baseline model . . . 23

4.2.3 Step 3 - Determine bottlenecks in performance . . . 23

4.2.4 Step 4 - Repeatedly make incremental changes. . . 23

(8)

4.3 Additional guidelines when applying machine learning . . . 24

4.3.1 The process of knowing what to do next . . . 24

4.3.2 Create a common data warehouse . . . 25

4.3.3 Determine human-level performance on the task . . . 25

4.3.4 Plot performance on increasing dataset size and visualize worst errors . . . 25

5 Experimental setup 27 5.1 Software and hardware used during experiments . . . 27

5.2 Preprocessing . . . 28

5.2.1 Dataset augmentation . . . 28

5.2.2 Contrast normalization . . . 28

5.3 Dataset generation . . . 29

5.4 Feature extraction . . . 30

5.5 Machine learning framework - Tensorflow . . . 30

6 Results 31 6.1 Results iteration 1 . . . 31

6.1.1 Step 1 - Determine goals and measurements . . . 31

6.1.2 Dataset extraction . . . 32

6.1.3 Step 2 - Establish working end-to-end pipeline and baseline model . . . 34

6.1.4 Performance benchmarks for first model . . . 36

6.1.5 Step 3 - Determine bottlenecks in performance . . . 41

6.2 Results iteration 2 . . . 42

6.2.1 Step 4 - Repeatedly make incremental changes . . . 42

6.2.2 Performance benchmarks for the second model . . . 43

6.3 Results iteration 3 . . . 47

6.3.1 Data augmentation experiments . . . 50

6.3.2 Final performance comparison between all models . . . 52

6.3.3 Performance on production test set . . . 52

7 Discussion 55 7.1 Future work . . . 57

8 Conclusion 59

Bibliography 61

(9)

1

Introduction

Computer vision and object classification has in the last couple of years been dra-matically improved by advances in deep learning and convolutional neural networks (CNNs) [18]. In the ImageNet competition 2012, the most reputable competition within computer vision, a group of researchers from the University of Toronto en-tered with a deep CNN algorithm called SuperVision. The team won the competition with an error rate of 16.4 percent, while the second best entry had an error rate of 26.2 percent[34]. The results were revolutionary, and the advances in computer vision driven by CNNs has been acknowledged as one of the top 10 breakthroughs of 2013[26].

CNNs main power lies in its deep structure, which allow the network to create discriminating features that for each layer increase the level of abstraction[33, 36, 9, 32]. Advances in hardware, larger datasets and more complex models are key factors to the recent success of CNNs. Further advances in the field are though not only driven by increasing complexity. GoogLeNet, Googles winning submission to ImageNet 2014, used 12 times fewer parameters and got significantly more accurate results than previous winners[17]. Recent research has started to investigate not only ways to improve the performance of CNNs, measured in error rate, but also the performance measured in cost-effectiveness. In order for models to be put to real-world use, metrics like computational budget, memory consumption and dataset size requirements need to be considered[33].

Training a deep CNN from scratch can be both costly and complicated [10]. First, a large labeled dataset is required for the training. In many domains, the amount of labeled data is limited, and collecting such a dataset might require experts to annotate images. Second, deep CNN training requires extensive memory and computational resources. Lacking adequate hardware will make the training process extremely time-consuming. Lastly, to avoid overfitting and ensure convergence, the training process need to be repeated iteratively, trying out different parameters in the model[34]. This requires experience and also make the process even more time-consuming.

To lower the cost of training CNNs, a promising alternative has emerged through research. Instead of training a CNN from scratch, an already trained CNN is used

(10)

that has been trained on an existing large dataset from another domain. The CNN is then fine-tuned towards the target domain or task. This concept is called transfer learning. Using transfer learning, various computer vision researchers has been able to significantly improve upon state-of-the-art performance on computer vision tasks within a large set of domains[30, 3, 24]. Yosinski et al [35] emphasize the importance of further studies on the exact nature and extent to which transfer learning can be applied.

This thesis project is part of a computer vision internship at the game studio Uprise. Uprise is a sister studio to Dice, and owned by the global video game company Electronic Arts (EA). In the game Battlefield 1, players can draw what Uprise call “emblems”. Emblems are images that the user can bind to their profile and also display on their weapons and vehicles inside of the game. Emblems are not allowed to contain offensive material. If it does, players are able to report these and EA must handle reported emblems in due time, often manually. Uprise would like to improve this process. During this thesis project, deep learning methods will be evaluated on the task of filtering out these offensive emblems.

Problem formulation

This report set out to answer the following problem formulation:

How well does a CNN perform on the task of classifying offensive drawings, created by players of the game Battlefield 1, when pretrained on ImageNet and fine-tuned on the target dataset?

(11)

1.1

Thesis Structure

Chapter 2

In the background chapter, essential machine learning concepts are introduced. Dataset partitioning strategies and the multi layer perceptron topology is explained. Furthermore, the chapter give an introduction to convolutional neural networks.

Chapter 3

The related work chapter summarize the body of research that has been done on transfer learning and CNNs. The chapter ends with describing the GoogLeNet architecture and its inception module.

Chapter 4

The Battlefield emblem system is explained in the method chapter, along with guidelines used to approach the machine learning problem.

Chapter 5

How the dataset was preprocessed and augmentation techniques are described in the experimental setup chapter. The chapter also explain how the CNN was used as a feature extractor. The chapter ends with a short introduction to the machine learning framework Tensorflow.

Chapter 6

The thesis work was divided into three iterations, each given its own section in the result chapter, along with performance benchmarks as the thesis work pro-ceeded. The results are analyzed and discussed along with the presentation of the benchmarks, to make the reasoning easier to follow.

Chapter 7

The results are further discussed in the discussion chapter. Future work is also covered in the chapter.

Chapter 8

The final conclusion is given in the conclusion chapter.

(12)
(13)

2

Background

The thesis work relies heavily on research within the field of machine learning, deep learning, and convolutional neural networks. This chapter will present an introduction to terminology and concepts used during the study.

2.1

Definitions and terminology

This section introduces terminology that will be used in the thesis. The task of detecting spam or non-spam in emails will be used to illustrate examples of the definitions. The part is based on the definitions presented by Mohri et al [20]

• Examples - The instances in the dataset. The examples are usually the rows in a matrix or database. In our spam detection problem, an email would correspond to an example in our dataset. Examples are used to train and evaluate the model [20].

• Features - The set of attributes that are associated to an example [20]. The attributes are often represented as a vector, which corresponds to the columns in a matrix or database. The name of the sender, presence of certain keywords in the message, the message length etc. would be considered features in the email example.

• Labels - The category or class value assigned to examples. An example email would have either a label of spam or non-spam. When predicting a discrete value, the task is called classification. If the target value is continuous, the task of classifying the output is called regression.

• Hyperparameters - A models configuration parameters are called hyperpa-rameters. This can for example be the number of iterations we want the model to train on the dataset or the learning rate of the model. Hyperparameters are not to be confused with parameters in a neural network, also called weights, which are learned through backpropagation.

(14)

2.2

Feedforward Networks

The most essential parts in a deep learning model are the feedforward neural networks, also called multilayer perceptrons (MLPs) or artificial neural networks. Neural networks are inspired by the brains information processing network, built up by neurons. Neurons are connected to each other in a large signaling network. Every neuron has multiple incoming connections. When a neuron receive incoming inputs, it will sum up all the inputs and if the value exceed a given threshold, it will fire. The signal is then passed on through connections to other neurons. Neural networks try to model this behavior. Figure 2.1 illustrate a single perceptron.

Fig. 2.1: Perceptron topology, illustration modified from Danilo Bargen [4]

The first layer is called the input layer and is often a vector of values, called a

feature vector. The input values x are then multiplied with a weight w. The weight is also called parameter and often depicted with the symbol θ. A bias term is often introduced as x0 and w0, and act as a threshold value for the activation.

The multiplied inputs are summed into a single value. The value is then passed through a activation function f that will produce an output. There are many types of activation functions. One of the simplest is the step function, which will output a 1 if the input is higher than a given threshold, and 0 if it is below.

To be able to produce more complex functions than linear functions, the model need to be applied not only to x, but to a non-linear transformation of x. This can be seen as creating a new representation of x made up by the network. This is done by adding hidden layers. These hidden layers will be used to produce the new feature representation that will help the model find mappings that achieve the desired output. Figure ?? illustrate the topology of a multi layer perceptron (MLP). In this figure, each input node is connected to every neuron, each with its own weight. The layers between the input layer and the final output layer are called hidden layers. Just as with the perceptron example, the output layer will take as input a feature vector, but in this case the input values transformed by the hidden layer and not the raw data. The network will then output a value based on the threshold and activation function. The objective of a feedforward network or

(15)

Fig. 2.2: Multi layer perceptron topology, illustration modified from Satvik Beri [5]

a multi layer perceptron is to find an hypothesis function h for a function f [14]. When solving a classification problem, we produce a function that given an example with a feature vector x, produce a hypothesis function that will output a class label prediction ˆy. The predicted label ˆyshould be as close as possible to the ground truth class label y. A feedforward network finds the values for the parameters θ that will result in the best approximation of the function. The network will iteratively tune the parameters to make the hypothesis function h, parameterized by θ, as similar as possible to the target function f [14].

ˆ

y= hθ(x)

The flow of information goes from an input x, through intermediate computations that are used to define hθ, ending up in an output ˆy. There are no connection

backwards between neurons, the features found by intermediate layers are strictly passed forward. This is why these models are called feedforward networks. In a feedforward network, a chain of functions are composed together and are often represented by a directed acyclic graph (DAG), as has been shown in the previous figures. This is why we call these models networks [14]. A simple feedforward network example would be a network composed together with three functions

f(1), f(2), and f(3), connected together in a chain forming the complete function

hθ(x) = fθ(3)(fθ(2)(fθ(1)(x))) [14]. In this example we would call f(1) the first

layer, f(2)the second layer and f(3)the final layer or the output layer. The length of the chain is called the networks depth, making this network a three layer deep network. This is also where the term "deep learning" comes from, as the networks are composed together with many layers, creating a deep network.

When we say that we "train" the network, what we do is that we try to drive hθ(x)

to match the target function f (x) [14]. The data in our dataset provides us with noisy and approximate examples of f (x), evaluated at different training points [14]. Every example x is associated with a ground truth label y. The dataset specifies what the last output layer need to produce, given the input x. What the layers

(16)

in between should output is what the learning algorithm will learn. The learning algorithm will tune these layers by changing the weights θ to best implement an approximation of f (x). This is done through a technique called backpropagation. By propagating the mistakes backwards, tuning the weights to accomplish a better fit to the target function, we improve and learn a better function approximation hθ.

By comparing the networks output from the hypothesis function ˆy to the correct value y, we can estimate a distance between the guess and the correct answer. This comparison is done through a cost function J also called a loss function. The goal of the algorithm is then to minimize the function J(θ) by tuning the weights of the network to produce the desired output. Mean squared error is one of many cost functions, illustrated in the below equation. In the equation m is the number of training examples in the dataset.

M SE= J(θ) = 1 2m m X i=1 ((hθ(xi) − yi) 2

By calculating the partial derivatives for each weight in the network, we can find in what direction each weight need to be adjusted in order to produce an output that is closer to the target function output. The gradient descent is one of the most common algorithms used to accomplish this within neural networks. Figure 2.3 illustrate the concept of gradient descent. The amount of change that is applied to a weight is decided by the gradient and a hyperparameter α called the learning rate. The step size of the weight change is determined by the gradient and the learning

Fig. 2.3: Gradient descent, illustration modified from Sebastian Raschka [25]

rate. Setting a high learning rate will make the gradient descent algorithm take larger steps, and a low learning rate will make the steps smaller. The algorithm is repeated until convergence. In the below equation, the symbol := mean that the left-hand side is updated with the value calculated at the right-hand side:

θj := θj− α∂θ

j

J(θ)

(17)

2.3

Datasets

In most machine learning algorithms, the dataset is divided into three subsets - A training set, a validation set and a test set. For classification tasks, every example in the dataset contains a number of features, x, and a target value y. Figure 2.4 give an overview of how datasets normally are split.

Fig. 2.4: Dataset partitioning

2.3.1

Training set

The training set is used during training to tune the parameters or weights θ of the model. This is in most applications the largest of the datasets.

2.3.2

Validation set

The validation set is used during training to find the best hyperparameters for the model. The set is used to make an intermediate estimation on how well the model would perform on data that it has not trained on, to avoid overfitting to the training set. The performance during training is called the models train error.

2.3.3

Test set

To evaluate the models performance on completely unseen data, a test set is used. This evaluation is done when the model has finished training and has its hyperpa-rameters and weights set. The performance on the test set show how well the model generalize, often measured in the test error or generalization error. All performance benchmarks are generated on the test set.

(18)

2.3.4

Test set distribution

When the set is divided into training and test set, a couple of assumptions need to be made. Firstly, the test and train set is assumed to have identical distribution when they are drawn at random from the same distribution. Secondly, we will assume that all examples in our dataset are independent of each other. These assumptions are called i.i.d assumptions (independent, identical distribution).

2.4

Capacity, Overfitting and Underfitting

The main challenge when constructing a machine learning algorithm, is to cre-ate a model that performs well not only on the trained data, but also on new unseen data[14]. The following section will describe key concepts regarding this challenge.

The models ability to fit the training set is called the models capacity. Conceptually it is the amount of freedom the model is given to calibrate itself towards the data presented. This could for example be the number of iterations the model get to train on the data, or the number of parameters in the model.

A model with low given capacity might struggle to fit the training data. This is called underfitting. On the other hand, a model with high capacity might become too specialized on the training set, essentially memorizing the output given a certain input. The model will then struggle on unseen data. This is called overfitting. Figure 2.5 illustrate the difference between overfitting and underfitting by showing a model that try to fit a line to an example dataset.

Fig. 2.5: Illustrative example of overfitting, underfitting and optimal capacity. Illustration modified from Amar Gondaliya [13]

(19)

2.5

Convolutional Neural Networks

2.5.1

Convolutional layer

One of the most important layers in a convolutional network is the convolutional

layer. The layer takes two arguments; input data and a kernel. Input data can either be the original image or the feature map of a previous layer. The output of a convolutional layer is called a feature map. The kernel is usually a square matrix, which will slide over the image and "filter" it for features.

At each position, the kernels Fweights will be multiplied with the pixel values within the kernel, performing element wise multiplication. The kernels value will then be summed up into a single value, outputting the activation at that spatial location. Figure 2.6 illustrates the process. If a specific feature is present in the input, the activation will be high. In the first layer of a CNN, the weights of the kernels are often set by the network to act as edge detectors, finding the presence of vertical and horizontal lines in the image. By convolving the image with a set of filters, a stack of filtered images is sent to the next layer.

Fig. 2.6: Illustration displaying the convolution operation [14]

How far the kernel is moved at every step is called the kernels stride. A stride of one would correspond to having the kernel move one pixel at each step. The region that is within the focus of the kernel is called the kernels receptive field. All the weights in the kernel are the same for every pixel in the image and is called weight sharing

(20)

Fig. 2.7: A 7×7 image with a 3×3 kernel and a stride of one [7]

Fig. 2.8: The 5 × 5 output feature map [7]

or weight tying. The stride of the kernel affect the size of the output feature map. A high stride will shrink the feature map output. Examples are presented in Figure 2.7 and Figure Y 2.8 to illustrate the effect of the stride hyperparameter. The stride in this example is set to one. Figure 2.7 display an input image of size 7 × 7 with a colored square showing the 3 × 3 kernel. The kernel is moved until it hits or would move past a border, resulting in an output feature map of 5 × 5. Figure 2.9 show an input image of the same image and kernel size, but with a stride of two. The kernel can only be moved three times on one row before it has to be moved down, with a stride of two. Figure 2.10 show the resulting 3 × 3 feature map, with a sample of three activation outputs.

Fig. 2.9: A 7×7 image with a 3×3 kernel and a stride of two [7]

Fig. 2.10: The 3 × 3 output feature map [7]

The final parameter that can be set in a convolutional layer is the amount of zeroes that should be added to all borders of the image, called zero padding or padding. Figure 2.11 show a padding of two to a 32 × 32 × 3 image, resulting in a 36 × 36 × 3 image. Padding is used to preserve the size of the image during convolutions. Without padding, there is no input for the kernel outside the edges, and it will move on. This results in a dimensionality reduction and can be avoided with padding.

Fig. 2.11: A 32 × 32 image with a padding of two [7]

(21)

2.5.2

Pooling layer

After the convolutional layer, a pooling layer [14] is often applied. The layer is sometimes called a downsampling layer, emphasizing the layers objective of decreasing the size of the image or feature map. A common pooling layer is the

maxpoolinglayer. In a maxpooling layer, the highest value in the kernels receptive field will be the output of the operation. Figure 2.12 illustrate the pooling process of a 2 × 2 maxpooling kernel with a stride of two, slid across a 4 × 4 feature map. By sliding the kernel over the feature map, we can both reduce the size of the feature map by summarizing "boxes" of feature maps, and at the same time become less sensitive to the exact spatial location of a feature. The relative location to other features is still retained. In the maxpooling operation, it doesn’t matter where in the receptive field the highest value is positioned.

Fig. 2.12: Image displaying the output of a 2 × 2 maxpool kernel, with a stride of two[7]

2.5.3

Fully connected layer

The last layer of a convolutional neural network has the role of finding the con-nections between features and classes, and is called a fully connected layer (FCL). All the neurons in this layer are connected to all the neurons in the previous layer, much like a hidden layer in a multi layer perceptron. The features generated by the previous convolutions has by the end of the network reached a level of abstraction were the representations can take the form of hand detectors, feet detectors, cat detectors etc. The fully connected layer has the same amount of neurons as there are classes, and will output a vector representing the activations for each class. The role of the FCL is to find mappings between the activations and a certain class. This mapping is learned through forward passes and backpropagation, described in the previous feedforward network section. The layer before the final output layer produces the final feature map that will be used for classification. This layer is sometimes called the bottleneck, and the feature maps that is used as input to the final output layer are called bottlenecks. A common activation function for the final output layer is the softmax activation function. The function is used for multi-class classification and is a generalization of logistic regression. A vector is produced as output, where each element represents a class and the probability that an example belongs to a specific class. The softmax output vector always sum to one.

(22)
(23)

3

Related work

The chapter begins with an introduction to the concept of transfer learning, followed by research exploring transfer learning as a theoretical concept. The second part summarize research that has been done on transfer learning when applied to real world problems. In the last section, the GoogLeNets architecture is introduced.

3.1

Transfer learning

Transfer learning has the last couple of years become a viable and common solution when applying machine learning to real world problems. As been stated in the introduction, the motivation is often that the process of training the network from scratch is too costly.

An assumption that often has to be made in machine learning, is that the training dataset and future data must have the same distribution and feature space[23]. When dealing with real world problem, this assumption is not always true. The dataset available when applying machine learning might be small and the task of labeling more data can prove expensive. On the other hand, we might have sufficient training data in another domain, but where the dataset has a different feature space and distribution. To avoid the expensive operation of increasing the target dataset, we want to transfer the knowledge learned on the other domain, to the domain of our dataset. This is called transfer learning, and has been proven highly successful in recent years[23].

3.2

Research exploring transfer learning

Donahue et al. [8] explored how well a pre-trained state-of-the-art CNN generalizes to perform classification on images drawn from other domains. They took a state-of-the-art model, trained it on ImageNet, and then retrained the last layers on new datasets and tasks. The researchers researched three datasets; The first was the SUN-397 dataset, containing scenes like dinner or a mosque, the second was an office dataset containing office-product images and the last dataset was the Caltech-USCD bird dataset. Their results showed that the generality and semantic knowledge

(24)

learned in the pre-trained network tend to cluster images into semantic categories that the network was never explicitly trained on. Their results was among the best ever attained on the used datasets. The model had been trained on the task of object recognition, but was also tested on "scene recognition", a completely different task. The model performed surprisingly well, and was able to beat state-of-the-art performance in accuracy with 2.9%.

Girshick et al. [12] propose an object detection algorithm that significantly improve on previous results on PASCAL VOC 2012. Their research builds on two insights. The first is to localize and segment objects into regions using bottom-up region proposals, and then apply state-of-the-art convolutional networks to these regions. The second was that it is highly effective to pretrain a CNN on an auxiliary task with large quantities of data and then fine-tune the network for the target task. They conclude that it is likely that transfer learning will be highly effective for a wide variety of computer vision problems where data is scarce.

Tajbakhsh et al. [34] set out to answer the following research question: Can the use

of pre-trained deep CNNs, with sufficient fine-tuning, eliminate the need for training a deep CNN from scratch?. Their experiments consistently demonstrated the following properties; 1) A pre-trained CNN with enough fine-tuning seem to outperform or, in the worst cases, perform on par with CNNs trained from scratch; 2) A CNN that is trained using fine-tuning prove to be more robust on different sizes of training sets than a CNN trained from scratch; 3) Neither tuning all layers, called deep tuning, nor tuning just the last layer, called shallow tuning, proved to give the best results; 4) The best performance was achieved by doing layer-wise fine-tuning, iteratively finding the optimal amount of layers that should have their weights being fine-tuned during training.

Sinno Jialin Pan and Qiang Yang [23] did a survey study on transfer learning. In the survey, the authors categorize and review the current progress on transfer learning. The survey also focused on defining the relationship between transfer learning and other related machine learning techniques. The research concluded that most research show that the transferability in transfer learning, to a large degree is related to how similar the source and target domain or task is. We still lack a similarity measure that define distance between domains or tasks, and is suggested for future research. The survey also covers what is called "negative transferability", that is when the transferred knowledge actually decrease the model performance, which is also tightly coupled to the source and target domain similarity.

In the paper "How transferable are features in deep neural networks?", Yosinki et al. [35] experimentally try to quantify the generality versus specificity of neurons in each layer of a CNN. A phenomenon that has been observed across many CNNs, is

(25)

that the first layer often learn features for edge detection. This suggests that these features are somewhat general in that they are not only useful on the current dataset and task. For each layer, the network need to become more specialized towards the domain of the dataset and task, transitioning from general to specific. Yosinki et al found two distinct issues had a negative impact on transferability. The first issue they discovered was that the performance on the target task was negatively affected by the higher level neurons specialization towards their original task, which could be expected. The second issue they observed was that splitting networks between co-adapted neurons created optimization difficulties. Either of these described issues may dominate, depending on how many of the layers are "frozen" during retraining and fine-tuning towards the target domain and task. In line with previous results, the paper also prove that the transferability of features decrease with the similarity distance between the base and target task.

3.3

Research applying transfer learning

Saito and Matsui [28] highlight in their paper on semantic vector representation of illustrations, the fact that many studies has been made on CNNs performance on natural images, but there is a lack of research focusing on illustrations. According to the authors, this is because of two technical issues. The first reason is the difficulty in recognizing illustrations, because of illustrations diversity of visual elements. Eye size, shapes of faces and bodies etc. vary a lot between, not only different drawers, but also between drawings done by the same drawer. The second issue is the lack of large open source datasets of illustrations. Large-scale annotated datasets like ImageNet, is one of the driving factors behind the rapid development within image recognition. Such a dataset for illustrations does not exist.

Esteva et al. [11] researched in their paper "Dermatologist-level classification of skin cancer with deep neural networks", the use of transfer learning in the context of dermatology. The study was very well-received, and the researchers were able to produce a classification model that could classify skin lesion images with the accuracy of a board-certified dermatologist. They used the model GoogLeNet, pre-trained on ImageNet, and just repre-trained it on their target dataset. An important note is that the Esteva et al. had a large dataset, 129,450 clinical images. They used an interesting method of building a topological tree structure, where they summarized the probabilities of each root nodes’ child’s, to produce the classification. The classifier matched the performance of professional dermatologists tested across critical diagnostic tasks for skin cancer, and is deployable on mobile devices. Several research papers within the field of medicine describe the effectiveness of feature extraction using pre-trained CNNs [31] [27] [15].

(26)

Al-Shabi et al. [29] propose in their paper an adult image recognition system that uses a mixture of CNNs. The most popular method to block access to websites that present adult content, is to search the site for restricted words. More traditional methods has focused on handcrafting the features in adult images, like different po-sitions and shapes. In contrast to these more traditional methods of adult-detection, the system is an end-to-end machine learning model. The researchers manually collected 41,154 adult images of the internet, and then used the ILSVRC-2013 dataset as non-adult images. An ensemble of CNN classifiers were used, and their prediction on the image was weighted on their performance on the test set. The final model yielded an impressive accuracy of over 96%.

Moustafa [21] also explored the use of deep learning for classifying pornographic images. One of the differences from Al-Shabi et al. was that Moustafa used AlexNet and GoogLeNet as feature extractors, using the output from the last convnet layer (convolutional neural network layer). This allowed the last layer classifier to be replaced with any kind of classifier e.g. Support Vector Machine (SVM). The effect is a model that requires much less data to be trained, because it has less parameters that need to be adapted. By combining the predictions from both AlexNet and GoogLeNet into an ensemble-convnet with different last layer classifiers, the author noticed a significant increase in performance on test set. The predictions from each classifier was weighted on the classifiers performance during testing. In a study made by Zhou et al. [37] results showed that an ensemble of CNNs can produce state-of-the-art results on pornographic image classification. According to Zhou et al, a common technique for categorizing images as pornographic is based on image retrieval technology. A large image database with vast amounts of pornographic and normal content is first created. The image to be classified is then used as query-input and compared with images in the database. Classes of the retrieval result then determines the class of the input image. The problem with this method is that due to high variety in adult images, it has proven difficult to build a database that covers a large enough set of images.

Several studies the last year has shown the effectiveness of applying CNN ensemble classifiers and transfer learning to real world problems. Huynh et al. [16] achieved state-of-the-art results on digital mammographic tumor classification, by using transfer learn combined with an ensemble of classifiers. Akcay et al. [1] applied transfer learning to the problem of x-ray baggage image classification. Their model achieved 98.92% detection accuracy, outperforming previous work in the field. In the study "Transfer Learning with Convolutional Neural Networks for Classification of Abdominal Ultrasound Images", Cheng and Malhi [6] evaluated the use CNNs and transfer learning within the field of abdominal ultrasound images. Their results show that their CNN model achieved a classification accuracy that slightly surpassed that of human radiologists.

(27)

3.4

GoogLeNet

In the ILSVRC14 competition, Google competed and won with a CNN model called GoogLeNet [33]. The model had a "top five" error rate of 6.7%, pushing the state-of-the-art. The revolutionary part of the algorithm was its architecture. Having only 22 layers, GoogLeNet uses twelve times fewer parameters compared to AlexNet, breaking the trend of ever deeper CNN architectures. The depth of a model has a huge impact on memory consumption, which engineers at Google realized would become a bottleneck when applying CNNs to real world applications. Very deep models might produce better results measured in accuracy, but can never be deployed on for example a mobile device. Figure ?? show the complete network architecture.

Fig. 3.1: GoogLeNet CNN architecture. Illustration taken from the research paper "Going Deeper with Convolutions" [33]

3.4.1

The inception module

To achieve this more memory-cost efficient model, researchers at Google came up with a module they call "Inception". The model architecture is shown in Figure 3.2. The architecture make use of a technique called "Network-In-Network", (NIN) that was presented in the paper Network-In-Network by Min et al. [19]. Instead of applying a linear operation in the convolutional layer, a multi layer perceptron is used to capture the feature concepts in the input. The use of a MLP has shown to do a better job at extracting features at each spatial location[19]. Figure 3.3 illustrate the Network-in-Network concept.

The 1x1 layers can be used to reduce a feature map of size 512x512x80 to a map of 512x512x40 by applying 40 filters to the 1x1 convolution. The 1x1 convolutional layers displayed in the Inception module illustration are NIN layers. NIN layers are placed before the more computational expensive 3x3 and 5x5 convolutions to reduce the dimensionality of the input.

(28)

Fig. 3.2: Inception module illustration [33]

Instead of having only a single convolution, the inception module has a composition of differing sizes of convolutions. This effectively make the model able to "choose" if it should use a 5x5, a 3x3 or a 1x1 convolution etc. at multiple layers. This will keep down the total number of parameters in the model and at the same time perform better than if the layer just had a simple convolution.

Fig. 3.3: Figure illustrating the difference between a normal linear convolution layer, and a MLPconv layer. Illustration taken from the paper "Network-In-Network" [33]

(29)

4

Method

This chapter begins with presenting what Battlefield emblems are, how they are created, and how they are reported. The next section introduces a work-flow on machine learning problem solving that has been suggested by researchers within the field. The guidelines introduced are applied during the process of producing the thesis results.

4.1

Emblems in Battlefield

4.1.1

How players create and use emblems in Battlefield 1

and Battlefield 4

Uprise is an Electronic Arts studio located in Uppsala. The studios main responsibility is the online platform and user interface surrounding the games, where players socialize, join games, buy merchandise etc.

One of the features provided on this platform, is the possibility to create your own "emblem". An emblem is an image that will be associate with a player profile and also be displayed in the game; on weapons and vehicles. A player can either choose to import an already existing emblem from another player, or create their own emblem. In the platform, players are presented with a web-editor where they can draw their own emblem. Unlike common painting tools like "paint", the players are not given a brush, but instead a list of 105 symbols that can be used to compose together their emblem. The size, color, orientation of the symbol can be adjusted by the player. The editor also has a layer structure, a symbol can be put behind or in front of another symbol. When a player submit their emblem, the emblem is stored as a PNG, and no check is made if the emblem already exist in the database. The PNG is then exposed through a unique URL. Figure 4.1 show a screen-shot of the web-editor in Battlefield companion.

Fan based web-pages like http://emblemsbf.com/, provide galleries where players can share their emblem creations. This results in a significant reuse of certain well

(30)

Fig. 4.1: Screenshot capture from the Battlefield companion emblem editor

crafted emblems. Note that Battlefield companion do not allow players to import images that has not been created through the Battlefield web-editor.

4.1.2

How offensive emblems are handled in the Battlefield

games

Players can report other players’ emblem if they find it offensive. These reported emblems are sent to the customer service department at Electronic Arts, where employees will decide whether a reported emblem should be banned or not. If the reported emblem is banned by customer service, the emblem is flagged as "hidden" in Battlefields emblem database. No additional metadata is currently stored except the date of the change.

(31)

4.2

Method to approach the problem

The problem was approached according to guidelines presented in the book "Deep learning" [14], written by the machine learning researcher Andrew Ng.

4.2.1

Step 1 - Determine goals and measurements

The first step in applying machine learning to a problem, is to determine the goals of the project, what metrics to use and target values the project should satisfy [14].

4.2.2

Step 2 - Establish working end-to-end baseline model

The next step is to establish a working end-to-end pipeline for the machine learning task and measure performance on a first baseline model.

4.2.3

Step 3 - Determine bottlenecks in performance

According to Goodfellow et al. [14], the following questions are of great importance when trying to determine bottlenecks in performance.

• Is the model overfitting? • Is the model underfitting? • Are there defects in the dataset? • Are there defects in software?

4.2.4

Step 4 - Repeatedly make incremental changes.

The last step when applying machine learning, is to iteratively make changes to improve performance. The following tasks are often applied at this stage:

• Gather new data

• Adjust hyperparameters • Change algorithm if necessary

(32)

4.3

Additional guidelines when applying machine

learning

4.3.1

The process of knowing what to do next

Andrew Ng argues that the process of applying deep learning in practice is still being researched, but presents a few guidelines from his experience. When deep

Fig. 4.2: Flow-chart displaying the process of applying deep learning. Illustration taken from "Nuts and Bolts of Applying Deep Learning" [22]

learning is applied in practice, Ng argue that engineers often struggle on knowing what should be done next[22]. Ng presents a flow-chart approach on how resources in many situations are best spent, depending on performance benchmarks during training and testing.

If the training error is high, called underfitting, the first thing to do is to make the model bigger. In this situation, the model is not able to capture the structure of the data, and need more freedom to adjust and fit. Training the model longer on the dataset should also be evaluated. If the previous approaches don’t work, the model architecture might have to be changed [22]. If nothing works, the quality of the data might be the problem. The data could be too noisy or not include features that makes it possible to predict the output. The solution to this problem is to start over, collect cleaner data or a dataset with a richer set of features.

If the error on the training set is low, but the validation error is high, called overfitting, then our model is not generalizing. In most situations, the best option is to put efforts into obtaining more data. Adding or increasing regularization measures by for example decreasing the number of training epochs, can improve performance during testing. If these measures don’t increase the performance on our test set, a different model architecture might be the last option.

(33)

The development test set (dev test set) is used to produce intermediate performance results when a classifier has been trained using the training set and fine-tuned using the validation set. When the validation error is low, but the error on the dev test set is still high, the best option is again to extract more data and make sure that the data trained on is similar to the data the model is being tested on. Synthesizing data, by for example creating new rotated images or adding random noise, can be an option to increase the dataset size.

The production test set (prod test set) should be extracted from the target application domain and have a data distribution that is identical to the domain where the model will be run. The work is done when the performance on the final production test set is satisfactory.

4.3.2

Create a common data warehouse

Ng suggests that creating a common data warehouse for the project speeds up development, making sure that the latest dataset are always reachable by the engineers in the project.

4.3.3

Determine human-level performance on the task

Determining human-level performance on the task, measured in accuracy, give an idea of where the theoretical limit of performance lies, often called the optimal error rate. A dataset containing images often have some examples that are so blurry or misleading, that they simply are not possible to label into a category with high confidence. Humans perform well on many of the tasks that are normally targeted with deep learning, leaving the gap between optimal error and human level performance, relatively small. When iterating and improving the algorithm, it’s easier to improve when model performance is below human-level performance.

4.3.4

Plot performance on increasing dataset size and

visualize worst errors

Running experiments using 1/8, 1/4, 1/2 etc. of the dataset give insights on the expected performance gains if more data were extracted. A final tip that is presented by Ng, is to visualize the models’ worst errors. Looking at the incorrect classifications with the highest confidence can often show data that is incorrectly labeled, and will give a better understanding on what examples the model struggle on.

(34)
(35)

5

Experimental setup

The material and setup used to produce the experimental results is explained in the chapter. The preprocessing techniques used and dataset augmentation procedures are described, together with the dataset generation method.

5.1

Software and hardware used during

experiments

All experiments were conducted on the following hardware:

• Intel Xeon CPU E5-1650 v3 3.5GHz 12 vCPUs • NVIDIA GeForce GTX 980, 2048 CUDA cores. • 32GB RAM

The following software versions were used during classifier testing/training:

• Python 3.4 • Tensorflow 1.1.2 • Ubuntu 16.04

(36)

5.2

Preprocessing

Images need to have a standardized pixel range, for example the range [0,1] or [0,255]. This is the only preprocessing that is strictly required when running images through a CNN.

5.2.1

Dataset augmentation

More data can be produced by augmenting existing images, synthesizing additional data. Adding "noise" to images by rotating them, adding random brightness etc are examples of augmentation techniques. The following table describe distortions applied and experimented with during the thesis work. Figure 5.1 illustrate the rotation technique applied to some emblems.

Tab. 5.1: Data augmentations

Distortion type Description

Random scale Randomly scale the image by x% Random crop Randomly crop the image by x% Random brightness Randomly adjust the image brightness by x%

Rotation Rotate 20 degrees for 340 degrees, synthesizing 17 images

Fig. 5.1: Rotation augmentation example

5.2.2

Contrast normalization

The magnitude between the bright and the dark pixels in the image is called the image contrast. The amount of contrast in an image can often safely be removed, to reduce variance and remove the need for the model to learn how to handle multiple contrast scales. One way to achieve this is global contrast normalization, normalizing the contrast in every image.

(37)

5.3

Dataset generation

The complete dataset was iteratively divided up into the following parts. Increasing amount of emblems became available as the thesis work progressed. In an effort to keep the distribution between the classes within the complete dataset consistent, the labeling process kept a goal of producing the division of 45% non-offensive emblems, 30% swastika emblems and 25% penis emblems. A more detailed explanation of the emblem categorization process and extraction is presented in the results chapter. The dataset during iteration one had a size of 5000 emblems, iteration two 10 000 emblems and the third iteration 17 377 emblems. The class distribution between the training, validation and dev test set were close to identical.

• Training set 80% - At the end of each thesis iteration, 80% of the images was drawn at random from the dataset and put into a separate training set. This set was used to tune the weight/parameters of the model.

• Validation set 10% - The validation set was used solely to graph the models estimated generalization error for each epoch during training. 10% of the dataset was set apart for this. The validation set is normally used to tune the models hyperparameters.

• Development test set 10% - Henceforth, the development set is called dev test set. When the model has been fully trained, performance benchmarks was run on this set, kept separate from the training process.

• Production test set - Henceforth, the production test set is called prod test set. At the end of the third iteration, 3650 emblems were drawn at random from the emblem database containing 8 032 703 emblems. The MD5 hash of these images was then compared to the 17 377 emblems in the already labeled dataset to make sure the model have never seen the emblems before. 523 emblems were matched and removed in this process, yielding a final production test set of 3127 emblems.

(38)

5.4

Feature extraction

The CNN architecture GoogLeNet was used as a feature extractor during the thesis work, pretrained on the image dataset ImageNet. Figure 5.2 show a sample from the ImageNet dataset. Emblem images were fed through the CNN, and the output feature map (called bottlenecks) of the last convolution layer was then used to train a MLP to classify the Battlefield 1 emblems. The feature map produced is a vector with length 2480, each element being a feature represented by a real number between zero and two.

Fig. 5.2: ImageNet sample. Image taken from Stanford Vision Lab [2]

5.5

Machine learning framework - Tensorflow

TensorFlow is an open-source software library for machine learning and numerical computation. The framework was developed by the Google Brain Team, within Google’s Machine Intelligence research organization. In Tensorflow, a data flow graph is defined, where each node represents a mathematical operation. The edges between the nodes represent multidimensional data arrays, called tensors, that are passed between the nodes. The abstraction of a computational graph makes it possible to deploy the computation to multiple CPUs or GPUs, and on different devices with the same API.

(39)

6

Results

6.1

Results iteration 1

6.1.1

Step 1 - Determine goals and measurements

The first step was to determine the goals for the project. In discussion with Uprise, the following objectives were decided for the thesis:

• The project should produce a categorizing service that when presented with an emblem, will flag the emblem as offensive or not.

• The project should provide a good overview of the models strength and weak-ness.

No explicit key metric or target values were added to the objectives. The task of filtering out offensive content is similar to the task of spam detection in some ways. They are both binary classification tasks and in both tasks the cost of incorrectly classifying an example as offensive/spam is higher than making the mistake of permitting an offensive/spam example. The dataset also have a heavily skewed distribution between the classes. About 99% of the emblems are non-offensive, when examining a sample of 1000 emblems drawn at random from the eight million dataset.

The uneven class distribution render metrics like accuracy and error rate less useful when evaluating the model on real-world samples. On a set randomly picked from the real-world dataset, a model that would classify all the examples as non-offensive would on average get an accuracy around 99%, which is misleading. We are not that interested in the examples that the model correctly flag as non-offensive, so we want to use metrics that don’t take true negatives into account.

The primary focus is to minimize the amount of examples that the classifier incor-rectly flag as offensive, covered by the precision metric. A secondary goal is to catch as many offensive emblems as possible in the filter, covered by the recall metric. The F-measure take into account both precision and recall.

(40)

T P = True Positive = Correctly predicting a offensive emblem as offensive

T N = True Negative = Correctly predicting a non-offensive emblem as non-offensive

F P = False Positive = Incorrectly predicting a non-offensive emblem as offensive

F N = False Negative = Incorrectly predicting a offensive emblem as non-offensive

P recision= T P

T P + F P

Recall= T P

T P + F N

F measure= 2 ∗ precision ∗ recall precision+ recall

An important note is that the distribution between the sets will be even during training, rendering accuracy a useful measurement for model evaluation, and will be considered the key metric. The dev test set has the same distribution between the classes as in the training set.

6.1.2

Dataset extraction

Dataset labeling process

There are 8 032 703 uploaded emblems in Battlefield 1, as of April 2017. The emblems that has been reported and marked as offensive by customer service constitute the offensive dataset. The dataset consist of 4730 images for the game Battlefield 1. No additional data was stored regarding the images. In order to get a sense of the distribution within the offensive dataset, the dataset was sorted into offensive categories. The following categories were decided:

Tab. 6.1: Categories within the offensive dataset

Nazi symbol Penis Nude Text Miscellaneous

Emblems were labeled miscellaneous when none of the other labels applied. Fig-ure 6.1 illustrate samples from each dataset.

Fig. 6.1: Sample emblems. From left to right: nude, miscellaneous, nazi symbol, penis and text.

(41)

Dataset Distribution

Tab. 6.2: Emblems hidden by customer service at Dice, categorized

Nazi symbol Penis Nude Text Miscellaneous Total 2942 1146 265 110 267 4730

Fig. 6.2: Distribution among hidden emblems in BF1

The distribution between the classes are shown above. Most of the offensive emblems are Nazi symbols, followed by penis illustrations. To get a further understanding regarding the kind of emblems that are common in Battlefield, the 10 000 most used emblems were extracted. This was done by running a MD5 hash on all the emblems, group all the emblems with the same hash, and then sort the emblems on the number of occurrences. The top 10 000 emblems are reused by players 1 557 720 times. Figure 6.3 show the distribution among all the top 1000 emblems, after manually sorting the set. Figure 6.4 display the distribution between the offensive categories found in the top 1000 emblem dataset. The most common offensive classes are nude and miscellaneous. The reused drawings mostly consist of advanced illustrations, having multiple layers and being more artistic than the average emblem. One plausible explanation to why nude images are reused the most, could be that they are too hard for the average player to draw themselves. In contrast, most people are capable of drawing a swastika or a penis.

Tab. 6.3: Distribution among top 1000 emblems after manual categorization

Non-offensive Nazi symbol Penis Nude Text Miscellaneous Total

907 3 8 50 4 28 1000

Fig. 6.3: Distribution among all top 1000 emblems BF1

Fig. 6.4: Distribution between offensive emblems in top 1000

(42)

6.1.3

Step 2 - Establish working end-to-end pipeline and

baseline model

A database was set up to store emblems and their labels. To make sure that the dataset did not include any duplicate emblems used, a MD5-hash was produced on each emblem and used as key in the database.

To streamline the process of collecting experimental results, a database was set up to automatically store classifier hyperparameters, what labels that were included as offensive, and the classifiers’ performance on test set. The end-to-end pipeline was set up in Googles open source machine learning library TensorFlow, using the Python API. To avoid dependency issues and ensure deployment stability, the project made heavy use of containerization using Docker.

The pipeline looked as follows:

1. Choose what labels that should be considered offensive (in order for future models to include new categories as offensive).

2. Define a hyperparameter configuration file that will be used for the run, includes parameters like number of epochs, learning rate, training batch size etc.

3. The pipeline would then fetch the latest dataset from the database and spawn a docker container, performing the classifier training and output the trained clas-sifier as a graph file. Bottlenecks were produced once and reused using a cache folder, making repeated runs significantly faster. The classifiers’ performance is then automatically stored in the database.

(43)

The dataset distribution used to train the first baseline model is shown below. Training batch size is the number of images that are used each epoch for the forward pass and backpropagation. The data was not augmented in any way during iteration one or two. An accuracy of 91.6% on the test set was recorded for the baseline model.

The dataset used in the baseline model were created without applying any fine-grained separation within the offensive dataset, including images from all categories as offensive.

Tab. 6.4: Dataset baseline model

Non-offensive Offensive Total 1000 4000 5000

Tab. 6.5: Training parameters baseline model

Number of training epochs Learning rate Batch size during training

4000 0.01 100

Tab. 6.6: Performance on test set

Model Accuracy F-measure Precision Recall Test set size base-line 0.9162 0.7843 0.7018 0.8889 542

(44)

6.1.4

Performance benchmarks for first model

During the manual process of labeling images into more fine-grained categories, it was concluded that the amount of variety between images within some classes were very high. A sample from the miscellaneous labeled emblems is shown in Figure 6.5.

Fig. 6.5: Sample from the miscellaneous category

The decision was made to limit the scope of the thesis to focus on filtering the emblems containing swastikas and penises. This was due to the fact that swastikas was seen as the most offensive category. These were also the largest offensive categories in the hidden dataset.

Most of the images in the Nazi symbol category are swastikas, so it was also decided that swastikas should first be considered, cleaning the Nazi symbol category to only contain swastikas. Other symbols like the blood drop cross, confederate flags etc. were put into a separate category. The dataset was swept through a second time, resulting in a few images being found in the wrong category. After the dataset clean-up, the model was trained again. By collecting more non-offensive labeled emblems both from the top 10 000 dataset and emblems at random from the eight million set, the dataset was changed to contain a more even distribution between offensive and non-offensive emblems.

(45)

Hyperparameters and training set

Tab. 6.7: Training parameters

Training epochs Learning rate Training batch size

4000 0.01 100

Tab. 6.8: Dataset used during training in iteration one

Non-offensive Swastikas Penis Total 2248 1539 1211 5000

Samples from the labeled categories are shown below as thumbnails. These were the emblems used during training. The categories non-offensive, swastika, and penis was used to train the model, resulting in a multi-class classification problem. During testing, performance is measured on the binary classification task of determining if an emblem is offensive or non-offensive. If the classifier guess penis or swastika, the guess would be coded into an offensive guess.

(a)Non-offensive (b)Swastikas (c) Penises

Fig. 6.6: Emblem thumbnails from each of the categories

(46)

Performance during training and test

The model took four minutes to generate bottlenecks (feature maps from the pre-trained GoogLeNet CNN) and 15 minutes to fine-tune the fully connected layer. For comparison, this procedure took four hours when run on the CPU, instead of the GPU.

Fig. 6.7: Accuracy plot during training. Performance on training batch in orange, validation performance in turquoise. The x-axis show the number of epochs.

Fig. 6.8: Cross-entropy plot during training. Performance on training batch in orange, validation performance in turquoise. The x-axis show the number of epochs.

Figure 6.7 plot the accuracy for each epoch on the training batch and the validation set. The faded line is the actual performance at each epoch, and the solid line display the smoothed-out performance across each epoch, to more easily show the trend. Note that the training set performance is evaluated on the last 100 images, which give rise to the large jitter in performance across epochs. The validation performance

(47)

is evaluated on the complete validation set, 542 images, every ten epochs, making it much less prone to jitter.

The accuracy on both the training- and validation- set increase drastically during the first 200 iterations. Performance on the validation seem to stop increasing after around 2000 iterations, while performance on the training set continue to improve, reaching 100% accuracy on some training batches during epoch 3500 and 4000. The gap in accuracy between training and validation is by the end of epoch 4000 close to 5%. Figure 6.8 plot the cross-entropy during each epoch, confirming the performance improvement during each epoch that was shown in the accuracy plot.

Actual Class

Predicted Class

Pos Neg Total

Pos TP 245 FN 13 258 Neg FP 16 TN 268 284 Total 261 281 542

Tab. 6.9: Confusion matrix for the first iteration model

Performance across all measurements are shown in Table 6.10. The improvement in accuracy is largely dependent on changing the rules for what emblems that are considered offensive in the dataset. Only considering swastikas and penises as offensive emblems, give the classifier a more well-defined concept, that prove to be easier to separate from non-offensive emblems.

Tab. 6.10: Performance on dev test set

Model Accuracy F-measure Precision Recall Dev test set size base-line 0.9162 0.7843 0.7018 0.8889 542 first iteration 0.9465 0.9441 0.9387 0.9491 542

(48)

Misclassified images

Fig. 6.9: Penises misclassified as non-offensive

Fig. 6.10: Non-offensive misclassified as penis

Fig. 6.11: Swastikas misclassified as non-offensive

Fig. 6.12: Non-offensive misclassified as swastika

In Figure 6.9, emblem 2, 3, 4 and 6 are penis illustrations that are in line with how most penis emblems look like. Emblem 7 has its ground truth label wrong, the emblem is a bandanna, and is one of the web-editor drawing symbols. Emblem 5 could be considered correctly labeled as a penis illustration. The illustration depicts an armed soldier with a bullet starting at the crotch.

In Figure 6.10 The first and the last emblem has been incorrectly classified as penises. Both are characters from the cartoon show "SpongeBob SquarePants". The character to the furthest right is the character "Patrick", and is a common character in emblems. A sample of the "Patrick" drawings from the non-offensive category is shown in Figure 6.13, with the incorrectly classified emblem to the furthest right. The pink color, combined with Patrick’s pointed head and eye-balls, prove hard for the classifier to separate from a drawing of a penis.

(49)

Fig. 6.13: Emblems from the non-offensive category containing the SpongeBob character Patrik

The pattern that can be found in some non-offensive emblems classified as swastikas, could be the presence of an eagle in the center of the image. This is a common pattern for swastikas, as shown in Figure 6.14.

Fig. 6.14: A small sample of the emblems in the swastika category containing eagles

6.1.5

Step 3 - Determine bottlenecks in performance

The accuracy reach above 99% on the training batches, when the model is given enough training epochs. The error on the validation is considerably larger, indicating a problem due to overfitting or high variance. Monitoring the validation performance across epochs, indicate that the problem is not due to excessive training. The performance on validation do not show any indication of neither dropping nor improving after 4000 epochs.

As been presented in the method chapter, the options in this kind of situation are typically to gather more data, add or increase regularization or try a new

model architecture. Gathering more data is often the best alternative to start with, according to NG[14], and was chosen to be the goal for the second iteration.

(50)

6.2

Results iteration 2

6.2.1

Step 4 - Repeatedly make incremental changes

Common data warehouse and web-labeling service

To reduce the gap in accuracy between training and testing, the goal of the second sprint was to increase the size of the labeled dataset. After researching databases for previous Battlefield games, another 25 000 emblems that had been marked as offensive were extracted. After discussions with Uprise, employees volunteered to help out with the fine-grained labeling of the dataset. In order for this labeling to be done, the database set up for the thesis needed to be exposed for labeling by others than myself. Previously, the labeling was solely done locally on my workstation. The next step in the thesis project was then to expose the database to a web user-interface, where employees could click and label the dataset. A UI presenting each classifier experiment together with its hyperparameters and performance was also implemented. Figure 6.15 display a screen-shot taken of the labeling UI. Selecting an emblem marks it with a blue background, and the label can be submitted to the database by clicking the button at the top.

The dataset was increased from 5000 emblems to about 10 000 emblems in the upcoming weeks. After the second iterations’ data extraction and labeling phase, new experiments were run.

Fig. 6.15: Web-labeling service user-interface

References

Related documents

Optical character recognition systems have some sort of error rate of recognition (such as the ratio of incorrect characters to the total number of characters) which one wishes to

The primary goal of the project was to evaluate two frameworks for developing and implement- ing machine learning models using deep learning and neural networks, Google TensorFlow

Direktionens intresse för annat än enbart kristendomsundervis- ningen i Lappmarken visas tydligt i det resonemang som förts kring sysselsättningsmöjligheterna för lapparna och orsaker

The SIFT features are matched to estimate the relative movement of the vessel between multiple consecutive scansC. Figure 5a,b shows examples of how well these features

The size of a work group is equal to the number of small tiles in a large tile and each work item is responsible for computing the pixel coverage of every triangle located in that

First we have to note that for all performance comparisons, the net- works perform very similarly. We are splitting hairs with the perfor- mance measure, but the most important

Denna kategorisering av bilder från annonser skulle även kunna användas vid värdering av bostäder, vilket skulle i långa loppet leda till en jämnare pris- nivå på bostäder och

Slutsatser inkluderar bland annat att kvinnor med funktionsvariation generellt är särskilt utsatta för våld jämfört med andra kvinnor, den vanligaste våldstypen som kan