Effects of Transfer Learning on Data Augmentation with Generative Adversarial Networks

(1)

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2019

Effects of Transfer

Learning on Data

Augmentation with

Generative

Adversarial

Networks

ADAM JACOBS

OLLE BERGLÖF

(2)

Abstract

Data augmentation is a technique that acquires more training data by augmenting available samples, where the training data is used to fit model parameters. Data augmentation is utilized due to a shortage of training data in certain domains and to reduce overfitting. Augmenting a training dataset for image classification with a Generative Adversarial Network (GAN) has been shown to increase classification accuracy. This report investigates if transfer learning within a GAN can further increase classification accuracy when utilizing the augmented training dataset. The method section describes a specific GAN architecture for the experiments that includes a label condition. When using transfer learning within the specific GAN architecture, a statistical analysis shows a statistically significant increase in classification accuracy for a classification problem with the EMNIST dataset, which consists of images of handwritten alphanumeric characters. In the discussion section, the authors analyze the results and motivates other use cases for the proposed GAN architecture.

Keywords

data augmentation, generative adversarial networks, GAN, image classification, transfer learning, image generator, generating training data, machine learning

(3)

Effekten

av

transferlärande

på

datautökning

med

generativt

adversarialt nätverk

Sammanfattning

Datautökning är en metod som skapar mer träningsdata genom att utöka befintlig träningsdata, där träningsdatan används för att anpassa modellers parametrar. Datautökning används på grund av en brist på träningsdata inom vissa områden samt för att minska overfitting. Att utöka ett träningsdataset för att genomföra bildklassificering med ett generativt adversarialt nätverk (GAN) har visats kunna öka precisionen av klassificering av bilder. Denna rapport undersöker om transferlärande inom en GAN kan vidare öka klassificeringsprecisionen när ett utökat träningsdataset används. Metoden beskriver en specific GAN-arkitektur som innehåller ett etikettvillkor. När transferlärande används inom den utvalda GAN-arkitekturen visar en statistisk analys en statistiskt säkerställd ökning av klassificeringsprecisionen för ett klassificeringsproblem med EMNIST datasetet, som innehåller bilder på handskrivna bokstäver och siffror. I diskussionen diskuteras orsakerna bakom resultaten och fler användningsområden nämns.

(4)

Acknowledgements

The authors would like to acknowledge our supervisor, Pawel Herman, for giving us much helpful feedback and supporting the idea of this thesis. We also thank Quintus Roos for lending us computational power for the training of the GANs. Last, but not least, the authors thank their families, friends, and partner for their support.

(5)

Authors

Adam Jacobs and Olle Berglöf

Information and Communication Technology KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Supervisor

Pawel Herman

KTH Royal Institute of Technology

Examiner

Örjan Ekeberg

(6)

Chapter 1 Introduction

Data augmentation is a technique used to acquire more training data by augmenting available samples, where training data refers to data used in the process of fitting parameters for a model, commonly in classification [1]. Data augmentation is utilized due to a shortage of training data in certain domains. On the subject of images, data augmentation has traditionally been utilized with cropping, rotation, reflection and scaling [2]. Results indicate that realistic synthetic data can reduce overfitting, the phenomenon when a model too closely fit a limited dataset [1], and thus increase the accuracy and generalization of a model [2].

Within the area of image classification, the Neural Network (NN) is an effective approach for achieving highly accurate results. However, NN based models require much training data in order to achieve robustness as well as accurately classifying images [3]. Therefore, data augmentation techniques are commonly utilized in the training process of NNs on scarce datasets for image classification [2].

Classical methods for image data augmentation enable synthesis of training data with minimal variability, which provides minuscule effect to reduction of overfitting [2]. However, one model that can achieve higher variability - and

(9)

The GAN model consists of two NNs, hence much training data is needed in the training process [4]. Concerning data augmentation with GANs, the training data that is to be augmented is usually scarce and therefore a challenge arises in the training process of the GAN since it trains on the same limited dataset. However, a technique that can solve the problem of scarce training datasets when training GANs is transfer learning, which is the learning paradigm of attempting to utilize gained information while training in one domain and thereafter apply the information to a related, although different, domain [5]. Moreover, transfer learning will help the training process of GANs but there exists uncertainty in what effect data augmentation with a transfer learned GAN will have on the actual performance of an image classifier.

1.1 Problem Statement

Scarce datasets are a common problem within image classification, which data augmentation alleviates [2]. This report investigates the impact of data augmentation with GANs that utilize transfer learning on the performance of image classification. Improving the impact of data augmentation on image classification performance opens up the opportunity to utilize smaller training datasets. To conclude, the research question in this thesis is: what effect does data augmentation with transfer learned GANs have on image classification?

1.2 Scope

In the experiments of this report the EMNIST [6] dataset will be utilized, which contains images of handwritten alphanumeric characters. A Convolutional Neural Network (CNN) is used for classification, due to its wide acceptance as one of the primary models within image classification [7]. More specifically, the CNN named RESNET-18 [8] is used. Concerning GAN architectures, a Conditional GAN will be utilized. Transfer learning in thesis consists of transferring knowledge of handwriting from a large set of letters to a small set of digits. Lastly, the effect

(10)

in the research question is quantified by classification accuracy.

1.3 Thesis Outline

The second chapter of this report presents relevant theory and related research. The subsequent third chapter consists of the method of the experiments. The fourth chapter demonstrates the results obtained from the experiments. The fifth and final chapter discuss the results and provides conclusions.

(11)

Chapter 2 Background

2.1 Image Classification

Image classification is the use of computer algorithms to approximate or select a type for a given image or the objects that are present in an image. The focus of this report is image classification where an image belongs to one of a set of classes [Imagenet large scale visual recognition challenge]. A common paradigm, called supervised learning, that image classification algorithms utilize to achieve high classification results is learning from examples in the form of training data that consists of previously labeled images [7]. One useful model in image classification that can utilize supervised learning is the Neural Network (NN) [7]. However, in order to achieve accurate and robust classification results NNs require much training data that also contains a versatile representation of the relevant objects to be classified in order to accurately model the real world. The effort to program robust image classifiers has resulted in the common practise of utilizing image data augmentation in order to produce more training data [2].

(12)

2.2 Image Augmentation

Image augmentation is a technique used to acquire more training data by utilizing available samples. Image augmentation is utilized due to the existence of scarce training datasets in some domains, with the goal being reduction of overfitting within the classification problem [2]. Classical techniques for image augmentation are cropping, rotation, reflection and scaling. Previous results indicate that realistic synthetic data can reduce overfitting and hence increase a model’s accuracy and generalization [2]. A more recent method is using the Generative Adversarial Network (GAN) model for image augmentation by generating synthetic images with higher variability than classical methods, and therefore increasing the representative power of the training dataset and minimizing overfitting [4].

2.3 Neural Networks

2.3.1 Artificial Neural Network

An Artificial Neural Network (ANN), or commonly called Neural Network (NN), is a graph composed of neurons connected by weighted edges. Elements from a vector are put into neurons in the input layer, followed by hidden layer neurons, and finally through the output neurons, which outputs a vector. The structure of a NN is inspired by biological networks of neurons [9]. A visualization of a NN is seen in figure 2.3.1.

(13)

Figure 2.3.1: Structure of a NN with one hidden layer. The values in an input vector goes into the neurons in the input layer and is fed forward into the hidden layer and thereafter the output layer, which produces an output vector. Within classification, the output commonly represents probabilities of each possible class of the input. For image classification, the input could be the pixel values of an image. The weights are scalar values that are used together with a whole layer’s input to perform a transformation, which thereafter is fed forward into the next layer [10].

The learning process of a NN consists of two steps. During the first learning step, called feed forward, the input vector is fed forward all the way to the output layer. At all layers the vectors are linearly transformed, in the form of a dot product together with the weights, and thereafter put into an activation function that outputs a value that is forwarded to a neuron in the next layer. The sigmoid function is a common activation function, which is utilized in the example below. Each layer additionally contains a set of weight parameters.

For example, after receiving the input [x1, x2, x3] when there are three edges

that point to a neuron n in the next layer with weight values w1, w2 and w3,

the linear transform z = x1w1 + x2w2+ x3w3 is put into the sigmoid function

s(z) = 1/(1 + e−z), which thereafter is fed as an input value into neuron n. Multiple values from different neurons make up the input vector for the neuron

(14)

n.

For each training data vector the feed forward process is performed all the way to the output layer, followed by computing the error between the received output vector and the expected output of the training data points. The error is expressed with a specific loss function, for example the mean squared error. Following this computation, the second step, called backpropagation, is initiated by having the error propagate backwards in the network and tuning the weight parameters to produce output results that are more similar to the expected outputs in the training data [9].

2.3.2 Convolutional Neural Network

A Convolutional Neural Network (CNN) is a feed-forward NN that is not fully connected between each layer, meaning that not all neurons have edges to all other neurons within the next layer. Moreover, CNNs are primarily used for input in the form of images, and is inspired by the biological brain’s visual cortex. [11]. The images are usually represented as tensors with three dimensions: width, height, and color. Furthermore, CNNs are widely accepted as one of the primary models for image classification and recognition, due to its ability to learn high level features in images [7].

The important addition in the CNN, compared to the NN, is the convolutional layers that performs subsampling of the fed-forward data. The subsampling process consists of performing a transformation with adjacent pixels (assuming the data represents image pixels) that produces scalar values, which is then again fed-forwarded [11]. A CNN consists of one or multiple convolutional layers which, due to the subsampling, performs an extraction of image features that thereafter is fed into regular, fully connected, NN layers for the classification [12]. With more convolutional layers, more image features can be extracted [2]. See a visual representation in figure 2.3.2.

(15)

Figure 2.3.2: Image of a CNN that visualizes the subsampling processes in the convolutional layers where adjacent pixels of the input is transformed into scalar values, which is illustrated in the image by the rectangles that are transformed into a point in the next layer. This process is repeated throughout the network and then finally fed into a regular artificial neural network to produce an output vector (red neurons) as the classification result [5].

2.4 Generative Models

2.4.1 Generative Adversarial Network

A generative adversarial network (GAN) is comprised of two NNs (commonly two CNNs): a generator G and a discriminator D. The generator G generates new instances of data, whilst the discriminator D attempts to decide the authenticity of the generated data samples, in other words if the samples originate from G or the actual training data. The goal of the training of G is to maximize the probability of D classifying the generated samples as authentic. If the discriminator becomes too good due to its training there will be issues in the training of the generator, and vice versa. On a high level, the steps in a GAN are the following: G takes in a sample from a probability distribution and outputs an image, the generated image or a part of the actual dataset is fed into D, D returns a number between 0 and 1, where 1 represents authentic and 0 fake [4]. See figure 2.4.1 for a visualization of

(16)

a GAN.

GANs typically use Binary Cross

Entropy as the loss function: ylog(x) + (1− y)log(1 − x) [4]. In the expression, xrefers to the score ranging from 0 to 1 that is output by the discriminator (0 for fake, 1 for authentic). y refers to the authenticity label of the sample that is put into the discriminator. When training the discriminator on authentic images y is 1 and when training on fake images y is 0. When training the generator y is 1, and its generated image is fed through the discriminator and the loss is calculated using the Binary Cross Entropy function using the discriminators output for x.

Figure 2.4.1: A sample from a random distribution is input into the generator, which thereafter outputs a fake, although realistic, image due to its weight parameters being tuned in a specific way - achieved in the training process. The discriminator learns if the generated images are real or not [13]

GANs are able to generate realistic synthetic images when properly trained. One example of this are realistic human faces that are not real [14]. See figure 2.4.2 for an example.

(17)

Figure 2.4.2: Images of synthetic people provided by a generator in a GAN [14].

One common failure when training GANs is mode collapse, which is when the generator collapses into a parameter setting that emits extremely similar points for different inputs. The generator will output the same point as long as the discriminator marks it as real. When the generator’s output is rejected, it will instead switch to another similar point and repeat the same process. This phenomenon occurs since the discriminator processes data independently without any coordination for nearly identical outputs [15].

2.4.2 Conditional GAN

A Conditional GAN (CGAN) is a GAN with one modification. The label y of an image is used to deterministically generate fake images of a specific image class. This allows for the generated images to be easily used to train a classifier since it will need both data and corresponding labels during the training process [16]. See figure 2.4.3 for a visualization.

(18)

Figure 2.4.3: The generator’s input is both a random sample Z and a representation of a class label Y. When training on real data, the discriminator receives both the real image and the image’s label as input. When training on fake data, the discriminator receives the fake image and the label that it is supposed to belong to, according to the generator.

2.5 Transfer Learning with Neural Networks

Transfer learning is a method commonly used to train models by applying knowledge learnt in one domain to another. This learning paradigm makes it possible to train a Neural Network on a data sparse domain by leveraging learnt knowledge and computing power spent on a different, but related, domain. When using transfer learning in Neural Networks, the transferred knowledge is encoded in the weight parameters [17, 18].

2.6 Related Work

Only recently have GAN generated synthetic images been experimented with in training of NNs for image classification. In “Data Augmentation Generative Adversarial Networks” the authors motivate the paper by referring to the problem of “limited plausible alternative data” when using older classical data augmentation methods, where the alternative data is the data produced by classical image augmentation methods. The result of the paper is a

(19)

Application to unseen classes is possible as it does not depend on the class itself. The paper also shows that DAGAN increases classification accuracy by 13 percent when utilizing it for data augmentation in multiple experiments. In one experiment the training data is 1200 data points from the Omniglot dataset, and in another 100 data points per class from the EMNIST dataset [3].

(20)

Chapter 3 Method

3.1 Dataset

The dataset utilized in the experiments is the EMNIST dataset [6]. The dataset contains grayscale images of handwritten characters, both decimal digits and Latin letters. EMNIST has multiple ways to split the data between numbers and letters, where each split has in turn been divided into a designated training and test set [6].

In the experiments, two main datasets are utilized: one that contains all the letters and another that contains all the digits. Subsets of the digits, with varying sizes, are augmented in different ways and thereafter used to train classifiers with the same classifier model. The subsets of digits are referred to as DTR.xn (Digit TRaining) where x indicates the number of samples per class (digit) in the set. For example, DTR.1n contains one image per class and therefore the entire set contains ten images. Furthermore, another augmentation method also utilizes the set of letters. In this case the entire letter training set is used, with no subsets involved, and is referred to as LTR (Letter TRaining).

(21)

Below is a summary of the datasets used.

3.1.1 Utilized Data

• DTR.xn - Digit TRaining set - A subset of x digits from the EMNIST digit-training set.

• DTE - Digit TEsting set - The EMNIST digit-testing set. The number of samples are 40 000.

• LTR - Letter TRaining set - The EMNIST letter-training set. The number of samples are 124 800.

In total three different training sets are used: DTR.1n, DTR.4n and DTR.10n. These sizes are chosen to examine the effect of the different augmentation techniques in settings with differently limited training data.

3.2 Implementations

The purpose of the implementations is to investigate what effect different ways of augmenting training data, of varying sizes, has on the classification performance of a classifier that is trained on the augmented data. The two augmentation methods are referred to as the Vanilla GAN and the Transfer-learned GAN. Three classifiers are trained for each training dataset. One only on the digits in the training set, one on the images generated by the Vanilla GAN and the digit training dataset, and one on the images generated by the Transfer-learned GAN and digits training dataset.

3.2.1 GANs

3.2.1.1 Variable Descriptions

• Z refers to the size of the noise vector that is fed forward into the generator. The chosen probability distribution used to generate noise input to the

(22)

generator is the 100 dimensional standard Gaussian distribution. Thus Z is set to 100.

• C refers to the number of possible conditions in a training setting. For example, when the training data only consists of decimal digits there are ten possible conditions making C=10. Thus C will be 10 in the cases when training only on decimal digits and 36 when training on both decimal digits and Latin letters. See section 3.2.1.5 for more details on the condition. • The images in EMNIST are grayscale and 28 by 28 pixels in size. Thus an

image sample can be represented as a 1x28x28 tensor. However, the pre-trained RESNET-18 network expects images that are 64 by 64 pixels in size as input. Therefore all samples used from EMNIST are scaled up to 64 by 64 pixels in size before being used. Due to the scale, the generator’s output tensors are 1x64x64.

3.2.1.2 GAN Architecture

The GANs utilized in this report are Deep Convolutional GANs (DCGAN) which is an architecture that has been shown to produce more representative images than the original GAN architecture [19]. The same learning rate of 0.0002 and batch normalization momentum of 0.5 are used as proposed in the DCGAN paper [19]. The exact architecture for the generator and discriminator are graphically represented in appendix A. The only difference between the Vanilla GAN and the Transfer-learned GAN is the number of possible conditions. As the graphical representations shows in appendix A, this means that only a single layer in the beginning of both the generator and discriminator is different between the Vanilla GAN and the Transfer-learned GAN.

(23)

3.2.1.4 Transfer-learned GAN

The Transfer-learned GAN is trained on both data from DTR.xn and LTR, meaning there are 36 conditions (10 for decimal digits and 26 for the latin letters). Therefore the architecture of the Transfer-learned GAN is defined by appendix A with C=36.

3.2.1.5 Condition in GANs

Both the Vanilla GAN and the Transfer-learned GAN are conditional GANs, meaning that both the generators and discriminators also take a condition as input in addition to the noise and images. For both the discriminator and the generator, the condition is an encoded representation of a class label. For the generator the condition indicates what class of character should be generated, while for the discriminator the condition indicates what class of character the image is supposed to be. The class labels are encoded as conditions using one-hot encoding, meaning if there are C different class labels then the condition is encoded as a C dimensional vector, where all elements are 0 except for one which has the value 1. For example: the condition for class label 4, where there are ten class labels in total, is the vector [0, 0, 0, 0, 1, 0, 0, 0, 0, 0].

This one dimensional one-hot encoding is used directly for the generators. Thus the dimensionality for the inputs to the generators are Z for the input noise and C for the input condition.

The main input to the discriminators are the grayscale images of size 1x64x64. To be able to perform equivalent operations on the input condition it will be encoded as a Cx64x64 tensor. This is achieved by by interpreting the one dimensional one-hot encoding as a Cx1x1 tensor that is scaled up up to Cx64x64. Figure 3.2.1 illustrates this encoding process. The resulting encoding can be interpreted as a 64x64 image with C color channels where all pixels are the same with all channels set to 0 except one which is set to 1.

(24)

Figure 3.2.1: A visual representation of the conversion from one dimensional condition to three dimensional condition that acts as input to the discriminators. The white layers on both figures are layers filled with zeros while the shadowed layers are filled with ones. Both figures are C layers in depth. The figure to the left is Cx1x1 in size while the figure to the right is Cx64x64 in size.

3.2.1.6 GAN Training Process

3.2.1.7 Training Process

One iteration during the training process is explained in figure 3.2.2. The chosen training parameters are empirically chosen. The batch size is 10. The number of epochs is 500. For the real and fake images to have the same distribution of class labels, the same batch of conditions is used for fake images as the real images in the same iteration.

(25)

1: b← Next mini-batch of labeled images from training set

2: x← images from b

3: y← image labels from b

4: c1d← 1d-one-hot(y)

5: c3d← 3d-one-hot(y)

6: o← feedforward(Discriminator, x, c3d)

7: e← BCELoss(o, 1)

8: backprop(Discriminator, e)

9: z ← one mini-batch of Z-dimensional noise

10: x← feedforward(Generator, z, c1d) 11: o← feedforward(Discriminator, x, c3d) 12: e← BCELoss(o, 0) 13: backprop(Discriminator, e) 14: o← feedforward(Discriminator, x, c3d) 15: e← BCELoss(o, 1) 16: backprop(Generator, e)

Figure 3.2.2: Steps in GAN training process. “c3d” is a 3-dimensional one-hot encoding of the condition. “c1d” is a 1-dimensional one-hot encoding of the condition. BCELoss refers to the binary cross entropy loss function.

3.2.1.8 Training Data in Transfer-learned GAN

For the Transfer-learned GAN, training data from DTR.xn and LTR is evenly distributed in every mini-batch during training. This is implemented by filling half of the batch with samples for DTR.xn and the other half with samples from LTR. From DTR.xn samples are randomly selected without repetition but from LTR samples are randomly selected with repetition. This means that every training epoch will process every sample from DTR.xn exactly once while this is not the case for LTR. Since LTR is larger than DTR.xn, for all values used for x, a portion of LTR will not be used in every epoch. Furthermore, samples from LTR could be processed in multiple iterations in the same epoch since samples are selected with repetition.

(26)

decimal digits but also on a wide range of Latin letters. The transfer learning is occurring incrementally during the training. The idea is that the GAN will learn high level features from the letters that are applicable when generating digits. Intuitively this can be thought of as the GAN learning general handwriting by seeing many examples of letters which makes it better at generating digits. This is the transfer learning investigated in the thesis and should not be confused with pre-trained neural networks that are trained on two domains in two separate processes.

3.2.2 Classifier

The classifier used is the convolutional neural network RESNET-18 [8], which has been pre-trained on the ILSVRC subset of imagenet, a dataset of millions of images spanning 1000 categories [20]. The pre-training in this classifier is not related to the transfer-learning this thesis investigates for the GANs. Imagenet contains photographs and no categorized handwritten characters [20]. Most of the network is kept as is without being modified during training. The output layer of RESNET-18 is a fully connected layer with 1000 outputs. However, this last layer is replaced with a new fully connected layer with only 10 outputs in order to match the number of classes for the EMNIST digits. This last fully connected layer is the only part of the classifier that is trained. Figure 3.2.3 illustrates the architecture without detailing the internals of RESNET-18.

(27)

Figure 3.2.3: A visual representation of the classifier. The edge labels indicate the shape of the layers the edge feeds into.

Three variants of the classifier are referred to in this thesis: Classifier 1 refers to the classifier that has been trained on DTR.xn without augmentation. Classifier 2 refers to the classifier that has been trained on both DTR.xn and the data generated by the Vanilla GAN. Classifier 3 refers to the classifier that has been trained on both DTR.xn and the data generated by the Transfer-learned GAN.

Both Classifier 2 and Classifier 3 train on datasets that are augmented by a GAN. This is implemented by having every mini-batch during the training process consist of half real samples from DTR.xn and half newly generated samples from the GAN.

(28)

3.3 Tools

All neural networks are implemented, trained and tested using PyTorch, a deep learning library for Python [21]. PyTorch supports training models on a GPU, which is typically much faster than training on a CPU [22]. Therefore the training is done on a GPU, specifically Nvidia GeForce GTX 1050 Ti 4GB GDDR5 GPU.

3.4 Workflow

For all three digit training datasets (DTR.1n, DTR.4n, and DTR.10n) two GANs are trained, one Vanilla GAN and one Transfer-learned GAN. Thereafter three classifiers are trained for each of the digit training datasets. Classifier 1 trains on only DTR.xn, Classifier 2 trains on DTR.xn and images generated by the Vanilla GAN, and Classifier 3 trains on DTR.xn and images generated by the Transfer-learned GAN. The performances of the classifiers are evaluated by the classification accuracy on the DTE dataset. Figure 3.4.1 illustrates an overview of the overall workflow.

Figure 3.4.1: A flowchart of the overall workflow.

3.5 Evaluation

(29)

classified labels in the test set divided by the number of tested samples. To ensure reliability, comparison of the classification accuracy of the proposed classifiers will be statistically tested. Two statistical tests will be made for each training dataset: one with three test groups, one per classifier; and one with two test groups, one for C2 and one for C3. Each sample in the groups is produced by re-training the classifier with new random weights. To evaluate differences and interactions among groups, analysis of variance (ANOVA) will be utilized [23]. ANOVA displays if the variance in the results are due to differences within or between the groups. The used hypotheses are:

• H0: the means of all groups are equal

• H1: the mean of at least two groups are not equal

ANOVA results in a F-statistic, which then is used to compute the p-value [23]. This report uses a significance level of 5 percent. Thus, if the p-value is lower than 0.05, the null hypothesis is rejected.

(30)

Chapter 4 Results

The results were produced via two separate activities. The first being the implementation of GANs that generates images with high enough quality. The process of generating images is described in the method section. The quality of the images were qualitatively evaluated by visually comparing them with images in the authentic training set. Samples of generated handwritten digits can be seen in section 4.1. The second activity was training the three different classifiers with the datasets described in the method section and the generated images produced from the first activity. A statistical comparison between classifiers can be seen in section 4.2.

4.1 GAN Results

In this section, results are presented in the form of generated images from the two different GANs. The quality of the images were qualitatively evaluated by a visual comparison with the authentic training dataset. Results are displayed for each dataset and for both the Vanilla and Transfer-learned GAN. Observe that the images are mirrored and rotated in the datasets. Authentic images are also shown in order to be able to compare the results.

(31)

4.1.1 DTR.1n

Figure 4.1.1 and 4.1.2 show authentic samples from DTR.1n. Figure 4.1.3 and 4.1.4 show synthetic samples from the Vanilla GAN when trained on DTR.1n. Figure 4.1.5 and 4.1.6 show synthetic samples from the Transfer-learned GAN, trained on DTR.1n and LTR.

Figure 4.1.1: The sample of a zero from the DTR.1n dataset.

Figure 4.1.2: The sample of a five from the DTR.1n dataset.

4.1.1.1 Vanilla GAN

Figure 4.1.3: 10 samples of zeros generated by the Vanilla GAN after being trained on DTR.1n.

Figure 4.1.4: 10 samples of fives generated by the Vanilla GAN after being trained on DTR.1n.

(32)

Figure 4.1.5: 10 samples of zeros generated by the Transfer-learned GAN after being trained on DTR.1n.

Figure 4.1.6: 10 samples of fives generated by the Transfer-learned GAN after being trained on DTR.1n.

4.1.2 DTR.4n

Figure 4.1.7 and 4.1.8 show authentic samples from DTR.4n. Figure 4.1.9 and 4.1.10 show synthetic samples from the Vanilla GAN, trained on DTR.4n. Figure 4.1.11 and 4.1.12 show synthetic samples from the Transfer-learned GAN, trained on DTR.4n and LTR.

Figure 4.1.7: The samples of zeros from the DTR.4n dataset

Figure 4.1.8: The samples of fives from the DTR.4n dataset

(33)

4.1.3 DTR.10n

Figures 4.1.13 and 4.1.14 show authentic samples from the DTR.10n. Figures 4.1.15 and 4.1.16 show synthetic samples from the vanilla GAN, trained on DTR.10n. Figures 4.1.17 and 4.1.18 show synthetic samples from the Transfer-learned GAN, trained on DTR.10n and LTR.

Figure 4.1.13: The samples of zeros from the DTR.10n dataset

(34)

4.1.3.1 Vanilla GAN

Figure 4.1.15: 10 samples of zeros generated by the Vanilla GAN after being trained on DTR.10n.

4.2 Classification Results

Section 4.2.1 shows the mean and variance of the test classification accuracy for the classifiers. Section 4.2.2 statistically compares the three classifiers. Section 4.2.3 statistically compares classifier 2 and 3.

(35)

• MS - Variance estimate (mean square) • F - The calculated F statistic

• P - The corresponding p-value

4.2.1 Classification Test Accuracy

Figure 4.2.1 and 4.2.2 shows the means and variances of each classifier on the respective datasets.

DTR.1n DTR.4n DTR.10n Classifier 1 35.11 58.27 74.51

Classifier 2 41.71 60.86 76.54

Classifier 3 42.27 61.74 77.31

Figure 4.2.1: The average classification test accuracy for the all classifiers, for each training set.

DTR.1n DTR.4n DTR.10n Classifier 1 12.43 8.55 3.25

Classifier 2 9.17 7.76 2.19

Classifier 3 11.81 9.21 1.37

Figure 4.2.2: The variance of the measured classification test accuracy for the all classifiers, for each training sets.

4.2.2 Comparison of All Classifiers

An ANOVA was made for each training dataset, with the three different classifiers making up the three test groups. ANOVA displays if the variance in the results are due to differences within or between the groups. The null hypothesis is that the means of the three groups are equal. The ANOVA table layout is from Washington university [24]. The significance level is 5 percent. Figures 4.2.3, 4.2.4, and 4.2.5 shows the ANOVA tables for each dataset when comparing all three classifiers.

(36)

Observe that no p-values are exactly 0. The calculations are made using double precision floating point arithmetics which cannot express values that are closer to 0 than 10−16. When calculated values p-values are in that range it is simply represented as 0 in the tables since it is not certain how small they actually are. 4.2.2.1 DTR.1n Source DF SS MS F P Between 2 9382.33 4691.16 419.88 0 Within 885 9887.86 11.17 -Total 887 19270.18 -

-Figure 4.2.3: ANOVA values between and within groups when comparing all three classifiers on DTR.1n. P is the significance probability for F. With the significance level 5 percent, the null hypothesis is rejected.

4.2.2.2 DTR.4n

Source DF SS MS F P Between 2 9382.32 4691.16 419.88 0

Within 885 9887.85 11.17

-Total 887 19270.18 -

(37)

4.2.2.3 DTR.10n

Source DF SS MS F P Between 2 1240.40 620.20 272.00 0

Within 885 2017.93 2.28

-Total 887 3258.33 -

4.2.3 Comparison of Classifier 2 and Classifier 3

An ANOVA was made for each training dataset, with the C2 and C3 making up the two test groups. ANOVA displays if the variance in the results are due to differences within or between the groups. The null hypothesis is that the means of the two groups are equal. The significance level is 5 percent. Figures 4.2.6, 4.2.7, and 4.2.8 shows the ANOVA tables for each dataset when comparing C2 and C3. 4.2.3.1 DTR.1 Source DF SS MS F P Between 1 45.56 45.56 4.33 0.038 Within 590 6209.65 10.52 -Total 591 6255.21 -

-Figure 4.2.6: ANOVA values between and within groups when comparing C2 and C3 on DTR.1n. P is the significance probability for F. With the significance level 5 percent, the null hypothesis is rejected.

(38)

4.2.3.2 DTR.4

Source DF SS MS F P Between 1 114.20 114.20 13.41 2.73e-4

Within 590 5024.09 8.52

-Total 591 5138.30 -

4.2.3.3 DTR.10

Source DF SS MS F P

Between 1 87.45 87.45 48.88 7.41e-12

Within 590 1055.68 1.79

-Total 591 1143.13 -

(39)

Chapter 5 Discussion

5.1 Analysis of Transfer-learned GAN

The result of the ANOVA method demonstrates a significant difference in the means of the classification accuracies of the three classifiers, and between C2 and C3. Therefore, in the experiments of this report, value is gained by utilizing the Transfer-learned GAN data augmentation scheme, which outperforms the Vanilla GAN and not using data augmentation at all.

When comparing images from the Vanilla and Transfer-learned GAN there is a distinct difference in the variability of the samples. In other words, the images of digits from the Vanilla GAN is quite similar in contrast to images from the Transfer-learned GAN, which exhibits variance in the digits’ line and curve structures. The variation in the structure of the digits generated by the Transfer-learned GAN provides more realism to the training dataset, which increases the classification accuracy when testing on many human handwritten digits.

The Transfer-learned GAN provides a superior modeling of the digits since in the real world handwritten digits have much variability when comparing handwriting styles. The increased variability can most likely be explained by the Transfer-learned GAN being incrementally trained on the letter training dataset, gaining knowledge influenced by the structure of letters. Letters have different structures than digits and will therefore influence the weights, which encodes the curve and

(40)

line features of the images, during training of the generator and hence provide variation in the generation process of digits.

Additionally, this report displays the possibility to utilize highly scarce datasets for image classification when combined with transfer learning within the data augmentation model. The classification accuracy increased when utilizing only one sample per digit class, and therefore opens up the opportunity to utilize fewer samples for other datasets as well.

5.2 Limitations

The experiments with the GANs are limited to the specific GAN architecture described in the method section. Other GAN architectures could provide different results. The implementation in this thesis is considered to be naive in comparison with related research, which proposed a GAN architecture specifically for data augmentation, called the Data Augmentation Generative Adversarial Network (DAGAN) [3]. The DAGAN is constructed for its domain, whilst the GAN architecture in this report is general, hence the implementation is considered to be naive. With a more sophisticated architecture the Transfer-learned GAN could possibly utilize the transferred knowledge more efficiently.

Another factor is the limitation of the EMNIST dataset, which constituted the training and test data in the experiments. In contrast to the statistically significant results of the experiments within this thesis, the effect of transfer learning in the Transfer-learned GAN data augmentation scheme could vary for different datasets. Furthermore, the generator and discriminator in the GANs in this thesis consists of CNNs. Due to CNNs superior ability to extract high-level features from images, there is a possibility of transfer learning in GAN data augmentation having an increasing effect on more complex image datasets due to the existence of more high-level features.

(41)

collapse meaning there would be no variance in the generated samples.

5.3 Future Work

To begin with, the most straight forward research topic is continuing the work in this report by changing the scope. Moreover, as mentioned in the limitations, the following can be experimented with: GAN architectures, datasets, and training parameters. Specifically, the authors see value in researching data augmentation for scarce datasets with more practical applications. One example is medical data, which is scarce due to patient secrecy. Generating more medical data can help classifiers make better medical decisions and predictions, thus saving human lives.

Furthermore, the proposed GAN architecture in this thesis could possibly be applied to other fields than data augmentation, and instead within the general subject of data generation. The first GAN report was published in 2014 [4], which means that the general GAN model is still young and there is still much research to be made.

When training the GANs for the experiments the primary bottleneck was training time. To be able to produce empirically optimized GAN architectures, continued research is needed within the hardware and learning algorithm fields, which will provide more efficient and powerful training schemes for machine learning models, specifically neural networks.

5.4 Ethical Aspects

There are many opportunities in the possible generative power of GANs, which could result in both positive and hazardous applications. One positive example is that the work in this report could be further developed to aid in medical diagnosis. Whilst a hazardous example, which applies generally for GANs, is generating synthetic media - for example videos - which can not be trusted by a human. Generally, if GANs are further developed there must be easily available structures

(42)

and tools that can authenticate generated data and media. Otherwise there is a risk of groups being misinformed and deceived. Nonetheless, this discussion will not analyze the consequences of a GAN generated media deception.

5.5 Conclusion

There is a statistically significant difference between the mean results of the classifiers that utilize data augmentation, where the transfer-learned GAN results in a better classification accuracy. Therefore, in conclusion, there is a gained positive effect on the classification accuracy when utilizing transfer learning within the specific Conditional GAN architecture described in this thesis. However, more research is needed in order to utilize the transfer-learned GAN data augmentation scheme more efficiently and for more practical applications.

(43)

Bibliography

[1] James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. An introduction to statistical learning. Vol. 112. Springer, 2013.

[2] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105.

[3] Antoniou, Antreas, Storkey, Amos, and

Edwards, Harrison. “Data augmentation generative adversarial networks”. In: arXiv preprint arXiv:1711.04340 (2017).

[4] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. “Generative adversarial nets”. In: Advances in neural information processing systems. 2014, pp. 2672–2680.

[5] Saha, Sumit. A comprehensive guide to convolutional neural networks– the ELI5 way. 2018.

[6] Cohen, Gregory, Afshar, Saeed, Tapson, Jonathan, and Schaik, André van. “EMNIST: an extension of MNIST to handwritten letters”. In: arXiv preprint arXiv:1702.05373 (2017).

[7] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, et al. “Imagenet large scale visual recognition challenge”. In: International journal of computer vision 115.3 (2015), pp. 211–252.

(44)

[8] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770– 778.

[9] Kriesel, David. A Brief Introduction to Neural Networks. Available at http://www.dkriesel.com. 2007.

[10] Nasrabadi, Nasser M. “Pattern recognition and machine learning”. In: Journal of electronic imaging 16.4 (2007), p. 049901.

[11] Le Roux, Nicolas and Bengio, Yoshua. “Representational power of restricted Boltzmann machines and deep belief networks”. In: Neural computation 20.6 (2008), pp. 1631–1649.

[12] LeCun, Yann, Kavukcuoglu, Koray, and Farabet, Clément. “Convolutional networks and applications in vision”. In: Proceedings of 2010 IEEE International Symposium on Circuits and Systems. IEEE. 2010, pp. 253– 256.

[13] An intuitive introduction to Generative Adversarial Networks (GANs). https://www.freecodecamp.org/news/an- intuitive- introduction-to-generative-adversarial-networks-gans-7a2264a81394/. Accessed: 2019-05-31.

[14] Karras, Tero, Laine, Samuli, and Aila, Timo. “A style-based generator architecture for generative adversarial networks”. In: arXiv preprint arXiv:1812.04948 (2018).

[15] Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. “Improved techniques for training gans”. In: Advances in neural information processing systems. 2016, pp. 2234– 2242.

[16] Mirza, Mehdi and Osindero, Simon. “Conditional generative adversarial nets”. In: arXiv preprint arXiv:1411.1784 (2014).

(45)

[18] Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, and Lipson, Hod. “How transferable are features in deep neural networks?” In: Advances in neural information processing systems. 2014, pp. 3320–3328.

[19] Radford, Alec, Metz, Luke, and Chintala, Soumith. “Unsupervised representation learning with deep convolutional generative adversarial networks”. In: arXiv preprint arXiv:1511.06434 (2015).

[20] Large Scale Visual Recognition Challenge 2014 (ILSVRC2014).http:// image - net . org / challenges / LSVRC / 2014 / browse - synsets. Accessed: 2019-05-31.

[21] PyTorch.https://www.pytorch.org. Accessed: 2019-05-31.

[22] Raina, Rajat, Madhavan, Anand, and Ng, Andrew Y. “Large-scale deep unsupervised learning using graphics processors”. In: Proceedings of the 26th annual international conference on machine learning. ACM. 2009, pp. 873–880.

[23] St, Lars, Wold, Svante, et al. “Analysis of variance (ANOVA)”. In: Chemometrics and intelligent laboratory systems 6.4 (1989), pp. 259–272.

[24] Lecture 7: Hypothesis Testing and ANOVA.https://www.gs.washington. edu/academics/courses/akey/56008/lecture/lecture7.pdf. Accessed: 2019-05-31.

(46)

Appendix A

(47)

(48)

(49)

Effects of Transfer Learning on Data Augmentation with Generative Adversarial Networks