Synthetic Image Generation Using GANs : Generating Class Specific Images of Bacterial Growth

(1)

Linköpings universitet

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 18 ECTS | Cognitive Science

2021 | LIU-IDA/KOGVET-G--21/010--SE

Synthetic Image Generation

Using GANs

–

Generating Class-Speciﬁc Images of Bacterial Growth

Syntetisk bildgenerering med GANs

Marianne Mattila

Supervisor : Erik Marsja Examiner : Arne Jönsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Mastitis is the most common disease affecting Swedish milk cows. Automatic image classification can be useful for quickly classifying the bacteria causing this inflammation, in turn making it possible to start treatment more quickly. However, training an automatic classifier relies on the availability of data. Data collection can be a slow process, and GANs are a promising way to generate synthetic data to add plausible samples to an existing data set. The purpose of this thesis is to explore the usefulness of GANs for generating images of bacteria. This was done through researching existing literature on the subject, implementing a GAN, and evaluating the generated images. A cGAN capable of generat-ing class-specific bacteria was implemented and improvements upon it made. The images generated by the cGAN were evaluated using visual examination, rapid scene categoriza-tion, and an expert interview regarding the generated images. While the cGAN was able to replicate certain features in the real images, it fails in crucial aspects such as symmetry and detail. It is possible that other GAN variants may be better suited to the task. Lastly, the results highlight the challenges of evaluating GANs with current evaluation methods.

(4)

Acknowledgments

To Agricam, especially my external supervisor Erika Anderskär, for their trust and guidance. To my supervisor Erik Marsja for the helpful advice regarding both the thesis content and the writing process. To my friends for the invaluable support system.

(5)

List of Figures

3.1 Sample image from the dataset (ecoli present). . . 9

3.2 Distribution of labels in the dataset. . . 9

3.3 cGAN architecture. Left: discriminator. Right: generator. . . 11

4.1 Examples of images produced by different iterations of the model. . . 14

(7)

List of Tables

(8)

1 Introduction

The most common disease affecting cows on Swedish farms is udder inflammation, also known as mastitis [15]. These inflammations cause the animals suffering and are a driv-ing factor in the usage of antibiotics on farm animals. It is therefore of interest to develop solutions that can help with mastitis treatment and prevention.

Most types of mastitis are caused by bacteria, and in order to administer correct treatment it is important to first identify which type of bacteria is causing the inflammation. Correct identification may also help farmers understand which bacteria are a present or have become a recurring problem on their farms, which aids them in taking preventative measures as well. This is where Agricam’s product Bacticam comes in. Bacticam allows farmers to grow and analyze bacteria on-site, speeding up a process that could otherwise take days. This is done through taking images of bacterial growth plates and feeding them through an image classi-fier designed to tell bacteria apart.

Data collection is a resource-intensive part of training this classifier. Providing the classi-fier with more data to train on could improve its performance, allowing for quicker identifi-cation of bacteria and in turn quicker treatment of mastitis cases. Synthetic data generation is a field that has generated a lot of interest in recent years, as being able to augment and modify existing data sets may be a viable way to quickly and cost-efficiently increase the size of the data set. Models such as generative adversarial networks (GANs) have shown promise in generating images before and could maybe be applied to this as well.

1.1 Aim

The aim of the thesis is to evaluate if generative adversarial networks can be useful for gen-erating synthetic images of bacteria. The aim is to build a machine learning model based on a thorough literature study, finding inspiration from other fields. Should this model produce images of a high enough quality, the hope is that the images could be used to train and further improve the existing classifier.

1.2 Research questions

(9)

1.3. Delimitations

1. Are GANs a viable method of generating synthetic images of bacteria? 2. How well do GANs perform at generating different types of bacteria?

1.3 Delimitations

The focus of this thesis is to create a GAN based on existing literature and not design or propose any new variant. Due to time limitations not all types of generative models or GANs can be considered, as some of them require months of training. This thesis will also not compare different types of GANs, but instead focus on identifying the most suitable variant for the project which will then be implemented.

1.4 Structure

The thesis contains six chapters. Chapter 2 presents a theoretical background for GANs and the technological advances that made them possible, as well as which evaluation methods currently exist for them. Chapter 3 describes the data set used and how the model was im-plemented and evaluated. Chapter 4 presents the evaluation results. Chapter 5 contains a discussion about methodology and results. Lastly, chapter 6 summarizes the thesis and the findings.

(10)

2 Theory

The goal of this section is to provide the reader with a short overview of the theoretical back-ground for the thesis. The section starts off by describing artificial neural networks before building on them and explaining convolutional neural networks. Lastly, generative adver-sarial networks are presented, including variations and evaluation of them.

2.1 Artificial Neural Networks

Artificial neural networks (ANN) are loosely inspired by their biological counterpart. Rosen-blatt [23] was first in developing the perceptron, which is the artificial counterpart to the neuron and a precursor to modern ANNs.

ANNs consist of connected layers. Each layer contains a set of nodes that are connected to nodes in the next one through weights. The activation level of a given node is determined by the strength of the weights connecting to it as well as the chosen activation function, and in a fully connected each node in one layer maps to every node in the next. The learned representations of the model are stored in the weights, which are updated during the course of training [3].

During training, the model attempts to minimize the loss function that is used. The loss function keeps track of how well the network models the data and maps given input to the correct output. When errors are made, the weights of the network must be updated to im-prove its performance. This is commonly done through backpropagating the error through the network’s layers, calculating which weights are responsible for the wrong output. A common method for upgrading the weights is gradient descent, through which the network attempts to reach the local minimum of the loss function [24].

Artificial neural networks are capable of learning through updating their weights and can solve a wide-ranging number of problems. Sections 2.2 and 2.3 present and discuss a few variations and applications of ANN.

2.2 Convolutional Neural Networks

Convolutional Neural Networks (CNN) build upon and share a lot in common with ANN. CNN are a type of neural network capable of working with images, and are widely used in tasks such as image classification and recognition.

(11)

2.3. Generative Adversarial Networks

The input layer in a CNN consists of nodes representing pixels in the original image. For color images the data is commonly separated into 3 separate color channels representing RGB values, while a single channel is enough for black and white images.

The latter layers of a CNN come in three main types: convolutional layers, pooling layers, and fully connected layers [20].

As the name suggests, convolutional layers carry an important role in CNN. In a convo-lutional layer, kernels (or filters) are applied to the previous layer. These kernels are passed over the layer in a controllable pattern, computing the dot product of the area it covers dur-ing each pass. Each kernel identifies certain patterns in the previous layer. In early layers, kernels may simply detect horizontal or vertical lines, but in later layers kernels can detect composite features such as noses or eyes. It is the CNN’s ability to learn image features that lies behind much of its success.

It is possible to modify the output of kernel calculations using strides or padding. Padding can be added to the edges of the image, in other words adding an empty border. Kernels larger than 1x1 won’t pass over the edge pixels as many times as the inner pixels, and adding padding is a simple way to rectify this. Padding is a useful tool especially when the edges of an image are important. Another way to modify kernel usage is to simple set a different number of strides, which determines how many steps the kernel is moved between each pass. While setting a larger number of strides is one possible way to down-sample a layer, many CNN also use pooling layers which effectively distill information from previous lay-ers. A common example of this is max pooling, which simply saves the largest pixel value from a certain area of pixels. This reduces the amount of computation required while still maintaining much of the relevant information.

It is also possible to instead upsample layers through transposed convolutions. This in-volves adding pixel values based on existing ones, for example through replicating them (nearest neighbors) or performing interpolation between them [8].

Lastly, the final layers of a CNN are flattened and densely connected to the output layer, which for example would be a set of classes if performing a classification task.

2.3 Generative Adversarial Networks

Generative Adversarial Networks (GANs) are a type of generative model that were first pro-posed by Ian Goodfellow et al. [9]. GANs are artificial neural networks that learn the data distribution of a given data set and produce completely new yet plausible samples for it. This is achieved through using two separate neural networks that compete with each other: the discriminator and the generator. During the training process of a GAN, both networks improve together through a zero-sum game where network’s gain is the other’s loss.

While GANs can be used to generate many types of data the focus of this thesis is on image generation. Upcoming sections therefore specifically describe the process of generating images with GANs.

2.3.1 Generator

The generator is the heart as well as the end product of the GAN. The generator’s task is to transform noise vectors into realistic samples that are capable of fooling the discriminator. The noise vectors are randomly sampled from a latent space, typically a Gaussian normal distribution, and throughout training the generator learns to map this input to pixels in an image [25].

During training, the generator uses the discriminator’s feedback on which images appear more realistic. Over time this allows the generator to learn the data distribution and create increasingly realistic samples. After training is completed, the discriminator is discarded while the generator is kept and used to generate more samples.

(12)

2.4. GAN variations

2.3.2 Discriminator

While the generator learns to create realistic samples capable of fooling the discriminator, the discriminator attempts to distinguish generated samples from real ones. In other words, the discriminator’s job is to classify incoming samples correctly and in doing so provide helpful feedback to the generator.

2.3.3 Training

As GANs consist of two separate networks, training a GAN is done through alternating train-ing for the generator and discriminator. Initially, the generator is tasked with creattrain-ing a batch of fake images, generated from random noise vectors. The generated images are then mixed in with a set of real images before all images fed to the discriminator. The discriminator then classifies these as real or fake, after which its weights are updated using the discrepancy be-tween its guesses and the correct answers. Finally, the discriminator’s classifications of the fake samples are used to improve the generator. These classifications show the generator which images were graded as more realistic or not convincing enough, and its weights are upgraded accordingly.

GAN training traditionally utilizes Binary Loss Entropy as its loss function, which as its name suggest is suitable for binary output such as real or fake.

A big challenge with training GANs is stabilizing training. This is largely due to the fact that the discriminator’s job is easier, as learning classifications is an easier task than learning and mapping features of images. In GANs this often leads to the discriminator improving more rapidly than the generator. When the discriminator improves too fast, it is unable to provide useful feedback to the generator. If the discriminator is rarely if ever fooled the generator naturally can’t learn which images are better or worse.

A common side effect of this training issue is mode collapse, where the discriminator learns to favor certain types of images. Using the discriminator’s feedback the generator is indirectly told to focus on generating only those images. This can be difficult to detect, but can be seen through for example generating a large set of images and looking for similarities in them. Mode collapse in a network trained to produce numbers may for example lead to it only generating the number 3 and 4. While attempts to fix this have been made, it remains a major challenge for GANs [2].

2.4 GAN variations

There are many variations of GANs, with some being more suitable for certain problems or data sets. This section describes the two variations that had the largest impact on the thesis. It starts off with a short explanation of Deep Convolutional GANs before finish off by describing how Conditional GANs work.

2.4.1 Deep Convolutional GANs

Deep Convolutional GANs, or DCGANs, have gained a lot of popularity in the GAN commu-nity since they were first created by Radford et al [22]. The DCGAN paper describes a specific network architecture using certain constraints, such as removing pooling and fully connected layers altogether. Compared to other GANs of its time, DCGANs have been shown to lead to improved stability during training, better feature learning, and generated images of a higher quality.

DCGANs and architectures inspired by them have become the backbone of many GANs [26]. These types of network provide an excellent starting point for many projects, as they yield good results at relatively low computational complexity.

(13)

2.5. Evaluating GANs

2.4.2 Conditional GANs

Conditional GANs were first introduced by Mirza and Osindero [18] and build upon regular GANs to create a model capable of producing class-specific results. CGANs are training using a labeled data set, which makes it possible to condition both the discriminator and generator on class labels. For real images, this involves feeding the correct image-label pair to the discriminator. For fake images, this means matching a randomly chosen label with an image created from a random noise vector. Furthermore, the added constraint means that the model not only needs to generate realistic images - they also need to match the provided label.

A few different ways of inserting the class labels have been suggested. While it is pos-sibly to simply add n nodes (where n = number of classes) to the input layer, others have shown great success through coding the label as another channel, added along with the color channels in the discriminator. [7]

A big perk of creating a cGAN instead of a classic (unconditional) GAN is the mitigation of some mode collapse. A cGAN is explicitly told it needs to create images for certain classes and penalized when it fails to learn what the data distribution of a certain class looks like. A cGAN cannot achieve excellent performance through simply repeating the same class over and over. Mode collapse for classes is easier to monitor from cGAN output, as it is possible to explicitly tell the model what examples it needs to produce.

2.5 Evaluating GANs

Generating images with GANs is challenging due to a tradeoff between fidelity and diversity. The model needs to create images that are of a high quality yet also diverse enough. In practice, there is usually a tradeoff between fidelity and diversity when it comes to GAN performance.

GANs are notoriously difficult to evaluate since there is no correct answer or ground truth to refer to for the generated data. This has led researchers to suggest a variety of quantita-tive and qualitaquantita-tive approaches for evaluating GANs, and while these may be helpful it is important to keep in mind that they typically have significant shortcomings. Borji [4] notes that while qualitative approaches may be helpful in determining how realistic the generated data is, these approaches may miss common GAN problems such as overfitting or mode col-lapse. Quantitative approaches attempt to identify and resolve these issues, but there is no guarantee that this matches how humans evaluate the generated data.

2.5.1 Quantitative Evaluation

Quantitative evaluation methods for GANs try to capture quantitative differences between the real and generated images. There are a variety of options, from simple pixel-by-pixel comparison to more advanced methods that attempt to quantify the difference between the real and generated images.

As of now, the two most popular quantitative evaluation methods for GANs are Inception Score (IS) and Frechet Inception Score (FID) [4]. Both of these rely on the pre-trained image classifier Inception-v3 with its output layer cut off. The images are fed through Inception-v3 and the new final layer of it captures high level feature activations in the images. FID, which in itself is an improvement of IS [10], utilizes covariance and mean to then compare these features between the data distributions.

FID has been shown to be largely consistent with human judgments and performs well in terms of discriminability, robustness, and computational efficiency [4]. However popular, FID also comes with certain drawbacks. FID relies on Inception-v3, which has been trained on the ImageNet data set consisting of a variety of real life photos. According to [13], this makes applying FID to very different data sets relatively meaningless as well as misleading.

(14)

2.5. Evaluating GANs

Furthermore, FID requires large sample sizes in order to provide reliable scores. It’s been suggested that FID should be applied on a minimum of 50 000 samples in order not to over-estimate the score.

2.5.2 Qualitative Evaluation

A quick way to evaluate GANs is to perform manual visual inspection of generated samples. While being a quick and intuitive way to evaluate model performance, human evaluation is expensive, biased, and often inconsistent between judges [4]. While visual inspection rarely captures things such as overfitting or mode collapse, it can still provide a quick entryway to seeing if and how well the model is learning and is thus commonly used by researchers during model training.

Other qualitative evaluation methods include investigating and visualizing network in-ternals, finding the nearest neighbor of generated samples, evaluating mode collapse, and presenting generated samples to human judges [4]. When performing rating and preference judgment, the judges are presented with different samples and asked which ones they prefer. In rapid scene categorization, as used in [7], the judges are shown samples during very short periods of time (varying from microseconds to seconds) and are asked to classify it as either real or fake.

(15)

3 Method

This chapter starts off with describing the data set used to train the model. It then describes how the model was designed, trained, and improved. Lastly, it goes over the chosen evalua-tion methods and how these were implemented.

3.1 Data

This section goes into detail about how the data set used to train the model was collected and processed.

3.1.1 About the data set

The data set for this project was collected and provided by Agricam using their Bacticam product, which makes it possible to grow and photograph bacterial cultures directly on farms using a specially designed cabinet.

The growth plates are photographed twice, after 24 hours and 48 hours respectively. Since some bacteria are easier to identify in certain lighting, each growth plate is photographed using lighting from both below and above during each photo opportunity as shown in figure 3.1 on the left.

The data set provided by Agricam consists of 9812 images split into a test, train, and validation set, of which only the train set was used to train the GAN. The data set used in this project only contains the red field of the growth plates, as seen on the right in 3.1). The hope was that this would make it easier to spot proper growth patterns, and that the same pattern would emerge on both sides of the image. Some bacteria, such as for example streptococcus, mainly grow on the red field of the plate and because of that there would also be a possibility to merge created red fields with real images to train the classifier further.

All images in the data set are labeled as one of eight classes. The data set contains images of six different bacteria: ecoli, klebsiella, staphylococcus, staphylococcusaureus, streptococ-cus, and pyogenes. Some images are labeled as mixed flora, indicating contamination, while other images don’t show any signs of bacterial growth (negative growth). The labels are not evenly distributed (as shown in figure 3.2) and Agricam has a special interest in supplement-ing the images of rarer bacteria such as klebsiella with synthetic images.

(16)

3.2. Model

Figure 3.1: Sample image from the dataset (ecoli present).

Figure 3.2: Distribution of labels in the dataset.

3.1.2 Preprocessing

Some preprocessing of the data set was done by Agricam. This included centering the images, correcting the angle, extracting the red fields, and replacing everything around the plates with the color black. The images were also resized to be 600x300 pixels large, but due to computational limitations of the training platform and model the images had to be further resized to the size of 400x200 pixels.

Before training, the pixel values were normalized to a (-1, 1) interval instead of using RGB values ranging from 0 to 255. Firstly, larger input values may slow down or disrupt the learning process of neural networks and setting them to smaller values is good practice [5]. Secondly, this normalization was required since the generator outputs tanh activations within the same interval.

3.2 Model

This section describes which type of model was chosen, how it was implemented, and which improvements were made. It also presents which code libraries were used to create the model.

(17)

3.2. Model

3.2.1 Model choice

The end goal of the project was to supplement Agricam’s existing data set with statistically plausible images in order to improve the performance of their bacteria classifier. Since this classifier requires labeled images, a cGAN was the best option as it is able to not only produce images but also matching labels for them. Including conditional input in form of a label is a prerequisite for this, and several studies have also shown that training a GAN with extra information may improve image quality. Luckily, a labeled data set was available.

3.2.2 Model design

The cGAN was based off an existing code skeleton [6] that was then modified to fit the project as well as improved upon using available research. This code skeleton is in itself based on DCGAN architecture, which has proven to be an efficient starting point for many GANs [22]. Class labels can be used in cGANs in a variety of ways. This model encodes the label as an extra channel. In the discriminator, the extra channel is introduced alongside the three regular color channels of the generated image. In the generator, the extra channel is added to the number of latent dimensions and in effect encoded into the generated image itself. While there are alternatives to this, such as encoding the label as a 1-hot vector, the channel option has been shown to be successful by Denton et al [7]. Figure 3.3 showcases this merging of channels in both the generator and discriminator.

Generating noise vectors was done using 128 latent dimensions, a common choice for GAN projects, and the batch size was 16. Training the discriminator was done using mini-batches of only real and only fake images.

3.2.3 Model improvements

Training stable GANs is a difficult task, since the generator and discriminator need to im-prove together and not just separately. The largest increases in performance in this model were born from making the discriminator’s job slightly more difficult, as classifying is an easier task than learning the data distribution of images (as explained in 2.3.3). Two tech-niques that achieve this led to an improvement in image quality (clear through initial visual inspection of generated images as well as performance monitoring).

As Salimans et al [26] have shown, one-sided label smoothing can have a positive effect on model performance. The labels representing real images were reduced from 1.0 to 0.9, ensur-ing that the discriminator never outputs complete or near complete certainty of an image’s authenticity.

The largest increase in quality was born from applying a two time-scale update rule to the training process, as suggested by Heusel et al [10]. Following this rule individual learning rates were set for the discriminator and generator (LRD =0.0004, LRG =0.0001), where the

discriminator’s learning rate was 4 times higher than the generator’s.

Another challenge in generating images using GANs are so called checkerboard artifacts, in other words checkered pixel patterns. These artifacts may occur during the upsampling process and are especially noticeable in more colorful images. Odena et al [21] suggest that using a kernel size that is divisible by the number of strides is a viable strategy for combating checkerboard artifacts. This strategy should work due to the kernel moving over all pixels a similar number of times. If not divisible, it passes over every second pixel more times which causes the checkerboard artifacts over time.

Despite following the first strategy through setting kernel size to 4 and strides to 2, an early version of the developed model still produced obvious checkerboard artifacts. These ar-tifacts disappeared when using the second strategy described by Odena et al [21], i.e. switch-ing out transposed convolutions for a combination of upsamplswitch-ing and convolution. This is clearly shown in 3.3 Interpolation in the upsampling layers was done using nearest-neighbor

(18)

3.3. Evaluation

Figure 3.3: cGAN architecture. Left: discriminator. Right: generator.

interpolation, as bilinear interpolation led to droplet-like artifacts and less realistic images. Kernel size 4 was still used for the convolutional layers.

3.2.4 Framework

The model was built using the Python library Keras [12], which in turns builds upon a back-end including Tensorflow. The computational library Numpy [19] is required to run Keras but was also used to generate input vectors and labels. Lastly, Matplotlib [16] was used for plotting and visualizing learning curves as well as generating and saving images.

The model was trained on Google’s Colab-platform using Colab Pro and a Tesla P100 PCIe 16GB GPU.

3.3 Evaluation

The final section of the method chapter describes which evaluation measures were used to evaluate the generated images.

3.3.1 Visual Examination

Every 5th epoch, the loss curves for the generator and discriminator were plotted on a graph and sample images generated. This allowed for quick and easy inspection to see if and how

(19)

3.3. Evaluation

well the model was learning and improving. It also made it possible to spot various strengths and weaknesses of the model, showcased in how plausible the generated samples were.

After completed training, the generator was saved and used to generate fake images for further evaluation. A set of 100 images of randomized class belonging was produced. The final data set for testing was comprised of the 100 fake images as well as 100 randomly chosen real images.

3.3.2 Rapid Scene Categorization

The model’s performance was evaluated by a user with expert knowledge on bacterial growth plates. Testing was done through rapid scene categorization using the newly gen-erated data set. The test was done through creating a simple GUI which made it possible to present images for a set number of time as well as record the chosen answers for each image. No feedback was provided after picking an answer.

The expert took the test in a well lit room, seated in front of a computer screen running the GUI script. They were informed that they would be asked to classify a set of images as either real or fake, but that the purpose of the test was not to get them all correctly but instead rely on a gut feeling for each one, especially since the images would only be visible for very short times. They were also offered the chance to rest in between rounds. The test designer remained in the room at all times and was available for questioning.

Three rounds of testing were performed, starting with an inital display time of 250ms per image and then moving up to 500ms and finally 2000ms. The test taker opted to take a couple minute long breaks in between testing but otherwise performed the test quickly and without confusion.

3.3.3 Post-Test Interview

After completing the rapid scene categorization test the expert was interviewed about their decision-making strategy. They were asked which strategy had evolved throughout the rounds of testing and which image attributes they relied on when making decisions regarding an image’s authenticity.

The set of images that was used during testing was split into three different sets: real im-ages, fake images categorized as real, and fake images categorized as fake. After the initial interview questions the expert was shown the fakes that had been wrongly classified as au-thentic images. They were asked to elaborate on why these slipped through, returning to the decision-making strategy they had outlined in the first part of the interview. They were then allowed to analyze real and fake images side by side, without a time limit, and asked to freely discuss differences between them. Lastly, they were asked to summarize the current image quality and which attributes should be improved upon in order to generate more authentic images.

(20)

4 Results

This chapter presents a breakdown of the model’s results. The first section explains the role and results of visual examination of the generated images. It then moves on to the results of expert evaluating, starting with rapid scene categorization and finally presenting interview answers.

4.1 Visual Examination

Figure 4.1 showcases a set of images generated by the model. Visual examination was respon-sible for evaluating which model improvements (described in chapter 3.2.3) led to increased image quality. Through visual examination it was possible to identify and rectify issues such as checkerboard artifacts and high prevalence of noise. Early iterations of the model pro-duced image of noticeably worse quality.

Through visual examination it was also possible to see certain attributes learned by the model. For example it has learned to create empty plates for the label "negative growth", seen in row 4 of image 4.2. Row 8 in the same image also showcases patterns typical for pyogenes, the small and detailed growth pattern. This quick visual inspection indicated that the model was in fact learning some differences between classes of bacteria.

Upon closer examination of the generated images, it is also clear that several of them still contain noise. This can be seen for example as gray splotches in the black areas (image 4.2).

4.2 Rapid Scene Categorization

The shorter the images were displayed for, the more errors were made. Results are broken down in table 4.1, sorted by time displayed. True negatives are fake images classified as fake, false positives are fake images classified as real, true positives are real images classified as real, and lastly, false negatives are real images classified as fake. In each round of testing, 100 fake and 100 real images were presented to the expert in a randomized order.

As display time increased from one fourth of a second to two seconds, the classifications became increasingly accurate. This effect could be seen for both real and fake images but was stronger for the real images.

Two classes of bacteria were responsible for a majority of wrongly classified fakes in all three rounds of testing: negative growth and pyogenes. Negative growth is characterized by

(21)

4.3. Post-Test Interview

Figure 4.1: Examples of images produced by different iterations of the model.

Table 4.1: Rapid scene categorization results

Time displayed True negatives False positives True positives False negatives

250ms 54 46 78 22

500ms 72 28 85 15

2000ms 81 19 93 7

a lack of bacterial growth while pyogenes often forms very delicate patterns. These and other images lacking obvious growth patterns were misidentified most often.

4.3 Post-Test Interview

As results of the rapid scene categorization testing suggest, the main strategy developed by the expert was to consider the growth patterns of the bacteria. This strategy grew stronger as the rounds progressed. In early rounds they noted paying attention to if the growth pattern matched on the left and right side, as authentic images should showcase the same growth pattern in both types of lighting. As the display time increased they also started paying attention to the specific pattern formed by the bacteria, such as edge formations.

When given ample time to compare the generated images with the real ones, the expert again highlighted the growth patterns. While the model occasionally generates detailed pat-terns, it trends towards patterns that are grainier and blurrier than authentic ones. Instead of forming clear, distinguishable circles and edges the shapes often flow into each other. The model also underperforms at generating realistic degrees of transparency in the bacterial growth. However, the shape and color of the plates was deemed quite realistic until zoomed in on.

During longer inspection it became clear that while the quality and resolution of the gen-erated images is noticeably lower than in real images, the largest issue is the lack of symmetry between the left and right side.

(22)

4.3. Post-Test Interview

(23)

5 Discussion

This chapter discusses the results that were presented in the previous chapter. It also dis-cusses the data and methods used to create the model. Lastly, it presents a few areas of interest for further research.

5.1 Results

Using visual inspection proved to be a quick and intuitive way to spot jumps in improve-ment of image quality. Obvious improveimprove-ment could be seen after applying methods such as the two-time update rule [10], one-sided label smoothing [26], and replacing transposed convolutions with a combination of convolutions and upsampling [21]. While it may not be possible to spot finer improvement using visual inspection, it worked well for the purposes of this project considering the state of the developed model.

From the visual inspection it quickly became clear that there were clear differences be-tween the generated and real images. The largest issue with the synthetic images is the model’s inability to generate symmetrical patterns. Growth patterns should look roughly the same when lit from above or below, but the model rarely achieves anything close to this symmetry. This appears to be a regularly occuring problem with GANs, from simple DCGAN based architectures [14] to state of the art models [17].

The classifier will never encounter real images where the same growth pattern does not emerge in both types of lighting. The model’s failure to successfully and most importantly reliably replicate this symmetry means that the generated images would not be useful for training Agricam’s classifier, as no real images of bacterial growth plates would ever be this asymmetrical. Training the classifier on the generated images is instead likely to worsen its performance, as the new images are not representative of real bacterial growth. Finding a way to rectify this asymmetry remains a major task.

Rapid scene categorization and interview with an expert on bacterial growth plates iden-tified the same issues with lack of symmetry. It also highlighted other issues with growth patterns such as blurriness and the poorly shaped and differentiated edges. However, there were signs that the model had picked up on other patterns. It learned to create realistic-looking plates of the correct color, and several of these images passed the rapid scene cate-gorization test if there was no obvious bacterial growth on them. The model also learned to differentiate between certain classes, for example producing empty plates for the label

(24)

"neg-5.2. Method

ative growth" and detailed growth patterns when instructed to create a plate showcasing a pyogenes colony. At the same time, applying only qualitative evaluation methods don’t provide enough proof to make the claim that this is the case for all classes.

While it is clear that the model has a long way to go before it generates usable images, it has already learned to represent certain patterns in the original data set. It was also possible to generate large increases in quality through small adjustments to the code. It might be possible to improve the model further through tweaks to its architecture or parameters.

5.1.1 Evaluation Methods

It can be argued that this project is overly reliant on qualitative evaluation methods, which makes it susceptible to human errors such as subjective and inconsistent judgment. While it would have been possible to evaluate the project quantitatively with a method such as FID [10], there’s no point in evaluating just to evaluate. While the research shows that FID is the current leading standard for quantitatively evaluating GANs, the drawbacks of FID make it unsuitable for this project. This thesis relies on a data set that not only is very different from what Inception-v3 was trained on but also much smaller than the recommended size of 50 000 elements, both of which can cause FID to generate misleading metrics [13].

In this project, qualitative evaluation was also more than enough to see that the model fails to perform basic tasks such as replicating the growth pattern in both kinds of lightning. It doesn’t take an expert to see that the model does not manage to match the lefthand pattern with the righthand one, nor to understand that training a classifier on unrealistic images may make it worse at recognizing realistic ones. Qualitative methods utilize human intuition and expertise and provide an excellent springboard for deeper analysis. Should the model’s performance improve, quantitative evaluation methods are likely to play an even larger role in properly evaluating it.

At the same time, there are ways even the qualitative evaluation process could be im-proved upon. Rapid scene categorization and interviewing was done using a sample size of 1, with only one expert participating in the tests. Increasing the sample size might lead to bet-ter insights into which patbet-terns emerge, in turn increasing understanding of not only current model performance but also how it needs to be improved. There is also a possibility that the improvement in correct classifications of images was due to a training effect. However, this effect was minimized through not providing feedback after answers, letting the participant see their selections only after testing was completed.

However, there is no guarantee that these evaluation measures catch issues such as over-fitting or mode collapse [4]. While qualitative evaluation provides a good starting point for analyzing image quality, there is no guarantee that these issues are not present in the gen-erated images. Still, severe mode collapse is easy to spot through visual inspection of larger amounts of generated images. As shown in 4.2, the model clearly generates samples that differ not only between classes but also within them. Visually inspecting the generated im-ages still remains a quick way to spot impending mode collapse, but quantitative evaluation will eventually be needed to confirm this hypothesis through a statistical comparison of the similarity between generated images.

The field of GAN evaluation is still in its infancy but is rapidly evolving. Future iterations of this project may be able to capitalize on new strategies that do not yet exist. Should the model begin to generate images so realistic that they fool humans, quantitative evaluation measures will play an important role in ensuring that this success has not been achieved through simple tricks.

5.2 Method

This section starts off by discussing the data set and how it was used. It then finishes off by discussing how the model was designed and implemented.

(25)

5.3. Continuation

5.2.1 Data

The provided data set was quite unbalanced, as shown in figure 3.2. While many models may have been able to reach good performance through simply maximizing performance on the most common classes, a cGAN does not have this luxury [18]. For example generating a hyper-realistic number 3 would not yield a high score if the paired label was a 4. During the entire training process a cGAN is forced to constantly attempt to match the generated image with the generated label, as doing otherwise leads to a jump in loss. Besides this, one of the aims of the project was to be able to generate images specifically for underrepresented classes. Choosing a cGAN for the project was a suitable decision not only to handle class imbalance but also to attempt to rectify it through class-specific image generation.

Ideally, the model would generate images of the whole growth plates and not just the red field. Using the red field only was however suitable for a pilot study, as results would clearly show if the model was capable of generating symmetry where required as well as different patterns for different types of bacteria. As some bacteria only grow on the red field, it would also have been possible to merge the red field back into images of the entire plates should the images have been of high enough quality.

5.2.2 Model

Besides the quality issues mentioned in section 5.1, the generated images are also too small to be used by Agricam’s classifier. The created cGAN is capable of training and outputting im-ages that are 200x400 pixels large, while Agricam’s classifier takes in imim-ages that are 300x600 pixels large. Popular GAN architectures such as DCGAN [22] are commonly used to generate small images, based on data sets such as celebA or ImageNet. While the DCGAN architecture is computationally efficient and produces great results [22], it may not be suitable for generat-ing larger images. Achievgenerat-ing higher image resolution could be done through switchgenerat-ing layer structure, picking another GAN variant altogether, or simply training on a more powerful GPU.

The model also trained on an online platform that limits training times. While it is hard to guarantee that the model would have kept improving instead of starting to deteriorate, it might be of interest to train it for longer than the current 175 epochs. GANs are time consuming to train, training for anywhere from hours to months on end, and it is possible that training had to be stopped prematurely.

While it would have been possible to pick a different GAN variation, other variants come with drawbacks such as added complexity or drastically increased training times ranging from weeks to months. Creating a cGAN is but a small step from working with a regular GAN, which not only fit the time frame of the project but also its purpose - creating labeled images that can be used to train an image classifier. Lastly, cGAN also mitigate the issue of mode collapse [1] which commonly occurs in unconditional GANs and was clearly seen in an unconditional, early iteration of the model. While qualitative evaluation is not enough to be sure no mode collapse has taken place, the large amount of sample generated images had distinct pattern formations both between classes and within a class.

5.3 Continuation

This project highlights the current state of GAN evaluation research. Evaluating GANs con-tinues to be a tricky task, particularly models that utilize small and specialized data sets. Despite methods such as FID finding widespread popularity in the recent years it remains clear that better, more universal evaluation methods are required. As this field is still rapidly expanding it is important to stay up to date with current research to evaluate new alterna-tives.

(26)

5.3. Continuation

While GANs and more specifically cGANs proved to be an interesting starting point for this project it, may be necessary to turn to more advanced variants instead. Since the images need to be of a higher resolution and quality it may be necessary to accept the tradeoff for increased complexity and training times. GAN variants such as for example CycleGAN [27] or StyleGAN [11] have shown remarkably detailed results in the past years and may be more suitable to this kind of task, despite higher model complexity and longer training times.

Creating a model that can generate realistic images of bacterial growth would go a long way in supplementing the existing data set. Data collection is a slow and resource intensive process. The longer it takes to grow, photograph, and correctly classify an infection, the longer it takes to treat it correctly. Adding synthetically generated images of a high enough quality is a quick path to hopefully improving the performance of the classifier and in turn its classifications. This would help farmers not only start mastitis treatment faster but also gain an upper hand in recognizing recurring infections in order to take preventative measures. Being able to do this would not only reduce cost but also animal suffering, improving the well-being of livestock.

Succeeding in creating this model would not just improve Agricam’s product, but could possibly be an inspiration to others. Bacterial growth plates are used in a variety of fields. It is possible that a network capable of modeling the appearance of mastitis infections could also produce realistic images for similar data sets.

(27)

6 Conclusion

The aim of this thesis was to evaluate GANs as a method for generating synthetic images of bacteria. Based on the literature review and the project’s purpose, a cGAN was chosen as the most suitable model for the project. The model was iterated upon through small tweaks to the layer structure and hyperparameters, leading to an increase in image quality. However, qualitative evaluation highlighted prevalent issues with symmetry and growth patterns in the generated images. While the model learned to replicate certain attributes such as shape, color, and class-unique attributes, the final images were not of a high enough quality or res-olution to train the classifier with. This showcases the difficulty of creating and especially evaluating GANs. While the results are promising, it may be necessary to adopt a different GAN variant altogether to achieve the desired results.

This thesis constitutes a starting point for Agricam’s journey to realistic image genera-tion. It has described common issues with developing GANs, provided ways to improve upon model performance, and highlighted the prevalent issue of GAN evaluation. Keeping up to date with research in this field may yield new ideas and models better suited to the complexity of the task.

(28)

Bibliography

[1] Sudarshan Adiga, Mohamed Adel Attia, Wei-Ting Chang, and Ravi Tandon. “ON THE TRADEOFF BETWEEN MODE COLLAPSE AND SAMPLE QUALITY IN GENERA-TIVE ADVERSARIAL NETWORKS”. In: 2018 IEEE Global Conference on Signal and In-formation Processing (GlobalSIP). 2018 IEEE Global Conference on Signal and Informa-tion Processing (GlobalSIP). Anaheim, CA, USA: IEEE, Nov. 2018, pp. 1184–1188.ISBN:

978-1-72811-295-4.DOI: 10 . 1109 / GlobalSIP . 2018 . 8646478. URL: https : / / ieeexplore.ieee.org/document/8646478/(visited on 05/26/2021).

[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. Dec. 6, 2017. arXiv: 1701.07875 [cs, stat]. URL: http://arxiv.org/abs/1701.07875 (visited on 02/08/2021).

[3] I. A Basheer and M Hajmeer. “Artificial Neural Networks: Fundamentals, Computing, Design, and Application”. In: Journal of Microbiological Methods. Neural Computting in Micrbiology 43.1 (Dec. 1, 2000), pp. 3–31. ISSN: 0167-7012. DOI: 10 . 1016 / S0167 -7012(00 ) 00201 - 3. URL: https : / / www . sciencedirect . com / science / article/pii/S0167701200002013(visited on 05/11/2021).

[4] Ali Borji. Pros and Cons of GAN Evaluation Measures. Oct. 23, 2018. arXiv: 1802.03446 [cs].URL: http://arxiv.org/abs/1802.03446 (visited on 02/10/2021). [5] Jason Brownlee. Deep Learning for Computer Vision: Image Classification, Object Detection,

and Face Recognition in Python. Machine Learning Mastery, Apr. 4, 2019. 564 pp. Google Books: DOamDwAAQBAJ.

[6] Jason Brownlee. How to Develop a Conditional GAN (cGAN) From Scratch. Machine Learn-ing Mastery. July 4, 2019.URL:

https://machinelearningmastery.com/how-to- develop- a- conditional- generative- adversarial- network- from-scratch/(visited on 02/09/2021).

[7] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks. June 18, 2015. arXiv: 1506. 05751 [cs].URL: http://arxiv.org/abs/1506.05751 (visited on 03/22/2021). [8] Vincent Dumoulin and Francesco Visin. “A Guide to Convolution Arithmetic for Deep Learning”. In: (Mar. 23, 2016). URL: https : / / arxiv . org / abs / 1603 . 07285v2 (visited on 05/24/2021).

(29)

Bibliography

[9] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. June 10, 2014. arXiv: 1406 . 2661 [cs, stat].URL: http : / / arxiv . org / abs / 1406.2661(visited on 02/10/2021).

[10] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equi-librium. Jan. 12, 2018. arXiv: 1706.08500 [cs, stat].URL: http://arxiv.org/ abs/1706.08500(visited on 05/06/2021).

[11] Tero Karras, Samuli Laine, and Timo Aila. “A Style-Based Generator Architecture for Generative Adversarial Networks”. In: (Dec. 12, 2018).URL: https://arxiv.org/ abs/1812.04948v3(visited on 05/24/2021).

[12] Keras: The Python Deep Learning API. URL: https : / / keras . io/ (visited on 05/16/2021).

[13] Shaohui Liu, Yi Wei, Jiwen Lu, and Jie Zhou. An Improved Evaluation Framework for Gen-erative Adversarial Networks. July 19, 2018. arXiv: 1803.07474 [cs].URL: http:// arxiv.org/abs/1803.07474(visited on 05/13/2021).

[14] Vishnu Makkapati and Arun Patro. “Enhancing Symmetry in GAN Generated Fash-ion Images”. In: Artificial Intelligence XXXIV. Ed. by Max Bramer and Miltos Petridis. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2017, pp. 405–410.ISBN: 978-3-319-71078-5.DOI: 10.1007/978-3-319-71078-5_34. [15] Mastit hos mjölkkor - SVA.URL:

/djurhalsa/djursjukdomar-a-o/mastit-hos-mjolkkor/(visited on 05/01/2021).

[16] Matplotlib: Python Plotting — Matplotlib 3.4.2 Documentation. URL: https : / / matplotlib.org/(visited on 05/16/2021).

[17] Kyle McDonald. How to Recognize Fake AI-Generated Images. Medium. Dec. 14, 2018.URL: https : / / kcimc . medium . com / how to recognize fake ai generated -images-4d1f6f9a2842(visited on 05/25/2021).

[18] Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. Nov. 6, 2014. arXiv: 1411 . 1784 [cs, stat]. URL: http : / / arxiv . org / abs / 1411 . 1784 (visited on 02/18/2021).

[19] NumPy.URL: https://numpy.org/ (visited on 05/16/2021).

[20] Keiron O’Shea and Ryan Nash. An Introduction to Convolutional Neural Networks. Dec. 2, 2015. arXiv: 1511 . 08458 [cs]. URL: http : / / arxiv . org / abs / 1511 . 08458

(visited on 05/16/2021).

[21] Augustus Odena, Vincent Dumoulin, and Chris Olah. “Deconvolution and Checker-board Artifacts”. In: Distill 1.10 (Oct. 17, 2016), e3.ISSN: 2476-0757.DOI: 10.23915/ distill . 00003. URL: http : / / distill . pub / 2016 / deconv - checkerboard (visited on 01/28/2021).

[22] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learn-ing with Deep Convolutional Generative Adversarial Networks. Jan. 7, 2016. arXiv: 1511. 06434 [cs].URL: http://arxiv.org/abs/1511.06434 (visited on 01/31/2021). [23] F. Rosenblatt. The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell

Aeronautical Laboratory, 1957. book. Google Books: P_XGPgAACAAJ.

[24] Stuart J. Russell and Peter Norvig. Artificial Intelligence : A Modern Approach. Pren-tice Hall Series in Artificial Intelligence. Pearson Education Limited, 2016.ISBN: 978-1-292-15396-4. URL: https : / / login . e . bibl . liu . se / login ? url = https : //search.ebscohost.com/login.aspx?direct=true&AuthType=ip, uid& db=cat00115a&AN=lkp.927144&lang=sv&site=eds-live&scope=site.

(30)

Bibliography

[25] Pegah Salehi, Abdolah Chalechale, and Maryam Taghizadeh. Generative Adversarial Net-works (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Develop-ments. May 27, 2020. arXiv: 2005.13178 [cs, eess].URL: http://arxiv.org/ abs/2005.13178(visited on 04/09/2021).

[26] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. June 10, 2016. arXiv: 1606.03498 [cs].

URL: http://arxiv.org/abs/1606.03498 (visited on 02/10/2021).

[27] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks”. In: (Mar. 30, 2017).

Synthetic Image Generation Using GANs : Generating Class Specific Images of Bacterial Growth

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 18 ECTS | Cognitive Science

2021 | LIU-IDA/KOGVET-G--21/010--SE

Synthetic Image Generation

Using GANs

Generating Class-Speciﬁc Images of Bacterial Growth

Syntetisk bildgenerering med GANs

Marianne Mattila

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Research questions

1.3

Delimitations

1.4

Structure

2

Theory

2.1

Artificial Neural Networks

2.2

Convolutional Neural Networks

2.3

Generative Adversarial Networks

2.3.1

Generator

2.3.2

Discriminator

2.3.3

Training

2.4

GAN variations

2.4.1

Deep Convolutional GANs

2.4.2

Conditional GANs

2.5

Evaluating GANs

2.5.1

Quantitative Evaluation

2.5.2

Qualitative Evaluation

3

Method

3.1

Data

3.1.1

About the data set

3.1.2

Preprocessing

3.2

Model

3.2.1

Model choice

3.2.2

Model design

3.2.3

Model improvements

3.2.4

Framework

3.3

Evaluation

3.3.1

Visual Examination

3.3.2

Rapid Scene Categorization

3.3.3

Post-Test Interview

4

Results

4.1