Linköpings universitet

### Linköping University | Department of Computer and Information Science

### Master’s thesis, 30 ECTS | Statistics and Machine Learning

### 2020 | LIU-IDA/STAT-A--20/023--SE

## Generating synthetic brain MR images

## using a hybrid combination of

## Noise-to-Image and Noise-to-Image-to-Noise-to-Image GANs

**Lennart Schilling**

Supervisor : Anders Eklund Examiner : Fredrik Lindsten

**Upphovsrätt**

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

**Copyright**

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

**Abstract**

Generative Adversarial Networks (GANs) have attracted much attention because of their ability to learn high-dimensional, realistic data distributions. In the field of medi-cal imaging, they can be used to augment the often small image sets available. In this way, for example, the training of image classification or segmentation models can be im-proved to support clinical decision making. GANs can be distinguished according to their input. While Noise-to-Image GANs synthesize new images from a random noise vector, Image-To-Image GANs translate a given image into another domain. In this study, it is investigated if the performance of a Noise-To-Image GAN, defined by its generated output quality and diversity, can be improved by using elements of a previously trained Image-To-Image GAN within its training. The data used consists of paired T1´and T2-weighted MR

brain images. With the objective of generating additional T1-weighted images, a hybrid

model (Hybrid GAN) is implemented that combines elements of a Deep Convolutional GAN (DCGAN) as a Noise-To-Image GAN and a Pix2Pix as an Image-To-Image GAN. Thereby, starting from the dependency of an input image, the model is gradually converted into a Noise-to-Image GAN. Performance is evaluated by the use of an independent classi-fier that estimates the divergence between the generative output distribution and the real data distribution. When comparing the Hybrid GAN performance with the DCGAN base-line, no improvement, neither in the quality nor in the diversity of the generated images, could be observed. Consequently, it could not be shown that the performance of a Noise-To-Image GAN is improved by using elements of a previously trained Image-Noise-To-Image GAN within its training.

**Acknowledgments**

First, I would like to thank my supervisor Anders Eklund for his extraordinary help during the whole time of my thesis. I was not used to such good supervision from my previous studies in Germany and I am very happy that I could approach him with my open questions at any time.

Many thanks also to my examiner Fredrik Lindsten. I consistently felt that he had a great interest in my work and was able to give me very helpful feedback to further improve the quality of my thesis.

I felt that Lakshidaa’s opposition was fair and I am therefore very grateful for the collegial way in which she acted.

Last but not least, my biggest thanks go to Lea, who has always been there for me over the last two years and has always believed in me.

**Contents**

**Abstract** **iii**

**Acknowledgments** **iv**

**Contents** **v**

**List of Figures** **vii**

**List of Tables** **viii**

**1** **Introduction** **1**
1.1 Aim . . . 2
1.2 Related work . . . 2
1.3 Ethical considerations . . . 4
**2** **Data** **7**
**3** **Theory** **9**
3.1 Convolutional Neural Networks . . . 9

3.2 Discriminative vs. generative models . . . 11

3.3 Generative Adversarial Networks . . . 12

3.3.1 Architecture . . . 12

3.3.2 Objective function . . . 13

3.3.3 Common training problems . . . 18

3.3.4 Deep Convolutional Generative Adversarial Network . . . 20

3.3.5 Pix2Pix . . . 21
**4** **Method** **24**
4.1 Models . . . 24
4.1.1 Baseline . . . 24
4.1.2 Intermediate . . . 25
4.1.3 Hybrid GAN . . . 25
4.2 Data preprocessing . . . 27
4.3 Training . . . 28
4.3.1 Optimizer . . . 29
4.3.2 Architectures . . . 30
4.3.3 Hyperparameters . . . 30

4.3.4 Techniques to counteract mode collapse . . . 30

4.4 Evaluation . . . 32

4.4.1 Standard metrics . . . 33

4.4.2 Critic-based divergence estimation . . . 34

**5** **Results** **38**

5.1 Critic behaviour . . . 38

5.2 Model training . . . 39

5.3 Model comparison . . . 41

5.4 Extended Hybrid GAN training results . . . 44

**6** **Discussion** **47**
6.1 Results . . . 47

6.2 Method and future work . . . 49

**7** **Conclusion** **53**

**Bibliography** **55**

**List of Figures**

1.1 Autoencoders. . . 5

2.1 MRI views of the brain. . . 8

2.2 Data example. . . 8

3.1 CNN example for image classification. . . 11

3.2 Discriminative vs. generative principle within a classification task. . . 12

3.3 Standard GAN architecture. . . 13

3.4 KL vs. JS divergence. . . 15

3.5 Example of a non-converging simulation. . . 19

3.6 Illustration of the mode collapse problem. . . 19

3.7 Activation functions. . . 20

3.8 DCGAN generator and discriminator. . . 21

3.9 Pix2Pix generator. . . 22

3.10 Pix2Pix architecture. . . 23

4.1 Illustration of increasingly added noise. . . 26

4.2 Interaction between baseline, intermediate and Hybrid GAN. . . 27

4.3 Hybrid GAN architecture. . . 28

4.4 Fitting a single Gaussian to a mixture of two Gaussians. . . 31

4.5 Minibatch Discrimination. . . 32

4.6 Different NSOF optimizations during critic training. . . 36

5.1 Critic training behaviour. . . 39

5.2 Critic- vs. Discriminator-based GAN evaluation. . . 39

5.3 Overview of all performed model training runs. . . 40

5.4 DCGAN samples. . . 42

5.5 Pix2Pix samples. . . 43

5.6 Hybrid GAN samples. . . 44

5.7 Divergence estimates at the beginning of Hybrid GAN training. . . 45

5.8 Divergence estimates for Hybrid GAN training with/without varying noise. . . . 45

5.9 Divergence estimates for Hybrid GAN training with/without L1-distance. . . 46

**List of Tables**

2.1 Overview of the distribution of the data sets. . . 8

5.1 Selected training results from DCGAN training. . . 40

5.2 Selected training results from Pix2Pix training. . . 41

5.3 Selected training results from Hybrid GAN training. . . 41

5.4 Cross-model evaluation of the final generators. . . 42

A.1 Critic architecture and training settings. . . 60

A.2 DCGAN architecture and training settings. . . 61

A.3 Pix2Pix architecture and training settings. . . 62

**1**

**Introduction**

The recent upswing in Deep Learning (DL) applications in a variety of academic and industrial fields has also impacted the medical imaging sector.

As an example, DL models are increasingly adopted for computer-aided classification and segmentation of tissues within images acquired via Magnetic Resonance Imaging (MRI) or Computed Tomography (CT) to support clinical decision making [32].

Many of these applied models belong to the class of Supervised Learning. In this case, the model is asked to predict correct labels for new, previously unseen data based on the rules discovered within a provided labeled data set. To ensure the generalizability of the trained model on unseen data and to reduce overfitting, the model usually has to be provided with a large set of labeled data [32].

However, within the field of medical imaging, obtaining labeled medical data is challeng-ing [20]. A lack of experts who annotate medical images and the fact that the data sets are often unbalanced with few pathological findings limit the training of DL models. Besides, privacy and data protection are particularly important when dealing with medical data. For example, the patient’s permission is usually required if diagnostic images are to be used in a publication or made available to the public [32, 58].

To overcome the data privacy problems associated with the insufficient quantity of diag-nostic medical image data, it is often necessary to first augment the data in order to ensure a subsequent promising training of the model. Traditional techniques generally use geometric transformations such as rotation, scaling or cropping. However, these operations do not cover the whole variety of the data, for example regarding the size, shape, location and appearance of a specific pathology [58].

In contrast, Generative Adversarial Networks (GANs) provide a more promising alternative. First introduced in 2014 [14], they have received wide attention for their potential to learn underlying high-dimensional, real data distributions to generate new realistic samples that are not covered by the provided data set [20]. Compared to the traditional data augmentation techniques, GANs thus offer a more generic solution and have been used in many works to enlarge training images with promising results [58]. Within the field of medical imaging, it has been demonstrated that comparable results can be obtained when training the DL model on synthetic data provided by GANs rather than on real patient data [32].

GANs can be distinguished according to many different subjects, including the input it takes to generate new samples. For most of the GAN learning systems discussed in the

liter-1.1. Aim

ature, a random noise vector is used as the input. The GAN then intends to learn how to map this noise vector to an output that matches the real data as closely as possible. In contrast, other GANs take an image of one type as input and transfer it to another image type. In this work, these two types of GANs are declared as Noise-To-Image GAN and Image-To-Image GAN, respectively.

Within this study, the performance of GANs is assessed concerning differences in quality and diversity of its generated output in comparison to the real data distribution.

Compared to Image-To-Image GANs, it is generally more difficult to achieve high per-formance with Noise-To-Image GANs, resulting in relatively lower synthetic image quality and diversity. However, an image-to-image translation rather transfers an image to another image type with the same structure instead of generating completely new samples from the underlying distribution of the given data set, as a Noise-To-Image GAN does. In addition, the transfer presupposes the existence of images in multiple domains.

Instead of image translation, in many cases there is a greater interest in synthesizing com-pletely new images to augment the existing data set.

**1.1**

**Aim**

Motivated by the challenge to achieve high performance in synthetic image generation with a Noise-To-Image GAN, this work aims to combine elements of both presented GAN types. Since an Image-To-Image GAN is expected to achieve higher performance, first completing its training and then using its acquired translation ability within the subsequent training of a Noise-To-Image GAN is intended to lead to higher quality and diversity of the synthesized Noise-To-Image GAN output.1

The main research question of this study is summarized as follows:

Can the performance of a Noise-To-Image GAN be enhanced by the use of a previously trained Image-To-Image GAN?

In specific, it can be divided into two sub-questions:

Can the generated output quality of a Noise-To-Image GAN be enhanced by the use of a previously trained Image-To-Image GAN?

Can the generated output diversity of a Noise-To-Image GAN be enhanced by the use of a previously trained Image-To-Image GAN?

**1.2**

**Related work**

The combination of a Noise-To-Image GAN with an Image-To-Image GAN has been per-formed before. Han et al. [19] have demonstrated that their developed two-step GAN-based data augmentation improves the training of a brain tumor detection model compared to the use of traditional augmentation techniques. In a first step, a Noise-To-Image GAN is trained to gradually increase the generated output size of brain images up to 256ˆ256 pixels. Subse-quently, an Image-To-Image GAN aims to translate the generated images for further refine-ment.

1_{This work is motivated by the example of data augmentation within the field of medical imaging. However,}

since the applications of GANs are not limited either to data augmentation or to the field of medical imaging, the performed study can be transferred to many other domains.

1.2. Related work

Guibas et al. [16] also implemented a two-stage pipeline using a pair of Noise-To-Image and Image-To-Image GANs to synthesize photorealistic images of retinal blood vessels. By the intention of decomposing the image generation process into less difficult parts, a Noise-To-Image GAN is trained in the first stage to synthesize diverse retinal vessel segmentation masks to capture the general geometry. Then, in a second stage, an Image-To-Image GAN translates the synthesized masks into the corresponding photorealistic images. In comparison with a single-stage GAN, which attempts to synthesize the photorealistic images directly, it has been shown that both quality and diversity have been improved.

To generate 256 ˆ 256 photo-realistic images conditioned on text descriptions, Zhang et al. [59] also decomposed the task into two sub-problems. In the first stage, a conditional Noise-To-Image GAN which takes a random noise vector and a given text description yields low-resolution images to reflect the basic shape and colors of the object. Then, conditioned on the generated images from the first stage, an Image-To-Image GAN translates them into higher resolution photo-realistic images. By this procedure, the authors demonstrated significant improvements in generating photo-realistic images conditioned on text descriptions.

Apart from these combinations of Noise-To-Image with Image-To-Image GANs, Wang et al. [55] studied the domain adaption applied to image generation with GANs. To transfer knowledge from a source domain to a target domain, the parameters of GAN models, pre-trained on different data sets to synthesize images, are used as the initial starting point for a new generation task. By fine-tuning with training images from the new domain, it has been demonstrated that using knowledge from pre-trained models can shorten convergence time and improve the quality of generated images.

The approach followed in this study is not based on stacking the two different GAN types, but on using information from a pre-trained Image-to-Image within a Noise-To-Image GAN. The investigations of Wang et al. do not refer to the combination of the two GAN types. To the author’s knowledge, no studies have been published to date on the research conducted. However, the applied methodology within this work implies a certain resemblance with au-toencoders.

The standard autoencoder (AE) [22] aims to reduce the dimensionality of its given input
*data. It consists of two networks, an encoder with parameters φ and a decoder with parameters*
*θ*. While the encoder translates the original high-dimensional input x into a corresponding
latent low-dimensional code z by z = g*φ*(x), the decoder recovers the input from the code

by x1 _{=} _{f}

*θ*(z). Using stochastic gradient descent, the networks are trained to minimize the

discrepancy between the original input x and its reconstruction x1_{by}

*θ, φ*=arg min
*θ,φ*
1
n
n
ÿ
i=1
Lx(i), x1(i)
=arg min
*θ,φ*
1
n
n
ÿ
i=1
Lx(i), f*θ*
g*φ*
x(i)
(1.1)

whereby the reconstruction loss function L can be selected from various options as
re-quired. As an example, the squarred error L(*θ, φ*) = 1_{n}řn_{i=1}

x(i)_{´}_{f}
*θ*
g*φ*
x(i)2_{may be}
selected.

A modification of the AE is the Denoising Autoencoder (DAE) [53]. The basic intention of this model is to learn latent representations that are robust to small irrelevant changes in input. The input x is transformed into a partially noisy or corrupted version ˜x before it is processed by the autoencoder. This is performed with a stochastic mapping ˜x „ qD(˜x|x)that is not limited to a specific type of corruption process, meaning that different variations of added noise can be considered.

1.3. Ethical considerations

Following the same principles from the AE, the model parameters of the encoder and decoder networks are then updated based on the resulting loss

L(*θ, φ*) = 1
n
n
ÿ
i=1
Lx(i), f*θ*
g*φ*
˜x(i). (1.2)

Consequently, the model is trained to reconstruct the corrupted input ˜x to the original, uncorrupted input x. With the added noise, the DAE is encouraged to gather insights from a combination of many input dimensions to reconstruct the denoised version, rather than focusing on isolated dimensions. This provides a good basis for learning a robust latent low-dimensional representation.

Another modification of the AE is the Adversarial Autoencoder (AAE) [34]. It is is a proba-bilistic autoencoder that aims to match the aggregated posterior distribution of the latent code with a defined prior distribution. Let q(z|x)be the encoding function of the autoencoder and prthe data distribution, the aggregated posterior distribution q(z)is obtained by

q(z) =

ż x

q(z|x)pr(x)dx. (1.3)

The matching of the aggregated posterior with the defined prior distribution p(z) is
achieved by the usage of an additional discriminative adversarial network which aims to
distinguish samples from both distributions as correctly as possible. In this way, the
autoen-coder is trained in two phases. In the reconstruction phase, the enautoen-coder and deautoen-coder are
updated in the same way as in the AE or DAE by minimizing the discrepancy between the
original input x and its reconstruction x1_{. However, in the additional regularization phase,}
the adversarial classifier is first updated by minimizing its classification error. Moreover, the
encoder is updated to maximize the classification error of the discriminative network. This
leads the encoder to convert x into a latent representation z which approaches the prior
dis-tribution pz. As a result, the autoencoder is turned into a generative model in which the
decoder learns how to map the defined prior distribution to the data distribution.

The three autoencoder variants presented are summarized in Figure 1.1. The relationship between the applied methodology within this study and the autoencoders will be clarified in the course of the thesis.

**1.3**

**Ethical considerations**

The analyses within this study are performed using MR brain images that are publicly avail-able. Each image was provided to the author without personal information about the subjects or other identifying information.

The synthesis of new images is motivated by its resulting potential to augment medical image data sets. In this way, the need for personal data is reduced. As a result, the training of Deep Learning models that, for example, are able to detect a brain tumor from MR images, may be improved.

As more accurate predictions about patients’ health status may be obtained, it offers great potential to support the clinical decision-making of medical professionals. However, the healthcare sector is a field in which false conclusions from applied models can lead to sig-nificant consequences for patients [44]. Therefore, multiple ethical challenges must be taken into account.

In general, an ethical automated decision-making should promote well-being, minimize harm and ensure that benefits and harms are shared equally among the affected groups. However, a certain extent of bias occurs with any data set [12]. Most existing public medical data sets are small and lack real-world variation. It follows that medical data may contain biases for or against a particular group in terms of gender, social, environmental or economic factors. Since the accuracy of a model strongly depends on the information included within

1.3. Ethical considerations

(a) Standard Autoencoder (AE) (b) Denoising Autoencoder (DAE)

(c) Adversarial Autoencoder (AAE)

Figure 1.1: Autoencoders. The Standard Autoencoder in a) first encodes a
higher-dimensional input x into a lower-higher-dimensional latent representation z. The decoder then
converts the latent code z to a reconstruction of the original input. Both encoder and
de-coder are updated by minimizing the discrepancy between the reconstructed input image x1
and the original input image x. As a result, the autoencoder learns how to reduce the
di-mensionality of the original input x while preserving enough information to reconstruct the
higher-dimensional original input. The Denoising Autoencoder in b) involves a small
modi-fication by first corrupting the original input by the usage of a stochastic function qD(˜x|x). In
this way, it is encouraged to gather insights from a combination of many input dimensions to
reconstruct the denoised version so that the latent representations are expected to be more
ro-bust to slight changes in the input. The Adversarial Autoencoder in c) expands the AE by an
additional discriminative network. Given samples z1_{from the defined prior distribution p}

z and samples z from the aggregated posterior distribution of the latent representations qz, its classification returns an additional adversarial loss which is used to update the discriminator itself and the encoder. Since the encoder is updated by maximizing the discrimination net-work’s loss, the autoencoder is forced to reconstruct the original input x so that the obtained latent representation follows the defined prior distribution. After training, the decoder then represents a generative model which converts samples from the prior distribution to gener-ated outputs following the data distribution pr.

the data on which it is trained, the system may underperform for certain groups so that the data biases may lead to discrimination against underrepresented subsets of a population. Due to the potentially significant consequences, the biases may, therefore, harm patients and negatively affect their health [44].

In this context, the implementation of Deep Learning models requires a high degree of transparency regarding the collection and processing of training data, including descriptions of the data and characteristics of the underlying patients [44]. Moreover, bias-related ques-tions such as

• What kinds of biases are present in the data used to train and test the models? • What are the potential risks that may arise from biases in the data?

• How should the remaining biases be treated? need to be clarified [12].

1.3. Ethical considerations

Finding answers to such questions becomes more difficult when including synthesized data within the training data set. No patient directly underlies the generated images. Instead, all GAN outputs are based on the combination of different characteristics originating from the patients whose images are used to train the GAN. Consequently, the approach regarding the analysis of the data bias must be clarified. One possibility could be to use the characteristics of the underlying patients whose images are used for the GAN training.

Another ethical challenge builds on the patients’ privacy. The consent of the patients is usually required when dealing with medical image data. The use of synthesized training data reduces the amount of direct personal data. However, it is unclear to what extent the use of artificial images still requires the consent of the patients whose images were used to train the GAN. Since the generated images are trained on their direct personal data, the synthesized images contain some personal information.

In addition to the examples presented, the use of artificially generated images to train Deep Learning models which are intended to be used as support for clinical decision-making offers great potential for further in-depth discussions. In general, a consistent approach is needed to address these challenges. Boundaries must be defined so that models trained based on artificial data are generally accepted. This raises further questions, for example regarding the responsibility to be assigned in case of erroneous analyses. Due to the serious consequences of incorrect analyses, the health sector is an area in which particular caution is required.

**2**

**Data**

The study of this work is performed on a brain image dataset provided by the Human Con-nectome Project (HCP) [52].

With the aim of making brain image data freely accessible to the scientific community, the HCP carried out the HCP Young Adult release, from which the data is taken. It consists of structural MR brain scans of 1113 healthy US-American adults aged 22 to 35 years, collected between 2012 and 2015.

Any three-dimensional MR image consisting of voxels can be seen as a stack of two-dimensional slices consisting of pixels. As Figure 2.1 shows, MRI of the brain can provide axial, sagittal and coronal slices [27].

In MRI, different types of images are acquired. T1- and T2-weighted MR images are char-acterized by tissue-specific differences in brightness. As an example, while areas of the brain filled with water appear dark in T1-weighted images, they appear bright in T2-weighted im-ages [2].

For each subject, the HCP provides T1- and T2-weighted three-dimensional image pairs of the brain from axial, sagittal and coronal view. In this study, the axial images are used. Each imaged brain has a volume of 260 ˆ 311 ˆ 260, meaning it consists of 260 two-dimensional slices [2]. Each slice has a physical size of 0.7 ˆ 0.7 mm.

One T1- and T2-weighted slice pair is extracted for each subject. Since the inner slices of the brain reflect the most information, slice 120 is chosen. Figure 2.2 shows an example of one provided image pair used within this work.

By extracting one image pair for each subject, 1113 image pairs are considered in total. When splitting the data for the training procedure, image pairs stick together, meaning that the T1- and T2-weighted images of the same subject are distributed to the same data set. The generalization ability of the models implemented within this study is estimated dur-ing the traindur-ing with a separate validation set, which is also used as feedback for the further tuning of the models. After several iterations of training and tuning, the final models are evaluated on a test set to simulate how the models react to previously unseen data. The sep-aration of the data set is performed using the rates for the training-, validation- and test set of 70%, 15% and 15% respectively. Table 2.1 gives an overview of the numbers of considered image pairs within each set.

Figure 2.1: MRI views of the brain. Left: Axial view. Middle: Sagittal view. Right: Coronal view.

Figure 2.2: Data example. The figure shows a pair of provided axial T1- and T2-weighted inner slices (slice 120) of one subject. Left: T1-weighted slice. Right: T2-weighted slice. For each subject, such an image pair is used within this study.

Subjects # 1113

T1- T2-weighted image pairs # 1113 Image pairs in training set % 70

# 779 Image pairs in validation set % 15

# 167 Image pairs in test set % 15

# 167

Table 2.1: Overview of the distribution of the data sets. Extracting one image pair of T1- and T2-weighted MR image per subject leads to a total number of 1113 considered image pairs. The data is split into a training-, validation- and test set with rates of 70%, 15% and 15%, resulting in 779, 167 and 167 image pairs within the resulting data sets.

**3**

**Theory**

This chapter provides the fundamentals of Generative Adversarial Networks. A Convolutional Neural Network (CNN) with its convolution operation enables fast and high-quality classifica-tions and generaclassifica-tions of images and is strongly integrated into the in this work applied GANs architectures. Following a description of this special kind of neural network, the difference between discriminative and generative models is shown. On these foundations, an overview of the general GAN system is given. Thereby, an increased focus is put on the analysis of the objective function as well as common training problems. The models used in this work, the Deep Convolutional Generative Adversarial Network (DCGAN) as a noise-to-image translation as well as the Pix2Pix as an image-to-image translation, are presented in the end.

**3.1**

**Convolutional Neural Networks**

A traditional fully connected network requires all layers to be one-dimensional vectors. Every neuron of a subsequent layer is connected to every neuron of the previous layer. It follows that the pixels of an image can only be processed independently so that local spatial structures of the image are not taken into consideration. Besides, given a high image resolution, this architecture results in many weights and biases to be learned, which makes the training of the network very difficult [40].

The Convolutional Neural Network tries to counteract the described problems by its main building block, namely the convolution operation. It transforms an input feature map into an output feature map by sliding a kernel, a matrix with a user-defined size, over the entire input. At every slide step, the value of the output is calculated for the current position of the kernel. This is done by elementwise multiplying the corresponding values of the kernel with the values of the overlapping input before the sum of these products is returned.1 For a convolution that is performed using a dimensional input feature map I and a two-dimensional kernel K, the output S(i, j)is calculated by

S(i, j) = (I ˚ K)(i, j) =ÿ

m ÿ

n

I(m, n)K(i ´ m, j ´ n), _{(3.1)}

3.1. Convolutional Neural Networks

with m ˆ n representing the shape of K [15]. Since convolution is commutative, it can be
equivalently rewritten as
S(i, j) = (K ˚ I)(i, j) =ÿ
m
ÿ
n
I(i ´ m, j ´ n)K(m, n). _{(3.2)}
In the case of a three-dimensional input, the kernel is set to have the same depth. Each
channel of the kernel is then applied to the corresponding input channel in the same
de-scribed way before combining the results in one final two-dimensional output feature map.
[11].

Consequently, given an image as an input, a pixel of the output feature map is not con-nected to all but only to a subset of pixels of the input feature map(s). In addition, as defined by the shape of the kernel, pixels are not processed independently, but in groups of nearby pixels. Within one convolutional layer, multiple kernels can be defined so that it returns a three-dimensional stack of all kernel output feature maps [40].

The kernel values are the weights of the network. However, when a kernel is slid over an input, all output pixels obtained from a kernel share the same weights. Each output pixel is, therefore, an expression of detecting the same feature, only at different positions in the input image. Also, for a given kernel, the same bias is used for each output pixel [40].

The complexity of the model is therefore reduced not only by the lower number of con-nections but also by the fact that the weights and bias per kernel are shared to transform an input to an output. For example, given an input image of 256 ˆ 256 pixels, training a tradi-tional fully connected network with a hidden layer of a modest 30 neurons would require a total of 256 ˆ 256 ˆ 30 = 1, 966, 080 weights plus 30 biases to be updated. In contrast, in a CNN with a convolutional layer defined by 20 kernels of size 5 ˆ 5, only 5 ˆ 5 ˆ 20 = 500 weights plus 20 biases would need to be updated.

Training a CNN is performed by calculating the gradient of the defined cost function with respect to the weights and biases using backpropagation and updating the parameters by the defined optimization algorithm [40].

While the size of an output feature map is affected by the input shape and kernel size, it is also influenced by the defined stride and the decision about zero-padding. The stride determines how many units the kernel is shifted in each sliding step. Zero-padding extends the border of the input with zeros before the sliding process starts to allow the kernel to visit positions for which part of the kernel may lie outside the actual input feature map [13].

After performing the convolution and running the linear activations through a nonlin-ear activation function, the output is often modified further by pooling operations. Pooling layers are commonly used directly after convolutional layers. By summarizing nearby input pixels, a pooling operation reduces the size of each output feature map of a convolutional layer. Again, a window for which its shape is defined is slid with a defined stride over each image. The most common kind of pooling is max pooling where the overlapping region is summarized by the maximum value of its included pixels [13]. Pooling aims to reduce the number of parameters needed in the following layers [40] and increases invariance to small translations of the input [13].

The described architecture with its convolutional and pooling layers refers to the mapping from a high-dimensional input to a low-dimensional output, as it is done in a classification task. However, sometimes the goal is to map a low-dimensional input to a high-dimensional output. Therefore, an operation that goes in the opposite direction of the normal convolu-tion is required to map feature maps to a higher-dimensional space. Dumoulin et al. [11] indicate that transposed convolutions or fractionally strided convolutions can be used. In this way, both upsampling and convolution are performed in one step. However, separating these two operations by first resizing the input to a higher resolution and then performing a normal convolution operation is more useful in many contexts, such as when the CNN is im-plemented within a Generative Adversarial Network [42]. In this case, the convolution that follows the resizing must be done using zero-padding to avoid reducing the resolution again.

3.2. Discriminative vs. generative models

Figure 3.1: CNN example for image classification. Given an input image of shape 256 ˆ 256 ˆ 1, in total six different kernels are applied to the input within the convolutional layer. The pixel values of the input image are displayed at the bottom left. Zero-padding is performed before the convolution operation. A sliding step of Kernel 1 (shape 2 ˆ 2 ˆ 1) is illustrated as an example: The sum of the products between the kernel weights and the overlapping area of the input image equals ey and is transferred to the output feature map of the kernel. By default, the output values are then converted by a nonlinear activation function. In this ex-ample, the convolutional layer is defined by six kernels, so the number of output feature map channels is six. Zero-padding and a stride of one results in the shape of the input image being maintained. The subsequent pooling layer performs max pooling with a defined window of 2 ˆ 2, thus summarizing the displayed region of input pixels with a maximum value of 0.4. Pooling is performed for all input channels. The pixels of the three-dimensional output fea-ture map of the pooling layer are then processed in several traditional fully connected layers, with the last layer representing the class assignment.

Pooling operations that reduce the size of their inputs are usually not used if the objective is to map a low-dimensional input to a high-dimensional output.

With the aim of classification, the architecture of combined convolutional and pooling lay-ers can be extended by adding traditional fully connected laylay-ers using the information from the last pooling layer for the classification. Figure 3.1 summarizes one example architecture of a CNN.

**3.2**

**Discriminative vs. generative models**

Machine learning algorithms can be distinguished between discriminative and generative learning algorithms. For a given training data with input x and target variable y, the discrim-inative algorithms model p(y|x), the conditional distribution of the target for a given input. In a classification problem, the algorithm tries to find a decision boundary, with which an instance can be directly assigned to a label depending on x. It tries to learn a direct mapping from input x to label y.

In contrast, generative learning algorithms first model the joint distribution of the input and the target, p(x, y). This is done by modeling p(x|y), the distribution of the input features given the target, and combining it with p(y), the class prior. Bayes rule can then be used to derive p(y|x), the posterior distribution of the target given the input. Depending on the highest probability for y, an instance is then classified. Therefore, generative models do not

3.3. Generative Adversarial Networks

Figure 3.2: Discriminative vs. generative principle within a classification task. While dis-criminative models try to find a border between classes, generative models try to model the distribution of the classes themselves.

try to find any boundary between the classes but try to model the distribution of the classes themselves [39]. Figure 3.2 illustrates the principles of the two learning algorithm types.

**3.3**

**Generative Adversarial Networks**

To learn how to model a data probability distribution, most generative models are based on the maximum likelihood principle. Using n training samples xi, the likelihood is computed as the product of the sample probabilities assigned by the model,

n
ź
i=1
p*θ*
xi, (3.3)

with p* _{θ}* xi representing the probability that is assigned to xi

*are updated so that the likelihood of the model following the data distribution is maximized. However, this requires a previous assumption about the underlying probabilistic model for which the parameters are optimized. By explicitly defining a certain form of p*

_{. The model parameters θ}*θ*(x), the

com-plexity of a high-dimensional data distribution may not be represented correctly [23]. Gener-ative Adversarial Networks are based on the concept of a simultaneous adversarial learning process rather than maximizing the likelihood [14].

**3.3.1**

**Architecture**

The originally proposed GAN, hereafter referred to as Standard GAN, consists of two adver-saries, a generative model, generator G, and a discriminative model, discriminator D. G is a differentiable function and tries to map a random noise input z, which is sampled from a defined prior distribution pz, to the real data space as G(z). D, also a differentiable function, distinguishes whether the input comes either from the real data distribution pr or from the generator’s distribution pg. While G aims to generate an output that D classifies as real, D simultaneously tries to classify its inputs from the real and generator’s data distribution as correctly as possible. Consequently, G and D compete with each other to achieve their in-dividual goals. Using the objective function, both models update their parameters via back-propagation. This is done in an iterative process that alternates between k steps of optimizing D and one step of optimizing G. While D is updated in each step using minibatches of pgand pr, G is updated with D classifying a minibatch of pgonly. The gradient-based updates can be performed by any standard gradient-based learning rule [14]. Figure 3.3 summarizes the Standard GAN architecture.

The general intuition of the adversarial process is that by forcing both models to improve progressively, G’s distribution pgconverges towards the real data distribution pr. The out-come of the GAN is a generative model for which no underlying probabilistic model has to

3.3. Generative Adversarial Networks

Figure 3.3: Standard GAN architecture. Generator G tries to map samples z from a defined random noise distribution pzto the real data distribution pr by G(z). Discriminator D takes samples x from pr and samples G(z) from the generator’s output and tries to distinguish them as correctly as possible. While D updates its parameters based on its classifications of real samples x and synthesized samples G(z), G updates its parameters based on how D classifies its synthesized samples G(z)only.

be defined in advance. Furthermore, since G in many cases represents a simple, determinis-tic feed-forward network, sampling can be performed easily by forward propagation unlike many other generative models [23].

**3.3.2**

**Objective function**

**3.3.2.1** **Theoretical foundations**

For Standard GAN, the output of the discriminator determines the estimated probability of its input coming from the real data distribution pr[14].

Therefore, the adversarial learning process can be formulated such that D tries to
maxi-mize**E**x„pr[log D(x)]and**E**z„pz[log(1 ´ D(G(z)))]. In contrast, G attempts to have its output

classified by D as real through minimizing**E**z„pz[log(1 ´ D(G(z)))]or**E**x„pg[log(1 ´ D(x))],

respectively. Since the combined objective function min

G maxD L(G, D) =**E**x„pr[log D(x)] +**E**z„pz[log(1 ´ D(G(z)))]

=**E**x„pr[log D(x)] +**E**x„pg[log(1 ´ D(x)]

(3.4)

is sought to be optimized by both models with conflicting interests, the adversarial process is forming a minimax relationship between the discriminator and the generator.

Based on the objective function, D is updated in each training step with a minibatch of m by the generator transformed noise samples z and m samples of the real data distribution x by ascending its stochastic gradient

∇_{θ}_{D} 1
m
m
ÿ
i=1
h

log D**x**(i)+log1 ´ DG**z**(i)i. (3.5)

In contrast, G is updated in each training step using a minibatch of m noise samples z by descending its stochastic gradient

∇_{θ}_{G} 1
m
m
ÿ
i=1
log1 ´ DGz(i). (3.6)

3.3. Generative Adversarial Networks

The first term of the objective function,**E**x„pr[log D(x)], does not influence training of G,

because G’s parameters are updated with D classifying mapped noise samples D(G(z))only [14].

The main goal of generative models is to learn an own generative distribution pgwhich is identical to the real data distribution prby minimizing the differences between these two dis-tributions [23]. To verify whether the defined objective function corresponds to this function-ality, two metrics to quantify the similarity of two probability distributions are first reviewed: The Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences.

The Kullback-Leibler divergence, DKL, returns a scalar which defines how a probability distribution p diverges from another probability distribution q and is defined as

DKL(p}q) =´ ż x p(x)logq(x) p(x)dx= ż x p(x)logp(x) q(x)dx= ż x p(x) [log p(x)´log q(x)]dx. (3.7) DKLis non-negative and returns the minimum zero if and only if p and q are the same distribution. However, since it is asymmetric (DKL(p}q) ‰ DKL(q}p)), the KL divergence cannot be seen as a true distance measure. Comparing it to the Shannon entropy H(p) =

´**E**x„p[log p(x)], it becomes obvious that DKLis just a small modification by adding in the
second distribibution q. Since the Shannon entropy can be interpreted as the expected amount
of information in an event drawn from that distribution, DKLwith its expectation of the log
difference between the probability distributions shows how much information is lost when
we approximate one distribution with another [13].

Another measure of similarity between two probability distributions is the Jensen-Shannon-Divergence which builds up on the KL divergence by

DJS(p}q) = 1 2DKL p}p+q 2 +1 2DKL q}p+q 2 . (3.8)

It ranges from 0 to 1 and in contrast to DKL, it is symmetric (DJS(p}q) =DJS(q}p)) [56]. A comparison between both divergences is illustrated in Figure 3.4.

Now, using these foundations about the Kullback-Leibler and Jensen-Shannon
Diver-gence, it can be shown that the defined objective function corresponds to the functionality
of minimizing the difference between pg, the distribution of the generator output and pr, the
real data distribution. Since it is the discriminator’s intention to classify its input as correctly
as possible, the optimal value for the discriminator, D‹_{(}_{x}_{)}_{, is identified by differentiating}
the objective function in Eq. 3.4. Goodfellow et al. [14] show that for a fixed generator, the
optimal discriminator is defined as

D˚(x) = pr(x)

pr(x) +pg(x)

P[0, 1]. (3.9)

Consequently, if the generator is also optimal, so that the distribution learned by G is
identical to the real data distribution (pg = pr), the discriminator can only guess how its
input is assigned (D˚_{(}_{x}_{) =} 1

2). At this point, when both models are at their optimal values, the objective function in Eq. 3.4 can be rewritten as

L(G˚, D˚) = ż x pr(x)log(D˚(x)) +pg(x)log(1 ´ D˚(x)) dx =log1 2 ż x pr(x)dx+log1 2 ż x pg(x)dx =´log 4. (3.10)

Consequently, the global minimum of the objective function that is reached when pg=pr equals ´ log 4.

3.3. Generative Adversarial Networks

Figure 3.4: KL vs. JS divergence. Given two normal densities p(x)and q(x)on the upper left,
the computed values for each x regarding the two KL divergences DKL(p}q)and DKL(q}p)
are shown on the upper right. Based on the mean KL divergence values for each x in the
lower left plot, in which m refers to p+q_{2} , all JS divergence values are plotted in the lower
right. While DKL(p}q)differs from DKL(q}p), the JS divergences DJS(p}q)and DJS(q}p)are
identical [56].

Referring to the equation of the JS divergence in connection with the optimal discrimina-tor leads to the solution of the equation by

DJS pr}pg=1
2DKL
pr}
pr+pg
2
+1
2DKL
pg}
pr+pg
2
=1
2
log 2+
ż
x
pr(x)log pr
(x)
pr(x) +pg(x)dx
+1
2
log 2+
ż
x
pg(x)log
pg(x)
pr(x) +pg(x)dx
=1
2
log 4+log1
2
ż
x
pr(x)dx+log1
2
ż
x
pg(x)dx
=1
2(log 4+L(G, D
˚_{))}_{.}
(3.11)
Rearranging this equation results in

L(G, D˚_{) =}_{2D}

JS pr}pg´log 4. (3.12)

Therefore, the objective function of a GAN does indeed correspond to the functionality
of minimizing the difference between pg, the distribution of the generator output and pr, the
real data distribution. If the discriminator has reached its optimal state, D˚_{, the objective is}
equivalent to minimizing the JS divergence [56].

Eq. 3.12 shows that for the optimal generator which has learned the identical real data distribution, DJS(pr}pg) = 0, so that the objective function achieves its global minimum ´log 4.

3.3. Generative Adversarial Networks

**3.3.2.2** **f -divergence**

In Standard GAN, the generator is trained to minimize DJS(pr}pg)by the use of the discrim-inator. However, it was found that instead of the Jensen-Shannon divergence, many different divergences can be adopted to train a GAN. Nowozin et al. [41] have generalized the set of divergences a GAN can target to the family of f -divergences. Given the real data density pr and the generator’s density pg, the f -divergence is defined as

Df pr}pg=
ż
*χ*
pg(x)f
pr(x)
pg(x)
dx. (3.13)

The divergence function f(u)is defined to be a convex function for which f(1) =0 so that a divergence of zero is returned when pr(x) = pg(x). Since the probability distributions of prand pgin Eq. 3.13 are not known during GAN training, the formula for the f -divergence can be replaced by a tractable form. By using the convex function’s Fenchel conjugate2, the estimated f -divergence can be lower-bounded by

Df(pr}pg)ěsup TPT

**E**x„pr[T(x)]´**E**x„pg[f

‹_{(}_{T}_{(}_{x}_{))]}

(3.14)
where f˚ _{represents the Fenchel conjugate and T the GANs discriminator. As a }
differ-ence of expectations, the divergdiffer-ence can be approximated given only samples from prand pg
instead of the true underlying, but unknown distributions.

Depending on the choice of the divergence function f(u), the generalized f -divergence represents various objectives that may be used to train a GAN. Among them is also the Stan-dard GAN objective function for the case that f(u) =u log u ´(t+1)log(t+1). However, the formula also includes many divergences other than Jensen-Shannon [41].

**3.3.2.3** **Discrepancy between theory and practice**

Training of Standard GAN can be understood as minimizing the Jensen-Shannon divergence between the real data distribution prand the distribution of generator output pg. It has been shown that if the divergence function f(u)is chosen accordingly, the objective function from Eq. 3.4 can also be represented in the form of a f -divergence.

Nevertheless, this discussed objective function from theory differs from the one used in practice. In theory, there is no differentiated consideration of the objective function with regard to the optimization of the discriminator or generator. A comparison of the gradients from Eq. 3.5 and Eq. 3.6 reveals that in both cases, the probability of the generator output being classified as fake is maximized and minimized, respectively.

However, it has been pointed out that the optimal discriminator between prand pgis al-ways perfect if their supports are disjoint or lie in low-dimensional manifolds [3]. Assuming that a contained object within a real image, sampled from pr, is fixed, the image has to com-ply with a lot of constraints3. These limitations prevent the image from being able to take on a high-dimensional free form. Since the generator uses a low-dimensional noise vector z to translate it to a higher-dimensional output G(z), pgalso lies in a low-dimensional manifold [56]. Given that prand pgboth lie in low-dimensional manifolds, they are likely to be disjoint. Following that the discriminator is, therefore, able to perfectly classify real and generated im-ages, its gradient transferred to the generator vanishes when the theoretical objective function is used:

lim

}D´D˚_{}Ñ0}∇*θ*G**E**z„p(z)[log(1 ´ D(G(z)))] =0. (3.15)
2_{The Fenchel conjugate of a convex function f}_{(}_{u}_{)}_{is defined as f}˚_{(}_{t}_{) =}_{sup}

uPdomf(ut ´ f(u)).
3_{For example, a brain should consist of three main parts: the cerebrum, cerebellum and brainstem.}

3.3. Generative Adversarial Networks

As a result, using this objective function may not provide enough gradient and hence information for the generator to learn effectively in practice [3].

To counteract this, the objective function is changed in practice with respect to the opti-mization of the generator. G is updated in practice by ascending its stochastic gradient

∇_{θ}_{G}**E**_{z„p(z)}[log(D(G(z))))]. (3.16)
In this way, the objective function offers much stronger gradients in learning. As a
re-sult, in contrast to the universal objective function in theory, two mismatched generator and
discriminator objective functions are used in practice [14].

Following the representation of f -divergences, Nowozin et al. [41] refer to a simple re-lationship between the GAN discriminator and an estimate of the density ratio. Given that pgand prare known, the optimal discriminator can be identified by the derivative of fD, the divergence function targeted for the discriminator:

T˚_{(}_{x}_{) =} _{f}1
D
pr(x)
pg(x)
. (3.17)

If f_{D}1 is invertible, the optimal discriminator T˚_{(}_{x}_{)}_{can be used to obtain the ratio between}
the real data density and the density of the generator. Since T˚_{(}_{x}_{)}_{is not known in practice,}
the approximate density ratio is estimated by using the current discriminator T(x)instead:

pr(x) pg(x) = fD1 ´1 (T˚(x))« fD1 ´1 (T(x)). (3.18)

Consequently, using the current discriminator and the divergence function targeted for the discriminator, the approximated density ratio between the probability distributions can be estimated.

Since the f -divergence in Eq. 3.13 depends only on samples from pgand the density ratio
at each point x, any f -divergence representing the generator objective function can now be
approximated as
Df_{G}(pr}pg) =**E**x„pg
fG
pr(x)
pg(x)
«**E**x„pg
h
fG
fD1
´1
(T(x))i (3.19)
where fGis the divergence function targeted for the generator.

In summary, the discriminator is optimized in practice in the same way as in theory by maximizing 3.14. However, the generator is optimized instead by minimizing the f -divergence from Eq. 3.19. Thereby, the current discriminator is used to approximate the density ratio between the probability distributions prand pg.

For Standard GAN, fD is defined in the same way as before by fD(u) =u log u ´(u+1)log(u+1). fG needs to be defined as fG(u) = log(1+ u1) so that the generators objective function corresponds to the one used in practice from Eq. 3.16. Then, it does not correspond to minimizing the Jensen-Shannon divergence as in theory but to minimizing the approximation of the f -divergence between the real data density and the generator density in Eq. 3.19 [45]. The objective function can be interpreted as

L(G, D˚) =DKL pg}pr´2DJS pg}pr (3.20) which is a composition of a reverse KL divergence term and a JS divergence term [3]. Consequently, the GAN training with the objective function used in practice can be under-stood as a divergence minimization, just as with the use of the theoretical objective function. Due to the different behaviour of the gradient for the generator, the theoretical objective function is in the following referred to as the saturating objective function (SOF). In contrast, the term non-saturating objective function (NSOF) is used for the objective function applied in practice.

3.3. Generative Adversarial Networks

**3.3.3**

**Common training problems**

Training of a GAN is generally considered to be difficult. The theoretical proof [14] that pg converges to pr, the real data distribution, does not correspond to the situation observed in practice when gradient-based optimization is used. A reason for the inconsistency is that the proof is based on the convexity of the GAN objective function. Since both the generator and the discriminator are modeled by deep neural networks in practice, the convexity does not hold because the optimization is performed rather in the parameter space than learning the probability function itself [23].

For Standard GAN, it has been demonstrated in section 3.3.2.3 that the use of the SOF leads to problems with vanishing gradients. By using the alternative proposal (NSOF), this saturation is prevented. However, as Eq. 3.20 shows, the training process can then be under-stood as simultaneous minimization of the reverse KL divergence, DKL(pg}pr), and max-imization of the JS divergence, DJS(pg}pr). Since these two objectives are contradictory, the gradient of the discriminator shows increasing variance and fluctuation as training pro-gresses [23].

Non-convergence is considered to be the main problem faced by all different types of GANs [15]. Apart from these problems concerning Standard GAN, two main difficulties are described which may be associated with any GAN: Finding the Nash equilibrium and avoiding the problem of mode collapse.

**3.3.3.1** **Nash equilibrium**

Most deep models are trained by minimizing a cost function. However, instead of an opti-mization problem, the training of a GAN can rather be seen as finding a Nash equilibrium to a non-cooperative two-player game [48]. For both the generator and the discriminator, the loss depends on the other player’s parameters. An update of one of the two models affects that its loss will be reduced. However, at the same time, the loss of the other player will increase. This sometimes allows both players to reach an equilibrium, but repeatedly undoing each other’s progress often does not result in a useful state. Rather, GANs often tend to oscillate in practice and are not guaranteed to reach an equilibrium [15].

The difficulty of finding a Nash equilibrium can be illustrated by a simple example. As-suming that the objective function is given by f(x, y) =xy, one player tries to minimize it by controlling x and the other player aims to maximize it by controlling y.

The Nash equilibrium for such a game does not represent a local minimum of f(x, y). Instead, it is achieved by a saddle point, which is an optimal point for both players with respect to their parameters [13]. In this situation, the only saddle point and thus the Nash equilibrium is given for x=y=0. Then, a change of the parameter does not directly lead to an improvement for either player in terms of their objectives [15].

Since B f_{Bx} = y and B f_{By} = x, x is updated with x = *x ´ ηy and y with y*= y+*ηx*within
*one iteration, where η represents the learning rate and is set to η* =0.1 within this example.
Figure 3.5 illustrates that instead of reaching the Nash equilibrium at xy = 0, the gradient
updates cause an increasing oscillation instead [56].

As in the example, in GAN training, it often happens that both the generator and the discriminator repeatedly update the model parameters forever instead of converging to the saddle point where neither player is able to reduce its loss [13].

**3.3.3.2** **Mode collapse**

Usually, the real data distribution, pr, is represented by several modes. A GAN suffers from a mode collapse when the generator learns to map different input values z, sampled from the defined prior distribution pz, to the same or similar output G(z).

A typical scenario is that the generator G often synthesizes samples from the same mode, which the discriminator D initially misclassifies. As the training progresses, D eventually

3.3. Generative Adversarial Networks

Figure 3.5: Example of a non-converging simulation. While x is updated with x = *x ´ ηy*
to minimize xy, y is updated with y = y+*ηx* *to maximize xy. The learning rate η is set*
to 0.1. Instead of reaching the Nash equilibrium at xy= 0, the updates cause an increasing
oscillation instead [56].

Figure 3.6: Illustration of the mode collapse problem. Two different GANs are trained to converge to the target distribution, a mixture of Gaussians in a two-dimensional space. For different steps during training, the heatmaps of the distributions of the generator output are shown. While the generator in the top row quickly learns to converge to the real data distribution, the heatmaps in the bottom row represent the typical behaviour when mode collapse occurs. The generator only synthesizes samples from only one mode at a time and rotates through the modes as the discriminator learns to classify them as fake [37].

learns that the samples generated by G are incorrectly classified, which leads G to generate again similar samples from another mode [23].

Since this training behaviour continues in this form, pgdoes not converge to pr, but in-stead rotates through the modes of the real data distribution. It does not converge to a fixed distribution, and only ever assigns significant probability mass to one mode at once [37]. Con-sequently, the samples generated lack the diversity of the real data [23]. Figure 3.6 illustrates the training of a GAN suffering from a mode collapse.

A potential cause of mode collapse is seen in the architecture of GAN’s iterative learning process. The SOF in Eq. 3.4 indicates that D should be trained to optimality for a fixed G. This minimax architecture is identical when using the NSOF instead.

However, as described in Section 3.3.1, the models update their parameters alternately between k steps of optimizing D and one step of optimizing G for computational reasons. In this way, it is not clear if it solves a minimax or maximin problem.

The solutions for both problems are not equal: min

G maxD L(G, D)‰maxD minG L(G, D) (3.21) In case of a maximin problem, the minimization with respect to the generator lies within the inner loop of the optimization process. G then maps its input z, sampled from pz, to

3.3. Generative Adversarial Networks
(a) Sigmoid
1
1+e´x
(b) tanh
e2x´1
e2x_{+}_{1}
(c) ReLU
max(0, x)
(d) Leaky ReLU
*x if x > 0, αx otherwise*

Figure 3.7: Activation functions. The figure presents all activation functions that are used
*within this study. In this example, α is set to 0.2 for the Leaky ReLU function.*

the output G(z) with the highest probability of being classified as real by D. As training progresses and D learns to classify G(z)as fake, G again tries to find the other output with the highest probability of being classified as real by D.

In this way, pgdoes not converge to all modes of prbecause G believes that choosing only one mode is enough to make D classify its output as real [5, 15, 23].

**3.3.4**

**Deep Convolutional Generative Adversarial Network**

As mentioned in Section 3.3.1, both the generator and the discriminator of a GAN are defined as differentiable functions. The Standard GAN implementation implies the use of traditional fully connected networks for both models [14].

The Deep Convolutional Generative Adversarial Network [46] adapts the architecture by implementing both models as CNNs. The task of the discriminator is to perform classification by mapping a high-dimensional input to a single number as output. Thus, it is implemented as a CNN using convolutional layers. In contrast, the task of the generator is to map a low-dimensional input z to a high-low-dimensional output G(z). As described in Section 3.1, this can be achieved by repetition, first changing the size of the input to a higher resolution and then performing a normal convolution operation.

Based on the proposal of Springenberg et al. [50], a DCGAN is characterized by modified CNN architectures. Very often, in the case of classification, CNNs are structured by alternat-ing convolutional and poolalternat-ing layers, followed by fully connected layers (as shown in Figure 3.1). However, in the DCGAN discriminator, the functionality of pooling layers is replaced by the use of convolutional layers with an increased stride, which also fulfills the purpose of reducing the input to a lower dimension. This allows the discriminator to learn its own spatial downsampling. Besides, all fully connected layers on top of the convolutional layers are removed.

Since the kernel values and biases are updated during the training of a CNN, the distri-bution of each layer’s inputs changes after each training step. Since this fact slows down the training, the problem can be addressed by normalizing the convolutional layer outputs to have zero mean and unit variance before passing them to the nonlinear activation func-tion. Batch normalization [25] is proposed to be applied to all convolutional layers except the generator output layer and the first convolutional layer of the discriminator [46].

Different activation functions are recommended by the DCGAN authors to be used for different layers. The Rectified Linear Unit (ReLU) [38] is used for all convolutional layers of the generator, except for the output layer, for which the hyperbolic tangent (tanh) is used. For the discriminator, the Leaky Rectified Linear Unit (Leaky ReLU) [33, 57] is applied to all convolutional layers. The output of the last convolutional layer is flattened and passed to a sigmoid function. Figure 3.7 provides an overview of the activation functions.

3.3. Generative Adversarial Networks

Figure 3.8: DCGAN generator and discriminator. Top: The discriminator. Given an input image of the shape 256 ˆ 256 ˆ 1, several convolutional layers are stacked. For each convo-lutional layer, the kernels are slid over the input feature maps with an increased stride. For example, by using zero-padding and a stride of two, the first convolutional layer halves the shape of each input channel (from 256 ˆ 256 to 128 ˆ 128). Batch normalization is not applied to the first convolutional layer. Leaky ReLU is applied to all convolutional layer outputs. After the convolution stack, the feature maps are flattened and passed to a sigmoid function. Bottom: The generator. A reshaped random vector of the shape 1 ˆ 1 ˆ 512 is mapped to the final output image of shape 256 ˆ 256 ˆ 1 by a stack of upsampling and convolutional layers. While the upsampling layers increase the shape of the input feature maps, the convo-lutional layers maintain the shapes of the input feature maps. This can be reached by using zero-padding and a stride of one. Batch normalization and ReLU are applied to the outputs of all convolutional layers except the last one, for which tanh is used. For both generator and discriminator, no fully connected or pooling layers are used.

An example architecture for the generator and discriminator of a DCGAN is shown in Figure 3.8.

The authors have not proposed a different objective function from the one in Standard GAN. Therefore, in practice, training a DCGAN with the described architecture is performed by the use of the NSOF.

**3.3.5**

**Pix2Pix**

Using the Pix2Pix framework [26] as an image-to-image translation, the GAN is applied in a conditional setting by generating high-dimensional output images that are conditioned on corresponding input images. While Standard GAN or DCGAN aim to unconditionally trans-form a low-dimensional random noise input z into a higher-dimensional output G(z), the conditional GAN learns a mapping from an observed input image x together with z to gen-erate G(x, z). The training data must consist of input-target image pairs so that the image translation from one domain to the other domain can be learned [54].

The Pix2Pix generator maps an input image to an output image using the U-Net [47] archi-tecture. As in a standard encoder-decoder network [22], the input is downsampled through a series of convolutional layers until it reaches a bottleneck layer from which it is upsampled again. Given the image-to-image translation problem, the input and target images differ in the appearance of the surface but are based on the same underlying structure. It is

there-3.3. Generative Adversarial Networks

Figure 3.9: Pix2Pix generator. Starting from an input image, it is encoded using several convo-lutional layers in the same way as in the DCGAN discriminator. Again, batch normalization is not applied to the first convolutional layer. Leaky ReLU is applied to all convolutional layers up to the bottleneck layer that returns 1 ˆ 1 feature maps. From this point on, the fea-ture maps are decoded in the same way as in the DCGAN generator using the combination of upsampling and convolutional layers. Again, ReLU is used to activate the convolutional outputs. However, as in the DCGAN generator, the last convolutional layer uses tanh in-stead. Dropout is applied to the first three convolutional layers of the decoding part. A skip connection within this network is illustrated by concatenating the two 2 ˆ 2 ˆ 512 outputs of the convolutional layers to one 2 ˆ 2 ˆ 1024 output. The network could also be represented in the shape of a U, where the layers to be concatenated are positioned on the same level.

fore desirable that common low-level information flows directly across the network. Con-sequently, the U-net architecture allows the generator to bypass the bottleneck layer for this shared information by adding skip connections between each layer i and layer n ´ i, where n is the total number of layers, to the encoder-decoder network. The selections of batch nor-malization and activation functions for the layers are strongly oriented towards DCGAN. Instead of providing the noise vector z directly as an additional input to the generator, it is implemented in the form of dropout and applied on the first three convolutional layers after the bottleneck [26]. Figure 3.9 demonstrates the architecture of the generator.

Based on the way how the discriminator is designed, it is referred to as PatchGAN. Con-catenated with the input image, the unknown image which is either a fake image from the generator or the target image is aimed to be classified. However, instead of classifying the image as a whole, each n ˆ n patch of the image is classified separately. Given an image of the shape 256 ˆ 256, by standard, a 30 ˆ 30 image is returned by the discriminator whereas each pixel corresponds to a 70 ˆ 70 patch of the unknown image. Again, the architecture of the stacked convolutional layers is strongly oriented towards DCGAN. Batch normaliza-tion is not applied to the first convolunormaliza-tional layer. Except for the last layer, whose output is forwarded to the sigmoid activation function, Leaky ReLU is used throughout the network. Consequently, with the difference that it does not return a single output but a 30 ˆ 30 image, the structure of the discriminator is similar to the one from DCGAN shown in Figure 3.8 [26].

Generally, the objective function of a conditional GAN is given by

min

G maxD LcGAN(G, D) =**E**x,y[log D(x, y)] +**E**x,z[log(1 ´ D(x, G(x, z))] (3.22)
where x is the input image, y the target image and z the provided noise. Thus, it
corre-sponds to the SOF from Eq. 3.4, extended by the inclusion of the input image. However,
given the vanishing gradient problem of the SOF, the objective function can be changed
in accordance with StandardGAN or DCGAN by the way that the generator maximizes
log[D(x, G(x, z))] instead of minimizing log[1 ´ D(x, G(x, z))]. Again, the classification of
the real images, D(x, y), is only used to train the discriminator and not the generator.

3.3. Generative Adversarial Networks

Figure 3.10: Pix2Pix architecture. The generator G and discriminator D are trained using pairs of input images x and target images y. Given an input image, the generator with its U-Net architecture, in which noise z is provided via dropout, generates a fake image G(x, z). The L1-distance between the generated fake and the target is computed. The discriminator in form of a PatchGAN concatenates the fake image with the input image and outputs a 30 ˆ 30 image (D(x, G(x, z))). Based on the classification of each patch combined with the computed L1-distance, the generator is updated. Within the same iteration, the target image is also given to the discriminator. Again, after concatenating it with the input image, the patches of the target get classified (D(x, y)). Based on the classification of the fake and the target image patches, the discriminator is updated.

The objective function of the generator used in Pix2Pix is a mix of the objective function of a conditional GAN with the L1-distance which is given by

LL1(G) =**E**x,y,z[}y ´ G(x, z)}1]. (3.23)
It penalizes the distance between the target images and the images generated from the
given input images. The combined generator objective function is then expressed as

LP2P(G) =LcGAN(G, D) +*λL*L1(G) (3.24)
*where λ is the regularization constant to adjust the weight of the L1-distance. In this way,*
the generator is not only forced to synthesize images which the discriminator fails to classify
correctly but which also get very close to the actual paired target image [26].

Given the descriptions about the generator and discriminator architecture and the modi-fied objective function, Figure 3.10 summarizes the Pix2Pix architecture in the context of the conditional setting.