GAN-Based Synthesis of Brain Tumor Segmentation Data : Augmenting a dataset by generating artificial images

(1)

LIU- IMT- TFK-A-20/586 - SE

GAN-Based Synthesis of

Brain Tumor

Segmentation Data

Augmenting a dataset by generating

artificial images

(2)

Linköping University

Supervisor

Anders Eklund

Department of Biomedical Engineering (IMT) Linköping University

Examiner

Magnus Borga

Department of Biomedical Engineering (IMT) Linköping University

(3)

Machine learning applications within medical imaging often suffer from a lack of data, as a consequence of restrictions that hinder the free distribution of patient information. In this project, GANs (generative adversarial networks) are used to generate data synthetically, in an effort to circumvent this issue. The GAN framework PGAN is trained on the brain tumor segmentation dataset BraTS to generate new, synthetic brain tumor masks with the same visual characteristics as the real samples. The image-to-image translation network SPADE is subsequently trained on the image pairs in the real dataset, to learn a transformation from segmentation masks to brain MR images, and is in turn used to map the artificial segmentation masks generated by PGAN to corresponding artificial MR images. The images generated by these networks form a new, synthetic dataset, which is used to augment the original dataset.

Different quantities of real and synthetic data are then evaluated in three different brain tumor segmentation tasks, where the image segmentation network U-Net is trained on this data to segment (real) MR images into the classes in question. The final segmentation performance of each training instance is evaluated over test data from the real dataset with the Weighted Dice Loss metric.

(4)

full dataset was made available.

A majority of the generated segmentation masks appear visually convincing to an extent (although somewhat noisy with regards to the intra-tumoral classes), while a relatively large proportion appear heavily noisy and corrupted. However, the translation of segmentation masks to MR images via SPADE proved more reliable and consistent.

(5)

I want to give my thanks to my supervisor Anders Eklund, for his responsive and useful support and guidance during the course of this project. I also want to thank David Abramian and Marco Cirillo for the implementation of the U-Net architecture that was used in this project. Finally, I would like to thank friends and family (my mother in particular) for the kindness and support I’ve received during the course of my five year long education.

(6)

1 Introduction

1

1.1 Aim . . . 3 1.2 Problem statements . . . 3

2 Theory

4

2.1 Machine learning . . . 5 2.1.1 Supervised learning . . . 5

2.1.2 Capacity, overfitting and underfitting . . . 6

2.2 Neural networks . . . 7

2.2.1 Activation functions . . . 9

2.2.2 Gradient descent . . . 11

2.3 Convolutional neural networks . . . 12

2.3.1 Convolution . . . 12 2.3.2 Pooling . . . 16 2.3.3 Transposed convolution . . . 16 2.3.4 Data augmentation . . . 18 2.3.5 Batch normalization . . . 18 2.4 Image segmentation . . . 20 2.4.1 U-Net . . . 20

2.4.2 Weighted Dice loss . . . 22

(7)

2.5.1 Wasserstein GAN . . . 25

2.5.2 Progressively growing GAN (PGAN) . . . 29

2.5.3 Conditional GAN . . . 31 2.5.4 SPADE . . . 33

3 Method

35

3.1 Dataset . . . 36 3.1.1 Dataset split . . . 38 3.2 Segmentation . . . 38 3.3 PGAN . . . 41 3.4 SPADE . . . 43 3.5 Preprocessing . . . 44

4 Results

46

4.1 PGAN . . . 47 4.1.1 Full dataset . . . 47 4.1.2 Reduced dataset . . . 47 4.2 Preprocessing . . . 50 4.3 SPADE . . . 50 4.4 Segmentation . . . 50

5 Conclusions

59

5.1 Discussion . . . 60

References

65

(8)

Introduction

Figure 1.0.1: MR image of a brain (left) and its tumor segmentation (right). Segmentations of MR images can be performed with deep learning techniques; specifically, via the training of segmentation networks, which require large datasets of both MR images and corresponding annotations. In this project, synthetic MR images as well as their annotations (the orange, red and yellow parts of the right image) are created using GANs, in an effort to improve the training of a segmentation network.

Recent advances in deep learning and computer vision have demonstrated great promise when applied to the automation of tasks that are considered

(9)

tedious, precise and time-consuming, and that may demand domain-specific expert knowledge to perform. This is relevant not least of all within the field of medical imaging, where such techniques could potentially be leveraged to assist medical experts in tasks and decisions that are especially sensitive and demanding (namely, in patient diagnosis). In the domain of medical image segmentation specifically, it is currently standard practice to annotate medical images such as MR scans by hand; a process which is time-consuming and to some extent subjective (i.e. different medical experts may create different annotations) – see figure 1.0.1 for an example. Thus, any advances towards the successful automation of these tasks would benefit both patients and medical experts greatly, and it is therefore of great interest to investigate if the encouraging results that have been achieved by deep learning and computer vision-based techniques can prove beneficial in this setting.

However, experimentation is impeded by the fact that medical data is especially scarce and difficult to obtain, since there are laws (e.g. GDPR) and other security considerations in place that hinder the free distribution of patient data (there are in fact few open neuroimaging datasets available that contain images and segmentations from more than 100 subjects). This is in contrast with other applications within computer vision, such as visual object recognition, which is supported by large, open datasets such as ImageNet [9] (which contains more than 14 million hand-annotated images). This limitation creates a problem, considering that even the most sophisticated machine learning models only perform as well as the amount of data that they are exposed to.

One attempt at circumventing this issue involves the use of deep learning techniques to generate data artificially, in an effort to compensate for the lack of patient data in the medical datasets. This could be achieved by utilizing GANs (Generative Adversarial Networks) – a deep learning framework

(10)

that was proposed in 2014 as a method for generating highly realistic data given data sampled from the desired distribution, which has since shown impressive and promising capabilities.

1.1 Aim

This thesis is an investigation of the possibility and potential benefits of using GANs to synthesize new images from a real dataset of brain MR scans and corresponding hand-labeled segmentation masks (annotations), and an evalutation of the potential performance gains that are achieved when using synthetic images to augment a dataset that is used to train a neural network to solve a brain tumor segmentation task.

1.2 Problem statements

The following questions are focused on in this thesis project:

• Can GANs be used to synthesize visually realistic MR images of brains and their corresponding tumor segmentation masks?

• Do these images improve the performance of a segmentation network, when used to augment a dataset of real images?

• How do different quantities of real and artificial images affect said performance?

(11)

Theory

This chapter contains theory relating to the concepts and methods that have been relevant in the execution of this project. It begins with an overview of basic machine learning concepts and terminology, which is followed by an introduction to neural networks, their design and the concepts surrounding them. This then transitions into an introduction to convolutional neural networks and general deep learning concepts, which is followed by a section covering image segmentation, that focuses on the U-Net [27] architecture that has been employed in this project. The chapter subsequently directs its focus towards GANs [11], where the underlying theory is presented, followed by descriptions of the specific architectures (PGAN [16] and SPADE [25]) that have been used.

The contents of this chapter are targeted towards readers with engineering or mathematics backgrounds, with no previous knowledge of machine learning or deep learning. If the reader already has knowledge in a particular topic that is discussed in this chapter and see fit to skip it, they are not discouraged from doing so.

(12)

2.1 Machine learning

The term machine learning refers to methods within statistics and computer science that provide computers with the ability to solve problems and perform tasks without being given explicitly programmed solutions. This is achieved by exposing the machine learning model to data, in which the model detects patterns and underlying structures (a process which is referred to as ”learning”) which are used as a basis for a prediction or decision.

2.1.1 Supervised learning

Supervised machine learning methods are tasks where the model learns a mapping from input to output by being exposed to (or ”trained” on) a training set consisting of pairs of input data and corresponding correct outputs (usually labeled by hand, by humans). A typical example is a classification task, where the model is trained to produce a correct label when given an input (e.g. to produce the word ”dog” when shown an image of a dog, or to produce a value between 0 and 1 representing the risk for heart disease when given a vector of a patient’s age, weight and cholesterol levels).

Formally, the supervised learning model is a mathematical function fθ :

X → Y from input space X to output space Y , that is shaped by its model parameters θ (e.g. regression parameters in a linear regression model, or weights and biases in a neural network – see section 2.2). The model is equipped with a loss function L which measures some sort of ”distance” L(y, ˆy)between the expected label y and the predicted label ˆy = fθ(x)(e.g.

absolute value or squared difference). The goal of the training algorithm is to adjust, or fit the model parameters θ to the training data, by minimizing the value of the loss function over the training data via some optimization

(13)

procedure (e.g. maximum likelihood estimation, or backpropagation – see section 2.2.2).

Apart from the training set, two separate data sets (disjoint from the training set and each other) are utilized to measure the performance of the supervised learning model, during and after training respectively:

• The validation set, which is used to measure the current performance of the model during or between instances of training by feeding it to the model and subsequently applying some loss metric (for example, the loss function) to expected and predicted outputs, which in turn can be referred to when adjusting the settings (or hyperparameters) of the model.

• The test set, which is used strictly after all of the training and hyperparameter tuning has taken place, to evaluate the final performance of the machine learning model.

When a machine learning method does not make use of labeled examples, it is referred to as unsupervised learning. These are algorithms that specialize in analyzing the structure and finding relationships within an unlabeled dataset, without being given feedback during the learning process. Typical examples include clustering, probability density estimation and dimensionality reduction.

2.1.2 Capacity, overfitting and underfitting

The goal of any machine learning model is to perform well on new, previously unseen data, not just the data it has been trained on (in fact, this aspect is what separates a machine learning task from a pure optimization problem). The degree to which a model can express a large variety of different functions is called its capacity, or model complexity. A model’s

(14)

capacity reflects how well it is able to reduce its loss over the training data, since simpler models with fewer parameters (e.g. linear regression models as opposed to neural networks) will not be able to capture the complex relationships that may be present in a given dataset. However, there is too much of a good thing: if the model capacity is too great, the model will simply start to memorize the properties of the training set, and fail to adequately capture those of the validation or test set (i.e. data it has never encountered during training). This phenomenon is known as overfitting, and can be combated by choosing appropriately complex models for a given task, by putting constraints on the number or size of parameters in the model or by refraining from training the model for too long. The opposite issue occurs when the model is too simple for the given task, which results in it failing to capture the properties of either the training set or the validation and test sets – this is referred to as underfitting.

2.2 Neural networks

A neural network (or artificial neural network) is a mathematical function consisting of layers of nodes, or neurons, that hold numbers, where each neuron in a given layer is connected by weights to the neurons in the succeeding layer. The value of a neuron is determined by a weighted sum of the neurons in the previous layer that are connected to it, where the coefficients of the sum are given by the weights that have been placed between each pair. After this sum has been calculated, it is input through a so called activation function, which usually is a non-linear function whose output value is set as the final value of that neuron. In a so called fully connected network (or multilayer perceptron), each neuron in a given layer is connected by unique weights to each neuron in the following layer. The input to the neural network is given to the first layer, and these values are

(15)

Figure 2.2.1: Artificial neural network with three inputs, two hidden layers and two outputs.

propagated forward through the so called hidden layers (the layers that are between the input and output layers), and finally arrive at the last layer, which holds the output of the network. See figure 2.2.1 for a diagram of a neural network with two hidden layers.

Formally, the value of neuron i in (non-input) layer k is given by:

h(k)_i = σ( N ∑ j=1 wijh (k−1) j + bi),

where wij is the weight between neuron j in the previous layer and neuron i

in the current layer, bi is a so called bias term (a constant value that is added

(16)

number of weights connected to the neuron in the current layer and σ(·) is the activation function. In vector notation this simply becomes:

h(k) = σ(W h(k−1)+ b),

where Wij = wij, i.e. row i of W hold weights that go to neuron i in

the current layer and column j hold weights that go from neuron j in the previous layer. The use of neural networks with multiple hidden layers is referred to as ”deep learning”.

As indicated by their name, neural networks are loosely inspired by the known structure and mechanisms of animal brains. According to the universal approximation theorem, the type of neural network described in this section has the capacity to approximate any real-valued continuous function defined on a (compact) subset ofRn _{arbitrarily well. Thus, in theory, any}

such function can be represented by a neural network, given appropriate architecture and parameters (i.e. weights and biases) [8].

2.2.1 Activation functions

Mathematical activation functions are inspired by the creation of action potentials in biological neurons, which are signals that are activated and fired along a neuron whenever its level of stimulation exceeds a certain threshold. This particular mechanism is captured by a step function; by necessity however, gradient-based optimization methods (see section 2.2.2) usually make use of activation functions with meaningful derivatives. Some typical examples are as follows.

Sigmoid

f (x) = 1 1 + e−x

(17)

The sigmoid function is defined on the reals, and its output lies in (0, 1), which can conveniently be interpreted as a probability.

Softmax

The softmax function σ :RK _{→ R}K_{takes a vector z = (z}

1, z2...zK)and returns

a vector σ(z) of K elements such that

σ(z)i =

ezi

∑K j=1ezj

,

for i = 1, ..., K; i.e. it first applies exponentiation to each element in the input and then normalizes them by dividing them by the sum of the exponentials – the resulting vector will have elements in the range (0, 1) that all sum to 1. This function is typically applied at the final layer of a neural network-based classifier, whose resulting values are interpreted as the network’s confidence that its input belongs to a particular class, represented as a probability for each respective class.

ReLU

f (x) = max(0, x)

The ReLU (or rectified linear unit) function is zero for negative inputs and identity for positive inputs. This function has been shown to improve training performance when compared to networks that use activation functions with saturating derivatives such as the sigmoid [18], and has since been adopted as one of the most popular activation functions for deep neural networks.1

1_{Note that this function is not differentiable everywhere (namely at x = 0). However,}

this does not pose a problem in practice since f′(0)can be set to an arbitrary value (e.g. 0, 0.5 or 1) without any issues.

(18)

2.2.2 Gradient descent

In order to minimize a loss function over training data, machine learning models employ mathematical optimization algorithms. In neural networks, the preferred optimization procedure is gradient descent, in which the (negative) gradient of the loss function L is calculated with respect to the weights and biases w of the network, which is then used to iteratively update said weights and biases and guide them towards a local minimum2_:

wn+1 = wn− γ∇wL(y, ˆy).

γ is the so called learning rate, which determines the size of the step that is taken in the direction of the gradient, and is an important hyperparameter to tune: a learning rate that is too small will render progress steady but impractically slow and a learning rate that is too large will make the weights oscillate or ”zig-zag” over the error surface instead of steadily following a path towards a minimum.

In practice, the gradients are not calculated directly with respect to each individual weight in the network since the large number of parameters would make this approach highly inefficient. Instead, the gradients are calculated one layer at a time iterating backwards from the last layer, and makes efficient use of the chain rule to avoid redundant calculations. This method is called backpropagation, since the gradients are propagated backwards (from output to input) in the network after having been calculated.

A backpropagation step can be performed by either calculating the loss

2_{It is of interest to note that optimization methods that use information from the second}

derivative of the error functions (such as Newton’s method) are rarely used in deep learning, since the corresponding Hessian matrices would be rendered intractably large by the large number of network parameters.

(19)

and corresponding gradients over all of the training data at once in every iteration, or by selecting a different subset (or batch) of the training data each time – doing this is not as computationally intensive, but it might yield noisier gradients. In the latter case, the procedure is called stochastic gradient descent (”stochastic” since the subsets are randomly selected). The time, or number of iterations required to go through all of the available data in such an optimization procedure is called an epoch.

2.3 Convolutional neural networks

A convolutional neural network is a type of neural network that is specialized in analyzing grid-like data, such as images. Whereas neurons in a traditional neural network layer are fully connected to every neuron in the proceeding layer and propagate their values forward via ordinary matrix multiplications, layers in a convolutional neural network are not fully connected and instead make use of the convolution operator to propagate values.

2.3.1 Convolution

In mathematics (and prominently in applications within signal processing and statistics), convolution (∗) is an operator acting on two functions (say, f and g) that produces a third function (f ∗ g) in the following manner:

(f ∗ g)(t) = ∫ _∞

−∞

f (τ )g(t− τ)dτ,

or in the discrete case:

(f ∗ g)[n] =

∞

∑

m=−∞

(20)

In words, this operator accepts two functions f and g, reflects g about the vertical axis, shifts (or ”slides”) g forward by t (or n) units and returns the integral (sum) of the product of the functions; the resulting function f ∗ g provides a mapping from the offset t (or n) to the resulting integral (sum).

Figure 2.3.1: Illustration of the 2D convolution operator. Blue: input, red: kernel, green: output.

The convolution operator can be generalized to higher dimensions. When acting on 2D data such as images, the operation is performed between two discrete 2D ”functions”: an image (a matrix of pixel values) and a so called convolution kernel (or filter). In this context, convolution involves placing the kernel over the input image and performing element-wise multiplication between each overlapping element. The sum of these products becomes the pixel value of the output in the location corresponding to the placement of the kernel, and the rest of the output pixels are calculated in the same manner, i.e. by ”sliding” the kernel over the input image3_{– see figure 2.3.1}

for an illustration. The following two hyperparameters are important when controlling the behaviour of a convolution kernel: the stride determines

3_{Observant readers may notice that this procedure makes no mention of the reflection or}

”flipping” of the elements in the kernel. Technically, this is not a true convolution, but what is called a cross-correlation. It is nevertheless referred to as a convolution by deep learning practitioners since this aspect makes no practical difference in this context.

(21)

the width and height of the ”steps” that are taken when sliding the kernel over the input, and the padding type affects the resolution of the output by padding the borders of the input with zeroes.

Image units in convolutional neural networks are usually comprised of multiple arrays called channels, or feature maps, which are different representations of the same image bundled together (the most basic example is an image represented by three primary color channels: red, green and blue). When designing a convolutional layer, it is common practice to apply multiple convolutional filters at once – each of these filters are individually applied to every different channel in the input image, and the results from the different channels are linearly combined into one output channel. The next channel in the output is formed by linearly combining the convolutions of the next filter with all of the respective input channels, and so on. During the training process, the different convolutional filters individually become specialized at extracting different kinds of features from the input.

Convolutional neural networks benefit from a number of advantages when compared to fully-connected networks, these include:

• Sparse connectivity: When designing a fully connected neural network to receive image inputs with a typical resolution of thousands or millions of pixels, the number of weights and biases immediately becomes unreasonably large: if two consecutive layers in a fully connected network contain N and M neurons respectively, the number of weights between those layers are N × M. In a convolutional layer, the number of weights is simply the size of the convolutional kernel (which is determined by design, and is often as small as 3× 3), times the number of filters. Apart from reducing computation and memory burdens, this greatly reduces the capacity for overfitting.

(22)

smaller than the input, the detection of meaningful local features (such as edges) is facilitated by the convolution framework, since the same weights are applied locally in different parts of the image.

• Equivariance: The convolution operator commutes with translations4, i.e:

τx(f ∗ g) = (τxf )∗ g = f ∗ (τxg),

where τxf is the translation of f by x (that is, τxf (y) = f (y− x)).

In the context of image processing, this has the following consequence: if every pixel in an image is shifted by a certain amount before undergoing convolution, the output will be equal to the convolution of the original image shifted by the same amount. Thus, if a convolutional network is trained to detect a certain object in an image, it will do so regardless of the position of the object. However, convolution is not equivariant to other transformations, such as scaling or rotation; these variations are handled by pooling techniques (see section 2.3.2), and in some cases data augmentation (see section 2.3.4).

The main unit of data in a deep learning framework is the tensor, which is a generalization of a matrix to higher dimensions, i.e. a multidimensional array5. A tensor organizes some data by its dimensions, or axes. For instance, a batch of images with three color channels can be packaged into a tensor by organizing them like (batch, width, height, channel), i.e. dimension 0 of the tensor corresponds to the samples in the batch, dimension 1 to the pixel values along the width of the images, and so on.

4_{This is also true for the cross-convolution operator.}

5_{This is not to be equated with the concept of a tensor in algebra, which has properties}

(23)

2.3.2 Pooling

In a convolutional layer, the values typically undergo the following three operations in order: convolution, nonlinear activation and pooling. A pooling function is an operation that replaces each value in the activated output by a ”summary” of the neighbourhood of that value. For instance, a max pooling operation is a window that slides over the output map and replaces each pixel with the maximum value of the pixels covered by the window, and an average pooling function replaces each pixel with the average of the pixels in the neighbourhood. Pooling has the important function of providing a network with robustness against small variations in the input to a layer, which is a property (formally called ”invariance”) that is not guaranteed by the convolutional operator and the nonlinear activation functions by themselves. These variations include small translations or rotations, which are suppressed by the pooling function if applied successfully. Pooling also has the important effect of reducing the number of pixels in the outputs of the hidden layers (i.e. downsampling them), which reduces the number of parameters of the functions that act on them (e.g. convolution) and thus the total number of parameters of the entire network. This is especially significant when using fully connected layers before the output layer, which is common practice in deep convolutional neural networks.

2.3.3 Transposed convolution

When performing convolution on an image, the result is normally (depending on the padding type) slightly smaller than the input, since pixel values outside of the image are undefined. Transposed convolution6

is an operation that is similiar to the convolution operator, but that goes ”the other way around” by producing an output that is larger than the

6_{Also erroneously called the deconvolution operator, which is a completely different}

(24)

Figure 2.3.2: Illustration of a transposed convolution of a 3× 3 kernel over a 2×2 input with one zero inserted between each element padded with a 2×2 border of zeroes.

input image while preserving the connectivity pattern of the convolution operator. The operation is equivalent to spacing out the elements and padding the borders of the input with a specified number of zeroes, before convolving the resulting image with a kernel – see figure 2.3.2. The resulting output is an image with a higher resolution than the input, that is such that the input can be obtained from it via convolution – in essence, transposed convolution is an operation that lets us recover the shape of a hypothetical image whose convolutional output is the input image. Just like with convolution, the weights of the transposed convolution kernel are shaped during training, and while convolution is useful for creating a low-dimensional abstract representation of the input, the transposed convolution is useful for generating high resolution data from a low-dimensional input.

(25)

2.3.4 Data augmentation

In an effort to provide additional robustness against variations in the input and to otherwise reduce the risk of overfitting, one might augment the dataset in question by generating ”new” data from it by means of simple mathematical operations. This typically involves adding translated, rotated, scaled, flipped or deformed versions of the data to the dataset in order to artificially increase its size and variation, without actually obtaining new data. This can be very useful if the dataset is small or otherwise lacks variation (e.g. 100 pictures of cats that are all aligned at the same position in the image), and can improve the performance of the network by a significant amount at very little cost, if applied successfully. More advanced methods of data augmentation are also possible, such as augmentation via GANs which is the main topic of this text and which will be described in detail in chapter 3.

2.3.5 Batch normalization

Batch normalization is a technique that has been shown to improve the speed, performance and stability of neural networks. It was introduced in 2015 [14] and has since become widely adopted, although the reasons for why it is effective is still a topic of discussion. The motivation for batch normalization is the following:

When training a neural network, all of the weights of the different layers are changed simultaneously. This means that the distribution of the activations of each layer changes as well during training, which can pose a problem since this forces the network to adapt to constantly changing inputs. This problem is dubbed internal covariate shift, and is especially significant in deep networks where small changes in shallower layers are amplified and propagated through the network, and result in a large shift in the

(26)

deeper layers. Batch normalization was proposed to combat this effect by normalizing the distributions of the activations at training time, in the following manner:

Let B be a batch of activations of size m at a given layer, and calculate the empirical mean and variance over B:

µB = 1 m m ∑ i xi, and σ2B = 1 m m ∑ i (xi− µB)2.

The mean and variance is calculated over each channel of B separately, resulting in two vectors of values. Let d be the dimension (number of channels) of the input, i.e. Bi = (x(1), x(2), ..., x(d))i, and normalize each

dimension of the input separately:

ˆ x(k)_i = x (k) i − µ (k) B » σ_B(k)2 + ϵ ,for k ∈ [1, d] and i ∈ [1, m],

where ϵ is a small constant added for numerical stability. The resulting normalized activation ˆx(k)_i has zero mean and unit variance, and in order to restore its representational power (i.e. by allowing activations to have other means and variances), the following transformation is applied:

ˆ

y(k)_i = γ(k)xˆ(k)_i + β(k),

where γ(k)_{and β}(k)_{are parameters that are learned during training.}

One might question the point of normalizing the activation to begin with, if the mean is to be set to an arbitrary value afterwards regardless. The reason is that the original mean of B is determined by complicated interactions in the weights preceding B that are difficult to control, whereas it is determined solely by β after batch normalization, and the reparametrization

(27)

is consequently much easier to learn via gradient descent.

2.4 Image segmentation

In image processing, image segmentation is the process of analyzing a visual input and dividing it into different regions, or segments, such that pixels within the same segment share certain characteristics. Formally, an image segmentation task is the problem of assigning a label to each pixel in an image according to some goal or criteria, and so called semantic image segmentation specifically involves identifying what class or object each part of the image belongs to, and categorizing it accordingly – this task is equivalent to an ordinary classification task, where each pixel in the input is assigned a label describing which class it belongs to. The segmentation tasks in this project are exclusively of this kind, so the ”semantic” descriptor is abandoned throughout the rest of this text, and all problems of this kind are simply referred to as ”image segmentation”, or just ”segmentation”.

2.4.1 U-Net

Traditional image segmentation methods mainly make use of the intensity information of the pixels in an image, while it is known that humans take advantage of other kinds of knowledge when performing image segmentation by hand. Neural network-based methods are expected to overcome these issues, and have indeed been successful when applied to image segmentation problems. U-Net [27] is a convolutional neural network that was developed for biomedical image segmentation (specifically, to detect cell boundaries in biomedical images), that takes an image as an input and outputs a map of labels for each pixel (or a so called segmentation mask).

(28)

Figure 2.4.1: Illustration of the U-Net architecture7. Blue boxes depict multi-channel feature maps, the corresponding number of multi-channels are denoted on the top of each box. The image resolutions are indicated by the height of the boxes. White boxes represent feature maps that have been copied via the skip connections. The arrows denote the different operations. The image is inspired by figure 1 in [27].

U-Net follows a so called autoencoder structure that consists of two sub-structures: an encoder part that consists of a sequence of convolutional and max pooling layers intended to reduce the input to a set of abstract, low-dimensional representations of the original input, followed by a decoder part that consists of a sequence of transposed convolution layers intended to upsample said feature maps into the final output. The final output is one-hot encoded, i.e. it is a binary-valued multi-channel output where the channels correspond to the different classes and the ones and zeros in each channel signify the existence or absence of that particular class at a particular coordinate.

In an effort to avoid losing high-resolution information in the encoder

(29)

process and to preserve details that could otherwise be lost when downsampling the inputs via convolution and pooling, so called skip connections are placed between convolution and transposed convolution layers of the same shape. These simply involve saving a copy of each encoder layer before max pooling, and concatenating (i.e. stacking) them channel-wise with decoder layers of the same shape, before performing transposed convolution. See figure 2.4.1 for an illustration of the U-Net architecture.

2.4.2 Weighted Dice loss

When selecting a loss function for a segmentation network, one might naïvely choose to use the pixel accuracy metric, i.e. the percentage of correctly classified pixels in the network’s output image when compared to the correct output (or ground truth). However, this metric is unable to handle datasets with imbalanced classes, i.e. datasets with images that are dominated by one or a few classes while other classes constitute only a small part of the respective images they appear in. For instance, it is often the case that most images in a dataset are dominated by the background, in which case the network will learn to consistently output completely blank images since these consequently yield a high pixel accuracy.

One of many loss functions that account for this type of class imbalance is the weighted Dice loss [28] (also known as the weighted F1 score or generalized Dice overlap): L(y, ˆy) = 1− 2 · ∑ c(wc ∑ y ∑

x(yx,y,c· ˆyx,y,c))

∑ c(wc ∑ y ∑ x(yx,y,c+ ˆyx,y,c)) + ϵ ,

where (·)x,y,c is the (binary) value of the element at row x, column y and

(30)

calculated in advance over all of the available training data:

w = (total number of pixels in dataset) (total number of classes)· v + ϵ ,

where v is a vector that counts the number of occurrences of each class in the dataset, i.e. element 0 holds the number of occurrences of class 0, element 1 holds the number of occurrences of class 1, and so on. The depicted ”division” is performed element-wise over v, and the result is the vector w. In both equations, a small constant ϵ is added in the denominator for numerical stability.

2.5 Generative adversarial networks (GANs)

A generative adversarial network (GAN) is a machine learning framework that was proposed in 2014 as a method for learning the underlying probability distribution of a high-dimensional training set, and generating artificial examples from the same distribution that are not part of the original dataset [11]. In its most basic incarnation, a GAN consists of two neural networks: a generator, and a discriminator. The generator’s (G) role is to capture the data distribution by learning to output examples from its estimation of said distribution, and the discriminator’s (D) role is to estimate the probability that an output from the generator came from the (real) training data, instead of from G. The two networks are trained by making them ”compete” against each other in a two-player game, where the objective of D is to maximize the probability that it assigns correct labels to both training examples and outputs from G, and the objective of G is is to maximize the probability that D makes mistakes.

Specifically, the role of G is to transform a random distribution pz (usually

(31)

z ∼ pz as an input and mapping it to a sample x = G(z) from pg, which

is the generator’s estimate of the true data distribution pd. D(x) represents

the probability that x comes from pdinstead of from pg, and D is trained to

maximize the probability of assigning the correct label when given both real examples and samples from G, which is simultaneously trained to minimize ln(1 − D(G(z))). This minimax game can be formulated as the following objective:

min

G maxD V (D, G) =Ex∼pd[ln D(x)] +Ez∼pz[ln(1− D(G(z))]. (2.1)

In practice, D is not optimized to completion in the inner loop of the training procedure (this would be computationally impractical and would also lead to overfitting). Instead, the training procedure alternates between k steps of optimizing D, and one step of optimizing G – this has the effect of maintaining D near its optimal solution, as long as G changes slowly enough. This strategy is summarized in the following algorithm.

It can be shown theoretically that the minimax game given by the objective 2.1 has the unique global optimum pg = pd, and that the algorithm converges

to this solution if G and D have sufficient capacity, and if D is allowed to reach its optimum (given G) at every step of the algorithm [11].

In practice, when implemented exactly as described in this section, GANs tend to become unstable and difficult to train – D must be steadily synchronized with G in order for the training dynamics to remain stable, and the balance can be delicate. In particular, one must refrain from training G too much without updating D, in order to avoid succumbing to the so called mode collapse problem, which is a scenario where G maps too many values of z to just a few different kinds of outputs. Fortunately, significant progress has been made in alleviating training instability of this kind since

(32)

Stochastic gradient descent training of GANs 1: fornumber of training iterations do 2: fork steps do

3: Sample a batch of m noise samples{z(1)_{, z}(2)_{, ..., z}(m)_{} from p}

z.

4: Sample a batch of m training examples{x(1)_{, x}(2)_{, ..., x}(m)_{} from p}

d.

5: Update the discriminator by ascending its gradient: wD ← wD+ γ∇wD 1 m m ∑ i=1 [ln D(x(i)_{) +}_ln(1_{− D(G(z}(i)_)))]. 6: end for

7: Sample a batch m noise samples{z(1)_{, z}(2)_{, ..., z}(m)_{} from p}

z.

8: Update the generator by descending its gradient: wG← wG− γ∇wG 1 m m ∑ i=1 ln(1− D(G(z(i)_)). 9: end for

the original paper was published.

2.5.1 Wasserstein GAN

It can be shown [11] that optimizing the standard GAN objective 2.1 with an optimal discriminator is equivalent to minimizing the so called Jensen-Shannon divergence between the probability distributions pg and pd:

J S(pd, pg) = KL(pd, pm) + KL(pg, pm),

where pm = pd+pg

2 , and KL is the so called Kullback-Leibler divergence:

KL(p, q) = ∫ p(x)ln ( p(x) g(x) ) dx.

(33)

are measures of ”distance” between probability distributions. It is possible to train GANs with respect to other distance measures than JS, and this possibility was investigated in [1] where this divergence was replaced with the so-called Wasserstein distance:

W (pd, pg) = inf γ∈Π(pd,pg)

E(x,y)∼γ[∥x − y∥],

where Π(pd, pg) is the set of all joint distributions γ(x, y) whose respective

marginal distributions are pdand pg. The motivation for the definition of this

metric can be understood by considering the optimal transport problem: Consider a distribution of ”mass” µ(x) on a space X (for example, a one-dimensional probability distribution, or in our specific case a probability distribution on the space of possible images). We wish to transport this mass in such a way that it is transformed into the distribution ν(x), defined on the same space. This problem can be imagined as the task of moving a pile of earth in the shape of µ into a hole in the ground in the shape of ν such that both the pile of earth and the hole in the ground vanish completely when the task is finished. A transport plan can be described by a function γ(x, y), which gives an amount of mass to move from x to y. The following two constraints are natural, and necessary in order for such a plan to be meaningful:

• The amount of mass moved from point x needs to be equal to the amount that was there to begin with.

• The amount of mass moved into point y needs to be equal to the depth of the hole that was there to begin with.

Mathematically, this means: ∫

γ(x, y)dy = µ(x)and ∫

(34)

These constraints are equivalent to the requirement that γ is a joint probability distribution with respective marginals µ and ν. If we define the ”cost” of moving x to y simply as∥x − y∥ (using some appropriate norm ∥ · ∥ on X), the cost of the transport plan γ becomes:

∫∫

∥x − y∥γ(x, y)dxdy = E(x,y)∼γ[∥x − y∥],

and thus the cost of the optimal transport plan is equal to the cost of the greatest lower bound of the set of all possible transport plans γ, i.e:

inf

γ∈Π(µ,ν)E(x,y)∼γ[∥x − y∥],

where Π(µ, ν) is the set of all transport plans (or equivalently, joint distributions with respective marginals µ and ν).

In practice, this infimum is highly intractable and provides only theoretical guidance rather than a means of calculation. The authors of [1] point to the Kantorovich-Rubenstein duality, which states:

W (pd, pg) = sup ∥f∥L≤1

Ex∼pd[f (x)]− Ex∼pg[f (x)],

where the supremum is over the 1-Lipschitz functions, i.e. functions f : X → R such that |f(x1)− f(x2)| ≤ ∥x1 − x2∥ for all x1, x2 ∈ X. If ∥f∥L ≤ 1 is

instead replaced with∥f∥L ≤ K (where K is some positive constant), the

left hand side becomes K· W (pd, pg). By this observation, the authors justify

constructing a family of functions{fw}w∈W parametrized by the weights w,

and instead solving the following problem:

max

(35)

If the original supremum is attained in such a solution, it would provide a calculation of W (pr, pg)up to some multiplicative constant.

To approximate the function fw that solves the maximization problem, the

authors propose learning it with a neural network parametrized with weights w. This network is dubbed the ”critic”, and replaces the role of the standard discriminator in this framework. In order to enforce the Lipschitz constraint on fw, the authors force the parameters w to lie in a compact space by clipping

the weights of fw, i.e. by forcing each of them to lie in a closed interval

[−c, c]. The complete training strategy is summarized in the following algorithm:

Wasserstein GAN (WGAN) 1: whileG has not converged do 2: for t = 0, ..., ncritic do

3: Sample m training examples{x(1)_{, x}(2)_{, ..., x}(m)_{} from p}

d.

4: Sample m noise samples{z(1)_{, z}(2)_{, ..., z}(m)_{} from p}

z.

5: Update the critic:

w← w + γ∇w[ 1 m m ∑ i=1 fw(x(i))− 1 m m ∑ i=1 fw(G(z(i))]. 6: w← clip(w, −c, c) 7: end for

8: Sample m noise samples{z(1)_{, z}(2)_{, ..., z}(m)_{} from p}

z.

9: Update the generator:

wG ← wG− γ∇wG 1 m m ∑ i=1 fw(G(z(i))). 10: end while

(36)

encouraged to train the critic to optimality; when optimizing JS divergence, a typical scenario is that the discriminator learns too quickly, which causes its generator gradients to saturate to zero. In other words: no matter how we change the weights of G, the gradients of ln(1− D(G(z(i)₎₎₎_{in this scenario}

will be too small since D is too good at recognizing fakes from G in its current state. It is demonstrated in [1] that this problem is alleviated with the WGAN framework, and that the generator can still learn even when the critic performs well. Furthermore, and perhaps more importantly, this also prevents the mode collapse problem, of which no sign is shown in the authors’ experiments. Yet another benefit of the Wasserstein distance is the fact that it happens to correlate remarkably well with the perceived image quality of the generated images, which is not the case with images generated via the JS objective.

However, the remaining issue with this method is that weight clipping is a poor (albeit simple) way to enforce the Lipschitz constraint. An alternative was proposed by Gulrajani et al. [12], who noted that a differentiable function is 1-Lipschitz if and only if the norm of its gradient is at most 1 everywhere. Based on this fact, they suggested adding a term to the WGAN-objective that penalizes the norm of the gradient of the critic with respect to its input. The resulting framework is called WGAN-GP (where ”GP” is short for ”Gradient Penalty”), and it is demonstrated that this method performs better than the standard WGAN method that employs weight clipping.

2.5.2 Progressively growing GAN (PGAN)

Progressively growing GAN (Progressive GAN, or PGAN) is a GAN training methodology developed by NVIDIA in 2018 [16]. The key idea is to grow the generator and discriminator networks progressively by starting with shallow, low-resolution networks and iteratively adding layers during

(37)

Figure 2.5.1: PGAN training progression. G is the generator, and D is the discriminator. The image is inspired by figure 1 from [16].

training that model increasingly fine details in higher resolutions. This method is observed to both speed up and stabilize training, which allows it to produce remarkably detailed and realistic images – see figure 5 in [16]. PGAN starts by training an initially shallow generator and discriminator with downsampled low-resolution versions of the training images. As the training progresses, new layers are added to network and the resolution of the training images is increased – this incremental form of training allows the GAN to start with discovering the large-scale structure of the image distribution before moving on to learn smaller details, instead of having to learn details on every scale simultaneously. Architecturally, the generator and discriminator networks are mirror images of each other and grow synchronously; see figure 2.5.1 for a diagram of a generator and discriminator that start with a low resolution of 4×4 and gradually progress

(38)

to a high resolution of 1024× 1024.

PGAN employs the WGAN-GP loss described in section 2.5.1. In order to compare the results of different GANs and evaluate the quality of results post-training, the authors of [16] use metrics that compare generated images with the corresponding (real) training set. Namely, by building on the idea that a successful generator should be able to capture realism at all scales, they estimate statistical similarity by calculating the so called sliced Wasserstein Distance (SWD, an approximation of the Wasserstein distance) over distributions of patches sampled from the images at representations of different resolutions – see [16] for a more detailed explanation of the SWD metric.

2.5.3 Conditional GAN

A standard GAN model is designed to learn a probability distribution over data, and to output random noise-conditioned samples from its estimation of said distribution; it does not allow the user to control the specific type of output to receive from that distribution. The idea of a conditional GAN was proposed shortly after the original paper, as an extension of the original idea that allows the user to control the modes of the generator’s output by conditioning its noise-input with some additional information [21]. For example, a standard GAN model may be trained to output images of digits 0−9, but the digits they output are completely dependent on the noise input z, and are thus random. In a conditional GAN however, one can for example condition the noise input with a label y, say for instance the integer ’2’, in order to receive a generated image of a ’2’ instead of a completely random digit.

The conditioning is achieved by feeding y into the generator and discriminator via an additional layer between network and input, as depicted

(39)

Figure 2.5.2: Conditional GAN illustration. The label y is combined with the noise input z and transformed into an encoding of both inputs via an intermediate layer, which is fed to the discriminator or generator. The image is inspired by figure 1 from [21].

in figure 2.5.2). The new GAN objective becomes:

min

G maxD V (D, G) =Ex∼pd[ln D(x|y)] + Ez∼pz[ln(1− D(G(z|y))].

Conditional GANs are commonly used in so called image-to-image translation tasks, where the goal is to transform one representation of an image to another. As an example, a conditional GAN can learn to generate images of a certain object when its input is conditioned on an image of the contours of that object (thus providing a ”mapping” of sorts between the contour image and the complete image). Another useful application is in semantic image synthesis, where a conditional GAN can be trained to translate an

(40)

image segmentation mask to an image that corresponds to the segmentation; effectively, such a GAN can be trained to provide an inverse to a model that is capable of creating such a segmentation.

2.5.4 SPADE

Figure 2.5.3: SPADE layer illustration.

In their 2019 paper ”Semantic Image Synthesis with Spatially-Adaptive Normalization” [25], Park et al. improve upon the previous state of the art image translation networks by introducing a so called SPADE (SPatially-Adaptive (DE)normalization) layer. Previous methods such as Pix2PixHD [29] feed the image segmentation mask directly as an input to a deep image translation network, which is an approach that the authors point out as suboptimal since they find that the normalization layers of such a network tend to ”wash away” important information from the segmentation mask as it goes through the network. SPADE is a normalization technique

(41)

similar to batch normalization, that addresses this problem by basing the normalization parameters on the input segmentation mask, thereby allowing it to be repeatedly introduced to the intermediate layers of the generator instead of to just the first one.

SPADE is applied to an input by first performing channel-wise normalization on its activations, and subsequently modulating the result with parameters γand β like in batch normalization. However, unlike in batch normalization, γ and β in SPADE are not scalars but tensors that are the product of convolution with the original semantic segmentation mask; see figure 2.5.3 for an illustration of this operation.

The authors of [25] argue that SPADE relieves the need to feed the segmentation mask to the first layer of the generator, and thus justify discarding the encoder (downsampling) part of the generator that is commonly used in architectures such as Pix2PixHD. This results in a simpler and more lightweight generator network that performs better with fewer parameters. See figure 4 in [25] for an illustration of the SPADE generator.

(42)

Method

Figure 3.0.1: Illustration of the method.

This chapter details the steps that have been taken to investigate the questions outlined in the introduction, i.e. the process of training a segmentation network on a dataset of brain MR images, synthesizing new

(43)

brain image data with GANs and including it in the segmentation. This process is illustrated in figure 3.0.1.

The chapter begins with an introduction to the BraTS dataset [20] that has been used throughout this project, and an explanation of how that data has been prepared for use in the segmentation network. This is followed by a section on segmentation, which provides details surrounding the application of U-Net and the architecture and hyperparameters that have been employed. Following this is two sections on PGAN and SPADE respectively, which explains how these GAN frameworks have been applied in this project. The chapter ends with a section that describes the preprocessing operations that have been applied on the synthetic segmentation masks.

3.1 Dataset

The Multimodal Brain Tumor Image Segmentation Benchmark (BraTS) [22] [20] [2] [5] [3] [4] is a dataset of volumetric MR scans and corresponding brain tumor segmentations of low- and high grade glioma patients. The MR images in the dataset have been acquired with different clinical protocols and scanners from 19 different institutions, and are all available in 4 different formats:

• T1-weighted (T1)

• T1-weighted, contrast-enhanced (T1c) • T2-weighted (T2)

• T2-weighted FLAIR image (FLAIR)

Because of time constraints, the experiments in this project have only been performed using the contrast-enhanced MR images (T1c). The

(44)

corresponding ground truth segmentation images encompass the following intra-tumoral structures (and the background) as classes:

0. Background (BG)

1. Necrotic and non-enhancing tumor core (NCR/NET) 2. Peritumoral edema (ED)

3. GD-enhancing tumor (ET)

To quote [22]: ”All the imaging datasets have been segmented manually, by one to four raters, following the same annotation protocol, and their annotations were approved by experienced neuro-radiologists.”.

The ground truth segmentations used in this project have been expanded with three more classes in what is referred to as the complete version of the dataset with 7 classes. This version was created (outside of this project) by analyzing each subject with the function FAST [30] in the FSL software [15], with the purpose of obtaining the following new segmentations:

4. White matter (WM) 5. Grey matter (GM)

6. Cerebrospinal fluid (CSF)

When synthesizing images with GANs, the complete version of the ground truth has been used in favor of the incomplete version with 4 classes. This choice is motivated by the expectation that it is easier for an image-to-image translation network to synthesize complete MR images of the brain from fully segmented images, compared to using tumor segmentations that only cover a small part of the image.

(45)

3.1.1 Dataset split

The BraTS dataset used in this project consists of 210 pairs of 3D MR volumes and corresponding annotations, represented in the NIFTI [23] file format. Each volume contains 240× 240 × 155 voxels with an isotropic voxel size of 1× 1 × 1 mm. These .nii-files were read into Python with the Nibabel [6] library and sliced axially (i.e. in a direction parallell with a line going from the chin to the top of the head) into 155 2D slices each, resulting in a total number of 210× 155 = 32550 2D slices. The slicing was performed using Numpy [24], a Python library for matrix and array calculations, and the resulting slices were saved without loss or corruption of array data into separate .png files using the Python library PyPNG [26]; the MR images (which store 16 bit information) were saved in a greyscale uint16 file format and the segmentation masks (which are simply integer-valued matrices where each integer represents a class) were saved in a greyscale uint8 file format. Before being saved as image files, each slice was padded with zeroes around the border to round up the resolution to the nearest power of two, i.e. to 256× 256 (this was required by the scripts described in section 2.5.2). The resulting 32550 .png files were subsequently shuffled, and separated into training (80 %), validation (10 %) and test data (10 %). See figure 3.2.1 for an image of 24 random samples from the training set.

3.2 Segmentation

The segmentation tasks in this project use an implementation of U-Net in the Python deep learning library Keras [7] (run on top of the machine learning platform TensorFlow [19]). The network architecture used in this project follows that of figure 2.4.1 – to reiterate: the encoder part consists of four levels with two convolutions per level, with 64, 128, 256 and 512 convolutional filters per convolution in each respective level. Each

(46)

Figure 3.2.1: 24 random (non-empty) slices obtained from the BraTS dataset. The top image in each rectangle shows a T1c MR image, and the bottom image shows its corresponding color-coded segmentation mask. Black: BG, white: WM, grey: GM, blue: CSF, yellow: ED, orange: ET, red: NCR/NET.

(47)

consecutive encoder level is connected by a max pooling layer. The fifth level is the ”bridge” between the encoder and decoder part, which consists of a convolution with 1024 filters followed by a transposed convolution. The decoder part that follows consist of four levels with two convolutions each with 512, 256, 128 and 64 filters per convolution in each respective level, and each consecutive level is connected by a transposed convolution layer. The last convolution is followed by a final convolution layer where the number of filters is the number of classes in the given dataset. Each convolution layer uses ’same’ padding, and is followed by a batch normalization layer. Each transposed convolution layer uses a 3 × 3 kernel with ’same’ padding and a (2, 2) stride, and is followed by a batch normalization layer (before concatenation). The final convolution layer is followed by a softmax activation function applied over the channel axis, which results in a multi-channel segmentation map in which the value in a given channel corresponds to the probability that the pixel in question belongs to the class represented by that channel (a single-channel image can then be created post-training by taking the argmax of the segmentation map over the channel axis). The respective weights of the convolution and transposed convolution kernels are initialized with He normal initialization [13].

Before being fed to the network, the MR images in the training, validation or test sets were scaled and normalized with respect to the training set. This was done by dividing each pixel value with a constant equal to the maximum pixel value in the training set, followed by subtracting the scalar mean of all (scaled) pixel values in the training set. Furthermore, each segmentation mask was converted to a multi-channel one-hot encoded tensor before entering the segmentation network.

The training was performed via stochastic gradient descent with the Adam [17] optimization algorithm, using a learning rate of 10−4 and a batch size of 8. The training and validation sets were read sequentially and

(48)

shuffled after each respective epoch (i.e. after all of the images in each respective dataset had been used once). Weighted Dice Loss (see section 2.4.2) was used as the loss function and the validation metric, and the corresponding weight vector was calculated over the training set.

Using these settings, different training instances were created with varying numbers of real and synthetic training data, resulting in a different training set per instance (however, the validation and test data remained identical between all instances). The number of segmentation classes were varied as well; apart from the complete version with 7 classes, the network was also trained with the incomplete version of the dataset (4 classes) as well as with a binary version (2 classes: tumor or non-tumor). The datasets with less than 7 classes were either read directly from disk when available, or created from a more complete dataset at training time by setting all irrelevant classes to 0 and the tumor classes to values in [1, 2, 3] (in case of a 4-class problem), or 1 (in case of a binary problem).

Each training instance was run for 150 epochs, and the network weights were saved every time a lower validation error was achieved at the end of an epoch. The results in chapter 4 are based on the network weights that have been achieved after these 150 epochs (i.e. the weights that yield the lowest validation error). The training was performed using an NVIDIA RTX 2080 Ti and two NVIDIA GTX 1080 GPUs (separately). See chapter 4 for the complete list of training instances.

3.3 PGAN

The official implementation of PGAN (hosted on GitHub) was downloaded and trained on the complete (7-class) dataset of segmentation masks from BraTS to generate a new, synthetic dataset of segmentation masks. Default settings were used, apart from changing the ’dynamic_range’ parameter in

(49)

the configuration file from [0, 255] to [0, 6], to ensure that the transformations between the network values (which are continuous and lie in [−1, 1]) and the values of the segmentation masks (which assume discrete values between 0 and 6) were performed correctly. Additionally, slight modifications were made to the scripts responsible for generating and saving image files (namely, 'util_scripts.py' and 'misc.py'), in order to ensure that the final generated outputs were saved as .png files with the same properties as the ones in section 3.1.1. Similar changes were also made to the scripts related to the calculation of the image metrics.

Two training instances were created: one that used the ”full” dataset consisting of 100 % of the training data (26040 images), and one that used a ”reduced” dataset consisting of only the first 20 % of the images in the full dataset (5208 images). Completely empty segmentation masks (which comprised 15.66 % of the full dataset and 15.44 % of the reduced dataset) were discarded when loaded into the scripts, resulting in 21962 and 4404 training images respectively. The training script was set to run until the dataset had been sampled 12 · 106 _{times in the training loop (as dictated}

by the parameter ’total_kimg’, which by default is set to 12000).

In both training instances, the network weights were saved after every ’tick’ (or iteration) of the training loop. After the training, the SWD score (see section 2.5.2) was calculated over the generated images with respect to each of the saved weights, and the weights that yielded the lowest (average) distance were saved and used to generate a dataset of 100,000 images – this was done once per training instance, resulting in two new datasets of segmentation masks. Both training instances were run using an NVIDIA RTX 2080 Ti GPU (at separate instances).

(50)

3.4 SPADE

The official implementation of SPADE (hosted on GitHub) was downloaded and trained on the 7-class version of the BraTS dataset to learn a ”mapping” from segmentation masks to an MR image represenation. The synthetic segmentation masks generated in section 3.3 were subsequently given to the trained network to generate their MR counterparts.

Similiarly to in section 3.3, the source code had to be modified slightly in order to adapt it to the properties of the dataset. In particular, the data loading and utilities scripts (namely, ’pix2pix_dataset.py’ and ’util.py’) had to be altered at places to accommodate single-channel MR images with values larger than 255, and to ensure that the transformations between tensor and image values were performed correctly. Again, small changes also had to be made (specifically to ’util.py’) in order to ensure that the generated MR images were saved as .png-files with the desired properties.

Default settings were used, apart from disabling image preprocessing, image flipping, inclusion of instance maps and VGG loss calculation (via the input argument no_vgg_loss). Like in section 3.3, two training instances were created: one with 100 % of the training data, and one with the first 20 % of the training data. The instance with 100 % of the training data was trained for the default number of 50 epochs, and the instance with 20 % training data was trained for 250 epochs. The reduced instance was trained for five times as many epochs as the full dataset, in order to ensure that the network would be shown the same number of training data in both instances (like in PGAN), and since this yielded better image quality, subjectively speaking.

After each training instance, the segmentation masks generated in section 3.3 were preprocessed (see section 3.5) and the resulting images were used with the final weights of the generator to create their MR counterparts. The segmentation masks that were created with 100 % of the training data

(51)

were input to the SPADE generator trained with 100 % of the data, and the masks that were generated with 20 % of the training data were used with the SPADE generator that was trained with the same 20 % of the data, ultimately resulting in two datasets of synthetic segmentation masks and corresponding MR images. Each training instance was performed using two NVIDIA Tesla V100 simultaneously (at separate instances).

3.5 Preprocessing

A simple preprocessing procedure was applied to the segmentation masks generated in section 3.3, in a quick effort to remove noisy or corrupted images from the synthetic datasets. This was done by comparing every image in a synthetic dataset to the entirety of the (corresponding) real dataset, by calculating the Z-score of each pixel value in the synthetic dataset with respect to the real dataset; i.e. by subtracting the empirical mean of the pixel values in the real dataset from each pixel value in the synthetic dataset, and subsequently diving them by the standard deviation of each pixel value in the real dataset. The mean and standard deviation was calculated over the batch axis of the real dataset, resulting in two two-dimensional arrays of values in which each pixel value represents the mean or standard deviation of a pixel value at a given location in the segmentation map. The resulting standardized images provide a measure of how much each pixel in a specific location in an image of the synthetic dataset deviate from all of the pixels in the same location, in the real dataset.

Following this, each image in the standardized synthetic dataset was reshaped into a vector, and the Euclidean norm of each vectorized image was calculated, resulting in a new vector where each value corresponds to an image in the synthetic dataset. These values were created with the intent of providing a measure of how much each synthetic image deviates