Deep Learning with Importance Sampling for Brain Tumor MR Segmentation

Full text








Deep Learning with Importance

Sampling for Brain Tumor MR



Deep Learning with Importance

Sampling for Brain Tumor MR



Degree Projects in Optimization and Systems Theory (30 ECTS credits) Master’s Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2021

Supervisor at Elekta Instrument AB: Kenneth Lau Supervisor at KTH: Xiaoming Hu


TRITA-SCI-GRU 2021:007 MAT-E 2021:001

Royal Institute of Technology

School of Engineering Sciences KTH SCI



Segmentation of magnetic resonance images is an important part of planning radiotherapy treat-ments for patients with brain tumours but due to the number of images contained within a scan and the level of detail required, manual segmentation is a time consuming task. Convolutional neural networks have been proposed as tools for automated segmentation and shown promising results. However, the data sets used for training these deep learning models are often imbalanced and contain data that does not contribute to the performance of the model. By carefully selecting which data to train on, there is potential to both speed up the training and increase the network’s ability to detect tumours.

This thesis implements the method of importance sampling for training a convolutional neural network for patch-based segmentation of three dimensional multimodal magnetic resonance images of the brain and compares it with the standard way of sampling in terms of network performance and training time. Training is done for two different patch sizes. Features of the most frequently sampled volumes are also analysed.



Segmentering av magnetröntgenbilder är en viktig del i planeringen av strålbehandling av patienter med hjärntumörer. Det höga antalet bilder och den nödvändiga precisionsnivån gör dock manuell segmentering till en tidskrävande uppgift. Faltningsnätverk har därför föreslagits som ett verktyg för automatiserad segmentering och visat lovande resultat. Datamängderna som används för att träna dessa djupinlärningsmodeller är ofta obalanserade och innehåller data som inte bidrar till modellens prestanda. Det finns därför potential att både skynda på träningen och förbättra nätverkets förmåga att segmentera tumörer genom att noggrannt välja vilken data som används för träning.



Firstly, I would like to thank my supervisor at Elekta, Kenneth Lau, for his support and guidance during the time of this thesis. His advice and ideas have helped provide a clear direction in the development of both the project and the report. Thanks also to the rest of the PAA team for their interest and input. Finally, I wish to thank my supervisor at KTH, Xiaoming Hu, for his feedback throughout the project.



1 Introduction 4

1.1 Background . . . 4

1.2 Problem Formulation . . . 5

1.3 Related Work . . . 5

1.3.1 Segmentation Using Deep Learning . . . 5

1.3.2 Data Sampling . . . 6

1.3.3 Comment on Earlier Work . . . 6

1.4 Thesis Outline . . . 6

2 Theoretical Background 8 2.1 Brain Tumours . . . 8

2.2 Magnetic Resonance Imaging . . . 9

2.3 Segmentation . . . 11

2.4 Convolutional Neural Networks . . . 12

2.4.1 Convolutional Neural Networks . . . 12

2.4.2 Segmentation Using CNNs . . . 14

2.5 Stochastic Gradient Descent . . . 15

2.5.1 Variants of Stochastic Gradient Descent . . . 16

2.5.2 Local and Global Minima . . . 17

2.6 Importance Sampling . . . 17

2.6.1 Gradient Norm Estimates . . . 18

2.6.2 The Importance Sampling Algorithm . . . 18

3 Methodology 20 3.1 Data . . . 20

3.1.1 Data Pre-Processing . . . 20

3.1.2 Patch Extraction and Data Split . . . 21

3.1.3 Data Augmentation . . . 22

3.2 Network Architecture . . . 22

3.3 Training and Sampling Strategies . . . 23

3.3.1 Uniform Sampling . . . 23

3.3.2 Importance Sampling . . . 23

3.4 Segmentation . . . 24


3.6 Sampling Analysis . . . 25

4 Results 27 4.1 Large Patches . . . 27

4.1.1 Distribution of Dice Scores . . . 29

4.1.2 Sampling Patterns . . . 31

4.2 Small Patches . . . 36

5 Discussion 46 5.1 Performance and Training Time . . . 46

5.2 Sampling Trends . . . 47

5.3 Validity and Robustness . . . 48

6 Conclusion 50 6.1 Future Work . . . 50

A Voxel Intensity Histograms 55 A.1 Large Patches . . . 55

A.1.1 Mean Voxel Intensity . . . 55

A.1.2 Standard Deviation of Voxel Intensity . . . 57

A.2 Small Patches . . . 58

A.2.1 Mean Voxel Intensity . . . 58


Chapter 1


Despite not being one of the most common types of cancer, brain tumours were estimated to have a mortality rate of 3.4 per 300,000 people worldwide [1]. Imaging of the brain plays a crucial part not only in diagnosing but also in treating patients with tumours as it allows physicians to both plan the dosage of radiotherapy treatment and to evaluate its effect over time. One of the most common methods for brain imaging is magnetic resonance (MR) imaging, particularly because it results in high contrast images of soft tissue while sparing the patient from harmful radiation.

Before incorporating MR images into treatment planning, the tumours captured are often first segmented to determine the location, morphology and subtype of any tumours present. An accurate segmentation is important to quantify the radiation dose that will be delivered to the tissue to ensure that the tumour receives enough radiation to target the entire structure while at the same time sparing the surrounding healthy tissue. Traditionally, this is done manually by a physician at the clinic but given the level of detail required this is a demanding and time consuming task. It is therefore of interest to develop automated methods for segmentation and given their recent success within other areas of computer vision, deep learning methods appear particularly promising for this [2]. By incorporating algorithms that are able to segment MR images automatically into the workflow as an assisting tool, there is potential to save both time and financial resources and thus provide more patients with better care.




The standard technique when it comes to sampling data during training is to do so uniformly from the training data set. While this ensures that the model is exposed to all samples at an approximately equal frequency it also means that time is spent training on redundant samples that do not contribute to the performance of the network for the given task. Recent studies indicate that this may be an inefficient strategy as all samples may not actually be equally important during the training phase [3, 4]. By choosing which samples the algorithm is exposed to in a more strategic manner, it might be possible to both speed up the training process and obtain a model that is more equipped to detect malignancies.


Problem Formulation

This thesis aims to investigate whether it is possible to decrease the time needed to train a deep neural network for brain tumour segmentation using MR images by employing different methods for sampling. This is done by

• providing a benchmark of the time it takes to train using uniform sampling, • applying a state of the art method for importance sampling,

• analysing the sampling patterns of the importance sampling algorithm.

The performance of the resulting models is measured both in terms of training time required for convergence and in terms of how well tumours are segmented.


Related Work


Segmentation Using Deep Learning

Research on the use of deep learning for the purpose of segmenting of brain tumours using MR images of the has been ongoing for the past few years and there is a vast number of articles available on this subject [5–9]. As with most image segmentation tasks, the current state of the art methods employ convolutional neural nets (CNN) in some capacity. The majority of the work implements or adapts different preexisting CNN architectures, such as DeepMedic [5], ResNet [6] and U-Net [7]. These have all been shown to segment MR images with an overall high accuracy, both when considering whole tumours and their substructures. Among those studies that compare different architectures, the general observation is that when sufficient data is available, deeper networks non-surprisingly tend to perform slightly better albeit at the price of a longer training time. In terms of training, stochastic gradient descent with uniform sampling is the method of choice in all articles consulted, sometimes using momentum.



Data Sampling

Despite uniform sampling being the default choice when training deep learning models, research on how to improve training has been ongoing for some time. Several different approaches have been proposed for speeding up learning through intelligent ways of sampling data. The strategies presented in the studies consulted as part of this thesis can be summarised into three categories outlined below.

One way of speeding up the training process is to simply reduce the size of the training data set in such a way that the remaining points form a good representation of the original set. This is for instance done in a work by Mirzasoleiman et al., where a subset of points that together provide an approximation of the full loss gradient is selected and trained on at each iteration [10] and Prahbu et al., where techniques from the field of active learning to construct the subset are used [11]. It is also possible to use the loss gradient directly. Vodrahalli et al., apply the norm of the loss gradient directly to compute an importance score for each sample and perform training only with the points with highest score [12].

Rather than removing part of the training data set, speedup can also be achieved by changing the distribution according to which the sampling is done. One way of implementing this is to sample data points according to their Euclidean distance from each other [13]. The assumption here is that points that sit closely together in Euclidean space will result in similar gradients and so this method is designed to favour points that together describe the diversity of the data set.

Another option is to base the sampling distribution on the gradient itself. Jiang et al., among others, use the norm of the loss gradient to determine a distribution for importance sampling [14]. Although this is shown to work significantly better than standard uniform sampling, computing the explicit values for the loss gradients is in itself a time consuming task. A possible solution for this is to compute an approximation to the gradient using uncertainty sets [15]. The approach taken by Katharopoulos et al. is to use the actual loss function as an approximation of its gradient [16]. A later work by the same authors further reduces the requirements for computational power by deriving an upper bound for the loss gradient which is shown to perform even better than the previous suggestion [4].


Comment on Earlier Work

There is a considerable amount of research done on the use of deep learning for segmentation as well as non-standard ways of sampling data for training. However, as far as the author is aware, there are no instances of these new sampling techniques having been applied to networks designed for the purpose of segmentation. This thesis aims to bring these two concepts together and investigate the performance of new ways of sampling on medical data.


Thesis Outline


Chapter 2

Theoretical Background

This chapter discusses the theory behind the methods deployed in this project. Firstly, a brief introduction to the underlying medical concepts of brain tumours, radiotherapy, magnetic resonance imaging and segmentation is given. Secondly, convolutional neural networks are introduced and the functions of their components in the context of image segmentation are discussed. Finally, the training algorithms for CNNs that will be used is presented.


Brain Tumours

Cancerous tumours in the brain [17] can be divided into two classes - primary and secondary tumours. Primary tumours originate in the brain whereas secondary tumours, also known as metastasis, typically have spread from other parts of the body. The most common type of primary tumour is gliomas, which arise in the glial components of the nervous system. These are in turn classified into four different grades indicating how rapid the growth and spread of the tumour is. Grade I gliomas are most commonly found in children. Grade II, sometimes also referred to as lower grade glioma, is the least invasive form of glioma found in adults, but may still be fatal in some cases. Grade III are highly malignant tumours with a high tendency to develop into grade IV glioma, also known as glioblastoma or glioblastoma multiforme. This is the most fatal type of glioma, partly because it tends to spread to other parts of the brain very rapidly.


Figure 2.1: The Leksell Gamma Knife


Magnetic Resonance Imaging

Treating patients with brain tumours requires high precision as it involves exposing the brain to ionising radiation. Before treatment is carried out the physician needs to know the exact location and morphology of the tumour so as to both minimise the amount of healthy tissue that is exposed to radiation as well as ensuring that no part of the tumour is missed. One way of achieving this is through magnetic resonance imaging [18]. It relies on the use of strong magnetic fields in order to measure the density of for instance hydrogen nuclei in the brain. Under normal conditions, the angle of the axes around which the nuclei rotate, also known as their spins, all point in random directions. When the hydrogen nuclei are exposed to a magnetic field their spins change to align with the direction of the magnetic field. Applying an electromagnetic wave in the radio frequency range then cause the spins to deflect away from the direction of the field, they are said to become excited. When the wave is switched off, the nuclei return to their previous state with spins aligned with the magnetic field. This relaxation causes the nuclei to emit signals. By measuring these signals, and exploiting the fact that the proton density vary between chemical compositions and therefore also between tissue types, high contrast images of the brain can be produced without exposing the patient to harmful radiation.

Another advantage of MR is that by varying the MR sequences different tissue contrasts can be enhanced. This allows for clear images of different kinds of structures which together enable a good visualisation of the brain. Two features that affect the contrast of an MR image are known as the T1

and T2times. These are intrinsic properties of biological tissue and are defined as the time it takes

for the excited spins of the hydrogen nuclei to recover to their previous state in the longitudinal and transverse directions respectively. By choosing different echo and repetitions times for a sequence one can determine what type of tissue to visualise. Images produced in such a way are said to be T1- or T2-weighted. It is also possible to increase the contrast of certain tissue types by injecting a

contrast enhancing agent, typically gadolinium, into the body. When this is done in an image with T1-weighting it is referred to as T1-contrast or T1C. Another weighting known as fluid-attenuated


Figure 2.3: An example of a segmented MR image with non-enhancing tumour core in purple, peritumoral edema in blue and nectrotic tumour core in GD-enhancing tumor in yellow.



The technique of partitioning an image into two or more non-overlapping segments based on some characteristic of the pixels is known as segmentation [19] and has many applications within a number of fields. In medical imaging it is often used to provide a distinction between different structures in the body, in particular between healthy and pathological tissue.

Brain tumour segmentation [2] involves identifying the location and shape of any tumours present. An example of a segmented slice of an MR image can be seen in Figure 2.3. This is typically a nontrivial task as the images often contain noise and other artefacts which make it difficult to distinguish between normal brain matter and tumour at a volume pixel (voxel) level. Consequently, performing segmentation manually, as is traditionally the case, requires a high level of skill. Given that a standard MR sequence of the brain contains a high number of two dimensional slices as well as multiple modalities, segmentation by hand is also very time consuming. In an attempt to combat this, several algorithms have been developed to automate the process, including thresholding and machine learning methods such as clustering and Bayesian classification [20]. As deep learning has gained interest in the machine learning community neural networks, and in particular convolutional neural networks, have gained popularity in the field of medical imaging as well, mainly because of their ability to take spatial correlation into account during predictions [2].



Convolutional Neural Networks

Before delving into the details of convolutional neural networks [21], first consider the general form of an artificial neural network (ANN). At its core, this a composition of functions

f (x) = f(n)◦ ... ◦ f(1)(x), (2.1)

where each f(i) denotes the ithlayer of the network and takes the form

f(i)(x) = h(i)(W(i)Tx + b(i)). (2.2) Here, the matrix W (i) and vector b(i), known as the weights and bias of the ith layer, define an affine transformation of x. The activation function h(i) is typically nonlinear in order to capture features in the data that the affine transformation cannot. A common choice of activation function is the rectified linear unit (ReLU), defined as h(i)(x) = max(x, 0) or the sigmoid function, h(i)(x) =

1/(1 + ex). An example of a simple network is shown in Figure 2.4.

Input layer Hidden layer Output layer x1 x2 x3 x4 x5 y

Figure 2.4: A simple ANN with one hidden layer.


Convolutional Neural Networks


2 -9 -4 -1 1 8 -5 1 -4 3 0 2 1 2 1 9 6 0 3 0 1 6 5 2 4 0 -1 1 0

Figure 2.5: An illustration of the convolution of a 2 × 2 kernel moving over a 4 × 4 image, resulting in a 3 × 3 feature map

Convolutional Layers

For ease of notation, first consider the case of a two-dimensional image I and a kernel function K, also of two dimensions. The convolutional operation is then defined as

(I ∗ K)(i, j) = M X m N X n I(m, n)K(i − m, j − n), (2.3) with (i, j) denoting the position in the output image, also known as the feature map. The kernel moves over the image with predetermined step sizes known as strides. This is illustrated in Figure 2.5. Applying this operation has the effect that each of the pixels in the output is only dependent on the pixels in a neighbourhood of the corresponding one in the input image, as K is typically smaller than I. Consequently, applying a convolutional layer results in a feature map where information has been compressed, often to a spatially lower resolution.

There are cases where one would like to make use of the features of the convolution without reducing the dimensions of the image. This can be achieved through padding - adding extra rows and columns in a symmetric manner to the edged of the image before applying the convolution. There are many different ways to choose the values of the added elements but the most common include all zeros or reflecting the image along the edges.

Transpose Convolutional Layers


-1 -5 0 2 21 10 8 8 2 0 0 0 0 0 1 5 0 0 2 1 0 0 0 0 0 0 -1 2 4

Figure 2.6: An illustration of the transpose convolution with a 2 × 2 kernel over a 2 × 2 image padded with zeros, resulting in a 3 × 3 output feature map

has been passed through a regular convolutional layer, it does not undo the convolution operation itself.

Actvation Layers

The role of the detection layer is to identify patterns present in the data that the affine convolution does not pick up on. As in the case of a standard ANN, this is done using a nonlinear activation function which is applied element-wise to the whole image. Similarly to regular networks, the ReLU function is a common choice for CNNs as well.

A Note on Dimensions

The above discussion deals with the case when the network takes two-dimensional data, such as a black and white 2D image, as its input. In many applications however, the input images may not necessarily be neither two-dimensional nor black and white. As the technology used for imaging has evolved, three dimensional images are more common, particularly within the medical field. Additionally, each image may be coloured and thus represented in terms of an RGB value, or as in the case of MR scans contain different modalities which require an extra dimension to be represented correctly. The operations and layers presented so far all generalise to higher dimensions in the manner that one would expect.


Segmentation Using CNNs

The goal of segmenting an image is to obtain a classification of each pixel or voxel among a given number of classes. In other words, given an input image X, the CNN should produce an output


Y = fθ(X), where θ represents the collection of parameters that define the network f , such that each

voxel ˆyi,j,k is labelled according to which class it belongs to. Here k denotes the third dimension


so that the output ˆY of the final layer has channel number equal to the number of classes C. The probability that a voxel ˆy belongs to class c can then be expressed by setting the final activation function s(·) to be the SoftMax function:

s(ˆyc) =




, c = 1, .., C. (2.4)

With this setup, the natural choice for objective or loss function when training the network is then the categorical cross-entropy

L(Y, ˆY ) =X

i,j,k C



−yci,j,klog(s(ˆyi,j,k)), (2.5)

where Y is the ground truth for which yci,j,k is one if voxel yi,j,k belongs to class c and zero

otherwise. Regularisation

As with all machine learning algorithms, one would like to obtain a model that not only performs well on the data it has been trained on but also generalises well on unseen data. There are a number of techniques that can be applied to avoid overfitting, most of which work by controlling the complexity of the model in some way, either by limiting the number of parameters in the model or by limiting the range of values that the parameters can take on. One instance of the latter, which is commonly used when working with CNNs, is L2-regularisation. It introduces a regularising term into the loss function that has the effect of drawing all kernel weights closer to the origin. Given a loss function L(·), the regularisation term can be added as


L(fθ(X), Y ) = L(fθ(X),Y )+λ||θ||2

2, (2.6)

where λ is a hyperparameter determining the strength of the regularisation.


Stochastic Gradient Descent

In order to obtain good predictions with supervised learning, the network needs to be trained on labelled data. This amounts to optimizing the parameters θ so as to minimize the empirical risk function defined on the data set (Xi, Yi)Ni=1 containing N samples, in other words to solve

min θ 1 N N X i=1 L(fθ(Xi), Yi). (2.7)


method of using mini batches enables faster computation while still providing an unbiased estimate of the actual gradient.

Given a batch gradient Gt, each parameter update is carried out as

θt+1= θt− αtGt, (2.8)

where the hyper parameter αt is known as the learning rate for iteration t. This may either be

kept constant throughout training or changed according to a schedule. The network parameters are consequently updated until a criterion for stopping is fulfilled. This may for instance be that the incremental change in the parameters falls below some threshold. The full SGD algorithm is shown in Algorithm 1.

Algorithm 1: Stochastic Gradient Descent

Inputs: Learning rate α, initial network parameters θ0;

Initialization t = 1;

while stopping criterion not met do

Sample a mini batch of n samples from the training set (Xi, Yi)Ni=1;

Compute gradient estimate Gt= 1n


j=1∇θt−1L(f (Xj), Yj);

Update θt= θt−1− αtGt;

t = t + 1; end


Variants of Stochastic Gradient Descent

While SGD computes an unbiased estimate of the true gradient at each iteration, using only a few samples at a time can make the algorithm sensitive to any noise present in those samples. One way to overcome this is to introduce a momentum term to the updating stage. This aggregates an exponentially decaying moving average of past gradient estimates and results in updates that take previous mini batches into account as well.

Another method for improving the performance of SGD is to use a learning rate that adapts according to gradient estimates that have been seen during the training so far. There are a number of different algorithms that make use of this strategy, such as RMSProp and and AdaGrad. The most commonly used for deep learning is ADAM and is also the ones applied in this project. It utilises momentum to compute estimates of both the first and second order moments of the gradient,

st= ρ1st−1+ (1 − ρ1)Gt (2.9)

rt= ρ2rt−1+ (1 − ρ2)Gt Gt, (2.10)

where ρ1and ρ2 are fixed decay rates and refers to elementwise multiplication. The parameters

of θ are subsequently updated with the bias-corrected versions of st and rt, see Algorithm 2 for


Algorithm 2: ADAM

Inputs: Global learning rate α, decay parameters ρ1, ρ2, small constant δ for numerical

stabilisation, initial network parameters θ0;

Initialization s0= 0, r0= 0, t = 1;

while stopping criterion not met do

Sample a mini batch of n samples from the training set (Xi, Yi)Ni=1;

Compute gradient Gt= n1P n

j=1∇θt−1L(f (Xj), Yj);

Update biased first moment st= ρ1st−1+ (1 − ρ1)Gt;

Update biased second moment rt= ρ2st−1+ (1 − ρ2)Gt Gt;

Correct bias in first moment ˆst= 1−ρstt 1

; Correct bias in second moment ˆrt=1−ρrtt

1 ; Compute update ∆θ = −αsˆt ˆ rt+δ; Update θt= θt−1+ ∆θ; t = t + 1; end


Local and Global Minima

Given an appropriate learning rate schedule, SGD is guaranteed to eventually converge to a min-imum point of the loss function. Had the loss been convex this minmin-imum would necessarily be a global one however due to the complexity of both the CNN and the loss itself, this is almost never the case in deep learning. So while the algorithm may converge, it may do so not to the global minimum but to a local one. The loss function may have a high number of local minima and the point of convergence is influenced both by the stochasticity of SGD as well as the starting value θ0.

To increase the chances of finding a minimum point with a low loss value it is therefore common to run the training several times, each with a different set of initial weights. Strategies for choosing the initial θ0 include drawing the weights according to a predetermined probability distribution,

such as a uniform or normal, or setting all weights to zero.


Importance Sampling

The stochastic element in SGD comes from the selection of samples for each mini batch. With the standard method, at each iteration a mini batch of samples are drawn without replacement according to a uniform distribution over the whole training set. Although this results in an unbiased estimate of the gradient of the loss ∇Lθ, it means that the algorithm will see all samples at an

approximately equal frequency regardless of how well they are classified by the current version of the CNN. Consequently, time and computational effort may be wasted on data that the network already handles well. Importance sampling [4] aims to rectify this by focusing the training on samples that the network struggles to classify correctly and thus increase the speed of convergence.

Let Pt denote the distribution according to which samples are chosen at iteration t and It the


and t + 1,

S = −EPt||θt+1− θ ∗||2

2− ||θt− θ∗||22 . (2.11)

It can be shown [4] that the above expression is minimized by selecting the distribution Pt that

minimizes Tr(VPt[G

(It)]), where G(i)= ∇

θtL(f (Xi, Yi)) is the gradient for sample i. The optimal

Pt for this has in turn been shown to be one that is proportional to the per-sample gradient norm

of the loss [22].


Gradient Norm Estimates

Although the ideal strategy would be to use a sampling distribution based on the norm of the loss gradient ||∇θL(f (Xi, Yi))||, the computation of this distribution is complex enough that one

may not actually gain anything in terms of training time. Instead, the algorithm proposed in [4] makes use of an upper bound of each ||Gi||2which is less computationally expensive. The following

notation describes the case of a simple fully connected ANN but the method and algorithm applies for CNNs as well.

First, let θ(l) ∈ RMl×Ml−1 denote the parameters for the lth layer out of a total of L, let σ(l)(·)

be the corresponding activation function, x(l) the output of layer l and z(l) = θ(l)x(l−1). Then define Σ0l(z) = diagσ0(l)(z1), ..., σ0(l)(zMl)  (2.12) ∆(l)i = Σ0l(z (l) i )θ T l+1... Σ0L−1(z (L−1) i )θ T L (2.13) ∇X(L) i L = ∇X(L) i L(f (Xi, Yi)). (2.14)

The upper bound for the gradient norm of each sample (Xi, Yi) is then given by

k∇θL(f (Xi), Yi)k2≤ Lη Σ 0 L(z (L) i )∇XL iL 2:= ˆG (i) (2.15) η = max l,i  x (l−1) i 2 ∆ (l) i 2  . (2.16)


The Importance Sampling Algorithm

The bound in (2.15) is considerably faster to compute than the gradient norm itself but as it depends on the iteration t it must still be recomputed at each time step. In addition, for large data sets determining these importance bounds for all samples is still time consuming. The importance sampling algorithm solves this by uniformly pre-sampling a large batch of B data points and computing the importance score and subsequent sampling distribution only for those. From this larger batch, a smaller one of b data points is then sampled without replacement according to the derived sampling distribution and used to compute the parameter update.


it is shown that in order to achieve the same variance reduction with standard sampling as with importance sampling with large and small batch sizes of B and b respectively, one needs to sample τ b data points, where τ is defined as

1 τ = 1 − 1 PB i=1g 2 i kg − uk2 2, (2.17)

with gi∝ kG(i)k2 and u = 1/B be the sample probabilities for importance and standard sampling

respectively. Using also the assumption that computing a backward pass of the gradient through the network is twice as time consuming as computing a forward pass, the relative extra time spent using importance sampling is (B + 3b)/3b. From this one can conclude that it is beneficial to use importance sampling when τ > (B + 3b)/3b. Algorithm 3 outlines the full method, which also includes the parameter aτ that is used to compute a running average of τ in order to achieve a

smoother estimate.

Algorithm 3: Training with Importance Sampling

Inputs: Large batch size B, small batch size b, smoothing parameter aτ, initial network

parameters θ0;

Initialization: t = 1, τ = 0;

while stopping criterion not met do if τ > (B + 3b)/3b then

U ← Bb uniformly sampled data points; gi∝ ˆG(i) ∀i ∈ U according to eq 2.15;

G ← b data points sampled with gi from U ;

wi ←Bg1i ∀i ∈ U ;

Update θtusing an SGD step with weights wi, U , θt−1;


U ← b uniformly sampled data points; wi ← 1 ∀i ∈ U ;

Update θtusing an SGD step with weights wi, U , θt−1;


Chapter 3


In this chapter, the methods used to obtain the results presented in Chapter 4 are outlined. This includes a description of the data set used, what pre-processing steps are taken, how patches to train on are extracted from the original images, the choice of network architecture as well as the methods used for training. Descriptions of the employed sampling strategies are provided as well as the relevant performance metrics. Finally the quantities used for analysing the sampling patterns from the importance sampling training are introduced.



The data set used for training and validation in this project is the one published for the 2020 BRaTS challenge [23–27]. It comprises 369 sets of 3T multi-modal MR scans performed pre-operatively on patients with confirmed diagnosis of glioblastoma and lower grade glioma. The images are obtained from a variety of scanners and a total of 19 institutions worldwide. Each scan contains 240 × 240 × 155 voxels large grey scale images with four modalities - T1, T1-contrast (T1C), T2

and T2 Fluid Attenuated Inversion Recovery (FLAIR) weighted sequences. The ground truth of

the data set has been obtained from images that have been segmented manually by experts into four classes - normal brain matter, gadolinium enhancing tumour, peritumoral edema, and finally necrotic and non-enhancing tumor core. For simplicity however, for this task the three tumour classes are merged together and segmentation is performed binary to only differentiate between tumour and healthy tissue.


Data Pre-Processing

As provided the data set has been pre-processed to ensure spatial alignment across all sequences, skull-stripped and interpolated to an isotropic resolution of 1 mm3. In order to further facilitate the


Bias Filed Correction

A bias field [28] refers to a smooth low frequency signal that is present in all MR scans due to small inhomogenities in the magnetic field used to produce the images. It has the effect of blurring the images slightly, which may make edges and contours of different structures less distinct and also cause changes in the pixel or voxel intensity. While the bias field artefacts do not affect the manual segmentation by a physician, it may degrade the performance of the model and is therefore often removed or at least reduced before training. This is done using the Insight Toolkit implementation of the N4ITK algorithm [29], which assumes that the field can be described through a non-parametric multiplicative model and uses an iterative strategy to estimate it.

Intensity Normalization

To further facilitate learning, all images are normalised to have a mean value of zero and standard deviation of one. This is done to ensure a common range of values over all dimensions and modalities. Normalisation is performed over each modality and over each patient.


Patch Extraction and Data Split

Due to limitations in the graphics processing unit (GPU) memory, training on the full 3D volumes is not possible. Instead the model is trained on smaller sections of each volume, known as patches. To improve model performance the resolution of the volumes is therefore decreased to 8 mm3 per voxel to increase the field of view contained within each patch. As the images contain a high amount of background voxels, that is voxels that are neither healthy brain tissue nor tumour, the images are also cropped to remove the majority of these. This results in images of 88 × 104 × 80 voxels. Each image is then split into several patches which are shown to the network one at a time. This is done using the two different strategies outlined below.

The data set is also randomly split into training and validation data. This is done on a patient level, so that all patches from one patient are put into exactly one of the training and validation sets. Approximately 80% of the data, 295 volumes, are used for training and the remaining 74 for validation.

Large Patches

For the first strategy, four patches of 88 × 104 × 32 voxels are extracted from each image, illustrated in Figure 3.1a, with an overlap of 16 voxels in the longitudinal direction. This leaves a total of 1476 patches in the data set, 1180 of which are used for training and 296 for validation.

Small Patches


104 80

88 32

(a) Large patches

104 80 88 32 32 32 (b) Small patches

Figure 3.1: Extraction of patches in blue from the full volumes in black.


Data Augmentation

To increase the variability of the data set in order to reduce overfitting, data augmentation is used on the training data. The images are randomly reflected about the sagittal axis and rotated at angles chosen uniformly in the range (0◦, 30◦). No augmentation is performed on the validation data.


Network Architecture

The experiments with different sampling methods in this thesis are all performed on the 3D deep convolutional network known as U-Net [30] which has been developed especially for the purpose of segmentation. Adapted from an earlier version [31] which required two dimensional data, this network consists of two main paths, an analysis path and a synthesis path. The idea behind this is to extend a normal CNN with another set of layers where upsampling operations replace the pooling layers in order to increase the resolution of the output image.

The analysis path follows the standard structure of a CNN with four steps each containing two 3 × 3 × 3 convolutions with a ReLu activation function followed by a 2 × 2 × 2 max pooling layer with strides equal to two in each dimension. In order to stabilise the learning, the original U-net architecture uses batch normalization before each ReLu activation function. However, because the memory limitations forces a batch size of one and eight respectively for the patch strategies, this has been swapped for instance normalization to avoid inaccuracies in the estimation of batch statistics [32]. The synthesis path has four steps consisting of an up-convolutional layer with a kernel size of 2×2×2 taking strides of two in each dimension followed by two standard convolutional layers with 3 × 3 × 3 kernels and ReLu activation functions. In addition, before the two convolutions are applied, the layer is concatenated with the layer of corresponding resolution in the analysis path. Finally, the last layer is convolutional with a kernel size of 1 × 1 × 1 to generate output with number of channels equal to two. The full architecture is visualised in Figure 3.2.


Figure 3.2: The architecture of the 3D U-Net with blue boxes representing feature maps and the number of channels shown above each feature map. Figure modified from [30].

is done according to the uniform distribution U defined as K(i, j, k) ∼ U r − 6 lin+ lout , 6 lin+ lout  ∀i, j, k, (3.1) where linand lout denotes the number of units in the input and output tensors respectively.


Training and Sampling Strategies


Uniform Sampling

To provide a benchmark for further investigations, the network is first trained using uniform sam-pling and SGD together with the ADAM optimizer as outlined in Algorithm 2. For patch pattern one, the available memory limits the batch size to one whereas for patch pattern two the smaller patches allow for a batch size of eight. In both cases, the learning rate is set to α = 10−4. The batch size is enforced due to memory constraints whereas the learning rate is selected empirically. The loss function is taken as binary cross-entropy with L2-regularization with a parameter of λ = 0.001

to reduce overfitting. The model is trained for 100 epochs.


Importance Sampling


parameter are investigated, B = 3, which is the default parameter shown to work well for the tasks considered in [4], and B = 10 and B = 6 for patch pattern one and two respectively. The difference in the larger pre-sampling parameter is due to the longer epoch time of the second patch pattern, making an epoch take approximately eight hours when sampling with B = 10. After some initial experiments, it is realised that training with τ computed according to Equation 2.17 results in the condition τ > (B + 3b)/3b not being fulfilled and importance sampling never actually used. Although this may be an indication that the time gained from training with importance sampling is limited, it is still of interest to observe the characteristics of the samples that are selected most often by the algorithm. The threshold value it therefore set to τ = 0 in all training runs. Due to the extended epoch run times when importance sampling is switched on, the number of epochs are reduced for some of the runs. The setups are summarised in Table 3.1.

B Epochs Large patches 3 100

10 50 Small patches 3 60 6 30

Table 3.1: Training configurations for importance sampling.



Given an instance of the CNN, during training or after, segmentation on a full volume, i.e. the data from one patient, is done by feeding one patch at a time to the network. The patches are then reassembled to represent the full volume. In the case of overlap, that is when a voxel in the original image exists in more than one patch, the maximum of the SoftMax outputs is selected to represent the voxel. Given that the SoftMax function can be thought of as the probability that a voxel belongs to either the background or tumour class, this corresponds to relying on the prediction that is most certain.


Performance Metrics

Because the vast majority of the voxels in each image represents either healthy tissue or background the accuracy percentage does not give a very good indication of how well the network is performing when it comes to actually detecting tumours. It is therefore more common to use other metrics that provide a better insight into how well the set of voxels classified by the network as tumours overlaps with the tumour voxels in the ground truth. There are a number of possible metrics that serve this purpose however the one used for this project is the Dice score. As the Dice score is one of the metrics used in the BraTS 2020 challenge, this allows for easy comparison with the performance of the segmentation models in this project to those of other studies. It is computed as



where T P , F P and F N denote the total number of true positive, false positive and false negative voxels respectively. A Dice score of one implies a segmentation exactly identical to the ground truth.

Besides the performance of the trained network, the main quantity of interest is the time taken for training. This is measured both in terms of seconds/epoch and time to convergence.


Sampling Analysis

When the models are finished training with importance sampling the sampling patterns of the algorithm are examined in order to investigate whether the images that are chosen most often have particular features in common. This includes the following aspects:

• Does the ground truth of the image contain a tumour?

• In the case that the ground truth contains tumour, does the image cover the whole tumour? If not, what proportion of the tumour is outside the field of view of the patch?

In addition, statistics of the tumours, known as radiomic features, contained in the patches are examined. This is done using the Pyradiomics toolbox [33], which generates a triangle mesh defining the approximate shape of each tumour, also known here as the Region of Interest (ROI). Letting Nf denote the number of triangles defining the mesh, the following features are of interest:

The volume V of the ROI, computed using the mesh, Vi= Oai· (Obi× Oci) 6 (3.3) V = Nf X i=1 Vi, (3.4)

where ai, bi, ci are the points defining the ith triangle of the mesh and O the origin of the


The surface area A of the ROI, computed as the sum of the areas of each triangle in the mesh, Ai= 1 2|aibi× aici| (3.5) A = Nf X i=1 Ai. (3.6)

In the case that part of a tumour is cut off when extracting the patch, the area of the cut is disregarded.

Finally, the mean and standard deviation of the voxel intensities in each patch are measured. Denoting the number of voxels in the patch that do not correspond to background as Np and the


mean = 1 Np Np X i=1 I(i). (3.7) standard deviation = v u u t 1 Np Np X i=1 (I(i) − ¯I)2, (3.8)


Chapter 4


This chapter contains the results from training with the different patch sizes and sampling strategies. This includes the loss functions, Dice scores and training times. For importance sampling, the characteristics of the observed sampling patterns are also presented.


Large Patches

Training with importance sampling is done both with pre-sampling parameter B = 3 and with B = 10. The resulting training and validation loss values are compared against that of uniform sampling in Figures 4.1 and 4.2. The comparison is made both in terms of epoch number and elapsed time. In terms of epochs, some speedup, particularly in the early stages, is observed both for the training and validation loss. There is however no clear distinction between the two choices of B. The speedup is not present when considering the losses with respect to run time however, as all three methods evolve similarly.


Figure 4.1: Training and validation loss when training with uniform sampling and importance sampling plotted against epoch number.

Figure 4.2: Training and validation loss when training with uniform sampling and importance sampling plotted against running time.


Figure 4.3: Average validation Dice score when training with uniform sampling and importance sampling with B = 3 and B = 10 plotted against epoch number (left) and running time (right).


Distribution of Dice Scores

In order to assess how the performances of the models are improving as the training progresses -i.e whether segmentation is improved in general or if only a few samples are handled increasingly well, the distribution of Dice scores is investigated. Comparisons are made for both instances of importance sampling against uniform sampling in Figures 4.4 and 4.5. For B = 3, every 20th

epoch is compared and for B = 10 every 10th. It is observed that overall, the models trained with


Figure 4.4: Histograms of validation dice scores when training with uniform sampling in blue and importance sampling with B = 3 in orange.

Figure 4.5: Histograms of validation dice scores when training with uniform sampling in blue and importance sampling with B = 10 in green.



Sampling Patterns

While the importance sampling algorithm bases its sampling probabilities on estimates of the loss gradient, it is of interest to investigate if any other conclusions can be drawn regarding the patches that are sampled most frequently. Over the course of the training, all samples are chosen at least once when B = 3. For B = 10 the algorithm selects all but two samples at least once. Furthermore, by comparing the number of sampling events for each patch it can be seen that some patches are drawn more frequently than in the case of uniform sampling and some not as often. Figure 4.6 illustrates the normalised distribution of sampling events for all patches. Before normalising, the mean number of sampling events are µlarge,3= 100.00 and µlarge,10= 50.08 for B = 3 and B = 10

respectively. The corresponding standard deviations are σlarge,3 = 36.24 and σlarge,10 = 28.10.

Instances of patches that are sampled the least and most often are shown in Figures 4.7 and 4.8. For further analysis, denote the sets of the most frequently sampled patches when B = 3 as

Slarge,30 = {patches sampled more than µlarge,3 times during training}

Slarge,31 = {patches sampled more than µlarge,3+ σlarge,3 times during training}

Slarge,32 = {patches sampled more than µlarge,3+ 2σlarge,3times during training}.

The corresponding sets for B = 10 are defined similarly.


(a) B = 3

(b) B = 10

Figure 4.7: Slices with indices 7, 14, 21 and 28 in the longitudinal direction of the patches that are sampled the least during training.

(a) B = 3

(b) B = 10

Figure 4.8: Slices with indices 7, 14, 21 and 28 in the longitudinal direction of the patches that are sampled the most during training.


Figure 4.9: The proportion of sampled patches containing tumours in the training set and most frequently sampled patches with importance sampling.

Tumour Statistics

Given that each patch is only 32 voxels wide in the longitudinal direction a tumour may be cut off once or twice when a patch is extracted. Depending on where these cutoffs are a patch may contain only a small part of the tumour. Figure 4.10 shows the distribution of nonzero tumour volumes contained in the patches for some of the epochs. It is observed that for both B = 3 and B = 10 the smaller tumours dominate the most frequently sampled patches. For B = 3, the distribution of patches sampled more than µ1,3+ 2σ1,3times stand out as the most common volume

is approximately 40 cm3 rather than 0-15 cm3.


In addition to the tumour volume, the surface area, Figure 4.11, of the tumours contained in the sampled patches is also investigated. Among the patches sampled more than two standard deviations over the mean, the tumours with the smallest surface areas are avoided when B = 3. For B = 10 some of the tumours in the patches that are most common during training have very small surface areas but the larger ones are avoided instead.

Figure 4.11: Distribution of tumour surface area in the training set (blue) and the sets of patches often sampled with importance sampling. Orange corresponds to Silarge,3and green to Silarge,10.


Figure 4.12: Distribution of the proportion of full tumours that are captured by the patches in the training set (blue) and the sets of patches often sampled with importance sampling. Orange corresponds to Silarge,3and green to Silarge,10.

Considering now the statistics of the whole patches as opposed to only the tumours, the mean and standard deviation of voxel intensities in each patch are investigated, see Figures 4.13 and 4.14. For brevity only the T1 intensities are included here, the remaining three modalities can be


Figure 4.13: Distribution of the mean T1voxel intensities in the training set and the sets of patches

often sampled with importance sampling. Blue corresponds to Silarge,3and orange to Silarge,10.

Figure 4.14: Distribution of the standard deviation of T1voxel intensities in the training set and the

sets of patches often sampled with importance sampling. Blue corresponds to Silarge,3 and orange to Silarge,10.


Small Patches


running time the opposite is true - especially in earlier epochs using importance sampling appears to slow the training down.

Figure 4.15: Training and validation loss when training with uniform sampling and importance sampling plotted against epoch number.

Figure 4.16: Training and validation loss when training with uniform sampling and importance sampling plotted against running time.


Figure 4.17: Average validation Dice score when training with uniform sampling and importance sampling with B = 3 and B = 6 plotted against epoch number (left) and running time (right).

Distribution of Dice Scores


Figure 4.18: Histograms of validation dice scores when training with uniform sampling in blue and importance sampling with B = 3 in orange.

Figure 4.19: Histograms of validation dice scores when training with uniform sampling in blue and importance sampling with B = 6 in green.


Sampling Patterns

Investigating the sampling frequency of each patch during training reveals that training with impor-tance sampling affects the number of times a sample is chosen. This is visualised using normalised histograms in Figure 4.20. The mean number of times a sample is trained on with B = 3 is

µsmall,3 = 60 with a standard deviation of σsmall,3 = 15.15. For B = 6, the mean and standard

deviation are µsmall,6 = 30 and σsmall,6 = 10.50. In both cases, all patches are sampled at least

once during the whole training period however the most popular patches are picked up 143 times for B = 3 and 94 times with B = 6. Examples of the least and most sampled patches are shown in Figures 4.21 and 4.22. As for the case of the large patches, define sets containing the most commonly used patches with B = 3 as

S0small,3= {patches sampled more than µsmall,3 times during training}

S1large,3= {patches sampled more than µsmall,3+ σsmall,3times during training}

S2small,3= {patches sampled more than µsmall,3+ 2σsmall,3 times during training},

and the corresponding ones for B = 6.


(a) B = 3

(b) B = 6

Figure 4.21: Slices with indices 7, 14, 21 and 28 in the longitudinal direction of the patches that are sampled the least during training.

(a) B = 3

(b) B = 6

Figure 4.22: Slices with indices 7, 14, 21 and 28 in the longitudinal direction of the patches that are sampled the most during training.

In the training set approximately 35.6% of the patches contain tumour to some extent. Figure 4.23 shows the corresponding proportions for the most commonly sampled patches with importance sampling. The general trend here is that the more a patch has been sampled the more likely it is to contain tumour to some extent. However, when B = 3 even the most commonly sampled patches altogether contain tumours to a smaller extent than the training set as a whole.


Figure 4.23: The proportion of sampled patches containing tumours in the training set and most frequently sampled patches with importance sampling.

is limited and so the mean tumour volume contained in a patch, among those that contain tumours at all, in the training set is lower than for the larger patches. The distribution of nonzero tumour volumes in the patches sampled with importance sampling are shown in Figure 4.24. For the case B = 3 the volume distributions of the patches that are trained on most often are very similar to that of the whole training set. For B = 6, it can be seen that for those samples that are chosen more than two standard deviations away from the mean number of times, tumours smaller than approximately 25 cm3 are slightly more prevalent. When considering the distributions of tumour

surface area, see Figure 4.25, the patterns are similar in that the patches that are sampled up to µ2,3+ σ2,3 and µ2,6+ σ2,6 times respectively have approximately the same characteristics as the


Figure 4.24: Distribution of tumour volumes in the training set (blue) and the sets of patches often sampled with importance sampling. Orange corresponds to Sismall,3 and green to Sismall,6.

Figure 4.25: Distribution of tumour surface area in the training set (blue) and the sets of patches often sampled with importance sampling. Orange corresponds to Sismall,3and green to Sismall,6.


Figure 4.26: Distribution of the proportion of full tumours that are contained in the training set (blue) and the sets of patches often sampled with importance sampling. Orange corresponds to

Silarge,3and green to Silarge,10.

Finally the distributions of the mean and standard deviations of voxel intensities in the sampled patches are presented in Figures 4.27 and 4.28. As with the large patches, only the T1 intensities

are shown here and the remaining three modalities can be found in Appendix A. The deviations from the distribution for the whole training set are again very small. In terms of mean values, the case B = 6 displays a slightly higher frequency of means in the larger range for both S1small,6 and S2small,6.

Figure 4.27: Distribution of the mean T1voxel intensities in the training set and the sets of patches


Figure 4.28: Distribution of the standard deviation of T1 voxel intensities in the training set and


Chapter 5


In this chapter the results presented in the previous chapter are discussed and put into context. Conclusions are drawn regarding both the results of the training and the patterns observed when training with importance sampling for each of the two patch sizes. Finally, some general points concerning the validity and robustness of the experiments are given.


Performance and Training Time

For both patch sizes, the benchmark of uniform sampling results in a model which converges both in terms of training and validation loss. Introducing importance sampling has a positive impact on the convergence of both the training and validation loss in terms of epoch number, particularly in the earlier stages of training. It is also noted that using a larger value for B results in a slightly faster convergence, although the differences are small. In terms of wall clock time there is however no real gain as the time taken to pre-sample and compute importance scores counteracts the gain in terms of epochs for both values of B, regardless of patch size. This is expected given that importance sampling with τ not manually set to zero stays in uniform sampling. Another interesting aspect is the regularising effect of importance sampling that is observed for the validation loss in the case of the larger patches.


Overall, the results suggest that there is enough variation in both sets of extracted training patches for importance sampling to reduce the variance of the gradients and give a smoother training with less fluctuations but not enough to actually speed up the training. This might partially be due to the overlap between slices, as the patches from the same patient contain a high amount of the same information in their fields of view. Using less overlap might increase the effect of importance sampling. In both cases, the importance trained models achieve better overall Dice scores on the validation data and as can be seen in Figure 5.1, a larger pre-sampling batch yields a better score. Compared to the top ranked papers in the 2020 BraTS challenge the models trained in this project are however not quite able to compete with the current state of the art. It should also be noted that since the volumes have been down sampled to a lower resolution due to memory limitations, the resulting segmentations are less precise. It would be interesting to see how the importance sampling algorithm fares on patches with higher resolution, as they should contain more variation.

Figure 5.1: Comparison of Dice scores in the top ranked papers in the BraTS 2020 segmentation challenge and the models trained with importance sampling. Note that ’BraTS best model’ [34] corresponds to the highest ranked model overall and ’BraTS best Dice score’ [35] to the model that achieved the highest Dice score in the binary tumour segmentation task.


Sampling Trends


When considering how often the importance sampling algorithm draws patches containing tumours, both values of B typically draw patches without tumour more often than one would with uniform sampling when working with large patches. The difference is most notable for the subsets of patches that are sampled very often, which suggests that the model struggles with properly classifying the voxels in patches without any tumour. While the number of false positive predictions have not been recorded when computing the Dice scores at each epoch, it is possible that exposing the model to patches without tumours to a higher extent makes it less prone to misclassifying healthy tissue. When the patches are small, the opposite trend is observed. This may be because this training set contains patches with tumour to a slightly lower extent than is ideal for the model to learn to recognise them properly.

Based on the analysis of the radiomic quantities, it appears that when patches are large importance sampling favours those with small tumours, both in terms of volume and surface area, to some extent. It is however interesting that the distribution of patches that are sampled most with B = 3 peaks a volume of approximately 40 cm3 while the one corresponding to B = 10 peaks at about

10 cm3. Combining this with the fact that the proportion of tumours in the full volumes that are

covered by the patch have most mass to the right in the histogram when B = 3 and more to the left with B = 10 could imply that training with a larger pre-sampling batch favours patches that have been extracted around tumours which are originally small. For the smaller patches any such trends are more difficult to detect as the distribution of volumes and surface areas among the most sampled patches is very similar to those of the whole training set.

For both patch sizes, the distribution of voxel intensities does not appear to greatly influence how often a sample is used during training. Although some minor changes to the distribution are observed, they are too small to rule out as being due to noise. The same can be said about the distribution of standard deviations. Here one might expect that images with low standard deviation would be sampled more often as a small spread of values indicate low contrast in the patch, however this is not observed for any of the B-values. Given that tumours in MR images are visualised through the contrast of the voxels, looking at higher order statistics, such as the kurtosis, skewness and number of peaks, in each patch might be more useful. Transforming the problem into a multiclass one by separating the three tumour classes that were used for segmentation in the original data set would likely also have an impact on the observed intensities. Then one would expect the intensity patterns to differ more between the modalities, as different modalities highlight different structures.

While some trends are observed, the differences between the set of the most frequently sampled patches and the training set as a whole are not large enough to say for certain if they in fact are trends or simply noise. The fact that importance sampling still has an effect on the stability of the losses and improves the Dice scores imply that there may be other underlying features that importance sampling picks out than those considered in this project. Further analysis of higher order features could potentially shed more light on what influences the sampling frequency of a patch.


Validity and Robustness


Chapter 6


By implementing importance sampling when training a 3D U-Net for segmentation on MR images of the brain, this thesis shows that it is possible to improve the performance of a deep learning algorithm on medical data by sampling data points selectively. In particular, it is demonstrated that importance sampling has a regularising effect on both the training and validation loss as well as the Dice scores of segmented volumes. The models trained with importance sampling also yield higher Dice scores when averaged over the whole validation data set.

Although improvement can be seen in terms of stability and performance incorporating importance sampling does not appear to speed up training in terms of runtime. When considering the minimiza-tion of the loss funcminimiza-tion as a funcminimiza-tion of time importance sampling progresses at approximately the same rate as uniform sampling, as the advantage gained by sampling strategically is cancelled out by the time it takes to perform the additional computations. The effect of the importance sampling is also mainly observed only in the earlier stages of training. This is however less important from a clinical perspective, as it is the performance of the network that determines whether it might be useful as a tool in practice.

Analysis of the images selected during importance sampling suggest that when using large patches, the algorithm favours small tumours to some extent and also uses patches without tumour more frequently. For smaller patches it is difficult to draw the same conclusions as the differences observed are too small to safely rule out as being noise.


Future Work

The results of this thesis indicate that it is possible to speed up learning and obtain both better and more stable results when training with importance sampling. As mentioned in the discussion, to build on these findings the first step should be to repeat the experiments in order to obtain more robust results. Beyond this, possible future work include:


estimates more often in the early epochs but sufficient to keep them for longer when the model has trained for some time.

• Experimenting with different values of the pre-sampling parameter B. The results presented in this project suggest that a larger pre-sampling batch leads to better performance and less fluctuations in the loss function but there is most likely a limit in to what extent the variance of the gradients can be reduced.



[1] Kh. Kalan Farmanfarma, M. Mohammadian, Z. Shahabinia, S. Hassanipour, H. Salehiniya. Brain cancer in the world: An epidemiological review. World Cancer Research Journal, 10.32113/wcrj_20197_1356, 2019.

[2] Zeynettin Akkus, Alfiia Galimzianova, Assaf Hoogi, Daniel Rubin & Bradley Erickson. Deep Learning for Brain MRI Segmentation: State of the Art and Future Directions. Journal of digital imaging, 30(4):449-459, 2017.

[3] Vighnesh Birodkar, Hossein Mobahi & Samy Bengio. Semantic Redundancies in Image-Classification Datasets: The 10% You Don’t Need. ArXiv abs/1901.11409, 2019.

[4] Angelos Katharopoulos & François Fleuret. Not All Samples Are Created Equal: Deep Learn-ing with Importance SamplLearn-ing. arXiv preprint arXiv:1803.00942, 2018.

[5] Kai Roman Laukamp, Frank Thiele, Georgy Shakirin, David Zopfs, Andra Faymonville, Marco Timmer, David Maintz, Micheal Perkuhn & Jan Borggrefe. Fully automated detection and segmentation of meningiomas using deep learning on routine multiparametric MRI. European Radiology, 29(1):124–132, 2019.

[6] Muhammed Talo, Ozal Yildirim, Ulas Baran Baloglu, Galip Aydin, U Rajendra Acharya. Convolutional neural networks for multi-class brain disease detection using MRI images. Com-puterized Medical Imaging and Graphics, 78(2019): 101673, 2019.

[7] Dmitry Lachinov, Evgeny Vasiliev, & Vadim Turlapov. Glioma Segmentation with Cascaded Unet. Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, p. 189-198, 2019.

[8] Sergio Pereira, Adriano Pinto, Victor Alves, & Carlos A Silva. Brain Tumor Segmentation Using Convolutional Neural Networks in MRI Images. IEEE Transactions on Medical Imaging, 35(5):1240-1251, 2016.

[9] Alexander de Brebisson & Giovanni Montana. Deep neural networks for anatomical brain seg-mentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 10.1109/CVPRW.2015.7301312, 2015.


[11] Ameya Prabhu, Charles Dognins & Maneesh Singh. Sampling Bias in Deep Active Classifi-cation: An Empirical Study, p 4049-4059, 10.18653/v1/D19-1417, 2019.

[12] Kailas Vodrahalli, Ke Li & Jitendra Malik. Are All Training Examples Created Equal? An Empirical Study, 2018.

[13] R Manmatha, Chao-Yuan Wu, Alexander Smola, Alexander & Philipp Krahenbuhl. Sampling Matters in Deep Embedding Learning. 2017 IEEE International Conference on Computer Vision, p. 2859-2867, 2017.

[14] Angela Jiang, Daniel Wong, Giulio Zhou, David G Andersen, Jeffery Dean, Gregory Granger, Gauri Joshi, Michael Kaminsky, Michael Kozuch, Zachary Z Lipton, Padmanabhan Pillai. Ac-celerating Deep Learning by Focusing on the Biggest Losers. arXiv preprint arXiv:1910.00762, 2019.

[15] Tyler B Johnson & Carlos Guestrin. Training Deep Models Faster with Robust, Approximate Importance Sampling. Proceedings of the 32nd International Conference on Neural Informa-tion Processing Systems, p.7276–7286, 2018.

[16] Angelos Katharopoulos & François Fleuret. Biased importance sampling for deep neural net-work training. arXiv preprint arXiv:1706.00043, 2017.

[17] Randa El-Zein & Melissa Bondy, Margaret Wrench. Epidemiology of Brain Tumors. Brain Tumors. Contemporary Cancer Research, Humana Press, 2005.

[18] Dominik Weishaupt, Victor D Köchli & Borut Marincek. How does MRI work? An Intro-duction to the Physics and Function of Magnetic Resonance Imaging, 2nd edition. Springer, 2008.

[19] Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, Mohammed Bennamoun. A Guide to Convolutional Neural Networks for Computer Vision. Morgan & Claypool, 2018.

[20] Ivana Despotović, Bart Goossens & Wilfried Philips. MRI segmentation of the human brain: challenges, methods, and applications. Computational and Mathematical Methods in Medicine, (2015): 1-23, 2015.

[21] Ian Goodfellow, Yoshua Bengio & Aaron Courville. Deep learning. MIT Press, 2016.

[22] Deanna Needell, Nathan Srebro, Nathan & Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 155(1), p 549-573, 2016.

[23] Bjoern Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Transactions on Medical Imaging : 34(10), p. 1993-2024, 2015.

[24] Spyridon Bakas, Hamed Akbari, Aristeidis Sotiras, Michel Bilello, Martin Rozycki, Justin S Kirby, et al. Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Nature Scientific Data, 4(1), 2017.





Relaterade ämnen :