Automated Kidney Segmentation in Magnetic Resonance Imaging using U-Net

(1)

Automated Kidney Segmentation in

Magnetic Resonance Imaging using U-Net

By Andreas Östling

Department of Statistics

Uppsala University

Supervisors

Anna Bornefalk-Hermansson

Taro Langner

Fan Yang Wallentin

(2)

(3)

Abstract

Manual analysis of medical images such as magnetic resonance imaging (MRI) requires a trained professional, is time-consuming and results may vary between experts. We pro-pose an automated method for kidney segmentation using a convolutional Neural Net-work (CNN) model based on the U-Net architecture. Investigations are done to compare segmentations between trained experts, inexperienced operators and the Neural Network model, showing near human expert level performance from the Neural Network. Stratified sampling is performed when selecting which subject volumes to perform manual segmen-tations on to create training data. Experiments are run to test the effectiveness of transfer learning and data augmentation and we show that one of the most important components of a successful machine learning pipeline is larger quantities of carefully annotated data for training.

(4)

Acknowledgements

The author would like to thank everyone at the MRI research group at the Department of Sur-gical Sciences, Section of Radiology for great feedback and many fruitful discussions. I would

also like to thank Antaros Medical for providing the ground truth segmentations of the kid-ney volumes. Finally I would like to thank Uppsala Clinical Research Center (UCR) for great

feedback and providing the contacts to make this thesis possible.

Uppsala, June 2019

(5)

1 Introduction

Medical imaging is an important part of providing high quality medical care for patients. Man-ually analysing medical images is a very time consuming task that requires high precision and

consistency for both the procedures that create the images and for the medical experts that perform the analysis. Reasons for segmenting organs that often are of interest could include

knowing the proportion of fat tissue in the liver or the total volume of a kidney since these measurements have been shown to be indicators of diseases such as chronic kidney disease (Yu

et al. 2018). These measures require an estimate of the volume of the quantities of interest. To estimate the volume an operator has to go through the 3D volume by looking at each slice

(2D part of the 3D volume) and label each voxel as either kidney or not kidney. Advancements leading to image volumes being captured at higher resolution lead to higher quality data but

also a greater number of images the expert needs to look at to be able to see the full object of interest. This study reported around one hour per subject when segmentation was performed

by a trained expert.

In addition to being time consuming, segmentation is also subjective leading to different

segmentations for operators with varying levels of experience and personal style. An automated procedure for segmentation would reduce this variability in segmentations and create more

consistent estimates. Earlier attempts at estimating kidney volume involved either calculating the major and minor axes of a 3-dimensional ellipsoid or for each slice in the volume fitting

an ellipse and then adding the ellipses together to create a volume (Holloway et al. 1983). Later proposed solutions to the automated segmentation problem involve a multi-atlas approach

(Shimizu et al. 2007; Wolz et al. 2013; Iglesias and Sabuncu 2015). These methods consist of creating an atlas of images of healthy subjects and then evaluating new images by first

performing reasonable deformations, such as aligning the joints, to match the atlas as close as possible and then compare or propagate labels for segmentation.

1.1 Problem Statement

In this paper we propose to solve the issue of segmentations by training a convolutional Neu-ral Network (CNN) with the U-Net (Ronneberger, Fischer, and Brox 2015) base architecture.

Convolutional Neural Networks show promising results on similar tasks such as segmenting visceral adipose tissue (VAT) and subcutaneous adipose tissue (SAT) (Langner et al. 2019) as

(7)

well as liver and tumour segmentation (Christ et al. 2017). Second we compare the results of this algorithm to the intra-operator variability of a trained expert as well as an inexperienced

operator in kidney segmentations to be able to evaluate the performance of this automated pro-cedure. Further experiments are done, investigating the effectiveness of augmenting training

data using deformations, initializing the network with pre-trained weights and the effect of a larger sample of training images.

2 Data

Data is collected by the UK Biobank (Sudlow et al. 2015) which is a charity organisation aimed at "[...] improving the prevention, diagnosis and treatment of a wide range of serious and

life-threatening illnesses [...]". The study recruited 500 000 people aged between 40-69 years in 2006-2010. 100 000 MRI scans using a 1.5 T MR-scanner with the dual-echo Dixon Vibe

protocol covering neck to knee are planned (West et al. 2016). Around 38 000 subjects have had their scan completed so far. Including one more scanning station per subject to cover the

head or the feet would increase total scan time by approximately 60 seconds per extra station per subject or 69 days total for all of the 100 000 subjects (West et al. 2016).

Creating ground truth data is expensive and time consuming and creating labels for the en-tire data set is infeasible. With this in mind we wanted a sampling method that would maximize

variation in the training examples. The motivation to maximize variation is that showing mul-tiple examples of very similar kidneys provides little additional information leading to a small

(8)

gain in explanatory power when trying to robustly identify kidneys in a large population. The chosen method is to perform stratified sampling upon the variables age, gender and weight of

patients to maximize variation in training data and hopefully get a model that generalizes with fewer samples. The stratified sampling is done in batches of 32 patients. A total of two batches

were completed, resulting in a total sample of 64 patients.

Data consists of two 3D volumes containing water and fat signal for the MRI scans. The

water and fat signals are extracted from the raw MRI signal (Berglund 2011). The kidney contains small amounts of fat and shows up as a darker area when looking at the fat image,

which is useful for finding the contour of the kidney. The water image shows more detail about the kidney itself.

The volume is sliced in the axial (see Figure 1) plane with a resolution of 224x174 pixels, each pixel being a 2.232x2.232 mm square. The distance between the axial slice is 4.5 mm.

Each subject is scanned at six overlapping stations to create the volume. The full MR signal is decomposed into the water signal and the fat signal. Figure 2 shows the coronal view for both

the water and fat signals with arrows indicating borders between six stations. The kidney is located in or at the intersection of the second and third station, making them the focus of this

work. The reduction is done to reduce the number of slices that does not contain any voxels (3D pixels) belonging to the kidney. The purpose is twofold: both to make the model train faster

by showing it fewer slices which contain no examples of kidney and to have a more balanced distribution of voxels which are kidney compared to those who are not.

3 Neural Networks

The multilayer perceptron, also known as a fully connected network, is the simplest Neural Network architecture. It consists of an input layer, at least one hidden layer and an output layer.

The input layer is the chosen representation of the available input (independent) variables, examples of this are age, gender, height or weight for an individual. If we apply this approach

to an image then each pixel in the image would be treated as a variable where the intensity value is the value of the variable. A square (grayscale) image with resolution 256x256 could

therefore be represented by a 256_{· 256 = 65536 elements long vector of brightness values. If} we include the RGB color channels or water signal and fat signal in the MRI setting then this

(9)

Figure 2: MR image decomposed into fat and water signal. Arrow denote borders between scanning stations.

Hidden layers can be stacked one after the other with the output from the previous one being the input for the next one, this determines the depth of the Neural Network.

3.1 A small fully connected network

Figure 3: A fully connected Neural Network with water and fat input signals in a 2x2 pixel image.

(10)

classi-fying kidney or no kidney in a 2x2 image. The first four input nodes (Water 1-4) are the four pixels that come from the water signal. The last four inputs (Fat 1-4) are the four pixels that

come from the fat signal. Each node in the first hidden layer is the weighted sum of all of the input nodes from the previous layer followed by an activation function. The edges of the graph

represent the weights (or learned parameters) and the arrows represent the sum. The nodes of the second layer is calculated in the same way with the exception of using the nodes of the first

hidden layer as input instead of the input layer. The most commonly used activation function for the hidden units is the rectified linear unit (ReLU) (Figure 4a). The final layer is the output

layer where the ReLU is replaced by a sigmoid (Figure 4b) activation function, yielding an estimated probability that each pixel is part of a kidney. The probability is then thresholded to

make the classification.

−3 −2 −1 1 2 3 −1 1 x f (x) σ(x) = max(0, x)

(a) The Rectified Linear Unit (ReLU) function.

−3 −2 −1 1 2 3 −1 1 x f (x) σ(x) = 1 1 + e−x

(b) The Sigmoid function. Figure 4: Two common activation functions for Neural Networks

The output is a vector of values that are either 0 or 1. If the output layer is of the same size as the input layer then the output vector could be rearranged back into a 2x2 grid and then used

to perform predictions about each pixel whether they are part of a kidney or not.

3.2 Parameter updates

A Neural Network does not have a closed form solution in general, weights are therefore

up-dated by gradient descent. Calculating the gradient for all observations simultaneously quickly becomes inefficient with growing sample size and growing network size. What is typically

done instead of calculating the gradient for the entire sample is to only calculate the gradient for a smaller portion of the observations and use that as an approximation for the gradient for

(11)

observations included for one update step is known as the batch size. This work utilizes, after some initial experiments, one of the extreme cases and has a batch size of 1.

The network in Figure 3 can be represented by the following matrix multiplications:

        o1 o2 o3 o4         | {z } Output = σ1(W3   1 L2  ), L2 = σ2(W2   1 L1  ), L1 = σ2(W1                        1 wat1 wat2 wat3 wat4 f at1 f at2 f at3 f at4                        | {z } Input ).

Where W3, W2 and W1 are parameter matrices to be learned, σ1(·) is the sigmoid activation

function, σ2(·) is the ReLU activation function, L2 and L1 are vectors of intermediate values

and the constant 1 (omitted from Figure 3) allows for an offset (sometimes referred to as bias)

term. The activation functions are applied element wise.

The network is trained by iteratively updating the parameters of the layers. One update

step consists of first performing a forward pass which consists of performing the matrix mul-tiplications above and calculating the value of a loss function. Our choice of loss function

is the negative log-likelihood of the Bernoulli distribution (excluding the constant which has derivative 0), also known as the binary cross-entropy loss:

−logL(ˆp|y) =_{−(ylog(ˆp) + (1 − y)log(1 − ˆp)),} (1) where ˆp is the estimated probability of kidney by the model and y is the true label. The next step is to calculate the gradient of the loss function with respect to the weights. We do this

with backpropagation, starting from the output and by applying the chain rule, multiplying the gradient in a backward pass until the gradient is accumulated all the way to the input. The

weights are then updated in the negative direction of the gradient multiplied by the learning rate.

The above description is a simplification for instructive purposes, in practice we apply the Adam (Kingma and Ba 2014) algorithm which updates exponential moving averages of the first

(12)

and second moment of the estimated gradient instead of only using the gradient at the current time step.

3.3 Convolutional Neural Networks

Fukushima (1988) proposed a very early version of the convolutional Neural Network struc-ture. This was before the modern day computing power and the idea never became popular.

One decade later LeCun et al. (1998) propose a version which utilizes gradient based optimiza-tion and obtain good results for the handwritten digit classificaoptimiza-tion problem with examples for

reading zip-codes and bank checks. Yet another decade later Krizhevsky, Sutskever, and Hin-ton (2012) introduced the AlexNet which showed large improvements in classifying objects on

the ImageNet (Deng et al. 2009) dataset which consists of several million images across 1000 different classes. The large improvement in accuracy came with the ability to utilize graphic

processing units (GPUs) for training, greatly increasing time needed to train these large models with millions of parameters.

A fairly recent paper by Zeiler and Fergus (2014) exemplifies the desire to understand these models on a deeper level. They show that early layers of the network represent simple features

such as edges or color gradients while deeper layers can be representations of more complex objects such as humans, dogs or cars. For medical images these higher level abstractions could

represent a liver or a kidney.

0 0 0 0 0 0 0

0 0 1 1 1 0

0

0 0 0 1 1 1

0

0 0 0 1 1 0

0

0 0 1 1 0 0

0

0 1 1 0 0 0

0 0 0 0 0 0 0 0

I

∗

1 0 1

0 1 0

1 0 1

K

=

0 2 2 3 1

1 2 4 3 3

1 2 3 4 1

1 3 3 1 1

2 2 1 1 0

I

_{∗ K}

1 0 1

0 1 0

1 0 1

×1 ×0 ×1 ×0 ×1 ×0 ×1 ×0 ×1

Figure 5: An example of a 3x3 kernel with stride 1 applied to a 5x5 input with one layer of zero-padding (gray), resulting in a 5x5 output. (TikZ code: https://github.com/PetarV-/TikZ/tree/ master/2D%20Convolution)

A convolutional Neural Network uses convolutional layers where the nodes of a fully

(13)

convolu-tional kernel has three hyperparameters, size, padding and stride (Dumoulin and Visin 2016). The size determines the size of a sweeping window across the input image. Padding is

pri-marily used to ensure that the output of the convolutional layer is the same size as the input to allow for deeper networks without the dimension of layers deeper in the network becoming

too small to contain information. The stride (default=1) determines how far the kernel moves along the input image for each output. Stacking multiple kernels at the same layer increases

the number of output channels and allows the network to learn different shapes of the input. The next convolutional layer would then operate on its specified kernel size but across all of

the channels of the previous layer.

A Max Pooling layer has no parameters to be learned. Instead of multiplying the input with

a weight matrix, we simply take the maximum value as output for each position of a kernel. The max pooling layer serves the purpose of reducing the spatial dimension to allow a more

general latent representation of the original image.

The up-sampling convolution (known as deconvolution or transpose convolution) is

imple-mented in PyTorch by interlacing zero padding in the input and then applying a convolution kernel to achieve the desired larger resolution.

Dumoulin and Visin (2016) include more detailed animated visualisations of the layers

mentioned above at their github page (https://github.com/vdumoulin/conv_arithmetic).

3.4 U-Net

The U-Net (Ronneberger, Fischer, and Brox 2015) is designed to create image segmentations from images instead of classifying images into categories. Segmentation can be viewed as

a very high dimensional classification problem where each pixel (or voxel) is classified into one of several categories in general or as either kidney or not kidney in this case. The U-Net

architecture can be seen in Figure 6 where we see the resemblance to the U-shape.

The U-Net architecture consists of an encoder part on the left, a decoder part on the right and

concatenate connections between the encoder and the decoder on each level. One level of the encoder is a convolutional block followed by a max pooling block, feeding into the next layer

below. The decoder starts at the bottom of the U-shape with an up-sampling block with padding that is concatenated with the decoder layer above, represented by the purple edges connecting

to(II). These up-sampling blocks are repeated until the original input size is recovered. Finally a softmax layer produces an output with the desired number of classes.

(14)

64 64 2 Input 128 128 256 256 512 512 1024 1024 512 || 512 512 256 || 256 256 128 || 128 128 64 || 64 64 1 Softmax

Figure 6: The U-net model starting with 2 input channels (water and fat) on the left, going down through the encoder with repeated max-poolinginto 2D convolutions followed by a ReLUactivation function until we reach the bottom of the encoder. The decoder part consists of repeated2D convolutions

followed by aReLUintoup-samplinglayers intoconcatenationsbefore finally exiting through asoftmax

layer, classifying kidney or not kidney on a pixel-per-pixel basis. (TikZ code adapted from: https: //github.com/HarisIqbal88/PlotNeuralNet)

(15)

4 Training a Convolutional Neural Network

The aim of the model is to create segmentations of the kidney by classifying each voxel as kidney or not kidney, this would be the output. The model input is the pair of water and fat

axial slices described in section 2.

4.1 Metrics

To evaluate the performance we are comparing the voxels classified as kidney in ground truth

data with the voxels classified as kidney or not kidney by the model. The comparison is done on a subject by subject basis to make sure we compare entire kidney volumes and not just slices.

The result of classified voxels can be summarised in a confusion matrix seen in Table 1.

Prediction

0 1

Label

0 True Negative (TN) False Positive (FP)

1 False Negative (FN) True Positive (TP)

Table 1: The confusion matrix for a binary classification problem

A very common measure for cases with a large number of true negative outcomes is the Dice

score, also known as F1-score. The measure is the harmonic mean of precision and sensitivity (or recall) and is defined as

Dice = 2· precision · sensitivity precision + sensitivity = 2TP 2TP + FP + FN where precision = TP TP + FP sensitivity = TP TP + FN

Equivalently the dice score can, with a slightly different interpretation, be defined as

Dice = 2× |X ∩ Y |

|X| + |Y | (2)

that is: The ratio of 2 times overlapping voxels divided by the total number of voxels for both ground truth and prediction. A perfect overlap has a dice score = 1 while a dice score = 0

(16)

Yellow: Over segmented (FP)

Red: Under segmented (FN) Blue: Correctly segmented (TP)

Figure 7: Example of predicted segmentation vs. ground truth segmentation in a axial slice.

would mean that none of the voxels in the predicted and ground truth segmentations overlap.

A visualisation of these regions can be seen in Figure 7 where we see the correctly classified pixels of this axial slice colored in blue, the pixels that were classified as kidney by the model

but not by the operator (False positive) in yellow and the pixels that were classified as kidney by the operator but not by the model (False negative) in red.

4.2 Dropout

Dropout (Srivastava et al. 2014) is originally from a master thesis by Nitish Srivastava and is a technique that is used as a regularization method for Neural Networks. It is performed by, for

each node in the current layer, removing the node by setting the output to 0, with probability p. This is done to prevent the network from overfitting and is according to Srivastava et al. (2014)

similar to creating an ensemble of multiple networks, all in one.

Dropout is only active during training and is disabled, never turning off connections, when

the model is used for prediction to retain as much information as possible.

Srivastava et al. (2014) discovers an issue with dropout in their paper: when we switch

the model from training to testing then the layers directly following a layer with dropout will receive a total higher activation input than during training. This increased activation would

propagate throughout the rest of the network and affect the output. They propose to solve this by multiplying all of the weights in the layer with dropout by p during testing to reduce the

(17)

output. In PyTorch this is instead implemented by multiplying the input by _1−p1 during training for the nodes that are kept for that forward pass.

4.3 Data Augmentation

The subject volumes go through a series of preprocessing steps before they can be fed through

the network. The first step is to slice the volume into axial slices to create 2D images with great care taken to ensure that all slices from a single subject either appear in the training set or the

test set, never both. These images are correlated since they come from the same subject and would bias the evaluation to inflate the perceived performance. Our implementation also uses

zero-padding on the 224x174 input images to match the 256x256 input layer of the model. It is common in image analysis to augment input data with deformations to artificially increase

the number of images for training. An image, transformed in a reasonable way, improves the performance of the algorithm by adding plausible variations of the same object without the

need for more labelled data. (Ronneberger, Fischer, and Brox 2015; Çiçek et al. 2016)

4.3.1 Brightness threshold and intensity normalisation

MRI devices produce after some processing an intensity value for each voxel with a higher

value corresponding to a stronger signal for that voxel. A bright voxel in the water image cor-responds to a large proportion of water molecules inside that voxel. One downside of this

imag-ing technique is that the camera will not always produce the exact same brightness value for a chosen voxel due to external factors which in turn leads to arbitrary ranges of values between

scans. We have performed brightness thresholding and intensity normalisation to mitigate this issue in the following way: The pixels brighter than 99% of all other pixels are thresholded to

that 99% brightness level. The slices are then normalized such that the darkest pixel has value 0 and the brightest pixel has value 1. This is performed on both training data and the validation

set.

4.3.2 Elastic deformation

Elastic deformation (Çiçek et al. 2016; Ronneberger, Fischer, and Brox 2015) has two

hyper-parameters, the first one is the number of points p and the second is the intensity of the defor-mation σ. The procedure is done by first creating a grid of p points, each with a displacement

(18)

vector sampled from   x y  ∼ N     0 0  ,   σ 0 0 σ    .

This grid is then interpolated to create a displacement grid for the input image which is again interpolated after the deformation. Figure 8a shows an unaltered image of a cat. Figure 8b

shows a deformation with reasonably chosen hyper parameters (p = 8, σ = 2). Figure 8b shows a too strong deformation which would be of more harm than help if it was used as a

training image. It is worth noting that the choice of hyperparameters and their effectiveness heavily depends on the resolution of the image that the deformation is being applied to.

The same deformation grid is applied to the water signal, fat signal and the kidney segmen-tation when performing the augmensegmen-tations to ensure that both the input and output is altered in

the same way.

4.3.3 Translation transformation

Translation transformation is to shift the position of the image along a random vector in the XY-plane. This is done to simulate subjects shifted up, down, left or right on the examination

table which would influence the position of the kidney in the image. Translation transformation is a special case of elastic deformation when using only a single grid point for the displacement.

4.4 Transfer learning

The purpose of transfer learning (Pan and Yang 2010; Huh, Agrawal, and Efros 2016) is to combat problems with small number of observations and reduce long training times for Neural

Networks. A small sample size creates a risk that the model would not generalize well due to not seeing enough examples. Initializing the parameters with pretrained weights gives the

model a head start by not having to relearn what edges and gradients looks like. Available data is then used to fine-tune the model. The weights are initialized from a convolutional

Neural Network with the VGG16 (Simonyan and Zisserman 2014) architecture trained on the ImageNet (Deng et al. 2009) dataset. The early layers of the VGG16 model matches fairly

close to the encoder part of the U-Net. These are the weights that are transferred. The rest of the layers (including the entire decoder) in the U-Net are initialized with random weights.

(19)

(a) A picture of a cat in a sink.

(b) A reasonable deformation. (c) Too strong deformation. Figure 8: Varying levels of elastic deformation applied to an image of a cat.

5 Results

5.1 Inter/Intra-operator variability

The ideal case for segmentations is to map the MR image to the real body of the subject and

know the true tissue type at each voxel. This type of objective ground truth is not possible and we have to rely on trained experts to manually identify different tissues and organs in

order to segment the images. Non-standardized instructions, experience or personal style of the operators are possible sources of variation in the resulting segmentations.

A single expert created all segmentations for this study, making the segmentations more consistent with each other. Five out of the 64 labelled volumes were segmented by the expert

(20)

(a) Intra-operator variability of dice score for 5 sub-jects.

(b) Inter-operator variability of dice score for 8 vol-umes between a trained expert and an inexperienced operator.

Figure 9: Dice scores for volumes that were segmented first by a trained expert and then again by the trained expert at a second time or by an inexperienced operator.

twice to be able to estimate the intra-operator dice score which is a measure of the consistency

of the operator. The intra-operator dice score can be seen as a maximum achievable dice score for our model since it is unreasonable to think that a model could create segmentations that

are more consistent with the trained expert than the expert themselves. The intra-operator dice score for the five subjects is seen in Figure 9a and the mean dice score is 96.17%.

Some of the MRI-volumes that were segmented by the trained expert were also segmented by a second inexperienced operator to be able to evaluate the performance of the network,

comparing the dice score from the network output to the dice score of both intra- and inter-operator. In Figure 9b we see the inter-operator dice score between the trained expert and the

inexperienced operator. We can see that the mean Dice score for the inexperienced operator is 87.22%.

5.2 U-Net performance

Table 2 shows the 7-fold cross validated dice score after the final iteration in the training process for U-Net trained with or without augmentations on data as well as the difference between

having access to labels for 32 or 64 subjects. Looking at the table we see that the highest dice score is 0.9530 when using the full set of 64 subjects with augmentations and using pretrained

(21)

(a) 32 subjects. (b) 64 subjects.

Figure 10: 7 fold cross validated dice score comparing dice scores for 32 subjects when using data augmentation and/or pretrained weights

Dice score Number of subjects 32 64

Original 0.9372 0.9458

Original* 0.9403 0.9415 Augmented 0.9408 0.9511

Augmented* 0.9463 0.9530

Table 2: Dice score of the network trained on augmented and non-augmented data of 32 or 64 volumes with or without pretrained weights. *Using pretrained weights

(22)

5000 training iterations with the first batch of 32 subjects as training data. The curves for the models that utilize augmentation lie above the curves of the models that do not. There

is not a large difference between models initialized with pretrained weights versus random initialization. Similar patterns are seen in Figure 10b which is the graph for models trained on

both of the available batches of subjects for a total of 64 subjects. Compared to Figure 10a we see that all of the curves are higher for models trained on the full data set. Augmentations

still seem to perform better than leaving data unaltered. The graphs maintain a slight upward slope even for the last part of training, indicating that the model might benefit from even more

training time.

Looking at Figure 11a,11b and 11c we see some examples of segmentations in different

situations with small errors. Figure 11d is a rare occurrence where there error is large, one of the kidneys is almost completely ignored.

Figure 12 shows two examples of what could be cysts on the kidney, the network correctly does not classify these areas as kidneys. Larger cysts which are rarer could potentially cause

(23)

(a) A typical segmentation result. (b) Segmentation only containing one of the kid-neys.

(c) Segmentation with a large liver overlapping the kidney.

(d) Segmentation where the model almost com-pletely ignores one of the kidneys (incorrectly). Yellow: Over segmented

Red: Under segmented Blue: Correctly segmented

(24)

(a) Segmentation with possible cyst. (b) Segmentation with possible cyst. Yellow: Over segmented

Red: Under segmented

Blue: Correctly segmented

Figure 12: Predicted segmentations in a validation set. Possible cysts denoted with red arrows.

6 Conclusions

Deep learning shows promising results in many fields, medical imaging does not appear to be the exception. We have shown near human expert level of kidney segmentations with the

U-Net Neural U-Network architecture. In addition the model greatly outperforms an inexperienced operator and could in general benefit medical image segmentations with consistent and

accu-rate segmentations. The model does still have some weaknesses where it does not perform as expected, more carefully acquired training data could help cover these edge cases.

Using pretrained weights might be beneficial in some cases, having the weights be pre-trained on medical data instead of the ImageNet data set might also show larger improvements.

Augmentation seems to improve performance at smaller sample sizes but also remains relevant as the sample size increases. The graphs for the validation dice still having some upward slope

even after training for 250 000 iterations indicating that the model could achieve an ever higher dice score if the training process was allowed to run for longer.

In general, the most important part for creating high performing data driven models such as Neural Networks seem to be more data and especially more data of high quality.

(25)

References

[1] Johan Berglund. Separation of water and fat signal in magnetic resonance imaging ad-vances in methods based on chemical shift. Acta Universitatis Upsaliensis, 2011. ISBN:

9789155481544.

[2] Patrick Ferdinand Christ et al. “Automatic Liver and Tumor Segmentation of CT and MRI Volumes using Cascaded Fully Convolutional Neural Networks”. In: (2017). arXiv:

1702.05970.

[3] Özgün Çiçek et al. “3D U-net: Learning dense volumetric segmentation from sparse annotation”. In: Lecture Notes in Computer Science (including subseries Lecture Notes

in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9901 LNCS. 2016, pp. 424–432. ISBN: 9783319467221.DOI: 10.1007/978-3-319-46723-8_49.

[4] Jia Deng et al. “ImageNet: A large-scale hierarchical image database”. In: 2009 IEEE

Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.ISBN: 978-1-4244-3992-8.DOI: 10.1109/CVPR.2009.5206848.

[5] Vincent Dumoulin and Francesco Visin. “A guide to convolution arithmetic for deep

learning”. In: (2016). arXiv: 1603.07285v2.

[6] Kunihiko Fukushima. “Neocognitron: A hierarchical neural network capable of visual pattern recognition”. In: Neural Networks 1.2 (1988), pp. 119–130. ISSN: 08936080.

DOI: 10.1016/0893-6080(88)90014-7.

[7] H. Holloway et al. “Sonographic determination of renal volumes in normal neonates”. In: Pediatric Radiology 13.4 (1983), pp. 212–214. ISSN: 0301-0449.DOI: 10.1007/

BF00973158.

[8] Minyoung Huh, Pulkit Agrawal, and Alexei A. Efros. “What makes ImageNet good for transfer learning?” In: (2016). arXiv: 1608.08614.

[9] Juan Eugenio Iglesias and Mert R. Sabuncu. “Multi-atlas segmentation of

biomedi-cal images: A survey”. In: Medibiomedi-cal Image Analysis 24.1 (2015), pp. 205–219. ISSN: 13618423. DOI: 10.1016/j.media.2015.06.012. arXiv: 1412.3421.

[10] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:

(26)

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. 2012.

[12] Taro Langner et al. “Fully convolutional networks for automated segmentation of

ab-dominal adipose tissue depots in multicenter water-fat MRI”. In: Magnetic Resonance in Medicine81.4 (2019), pp. 2736–2745.ISSN: 15222594.DOI: 10.1002/mrm.27550.

[13] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In:

Pro-ceedings of the IEEE 86.11 (1998), pp. 2278–2323. ISSN: 00189219. DOI: 10.1109/ 5.726791.

[14] Sinno Jialin Pan and Qiang Yang. “A Survey on Transfer Learning”. In: IEEE

Transac-tions on Knowledge and Data Engineering22.10 (2010), pp. 1345–1359.

[15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: Lecture Notes in Computer Science (including

subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9351 (2015), pp. 234–241. ISSN: 16113349. DOI:

10.1007/978-3-319-24574-4_28. arXiv: 1505.04597.

[16] Akinobu Shimizu et al. “Segmentation of multiple organs in non-contrast 3D abdominal CT images”. In: International Journal of Computer Assisted Radiology and Surgery

2.3-4 (2007), pp. 135–12.3-42. ISSN: 18616429.DOI: 10.1007/s11548-007-0135-z.

[17] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: (2014). arXiv: 1409.1556.

[18] Nitish Srivastava et al. Dropout: A Simple Way to Prevent Neural Networks from

Over-fitting. Tech. rep. 2014, pp. 1929–1958.

[19] Cathie Sudlow et al. “UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age”. In: PLoS Med 12.3

(2015), p. 1001779.DOI: 10.1371/journal.pmed.1001779.

[20] Janne West et al. “Feasibility of MR-based body composition analysis In large scale population studies”. In: PLoS ONE 11.9 (2016). ISSN: 19326203. DOI: 10 . 1371 /

(27)

[21] Robin Wolz et al. “Automated abdominal multi-organ segmentation with subject-specific atlas generation”. In: IEEE Transactions on Medical Imaging 32.9 (2013), pp. 1723–

1730. ISSN: 02780062.DOI: 10.1109/TMI.2013.2265805.

[22] Alan S.L. Yu et al. “Baseline total kidney volume and the rate of kidney growth are associated with chronic kidney disease progression in Autosomal Dominant Polycystic

Kidney Disease”. In: Kidney International 93.3 (2018), pp. 691–699.ISSN: 15231755.

[23] Matthew D. Zeiler and Rob Fergus. “Visualizing and understanding convolutional net-works”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in

Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8689 LNCS. PART 1. 2014, pp. 818–833. ISBN: 9783319105895. DOI: 10.1007/978- 3- 319-

Automated Kidney Segmentation in Magnetic Resonance Imaging using U-Net