Automated Kidney Segmentation in
Magnetic Resonance Imaging using U-Net
By Andreas Östling
Department of Statistics
Uppsala University
Supervisors
Anna Bornefalk-Hermansson
Taro Langner
Fan Yang Wallentin
Abstract
Manual analysis of medical images such as magnetic resonance imaging (MRI) requires a trained professional, is time-consuming and results may vary between experts. We pro-pose an automated method for kidney segmentation using a convolutional Neural Net-work (CNN) model based on the U-Net architecture. Investigations are done to compare segmentations between trained experts, inexperienced operators and the Neural Network model, showing near human expert level performance from the Neural Network. Stratified sampling is performed when selecting which subject volumes to perform manual segmen-tations on to create training data. Experiments are run to test the effectiveness of transfer learning and data augmentation and we show that one of the most important components of a successful machine learning pipeline is larger quantities of carefully annotated data for training.
Acknowledgements
The author would like to thank everyone at the MRI research group at the Department of Sur-gical Sciences, Section of Radiology for great feedback and many fruitful discussions. I would
also like to thank Antaros Medical for providing the ground truth segmentations of the kid-ney volumes. Finally I would like to thank Uppsala Clinical Research Center (UCR) for great
feedback and providing the contacts to make this thesis possible.
Uppsala, June 2019
Contents
1 Introduction 1
1.1 Problem Statement . . . 1
2 Data 2 3 Neural Networks 3 3.1 A small fully connected network . . . 4
3.2 Parameter updates . . . 5
3.3 Convolutional Neural Networks . . . 7
3.4 U-Net . . . 8
4 Training a Convolutional Neural Network 10 4.1 Metrics . . . 10
4.2 Dropout . . . 11
4.3 Data Augmentation . . . 12
4.3.1 Brightness threshold and intensity normalisation . . . 12
4.3.2 Elastic deformation . . . 12 4.3.3 Translation transformation . . . 13 4.4 Transfer learning . . . 13 5 Results 14 5.1 Inter/Intra-operator variability . . . 14 5.2 U-Net performance . . . 15 6 Conclusions 19
1
Introduction
Medical imaging is an important part of providing high quality medical care for patients. Man-ually analysing medical images is a very time consuming task that requires high precision and
consistency for both the procedures that create the images and for the medical experts that perform the analysis. Reasons for segmenting organs that often are of interest could include
knowing the proportion of fat tissue in the liver or the total volume of a kidney since these measurements have been shown to be indicators of diseases such as chronic kidney disease (Yu
et al. 2018). These measures require an estimate of the volume of the quantities of interest. To estimate the volume an operator has to go through the 3D volume by looking at each slice
(2D part of the 3D volume) and label each voxel as either kidney or not kidney. Advancements leading to image volumes being captured at higher resolution lead to higher quality data but
also a greater number of images the expert needs to look at to be able to see the full object of interest. This study reported around one hour per subject when segmentation was performed
by a trained expert.
In addition to being time consuming, segmentation is also subjective leading to different
segmentations for operators with varying levels of experience and personal style. An automated procedure for segmentation would reduce this variability in segmentations and create more
consistent estimates. Earlier attempts at estimating kidney volume involved either calculating the major and minor axes of a 3-dimensional ellipsoid or for each slice in the volume fitting
an ellipse and then adding the ellipses together to create a volume (Holloway et al. 1983). Later proposed solutions to the automated segmentation problem involve a multi-atlas approach
(Shimizu et al. 2007; Wolz et al. 2013; Iglesias and Sabuncu 2015). These methods consist of creating an atlas of images of healthy subjects and then evaluating new images by first
performing reasonable deformations, such as aligning the joints, to match the atlas as close as possible and then compare or propagate labels for segmentation.
1.1
Problem Statement
In this paper we propose to solve the issue of segmentations by training a convolutional Neu-ral Network (CNN) with the U-Net (Ronneberger, Fischer, and Brox 2015) base architecture.
Convolutional Neural Networks show promising results on similar tasks such as segmenting visceral adipose tissue (VAT) and subcutaneous adipose tissue (SAT) (Langner et al. 2019) as
well as liver and tumour segmentation (Christ et al. 2017). Second we compare the results of this algorithm to the intra-operator variability of a trained expert as well as an inexperienced
operator in kidney segmentations to be able to evaluate the performance of this automated pro-cedure. Further experiments are done, investigating the effectiveness of augmenting training
data using deformations, initializing the network with pre-trained weights and the effect of a larger sample of training images.
2
Data
Data is collected by the UK Biobank (Sudlow et al. 2015) which is a charity organisation aimed at "[...] improving the prevention, diagnosis and treatment of a wide range of serious and
life-threatening illnesses [...]". The study recruited 500 000 people aged between 40-69 years in 2006-2010. 100 000 MRI scans using a 1.5 T MR-scanner with the dual-echo Dixon Vibe
protocol covering neck to knee are planned (West et al. 2016). Around 38 000 subjects have had their scan completed so far. Including one more scanning station per subject to cover the
head or the feet would increase total scan time by approximately 60 seconds per extra station per subject or 69 days total for all of the 100 000 subjects (West et al. 2016).
Creating ground truth data is expensive and time consuming and creating labels for the en-tire data set is infeasible. With this in mind we wanted a sampling method that would maximize
variation in the training examples. The motivation to maximize variation is that showing mul-tiple examples of very similar kidneys provides little additional information leading to a small
gain in explanatory power when trying to robustly identify kidneys in a large population. The chosen method is to perform stratified sampling upon the variables age, gender and weight of
patients to maximize variation in training data and hopefully get a model that generalizes with fewer samples. The stratified sampling is done in batches of 32 patients. A total of two batches
were completed, resulting in a total sample of 64 patients.
Data consists of two 3D volumes containing water and fat signal for the MRI scans. The
water and fat signals are extracted from the raw MRI signal (Berglund 2011). The kidney contains small amounts of fat and shows up as a darker area when looking at the fat image,
which is useful for finding the contour of the kidney. The water image shows more detail about the kidney itself.
The volume is sliced in the axial (see Figure 1) plane with a resolution of 224x174 pixels, each pixel being a 2.232x2.232 mm square. The distance between the axial slice is 4.5 mm.
Each subject is scanned at six overlapping stations to create the volume. The full MR signal is decomposed into the water signal and the fat signal. Figure 2 shows the coronal view for both
the water and fat signals with arrows indicating borders between six stations. The kidney is located in or at the intersection of the second and third station, making them the focus of this
work. The reduction is done to reduce the number of slices that does not contain any voxels (3D pixels) belonging to the kidney. The purpose is twofold: both to make the model train faster
by showing it fewer slices which contain no examples of kidney and to have a more balanced distribution of voxels which are kidney compared to those who are not.
3
Neural Networks
The multilayer perceptron, also known as a fully connected network, is the simplest Neural Network architecture. It consists of an input layer, at least one hidden layer and an output layer.
The input layer is the chosen representation of the available input (independent) variables, examples of this are age, gender, height or weight for an individual. If we apply this approach
to an image then each pixel in the image would be treated as a variable where the intensity value is the value of the variable. A square (grayscale) image with resolution 256x256 could
therefore be represented by a 256· 256 = 65536 elements long vector of brightness values. If we include the RGB color channels or water signal and fat signal in the MRI setting then this
Figure 2: MR image decomposed into fat and water signal. Arrow denote borders between scanning stations.
Hidden layers can be stacked one after the other with the output from the previous one being the input for the next one, this determines the depth of the Neural Network.
3.1
A small fully connected network
Figure 3: A fully connected Neural Network with water and fat input signals in a 2x2 pixel image.
classi-fying kidney or no kidney in a 2x2 image. The first four input nodes (Water 1-4) are the four pixels that come from the water signal. The last four inputs (Fat 1-4) are the four pixels that
come from the fat signal. Each node in the first hidden layer is the weighted sum of all of the input nodes from the previous layer followed by an activation function. The edges of the graph
represent the weights (or learned parameters) and the arrows represent the sum. The nodes of the second layer is calculated in the same way with the exception of using the nodes of the first
hidden layer as input instead of the input layer. The most commonly used activation function for the hidden units is the rectified linear unit (ReLU) (Figure 4a). The final layer is the output
layer where the ReLU is replaced by a sigmoid (Figure 4b) activation function, yielding an estimated probability that each pixel is part of a kidney. The probability is then thresholded to
make the classification.
−3 −2 −1 1 2 3 −1 1 x f (x) σ(x) = max(0, x)
(a) The Rectified Linear Unit (ReLU) function.
−3 −2 −1 1 2 3 −1 1 x f (x) σ(x) = 1 1 + e−x
(b) The Sigmoid function. Figure 4: Two common activation functions for Neural Networks
The output is a vector of values that are either 0 or 1. If the output layer is of the same size as the input layer then the output vector could be rearranged back into a 2x2 grid and then used
to perform predictions about each pixel whether they are part of a kidney or not.
3.2
Parameter updates
A Neural Network does not have a closed form solution in general, weights are therefore
up-dated by gradient descent. Calculating the gradient for all observations simultaneously quickly becomes inefficient with growing sample size and growing network size. What is typically
done instead of calculating the gradient for the entire sample is to only calculate the gradient for a smaller portion of the observations and use that as an approximation for the gradient for
observations included for one update step is known as the batch size. This work utilizes, after some initial experiments, one of the extreme cases and has a batch size of 1.
The network in Figure 3 can be represented by the following matrix multiplications:
o1 o2 o3 o4 | {z } Output = σ1(W3 1 L2 ), L2 = σ2(W2 1 L1 ), L1 = σ2(W1 1 wat1 wat2 wat3 wat4 f at1 f at2 f at3 f at4 | {z } Input ).
Where W3, W2 and W1 are parameter matrices to be learned, σ1(·) is the sigmoid activation
function, σ2(·) is the ReLU activation function, L2 and L1 are vectors of intermediate values
and the constant 1 (omitted from Figure 3) allows for an offset (sometimes referred to as bias)
term. The activation functions are applied element wise.
The network is trained by iteratively updating the parameters of the layers. One update
step consists of first performing a forward pass which consists of performing the matrix mul-tiplications above and calculating the value of a loss function. Our choice of loss function
is the negative log-likelihood of the Bernoulli distribution (excluding the constant which has derivative 0), also known as the binary cross-entropy loss:
−logL(ˆp|y) =−(ylog(ˆp) + (1 − y)log(1 − ˆp)), (1) where ˆp is the estimated probability of kidney by the model and y is the true label. The next step is to calculate the gradient of the loss function with respect to the weights. We do this
with backpropagation, starting from the output and by applying the chain rule, multiplying the gradient in a backward pass until the gradient is accumulated all the way to the input. The
weights are then updated in the negative direction of the gradient multiplied by the learning rate.
The above description is a simplification for instructive purposes, in practice we apply the Adam (Kingma and Ba 2014) algorithm which updates exponential moving averages of the first
and second moment of the estimated gradient instead of only using the gradient at the current time step.
3.3
Convolutional Neural Networks
Fukushima (1988) proposed a very early version of the convolutional Neural Network struc-ture. This was before the modern day computing power and the idea never became popular.
One decade later LeCun et al. (1998) propose a version which utilizes gradient based optimiza-tion and obtain good results for the handwritten digit classificaoptimiza-tion problem with examples for
reading zip-codes and bank checks. Yet another decade later Krizhevsky, Sutskever, and Hin-ton (2012) introduced the AlexNet which showed large improvements in classifying objects on
the ImageNet (Deng et al. 2009) dataset which consists of several million images across 1000 different classes. The large improvement in accuracy came with the ability to utilize graphic
processing units (GPUs) for training, greatly increasing time needed to train these large models with millions of parameters.
A fairly recent paper by Zeiler and Fergus (2014) exemplifies the desire to understand these models on a deeper level. They show that early layers of the network represent simple features
such as edges or color gradients while deeper layers can be representations of more complex objects such as humans, dogs or cars. For medical images these higher level abstractions could
represent a liver or a kidney.
0 0 0 0 0 0 0
0
0 1 1 1 0
0
0
0 0 1 1 1
0
0
0 0 1 1 0
0
0
0 1 1 0 0
0
0
1 1 0 0 0
0
0 0 0 0 0 0 0
I
∗
1 0 1
0 1 0
1 0 1
K
=
0 2 2 3 1
1 2 4 3 3
1 2 3 4 1
1 3 3 1 1
2 2 1 1 0
I
∗ K
1 0 1
0 1 0
1 0 1
×1 ×0 ×1 ×0 ×1 ×0 ×1 ×0 ×1Figure 5: An example of a 3x3 kernel with stride 1 applied to a 5x5 input with one layer of zero-padding (gray), resulting in a 5x5 output. (TikZ code: https://github.com/PetarV-/TikZ/tree/ master/2D%20Convolution)
A convolutional Neural Network uses convolutional layers where the nodes of a fully
convolu-tional kernel has three hyperparameters, size, padding and stride (Dumoulin and Visin 2016). The size determines the size of a sweeping window across the input image. Padding is
pri-marily used to ensure that the output of the convolutional layer is the same size as the input to allow for deeper networks without the dimension of layers deeper in the network becoming
too small to contain information. The stride (default=1) determines how far the kernel moves along the input image for each output. Stacking multiple kernels at the same layer increases
the number of output channels and allows the network to learn different shapes of the input. The next convolutional layer would then operate on its specified kernel size but across all of
the channels of the previous layer.
A Max Pooling layer has no parameters to be learned. Instead of multiplying the input with
a weight matrix, we simply take the maximum value as output for each position of a kernel. The max pooling layer serves the purpose of reducing the spatial dimension to allow a more
general latent representation of the original image.
The up-sampling convolution (known as deconvolution or transpose convolution) is
imple-mented in PyTorch by interlacing zero padding in the input and then applying a convolution kernel to achieve the desired larger resolution.
Dumoulin and Visin (2016) include more detailed animated visualisations of the layers
mentioned above at their github page (https://github.com/vdumoulin/conv_arithmetic).
3.4
U-Net
The U-Net (Ronneberger, Fischer, and Brox 2015) is designed to create image segmentations from images instead of classifying images into categories. Segmentation can be viewed as
a very high dimensional classification problem where each pixel (or voxel) is classified into one of several categories in general or as either kidney or not kidney in this case. The U-Net
architecture can be seen in Figure 6 where we see the resemblance to the U-shape.
The U-Net architecture consists of an encoder part on the left, a decoder part on the right and
concatenate connections between the encoder and the decoder on each level. One level of the encoder is a convolutional block followed by a max pooling block, feeding into the next layer
below. The decoder starts at the bottom of the U-shape with an up-sampling block with padding that is concatenated with the decoder layer above, represented by the purple edges connecting
to(II). These up-sampling blocks are repeated until the original input size is recovered. Finally a softmax layer produces an output with the desired number of classes.
64 64 2 Input 128 128 256 256 512 512 1024 1024 512 || 512 512 256 || 256 256 128 || 128 128 64 || 64 64 1 Softmax
Figure 6: The U-net model starting with 2 input channels (water and fat) on the left, going down through the encoder with repeated max-poolinginto 2D convolutions followed by a ReLUactivation function until we reach the bottom of the encoder. The decoder part consists of repeated2D convolutions
followed by aReLUintoup-samplinglayers intoconcatenationsbefore finally exiting through asoftmax
layer, classifying kidney or not kidney on a pixel-per-pixel basis. (TikZ code adapted from: https: //github.com/HarisIqbal88/PlotNeuralNet)
4
Training a Convolutional Neural Network
The aim of the model is to create segmentations of the kidney by classifying each voxel as kidney or not kidney, this would be the output. The model input is the pair of water and fat
axial slices described in section 2.
4.1
Metrics
To evaluate the performance we are comparing the voxels classified as kidney in ground truth
data with the voxels classified as kidney or not kidney by the model. The comparison is done on a subject by subject basis to make sure we compare entire kidney volumes and not just slices.
The result of classified voxels can be summarised in a confusion matrix seen in Table 1.
Prediction
0 1
Label
0 True Negative (TN) False Positive (FP)
1 False Negative (FN) True Positive (TP)
Table 1: The confusion matrix for a binary classification problem
A very common measure for cases with a large number of true negative outcomes is the Dice
score, also known as F1-score. The measure is the harmonic mean of precision and sensitivity (or recall) and is defined as
Dice = 2· precision · sensitivity precision + sensitivity = 2TP 2TP + FP + FN where precision = TP TP + FP sensitivity = TP TP + FN
Equivalently the dice score can, with a slightly different interpretation, be defined as
Dice = 2× |X ∩ Y |
|X| + |Y | (2)
that is: The ratio of 2 times overlapping voxels divided by the total number of voxels for both ground truth and prediction. A perfect overlap has a dice score = 1 while a dice score = 0
Yellow: Over segmented (FP)
Red: Under segmented (FN) Blue: Correctly segmented (TP)
Figure 7: Example of predicted segmentation vs. ground truth segmentation in a axial slice.
would mean that none of the voxels in the predicted and ground truth segmentations overlap.
A visualisation of these regions can be seen in Figure 7 where we see the correctly classified pixels of this axial slice colored in blue, the pixels that were classified as kidney by the model
but not by the operator (False positive) in yellow and the pixels that were classified as kidney by the operator but not by the model (False negative) in red.
4.2
Dropout
Dropout (Srivastava et al. 2014) is originally from a master thesis by Nitish Srivastava and is a technique that is used as a regularization method for Neural Networks. It is performed by, for
each node in the current layer, removing the node by setting the output to 0, with probability p. This is done to prevent the network from overfitting and is according to Srivastava et al. (2014)
similar to creating an ensemble of multiple networks, all in one.
Dropout is only active during training and is disabled, never turning off connections, when
the model is used for prediction to retain as much information as possible.
Srivastava et al. (2014) discovers an issue with dropout in their paper: when we switch
the model from training to testing then the layers directly following a layer with dropout will receive a total higher activation input than during training. This increased activation would
propagate throughout the rest of the network and affect the output. They propose to solve this by multiplying all of the weights in the layer with dropout by p during testing to reduce the
output. In PyTorch this is instead implemented by multiplying the input by 1−p1 during training for the nodes that are kept for that forward pass.
4.3
Data Augmentation
The subject volumes go through a series of preprocessing steps before they can be fed through
the network. The first step is to slice the volume into axial slices to create 2D images with great care taken to ensure that all slices from a single subject either appear in the training set or the
test set, never both. These images are correlated since they come from the same subject and would bias the evaluation to inflate the perceived performance. Our implementation also uses
zero-padding on the 224x174 input images to match the 256x256 input layer of the model. It is common in image analysis to augment input data with deformations to artificially increase
the number of images for training. An image, transformed in a reasonable way, improves the performance of the algorithm by adding plausible variations of the same object without the
need for more labelled data. (Ronneberger, Fischer, and Brox 2015; Çiçek et al. 2016)
4.3.1 Brightness threshold and intensity normalisation
MRI devices produce after some processing an intensity value for each voxel with a higher
value corresponding to a stronger signal for that voxel. A bright voxel in the water image cor-responds to a large proportion of water molecules inside that voxel. One downside of this
imag-ing technique is that the camera will not always produce the exact same brightness value for a chosen voxel due to external factors which in turn leads to arbitrary ranges of values between
scans. We have performed brightness thresholding and intensity normalisation to mitigate this issue in the following way: The pixels brighter than 99% of all other pixels are thresholded to
that 99% brightness level. The slices are then normalized such that the darkest pixel has value 0 and the brightest pixel has value 1. This is performed on both training data and the validation
set.
4.3.2 Elastic deformation
Elastic deformation (Çiçek et al. 2016; Ronneberger, Fischer, and Brox 2015) has two
hyper-parameters, the first one is the number of points p and the second is the intensity of the defor-mation σ. The procedure is done by first creating a grid of p points, each with a displacement
vector sampled from x y ∼ N 0 0 , σ 0 0 σ .
This grid is then interpolated to create a displacement grid for the input image which is again interpolated after the deformation. Figure 8a shows an unaltered image of a cat. Figure 8b
shows a deformation with reasonably chosen hyper parameters (p = 8, σ = 2). Figure 8b shows a too strong deformation which would be of more harm than help if it was used as a
training image. It is worth noting that the choice of hyperparameters and their effectiveness heavily depends on the resolution of the image that the deformation is being applied to.
The same deformation grid is applied to the water signal, fat signal and the kidney segmen-tation when performing the augmensegmen-tations to ensure that both the input and output is altered in
the same way.
4.3.3 Translation transformation
Translation transformation is to shift the position of the image along a random vector in the XY-plane. This is done to simulate subjects shifted up, down, left or right on the examination
table which would influence the position of the kidney in the image. Translation transformation is a special case of elastic deformation when using only a single grid point for the displacement.
4.4
Transfer learning
The purpose of transfer learning (Pan and Yang 2010; Huh, Agrawal, and Efros 2016) is to combat problems with small number of observations and reduce long training times for Neural
Networks. A small sample size creates a risk that the model would not generalize well due to not seeing enough examples. Initializing the parameters with pretrained weights gives the
model a head start by not having to relearn what edges and gradients looks like. Available data is then used to fine-tune the model. The weights are initialized from a convolutional
Neural Network with the VGG16 (Simonyan and Zisserman 2014) architecture trained on the ImageNet (Deng et al. 2009) dataset. The early layers of the VGG16 model matches fairly
close to the encoder part of the U-Net. These are the weights that are transferred. The rest of the layers (including the entire decoder) in the U-Net are initialized with random weights.
(a) A picture of a cat in a sink.
(b) A reasonable deformation. (c) Too strong deformation. Figure 8: Varying levels of elastic deformation applied to an image of a cat.
5
Results
5.1
Inter/Intra-operator variability
The ideal case for segmentations is to map the MR image to the real body of the subject and
know the true tissue type at each voxel. This type of objective ground truth is not possible and we have to rely on trained experts to manually identify different tissues and organs in
order to segment the images. Non-standardized instructions, experience or personal style of the operators are possible sources of variation in the resulting segmentations.
A single expert created all segmentations for this study, making the segmentations more consistent with each other. Five out of the 64 labelled volumes were segmented by the expert
(a) Intra-operator variability of dice score for 5 sub-jects.
(b) Inter-operator variability of dice score for 8 vol-umes between a trained expert and an inexperienced operator.
Figure 9: Dice scores for volumes that were segmented first by a trained expert and then again by the trained expert at a second time or by an inexperienced operator.
twice to be able to estimate the intra-operator dice score which is a measure of the consistency
of the operator. The intra-operator dice score can be seen as a maximum achievable dice score for our model since it is unreasonable to think that a model could create segmentations that
are more consistent with the trained expert than the expert themselves. The intra-operator dice score for the five subjects is seen in Figure 9a and the mean dice score is 96.17%.
Some of the MRI-volumes that were segmented by the trained expert were also segmented by a second inexperienced operator to be able to evaluate the performance of the network,
comparing the dice score from the network output to the dice score of both intra- and inter-operator. In Figure 9b we see the inter-operator dice score between the trained expert and the
inexperienced operator. We can see that the mean Dice score for the inexperienced operator is 87.22%.
5.2
U-Net performance
Table 2 shows the 7-fold cross validated dice score after the final iteration in the training process for U-Net trained with or without augmentations on data as well as the difference between
having access to labels for 32 or 64 subjects. Looking at the table we see that the highest dice score is 0.9530 when using the full set of 64 subjects with augmentations and using pretrained
(a) 32 subjects. (b) 64 subjects.
Figure 10: 7 fold cross validated dice score comparing dice scores for 32 subjects when using data augmentation and/or pretrained weights
Dice score Number of subjects 32 64
Original 0.9372 0.9458
Original* 0.9403 0.9415 Augmented 0.9408 0.9511
Augmented* 0.9463 0.9530
Table 2: Dice score of the network trained on augmented and non-augmented data of 32 or 64 volumes with or without pretrained weights. *Using pretrained weights
5000 training iterations with the first batch of 32 subjects as training data. The curves for the models that utilize augmentation lie above the curves of the models that do not. There
is not a large difference between models initialized with pretrained weights versus random initialization. Similar patterns are seen in Figure 10b which is the graph for models trained on
both of the available batches of subjects for a total of 64 subjects. Compared to Figure 10a we see that all of the curves are higher for models trained on the full data set. Augmentations
still seem to perform better than leaving data unaltered. The graphs maintain a slight upward slope even for the last part of training, indicating that the model might benefit from even more
training time.
Looking at Figure 11a,11b and 11c we see some examples of segmentations in different
situations with small errors. Figure 11d is a rare occurrence where there error is large, one of the kidneys is almost completely ignored.
Figure 12 shows two examples of what could be cysts on the kidney, the network correctly does not classify these areas as kidneys. Larger cysts which are rarer could potentially cause
(a) A typical segmentation result. (b) Segmentation only containing one of the kid-neys.
(c) Segmentation with a large liver overlapping the kidney.
(d) Segmentation where the model almost com-pletely ignores one of the kidneys (incorrectly). Yellow: Over segmented
Red: Under segmented Blue: Correctly segmented
(a) Segmentation with possible cyst. (b) Segmentation with possible cyst. Yellow: Over segmented
Red: Under segmented
Blue: Correctly segmented
Figure 12: Predicted segmentations in a validation set. Possible cysts denoted with red arrows.
6
Conclusions
Deep learning shows promising results in many fields, medical imaging does not appear to be the exception. We have shown near human expert level of kidney segmentations with the
U-Net Neural U-Network architecture. In addition the model greatly outperforms an inexperienced operator and could in general benefit medical image segmentations with consistent and
accu-rate segmentations. The model does still have some weaknesses where it does not perform as expected, more carefully acquired training data could help cover these edge cases.
Using pretrained weights might be beneficial in some cases, having the weights be pre-trained on medical data instead of the ImageNet data set might also show larger improvements.
Augmentation seems to improve performance at smaller sample sizes but also remains relevant as the sample size increases. The graphs for the validation dice still having some upward slope
even after training for 250 000 iterations indicating that the model could achieve an ever higher dice score if the training process was allowed to run for longer.
In general, the most important part for creating high performing data driven models such as Neural Networks seem to be more data and especially more data of high quality.
References
[1] Johan Berglund. Separation of water and fat signal in magnetic resonance imaging ad-vances in methods based on chemical shift. Acta Universitatis Upsaliensis, 2011. ISBN:
9789155481544.
[2] Patrick Ferdinand Christ et al. “Automatic Liver and Tumor Segmentation of CT and MRI Volumes using Cascaded Fully Convolutional Neural Networks”. In: (2017). arXiv:
1702.05970.
[3] Özgün Çiçek et al. “3D U-net: Learning dense volumetric segmentation from sparse annotation”. In: Lecture Notes in Computer Science (including subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9901 LNCS. 2016, pp. 424–432. ISBN: 9783319467221.DOI: 10.1007/978-3-319-46723-8_49.
[4] Jia Deng et al. “ImageNet: A large-scale hierarchical image database”. In: 2009 IEEE
Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.ISBN: 978-1-4244-3992-8.DOI: 10.1109/CVPR.2009.5206848.
[5] Vincent Dumoulin and Francesco Visin. “A guide to convolution arithmetic for deep
learning”. In: (2016). arXiv: 1603.07285v2.
[6] Kunihiko Fukushima. “Neocognitron: A hierarchical neural network capable of visual pattern recognition”. In: Neural Networks 1.2 (1988), pp. 119–130. ISSN: 08936080.
DOI: 10.1016/0893-6080(88)90014-7.
[7] H. Holloway et al. “Sonographic determination of renal volumes in normal neonates”. In: Pediatric Radiology 13.4 (1983), pp. 212–214. ISSN: 0301-0449.DOI: 10.1007/
BF00973158.
[8] Minyoung Huh, Pulkit Agrawal, and Alexei A. Efros. “What makes ImageNet good for transfer learning?” In: (2016). arXiv: 1608.08614.
[9] Juan Eugenio Iglesias and Mert R. Sabuncu. “Multi-atlas segmentation of
biomedi-cal images: A survey”. In: Medibiomedi-cal Image Analysis 24.1 (2015), pp. 205–219. ISSN: 13618423. DOI: 10.1016/j.media.2015.06.012. arXiv: 1412.3421.
[10] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:
[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. 2012.
[12] Taro Langner et al. “Fully convolutional networks for automated segmentation of
ab-dominal adipose tissue depots in multicenter water-fat MRI”. In: Magnetic Resonance in Medicine81.4 (2019), pp. 2736–2745.ISSN: 15222594.DOI: 10.1002/mrm.27550.
[13] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In:
Pro-ceedings of the IEEE 86.11 (1998), pp. 2278–2323. ISSN: 00189219. DOI: 10.1109/ 5.726791.
[14] Sinno Jialin Pan and Qiang Yang. “A Survey on Transfer Learning”. In: IEEE
Transac-tions on Knowledge and Data Engineering22.10 (2010), pp. 1345–1359.
[15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: Lecture Notes in Computer Science (including
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9351 (2015), pp. 234–241. ISSN: 16113349. DOI:
10.1007/978-3-319-24574-4_28. arXiv: 1505.04597.
[16] Akinobu Shimizu et al. “Segmentation of multiple organs in non-contrast 3D abdominal CT images”. In: International Journal of Computer Assisted Radiology and Surgery
2.3-4 (2007), pp. 135–12.3-42. ISSN: 18616429.DOI: 10.1007/s11548-007-0135-z.
[17] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: (2014). arXiv: 1409.1556.
[18] Nitish Srivastava et al. Dropout: A Simple Way to Prevent Neural Networks from
Over-fitting. Tech. rep. 2014, pp. 1929–1958.
[19] Cathie Sudlow et al. “UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age”. In: PLoS Med 12.3
(2015), p. 1001779.DOI: 10.1371/journal.pmed.1001779.
[20] Janne West et al. “Feasibility of MR-based body composition analysis In large scale population studies”. In: PLoS ONE 11.9 (2016). ISSN: 19326203. DOI: 10 . 1371 /
[21] Robin Wolz et al. “Automated abdominal multi-organ segmentation with subject-specific atlas generation”. In: IEEE Transactions on Medical Imaging 32.9 (2013), pp. 1723–
1730. ISSN: 02780062.DOI: 10.1109/TMI.2013.2265805.
[22] Alan S.L. Yu et al. “Baseline total kidney volume and the rate of kidney growth are associated with chronic kidney disease progression in Autosomal Dominant Polycystic
Kidney Disease”. In: Kidney International 93.3 (2018), pp. 691–699.ISSN: 15231755.
[23] Matthew D. Zeiler and Rob Fergus. “Visualizing and understanding convolutional net-works”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 8689 LNCS. PART 1. 2014, pp. 818–833. ISBN: 9783319105895. DOI: 10.1007/978- 3- 319-