IN
DEGREE PROJECT MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS
STOCKHOLM SWEDEN 2020 ,
An Analysis of Context Channel Integration Strategies for Deep Learning-Based Medical Image Segmentation
JOAKIM STOOR
KTH ROYAL INSTITUTE OF TECHNOLOGY
An Analysis of Context Channel Integration Strategies for Deep Learning-Based Medical Image Segmentation
JOAKIM STOOR
Master in Medical Engineering Date: September 16, 2020 Supervisor: Chunliang Wang Reviewer: Örjan Smedby Examiner: Matilda Larsson
School of Engineering Sciences in Chemistry, Biotechnology and Health Host company: Novamia AB
Swedish title: Strategier för kontextkanalintegrering inom
djupinlärningsbaserad medicinsk bildsegmentering
Abstract
This master thesis investigates different approaches for integrating prior information into a neural network for segmentation of medical images. In the study, liver and liver tumor segmentation is performed in a cascading fashion. Context channels in the form of previ- ous segmentations are integrated into a segmentation network at multiple positions and network depths using different integration strategies. Comparisons are made with the tra- ditional integration approach where an input image is concatenated with context channels at a network’s input layer. The aim is to analyze if context information is lost in the upper network layers when the traditional approach is used, and if better results can be achieved if prior information is propagated to deeper layers. The intention is to support further improvements in interactive image segmentation where extra input channels are common.
The results that are achieved are, however, inconclusive. It is not possible to differentiate the methods from each other based on the quantitative results, and all the methods show the ability to generalize to an unseen object class after training. Compared to the other evaluated methods there are no indications that the traditional concatenation approach is underachieving, and it cannot be declared that meaningful context information is lost in the deeper network layers.
Sammanfattning
Det här masterexamensarbetet undersöker olika metoder för att integrera förhandsinfor- mation till ett neuralt nätverk för segmentering av medicinska bilder. I studien genomförs kaskadbaserad lever- och levertumörsegmentering där tidigare erhållna segmenteringar an- vänds som kontextkanaler som integreras in i ett segmenteringsnätverk vid flera positioner och nätverksdjup med hjälp av olika integreringstrategier. Jämförelser genomförs med ut- gångspunkt i den tradionella integreringsmetoden där en bild och givna kontextkanaler hoplänkas vid ett nätverks input-lager. Syftet är att analysera om nätverk som använder den tradionella metoden förlorar kännedom om given kontextinformation i de övre nät- verkslagren, och om bättre resultat kan nås om förhandsinformation sprids till djupare lager. Avsikten är att understödja fortsatt utveckling inom interaktiv bildsegmentering där extra input-kanaler är vanliga. Resultaten som erhålls i studien är dock tvetydiga.
Med hänsyn till de kvantitativa resultaten är det inte möjligt att skilja de olika meto-
derna från varandra, och efter träning visar alla metoder förmågan att generalisera till en
osedd objekt-klass. Det finns inga indikationer på att den traditionella metoden baserat på
hoplänkning av kanaler presterar sämre än de övriga metoderna, och från resultaten går
det inte att utläsa att betydelsefull kontextinformation går förlorad vid de djupare lagren.
Abbreviations
CNN Convolutional Neural Network CT Computed Tomography
DNN Deep Neural Network FCN Fully Convolutional Network MRI Magnetic Resonance Imaging OOI Object of Interest
ROI Region of Interest
SE Squeeze-and-Excitation
Contents
1. Introduction 3
2. Method 5
2.1. Data Processing and Augmentation . . . . 5
2.2. Network Architecture . . . . 6
2.3. Incorporation of Prior Information . . . . 7
2.3.1. Valve Filters and Relevance Maps . . . . 7
2.3.2. Symmetric Encoder Path for Relevance Map Propagation . . . . 8
2.4. Training Settings . . . 10
2.5. Performance Evaluation . . . 10
3. Results 11 3.1. Liver Segmentation . . . 11
3.2. Liver Tumor Segmentation . . . 12
3.3. Generalization to an Unseen Object Class . . . 14
3.4. Visualization of Feature Maps . . . 14
4. Discussion 18 4.1. Impact Analysis for Context Channel Integration . . . 18
4.2. Initilizations Strategies . . . 19
4.3. Generalization Ability . . . 19
4.4. Relevance for Interactive Segmentation . . . 20
4.5. Future Work . . . 20
5. Conclusion 21
Appendix A. State of the Art 25
Appendix B. Additional Visualizations 40
1. Introduction
Segmentation of anatomical structures to delineate tissue abnormalities and organs of interest is highly coveted in medicine. Clinically relevant use cases include segmentation of malignant tissue for cancer diagnosis, treatment planning, tracking and treatment response [1, 2]. Yet, diagnostic imaging procedures and manual segmentations are expensive and time consuming. Dense annotations require expert knowledge from trained physicians and can last up to several hours, thus, for large-scale clinical applications, manual approaches are not feasible. Fortunately, since the deep learning revolution began in 2012 [3], great strides have been taken in fully automated image segmentation where no manual input is given during the segmentation process.
Current state of the art approaches in deep learning for fully automated image segmen- tation are based on convolutional neural networks (CNNs) that learn convolutional filter parameters to capture local relationships [4, 5, 6, 7]. These networks consist of several layers of such filters that successively extract information of increasing complexity from an input image. For segmentation tasks, the networks are often fully convolutional encoder- decoder architectures that do not only encode high-level information from an input image, they also decode that information to create a dense segmentation map as an output. For these approaches, and for deep learning in general, large datasets are needed for general- ization to unseen examples, and while this is generally not a concern in computer vision where images are relatively easy to come by, this is not so in medical image analysis.
Aside from the problem of acquiring ground truth annotations for medical images, patient safety regulations make it difficult to publicly distribute medical data without infringing on patient privacy laws [8]. For this reason medical datasets are generally smaller than publicly available databases of natural images, and this can ultimately hamper network performance [9].
Beyond the number of publicly available images, medical image acquisition through image modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) imposes difficulties on the segmentation process due to different acquisition protocols, contrast agents, enhancements and scanner resolutions [1]. Furthermore, image complexity is exacerbated by noise, artefacts, pathologies and biological variations [10]. Considering the sparseness and complexity of medical data, medical image segmentation could benefit from systems that can leverage a medical expert’s high-level anatomical understanding to find a compromise between user interaction and automation. Consequently, interactive and semi-automatic approaches that incorporate user input into automatic systems is of interest.
Interactive segmentation methods allow for iterative improvements through feedback sig-
nals based on manually corrected segmentation regions. Semi-automatic methods are
similar, but they solely rely on prior information to initialize the segmentation process,
and they are not iterative [11]. Common input types to integrate user interaction with
deep neural networks (DNNs) include different encodings of input clicks, bounding boxes,
and scribbles [12, 13, 14]. However, irregardless of nature of the input, all such approaches
faces the difficulty of embedding user input into the networks. Input signals are often used
as prior information and encoded as extra input channels under the assumption that the
information can be remembered during high-level feature extraction at the deeper network
layers. Yet, recent results suggest that valuable information might get lost in the shallow layers [15], and in that case, the information would not contribute to important parts of the segmentation process. Improved performance could therefore be achieved by solutions that better attend to the information in the manual input. On that account, it is possible that the latest progress in attention networks could be exploited for this purpose.
The attention mechanism in deep learning makes it possible to attend to image parts with meaningful information while disregarding parts with unimportant information [16, 17].
Most research within this area has focused on learnable attention where no interactive guidance is given to steer the network towards user-specified image regions. One example of such an approach is squeeze-and-excitation (SE) blocks for spatial and channel-wise attention [17, 18]. Solutions based on manual attention are more limited, but one strategy that approaches such a solution is an approach developed by Eppel [15] to segment liquids in a vessel by using relevance maps derived from previous vessel segmentations. These relevance maps regulate the activation’s of image features within a network and the basic building block can be applied at arbitrary network depths [19]. The solution is very similar to the traditional approach based on extra input channels, but it differs in the way it is conceptualized. It is implemented as a parallel computational unit, thus, it easier to tweak it to extract more information from the context channels by feeding the information forward through the network.
As it relates to manual attention and the integration of prior information, a thorough in-
vestigation of how manual input can best be embedded into DNNs for improved semantic
segmentation is lacking. The contribution of this project is, consequently, to investigate
how information from extra input channels can be used for this purpose. The results
from the study are intended to support developments in deep learning-based interactive
segmentation where extra input channels are frequently used. In the project, context chan-
nels in the form of previous segmentations are integrated into a segmentation network at
multiple positions and network depths using different integration strategies. Comparisons
are made with the traditional integration approach where an input image is concatenated
with context channels at a network’s input layer. The aim of the study is to analyze if
context information is lost in the upper network layers when the traditional approach is
used, and if better results can be achieved if prior information is propagated to deeper
layers.
2. Method
The data that was used in this study is from the Liver Tumor Segmentation Challenge (LiTS) [1]. The LiTS dataset consists of 201 contrast enhanced abdominal CT volumes.
Each volume has a liver and liver tumor ground truth segmentation and the whole dataset is divided into a training set of 131 volumes and a test set of 70 undisclosed volumes.
In the study, all experiments were performed on the training images and evaluation was performed using K-fold cross validation.
Liver and liver tumor segmentation was performed separately in a cascading fashion. The liver was segmented in two succeeding orientations. It was first segmented in the axial view to extract an initial context channel which was then used together with the input image for segmentation in the sagittal view, thus, an auto-context scheme [20] was applied to improve on the first segmentation result. Liver tumors were only segmented in the axial view, but the segmentation was cascaded using two succeeding segmentation steps with different context channels. Liver segmentations were used as context channels in the first cascade, and the extracted tumor segmentations were used in the second one. For both liver and liver tumor segmentation, different approaches were used to integrate a given context channel into the segmentation process. These approaches were the main focus of the project, and they will be explained in due course.
Note: 13 out of 131 CT volumes in the training set do not contain any tumors at all.
To not skew the statistics during evaluation of segmentation performance, these volumes were therefore excluded when tumor segmentation was performed.
2.1. Data Processing and Augmentation
The following preprocessing steps were applied to all volumes: image reorientation, image resizing and contrast normalization. After the first liver segmentation in the axial view, the binary outputs were used to crop the image volumes to only include slices with liver tissue, thenceforth, only slices from the cropped volumes were used for training.
The orientation of the images within the dataset varied, thus, to simplify the learning problem so that all input images had the same direction and alignment, the images were reoriented to a uniform orientation (RAI) [21].
Image normalization was performed using the formula:
I
0= (I − µ)
σ (1)
where µ is the mean intensity and σ is the standard deviation. For the initial liver segmentation, image intensity values were initially clamped to the range {-1024, 1024}
and the images were then normalized with µ set to 0 HU and σ to 500 HU. When an
initial liver segmentation had been extracted, µ and σ were calculated using image voxels
within the previously attained liver segmentation, and image intensity values were then
clamped to the range {-5, 5} to retain values up to 5σ away from the mean intensity value
of the liver.
For all experiments, the input volumes were resliced to 2D slices in the specified orien- tation. The liver was segmented isotropically at 1.5mm resolution and the images were downscaled with a factor of 2 to a resolution of 256 × 256. Liver tumors were segmented anisotropically at an image resolution of 512 × 512 to not disregard tiny liver tumors which would be hard to delineate otherwise. The tumors were only segmented in the axial view, so no isotropic resampling was necessary.
Affine transformation was performed after all the preprocessing steps by randomly sam- pling parameters for rotation: ±10°; horizontal and vertical translation: ±10 pixels; and scaling ±10 %.
For inference, the only post-processing step that was always performed was resampling of predicted segmentation masks back to their original image resolutions. Predicted seg- mentation masks that were obtained from cropped volumes were also transformed back to their re-cropped states.
2.2. Network Architecture
All the evaluated networks that were selected for this study were based on modifications of the original U-net architecture [5]. U-net is an encoder-decoder architecture with convolu- tional blocks that consist of two succeeding layers of convolutional filters that perform 3×3 convolution. Each convolutional layer is followed by ReLU-activation, and the number of filters in each layer is a multiple of base B (see figure 2.1).
Figure 2.1: U-net architecture. The arrows denote the different operations. Light grey boxes denote multi-channel feature maps, and the number of channels, shown above each map, is a function of base B. Dark grey boxes represent copied feature maps. The dense prediction map has C output channels (set to 2 for all the experiments in the study).
In the encoding path of U-net, each block is followed by 2×2 max-pooling to decrease the
spatial resolution of incoming feature maps. In the decoding path, transposed convolution is instead used to upsample feature maps to increase their spatial resolution. Skip connec- tions are used between blocks of the encoder and decoder to retain more information about the spatial location of structures within the input, and this is achieved by concatenating feature maps from the encoder and decoder. Batch normalization layers [22] are option- ally inserted between each convolutional layer and its succeeding activation function, and spatial dropout layers [23] are optionally used after every max-pooling and concatenation operation. Batch normalization is used to improve network performance, speed and sta- bility, and spatial dropout is used to prevent overfitting. The depth of U-net is defined by the number of resolution levels (set to 5 in figure 2.1) and the deepest network layers are the so called bottleneck layers. In this study, unless stated otherwise, all the network filter parameters were initialized by Xavier initialization [24] and the biases were initially set to zero.
2.3. Incorporation of Prior Information
The traditional approach for integration of prior information into CNNs is simple; context channels are concatenated with the image and the result is fed as the definite input to the neural network. The only architectural difference between this approach and approaches that use no context channels is the introduction of additional kernels in the first convolu- tional layer of the network - the kernels operate on the introduced context channels. In this project, the concatenation approach was viewed as the baseline, and it was compared to two alternative integration strategies based on valve filters.
2.3.1. Valve Filters and Relevance Maps
Given a region of interest (ROI) within an image that is known beforehand, a ROI map can be incorporated into a network via relevance maps that selectively highlight specific parts of an extracted image feature map [15], see figure 2.2. A relevance map is extracted from valve filters that operate on an incoming ROI-map, or context channel. From a computational perspective, image and valve filters are alike, but they are apart of separate convolutional layers that operate on either an image or a context channel. After both layers have been applied, each channel within the feature map has a corresponding channel within the relevance map, and the feature map and the relevance map are multiplied to produce a normalized feature map. ReLU is applied on the normalized feature map and the produced result is passed on through the network. Valve layers can replace normal convolutional layers to highlight specific image parts, and they can be applied at arbitrary network depths by downsampling the context channel.
Figure 2.2: Valve layer that operates on an incoming image (top-left) and context channel
(bottom-left). This layer is identical to Eppel’s approach in [15].
In contrast to the normal network weights, valve filter weights are zero-initialized and their bias terms are set to one. In this way, the effect of the filters will be zero from the outset, and thereafter it increases gradually during training. This initialization strategy differs from the one used by the traditional integration approach based on context channel and image concatenation. In the latter case all the network weights are initialized in the same way, including the weights that operate on the context channels.
Valve Layer Integration Strategies
In this project, valve layers were integrated into U-net in four different ways: either at the topmost layer in the encoder or within the topmost skip connection, or in the anterior layer of all the blocks in the encoder or within all the skip connections. In figure 2.3, context channels are integrated into all network depths, and at each block they are first downsampled to their required spatial resolution. For context channel integration at the topmost layer or within the topmost skip connection, the lower blocks in figure 2.3 are the same as in figure 2.1, thus, no context channel downsampling is required.
Figure 2.3: Network configurations that support context channel integration via valve layers. In the path to the left, the anterior layer of each encoder block is a valve layer. In the other configuration, valve layers are integrated into the skip connections. The decoder path of either configuration is the same as the one in figure 2.1.
2.3.2. Symmetric Encoder Path for Relevance Map Propagation
The previously described integration methods utilise the provided context channels as they
are without extracting more high level information out of them in order to attend to specific
parts of the context regions. A new integration strategy motivated by the valve approach
was therefore developed to study how context information is incorporated and propagated
throughout a network if more high-level extraction of relevance map information is made
possible. The approach makes use of an additional encoder path that operates on relevance
maps extracted from context channels.
With few exceptions, the addtional path is symmetric to the original encoder path. At each network level parallel convolutional layers operate on an image feature map or a relevance map. A departure from the valve approach is that a single relevance channel in a relevance map is shared between several feature channels within an image feature map. This is motivated by the fact that relevance channels in a relevance map seem to cluster into groups with similar back- and foreground highlighting, but also by the need of reducing the computational and spatial complexity which increases with additional convolutional layers. The sharing factor is a hyperparameter and is constant throughout the network depths. It was set to 4 for all experiment (empirically chosen). Consequently, for network layers with 32 image feature channels, 8 relevance channels exist. As before, a normalized image feature map is created by multiplying a relevance channel with its corresponding image feature channel. The normalized image feature map and its related relevance map are then in parallel gated by ReLU-activation before they are fed to the succeeding layer to repeat the process. For a visual representation, see figure 2.4.
Figure 2.4: Network layer that supports context channel integration and propagation via valve layers. The figure illustrates an initial network layer with an incoming input image and context channel. The number of extracted feature channels is denoted by N and the number of relevance channels by N/M (M is the sharing factor). In the illustration, N is set to 8 and M to 2, i.e. 4 relevance channels exist. After ReLU activation, the two results are fed to the next layer for further processing.
For network integration, the layer in figure 2.4 replaces all the normal convolutional layers in the encoder path of figure 2.1 (except for the bottleneck layers). Max-pooling is applied to both the relevance and image feature maps, but it is only the image feature maps that are copied and transferred to the decoder. Optional spatial dropout layers operate exclusively on image feature maps, but optional batch normalization is applied to either map-type within the new layer, and this is done before ReLU-activation.
In contrast to the valve layers in figure 2.2, those in 2.4 are weight initialized by Xavier
initialization. The different initialization scheme is motivated by the fact that relevance
maps are fed forward through the network, thus, there is a need to prevent layer activation
outputs from exploding or vanishing during propagation [24]. The biases for the valve
filters are set to one as usual.
2.4. Training Settings
Identical training settings were chosen for both segmentation tasks to better compare the ability of the evaluated methods to incorporate provided context information. All parameter values were empirically chosen after careful evaluation, and they can be seen in table 2.1.
Table 2.1: General parameter settings for the different segmentation problems. For either problem, the batch size pair denotes the batch size for each succeeding cascade.
Hyperparameter Liver Liver Tumor
U-net base 32 32
U-net depth 5 5
Batch Norm Yes Yes
Dropout rate None 0.2
Optimizer adam [25] adam learning rate 10
−310
−4Image Resolution 256 512
Batch Size 16 8
Epochs 50, 30 30, 10
K-folds 5 10
For all experiments, batch based soft Dice loss [26] was used as the loss function, weighted batches with the same number of slices for each output class was used to stabilize training and gradient norm clipping [27] with a threshold value of 1.0 was used to prevent exploding gradients. Also, a learning rate step decay schedule was used for liver tumor segmentation to decrease the learning rate by a factor of 0.1 every 10th epoch.
2.5. Performance Evaluation
Quantitative segmentation results were evaluated by measuring volumetric overlap using the Dice coefficient, the Jaccard index, and sensitivity and recall. Both per case (the mean score of all the individual volumes) and global scores (attained by viewing all the volumes as a single global volume) were used. For tumor segmentation, a global-wise score is more stable as the size and number of tumors might differ a lot from volume to volume.
To evaluate the statistical significance of the attained scores for tumor segmentation, a Student’s T-test [28] over the K folds was performed between the best performing context channel integration strategy and the other methods. According to common customs, a threshold value of 0.05 was set for the p-value.
To evaluate the generalization ability of the different context channel integration ap-
proaches, a few image slices showing the liver with projections of at least one liver tumor
were selected and manually segmented to delineate the left kidney with eclipsed kidney
regions. Separate labels were used for the "normal" and eclipsed kidney regions, but when
the map was fed to the network it was first binarized to feed the network the whole kid-
ney region as a singular unit. For all selected image slices, all networks trained for liver
tumor segmentation with liver contexts were in turn fed the binarized ground truth liver
segmentations and the manual kidney segmentations.
3. Results
To be able to concisely summarize the different results, a naming convention has to be agreed upon for the different approaches. In the following, vanilla unet without context channel integration will be denoted as the without context approach. All approaches based on valve filters will be denoted by valve followed by the integration level, top for the first block and all for all blocks, followed by the integration position, encoder for the anterior layers of the blocks in the encoder path or skip for the skip connections. The traditional approach based on context channel concatenation is denoted by concat and the symmetric encoder path for relevance map propagation is denoted by symmetric encoder.
3.1. Liver Segmentation
The quantitative liver segmentation results can be seen in table 3.1 and 3.2. Mean and standard deviation values are provided for each metric, and they were calculated over the 5 validation folds.
Table 3.1: Mean global and per case Dice and IoU scores for liver segmentation.
*denotes the initial segmentation in the axial view, the other approaches are in the sagittal view.
Method Dice per case Dice global IoU per case IoU global Without Context * .9530 ±.0032 .9547 ±.0042 .9111 ±.0052 .9133 ±.0076 Concat .9591 ±.0038 .9610 ±.0046 .9221 ±.0061 .9250 ±.0085 Valve Top Encoder .9589 ±.0034 .9609 ±.0047 .9218 ±.0054 .9249 ±.0086 Valve Top Skip .9566 ±.0032 .9588 ±.0037 .9176 ±.0052 .9209 ±.0068 Valve All Encoder .9594 ±.0035 .9616 ±.0043 .9226 ±.0058 .9260 ±.0080 Valve All Skip .9584 ±.0036 .9603 ±.0045 .9208 ±.0058 .9237 ±.0083 Symmetric Encoder .9596 ±.0036 .9618 ±.0043 .9231 ±.0058 .9264 ±.0080
Table 3.2: Mean global and per case precision and recall scores for liver segmentation.
*denotes the initial segmentation in the axial view, the other approaches are in the sagittal view.