An Analysis of Context Channel Integration Strategies for Deep Learning-Based Medical Image Segmentation

(1)

IN

DEGREE PROJECT MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

An Analysis of Context Channel Integration Strategies for Deep Learning-Based Medical Image Segmentation

JOAKIM STOOR

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

An Analysis of Context Channel Integration Strategies for Deep Learning-Based Medical Image Segmentation

JOAKIM STOOR

Master in Medical Engineering Date: September 16, 2020 Supervisor: Chunliang Wang Reviewer: Örjan Smedby Examiner: Matilda Larsson

School of Engineering Sciences in Chemistry, Biotechnology and Health Host company: Novamia AB

Swedish title: Strategier för kontextkanalintegrering inom

djupinlärningsbaserad medicinsk bildsegmentering

(4)

(5)

Abstract

This master thesis investigates different approaches for integrating prior information into a neural network for segmentation of medical images. In the study, liver and liver tumor segmentation is performed in a cascading fashion. Context channels in the form of previ- ous segmentations are integrated into a segmentation network at multiple positions and network depths using different integration strategies. Comparisons are made with the tra- ditional integration approach where an input image is concatenated with context channels at a network’s input layer. The aim is to analyze if context information is lost in the upper network layers when the traditional approach is used, and if better results can be achieved if prior information is propagated to deeper layers. The intention is to support further improvements in interactive image segmentation where extra input channels are common.

The results that are achieved are, however, inconclusive. It is not possible to differentiate the methods from each other based on the quantitative results, and all the methods show the ability to generalize to an unseen object class after training. Compared to the other evaluated methods there are no indications that the traditional concatenation approach is underachieving, and it cannot be declared that meaningful context information is lost in the deeper network layers.

Sammanfattning

Det här masterexamensarbetet undersöker olika metoder för att integrera förhandsinfor- mation till ett neuralt nätverk för segmentering av medicinska bilder. I studien genomförs kaskadbaserad lever- och levertumörsegmentering där tidigare erhållna segmenteringar an- vänds som kontextkanaler som integreras in i ett segmenteringsnätverk vid flera positioner och nätverksdjup med hjälp av olika integreringstrategier. Jämförelser genomförs med ut- gångspunkt i den tradionella integreringsmetoden där en bild och givna kontextkanaler hoplänkas vid ett nätverks input-lager. Syftet är att analysera om nätverk som använder den tradionella metoden förlorar kännedom om given kontextinformation i de övre nät- verkslagren, och om bättre resultat kan nås om förhandsinformation sprids till djupare lager. Avsikten är att understödja fortsatt utveckling inom interaktiv bildsegmentering där extra input-kanaler är vanliga. Resultaten som erhålls i studien är dock tvetydiga.

Med hänsyn till de kvantitativa resultaten är det inte möjligt att skilja de olika meto-

derna från varandra, och efter träning visar alla metoder förmågan att generalisera till en

osedd objekt-klass. Det finns inga indikationer på att den traditionella metoden baserat på

hoplänkning av kanaler presterar sämre än de övriga metoderna, och från resultaten går

det inte att utläsa att betydelsefull kontextinformation går förlorad vid de djupare lagren.

(6)

(7)

Abbreviations

CNN Convolutional Neural Network CT Computed Tomography

DNN Deep Neural Network FCN Fully Convolutional Network MRI Magnetic Resonance Imaging OOI Object of Interest

ROI Region of Interest

SE Squeeze-and-Excitation

(8)

(9)

1. Introduction

Segmentation of anatomical structures to delineate tissue abnormalities and organs of interest is highly coveted in medicine. Clinically relevant use cases include segmentation of malignant tissue for cancer diagnosis, treatment planning, tracking and treatment response [1, 2]. Yet, diagnostic imaging procedures and manual segmentations are expensive and time consuming. Dense annotations require expert knowledge from trained physicians and can last up to several hours, thus, for large-scale clinical applications, manual approaches are not feasible. Fortunately, since the deep learning revolution began in 2012 [3], great strides have been taken in fully automated image segmentation where no manual input is given during the segmentation process.

Current state of the art approaches in deep learning for fully automated image segmen- tation are based on convolutional neural networks (CNNs) that learn convolutional filter parameters to capture local relationships [4, 5, 6, 7]. These networks consist of several layers of such filters that successively extract information of increasing complexity from an input image. For segmentation tasks, the networks are often fully convolutional encoder- decoder architectures that do not only encode high-level information from an input image, they also decode that information to create a dense segmentation map as an output. For these approaches, and for deep learning in general, large datasets are needed for general- ization to unseen examples, and while this is generally not a concern in computer vision where images are relatively easy to come by, this is not so in medical image analysis.

Aside from the problem of acquiring ground truth annotations for medical images, patient safety regulations make it difficult to publicly distribute medical data without infringing on patient privacy laws [8]. For this reason medical datasets are generally smaller than publicly available databases of natural images, and this can ultimately hamper network performance [9].

Beyond the number of publicly available images, medical image acquisition through image modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) imposes difficulties on the segmentation process due to different acquisition protocols, contrast agents, enhancements and scanner resolutions [1]. Furthermore, image complexity is exacerbated by noise, artefacts, pathologies and biological variations [10]. Considering the sparseness and complexity of medical data, medical image segmentation could benefit from systems that can leverage a medical expert’s high-level anatomical understanding to find a compromise between user interaction and automation. Consequently, interactive and semi-automatic approaches that incorporate user input into automatic systems is of interest.

Interactive segmentation methods allow for iterative improvements through feedback sig-

nals based on manually corrected segmentation regions. Semi-automatic methods are

similar, but they solely rely on prior information to initialize the segmentation process,

and they are not iterative [11]. Common input types to integrate user interaction with

deep neural networks (DNNs) include different encodings of input clicks, bounding boxes,

and scribbles [12, 13, 14]. However, irregardless of nature of the input, all such approaches

faces the difficulty of embedding user input into the networks. Input signals are often used

as prior information and encoded as extra input channels under the assumption that the

information can be remembered during high-level feature extraction at the deeper network

(12)

layers. Yet, recent results suggest that valuable information might get lost in the shallow layers [15], and in that case, the information would not contribute to important parts of the segmentation process. Improved performance could therefore be achieved by solutions that better attend to the information in the manual input. On that account, it is possible that the latest progress in attention networks could be exploited for this purpose.

The attention mechanism in deep learning makes it possible to attend to image parts with meaningful information while disregarding parts with unimportant information [16, 17].

Most research within this area has focused on learnable attention where no interactive guidance is given to steer the network towards user-specified image regions. One example of such an approach is squeeze-and-excitation (SE) blocks for spatial and channel-wise attention [17, 18]. Solutions based on manual attention are more limited, but one strategy that approaches such a solution is an approach developed by Eppel [15] to segment liquids in a vessel by using relevance maps derived from previous vessel segmentations. These relevance maps regulate the activation’s of image features within a network and the basic building block can be applied at arbitrary network depths [19]. The solution is very similar to the traditional approach based on extra input channels, but it differs in the way it is conceptualized. It is implemented as a parallel computational unit, thus, it easier to tweak it to extract more information from the context channels by feeding the information forward through the network.

As it relates to manual attention and the integration of prior information, a thorough in-

vestigation of how manual input can best be embedded into DNNs for improved semantic

segmentation is lacking. The contribution of this project is, consequently, to investigate

how information from extra input channels can be used for this purpose. The results

from the study are intended to support developments in deep learning-based interactive

segmentation where extra input channels are frequently used. In the project, context chan-

nels in the form of previous segmentations are integrated into a segmentation network at

multiple positions and network depths using different integration strategies. Comparisons

are made with the traditional integration approach where an input image is concatenated

with context channels at a network’s input layer. The aim of the study is to analyze if

context information is lost in the upper network layers when the traditional approach is

used, and if better results can be achieved if prior information is propagated to deeper

layers.

(13)

2. Method

The data that was used in this study is from the Liver Tumor Segmentation Challenge (LiTS) [1]. The LiTS dataset consists of 201 contrast enhanced abdominal CT volumes.

Each volume has a liver and liver tumor ground truth segmentation and the whole dataset is divided into a training set of 131 volumes and a test set of 70 undisclosed volumes.

In the study, all experiments were performed on the training images and evaluation was performed using K-fold cross validation.

Liver and liver tumor segmentation was performed separately in a cascading fashion. The liver was segmented in two succeeding orientations. It was first segmented in the axial view to extract an initial context channel which was then used together with the input image for segmentation in the sagittal view, thus, an auto-context scheme [20] was applied to improve on the first segmentation result. Liver tumors were only segmented in the axial view, but the segmentation was cascaded using two succeeding segmentation steps with different context channels. Liver segmentations were used as context channels in the first cascade, and the extracted tumor segmentations were used in the second one. For both liver and liver tumor segmentation, different approaches were used to integrate a given context channel into the segmentation process. These approaches were the main focus of the project, and they will be explained in due course.

Note: 13 out of 131 CT volumes in the training set do not contain any tumors at all.

To not skew the statistics during evaluation of segmentation performance, these volumes were therefore excluded when tumor segmentation was performed.

2.1. Data Processing and Augmentation

The following preprocessing steps were applied to all volumes: image reorientation, image resizing and contrast normalization. After the first liver segmentation in the axial view, the binary outputs were used to crop the image volumes to only include slices with liver tissue, thenceforth, only slices from the cropped volumes were used for training.

The orientation of the images within the dataset varied, thus, to simplify the learning problem so that all input images had the same direction and alignment, the images were reoriented to a uniform orientation (RAI) [21].

Image normalization was performed using the formula:

I

⁰

= (I − µ)

σ (1)

where µ is the mean intensity and σ is the standard deviation. For the initial liver segmentation, image intensity values were initially clamped to the range {-1024, 1024}

and the images were then normalized with µ set to 0 HU and σ to 500 HU. When an

initial liver segmentation had been extracted, µ and σ were calculated using image voxels

within the previously attained liver segmentation, and image intensity values were then

clamped to the range {-5, 5} to retain values up to 5σ away from the mean intensity value

of the liver.

(14)

For all experiments, the input volumes were resliced to 2D slices in the specified orien- tation. The liver was segmented isotropically at 1.5mm resolution and the images were downscaled with a factor of 2 to a resolution of 256 × 256. Liver tumors were segmented anisotropically at an image resolution of 512 × 512 to not disregard tiny liver tumors which would be hard to delineate otherwise. The tumors were only segmented in the axial view, so no isotropic resampling was necessary.

Affine transformation was performed after all the preprocessing steps by randomly sam- pling parameters for rotation: ±10°; horizontal and vertical translation: ±10 pixels; and scaling ±10 %.

For inference, the only post-processing step that was always performed was resampling of predicted segmentation masks back to their original image resolutions. Predicted seg- mentation masks that were obtained from cropped volumes were also transformed back to their re-cropped states.

2.2. Network Architecture

All the evaluated networks that were selected for this study were based on modifications of the original U-net architecture [5]. U-net is an encoder-decoder architecture with convolu- tional blocks that consist of two succeeding layers of convolutional filters that perform 3×3 convolution. Each convolutional layer is followed by ReLU-activation, and the number of filters in each layer is a multiple of base B (see figure 2.1).

Figure 2.1: U-net architecture. The arrows denote the different operations. Light grey boxes denote multi-channel feature maps, and the number of channels, shown above each map, is a function of base B. Dark grey boxes represent copied feature maps. The dense prediction map has C output channels (set to 2 for all the experiments in the study).

In the encoding path of U-net, each block is followed by 2×2 max-pooling to decrease the

(15)

spatial resolution of incoming feature maps. In the decoding path, transposed convolution is instead used to upsample feature maps to increase their spatial resolution. Skip connec- tions are used between blocks of the encoder and decoder to retain more information about the spatial location of structures within the input, and this is achieved by concatenating feature maps from the encoder and decoder. Batch normalization layers [22] are option- ally inserted between each convolutional layer and its succeeding activation function, and spatial dropout layers [23] are optionally used after every max-pooling and concatenation operation. Batch normalization is used to improve network performance, speed and sta- bility, and spatial dropout is used to prevent overfitting. The depth of U-net is defined by the number of resolution levels (set to 5 in figure 2.1) and the deepest network layers are the so called bottleneck layers. In this study, unless stated otherwise, all the network filter parameters were initialized by Xavier initialization [24] and the biases were initially set to zero.

2.3. Incorporation of Prior Information

The traditional approach for integration of prior information into CNNs is simple; context channels are concatenated with the image and the result is fed as the definite input to the neural network. The only architectural difference between this approach and approaches that use no context channels is the introduction of additional kernels in the first convolu- tional layer of the network - the kernels operate on the introduced context channels. In this project, the concatenation approach was viewed as the baseline, and it was compared to two alternative integration strategies based on valve filters.

2.3.1. Valve Filters and Relevance Maps

Given a region of interest (ROI) within an image that is known beforehand, a ROI map can be incorporated into a network via relevance maps that selectively highlight specific parts of an extracted image feature map [15], see figure 2.2. A relevance map is extracted from valve filters that operate on an incoming ROI-map, or context channel. From a computational perspective, image and valve filters are alike, but they are apart of separate convolutional layers that operate on either an image or a context channel. After both layers have been applied, each channel within the feature map has a corresponding channel within the relevance map, and the feature map and the relevance map are multiplied to produce a normalized feature map. ReLU is applied on the normalized feature map and the produced result is passed on through the network. Valve layers can replace normal convolutional layers to highlight specific image parts, and they can be applied at arbitrary network depths by downsampling the context channel.

Figure 2.2: Valve layer that operates on an incoming image (top-left) and context channel

(bottom-left). This layer is identical to Eppel’s approach in [15].

(16)

In contrast to the normal network weights, valve filter weights are zero-initialized and their bias terms are set to one. In this way, the effect of the filters will be zero from the outset, and thereafter it increases gradually during training. This initialization strategy differs from the one used by the traditional integration approach based on context channel and image concatenation. In the latter case all the network weights are initialized in the same way, including the weights that operate on the context channels.

Valve Layer Integration Strategies

In this project, valve layers were integrated into U-net in four different ways: either at the topmost layer in the encoder or within the topmost skip connection, or in the anterior layer of all the blocks in the encoder or within all the skip connections. In figure 2.3, context channels are integrated into all network depths, and at each block they are first downsampled to their required spatial resolution. For context channel integration at the topmost layer or within the topmost skip connection, the lower blocks in figure 2.3 are the same as in figure 2.1, thus, no context channel downsampling is required.

Figure 2.3: Network configurations that support context channel integration via valve layers. In the path to the left, the anterior layer of each encoder block is a valve layer. In the other configuration, valve layers are integrated into the skip connections. The decoder path of either configuration is the same as the one in figure 2.1.

2.3.2. Symmetric Encoder Path for Relevance Map Propagation

The previously described integration methods utilise the provided context channels as they

are without extracting more high level information out of them in order to attend to specific

parts of the context regions. A new integration strategy motivated by the valve approach

was therefore developed to study how context information is incorporated and propagated

throughout a network if more high-level extraction of relevance map information is made

possible. The approach makes use of an additional encoder path that operates on relevance

(17)

maps extracted from context channels.

With few exceptions, the addtional path is symmetric to the original encoder path. At each network level parallel convolutional layers operate on an image feature map or a relevance map. A departure from the valve approach is that a single relevance channel in a relevance map is shared between several feature channels within an image feature map. This is motivated by the fact that relevance channels in a relevance map seem to cluster into groups with similar back- and foreground highlighting, but also by the need of reducing the computational and spatial complexity which increases with additional convolutional layers. The sharing factor is a hyperparameter and is constant throughout the network depths. It was set to 4 for all experiment (empirically chosen). Consequently, for network layers with 32 image feature channels, 8 relevance channels exist. As before, a normalized image feature map is created by multiplying a relevance channel with its corresponding image feature channel. The normalized image feature map and its related relevance map are then in parallel gated by ReLU-activation before they are fed to the succeeding layer to repeat the process. For a visual representation, see figure 2.4.

Figure 2.4: Network layer that supports context channel integration and propagation via valve layers. The figure illustrates an initial network layer with an incoming input image and context channel. The number of extracted feature channels is denoted by N and the number of relevance channels by N/M (M is the sharing factor). In the illustration, N is set to 8 and M to 2, i.e. 4 relevance channels exist. After ReLU activation, the two results are fed to the next layer for further processing.

For network integration, the layer in figure 2.4 replaces all the normal convolutional layers in the encoder path of figure 2.1 (except for the bottleneck layers). Max-pooling is applied to both the relevance and image feature maps, but it is only the image feature maps that are copied and transferred to the decoder. Optional spatial dropout layers operate exclusively on image feature maps, but optional batch normalization is applied to either map-type within the new layer, and this is done before ReLU-activation.

In contrast to the valve layers in figure 2.2, those in 2.4 are weight initialized by Xavier

initialization. The different initialization scheme is motivated by the fact that relevance

maps are fed forward through the network, thus, there is a need to prevent layer activation

outputs from exploding or vanishing during propagation [24]. The biases for the valve

(18)

filters are set to one as usual.

2.4. Training Settings

Identical training settings were chosen for both segmentation tasks to better compare the ability of the evaluated methods to incorporate provided context information. All parameter values were empirically chosen after careful evaluation, and they can be seen in table 2.1.

Table 2.1: General parameter settings for the different segmentation problems. For either problem, the batch size pair denotes the batch size for each succeeding cascade.

Hyperparameter Liver Liver Tumor

U-net base 32 32

U-net depth 5 5

Batch Norm Yes Yes

Dropout rate None 0.2

Optimizer adam [25] adam learning rate 10

⁻³

10

⁻⁴

Image Resolution 256 512

Batch Size 16 8

Epochs 50, 30 30, 10

K-folds 5 10

For all experiments, batch based soft Dice loss [26] was used as the loss function, weighted batches with the same number of slices for each output class was used to stabilize training and gradient norm clipping [27] with a threshold value of 1.0 was used to prevent exploding gradients. Also, a learning rate step decay schedule was used for liver tumor segmentation to decrease the learning rate by a factor of 0.1 every 10th epoch.

2.5. Performance Evaluation

Quantitative segmentation results were evaluated by measuring volumetric overlap using the Dice coefficient, the Jaccard index, and sensitivity and recall. Both per case (the mean score of all the individual volumes) and global scores (attained by viewing all the volumes as a single global volume) were used. For tumor segmentation, a global-wise score is more stable as the size and number of tumors might differ a lot from volume to volume.

To evaluate the statistical significance of the attained scores for tumor segmentation, a Student’s T-test [28] over the K folds was performed between the best performing context channel integration strategy and the other methods. According to common customs, a threshold value of 0.05 was set for the p-value.

To evaluate the generalization ability of the different context channel integration ap-

proaches, a few image slices showing the liver with projections of at least one liver tumor

were selected and manually segmented to delineate the left kidney with eclipsed kidney

regions. Separate labels were used for the "normal" and eclipsed kidney regions, but when

the map was fed to the network it was first binarized to feed the network the whole kid-

ney region as a singular unit. For all selected image slices, all networks trained for liver

tumor segmentation with liver contexts were in turn fed the binarized ground truth liver

segmentations and the manual kidney segmentations.

(19)

3. Results

To be able to concisely summarize the different results, a naming convention has to be agreed upon for the different approaches. In the following, vanilla unet without context channel integration will be denoted as the without context approach. All approaches based on valve filters will be denoted by valve followed by the integration level, top for the first block and all for all blocks, followed by the integration position, encoder for the anterior layers of the blocks in the encoder path or skip for the skip connections. The traditional approach based on context channel concatenation is denoted by concat and the symmetric encoder path for relevance map propagation is denoted by symmetric encoder.

3.1. Liver Segmentation

The quantitative liver segmentation results can be seen in table 3.1 and 3.2. Mean and standard deviation values are provided for each metric, and they were calculated over the 5 validation folds.

Table 3.1: Mean global and per case Dice and IoU scores for liver segmentation.

*denotes the initial segmentation in the axial view, the other approaches are in the sagittal view.

Method Dice per case Dice global IoU per case IoU global Without Context * .9530 ±.0032 .9547 ±.0042 .9111 ±.0052 .9133 ±.0076 Concat .9591 ±.0038 .9610 ±.0046 .9221 ±.0061 .9250 ±.0085 Valve Top Encoder .9589 ±.0034 .9609 ±.0047 .9218 ±.0054 .9249 ±.0086 Valve Top Skip .9566 ±.0032 .9588 ±.0037 .9176 ±.0052 .9209 ±.0068 Valve All Encoder .9594 ±.0035 .9616 ±.0043 .9226 ±.0058 .9260 ±.0080 Valve All Skip .9584 ±.0036 .9603 ±.0045 .9208 ±.0058 .9237 ±.0083 Symmetric Encoder .9596 ±.0036 .9618 ±.0043 .9231 ±.0058 .9264 ±.0080

Table 3.2: Mean global and per case precision and recall scores for liver segmentation.

*denotes the initial segmentation in the axial view, the other approaches are in the sagittal view.

Method Precision per case Precision global Recall per case Recall global Without Context * .9542 ±.0044 .9541 ±.0068 .9531 ±.0032 .9552 ±.0041 Concat .9580 ±.0073 .9570 ±.0099 .9613 ±.0051 .9652 ±.0050 Valve Top Encoder .9581 ±.0061 .9574 ±.0090 .9608 ±.0037 .9646 ±.0036 Valve Top Skip .9574 ±.0048 .9573 ±.0069 .9569 ±.0029 .9603 ±.0027 Valve All Encoder .9585 ±.0035 .9587 ±.0055 .9611 ±.0048 .9645 ±.0042 Valve All Skip .9568 ±.0052 .9566 ±.0075 .9610 ±.0043 .9642 ±.0046 Symmetric Encoder .9579 ±.0042 .9569 ±.0061 .9623 ±.0026 .9667 ±.0026

Example Output

A post-processed volumetric output from segmentation in the axial view can be seen in

figure 3.1. This output was resliced into slices in the sagittal and axial view and fed as

context channels for continued liver segmentation and subsequent liver tumor segmenta-

tion.

(20)

(a) Axial View (b) Coronal View (c) Sagittal View Figure 3.1: Example output from liver segmentation in the axial view. The dense pre- diction map is overlaid on the original input image. Green coloring denotes correct seg- mentation regions, orange denotes false negative regions, and yellow denotes false positive regions.

3.2. Liver Tumor Segmentation

The quantitative results for liver tumor segmentation where previously obtained liver and liver tumor segmentations were used as context channels can be seen in table 3.3-3.6. Mean and standard deviation values are provided for each metric, and they were calculated over the 10 validation folds. Note, the results for the without context approach is presented together with the liver-context results.

Tumor Segmentation with Liver Context

Table 3.3: Mean global and per case Dice and IoU scores for liver tumor segmentation where previously obtained liver segmentations were used as context channels.

Method Dice per case Dice global IoU per case IoU global Concat .5174 ±.0706 .6894 ±.1164 .4064 ±.0650 .5377 ±.1327 Valve Top Encoder .5026 ±.0670 .6914 ±.1056 .3945 ±.0633 .5381 ±.1207 Valve Top Skip .4736 ±.0701 .6432 ±.1225 .3642 ±.0642 .4858 ±.1304 Valve All Encoder .5033 ±.0731 .6881 ±.1108 .3934 ±.0666 .5351 ±.1265 Valve All Skip .5131 ±.0700 .6972 ±.1176 .4036 ±.0644 .5472 ±.1335 Symmetric Encoder .4926 ±.0792 .6750 ±.1234 .3829 ±.0705 .5202 ±.1386 Without Context .4402 ±.0859 .5588 ±.1503 .3293 ±.0765 .4027 ±.1463 Table 3.4: Mean global and per case precision and recall scores for liver tumor segmenta- tion where previously obtained liver segmentations were used as context channels.

Method Precision per case Precision global Recall per case Recall global

Concat .5957 ±.0672 .7913 ±.0839 .5276 ±.0763 .6355 ±.1635

Valve Top Encoder .5575 ±.0654 .7530 ±.1189 .5403 ±.0675 .6757 ±.1564

Valve Top Skip .5337 ±.0730 .7346 ±.0977 .4918 ±.0815 .6013 ±.1814

Valve All Encoder .5487 ±.0628 .7554 ±.1218 .5436 ±.0628 .6707 ±.1665

Valve All Skip .5475 ±.0644 .7814 ±.0837 .5445 ±.0819 .6598 ±.1759

Symmetric Encoder .5477 ±.0668 .7614 ±.1296 .5135 ±.0819 .6404 ±.1801

Without Context .4793 ±.0957 .5966 ±.2239 .4849 ±.0703 .5965 ±.1473

(21)

Tumor Segmentation with Liver Tumor Context

Table 3.5: Mean global and per case Dice and IoU scores for liver tumor segmentation where previously obtained liver tumor segmentations were used as context channels.

Method Dice per case Dice global IoU per case IoU global Concat .5203 ±.0712 .6928 ±.1156 .4095 ±.0651 .5417 ±.1322 Valve Top Encoder .5057 ±.0670 .6921 ±.1063 .3966 ±.0628 .5391 ±.1216 Valve Top Skip .4741 ±.0710 .6461 ±.1209 .3656 ±.0646 .4887 ±.1294 Valve All Encoder .5071 ±.0741 .6892 ±.1123 .3979 ±.0680 .5367 ±.1278 Valve All Skip .5144 ±.0683 .7006 ±.1174 .4049 ±.0633 .5511 ±.1328 Symmetric Encoder .4968 ±.0722 .6781 ±.1115 .3873 ±.0700 .5236 ±.1254 Table 3.6: Mean global and per case precision and recall scores for liver tumor segmenta- tion where previously obtained liver tumor segmentations were used as context channels.

Method Precision per case Precision global Recall per case Recall global Concat .5860 ±.0722 .7829 ±.0864 .5423 ±.0744 .6487 ±.1652 Valve Top Encoder .5534 ±.0709 .7499 ±.1218 .5507 ±.0622 .6767 ±.1508 Valve Top Skip .5316 ±.0736 .7320 ±.1011 .4963 ±.0820 .6078 ±.1806 Valve All Encoder .5315 ±.0797 .7398 ±.1245 .5647 ±.0670 .6873 ±.1656 Valve All Skip .5391 ±.0685 .7713 ±.0875 .5539 ±.0728 .6720 ±.1727 Symmetric Encoder .5406 ±.0689 .7495 ±.1208 .5261 ±.0622 .6551 ±.1593

T-tests

As can be seen in table 3.3-3.6, the standard deviation values for the different metric scores is large. To evaluate the statistical significance of the differences that exist between the mean scores of the different methods, paired T-tests were therefore motivated and performed between the best performing context channel integration strategy (chosen as the reference method) and every other integration method, see table 3.7. The global Dice coefficient was chosen as the metric, thus, the reference method was valve all skip. For each paired T-test, the mean Dice score over each validation fold (10 in total), for either method, was used to calculate the mean and standard deviation statistics. By evaluating the p-scores in the figure we can determine that the mean of each two sets of data do not significantly differ from each other, except for the without context approach.

Table 3.7: Pairwise Student’s t-tests for liver tumor segmentation over the 10 validation folds for the global Dice coefficient. Tests were performed for both liver and liver tumor contexts. The valve all skip network was used as the reference model for either context type, i.e. it is the top-performing liver tumor segmentation approach in table 3.3 and 3.5

Network Liver Context Liver Tumor Context

T-value p-value T-value p-value

Concat 0.1426 .8882 0.1412 .8893

Valve Top Encoder 0.1112 .9127 0.1601 .8746

Valve Top Skip 0.9545 .3525 0.9696 .3451

Valve All Encoder 0.1706 .8664 0.2099 .8361 Symmetric Encoder 0.5504 .5889 0.4161 .6823

Without Context 2.1777 .0438 2.2310 .0394

(22)

3.3. Generalization to an Unseen Object Class

As mentioned in the end of the method section, evaluation of generalization ability was made possible by manually segmenting another organ. In total, 5 image slices showing the liver with projections of at least one liver tumor were selected and manually segmented to delineate the left kidney with eclipsed kidney regions. Each image slice was then in turn fed to the trained networks, either together with the ground truth liver segmentation or with the manual kidney segmentation.

The qualitative results can be seen in figure 3.2 and 3.3. The 5 selected image slices are portrayed in the top row of either figure. The orange and white segmentation regions that are overlaid on the images in figure 3.2 denote "normal" and eclipsed kidney regions.

The same coloring scheme in figure 3.3 denote healthy and cancerous liver tissue. These segmentation regions were binarized and fed as context channels to the networks. In the succeeding rows, the predictions from the different methods are the output probability maps for the tumor class. Instead of showing the thresholded output results, these maps convey more qualitative information.

3.4. Visualization of Feature Maps

To understand how context channels are incorporated and propagated in the different valve approaches, feature and relevance maps were visualized. Samples of some relevance channels together with their associated image feature channels in different blocks and layers of the symmetric encoder approach can be seen in figure 3.4 (examples of full relevance maps can be seen in appendix B). In the figure, the activations are samples from every network depth for a single input image. The codes at the left side of the relevance channels denote block and layer levels, where B stands for block and L for layer.

One relevance channel is provided for each layer at each block. The sharing factor is set

to 4 for all experiments, thus, the 4 feature channels that are related to each relevance

channel is shown next to the latter. The colormaps that are used for either channel type

goes from left to right in their activation levels. A hot [29] colormap is used for the

relevance channels. The colormap for the image feature channels is a linear combination

of jet and gray [29], i.e. the former map was used for the image feature channels and the

latter for the input images. When these images were interlaced the color information was

combined.

(23)

Figure 3.2: Qualitative results for segmentation of liver tumors.

(24)

Figure 3.3: Qualitative results for segmentation of eclipsed regions in the left kidney.

(25)

Figure 3.4: Relevance channels and their related image feature channels from the encoding

path of the symmetric encoder approach. See the text for further explanation.

(26)

4. Discussion

In table 3.1 and 3.2 it can be observed that all context channel integration strategies achieved improved performance for segmentation in the sagittal view compared to liver segmentation in the axial view. It is, however, apparent that the models saturated and converged towards similar performance scores, the absolute difference between the scores are small and does not motivate any conclusions. In a similar vein, when liver and liver tumor contexts were used for liver tumor segmentation, the T-tests in table 3.7 show that even though differences exist between the different context channel integration approaches, those differences are not significant. By analyzing table 3.3-3.6 it can be observed that the second liver tumor cascade gave small improvements for the per case and global Dice, IoU and recall scores, but that the precision scores experienced a small but consistent decrease which was independent of the method. No differences can thus be seen here either.

As an additional observation, it is noted that the traditional concatenation approach is consistently one of the top performers.

4.1. Impact Analysis for Context Channel Integration

The two different integration positions that were chosen for the valve filters vouch for two different philosophies of how prioritized information should be encoded and decoded.

The approach where valve filters are integrated into the encoder path focus on specified regions within and around a context region when high level feature extraction is performed.

Presumably, the high level feature representation in the bottleneck layer of U-net would thus store more semantic information about regions within and around the context than about regions distant from it. When this representation is decoded, mostly information about the important regions would be extracted. In contrast, when integrated into the skip connections, as much information as possible is encoded from the image and attention is instead put on specified context regions during the decoding step to prioritize specific features that have already been extracted.

Due to the inconclusive results it is not possible to validate any of the aforementioned integration philosophies, but it is noted that most self-lernable attention approaches seem to integrate their attention mechanisms into the decoding step [30, 31, 32], which is possibly motivated by the fact that image concepts are to a large extent defined by their surroundings. If too much attention is put on specified image regions during the encoding step without enabling extraction of surrounding concepts, generalization ability might plummet when important surrounding information cannot be used. For example, for generative adversarial networks (GANs), correlations between object classes have been found to be an important trait to define specific classes [33].

The integration level of the valve filters - if they are used at the topmost network depth or sequentially downsampled and used at all depths - represent a naive attempt to evaluate if information about context regions is lost in the upper network depths. Taking this into consideration it is noticeable that for integration in the skip connections, the discrepancy between the top and all approach is larger than it is for integration in the encoder path, i.e.

see table 3.3-3.6 in the result section. It is possible that the context signal is weaker for

skip-integration and that multiple reverberations of the signal is needed. Yet, the T-tests

make it difficult to affirmatively draw such a conclusion, and more research is needed to

(27)

investigate the hypothesis.

When previous segmentations are used as context channels, the prior information is not particularly complex. Aside from borders and similar simple features, not a lot of high-level information can therefore be extracted from such contexts when the symmetric encoder approach is used (see figure 3.4 and appendix B). The defining trait of the extracted relevance channels is a variance in foreground and background highlighting, i.e. it is mostly the "contrast" that is varied between relevance channels of different network depths. This motivates the use of more complex context channels to benefit from the approach. The extracted information must, however, be important to the segmentation task at hand.

When liver contexts are used for liver tumor segmentation, the extraction of liver borders might not contribute to the task at all, i.e. highlighting border regions of the liver would not attend to most liver tumors.

4.2. Initilizations Strategies

The valve filters seemed somewhat sensitive to the convolutional weight and bias initial- ization strategies (observed both quantitatively and qualitatively by analyzing relevance maps). Except for the symmetric encoder approach, activation outputs from the valve layers are not fed forward, which is why Xavier initialization was used for the symmetric encoder approach and not for the other valve approaches - they used zero initialization instead. The motivation behind weight initialization schemes such as Xavier initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network. Such a scheme facilitate training by gradient descent. In contrast to random weight initialization schemes such as Xavier initialization, zero initialization limits the possible convolutional filters that can be learned [24]. This is normally a highly undesirably feature, but due to the simple nature of the context chan- nels that were used in the study, it did not affect the results for the valve approaches that do not feed their relevance channel’s forward. Still, this might not be the case for similar experimental setups and should therefore be investigated further.

The sensitivity to the initialization strategies might partially be explained by the general volatility experienced during liver tumor segmentation training. Even though hyperpa- rameter values were empirically chosen for optimal performance it was not possible to get rid of this behaviour without decreasing segmentation ability.

4.3. Generalization Ability

From the outputs in figure 3.3 it can be observed that all context channel integration approaches generalized to the kidney. Irrespective of context channel integration strategy the approaches seem to be able to integrate the context channels into their decision making well enough to generalize to unseen object classes. In this regard, integration type does not seem to matter that much - the eclipsed kidney regions are found in most images by most methods. However, when valve layers are integrated into skip connection(s) more inconsistent results are seen, but due to the small sample size of manually segmented images it is difficult to draw any conclusions from these observations.

By analyzing figure 3.2 and 3.3 it can be observed that different segmentation results are

produced by the same network when no contexts are used (see the bottom-right image

in both figures). As explained in the method section, when a kidney or liver context is

(28)

used, the same region is also used for image normalization. This explains the discrepancy between the two results and it can thus be determined that networks are a bit sensitive to the mean and standard deviation values used for image normalization.

4.4. Relevance for Interactive Segmentation

Even though liver contexts might improve liver tumor segmentation, and even though they are of interest for analysis of generalization ability, it is not the kind of context region that is normally used for interactive segmentation. Part of the results in this study might not therefore be directly applicable to the main application area that motivated the study in the first place.

For interactive segmentation in medical image analysis, context regions such as the liver and liver tumors might still be used, but then for continued liver or liver tumor segmen- tation. This was the motivation behind the two auto-context setups that were used in the study. Nevertheless, neither setup showed any significant indications that one method is to prefer in front of another.

4.5. Future Work

Historically, DNNs have to a large extent been treated as non-transparent black box mod- els. An improved understanding of the internal representations of these models would therefore give increased clarity in how to best integrate prior information into their de- cision making. Visualization of class activation maps for classification tasks [34] and the recent progress into the understanding of the internal representations of GANs [33] could hopefully give insights into how to better utilise prior information in FCNs.

For this particular study, changes could be made to the base architecture, e.g. skip con-

nections could be introduced between individual convolutional blocks to see if it would

influence the effect of context channel integration and propagation. To be more applicable

to interactive segmentation it would also be of interest to use context channels in the

form of manual delineations of incorrectly segmented regions instead of only using previ-

ously obtained segmentations in an auto-context fashion. Finally, more complex context

channels could also be investigated.

(29)

5. Conclusion

The conclusions that can be drawn from this study are mixed. Compared to the the traditional concatenation approach, at-depth context channels integration was not shown to improve the results. At the same time, all the evaluated context channel integration approaches showed the ability to generalize to an unseen object class. Context channels were thus integrated into the decision making of each network.

Qualitatively, the symmetric encoder approach was able to extract features such as edges and borders from the provided context regions, but little more could be extracted from them. More complexity seems to be desirable for context regions if context channel prop- agation is to be investigated further.

Considering the findings and their relevance for the traditional concatenation approach, the

experiments did not indicate that any possible loss of context channel information in deep

network layers had an effect on network performance. Anyhow, it would be too decisive

to generalize the results to other segmentation tasks and it is thus unclear how much the

investigation could affect continued efforts in interactive image segmentation.

(30)

References

[1] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han, P.-A. Heng, J. Hesser, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

[2] K.-P. Wong, “Medical image segmentation: Methods and applications in functional imaging,” in Handbook of Biomedical Image Analysis: Volume II: Segmentation Mod- els Part B, Topics in Biomedical Engineering International Book Series, pp. 111–182, Boston, MA: Springer US, 2005.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.

[4] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.

[5] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomed- ical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, Springer, 2015.

[6] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.

[7] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmen- tation,” International Conference on Computer Vision, 2015.

[8] K. Grünberg, A. Jakab, G. Langs, T. S. Fernandez, M. Winterstein, M.-A. Weber, M. Krenn, and O. Jimenez-del Toro, “Ethical and privacy aspects of using medical image data,” in Cloud-Based Benchmarking of Medical Image Analysis, pp. 33–43, Springer, Cham, 2017.

[9] M. I. Razzak, S. Naz, and A. Zaib, “Deep learning for medical image processing:

Overview, challenges and the future,” in Classification in BioApps, pp. 323–350, Springer, 2018.

[10] F. Zhao and X. Xie, “An overview of interactive medical image segmentation,” Annals of the BMVA, vol. 2013, no. 7, pp. 1–22, 2013.

[11] R. Kashyap and A. S. Kumar, Challenges and Applications for Implementing Machine Learning in Computer Vision. IGI Global.

[12] T. Sakinis, F. Milletari, H. Roth, P. Korfiatis, P. Kostandy, K. Philbrick, Z. Akkus, Z. Xu, D. Xu, and B. J. Erickson, “Interactive segmentation of medical images through fully convolutional neural networks,” arXiv e-prints, p. arXiv:1903.08205, Mar. 2019.

[13] N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang, “Deep interactive object

(31)

selection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 373–381, 2016.

[14] G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L.

David, J. Deprest, S. Ourselin, and T. Vercauteren, “Interactive medical image seg- mentation using deep learning with image-specific fine tuning,” IEEE Transactions on Medical Imaging, vol. 37, no. 7, pp. 1562–1573, 2018.

[15] S. Eppel, “Setting an attention region for convolutional neural networks using region selective features, for recognition of materials within glass vessels,” arXiv e-prints, p. arXiv:1708.08711, Aug. 2017.

[16] S. A. McMains and S. Kastner, Visual Attention, pp. 4296–4302. Berlin, Heidelberg:

Springer Berlin Heidelberg, 2009.

[17] A. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channel ‘squeeze &

excitation’ in fully convolutional networks,” vol. 11070, pp. 421–429, Springer Verlag, 2018.

[18] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.

[19] S. Eppel, “Classifying a specific image region using convolutional nets with an ROI mask as input,” arXiv e-prints, p. arXiv:1812.00291, Dec. 2018.

[20] Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3d brain image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 10, pp. 1744–1757, 2009.

[21] Medical Image Computing and Computer-Assisted Intervention – MICCAI 2009 12th International Conference, London, UK, September 20-24, 2009, Proceedings, Part II.

Image Processing, Computer Vision, Pattern Recognition, and Graphics ; 5762, 1st ed. 2009.. ed., 2009.

[22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

[23] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object lo- calization using convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 648–656, 2015.

[24] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010.

[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[26] O. Kodym, M. Španěl, and A. Herout, “Segmentation of head and neck organs at

risk using cnn with batch dice loss,” in German Conference on Pattern Recognition,

pp. 105–114, Springer, 2018.

(32)

[27] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, pp. 1310–1318, 2013.

[28] S. Boslaugh, Statistics in a Nutshell, 2nd Edition. O’Reilly Media, Inc, 2 ed., 2012.

[29] “Choosing colormaps in matplotlib.” https://matplotlib.org/3.1.0/tutorials/

colors/colormaps.html. Accessed: 2020-07-03.

[30] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180, 2018.

[31] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medical image segmenta- tion,” IEEE Journal of Biomedical and Health Informatics, 2020.

[32] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.

[33] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba, “Gan dissection: Visualizing and understanding generative adversarial networks,” arXiv preprint arXiv:1811.10597, 2018.

[34] Z. Qin, F. Yu, C. Liu, and X. Chen, “How convolutional neural network see the

world-a survey of convolutional neural network visualization methods,” arXiv preprint

arXiv:1804.11191, 2018.

(33)

A. State of the Art

The following state of the art review introduces readers to deep learning-based methods in interactive medical image segmentation. It provides the relevant background knowledge needed to understand how user interaction can be integrated with deep learning to improve image segmentation in general, and medical image segmentation in particular. A short introduction to deep learning is provided in the text, but the review is intended for readers that have a basic understanding of deep learning and convolutional neural networks. A more pervasive overview of the basic concepts of these subjects is given in [35].

A.1. Image Segmentation

Image segmentation can be described as the process of partitioning pixels into salient image regions that share similar properties. In semantic image segmentation, each pixel is represented by a label, and pixels that have identical labels belong to the same object class. An extension of semantic segmentation is instance segmentation, where instances of the same object class can be differentiated. Generally, image segmentation has several applications, including tasks in autonomous driving and robotic perception [36], but it is also widely used in the medical field, where important use cases are tumor and organ delineation which can be used in diagnostic support systems and for surgical planning [37].

The success of a particular segmentation approach partially depends on the nature of the images to be segmented [38]. For medical images, such as those reconstructed from computed tomography (CT) and magnetic resonance imaging (MRI) scans, noise, arte- facts, pathologies and biological variation imposes high uncertainty and complexity on the extracted data [39]. Furthermore, medical images are often volumetric and anisotropic which increases the level of difficulty of deploying successful segmentation systems that can compromise between segmentation accuracy and time efficiency. This compromise often depends on the amount and type of user interaction that is needed during segmentation, and methods can be classified with respect to how much user input they require.

A.1.1. Manual and Automatic Segmentation

For a comparison of methods based on the amount of user guidance that is needed to perform segmentation, manual and fully-automatic methods represents two extremes on a spectrum. At one end, fully-automatic systems can be seen as black boxes where images are received as inputs and segmentation results are generated as outputs [35]. Such a system is devoid of any user guidance and performs segmentation on its own. On the other end, manual segmentation is performed by an expert that manually designates foreground regions by assigning labels to each pixel in an image. Manual segmentation is the most labour intensive segmentation type, especially if the image is volumetric, i.e. in those cases it is most often a task that proceeds in a slice-by-slice manner. Additionally, inter- and intra-observer variability [39] represented in the manual input may negatively affect the reproducibility of manual methods.

Compared to manual approaches, fully-automatic systems are much faster and the re-

sults are most often reproducible, but unfortunately they do not leverage knowledge from

(34)

medical experts to guide segmentation. Furthermore, state of the art approaches rely on supervised deep learning that require a lot of annotated data which is difficult to acquire in the medical domain [40]. Accordingly, a compromise between automatic and manual segmentation might be beneficial for medical images.

A.1.2. Interactive Segmentation

By allowing for user guidance, an interactive segmentation system can leverage a medi- cal expert’s high-level anatomical understanding and use it together with an automated system’s computational capacity [41]. The approach elevates the role of the user during segmentation, and the process is usually iterative. Using previous segmentation results and guidance signals, principal feedback and user responses makes it possible for the user to evaluate, respond to and interact with a segmentation result in real-time. Examples of traditional interactive approaches include region growing [42], graph cuts [43], active contours [44] and level-sets [45].

In addition to interactive methods, semi-automatic approaches also leverage user-input to assist automation. The main distinction between interactive and semi-automatic methods is related to the amount of user guidance that is needed. In semi-automation, user input is typically used to initialize an automatic method, and the process is not iterative [41].

Differences aside, the two types of methods are very related, and in the following, unless explicitly mentioned otherwise, the terms will be used interchangeably.

A.2. Deep Learning

Deep Learning has experienced tremendous progress since 2012 when AlexNet [46] became the first deep neural network (DNN) to defeat conventional computer vision methods on The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [47]. DNNs are multiple layered neural networks that learn hierarchical representations through a sequence of functional compositions [48]. The networks learn tasks and concepts by being exposed to large amount of data, and for most image tasks, including segmentation, the most successful approaches are supervised and based on convolutional neural networks.

A.2.1. Convolutional Neural Networks

Image classification networks such as AlexNet usually consists of a fully convolutional feature extractor that is connected to a trainable classifier - historically, the latter has often been a multi-layer perceptron (MLP) [35]. The feature extractor is made out of nested blocks of convolutional layers connected to a pooling layer. The convolutional layers are used to extract spatial image patterns of increasing complexity, and each layer is followed by a non-linear activation function. Pooling layers are used for noise reduction and down-sampling, and they increase the receptive fields of convolutional filters in succeeding layers.

Classification tasks notwithstanding, recent progress in image segmentation in both com-

puter vision and medical image analysis have been centered around a specific kind of CNN

- the fully convolutional network (FCN) [49].

(35)

A.2.2. Fully Convolutional Networks

In FCNs, all learnable parameters are convolutional. They exclusively consist of con- volutional and optional pooling layers - no fully connected (FC) layers are utilised. In contrast to CNNs with at least one FC-layer, FCNs allow for virtually arbitrary image sizes and increased representational capacity for similar computational costs [49]. Also, convolutional layers consist of convolutional filters that have local receptive fields, thus they retain spatial information about incoming features. For segmentation, both spatial location and high level feature descriptions of objects are necessary. An FC-layer retains no spatial information about its input, i.e. all input neurons are connected to all output neurons.

Figure A.1: 3D U-net architecture. Blue boxes represents feature maps and the num- ber above each feature map denotes the number of channels. The network utilises 3D- convolutions. (Reprint from Cicek et al. [50])

One of the most popular architectural choices for FCNs are so called encoder-decoder networks [51, 52]. In these structures, the encoder performs feature extraction and the decoder uses the extracted feature maps to generate a segmentation mask. For medical im- ages in particular, the most popular segmentation architecture is U-net [53]. The network operates on 2D-images and has a symmetric contracting and expanding (encoder-decoder) path. The encoder takes an input image and computes feature maps at multiple scales and abstraction levels. The decoder synthesizes the segmentation mask in stages starting with high level representations and gradually adds more low level spatial information, utilising skip connections between the depths of the contracting and expanding path. Trainable deconvolutional layers are used for upsampling in the decoder.

Of relevance to 3D segmentation is that U-net operates on 2D-images, because even

though it is possible to segment volumetric images in a slice-by-slice manner, spatial 3D-

inconsistencies might arise when segmented slices are concatenated to recreate volumes

[54]. Accordingly, to facilitate volumetric consistency, FCN-architectures that operate on

(36)

volumes instead of slices have also been developed, e.g. a more or less direct 3D-extension of the original U-net [50] can be seen in figure A.1.

Architecture aside, an additional contribution from the original U-net paper was the usage of data-augmentation to extend small datasets. In data augmentation, data is artificially altered to generate additional samples that are believed to reflect possible data points from the original data’s underlying probability distribution. Common transformations to augment images includes flips, zooming, rotations, grayscale transformations and non- linear deformations [53]. Data augmentation is particularly effective in medical imaging, where both images and annotations are hard to come by.

A.3. Interactive Segmentation and Deep Learning

Interactive segmentation systems should achieve good segmentation performance while effectively incorporating user input. The type and amount of user input should be as simple and minimal as possible, and inference should be fast. Also, aside from optimizing segmentation results by using expert knowledge to manually highlight image regions of importance, interactive segmentation systems based on deep learning should preferable be designed to handle and adapt to unseen object classes. Here, adaptation can benefit from meta-learning approaches that enable improved generalization behaviour [55].

In spite of the increased requirements of interactive deep learning-based systems, such approaches faces the difficulty of embedding human knowledge into DNNs due to their large number of model parameters. This problem is further exacerbated by the black-box nature of the networks, which might be one reason why most solutions are based on feeding prior information into DNNs while assuming that the information can be remembered and utilised throughout the network depths.

Broadly speaking, interactive deep learning-based methods can be distinguished based on the nature of the provided user input and on how the information is incorporated and encoded into the learning procedure. These methods use different backbone architectures and some use post-processing strategies such as conditional random fields (CRFs) to refine segmentation results. Additionally, most approaches only allow for user interaction during inference, thus simulated interaction is often used during training and large scale testing to mimic user behaviour. If a sampling strategy makes incorrect assumptions about user behaviour, inference will be negatively affected.

A.3.1. Click-Based Approaches

Several approaches have been developed to integrate user guidance signals into the seg- mentation of an image by using clicks that denote either the background or foreground region of an object of interest (OOI). A popular approach is to use separate input channels for different encodings of input clicks that represents either the background or foreground of an OOI. These additional channels are concatenated with the original image and the resulting input is fed to a DNN. In computer vision, input clicks have successfully been en- coded using euclidean distance maps [56] where distances are calculated from each pixel to their closest input click, and with Gaussian functions [57] using a predetermined variance centered around the clicks.

Sakinis et al. [58] developed an interactive FCN-based 2D-segmentation framework for

(37)

medical images similar to [57]. It is a U-net based approach where an expert provides input clicks for foreground and background regions of an OOI. 2D-Gaussian functions are centered around the input points to create separate heat-maps for either click-type.

The maps are concatenated with the original image before the combined input is fed to the network. Simulated interaction is enabled by sampling clicks from a probability distribution based on distance fields.

A.3.2. Bounding Boxes and Extreme-Points

Another common approach is to use bounding boxes to reduce image content and simplify the machine learning problem [59]. However, one disadvantage with the bounding box approach is that users have to select coordinates outside of the OOI which might be tricky, especially for volumetric images where a user has to slice through different plane views [59]. Applications in medical image analysis includes the weakly supervised DeepCut [60]

algorithm where bounding boxes are provided as weak annotations.

Some click-based approaches are based on extreme-points which makes it possible to easily combine a bounding-box approach with additional information about the border of ob- jects. During interaction for 2D images, four extreme points belonging to the top, bottom, left-most and right-most part of a specific object instance are provided by the user. From these points, a bounding box is calculated and used to crop the initial image, using some additional padding to get some context from the surrounding background. For example, in Deep Extreme Cut (DEXTR) [61], a user selects the extreme points of an OOI and the input points are encoded as Gaussians and given as a heatmap in an addtional input chan- nel. The approach allows for interactive improvement using an additional 5th input point, but it is suboptimal for correcting errors interactively. In contrast, Agustsson et al. [62]

developed a similar approach that allows for corrective scribbles on previously incorrectly segmented regions and that also considers all segmentation regions jointly.

The extreme-point approach has also been extended to volumetric medical images. Roth et al. [59] developed a method to perform weakly supervised segmentation using 6 extreme points that are selected to define a bounding box around an organ of interest. In the approach, an initial segmentation is attained by applying the random walker algorithm using scribbles computed from the extreme points. The initial segmentation is used as a ground truth, and is followed by image segmentation through an FCN, culminating in a final additional use of random walker on the the network output. The last 2 steps are iterative and are pursued until convergence, i.e. when two consecutive predictions are the same.

A.3.3. Scribble-Based Approaches

In contrast to simple clicks, scribbles make it possible to manually highlight more com- plex structures, making them especially useful for denoting non-linear and deformable structures commonly found in medical images. Yet, positives aside, a beneficial aspect of clicks is that they are not as user intensive, some types of scribbles might require more interaction.

Wang et al. developed DeepIGeoS [63] which is a deep learning-based interactive medical

image segmentation framework. It is based on two cascading networks, a proposal network

and a refinement network. The two networks are identical except for their input shapes.

An Analysis of Context Channel Integration Strategies for Deep Learning-Based Medical Image Segmentation

IN

DEGREE PROJECT MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

An Analysis of Context Channel Integration Strategies for Deep Learning-Based Medical Image Segmentation

JOAKIM STOOR

KTH ROYAL INSTITUTE OF TECHNOLOGY

An Analysis of Context Channel Integration Strategies for Deep Learning-Based Medical Image Segmentation

JOAKIM STOOR

Master in Medical Engineering Date: September 16, 2020 Supervisor: Chunliang Wang Reviewer: Örjan Smedby Examiner: Matilda Larsson

School of Engineering Sciences in Chemistry, Biotechnology and Health Host company: Novamia AB

Swedish title: Strategier för kontextkanalintegrering inom

djupinlärningsbaserad medicinsk bildsegmentering

Abstract

Sammanfattning

Med hänsyn till de kvantitativa resultaten är det inte möjligt att skilja de olika meto-

derna från varandra, och efter träning visar alla metoder förmågan att generalisera till en

osedd objekt-klass. Det finns inga indikationer på att den traditionella metoden baserat på

hoplänkning av kanaler presterar sämre än de övriga metoderna, och från resultaten går

det inte att utläsa att betydelsefull kontextinformation går förlorad vid de djupare lagren.

Abbreviations

CNN Convolutional Neural Network CT Computed Tomography

DNN Deep Neural Network FCN Fully Convolutional Network MRI Magnetic Resonance Imaging OOI Object of Interest

ROI Region of Interest

SE Squeeze-and-Excitation

Contents

1. Introduction 3

2. Method 5

2.1. Data Processing and Augmentation . . . . 5

2.2. Network Architecture . . . . 6

2.3. Incorporation of Prior Information . . . . 7

2.3.1. Valve Filters and Relevance Maps . . . . 7

2.3.2. Symmetric Encoder Path for Relevance Map Propagation . . . . 8

2.4. Training Settings . . . 10

2.5. Performance Evaluation . . . 10

3. Results 11 3.1. Liver Segmentation . . . 11

3.2. Liver Tumor Segmentation . . . 12

3.3. Generalization to an Unseen Object Class . . . 14

3.4. Visualization of Feature Maps . . . 14

4. Discussion 18 4.1. Impact Analysis for Context Channel Integration . . . 18

4.2. Initilizations Strategies . . . 19

4.3. Generalization Ability . . . 19

4.4. Relevance for Interactive Segmentation . . . 20

4.5. Future Work . . . 20

5. Conclusion 21

Appendix A. State of the Art 25

Appendix B. Additional Visualizations 40

1. Introduction

Interactive segmentation methods allow for iterative improvements through feedback sig-

nals based on manually corrected segmentation regions. Semi-automatic methods are

similar, but they solely rely on prior information to initialize the segmentation process,

and they are not iterative [11]. Common input types to integrate user interaction with

deep neural networks (DNNs) include different encodings of input clicks, bounding boxes,

and scribbles [12, 13, 14]. However, irregardless of nature of the input, all such approaches

faces the difficulty of embedding user input into the networks. Input signals are often used

as prior information and encoded as extra input channels under the assumption that the

information can be remembered during high-level feature extraction at the deeper network

The attention mechanism in deep learning makes it possible to attend to image parts with meaningful information while disregarding parts with unimportant information [16, 17].

As it relates to manual attention and the integration of prior information, a thorough in-

vestigation of how manual input can best be embedded into DNNs for improved semantic

segmentation is lacking. The contribution of this project is, consequently, to investigate

how information from extra input channels can be used for this purpose. The results

from the study are intended to support developments in deep learning-based interactive

segmentation where extra input channels are frequently used. In the project, context chan-

nels in the form of previous segmentations are integrated into a segmentation network at

multiple positions and network depths using different integration strategies. Comparisons

are made with the traditional integration approach where an input image is concatenated

with context channels at a network’s input layer. The aim of the study is to analyze if

context information is lost in the upper network layers when the traditional approach is

used, and if better results can be achieved if prior information is propagated to deeper

layers.

2. Method

The data that was used in this study is from the Liver Tumor Segmentation Challenge (LiTS) [1]. The LiTS dataset consists of 201 contrast enhanced abdominal CT volumes.

Each volume has a liver and liver tumor ground truth segmentation and the whole dataset is divided into a training set of 131 volumes and a test set of 70 undisclosed volumes.

In the study, all experiments were performed on the training images and evaluation was performed using K-fold cross validation.

Note: 13 out of 131 CT volumes in the training set do not contain any tumors at all.

To not skew the statistics during evaluation of segmentation performance, these volumes were therefore excluded when tumor segmentation was performed.

2.1. Data Processing and Augmentation

The orientation of the images within the dataset varied, thus, to simplify the learning problem so that all input images had the same direction and alignment, the images were reoriented to a uniform orientation (RAI) [21].

Image normalization was performed using the formula: