IN
DEGREE PROJECT MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS
STOCKHOLM SWEDEN 2019,
Gland Segmentation with
Convolutional Neural Networks : Validity of Stroma Segmentation as a General Approach
THOMAS BINDER
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH
K UNGLIGA T EKNISKA H ÖGSKOLAN
M
ASTERT
HESISGland segmentation with Convolutional Neural Networks : validity of stroma
segmentation as a general approach
Author:
Thomas BINDER
Supervisor:
Pushpak PATI
Mehdi TANTAOUI
A thesis submitted in fulfillment of the requirements for the degree of Master thesis
in the
Computational Pathology Team of IBM Research Zurich School of Technology and Health
March 13, 2019
iii
KUNGLIGA TEKNISKA HÖGSKOLAN
Abstract
School of Technology and Health Master thesis
Gland segmentation with Convolutional Neural Networks : validity of stroma segmentation as a general approach
by Thomas BINDER
The analysis of glandular morphology within histopathology images is a crucial step in determining the stage of cancer. Manual annotation is a very laborious task.
It is time consuming and suffers from the subjectivity of the specialists that label the glands. One of the aims of computational pathology is developing tools to automate gland segmentation. Such an algorithm would improve the efficiency of cancer diag- nosis. This is a complex task as there is a large variability in glandular morphologies and staining techniques. So far, specialised models have given promising results focusing on only one organ. This work investigated the idea of a cross domain ap- proximation. Unlike parenchymae the stroma tissue that lies between the glands is similar throughout all organs in the body. Creating a model able to precisely seg- ment the stroma would pave the way for a cross organ model. It would be able to segment the tissue and therefore give access to gland morphologies of different organs. To address this issue, we investigated different new and former architec- tures such as the MILD-net which is the currently best performing algorithm of the GlaS challenge. New architectures were created based on the promising U shaped network as well as Xception and the ResNet for feature extraction. These networks were trained on colon histopathology images focusing on glands and on the stroma.
The comparision of the different results showed that this initial cross domain ap- proximation goes into the right direction and incites for further developments.
v
Acknowledgements
I would like to thank Dr. Maria Gabrani who gave me the opportunity to work on this great and interesting project. I would also like to thank Pushpak Pati, Mehdi Tantaoui and Raul Catena for their extraordinary support throughout this thesis pro- cess. Of course, none of this would have been possible without the help of Ago Set- Agahyan and his wonderful team of data scientists. To all, thank you very much.
vii
Contents
Abstract iii
Acknowledgements v
1 Introduction 1
2 Method 3
2.1 Description of the dataset . . . 3
2.2 Data augmentation . . . 4
2.3 Network architectures . . . 5
2.3.1 U-net Xception multitask . . . 5
2.3.2 U-net ResNet multitask . . . 5
2.3.3 Xception multitask with auxiliary classifiers . . . 6
2.3.4 MILD-net . . . 6
2.4 Training process . . . 6
2.5 Post processing . . . 7
3 Results 9 3.1 Evaluation metrics . . . 9
3.2 Results for gland segmentation . . . 9
3.3 Results for stroma segmentation . . . 10
3.4 Comparing models on the breast dataset . . . 12
4 Discussion 15 5 Conclusion 17 6 Future work 19 A State of the art in gland segmentation 21 A.1 Medical image analysis, the advent of deep learning . . . 21
A.1.1 The importance of glands in cancer detection . . . 21
A.1.2 The challenges of image analytics in pathogenesis quantification 22 A.1.3 State of previous technologies in medical image and gland anal- ysis . . . 23
A.2 Introduction to Deep learning and CNN . . . 24
A.2.1 Deep Learning . . . 24
Architecture . . . 24
Training . . . 25
A.2.2 Convolution Neural Network . . . 26
Introduction . . . 26
Architecture of layers . . . 26
Fully convolutional neural networks . . . 27
A.3 State of the art for gland detection . . . 28
viii
A.3.1 State of the art in image segmentation techniques . . . 28 A.3.2 Best performing architectures . . . 30 A.3.3 Conclusion . . . 32
B Xception and ResNet architectures 33
C Prediction results with Xception - training on stroma 35 D Prediction results with Xception - training on glands 37 E Prediction results with MILD-net - training on glands 39
F Upsampling blocks 41
G Deep Multi Channel Network 43
H Deep Contour Aware Network 45
ix
List of Figures
2.1 Example of images in the dataset (bottom are malignant, top are be- nign) - reprinted from Gland Segmentation in Colon Histology Im-
ages: The GlaS Challenge Contest . . . 3
2.2 Augmented images from the dataset . . . 4
2.3 U-net Xception architecture . . . 5
2.4 Xception multitask with auxiliary classifiers architecture . . . 6
3.1 Comparison of learning curves for Xception Network with on the left, on the fly data augmentation and on the right, learning on 24 000 ex- amples for each epoch . . . 11
3.2 Example of segmentation from the test sets with the normalised input on the left . . . 11
3.3 Learning curve for Xception U-net on the stroma segmentation task . . 12
3.4 Sample from the breast dataset . . . 12
3.5 Predictions with the MILD-net focusing on stroma . . . 13
3.6 Predictions with the Xception U-net focusing on glands . . . 13
3.7 Predictions with Xception U-net focusing on stroma . . . 14
3.8 Predictions with MILD-net focusing on glands . . . 14
A.1 Endocrine system (wikipedia) . . . 21
A.2 Examples of histological sections of colorectal adenocarcinoma from the public dataset GLaS Challenge 2015 . . . 22
A.3 Simple neural network . . . 24
A.4 Convolution (left) and transposed convolution (right) . . . 26
A.5 2x2 MaxPooling (wikipedia) . . . 27
A.6 Inception block . . . 29
A.7 Residual block . . . 29
A.8 U-Net architecture from “U-Net: Convolutional Networks for Biomed- ical Image Segmentation” by O. Ronneberger . . . 30
A.9 Blueprint of MILD-net from “MILD-Net: Minimal Information Loss Dilated Network for Gland Instance Segmentation in Colon Histology Images” by S. Graham . . . 31
A.10 Dilated convolution . . . 31
A.11 Comparison of segmentation results for different networks including the MILD-net (last row) from “MILD-Net: Minimal Information Loss Dilated Network for Gland Instance Segmentation in Colon Histology Images” by S. Graham . . . 32
B.1 Xception . . . 33
B.2 ResNet . . . 33
C.1 Predictions by Xception U-net on stroma . . . 35
D.1 Predictions by Xception U-net on glands . . . 37
x
E.1 Predictions by MILD-net . . . 39 G.1 Example of a deeply supervised network from “Training Deeper Con-
volutional Networks with Deep Supervision” by L. Wang . . . 43 G.2 Blueprint of the DMN architecture from “Gland Instance Segmenta-
tion by Deep Multichannel Side Supervsion” paper by Yan Xu et al. . . 43 G.3 From left to right: original image, ground truth, FCN output, DMN
output from “Gland Instance Segmentation by Deep Multichannel Side Supervsion” by Yan Xu . . . 44 H.1 Architecture of the DCAN from “DCAN: Deep Contour-Aware Net-
works for Accurate Gland Segmentation” by H. Chen al [40] . . . 45 H.2 Results of the DCAN, from top to bottom : original image, output
without contour prediction and output with contour prediction . . . . 46
xi
List of Tables
3.1 Results of the best performing architectures on gland segmentation . . 10 3.2 Results of the best performing architectures on stroma segmentation . 11 A.1 Advantages and drawbacks of segmentation techniques . . . 28 A.2 Results of the MILD-net . . . 32 G.1 Results of the challenge of 2015 and DMN results from ““Gland seg-
mentation in colon histology images: The glas challenge contest" . . . 44 H.1 Results of the challenge of 2015 and DMN results from ““Gland seg-
mentation in colon histology images: The glas challenge contest” . . . 45
xiii
List of Abbreviations
CNN Convolutional Neural Network FCN Fully Convolutional Network GlaS Gland Segmentation challenge
MILD-net Minimal Information Loss Dilated network GPU Graphic Processing
ASPP Atrous Spatial Pyramid Pooling Unit
1
Chapter 1
Introduction
The process of determining the extent of malignancy is called cancer grading. To grade the cancer pathologists need to segment the glands. The experts individually highlight all the different glands in the slide. It is one of the primary criteria used in clinical practice to assess the cancer and plan the treatment of individual patients [1- 2]. Achieving a good level of reproducibility and robustness remains an important challenge in the pathology practice [5-6]. Digitised histology slides are becoming increasingly accurate and offer a potential automatised solution to this problem.
Manual segmentation of glands is a very laborious process. Regarding other di- agnostic technologies, tissue analysis is more invasive but provides better insights on the potential disease and health of the patient. With the advent of computer vi- sion and semantic segmentation, it is now possible to design convolutional neural networks (CNN) that are able to segment particular parts of a picture [39-40]. Au- tomated gland segmentation of tissue slides would allow extraction of quantitative features associated with gland morphology. The increasingly good quality of gland segmentation will pave the way for computer-assisted grading and increase the re- producibility and reliability of cancer grading.
Studies have already been made and some networks are performing well on gland segmentation with Dice scores and F1 scores above 0.85 [23-39-40-42-45]. However they rely on single-organ training sets. Glands, or more generally speaking epithelial tissue, are very different from one organ to the other and therefore these networks cannot be trained to segment glands from different organs. However the connective tissue around glands is quite similar in every organ. This work investigates the in- terest of focusing on connective tissues, i.e. the stroma, to find the morphologies of glands.
To make this assessment, the first part of this work consisted in implementing differ- ent new architectures for gland segmentation and compare them to the current state of the art. The Xception and ResNet are among the best performing networks on the ImageNet challenge and therefore made promising feature extractor candidates for transfer learning. The U-nets have shown promising results in segmentation tasks for biological images and were considered as well in this work. Once these architec- tures performed accurately on the gland segmentation task according to their F1 and Dice scores they were used to learn the stroma. Stroma is the tissue that lies between the different glands. This tissue is very similar from one gland to another and being able to segment it correctly would induce a potential high performing network on not only one but several organs. Models trained on glands and models trained on stroma were then tested on a dataset from another organ and results to assess the feasibility and interest of designing an architecture focusing on the stroma.
3
Chapter 2
Method
2.1 Description of the dataset
The first dataset used consists of 165 images of stained histological sections of stage T3 and T4 colorectal adenocarcinoma (see nomenclature). T1, T2, T3, T4 grades refer to the size and/or extent of the main tumor. The higher the number after the T, the larger the tumor or the more it has grown into nearby tissues. T’s may be further divided to provide more detailed annotations, such as T3a and T3b. Each section be- longs to a different patient, and sections were processed in the laboratory on differ- ent occasions. Thus, the dataset exhibits high inter-subject variability in both stain distribution and tissue architecture. These data were released as part of the GlaS (Gland segmentation challenge) challenge of 2015 [21] where the purpose was to create an incentive to design network architectures able to precisely segment glands.
An expert pathologist graded each visual field as benign or malignant according to the overall architecture of glands in the tissue. The pathologist also delimited the boundary of each individual gland on the field. These manual annotations are used as ground truth for the automatic segmentation. Figure 2.1 below shows examples of images in the dataset. The green delimitation was drawn by the pathologist.
FIGURE2.1: Example of images in the dataset (bottom are malignant, top are benign) - reprinted from Gland Segmentation in Colon Histol-
ogy Images: The GlaS Challenge Contest
The data was separated into three different sets according to the challenge. A first set of 85 images and its ground truths made the training set. Then two other
4 Chapter 2. Method
sets, A of 60 images and B of 20 images made the two test sets. The set B was used for on site testing in 2015.
2.2 Data augmentation
These 85 images are not sufficient to robustly train a convolutional neural network [40-44]. The first part of this work focused on the development of a relevant data augmentation strategy. The first angle focused on was the implementation of hand- crafted methods to augment the data. Methods including cropping, rotations, elas- tic distortions, vertical and horizontal flips and translations were used to create a first set of 1000 images. It was eventually discovered that the cropping used for the project created an important bias in the data, making the network unable to learn anything useful for the validation and test sets. A python library called “Augmen- tor” [49] was used to create a data augmentation pipeline that made the networks learn correct features. Figure 2.2 below shows some examples of screenings that were augmented. One can clearly notice the skewing and some distortions of the image.
FIGURE2.2: Augmented images from the dataset
The code currently implemented is able to sample as many images as desired and its ground truths from the 85 images of the training set. Most of the time be- tween 4000 and 8000 images were used to train the network. Before augmentation, 80 percent of images were used for training while 20 percent were taken as valida- tion. After training, the network was tested on 80 unseen images. Test set A consists of 60 of these images, while test set B consists of the remaining 20 images that were used for on site testing during the challenge.
Later in the project, offline data augmentation was dropped for online data augmen- tation. Instead of sampling 4000 new images from the previous 85 original images to a new folder, the network randomly chose 4 or 8 images within the 85 and proceeds with augmentation before proceeding with training on this batch. This process is called “on the fly augmentation”. It helps for variance reduction as the network will never be fed twice the same input. This process is the current trend in image analysis with CNN and FCN. This method is also preferred for larger datasets as it allows for an optimised data storage [50].
2.3. Network architectures 5
2.3 Network architectures
This paper used four different architectures that are going to be presented in this section. The first part of the study focused on implementing architectures that were able to perform correctly compared to the current state of the art of gland segmen- tation. Once the architectures achieved F1 and Dice scores around 0.7 on the GlaS dataset, the best performing models were used to segment the stroma. The first three architectures had not been tried before and are inspired from the Deep Con- tour Aware Network and the U-net [40]. Coupling Xception, one of the best feature extractor for image classification [48] with the U-net was a promising idea. These networks perform a segmentation task as they predict whether a pixel belongs to the background or a gland. Once every pixel is labeled, objects are individualised to get distinct glands.
2.3.1 U-net Xception multitask
This new architecture is a U-net where the downsampling path uses the Xception network’s weights [44-48]. The activation of the different inception blocks are re- trieved and concatenated in the upsampling path to create a U-net. There are two different upsampling paths, one to predict the masks and the second one to predict the contours. Once both feature maps are computed, all this information is fused together to draw the final mask prediction. A pixel is considered a gland pixel if its probability of being a gland is superior to a given threshold and if its probability of belonging to a contour is inferior to another given threshold. In practice, the best results were computed with both threshold set to 0.5. Yet, further investigation can be made to fine tune them. The exact architecture of Xception is given in appendix B. Appendix F gives a deeper insight on the network, specifically the different up- sampling layers.
FIGURE2.3: U-net Xception architecture
2.3.2 U-net ResNet multitask
This architecture is in essence the same as the U-net Xception multitask above ex- cept that the downsampling path is taken care of by the ResNet50 and not the Xcep- tion Network. The ResNet, even though it was outperformed by Xception on Ima- geNet, is known to be an excellent feature extractor [44]. The exact architecture of the ResNet50 is given in appendix B. This architecture also presents the U shape of U-nets and had not been tried before.
6 Chapter 2. Method
2.3.3 Xception multitask with auxiliary classifiers
This architecture is composed of the following: the Xception [48] network composes the downsampling path while the upsampling path this time does not rely on con- catenation of activations. This network is therefore not a U-Net. Another difference with the two previous architectures is that the last three activation blocks, instead of only the last activation block, of the downsampling path are used and fused together to create the output probability map. There is consequently six upsampling paths, three for the masks and three for the contours. At the end of each of these paths is an auxiliary classifier. Figure 2.4 below presents the principles of the network. This network uses the concept of deep supervision as well as using the features learnt at different levels of the architecture to compute the correct labels.
FIGURE2.4: Xception multitask with auxiliary classifiers architecture
2.3.4 MILD-net
As the current best performing architecture in gland segmentation, the MILD-net [23] was used in this paper as well. This complex architecture is explained in details in section 7.3.2.
2.4 Training process
The first training process is done with offline data augmentation and consists in the following steps:
• Augmentation of the 85 training images according to the methods presented above
• Separation of the new data set in a training and a validation set making sure that there is as many benign and malignant cases in each of the sets. The train- ing set consists of 80% of augmented images, ground truth masks and contours while the validation set is made of the remaining 20%
• The network is remotely trained with the GPUs of IBM Research in Zurich. An SSH connection is made from the local machine to the GPUs to submit training
• Once training is over, the model and loss history are downloaded locally
• The model is evaluated according to the methods presented in the next section on the 80 images of the test set
2.5. Post processing 7
The second training process that was used is simpler and optimizes data storage, it consists of the following steps:
• Loading the 85 original images in a data frame
• Normalising images by computing the mean and variance of the training set
• Separating these images according to the malignant/benign distribution in two sets of 66 images and 19 images
• For each epoch, the network will augment 4 or 8 images randomly chosen from the 66 training images and feed them to the network for training. This process is done several times per epoch - usually around 100 times. Validation is done over augmented images from the 19 validation set images
• Once training is over, the model and loss history are downloaded locally
• The model is evaluated according to the methods presented in the next section on the 80 images of the test set
2.5 Post processing
The raw output of the network consists of a probability map. Each pixel has a prob- ability to belong to one class or the other. This map needs to be processed to yield precise boundaries. In order to improve the quality of the segmentation it is there- fore important to work on the post processing of the segmentation as well as on the architecture of the network. The post processing consists of the following steps:
• Smoothing of boundaries using standard image processing techniques such as erosions and dilations
• Deleting smaller patches of pixels - threshold was empirically set to 450 pixels for images of size 480x480
• Filling holes within objects that are labelled as glandular objects
• Labelling distinct objects to individualize the different segmented glandular objects
9
Chapter 3
Results
3.1 Evaluation metrics
The performance of each model is evaluated based on the three criteria that were used in the GlaS challenge to compare performances : the accuracy of the detection of individual glands, the volume-based accuracy of the segmentation of individual glands and the boundary-based similarity of the segmentation of individual glands.
One would assume that volume-based segmentation accuracy would be redundant with boundary-based segmentation accuracy between a gland and its segmentation.
However, the volume-based metric for segmentation accuracy used was defined and calculated using the label that the algorithm had assigned to each pixel, but the boundary-based metric used the position assigned by the algorithm to the bound- ary of each gland. Pixels labels may be fairly accurate, while the boundary curves are very different.
F1 Score was used to measure the accuracy of individual gland detection. A seg- mented object that intersects with at least 50% of its ground truth object is counted as true positive, otherwise it is counted as false positive. The number of false nega- tives is calculated as the difference between the number of ground truth objects and the number of true positives.
The Dice index measures the similarity between two sets of samples. An object-level dice index is introduced in order to compare the segmented object to its ground truth. The index ranges from 0 to 1 where the higher the value, the more similar the segmented object and the ground truth are.
Finally, the boundary-based accuracy is computed with an object-level Hausdorff distance. The Hausdorff distance is the highest value between pairs of nearest pixels on the boundaries of the segmented object and its corresponding ground truth.
3.2 Results for gland segmentation
The first step of this work was to design network architectures rivalling the current best ones. Designing an accurate network on gland segmentation was an important foundation to reach before going one step closer to the stroma segmentation. It was also an opportunity to learn more about the different techniques and technologies used for gland segmentation.
The first models that were trained yielded disappointing results on the test set with F1 and Dice scores lower than 0.2. However, they were working precisely on the
10 Chapter 3. Results
training and validation sets with more than 95% accuracy in pixel prediction. The problem came from the data augmentation pipeline that created an important bias in the training set. The networks were not able to generalize the features they learned.
Lots of efforts were put into data augmentation and the first interesting results were retrieved through the use of the library called “Augmentor” [49]. Table 3.1 below shows the different best performing networks.
TABLE3.1: Results of the best performing architectures on gland seg- mentation
Scores were computed on the test sets A and B. Compared to the results of the state of the art these results are encouraging with F1 scores around 0.7 and Dice scores around 0.8. The only issue lies with the Hausdorff score. This issue is likely to be correlated with the fact that it is hard for the network to correctly label white pixels that sometimes surround the glands. White pixels can belong to a lumen, within a gland, or just represent the absence of tissue. The network needs to pre- cisely understand the context of each pixel to make the right prediciton.
An issue that was spotted with this training process was the important overfitting that arises after 5-6 epochs when training on a few thousands images. Implement- ing an online data augmentation process fixed the overfitting issue as one can see on figure 3.1 below. The validation loss curve is closer to the training curve which confirms learning on new examples made it easier for the network to generalize the features learnt. On figure 3.2 one can see an example of prediction on the test set.
Appendix D and E present more predictions of gland morphologies by the Xception U-net and MILD-net.
3.3 Results for stroma segmentation
Stroma segmentation with CNN had never been tried before for cancer grading as glands are the most important and discriminant structures. After the creation of ar- chitectures that correctly performed on the gland segmentation task, the objective was to focus on the stroma. This tissue that lies within the glands is quite similar from one picture to the other as one can see on figure 2.1. The stroma mask is not exactly the complementary mask of the glands’ mask. There are numerous white patches of pixels that consist of several lumens which is neither gland or stroma. It was necessary to compute new masks from the previous ones focusing on glands as no stroma masks were at our disposal. This was done by considering 3 distinct type of tissue : glands, white areas and stroma. At the end, stroma masks were computed
3.3. Results for stroma segmentation 11
FIGURE 3.1: Comparison of learning curves for Xception Network with on the left, on the fly data augmentation and on the right, learn-
ing on 24 000 examples for each epoch
FIGURE3.2: Example of segmentation from the test sets with the nor- malised input on the left
and validated with an expert histologist to ensure their correctness.
As the two best performing architectures, both the Xception U-net and the MILD- net were tested on this new dataset. As one can notice on table 3.2, the Xception Network performed better on the test set A while the MILD-net outperformed Xcep- tion U-net on set B. Both architectures were tried with and without the second up- sampling path yielding the contours. As the task does not focus on glands it was important to check the relevance of contours.
TABLE 3.2: Results of the best performing architectures on stroma segmentation
Learning curves were computed to make sure that the network correctly learned and generalized features. The new training pipeline with on the fly learning once
12 Chapter 3. Results
again helped with reducing overfitting. Figure 3.3 below shows the learning curve of the best performing Xception U-net network for stroma segmentation. More pre- dictions made by the Xception U-net focusing on the stroma are presented in Ap- pendix C. One can notice an increase of the validation curve after epoch 10 which shows the begining of a small overfitting.
FIGURE 3.3: Learning curve for Xception U-net on the stroma seg- mentation task
3.4 Comparing models on the breast dataset
The computational pathology team from IBM Research in Zurich had access to a few histological screenings from the breast. These images have a higher definition than the previous colon screenings. The resolution is 2000x2000 pixels. Below, on figure 3.4 you can see examples of such screenings. They represent glands within stroma tissues. This report only presents qualitative results of breast segmentation.
FIGURE3.4: Sample from the breast dataset
All the annotations were not ready for testing in time for this report. Therefore this work is based on the results received on 10 images from the breast dataset. The best performing models focusing on glands and on stroma were used to test these images and assess their ability to segment the important parts of the screenings, namely the glands or the stroma.
As the results are highly dependent on the architecture and not as precise as in the case of the colon screenings, the relevance of the architectures and the process of stroma segmentation was assessed by looking at the predictions with an expert in manual gland annotation. The results were analysed with a trained histopathologist from IBM Research. One can have a look at the results yielded from the different
3.4. Comparing models on the breast dataset 13
models on the figures below. Figures 3.5, 3.6, 3.7 and 3.8 give samples of the net- works predictions. On figure 3.7 one can notice that the network was able to detect most of the stroma and therefore yielded promising looking glands according to the expert. One can notice from figure 3.5 that the MILD-net focusing on stroma was not able to detect the important structures. However, figure 3.8 shows that the MILd-net was able to detect some glands through the knowledge learnt from training on the colon. Meanwhile the Xception U-net performed best when focusing on the stroma as one can see by looking at figure 3.6 and 3.7.
FIGURE3.5: Predictions with the MILD-net focusing on stroma
FIGURE3.6: Predictions with the Xception U-net focusing on glands
14 Chapter 3. Results
FIGURE3.7: Predictions with Xception U-net focusing on stroma
FIGURE3.8: Predictions with MILD-net focusing on glands
15
Chapter 4
Discussion
As previously stated, the analysis of glandular morphology within histopathology images is of the utmost importance in determining the stage of cancer and therefore selecting the most suited treatment. Designing a single system able to precisely as- sess the morphologies of glands among not only one but several organs of the body would be a considerable breakthrough. It would pave the way for a quicker and bias-free method of cancer diagnosis as well as supporting clinicians in their pre- scriptions of treatment for patients. With this aim in sight, stroma segmentation first results are discussed below. The conclusions concerning the segmentation coherence were made with Raul Catena, a trained histologist and visiting researcher at IBM Re- search Zurich. He is competent in manual gland annotation and has therefore the authority to declare whether the segmentation is correct.
The first important results to discuss come from the learning curves yielded after training. Even though the architectures were kept the same, the difference of results between offline training and online training (also known as training on the fly) is very important. As one can see on figure 3.1, feeding the network new random im- ages every time made it considerably better at generalising. This can be explained by the considerable sizes of the architectures considered. The MILD-net has more than 80 millions learnable parameters while the Xception U-net has around 50 millions.
These networks are able to learn very complex features and tend to overfit training sets that are not precisely and carefully prepared and monitored. On the fly training is a good way to prevent overfitting as the network is always presented new previ- ously unseen images. As the network is trying to find the boundaries of the stroma it is important to implement elastic transformation as well as other standard data augmentation techniques such as gaussian noise, gaussian blur and rotations. Nor- malising the initial batch of 85 training images before augmentation also allowed for more robust results.
Regarding the segmentation, the first interesting results concern the difference be- tween the Xception U-net and the MILD-net respective predictions. Even though the MILD-net, as the current best performing architecture, is better at segmenting glands that are within the colon, it yielded poor results on the breast dataset when focusing on the stroma (see figure 3.5). This behaviour can be attributed to the im- portant specialisation of the MILD-net. It did not take advantage of transfer learn- ing. It means that it can only rely on the augmented images that it was given and not on the knowledge from tens of thousands of images. The architecture of the MILD- net is great for learning the details of the image that are given but the results showed that it did not perform well on a previously unseen organ. Xception, even though it does not yield results as good as the MILD-net on the colon, gives more coherent and relevant results on the breast dataset. When focusing on glands, the MILD-net
16 Chapter 4. Discussion
gave interesting results as one can see on figure 3.8. Another advantage of the Xcep- tion U-net is that it is smaller in size. The MILD-net architecture weighs around 1GB of data, an Xception U-net weighs around 550MB. Nonetheless, both architectures were able to give acceptable first results on an unknown type of dataset. This can be probably be attributed to their ability to extract and analyze features on different scales. Indeed the MILD-net makes use of its ASPP unit while the Xception U-net can gather information on different scales thanks to its inception blocks.
Results with and without the focus on contours for stroma segmentation were also compared. The relevance of contours was confirmed for stroma segmentation. The best performing algorithm shown for stroma segmentation used the contours of the glands to make a better prediction even though one could have thought that directly computing the masks would have been quicker and sufficient. Table 3.2 shows the difference of performance for the stroma segmentation when using contours com- pared to without using contours. To segment glands or stroma tissue one needs an important precision. Using the contours gives more insights on the surroundings of pixels and therefore allows for more accurate segmentation and differentiation be- tween glands and stroma, especially when glands are very close to one another and the stroma is reduced to a thin line between them.
This preliminary study showed considerable promising results for cancer diagnosis and grading. Stroma segmentation, as an initial cross domain approximation goes into the right direction. There is still a lot of improvements to implement to go from encouraging segmentation to very accurate ones. Nonetheless, the importance and relevance of contours, transfer learning, multi scale feature extraction and precise data augmentation was confirmed for the task of designing a cross-organ algorithm.
17
Chapter 5
Conclusion
This work focused on paving the way for innovative methods to analyze glandular morphologies within histopathology images. The objective is to considerably im- prove the quality of cancer diagnosis, grading and treatment selection. This work focused on validating the interest of stroma segmentation to get a more general seg- mentation of glands.
The difficulty lies once again within the difference of morphologies between glands in the same organ as well as between different organs. The network needs to learn complex features as the stroma can be described as lines between two nearby glands and take a large part of the image. To this end, transfer learning for feature extrac- tion, normalisation of the input, online training and contour segmentation showed promising results. The networks, trained with images from the colon, were able to yield relevant results for stroma segmentation on colon screenings and also promis- ing results on breast images. Segmenting the stroma is therefore a relevant idea to investigate to design algorithms able to support clinical decisions and experts anno- tations. This preliminary investigation for a cross-domain application needs further studies.
19
Chapter 6
Future work
This preliminary work incites for important further development. There are several steps that would be relevant to investigate in order to design better networks for stroma segmentation. These steps can be divided along three axes : the preparation of data for training, network architectures and post processing.
Regarding the data used for training, it would be interesting to investigate the re- sults that can be achieved using in the training set images that belong to not only one organ but several organs. Another important fact to take into account is that so far stroma masks are not perfect as they were computed from the masks of glands.
A network can only get as good as the targets it is given. Using annotations from experts for the stroma without the masks of glands is also a promising option to explore. Investigating the training pipeline can have an important impact and im- prove the results as well. Fine tuning the data augmentation pipeline could make the difference between a good performing network and one that beats the state of the art. There are numerous techniques for data augmentation that can be added, removed or tweaked to get higher accuracy on the segmentation.
There are several interesting ideas to investigate concerning the architectures for stroma segmentation. The first one would be designing a more complex architec- ture to segment the stroma, adding features such as dilated convolutions and skip connections. It is important to keep designing networks that will not overspecialize.
Assessing the impact of different downsampling paths would also be interesting as Xception and ResNet did not exactly gave the same results. Namely, the NAS-net could be a good candidate.
The post processing steps are quite important as histopathology screenings do not only consist of stroma and glandular tissues. Starting from an image with the seg- mentation of the stroma, computing the glandular morphologies is not a trivial task.
Important efforts and thoughts should be invested in that direction as well.
More generally, an interesting idea to explore, even though it might lead to mod- els that are too heavy, would be designing a network that segment glands on the one hand, and the stroma on the other hand. After these parallel segmentations the two pieces of information could be fused to improve the quality and accuracy of the prediction. It could be as simple as using the output of the Xception U-net for gland segmentation and for stroma segmentation in the post processing steps. It could also be interesting to look at using the output of stroma segmentation as a prior for gland segmentation.
21
Appendix A
State of the art in gland segmentation
A.1 Medical image analysis, the advent of deep learning
A.1.1 The importance of glands in cancer detection
Glands are important histological structures which are present in most organ sys- tems. They are the main mechanism for secreting proteins, carbohydrates and hor- mones. There are different types of glands that one can find throughout the whole body of every living being. They are divided based on their function into two groups : endocrine and exocrine glands. Endocrine glands secrete substances that circu- late through the bloodstream in the body, while exocrine glands secrete substances through a duct onto an outer surface of the body [25]. Examples of endocrine glands are the pituitary gland or the thyroid gland, while exocrine glands examples are sweat glands.
The endocrine system is a network of glands and organs in the body that produce hormones. It is a regulation system. Each hormone has a different purpose. They control how we respond to the different changes that we encounter in our inner and outer environment. This includes: growth and development of the body and cells, body functions, our mood, sexual functions and reproduction. Consequently, an ab- normal behaviour of glands will have an important impact. [24]. On the figure A.1 below one can see the different major endocrine glands.
FIGUREA.1: Endocrine system (wikipedia)
22 Appendix A. State of the art in gland segmentation
Due to their important role in the regulation of the body, the morphology of glands has been widely used by pathologists to assess the malignancy degree of glandular epithelium. These malignant glandular epithelium are also called ade- nocarcinoma [2]. An adenocarcinoma is a type of cancer tumour that can occur in different parts of the body as glandular epithelium is a type of tissue that is en- countered in the majority of organs. Adenocarcinoma is the most prevalent form of cancer, there is consequently an important need in preventing and detecting them in time [1].
A.1.2 The challenges of image analytics in pathogenesis quantification Nowadays, pathologists need to assess patient biopsies and tissue resections to study the presence and grade of cancer. Histology screenings are studied in order to select the best treatment for the patient. Regarding other diagnostic technologies, tissue analysis is more invasive but provides better insights on the potential disease and health of the patient. Pathologists assess the screenings of tissues under a micro- scope. This leads to diagnoses being affected by subjective judgment. This process is difficult. The staining intensity in the tissue, as well as the morphological and cellular architecture that indicate cancer and many diseases depend a lot on the dif- ferent organs that are considered [3]. Indeed, the staining or colour of the sample will depend on the organ and the different methods used to process the tissue. The morphology of the gland will also be dependent on the type of gland, therefore on the organ.
On the other hand, assessing the disease malignancy and progression is a complex and multifactorial molecular process. Diseases such as cancer exhibit tissue and cel- lular heterogeneities which hinders differentiation between benign and malignant types of cell formations. The procedure is moreover very time-consuming. There- fore, there is a growing number of initiatives that are trying to bring hospitals into the digital area through bright-field and fluorescence scanners. The aim is to convert needle biopsies and glass slides of tissue specimens into virtual microscopy images.
Theses high resolution images will enable the use of advanced computer vision tech- niques. On figure A.2 below you can have a look at different shapes of glands within an adenocarcinoma. Each square of the grid stands for a 50x50 pixels crop. You can notice the difference of shapes and staining intensity (colours) for glands that are of the same type. These different samples show benign and malignant glands.
FIGUREA.2: Examples of histological sections of colorectal adenocar- cinoma from the public dataset GLaS Challenge 2015
The annotating process of histological structures is done by expert pathologists.
Manual annotation suffering from limited reproducibility, considerable efforts and time consumption, automated segmentation methods are highly demanded in clin- ical practice. This would allow improved reliability as well as more efficiency of
A.1. Medical image analysis, the advent of deep learning 23
the workload of pathologists as they could focus on other tasks. The challenges in glands labelling and segmentation will consist in overcoming the following issues [4] [38] [39]:
• Huge variations in the morphology of glands (heterogenous shapes of the glands) depending on the organ, the state of the cancer and the histological grade. Statistical shape models are difficult to use in this case
• The variability of intra and extra cellular matrices make it so that background portions of histological screenings contains important noise
• The existence of screenings where glands are touching one another which makes it hard for automated methods to individualise the different glands
• The difference in tissue preparation procedures (sectioning and staining namely) can cause inconsistencies and artefacts in the sample
• The use of clinical data makes it difficult to evaluate the results and create robust deep learning networks. The best approach is to use clinical data for training and curated data for evaluation
A.1.3 State of previous technologies in medical image and gland analysis The problem of gland detection is usually formulated in two sub problems : gland labelling or instance recognition and segmentation [5] [6]. Gland labelling dwells down to the ability to predict whether a particular pixel from the image belongs to a gland or not (binary classification). The segmentation is another issue that is related to the differentiation between glands. The interest lies in the morphology of glands, therefore it is important to make sure that two glands that are very close within the image will not be recognised as a single bigger gland by the algorithm. [39]
Gland labelling and segmentation have been intensively studied over the last few years. Several methods have been investigated and explored such as morphology- based methods [7] [8] [9] [10] and graph-based methods [11] [12]. Glands must be recognised individually in order to get the right shapes and therefore the right diag- nosis. Recognition of the exact morphology is therefore of the utmost importance.
Gland segmentation is insufficient due to its inability to detect each gland in histo- logical images. Consequently, researchers have put more focus on gland instance recognition. This is essential for morphology assessment which is proven to be not only a valuable tool for clinical diagnosis but also a necessary step in the cancer grading process.
Even though instance recognition and segmentation in histological screenings is a new subject, it has awoken researchers’ interest in these last years. SDS (Simulta- neous Detection and Segmentation) was one the first to raise these problems and proposed a first framework to solve the different challenges through Deep Learn- ing [13]. New methods have emerged since this first framework such as the hyper- column [14] and MNC (Multi-task network cascades) [15] which focus on improving the feature extraction process. These algorithms all followed the same process of first detecting an object and then segmenting the instance inside a bounding box.
In the beginning of medical image analysis, methods primarily depended on hand- crafted features and prior knowledge of the instances, shapes or structures that are
24 Appendix A. State of the art in gland segmentation
considered. What had been done so far was using masks to detect the object of in- terest for the segmentation. These masks were hand-crafted by Computer Vision specialists and medical staff together, given prior knowledge of the shapes looked for [22]. These hand-crafted features (specific kernels, filter banks, or anchor boxes) work particularly well in natural images where objects have a regular form but not for glands who show irregular shapes, which increases the difficulty of their detec- tion. For gland segmentation the latest techniques introduced contour segmentation [39] [40] through the specialisation of part of the network on the segmentation of the contours of glands. Deep Learning is a promising adaptive way to solve the different challenges raised by gland segmentation as it is versatile.
A.2 Introduction to Deep learning and CNN
A.2.1 Deep Learning Architecture
The concept of Deep Learning was introduced by LeCun and his team at the end of the XX century [26]. Y. Bengio, very known for his work on Deep Learning was working with LeCun at that time. They are both considered as the fathers of Deep Learning. What makes Deep Learning a current hot topic is the important progress that has been made in the last decade to improve computational power and the con- vergence of algorithms. The advent of competitions especially helped in designing more robust networks [27]. Deep Learning is a fast growing part of machine learn- ing. The structure is inspired by neuron communications within the brain. Several layers of neurons are stacked and share information from one layer to the next. The concept boils down to training the network by giving input data and targets so as to find the best predictions on previously unseen data.
The network is organised in several layers. There is one input layer, one output layer and multiple hidden layers. Each layer consists of several neurons that are partially or fully linked to the neurons of the previous and the next layer. Figure A.3 below illustrates a simple neural network and its different layers. In yellow you have the input layer, blue and green and the two hidden layers and the red is the output layer. The arrows represent the different connections between neurons. The term deep learning comes from the fact that we add more and more hidden layers, therefore going deeper in the network. [30]. The output layer consists of the target, for each training example, in the illustration a training example is a 3D vector, we want the network to yield a prediction as close as possible to the target. The updat- ing phase of the network to best fit the training data is called the training phase.
FIGUREA.3: Simple neural network
A.2. Introduction to Deep learning and CNN 25
Each neuron is used to store data before propagation to the next layer. If xiis the stored data in layer I (also called the activation vector of layer i). Then the relation between the activations of two layers is given by equation (1).
xi+1= s(Wi+1xi+bi+1) (A.1) Wi+1is a matrix called the weights of layer i+1. It is used to scale the activa- tion from layer i to layer i+1. bi+1 is called the bias of layer i+1. s is the activation function, whose purpose is to introduce non linear properties to the network and scaling activations within the desired bounds. The last layer’s activation function can sometimes be different from the one used in the hidden layers. Common acti- vation functions are the Rectified linear function (ReLu) and softmax function. Lhe latter is usually used in the last layer to output probabilities. [33]. ReLu is one of the most commonly used function in hidden layers at this time, see equation (2) for more details. It yields x or 0.
ReLu(x) =max(0, x) (A.2)
To find the best parameters the network needs to be trained. A loss function is introduced to assess the network’s performance to fit the training data. The higher the cost is, the worse the performance will be. The network is improved in order to minimise this cost function. The choice of this function mostly depends on the goals of the network. For classification it is common to use the cross entropy function [28], while for regression it is best to use mean squared error. Other functions range from logistic loss for logistic regression to hinge loss.
Training
The parameters of the network need to be optimised to minimise the cost, this will ensure the most accurate prediction for the training set. This optimisation occurs in an iterative process. The principle of the training phase is the following: the first step is a forward pass within the network to compute the cost and the predictions, the second step is the backward pass or back propagation to compute the gradients of the loss function with regards to the parameters of the network going from the output layer back to the input layer. The third and last step is the update of the parameters with these derivatives following equation (3). q represents a parameter, his called the learning rate and represents the importance given to the derivative, L is the cost function.
qi+1=qi h∂L
∂qi (A.3)
There are several algorithms in the literature used to proceed with training. Namely, there are stochastic gradient descent, where training updates are made one example at a time, batch gradient descent where all the training set is used before an update and finally mini batch gradient descent where a subset of examples is used to com- pute the updates [29]. The term epoch is commonly used to refer to an entire pass through the training set.
26 Appendix A. State of the art in gland segmentation
A.2.2 Convolution Neural Network Introduction
Hubel and Wiesel showed that cat and monkey visual cortices contain neurons that individually respond to small regions of the visual field. Neighboring cells have receptive fields that will overlap, their size and location will vary across the cortex to form a visual map of the space [20]. Convolutions were designed in the Con- volutional Neural Network (CNN) with the intuition of how an individual neuron responds to a “visual” stimuli. Each neuron only processes data for its own recep- tive field. CNN were created to specifically answer the problem of computer vision [31].
Although fully connected feed-forward neural networks can be used for feature ex- traction and image classification as well as data classification, their architecture is not optimised for image analysis. Processing images of large sizes requires a very important number of neurons as each pixel of the image becomes an input variable.
The convolution operation helps downsizing the problem to a reduced number of parameters. Using less parameters also helps dealing with the traditional problems of vanishing and exploding gradients in deep architectures during the back propa- gation process, as well as preventing overfitting. Due to their receptive fields and reduced number of parameters, CNN show formidable efficiency and robustness for image classification, object recognition/detection [16] and neural style transfer [32]. They rely on two very important principles: parameter sharing and sparsity of connections.
Architecture of layers
A standard CNN consists of three types of layers : convolutional layers, pooling lay- ers and fully connected layers. In a convolutional layer, the input is convolved with a kernel to create a new map. More precisely, in such a layer the input is convolved several times with different kernels, yielding several maps that are concatenated together at the end of the process. Figure A.4 below shows a convolution and a transposed convolution whose purpose is not to downsample but to upsample the data [27]. As you can see, on the left most image of figure A.4 the resulting map (top blue matrix) as the same size as the input data (lighter blue on the bottom) whereas on rightmost image the output map is bigger than the input map. The white squares correspond to zeros added to the input. This process is called padding and is used to monitor the output map’s size.
FIGUREA.4: Convolution (left) and transposed convolution (right)
Each feature map is the result of the convolution with the same input image with different kernels/filters. The purpose of the convolution layer is to specialise the
A.2. Introduction to Deep learning and CNN 27
neurons towards specific features in specific areas of the input image.
Pooling layers are designed to reduce the size of images while keeping the important features of the input. this is a common process for downsampling a picture. There is nothing to be learned during gradient descent in this layer. The current widespread pooling method is called MaxPooling. It makes sure that the important features and information are kept. Figure A.5 shows an example of MaxPooling by a 2x2 kernel.
The maximum of each coloured region is used to create the output feature map.
FIGUREA.5: 2x2 MaxPooling (wikipedia)
The last layers for classification are commonly fully connected layers. Every neu- ron from the previous layer is directly linked to all the neurons of the next layer. It is commonly used for image classification as it adds the softmax function in order to yield probabilities for each pixel to belong to the different classes [31]. Nowadays, fully connected layers can be seen as an appropriate 1x1 convolution. Classification is therefore possible with FCN as well as segmentation. Segmentation is not handled by adding fully connected layers but by upsampling layers with transposed convo- lutions.
During the training phase, only the fully connected layers and convolutional lay- ers learn new parameters. The weights and biases of the fully connected layers and of the kernels are updated following equation (3) after back propagation
Fully convolutional neural networks
The recent advances in deep learning knowledge has led to an increased interest in machine learning and computer vision techniques that deal with image classifica- tion [16] [17] and object detection [18]. Namely, these recent advances put forward a new kind of deep learning algorithm, based on the Convolutional Neural Networks (CNN) called the Fully Convolutional Network (FCN). In previous CNN networks, the last layers were fully connected layers (FC) that used the output of the convo- lutional layers, usually a 4D or 3D volume of extracted features to pass it through a softmax unit after flattening it for image classification. The FCN is only made of convolutional layers, meaning the last fully connected layers are convolutional as well. It allowed the advent of end-to-end training and testing for image labeling [19]. Detectors learn hierarchically embedded multiscale-edge fields to directly take into account the low, mid and high level information in contours and object bound- aries. With the correct target it is therefore virtually possible to learn any kind of features. Latest approaches in gland segmentation have tried to divide the prob- lem is two different sub-problems for the network, the pixel differentiation between gland and background and the contour detection to individualise glands. [39] [40]
28 Appendix A. State of the art in gland segmentation
A.3 State of the art for gland detection
A.3.1 State of the art in image segmentation techniques
Before the advent of Fully Convolutional Neural nets, researchers used and still use several different image segmentation techniques based on [47] :
Thresholding the simplest method, the image is divided with respect to pixel in- tensity. This works well when there is an important contrast between objects and the background, which is not the case in gland segmentation
Edge discontinuity the technique is based on rapid change of intensity in an image.
It is once again efficient in case there is an important contrast
Region discontinuity the image is divided between region that share similar char- acteristics. A common method is the Region growing seed
Clustering there are two categories of clustering, hierarchical and partition based.
Both of them try to divide pixels based on their characteristics
Watershed method it uses the concept of topological interpretation. The intensity represents the basins having a hole in its minima from where the water spills.
When water reaches the border of basin the adjacent basins are merged to- gether until the whole image is segmented
Handcrafted features [42] these are features that are extracted from the images, they do not fare well in general context but can be efficient for specific tasks On table A.1 below you can see the different advantages and disadvantages of these techniques.
TABLEA.1: Advantages and drawbacks of segmentation techniques
The techniques that will be focused on in this work are the one that concern artifi- cial neural networks. To understand how image segmentation is done through such a network, it is important to explain different concepts. The first important concept to understand is transfer learning. Learning features from unstructured data such as images require a tremendous amount of data, tens of thousands of images. One usu- ally do not own enough data to robustly train a network to work in the real world.
A.3. State of the art for gland detection 29
Therefore, researchers have given access to the different networks with pre-trained weights for everyone to use for their own work. Using part of a network pre-trained on hundreds of thousands of images makes sure you possess an efficient feature ex- tractor for your problem [46]. The knowledge learnt from the biggest dataset (e.g the weights) are used to learn relevant features on the smaller dataset. This will be es- pecially important for gland segmentation as there is no important dataset as of now.
One of the current best performing network on the ImageNet dataset is called Xcep- tion [48]. ImageNet is a dataset of a few million images that contains more an a thousand classes. It introduces a very computer efficient way to grasp multi scale features by using 1x1, 3x3 and 5x5 convolutions. You can see a inception-block on figure A.6 below. It uses different tricks to boost the performance. The GoogleLeNet is based on this architecture, it is composed of 9 inception blocks stacked linearly.
FIGUREA.6: Inception block
This network has given the best results on the ImageNet challenge and is there- fore a excellent general feature extractor. This network, along with the Residual Network (also known as ResNet) [44] are efficient pre-trained networks for image segmentation and classification. The residual network relies on residual blocks that allows the network to easily learn the identity function. It becomes possible to create deeper architectures without making training more difficult. On Figure A.7 below you can see a residual block. It boils down to adding the activation from 2 layers before the current one just before activation. This is called a skip connection.
FIGUREA.7: Residual block
The last general important concept is deep supervision. It helps networks learn important features in lower layers and makes training easier [34]. This approach is becoming more and more common in deep learning as it helps the network learning
30 Appendix A. State of the art in gland segmentation
important features in the first layers. It also prevents the gradients from exploding or vanishing during training. It is implemented by taking the activation as some point in the network, upsampling back to the input size and using an auxiliary clas- sifier that will be taken into account in the final loss of the network. This concept is more precisely explained in appendix G.
Moving to the specific case of medical image segmentation and gland instance recog- nition, the U-nets have given excellent results [45]. U-Nets consist of concatenating the feature maps from the downsampling path with the feature maps of the upsam- pling path. It has improved the precision of the prediction [41]. They are able to learn deep and meaningful features with few training images and are very fast. On figure A.8 below you can see an example of U-Net architecture. The features from the downsampling path are concatenated with the features from the same level in the upsampling path.
FIGURE A.8: U-Net architecture from “U-Net: Convolutional Net- works for Biomedical Image Segmentation” by O. Ronneberger
This architecture is very relevant to the work of gland segmentation. Most of the specific architecture that have been used lately are based on this concatenation of features from the downsampling path with the upsampling path. The current best performing architecture in gland segmentation, the MILD-net (Minimal Infor- mation Loss Dilated Network), that is going to be covered in the next section uses this principle as well.
A.3.2 Best performing architectures
Gland segmentation is more specific and an harder task than general segmentation.
Namely, high resolution is needed to precisely delimit the glands. It is also neces- sary to take into account morphological differences in glands depending on disease and grade of the disease. Researchers have implemented different architectures to specifically tackle gland segmentation such as the Deep MultiChannel Network (see Appendix B) and the Deep Contour Aware Network (see Appendix H). This section presents the current best architecture for gland segmentation.
The MILD-net suggests an original framework by incorporating the original down- sampled image back into the residual unit of the CNN after MaxPooling. On figure
A.3. State of the art for gland detection 31
A.9 one can see these minimal information loss units (green lower box). These units prevent the network from losing information as it goes deeper as the input is reg- ularly added back into the network. The network also introduces residual units (lower blue block) that add a shortcut from the activation of layer L to layer L+2 be- fore the activation function. It helps training deeper networks. The intuition behind residual unit is that it is easy for such units to learn the identity function and they will therefore not negatively impact the performance.
FIGUREA.9: Blueprint of MILD-net from “MILD-Net: Minimal In- formation Loss Dilated Network for Gland Instance Segmentation in
Colon Histology Images” by S. Graham
This network yields the current best results on the dataset provided by the GLaS challenge. The other difference with the former networks is [23] that it includes dilated convolution (lower pink block) which increases the receptive field of neurons without adding more parameters. It is efficient for this problem as glands can have different scales depending on the grade of the cancer. Figure A.10 gives an example of such a convolution [36]. The input used to compute the new map are spread out.
The network is still able to extract the right features with a lower computational cost.
FIGUREA.10: Dilated convolution
Figure 7.9 also shows dropout layers in purple. Dropout is a method used to pre- vent the network to overfit the dataset set. It basically prevents the neurons from be- ing too interdependent by randomly shutting down some neurons during forward and backward pass of the training phase [35]. Table A.2 below shows the perfor- mance of the MILd-net comparing to other very performant architectures. One can see that the MILD-net is vastly superior to all its predecessors.
32 Appendix A. State of the art in gland segmentation
TABLEA.2: Results of the MILD-net
Figure A.11 shows in the same manner of figure A.10 the results of different previous networks compared to the MILD-net (last row). It was better than most in the rectangular areas.
FIGUREA.11: Comparison of segmentation results for different net- works including the MILD-net (last row) from “MILD-Net: Minimal Information Loss Dilated Network for Gland Instance Segmentation
in Colon Histology Images” by S. Graham
A.3.3 Conclusion
Gland segmentation has interested many researchers these last years. Most of the approaches focused on the glands, yielding better and better results throughout the years but lacking robustness and stability for cross organ gland segmentation. This paper will explain how these previously introduced techniques leveraged together can help building an algorithm for cross organ segmentation by focusing not only on the glands but on the stroma, the tissue that lies between glands.
33
Appendix B
Xception and ResNet architectures
FIGUREB.1: Xception
Xception architecture : the data first goes through the entry flow, then through the middle flow which is repeated eight times, and finally through the exit flow.
Note that all Convolution and SeparableConvolution layers are followed by batch normalization (not included in the diagram). All SeparableConvolution layers use a depth multiplier of 1 (no depth expansion). Image taken from “Xception: Deep Learning with Depthwise Separable Convolutions” by Google.
FIGUREB.2: ResNet
ResNet50 architecture : several blocks of residual learning blocks are linearly stacked together. The end of the network can be changed depending on the applica- tion of the algorithm.
35
Appendix C
Prediction results with Xception - training on stroma
FIGUREC.1: Predictions by Xception U-net on stroma