Patch-level classification of brain tumor tissue in digital histopathology slides with Deep Learning

(1)

Linköpings universitet SE–581 83 Linköping Linköping University | Department of Computer and Information Science Master’s thesis, 30 ECTS | Statistics and Machine Learning 2021 | LIU-IDA/STAT-A--21/022--SE

Patch-level

classification

of

brain tumor tissue in digital

histopathology slides with Deep

Learning

Vasileia Kampouraki

Supervisor : Anders Eklund Examiner : Annika Tillander

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Histopathology refers to the visual inspection of tissue under the microscope and it is the core part of diagnosis. The process of manual inspection of histopathology slides is very time-consuming for pathologists and error-prone. Furthermore, diagnosis can sometimes differ among specialists. In recent years, convolutional neural networks (CNNs) have demonstrated remarkable performances in the classification of digital histopathology images. However, due to their high resolution, whole-slide images are of immense size, often gigapixels, making it infeasible to train CNNs directly on them. For that, patch-level classification is used instead.

In this study, a deep learning approach for patch-level classification of glioblastoma (i.e. brain cancer) tissue is proposed. Four different state-of-the-art models were evaluated (MobileNetV2, ResNet50, ResNet152V2, and VGG16), with MobileNetV2 being the best among them, achieving 80% test accuracy. The study also proposes a scratch-trained CNN architecture, inspired by the popular VGG16 model, which achieved 81% accuracy. Both models scored 87% test accuracy when trained with data augmentation. All models were trained and tested on randomly sampled patches from the Ivy GAP dataset, which consisted of 724 H&E images in total. Finally, as patch-level predictions cannot be used explicitly by pathologists, prediction results from two slides were presented in the form of whole-slide images. Post-processing was also performed on those two predicted WSIs in order to make use of the spatial correlations among the patches and increase the classification accuracy. The models were statistically compared using the Wilcoxon signed-rank test.

(4)

Acknowledgments

My greatest gratitude to my supervisors, Anders Eklund and Neda Haj-Hosseini, for giving me the opportunity to conduct such an exciting thesis topic. Their guidance enlightened my way during this research and their positive energy made this journey a delightful experience. Thanks for the tremendous trust you showed in me.

Special thanks to Linköping University for providing me with all the hardware needed for this study and for giving me the opportunity to study this Master’s program.

I would like to thank all my amazing friends in Sweden for making those two years special and for sharing the burden during stressful times. I also extend my gratitude to all my loved ones in Greece. Even though they were far away, they were continuously supporting me.

Finally, the biggest thanks goes to my family; they have always been there for me and without their support and sacrifices, this life experience wouldn’t have been possible.

”We are what we repeatedly do. Excellence, then, is not an act, but a habit.” - Aristotle

(5)

(6)

3.2.1 Optimization . . . 13 3.3 Transfer learning . . . 14 3.4 Batch normalization . . . 15 3.5 Regularization techniques . . . 16 3.6 Evaluation metrics . . . 17 3.6.1 Confusion matrix . . . 17 3.6.2 Accuracy . . . 18 3.6.3 Precision . . . 18 3.6.4 Recall . . . 18 3.6.5 F1-score . . . 18 3.6.6 ROC curve . . . 18 4 Methods 20 4.1 Workflow overview . . . 20 4.2 Image preprocessing . . . 21 4.2.1 Stain normalization . . . 21 4.2.2 Patch extraction . . . 23

(7)

4.3 Patch-level classification . . . 24 4.3.1 Scratch-trained CNN . . . 25 4.3.2 Pretrained CNN . . . 26 4.3.3 Data augmentation . . . 28 4.3.4 Post-processing . . . 29 4.4 Statistical comparisons . . . 30 5 Results 32 5.1 Model evaluation . . . 32 5.2 Post-processing . . . 35 5.2.1 Patient W10 . . . 36 5.2.2 Patient W6 . . . 39 5.3 Model comparison . . . 42 6 Discussion 43 6.1 Methods . . . 43 6.2 Results . . . 46 6.3 Ethical considerations . . . 47 7 Conclusion 48 Bibliography 50

(8)

List of Figures

2.1 Every H&E image of the Ivy GAP dataset has its corresponding label map. An ex-ample of a WSI and its label map is shown on the left [puchalski2018anatomic]. On the right are the feature annotations along with their corresponding color in the label map. . . 6 2.2 Image of a resected tumor. Every tumor of the dataset was subdivided into tissue

blocks of∼9 × 7.5 mm, illustrated by the tiles. W5-1-1 represents the patient details in an encoded form; A patient with ID 5 has undergone one surgery, and one tumor was resected. . . 7 2.3 Example WSI of the excluded patient. The second image shows the inaccurate

tumor borders. . . 7 2.4 Percentage of each tissue class in the dataset. Glioblastoma is defined by three

major anatomic regions: Cellular Tumor (CT), Leading Edge (LE) and Infiltrating Tumor (IT). Within those regions there exist structural features such as Microvascu-lar Proliferation (CTmvp), Pseudopalisading cells around necrosis (CTpan), Pseu-dopalisading cells no necrosis (CTpnn), Perinecrotic Zone (CTpnz), Hyperplastic Blood Vessels (HBV) and Necrosis (NEC). . . 8 3.1 An artificial neural network with one hidden layer. The nodes represent the input,

hidden and output variables and the links represent the weights. The links coming from the red nodes represent the biases. . . 10 3.2 A typical architecture of a CNN. It consists of two parts: feature extraction and

classification. Each convolutional layer has many filters that learn to extract im-portant features. In the pooling layers, the image size is reduced along with the number of parameters to be learned. The fully connected layers receive as input the flattened output of the last pooling layer and the last fully connected layer per-forms the classification. Image from A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images by Zabirul Islam, 2020. Licensed under CC BY-NC-ND [islam2020combined]. . . . 11 3.3 Example of a convolution operation, which involves the multiplication of a filter

with the input data. Image from Deep Learning: A Practitioner’s Approach by Josh Patterson and Adam Gibson (O’Reilly). Copyright 2017 Josh Patterson and Adam Gibson [patterson2017deep]. . . . 12 3.4 Example of an average and a max pooling operation for an input of size 4 × 4, a

filter of size 2× 2, and a stride of 2. The reduced size of the input after the pooling operation is 2 × 2. Image from Design of Power-Eﬀicient Training Accelerator for Convolution Neural Networks. Licensed under CC BY 4.0 [hong2021design]. . . . 12 3.5 Left: Neural network with two hidden layers. Right: A random thinned network

after applying dropout to the initial network on the left. . . 16 3.6 The training error decreases while the validation error starts increasing after some

epochs; this is a sign of overfitting. With early stopping, the training is stopped at the point where the minimum validation error is achieved. Early stopping requires separate validation data, which are not used for training or testing. . . 17

(9)

3.7 Example of a ROC curve and its AUC value. The dashed line represents the case of random guessing. . . 19 4.1 Illustration of the steps involved in the preprocessing, training and post-processing

phase of the classification task. Patches are extracted from the stain normalized WSIs and they are used for classification. The predicted patches are assembled back together resulting in a predicted WSI which finally gets smoothed using a post-processing algorithm. . . 21 4.2 Comparison of Macenko and Vahadane stain normalization methods on two

differ-ent images. . . 22 4.3 Patch extraction from the original WSIs. . . 23 4.4 Patches representing all nine different parts of the tumor and glass background. . . 24 4.5 Diagram of the proposed 2D CNN architecture. The numbers in brackets indicate

the output dimensions of every convolution block after max-pooling. The output of the 5th _{convolution block results in 12,800 dimensions after being flattened. . . .} ₂₆

4.6 MobileNetV2 architecture. There are stride 1 blocks and stride 2 blocks [sandler2018mobilenetv2] © 2018 IEEE. . . . 27 4.7 Detailed presentation of the layers of the original MobileNetV2, where t: expansion

factor, c: output channels, n: number of times a layer is repeated, and s: stride [sandler2018mobilenetv2] © 2018 IEEE. . . . 28 4.8 Three different data augmentation techniques applied on a cellular tumor (CT)

patch. From top to bottom: 90○rotations, horizontal flips, shifts in brightness. . . 29 5.1 Confusion matrices of the proposed CNN architecture and the pretrained model,

MobileNetV2. The diagonal of the matrices represents the correctly classified in-stances for each class. The color intensity increases with the number of classified instances. . . 34 5.2 ROC curves of the scratch-trained and the pretrained model, trained without and

with data augmentation. The legend inside each plot indicates the AUC value of each class. . . 35 5.3 Original WSI and label map of patient W10. There are three anatomic features in

this slide: CT (green), IT (pink) and LE (petrol). . . 36 5.4 From left to right: Original label map, predicted image, smoothed image with

kernel size 448× 448, 672 × 672, and 896 × 896 pixels. . . 37 5.5 Original WSI and label map of patient W6. There are five anatomic features in

this slide: CT (green), IT (pink), LE (petrol), NEC (black) and CTpnz (light blue). 39 5.6 From left to right: Original label map, predicted image, smoothed image with

(10)

List of Tables

3.1 Abstract representation of a confusion matrix for a binary classification problem. The correct predictions are located in the diagonal of the matrix. . . 17 4.1 Structure of the residual bottleneck block which transforms input with height h,

width w, and k channels to output with height h s, width

w

s, and k’ channels, where

of frozen layers and trainable parameters. . . 32 5.2 Test accuracy achieved from the suggested and the pretrained CNN. The results of

those two models trained with data augmentation are also presented. . . 33 5.3 Precision, Recall, F1-score and area under the ROC curve (AUC) for every model

and every class. The weighted AUC value of every model is calculated as the sum of the AUC of every class weighted by the support of the respective class. . . 33 5.4 Prediction results for the patches coming from a WSI of patient W10. Evaluation

metrics for every class are also presented. This table is connected to the visual representation of the prediction results in Figure 5.4. . . 36 5.5 Patch-level accuracies of the different kernel sizes used for image smoothing for

patient W10. . . 38 5.6 Prediction results for the patches coming from a WSI of patient W6. This table is

connected to the visual representation of the prediction results in Figure 5.6. . . 39 5.7 Patch-level accuracies of the different kernel sizes used for image smoothing for

patient W6. . . 41 5.8 Statistical comparisons between the classifiers using the Wilcoxon signed-rank test

(11)

1 Introduction

1.1 Background

1.1.1 Glioblastoma

Glioblastoma (GBM) is the most common and aggressive primary malignant brain tumor. The survival time is estimated to be around 15 months after receiving treatment, and the median age of diagnosis is 64 years. In some very rare cases, patients may survive up to 5 years. According to the 2013 Central Brain Tumor Registry of the United States (CBTRUS) report, the average annual incidence rate of GBM is 3.19/100,000 population [44, 32]. GBMs can be classified into two categories: primary, or de novo, and secondary. The primary GBMs appear without having a known precursor, while the secondary GBMs arise from low-grade tumors that later evolved into GBM. The first step of the diagnosis process usually involves a computed tomography (CT) or magnetic resonance imaging (MRI) scan. The core stage of diagnosis entails biopsy or surgical resections. The factors that induce the occurrence of glioblastoma cannot be precisely defined, and lifestyle factors such as smoking, alcohol, drugs, or the use of mobile phones have not been proved to be associated with GBM. Even though aggressive treatments, including resection, are applied to glioblastoma patients, most tumors recur [7].

1.1.2 Histopathology

Histopathology refers to the study of changes in tissues caused by a disease or disorder [36]. When there is suspicion of malignancy, doctors usually take a biopsy which they examine under a microscope. Before reaching the microscope, the biopsy specimen must be prepared. The first step of the tissue slide preparation includes chemical fixation, which aims to prevent the tissue from damage and decay. After fixation, the sample gets dehydrated by replacing the water with alcohol, and the tissue gets covered with wax. The specimen is finally cut into thin sections that can be placed on glass slides.

In order to examine the tissue samples placed on glass and perform diagnosis, pathologists use microscopes. If no staining is used, the light from the microscope will pass through the sample and no information will be visible. Therefore, staining is important in order to make the tissue visible when placed under the microscope. Stains highlight different tissue components and offer a colored image.

(12)

1.2. Related work

There are several options to use for staining. The most popular stain used in histology is Hematoxylin and Eosin (H&E). Hematoxylin is the one that is responsible for the blue color observed in H&E images and it highlights the cell nuclei. On the other hand, Eosin stains the extracellular matrix and cytoplasm with a pink color. This method has been used by pathologists for over a century; even today, it remains a gold standard method for diagnosis.

1.1.3 Digital pathology

The increasing cancer incidence rate has made diagnosis and grading of cancer a tedious procedure. Furthermore, pathologists need to extract several quantitative parameters, such as surface areas and lengths, for widely used grading systems [25]. Consequently, there is an immense need for doctors’ workflow assistance.

Thanks to the digitization of histopathology images, many complicated processes have been simplified. Digital pathology is a breakthrough technology established in the field of medicine around 20 years ago when whole-slide imaging scanners started being available in the commercial area, and since then seems to be gaining momentum. Apart from image digitization, digital pathology also incorporates the analysis, segmentation, and diagnosis of digitized images. The advent of whole-slide images (WSIs), which are digitized histopathology slides, substantially enhanced pathology diagnosis as they enabled computer-aided diagnosis, which has given doctors a remarkable boost to their workflow optimization by automating the traditional, standard diagnosis process. Although artificial intelligence (AI) used to be applied mostly for image-based diagnosis in radiology and cardiology, AI applications have seen an increase in pathology as well [30]. Pathologists and oncologists are the target group of this technology. Apart from speeding up the process of histopathology image examination, a primary aim of digital pathology is to increase the accuracy of the diagnosis that is highly affected by human factors. Therefore, it serves as a diagnosis enhancement tool for experts [4].

1.2 Related work

Even though several papers cover the topic of digital pathology and demonstrate different deep learning (DL) approaches for classification or segmentation, to the best of our knowledge, no paper has used the Ivy GAP dataset for classification [34]. Therefore, no actual comparison can be conducted between this thesis and the existing literature. However, plenty of interesting approaches have been applied to other histology datasets, like The Cancer Genome Atlas (TCGA), that can be used as a guideline for the present thesis as well. Traditional machine learning (ML) was initially used as the standard method for medical imaging [4]. However, DL becomes more and more popular and CNNs prove to offer state-of-the-art results, leading to a shift towards them. One substantial drawback of ML compared to DL is that ML needs first to compute the image features used for the diagnosis process. On the other hand, DL incorporates feature identification into the training phase [9].

A recent paper, by [21], presented a modified version of the popular VGG16 model. The PatchCamelyon dataset with lymph node images of size 96 × 96 pixels and 10x magnifica-tion was used for binary classificamagnifica-tion. Their best architecture achieved 95.46% area under the curve (AUC) and comprised Adam optimizer and Tanh activation function. The authors concluded that the normalization layer’s location plays an important role in the model per-formance. Furthermore, the presence of the dropout layer did not appear to increase the performance. Kurc et al. [24] suggested three different methods for the classification of adult diffuse glioma cases into two categories: oligodendroglioma and astrocytoma. Each of those three classification methods combined radiology with pathology images. WSIs were down-loaded from the TCGA archive and MRIs were downdown-loaded from the TCIA archive. Each WSI image had a corresponding MRI image and their methods achieved accuracy values of 75%, 80%, and 90%. The method achieving the highest accuracy score, 90%, combined the

(13)

1.2. Related work

predictions from an MRI and a WSI classification model to assign the final class. The second-best performing method integrates two classification models for MRI and WSI data through dropout-enabled ensemble learning. Finally, the method achieving 70% accuracy builds two separate classification models again for the MRIs and WSIs and combines their classification results through a weighted average operation. The patch sizes selected for each method, in the order in which they were described, were 224 × 224, 448 × 448, and 512 × 512 pixels, respectively. An interesting approach that combines deep learning with machine learning is described in [48]. They used a brain tumor dataset from the MICCAI 2014 Brain Tumor Digital Pathology Challenge for binary classification into glioblastoma multiforme (GBM) and low grade glioma (LGG). Different patch sizes were extracted depending on the magnification. More precisely, overlapping patches of size 336× 336 and 672 × 672 were selected for magnifi-cation 20x and 40x, respectively. As a next step, patches corresponding to glass were discarded and all the remaining patches were resized to 224× 244 pixels and fed to a pretrained CNN network, AlexNet. Feature selection is performed only for the binary classification case and finally, the classification is performed using a linear SVM classifier. Their method achieved 97.5% accuracy in the MICCAI challenge. Different classification accuracy results were also provided for different patch sizes. The patches used were of size 224 × 224, 448 × 448 and 672 × 672 pixels (corresponding to 40x magnification) and showed accuracy of 91.1%, 93.3% and 97.8%, respectively. For 20x magnification the corresponding patch sizes are half of the aforementioned sizes.

The work of [38] compares five popular pretrained networks for the classification of brain cancer patients’ survival rate, using histopathology images from TCGA, for three different patch sizes. The highest classification accuracy was achieved using GoogleNet on patches of size 256x256 pixels, reporting an impressive 99% accuracy, and this classifier was called DeepSurvNet. The paper published by [47] presented the results of training from scratch versus fine-tuning a CNN for a binary tumor classification problem. The authors chose GoogleNet as the CNN architecture and the workflow was divided into two phases; pre-training and fine-tuning. During pre-training, they optimized GoogleNet’s weights using a lymph node metastases dataset. The fine-tuning phase followed a similar pattern but with a different, smaller dataset for prostate cancer. The final comparison indicated that for 4196 prostate cancer training images, fine-tuned GoogleNet scored higher test accuracy compared to scratch-trained GoogleNet (84.3% vs. 78.8%).

Datasets often consist of images of varying color intensities. This color inconsistency can adversely affect computer-aided diagnosis (CAD) systems’ performance. Therefore, color nor-malization is a central part of image preprocessing in digital pathology and several methods have been developed to address this problem. One of the most famous stain normalization algorithms is Macenko [26] which converts RGB values into their corresponding optical density (OD) values and calculates the stain vectors for every image using singular value decompo-sition (SVD) to the foreground pixels. This method uses a target image according to which it modifies any source image’s color. Anghel et al. [1] developed an improved version of the Macenko algorithm, which reduces run-time and is more memory eﬀicient. Furthermore, while Macenko does not handle artifacts, leading to a biased color normalization, their method can improve the classification accuracy even when low-quality images are involved. Another pop-ular method that was based on Macenko is Vahadane [45]. Their method also estimates the stain vectors, but they use a dictionary learning approach instead of singular value decom-position. Their approach combines the stain density maps of an image with the stain color basis of a target image. This technique aims to modify image’s color while preserving bio-logical structure information. Recent papers have introduced generative adversarial networks (GANs)-based approaches, like CycleGAN [50] and StainGAN [37], that do not use target images. Even though such methods have proved to outperform previous state-of-the-art nor-malization techniques, they can introduce bias in the nornor-malization. Since they generate their own information instead of modifying the existing ones, it is diﬀicult to determine if the model is biased [14].

(14)

1.3. Challenges

After reviewing the available literature in digital pathology and classification with deep learning, the results presented in several papers pointed to the conclusion that pretrained net-works have appeared to offer higher classification accuracy, and bigger patch sizes are preferred for classification. In contrast, smaller patch sizes are better when it comes to segmentation [48]. Furthermore, if the stain normalization method demands the choice of a reference slide, the slide should preferably be picked by an expert. Finally, none of the papers reviewed for this study have focused on making statistical comparisons among different networks; the comparisons were held using evaluation metrics rather than statistical tests.

1.3 Challenges

Although AI has excellent applications in digital pathology and has contributed remarkable advancements, offering speed, convenience and remote diagnosis possibilities, there are yet several obstacles to be tackled.

One of the biggest challenges in digital pathology is the lack of labeled data. Asking pathologists to label images manually does not seem to be the optimal solution. Manual annotation is highly time-consuming and taxing, especially when images are not of high quality, but it is also costly for the organizations that need the annotations. One possible way to address this problem is by using active learning for annotation [3].

Dealing with whole-slide images means having to handle images of immense dimensions, which often can be of size 100,000× 100,000 pixels or even higher. This makes the conventional application of DL methods impossible since such large images cannot be fed directly into a neural network. A big part of the DL approach in digital pathology includes preprocessing of those large images, where smaller patches (tiles) are extracted and serve as the new data. In some cases, downsampling, to a level that the most important information of the images is still kept, may be needed.

Anything related to human life comes with high responsibilities as well. Incorporating AI into traditional medicine demands high trust and proven eﬀiciency. It is diﬀicult to make highly reliable algorithms that could theoretically even replace doctors. Additionally, it is hard to gain humans’ trust regarding machines, automation, and anything away from human nature. Therefore, to make those algorithms practically applicable, we need to ensure that they are easily employed, highly trustworthy and have a positive impact revenue-wise. 1.4 Objectives

This thesis endeavors to apply deep learning techniques on histopathology images by propos-ing a 2D convolutional neural network for patch-level classification of glioblastoma tissue subregions. As pretrained networks have shown remarkable success in several classification tasks, popular CNN architectures are also considered and compared on this classification task. Patch-level classification is a widely used technique for whole-slide images due to their big size. However, for the results to be handy for an expert, it is useful to visualize the networks’ predictions in the form of a WSI. Therefore, the results of this study are also illustrated in a way that could practically provide computer-aided diagnosis. Since deep learning has vast potential in the healthcare sector, the primary aim of this thesis is to contribute to the thor-ough research conducted in the medical imaging community. As mentioned in the related work section 1.2, the Ivy GAP dataset is relatively new, and to the best of our knowledge, no previously published papers have utilized it for classification purposes. Hence, this thesis could serve as a foundation for further improvements through scientific publications. However, it is essential to stress that this research aims to implement widely applicable methods rather than methods for a specific dataset. Specifically, the current study will try to answer the following research questions:

(15)

1.4. Objectives

• What is the best standard 2D CNN architecture for the Ivy GAP dataset?

• Does a pretrained network perform significantly better than a CNN trained from scratch? • Can data augmentation improve the test accuracy?

(16)

2 Data

2.1 Data source

The image data provided for this study are from the Ivy Glioblastoma Atlas Project (Ivy GAP), a collaborative partnership between the Ben and Catherine Ivy Foundation, the Allen Institute for Brain Science, and the Ben and Catherine Ivy Center for Advanced Brain Tumor Treatment [34]. This project aims to offer an image archive dedicated to glioblastoma, which could serve as a tool for scientists to perform advanced analysis and potentially propose new methods that could improve the diagnosis and treatment of brain cancer. The Allen Institute offers in situ hybridization (ISH) data, RNA sequencing (RNA-Seq) data and H&E-stained sections along with their corresponding label images. For this study, only the H&E images and their label maps were used. In Figure 2.1 a representation of the Ivy GAP dataset, with an example H&E slide, its corresponding label map, and the colors representing the anatomic features of the tumor are presented.

Figure 2.1: Every H&E image of the Ivy GAP dataset has its corresponding label map. An

example of a WSI and its label map is shown on the left [34]. On the right are the feature annotations along with their corresponding color in the label map.

(17)

2.1. Data source

The dataset was built upon 42 tumors donated by 41 patients. After receiving the tumors, each tumor was subdivided into tissue blocks as shown in Figure 2.2, where each tile represents one tissue block1_{. The dataset provided for this study consists of 945 H&E-stained glioblastoma} slides. From every tissue block, one H&E slide was produced, illustrated with green and yellow cubes in the right part of the figure, and about 22 H&E images were produced in total from each tumor specimen, resulting in those 945 images provided. However, the Ivy GAP dataset consists of∼12,000 histological images in total.

Figure 2.2: Image of a resected tumor. Every tumor of the dataset was subdivided into

tissue blocks of∼9 × 7.5 mm, illustrated by the tiles. W5-1-1 represents the patient details in an encoded form; A patient with ID 5 has undergone one surgery, and one tumor was resected. For annotating the H&E anatomic features in the Ivy GAP dataset, White Marsh Forests, Inc., released a machine learning application called ’Mill’, which performed semi-automated annotation. Since a semi-supervised algorithm has produced the annotation labels, there were some shifts in the annotations for some slides. After manual inspection2_{, a patient with 41} H&E images was excluded from the dataset due to some notably inaccurate annotations (Figure 2.3). Furthermore, 180 images in the provided dataset included tumor borders. Therefore, they were discarded as they could not be used for classification since the borders would confuse the networks. Eventually, the final dataset used for this study comprised 724 H&E stained JPG images, and their corresponding labeled images.

Figure 2.3: Example WSI of the excluded patient. The second image shows the inaccurate

tumor borders.

The Whole Slides were originally scanned in SVS file format (∼5 GB each) using ScanScope® scanners (Aperio Technologies, Inc; Vista, CA), at 20x magnification and a resolution of 0.5

1_{Source: http://glioblastoma.alleninstitute.org/ish/specimen/show/287469100} 2_{Source: http://glioblastoma.alleninstitute.org/ish/specimen/show/706601}

(18)

2.2. Glioblastoma anatomic labels

Figure 2.4: Percentage of each tissue class in the dataset. Glioblastoma is defined by three

major anatomic regions: Cellular Tumor (CT), Leading Edge (LE) and Infiltrating Tumor (IT). Within those regions there exist structural features such as Microvascular Proliferation (CTmvp), Pseudopalisading cells around necrosis (CTpan), Pseudopalisading cells no necrosis (CTpnn), Perinecrotic Zone (CTpnz), Hyperplastic Blood Vessels (HBV) and Necrosis (NEC).

µm/pixel. The raw images were converted to JPEG 2000 files of ∼0.4 GB each. The final histology images available for download are∼15,000 × 18,000 pixels and ∼40 MB each. 2.2 Glioblastoma anatomic labels

The primary anatomic regions are Leading Edge (LE), Infiltrating Tumor (IT), and Cellular Tumor (CT) [43]. Each of those three main regions often includes some particular structural features, which are Hyperplastic Blood Vessels (HBV), Microvascular Proliferation (MVP), Necrosis (NE), Perinecrotic Zone (PNZ), and Pseudopalisading Cells around Necrosis (PAN). Since MVP, PAN, and PNZ are found within the CT region, the acronyms CTmvp, CTpan, CTpnz are used instead. Those subcategories make glioblastoma (grade IV glioma) distinguish-able from lower grades of glioma. Especially Necrosis is a characteristic sign of glioblastoma and its presence is crucial for a tumor to be grade IV or classified as glioblastoma on the World Health Organization classification system [7]. The white areas of the images, named as Empty in the dataset, are the glass background and provide no information for the diagnosis. The percentages of all features in the dataset are presented with a barplot in Figure 2.4. As the barplot indicates, the Ivy GAP dataset is highly imbalanced in the percentage of classes.

(19)

3 Theory

This chapter aims to present the theory behind the methods that are implemented in this thesis.

3.1 Artificial Neural Networks

Deep learning (DL) is a subfield of machine learning (ML), and both belong to the broader field of artificial intelligence (AI). DL is a state-of-the-art approach in image processing and gained its popularity due to its high accuracy and more straightforward application for im-ages compared to traditional machine learning approaches that are based on hand-crafted feature engineering. Thanks to those amenities, DL has been increasingly used also in digital pathology.

DL comprises different neural network architectures, with convolutional neural networks (CNNs) being among the most popular ones. CNNs are mostly applied for image processing tasks, including object detection, classification and segmentation, and it is the architecture employed for the classification tasks of the present study.

The base of all deep learning network architectures is artificial neural networks (ANNs). ANNs are computational processing systems that resemble the way biological nervous systems operate [31]. An ANN consists of an input layer, one or more hidden layers, and an output layer. The architecture includes a large number of neurons that strive to learn from the input data and produce an optimized output. If multiple hidden layers exist, then we call it a deep neural network (DNN). However, there is no clear definition of when a network can be considered deep. All types of networks share the same basic components: nodes, edges, weights, biases, and activation functions.

Below, the formulas for an ANN with one hidden layer will be presented [5]. For more hidden layers, the procedure will be accordingly the same. A neuron takes as input a vector xi, where i= 1, ..., D is the number of input variables and produces M linear combinations as

follows: aj= D ∑ i=1 w_ji(1)xi+ w(1)j0

(20)

3.1. Artificial Neural Networks

Figure 3.1: An artificial neural network with one hidden layer. The nodes represent the

input, hidden and output variables and the links represent the weights. The links coming from the red nodes represent the biases.

where j= 1, ..., M and (1) refer to the parameters in the first layer. The weights are denoted as w(1)_ji and the biases as w_j0(1).

The values aj are called activations and they subsequently get transformed using a

non-linear activation function h(⋅) as follows:

zj= h(aj)

The choice of the activation function depends on the type of data and the distribution of the target variables [5]. For the multi-class classification task presented in the current thesis, the tanh is used as the activation function for the hidden layers and the softmax activation function for the output layer. Both functions are analyzed in detail in the Optimization subsection 3.2.1. The zjquantities represent the output of the hidden units. Those values are

again combined similarly in order to produce the output unit activations

ak= M

∑

j=1

w(2)_kjzj+ w(2)k0

where k= 1, ..., K and K denotes the number of outputs. Finally, those unit activations are transformed using the softmax activation function in the output layer to produce the outputs yk. Let σ(⋅) be the activation function of the output layer. The output can be computed as

yk(x,w) = σ( M ∑ j=1 w(2)_kjh( D ∑ i=1 w(1)_ji xi+ wj0(1))+ w (2) k0)

wherew is a vector that contains both the weights and biases. The evaluation of this

expression is known as forward propagation. Figure 3.1 presents all those formulas in the form of a diagram.

The process of learning is based on the weights update. The network starts with its initial weights, then the data are fed into the network (feed-forward) and a prediction is

(21)

3.2. Convolutional Neural Networks

made. The prediction is then compared to the ground truth and the error is computed through a cost function and propagated back to the network (back-propagation). Back-propagation describes the process of calculating the error gradients and updating the weights. This process is continuously repeated until the prediction error becomes very small and the predictions converge to the ground truth.

3.2 Convolutional Neural Networks

CNNs are an enriched version of the traditional ANNs. A CNN is a deep, feed-forward neural network that takes images as input and generates a predictive output. The name ’convolutional neural network’ results from the architecture, which holds several convolutional layers in which the network learns and extracts image features using convolution filters or kernels. Each convolutional layer contains many filters, that learn to extract different important features. The reason behind them being so popular for image classification is because CNNs use the fact that images have local spatial correlations, to substantially reduce the number of trainable weights. The basic CNN architecture consists of convolutional layers, pooling layers, and fully connected layers, as demonstrated in Figure 3.2.

Figure 3.2: A typical architecture of a CNN. It consists of two parts: feature extraction

and classification. Each convolutional layer has many filters that learn to extract important features. In the pooling layers, the image size is reduced along with the number of parameters to be learned. The fully connected layers receive as input the flattened output of the last pooling layer and the last fully connected layer performs the classification. Image from A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images by Zabirul Islam, 2020. Licensed under CC BY-NC-ND [20].

An image is represented as a matrix where each entry of the matrix corresponds to a pixel. The convolution filter slides over the image with a stride, and the values of the image are multiplied by the values of the filter [13]. This multiplication is called convolution operation, and for an image with pixel-coordinates q, r, and a filter of size

l × m, it is mathematically described as:

S(q, r) = (I ∗ F )(q, r) = ∑ l

∑

m

(22)

where S is the output of the convolution operation, I is the input image and F is the filter. Since convolution is commutative, the above equation can be written equivalently as: S(q, r) = (F ∗ I)(q, r) = ∑ l ∑ m I(q − l, r − m)F (l, m).

The stride defines how many pixels the kernel will slide to the right. The convolu-tion operaconvolu-tion results in a feature map of a smaller dimension than the initial image. In Figure 3.3 an example of a convolution operation is illustrated.

Figure 3.3: Example of a convolution operation, which involves the multiplication of a filter

Convolutional layers are followed by pooling layers, whose primary role is to reduce the dimensionality of feature maps produced by the convolutional layers, which will substantially reduce the number of weights that need to be learned in the last dense layers [4]. The most common types of pooling are the average and max pooling, where a filter slides over the image with a stride s and the pixels inside the area covered by the filter are replaced by the average or the maximum of all pixels, respectively. An example of average and max pooling for an input of size 4 _{× 4 is shown in Figure 3.4.}

Figure 3.4: Example of an average and a max pooling operation for an input of size 4× 4, a

filter of size 2× 2, and a stride of 2. The reduced size of the input after the pooling operation is 2× 2. Image from Design of Power-Eﬀicient Training Accelerator for Convolution Neural Networks. Licensed under CC BY 4.0 [18].

(23)

The fully connected (dense) layers are a type of ANN (described in Section 3.1), where all the neurons of one layer are connected to all the neurons of the next layer. They are the last layers of the CNN and they take as input the flattened output (a vector) of the last convolutional or pooling layer. The final layer uses an activation function that classifies the input as the class with the highest probability.

3.2.1 Optimization

DL algorithms are trained using an optimization process. The term optimization refers to the minimization or maximization of a function. When the task is minimization, the function is usually called cost function, loss function, or error function [13].

For multi-class classification problems, categorical cross-entropy is chosen as the loss function. Since the data are categorical, they must be hot encoded. With one-hot encoding, if we have K different classes, a vector equal to size K will be produced. For a sample belonging to class k, where k_{= 1, ..., K, all the vector components will be} 0 except for the kth_{component that will be assigned the value 1. Let y}_{1, ..., yK} _{be the}

one-hot encoded observed outputs for the K mutually exclusive classes and t1, ..., tK be the target values. Then, the cross-entropy loss is described by the following formula:

L(w) = − D ∑ i=1 K ∑ k=1 tki⋅ ln(yk(xi,w)).

Cross-entropy is almost always paired with the softmax activation function. Soft-max is usually used in the output layer and plays the role of the classifier. It takes as input the output unit activations ak as described in Section 3.1 and converts them

into output probabilities using the following relationship:

yk(x, w) =

exp(ak(x, w))

∑jexp(aj(x, w)) .

For all the models trained in this study, the hyperbolic tangent (tanh) activation function was used in the hidden layers. The range of the tanh function is [-1,1] and it is zero centered. Tanh is described by the following formula:

h(aj(x, w)) =

e2aj− 1

e2aj+ 1

where aj(x, w) are the linear combinations produced by the hidden layers.

The most popular algorithm to optimize a neural network is (stochastic) gradient descent. Gradient descent minimizes a loss function by updating the model’s param-eters in the opposite direction of the gradient of the loss function. The learning rate determines the size of the steps taken to reach a minimum.

Adaptive Moment Estimation (Adam) is a gradient-based algorithm that computes adaptive learning rates for different parameters from estimates of first and second moments of the gradients [23]. There are two main advantages to this optimizer; it is computationally eﬀicient and well suited for problems with substantial amount of data and/or parameters. Adam is used as the optimizer in all the CNN models presented in this study.

(24)

3.3. Transfer learning 3.3 Transfer learning

The utilization of CNNs pretrained on enormous databases, such as ImageNet, for the classification of images from different domains is referred to as transfer learning [46]. Pretrained CNNs can offer promising results on histopathology images but they first need to be fine-tuned in order to learn domain-specific features, as those networks are trained on natural images, like animals or objects, and they cannot recognize the specific biological structures of histopathology images. The first layers of a CNN learn general features, like curves and edges, that are applicable to any task. Therefore, those layers can be frozen and not trained. The last layers of a CNN are the ones that learn the task-specific features and those need to be trained on the dataset of interest. The weights of the pretrained network are loaded and used as the initial weights of the network. During the fine-tuning process, the weights are continuously updated, enabling the model to learn the task-specific features [46]. On top of the frozen layer, custom layers can be added to increase the classification accuracy. As the majority of the layers are frozen, the majority of the model parameters are non-trainable, which leads to faster training time and high accuracy thanks to the pretrained weights that are employed for the weight initialization.

In this study, four popular pretrained networks were examined; MobileNetV2, ResNet50, ResNet152V2 and VGG16.

- MobileNetV2 is a CNN model, optimized for mobile devices, introduced by

Google in 2018 [35]. MobileNetV2 was released after MobileNetV1 and its archi-tecture is based on its ascendant’s and therefore, MobileNetV2 also uses depthwise separable convolutions. The superiority of those layers against the traditional convo-lutional layers comes from the fact that they reduce computational cost. Specifically, MobileNetV2 reduces 8 to 9 times the computational cost compared to standard convo-lutions and that only with a minimal reduction in accuracy. Apart from the depthwise separable convolutions that have been employed by MobileNetV1, MobileNetV2 intro-duced two new novelties. The first one is linear bottlenecks between the layers, which prevent non-linearities from destroying a big part of the information. The second one is shortcuts between the bottlenecks that serve in enhancing the ability of a gradient to propagate across multiplier layers [35].

- ResNet50 was introduced by Microsoft in 2015 and it is one of the several models

in the family of ResNets [16]. The number next to their name indicates the number of layers, i.e. ResNet50 comprises 50 layers. Deep neural networks are able to learn complex features and offer higher accuracy compared to shallow networks. However, the deep learning community had to face the problem of vanishing gradients that appeared in deep networks and that would cause accuracy to saturate or drop quickly after some epochs. To address this problem, the authors introduced a residual learning framework. ResNet152V2 is a 152-layer network from the ResNet family but it belongs to the version 2 models, which is an improvement of version 1. The difference with version 1 is that V2 models present a different arrangement of layers in the residual block [17].

- VGG16 was released in 2014 from the University of Oxford [39]. The paper’s

main outcome was the increase in accuracy by using a small kernel size (3_{× 3) and} ex-tending the depth of the network to 16 layers, 13 convolutional and 3 fully-connected layers. Although it is a heavy model, due to its depth and number of convolutional

(25)

lay-3.4. Batch normalization

ers, it has a simple architecture. The authors also released a deeper VGG model with 19 layers, VGG19. Their submission won the first and second prize in the ImageNet Challenge 2014 in the localization and classification tracks respectively [39].

3.4 Batch normalization

During the training of a neural network the distribution of each layer’s inputs changes. This phenomenon is called internal covariate shift and can result in slower training as it demands lower learning rates and careful parameter initialization [19]. A solution to this problem is the normalization of the layer inputs for each training mini-batch. Batch normalization loosens the requirements for very small learning rates and specific initializations and in some cases, it is enough to use batch normalization without Dropout. Batch normalization is implemented according to the following steps:

Let x be the values over a mini-batch: B _{= {x}₁, ..., xm}. Calculating only the

normalized values of x may change what the layer can represent. To address this, the linear transformations of the x values, l_{= {l}1, ..., lm}, are computed.

1. Calculate mini-batch mean

µB= 1 m m ∑ i=1 xi

2. Calculate mini-batch variance

σ2_B= 1 m m ∑ i=1 (xi− µB)2 3. Normalize x_i ˆ xi= √xi− µB σ2 B+ ϵ ,

where ϵ is a constant added to the mini-batch variance for numerical stability. 4. Scale and shift the normalized value

li = γ ˆxi+ β,

(26)

3.5. Regularization techniques 3.5 Regularization techniques

Deep neural networks are very powerful models that can learn complex relationships between their inputs and outputs. However, a downside of those models is that they are prone to overfitting. Overfitting occurs when the model performs very well on the training data but poorly on unseen data. Dropout and early stopping are some of the most commonly used regularization techniques.

The idea of the Dropout technique is to randomly remove units of a network during training time [40]. Every unit has a probability p of being temporarily removed and this probability is called dropout rate. Those random deactivations prevent neurons from developing dependencies among each other. The application of dropout to a network results in several ’thinned’ networks. A thinned network consists of the units that survived dropout (Figure 3.5).

Figure 3.5: Left: Neural network with two hidden layers. Right: A random thinned

network after applying dropout to the initial network on the left.

For many optimization algorithms the training error is a nonincreasing function [5]. However, when the error is measured on the validation set, it often shows a decreasing pattern at first but then it starts increasing as the network starts to overfit. With early stopping the training is stopped at the point of the smallest validation error, as shown in Figure 3.6, so that the model can keep a good generalization performance.

(27)

3.6. Evaluation metrics

Figure 3.6: The training error decreases while the validation error starts increasing after

some epochs; this is a sign of overfitting. With early stopping, the training is stopped at the point where the minimum validation error is achieved. Early stopping requires separate validation data, which are not used for training or testing.

3.6 Evaluation metrics

Evaluation metrics are used as the guideline in order to choose the optimal classifier. There are several evaluation metrics, each one of them measuring different character-istics of a classifier. The most common metrics in classification tasks are: confusion matrix, accuracy, precision, recall, F1-score and Receiver Operating Characteristic Curve (ROC).

3.6.1 Confusion matrix

The confusion matrix summarizes the classified instances; the number of correctly classified instances and the number of instances classified as one of the rest classes are reported for every class. In a binary classification problem, every instance is classified as either positive or negative [6]. The terms used in a confusion matrix and the rest of the evaluation metrics presented here are true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). TP are positives correctly classified as positive while FP are negatives incorrectly classified as positive. TN refer to negatives correctly classified as negative and finally, FN refer to positives incorrectly classified as negative. The present study examines a multiclass classification problem. In that case, the logic of positives and negatives is similar and can be seen as many binary classification problems, one per class.

Predicted Positive Negative

Actual Positive TP FN

Negative FP TN

Table 3.1: Abstract representation of a confusion matrix for a binary classification problem.

(28)

3.6.2 Accuracy

The accuracy determines the number of instances that were correctly classified as either positive or negative out of all instances, and it is calculated as:

Accuracy= T P + T N

T P + T N + F P + F N

3.6.3 Precision

Precision is the fraction that checks how precise the classifier is by counting the number of instances correctly classified as positive, compared to the total number of instances classified as positive, and it is calculated as:

P recision= T P T P + F P

3.6.4 Recall

Recall is the fraction that represents the number of instances correctly classified as positive, compared to the total number of actual positives, and it is calculated as:

Recall= T P

T P + F N

3.6.5 F1-score

F1-score is the harmonic average of the precision and recall [29].

F 1= 2 × P recision× Recall P recision+ Recall

3.6.6 ROC curve

The receiver operating characteristic (ROC) curve is a graphical illustration of the discriminatory power of a binary classifier [10]. It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at different thresholds, where

T P R = _{T P}T P_{+F N} and F P R = F P

T N+F P. Another terminology used in ROC curves is

sensitivity and specificity, where sensitivity = recall and specificity = T N

F P+T N = 1−F P R

. An example of a ROC curve for a binary classification problem is illustrated in Figure 3.7.

ROC graphs are very useful tools in machine learning and they have been exten-sively used for medical decision making [11]. Classification accuracy alone is often not a good and objective evaluation metric. ROC curves have a great advantage that makes them superior to many other evaluation metrics; they are insensitive to class imbalance. While ROC curves are usually used for binary classification, where one class is considered positive and the other negative, there is also the possibility of using them for multi-class problems. In the case of n different classes, n ROC curves are produced. For the ROC curve of the i class, i=1,...,n, class i is used as the positive class and all the other classes as the negative. In order to compare different classifiers,

(29)

the area under the ROC curve (AUC) is calculated. The AUC represents the prob-ability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [11]. Comparing classifiers according to AUC, the one with the highest AUC is considered the best in terms of discriminability. In the multi-class case, the AUC is not a scalar, but it can be computed as the sum of each class’s AUC weighted by the support of the respective class,

AU Ctotal= ∑ ci∈C

AU C(ci) ⋅ p(ci),

where ci is class i and C is the set of all classes.

Figure 3.7: Example of a ROC curve and its AUC value. The dashed line represents the

case of random guessing.

An ideal classifier would give values very close to the top-left corner. The diagonal line represents the case of random guessing. If a classifier appears in the lower right triangle then it performs worse than a random classifier. Since the perfect classifier is rather idealistic, one can sacrifice some specificity and aim for higher sensitivity [10].

(30)

4 Methods

This chapter presents the methods used in this study. It starts with a summarization of the workflow and continues with a detailed explanation of all the methods. Training deep learning models demands strong hardware and GPUs are undoubtedly the heart of deep learning. For this project, a powerful computer was kindly provided by the Department of Biomedical Engineering (IMT), Campus US, Linköping University. The machine was equipped with two Nvidia GeForce RTX 2080 Ti graphics cards (11 GB memory each), 6 core CPU and 64 GB RAM.

4.1 Workflow overview

This section includes a summary of all the steps followed during the implementation of the methods presented in this thesis. After receiving the Ivy GAP dataset images, all the H&E-stained slides were stain normalized. Due to the high dimensionality of the images, patches were extracted from the initial digitized histopathology slides, and they were used as the data for the CNN models training, evaluation, and test-ing. Subsequently, experiments were conducted in order to decide which pretrained and which custom architecture would be chosen as the best, in terms of classification accuracy, for the patch-level classification task of this study. The patch-level predic-tions of the models were eventually used for demonstrating the classification results as WSIs. Since the results were noisy, the next step included the post-processing of the predicted images to correct some patch-level predictions using the majority voting technique with three different kernel filter sizes. The workflow is visually analyzed in Figure 4.1.

(31)

4.2. Image preprocessing

Figure 4.1: Illustration of the steps involved in the preprocessing, training and post-processing phase of the classification task. Patches are extracted from the stain normalized WSIs and they are used for classification. The predicted patches are assembled back together resulting in a predicted WSI which finally gets smoothed using a post-processing algorithm.

4.2 Image preprocessing

Prior to model training, the dataset images have undergone two stages of preprocess-ing. The preprocessing starts with the stain normalization of the original whole-slide images, which subsequently get divided into smaller images, patches. Both prepro-cessing steps are extensively analyzed in the following two sections.

4.2.1 Stain normalization

From the acquisition of tissue up until digitization, several factors can introduce noise that results in color intensity variations. Often, images come from various clinical laboratories that may use different scanners and staining protocols. Furthermore, the conditions under which staining was performed can affect the staining results. Tem-perature, pH, specimen’s thickness, stain’s strength and other conditions adversely influence the final appearance of a slide [14]. Those variations can reduce the classi-fication accuracy as it can be diﬀicult for the CNN to learn if the provided data are not good enough. Therefore, an integral part of the preprocessing phase of digitized histology images is color normalization.

For stain normalization, two popular algorithms, Macenko [26] and Vahadane[45], were examined before deciding which one fits the dataset best. Initially, Vahadane was the goal method since it is a descendant of Macenko, with an improved structure-preserving methodology. Both methods need a target image according to which any source image’s color is modified. An image including the majority of the anatomic features was selected as the target. After applying both methods to a part of the dataset, Macenko proved to offer better results as Vahadane resulted in very pink images and some images even appeared to have a bubble-like structure (Figure 4.2c).

(32)

(a) Original (b) Macenko (c) Vahadane

Figure 4.2: Comparison of Macenko and Vahadane stain normalization methods on two

different images.

It is essential to mention that Macenko is not the best method when artifacts are present in the histology images or when images are of low quality in general [1]. This algorithm does not take into account such factors and can lead to biased results. However, Ivy GAP consists of finely cut tissues and the images are in general of high quality, which allows us to choose a normalization method that is not overly complex. The algorithm is implemented in StainTools package for Python 3, and the whole-slide images were stain-normalized by this package.

Macenko is an unsupervised normalization method and its central idea is that ev-ery pixel is a linear combination of the two H&E stain vectors [1]. The first step of the algorithm includes the transformation of each RGB vector into its corresponding optical density (OD) domain as follows: OD _{= −log}10(I), where I is the RGB color

vector normalized to [0,1]. The purpose of this transformation is to provide a space where a linear combination of stains will result in a linear combination of OD values. After that, it calculates the H&E stain vectors of the WSI of interest by applying singular value decomposition (SVD) to the foreground pixels of the input image. Sub-sequently, the algorithm corrects the intensity variation caused by several factors such as stain strength, staining procedure, etc. Finally, the processed image is projected to a reference (or target) image so that, in the end, all images will have similar color characteristics. The algorithmic steps of Macenko method are presented in Algorithm 1.

By the end of the normalization process, the resulting stain normalized images appeared to have similar color characteristics in between them. The results indicated that the chosen stain normalization algorithm performed well on this dataset and the variations in stain intensity that existed at the beginning were finely adjusted. In Figure 4.2 there are two example images, before and after stain normalization both with Macenko and Vahadane method.

(33)

Algorithm 1: Macenko algorithm

1 Convert RGB vectors to their corresponding optical density (OD)

2 Remove pixels with optical density lower than threshold value β= 0.15 (almost no

stain present)

3 Apply SVD on the OD tuples

4 Create the plane formed by the two vectors corresponding to the two largest singular

values

5 Project OD transformed pixels onto the plane and normalize to unit length 6 Calculate angle of each point with respect to the 1st SVD direction

7 Find the robust extremes (αth and(100α)thpercentiles) of the angle (usually α= 1

gives robust results)

8 Project the extreme values back to OD space and use the projection as optical density

matrix (ODM)

9 Calculate the intensity histograms for all pixels using the inverse of ODM 10 Find the 99th percentile (robust maximum) of these intensity values

11 Scale and transform the intensity values to OD space and then back to RGB

4.2.2 Patch extraction

As mentioned in the Challenges subsection (1.3), whole-slide images are of enormous dimensions. Due to their size (up to gigapixels), it is impossible to feed them to a convolutional neural network as they cannot fit into RAM or GPU memory. To tackle this problem, the images were preprocessed and patches were extracted. Patches are smaller images extracted from the initial digital histology images, as shown in Figure 4.3, and each patch is assigned a label.

Figure 4.3: Patch extraction from the original WSIs.

A color represents every tissue class in our dataset (Figure 2.1). Since we have ten classes, including the glass background, 10 RGB color maps represent every tissue category. To assign a label to every patch, each H&E image was combined with its corresponding label map and cut simultaneously into patches. For every label map patch, the number of pixels that belonged to each one of the 10 RGB vectors were calculated. The RGB color representing the majority of the label map patch surface defined the label to be assigned to the H&E patch. Tissue slides often contain artifacts, such as tissue folding, cracks, pen marks and air bubbles, generated during the preparation of the slides. A common artifact is tissue folding that usually appears

(34)

4.3. Patch-level classification

on the sides of a slide. An example of tissue folding can be observed on the left side of the WSI in Figure 4.3, where the tissue has a darker color. As tissue folding is considered an artifact, it gives no information and can distort the classification algorithms. For that, the ’Mill’ semi-supervised annotation algorithm classified the tissue folding parts of the H&E images as background so that they can be excluded. However, since the patches are cut according to the label images, many H&E patches that were labeled as Empty included tissue folding parts. Therefore, a threshold was used, excluding patches from the Empty class with less than 50% of white color.

Figure 4.4: Patches representing all nine different parts of the tumor and glass background. This procedure resulted in roughly 3,600,000 patches of size 224 _{× 224, composing} a rich dataset for patch-level classification. In Figure 4.4 patches from every class are presented.

4.3 Patch-level classification

Class imbalance is a very common phenomenon in real-life problems. Algorithms tend to produce biased results when there is big difference in the number of instances among the classes of a dataset, ignoring the minority class. Ivy GAP is also a highly imbalanced dataset in terms of class percentages and therefore, there was a need to address this problem. For all the models trained, both with and without data augmentation, the class weights were adjusted at training time. The idea of using class weights is that the algorithm will give equal importance to all classes, as otherwise it would favor the majority class due to its higher frequency in the dataset. The weight adjustment was performed using the function compute_class_weight from Scikit-learn library and passing it as a parameter to the fit function in Keras. The minority classes get higher weight while majority classes receive a smaller weight, resulting in less biased results.

Furthermore, due to the extreme imbalance of the dataset, only four classes were chosen for training. Minor classes like CTpnn which occupies only 0.1 % of the dataset are very hard for a network to learn due to their very rare occurrence. Furthermore, the minor classes of the dataset are actually subclasses of the bigger class called CT. Therefore, it was decided that it would be still useful if the CNNs were trained to

(35)

4.3. Patch-level classification

recognize the four major anatomic regions, CT, LE, IT and NEC, as those are also the most important features for the diagnosis of glioblastoma. Consequently, all the methods and results presented in this study are produced taking into account only those four major anatomic regions. The class Empty, which is the glass background, was not used as a class for the networks to recognize, since it does not offer any useful information.

As mentioned in the patch extraction section, the dataset consists of ∼3,600,000 patches. The dataset was split into training (80%), validation (10%) and test sets (10%). As the number of patches was massive, only 10% of each of those three subsets was used for training, hyper-parameter tuning, and testing; otherwise, the training would take too long. In the end, for all the models presented in this study, 240,651 patches were used for training, 30,082 for validation and 30,084 for testing.

4.3.1 Scratch-trained CNN

The CNN architecture proposed in this study is based on a recently published paper [21]. After experimenting with different configurations the authors suggested a novel architecture, inspired by VGG16 [39]. Their architecture was compared to four state-of-the-art architectures, VGG16, VGG19, InceptionV3 [42] and ResNet [16]. Their model achieved an AUC of 95.46%, surpassing all four architectures in terms of per-formance.

The main differences with VGG16 are:

• Three convolutional layers per block instead of four.

• A normalization layer is added to every convolutional block. • The fully connected layer consists of 512 nodes instead of 4096.

Additionally, in the CNN architecture presented in this study, two dropout layers with a dropout rate of 0.5 are added after each dense layer in the fully-connected block. Dropout, like any other hyperparameter, can take different values but there is no standard value that can guarantee the best result for every classification task. However, 0.5 is a dropout rate that is widely used in the dense layers as it has proved to yield good results for a wide range of problems and network architectures [40]. The suggested architecture consists of 14 million trainable parameters, which is much lighter compared to the original VGG16 model with 134 million trainable parameters. The architecture is presented in Figure 4.5.

The convolution blocks play the feature extractor’s role, and the fully connected block operates as a classifier. Every convolution block consists of three convolutional layers, followed by a batch normalization layer and a max-pooling layer with ’same’ padding. ’Same’ padding prevents from losing information from the images’ corners. Furthermore, the stride is set to 1 for per-pixel movement over the images. Every convolution block follows the same pattern but with a different number of kernels. However, in the fifth block the last two convolutional layers have no padding to de-crease the spatial information. Finally, the convolutional layers’ output is flattened and the resulting vector becomes the input of the fully-connected block.

Since the model was trained from scratch, quite many epochs were needed until convergence and therefore, the model was trained for 100 epochs, using early stopping

Patch-level classification of brain tumor tissue in digital histopathology slides with Deep Learning

Patch-level

classification

of

brain tumor tissue in digital

histopathology slides with Deep

Learning

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

2

Data

3

Theory

4

Methods