Predicting Ovarian Malignancy based on Transvaginal Ultrasound Images using Deep Neural Networks

(1)

DEGREE PROJECT IN ENGINEERING PHYSICS, FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2020

Predicting Ovarian Malignancy based on Transvaginal

Ultrasound Images using Deep Neural Networks

Filip Christiansen

(2)

Abstract

Ovarian cancer is the most lethal gynaecological malignancy; however, ovarian le- sions are very common and only around 1% are malignant. Due to the large number of cases, patients are triaged by gynaecologists having a high variability in diagnostic accuracy.

The aim of this study is to train and validate deep neural networks and, by com- parison to subjective expert assessment, determine their potential in the triage of patients with ovarian tumours.

We used a transfer learning approach on pre-trained networks (VGG16, ResNet50, MobileNet), and a post-processing calibration to better align their confidence scores with the true certainty of their predictions. Our dataset contained 3077 transvaginal ultrasound images from 758 patients with ovarian tumours, where histological out- come from surgery or long-time follow-up (> 3 years) served as diagnostic ground truth. From our dataset, 150 cases (75 benign, 75 malignant), each containing 3 images, were held out for testing, while the remaining cases were used for training and model selection. The models were assessed bases on sensitivity, specificity, and AUC, along with their corresponding 95% confidence intervals.

On the test set, our final model had a sensitivity of 96.0% (0.897–0.989), speci- ficity of 86.7% (0.776–0.929), and AUC of 0.950 (0.906–0.985). When excluding the 12.7% (0.073–0.180) of cases most difficult to classify (based on the confidence score of the model output), our model had a sensitivity of 97.1% (0.909–0.994), specificity of 93.7% (0.856–0.978), and AUC of 0.958 (0.911–0.993). As compari- son, the subjective expert assessment had a sensitivity and specificity of 96.0% and 88.0% respectively.

We show that neural networks can be used to predict ovarian malignancy with high diagnostic accuracy, comparable to that of human experts, and thus have potential in the triage of patients with ovarian tumours.

KEYWORDS

Ultrasonography, Classification, Ovarian Neoplasm, Ovarian Tumours, Deep

Learning, Transfer Learning, Machine Learning, Computer-aided Diagnosis

(3)

Sammanfattning

Äggstockscancer har högst dödlighet bland gynekologiska cancersjukdomar. Ägg- stocksförändringar är dock vanligt förekommande och endast omkring 1% är ma- ligna. På grund av den höga förekomsten görs initialt en bedömning lokalt (triage) huruvida patienten skall remitteras vidare för expertbedömning, eller om uppföljn- ing på lokal vårdinrättning är tillräcklig. Triagen utförs av gynekologer som saknar utan expertkompetens inom äggstockscancer, och därav har stor variation i diag- nostisk precision.

Syftet med denna studie är att, genom jämförelse med subjektiv expertbedömning, utvärdera potentialen hos artificiella neurala nätverk för triagering av kvinnor med äggstockstumörer.

Vi använde transfer learning av förtränade modeller (VGG16, ResNet50, Mo- bileNet) och en kalibreringsmetod för bättre probabilistisk överensstämmelse mel- lan modellernas svar och deras underliggande diagnostiska precision. Vårt bildma- terial bestod av 3077 transvaginala ultraljudsbilder från 758 kvinnor med äggstock- stumörer. Samtliga fall hade säkerställd diagnos genom resultat från operation eller långvarig uppföljning (> 3 år). Av detta material lades 150 fall (75 benigna, 75 ma- ligna) à 3 bilder åt sidan för slutgiltig validering av modellen, medan resterande fall användes till träning och val av modell. Modellerna bedömdes baserat på sensi- tivitet, specificitet och AUC, ihop med deras 95% konfidensintervall.

Vid validering hade vår slutgiltiga modell en sensitivitet på 96,0% (0,897–0,989), specificitet på 86,7% (0,776–0,929), och AUC på 0,950 (0,906–0,985). Vid utes- lutande av 12,7% (0,073–0,180) av de fall som var svårast att klassificera hade vår modell en sensitivitet på 97,1% (0,909–0,994), specificitet på 93,7% (0,856–

0,978) och AUC på 0,958 (0,911–0,993). Som jämförelse hade den subjektiva ex- pertbedömningen en sensitivitet och specificitet på 96,0%, respektive 88,0%.

Vår studie visar att artificiella neurala nätverk kan användas för differentiering av

benigna och maligna äggstockstumörer med hög diagnostisk precision, jämförbar

med den hos erfarna läkare på området. Således bedömer vi att det finns poten-

tial för användning av dessa modeller för triagering av kvinnor med äggstocks-

tumörer.

(4)

Title

* Predicting Ovarian Malignancy based on Transvaginal Ultrasound Images using Deep Neural Networks

Date

November 4, 2020 Author

Filip Christiansen <filipchr@kth.se>

School of Engineering Sciences, KTH Royal Institute of Technology

Examiner Carlota Canalias

Department of Applied Physics, KTH Royal Institute of Technology

Supervisors

Kevin Smith, PhD, Associate Professor

School of Electrical Engineering and Computer Science,

KTH Royal Institute of Technology & Science for Life Laboratory Elisabeth Epstein, MD, PhD, Associate Professor

Department of Clinical Science and Education, Karolinska Institute Department of Obstetrics and Gynaecology, Södersjukhuset

*

An altered and extended version of this report has been published in [1].

(5)

1 Introduction 1

2 Background 3

2.1 Artificial Neural Networks . . . . 3

2.2 Convolutional Layers . . . . 4

2.3 Training by Backpropagation . . . . 5

2.4 Dropout . . . . 5

2.5 Batch Normalization . . . . 6

2.6 Transfer Learning . . . . 6

2.7 Calibration: Temperature Scaling . . . . 7

3 Method 9 3.1 Dataset . . . . 9

3.2 Pre-processing . . . 10

3.3 Data Augmentation . . . 10

3.4 Training . . . 10

3.5 Probabilistic Calibration and Model Ensembling . . . . 11

4 Results 13 4.1 Comments on Potential Sources of Bias . . . . 14

5 Conclusions and Future Work 16 References 17 A Appendix 21 A.1 Historical Diagnoses . . . . 21

A.2 Data Augmentation Details . . . . 21

A.3 Confidence Interval for a Binomial Proportion . . . . 21

A.4 Uncertainty Estimation by Bootstrapping . . . 22

(6)

1 Introduction

Ovarian cancer is the most lethal gynaecological malignancy with a 5-year survival rate below 45% [2]. However, ovarian lesions are very common and only around 1% are malignant [3]. Transvaginal ultrasound assessment is today the established procedure to differentiate between benign and malignant ovarian lesions [4]. Due to a large number of cases, a time-consuming diagnostic process, and a shortage of experienced sonographers, patients are triaged by gynaecologists, not specialized in ovarian cancer and having a high variability in diagnostic accuracy [5].

The large potential for improvement and the successful use of computerized imag- ing tools for medical diagnostics in other areas in recent years, raises the question if such methods could be utilized in the triage of patients with ovarian tumours. There are several potential benefits to be gained by an improved triaging process. Today, the high diagnostic uncertainty and interobserver variability cause many unneces- sary referrals to expert sonographers. This leads to long waiting times for patients and unnecessary surgical procedures and medical costs. Benign tumours can be managed more conservatively, with ultrasound follow-up or less invasive and fertil- ity preserving surgery.

A recent study [6], using traditional machine learning methods (e.g. Linear Dis- criminants (LD), Support Vector Machines (SVM), Extreme Learning Machines (ELM)) on a set of manually selected feature descriptors, was able to automatically differentiate between benign and malignant ovarian tumours with a sensitivity and specificity of around 90% and 80% respectively. However, the study was rather small, including only 384 images from 187 patients, which is often limiting both the potential and the statistical significance of the results.

In this study, we try to improve upon the results from [6] by using a transfer learn-

ing approach on ImageNet [7] pre-trained deep neural networks. To the best of our

knowledge, there has only been one paper [8] published, applying convolutional

neural networks (CNN) to transvaginal ultrasound images for ovarian tumour clas-

sification. While it shows promising results (92.5% accuracy), the paper lacks trans-

parency, as it gives no insight to how their model was tested, nor whether a separate

test set was used. Furthermore, the case mix is not reported, and the ultrasound im-

ages shown in the paper seems to be of poor quality. However, transfer learning of

pre-trained CNNs have been successfully used for classification of tumours in ultra-

sound images for other types of cancer, with a few well-designed studies on thyroid

[9, 10] and breast [11, 12] cancer. These and other studies have shown that a trans-

fer learning approach can yield similar or superior diagnostic accuracy, compared

to senior sonographers. With that said, out main research question for this thesis

follows:

(7)

Can pre-trained deep neural networks, fine-tuned on transvaginal ultrasound im-

ages, predict ovarian malignancy with a high enough diagnostic accuracy for use

in the triage of patients with ovarian tumours?

(8)

2 Background

In this chapter, relevant theory and terminology is presented. First, in Section 2.1, Artificial Neural Networks (ANN) are introduced. Then, Section 2.2 explains the concept of convolutional layers, followed by Section 2.3, which covers backpropa- gation, the algorithm by which ANNs are trained. The chapter then continues with Section 2.4 on the regularization technique called dropout, followed by Section 2.5, in which batch normalization is described. Section 2.6 explains the concept of trans- fer learning and how it can be used to reduce the reliance on large datasets. Lastly, a calibration method called temperature scaling is described in Section 2.7.

2.1 Artificial Neural Networks

Artificial Neural Networks (ANN) are computational models inspired by the dis- tributed information processing in biological systems [13]. A basic ANN consists of an input and output layer, separated by some number of hidden layers, which in turn are collections of ”vertically” stacked artificial neurons (Figure 2.1).

Figure 2.1: Schematic illustration of an Artificial Neural Network. [14]

Each artificial neuron (Figure 2.2), or node, is an elementary computational unit, defined by its weights and biases, its activation function, and its connections to other nodes. Each node receives a set of inputs from its connected nodes in the previous layer. After multiplication by the corresponding weights, the inputs are summed up and the bias is added. The output is then obtained by transformation by a non-linear activation function. Given a set of inputs, the ANN makes a prediction by succes- sively propagating the information through the network of interconnected nodes [15].

At model creation, weights and biases are randomly initialized. Therefore, the pre-

dicting performance of the model will initially not be superior to simple guessing.

(9)

Figure 2.2: Schematic illustration of an artificial neuron.

To change this, ANNs are ”trained” with a technique called backpropagation [16], explained with an example in Section 2.3.

2.2 Convolutional Layers

Convolutional layers consist of a set of stacked filters and are the main building blocks of Convolutions Neural Networks (CNN). Given an image volume (techni- cally a tensor), each filter is convolved across the width and height of the volume, producing a two-dimensional activation map. This means that the filter is moved across the image, while at each position, the dot product is computed between the parameters of the filter and the values of the image at the current position (Figure 2.3). The output from the convolutional layer is the volume of stacked activation maps from all filters, which in turn might be the input to another convolutional layer. The use of filters with shared weights across the image, results in shift invari- ant representation. The idea behind this is that the ability to recognize an object should be independent on its location within the image [17].

Figure 2.3: Schematic illustration of a convolutional filter. [18]

(10)

2.3 Training by Backpropagation

Training an image classification model in a supervised manner requires a training dataset {x

i

, y

_i

}

^N_i=1

, where each x

_i

is an image and y

_i

its corresponding class label.

Simplified, the model is trained by repeatedly giving it an image x

i

and letting it predict the conditional class probabilities

ˆ

y

_ji

= P (y

_j

|x

i

),

j = 1, . . . , C, where C is the number of possible classes. A confident correct classi- fication of an image x

_i

corresponds to ˆ y

_ji

being close to δ

_ij

. To achieve this, a loss function L must be defined, which is typically the categorical cross-entropy loss, defined as

L(y

i

, ˆ y

i

) = − ^∑

^C

j=1

y

i

· log(ˆy

ij

).

After a prediction has been made, the gradients of the loss function are computed with respect to the parameters of the model. Each parameter is then slightly changed in the direction opposite to the corresponding gradient, in order for the model to return a more accurate prediction next time it makes a prediction on the same image. For clarification, the updated parameters at time-step t + 1 is given by

Θ

^(t+1)

= Θ

^(t)

− η∇

Θ

L,

where η is the learning rate and Θ

^(t)

is the set of parameters at the previous time-step t. This way, the model learns to classify images by automatically finding a suitable representation from the raw pixel data of seen images [16].

2.4 Dropout

Dropout is a regularization technique used between fully connected layers to im-

prove classification performance by reducing overfitting to training data. During

training, individual nodes are ”dropped”, by setting its output to zero, with a prob-

ability of 1 − p. To compensate for the resulting bias in lower expected output to

the next layer, the output is divided by 1 − p. Randomly dropping nodes during

training forces the network to find a more redundant representation, by reducing

co-adaptation between nodes [19]. Co-adaptation means that groups of nodes learn

to respond to the same input, leading to a correlated representation. By avoiding

this, the use of dropout results in a more robust feature representation that gen-

eralize better to new data. A schematic illustration of dropout is shown in Figure

2.4.

(11)

Figure 2.4: Schematic illustration of dropout.

2.5 Batch Normalization

When training deep neural networks, the distribution of the inputs to each layer changes, as the parameters of the previous layers change. This is referred to as in- ternal covariate shift, and increases the risk of divergence. Thereby, it imposes careful parameter initialization and the use of lower learning rates, which in turn slows down training. A common method to reduce this problem is batch normal- ization [20]. During training, it normalizes the input to each layer over each train- ing batch. This is done by re-centering and re-scaling the input by the mean and variance of the current batch. However, at test time, a batch-dependence is not de- sirable, since we want the output to only depend on the input. Furthermore, the samples might have to be processed one at the time. Therefore, instead of using the means and variances of a batch, exponential moving averages (EMA) from training are used at test time, thereby keeping the means and the variances fixed. Simply normalizing the input to each layer, to have zero mean and unit variance over the batch, reduces the representation power of the network. To fix this, batch normal- ization is followed by a linear transformation, whose parameters are learned during training, along with the original parameters of the model.

2.6 Transfer Learning

The architecture of most neural networks for image classification consist of a convo- lutional base of multiple layers of stacked convolutional filters for representational learning, and a fully connected classifier. A schematic illustration is shown in Figure 2.5.

Transfer learning is the concept of taking a model that has been trained on a large

dataset from one domain, the source domain, replacing the original classifier, and

training only the parameters of the new classifier on a smaller dataset from another

domain, the target domain. This is done by “freezing” the parameters of the orig-

inal model, thereby keeping them unchanged. The idea behind this, is that for the

original model to be able to successfully classify images in the source domain, it

(12)

Figure 2.5: Schematic illustration of a Convolutional Neural Network for image clas- sification. [21]

must have learnt a good representation of important image features. While a large difference between the domains means that the learnt representation of the source domain may not be the most suitable representation for the target domain, this is often more than compensated for if there is limited data available in the target do- main [22].

Domain adaptation is the task of adapting a model trained on data from a source domain, to a target domain. When using transfer learning, domain adaptation is the motivation behind fine-tuning (parts of) the original model after having trained the new classifier. A common approach to fine-tuning, is “unfreezing” the layers of the last convolutional block, one by one, while training with a low learning rate for a small number of epochs. While this might improve overall performance by finding a more suitable representation for the target domain, it also runs the risk of overfitting the training data, leading to poor generalization.

When using transfer learning on a model with batch normalization layers, one more thing has to be addressed, namely the fact that the means and variances of a batch, used for normalization during training in the target domain (transfer learning), will be significantly different from the exponential moving averages (EMA), used at test time, that were once computed during training in the source domain. To align the training and test behaviour of the model, the EMA statistics of all batch normal- ization layers have to be replaced with the corresponding statistics from the target domain [23]. This can be done by continuedly updating the EMA statistics during training in the target domain.

2.7 Calibration: Temperature Scaling

For a given image, the final class prediction from a CNN is given by the most prob-

able class label. This is the argmax of the confidence scores of the prediction, which

in turn are the estimated conditional class probabilities P (class |observation). These

(13)

are obtained by taking the softmax, defined by softmax(z)

_i

= e

^zⁱ

∑

_C

j=1

e

^z^j

for i ∈ {1, . . . , C} and z = (z

1

, . . . , z

_C

)

^T

∈ R

^C

, of the logits, which are the outputs from the last fully connected layer. In some cases, it is desirable to not only be able to predict the most probable class given an image, but also a reliable estimate of the conditional probability P (class |observation) of each class. Neural networks tend to be ”over”-confident in their predictions, which is a result from minimization of the cross-entropy loss during training [24].

Temperature scaling is a post-processing technique to better align the confidence

scores with the underlying class probabilities. This is done by dividing the logits

{z

i

}

^C_i=1

by a scalar T > 0, referred to as the temperature, which is optimized to

minimize the cross-entropy loss on the validation set [25]. For T > 1, the entropy

is increased, which corresponds to less certainty in the predictions. As T → ∞,

the probabilities all approach 1/K, i.e. a random guess and maximum entropy. An

important aspect of temperature scaling is that it does not change the accuracy of

the model, nor its predictions.

(14)

3 Method

We used a transfer learning approach on the pre-trained deep learning models VGG16 [26], ResNet50 [27], and MobileNet [28], from the open-source library Keras, and used TensorFlow for fine-tuning the models on our dataset of transvagi- nal ultrasound images of ovarian tumours.

3.1 Dataset

Our dataset contained 3077 grayscale and power Doppler ultrasound images from 758 patients with ovarian tumours. All patients had undergone expert ultrasound assessment, using high-end ultrasound systems

¹

, at the departments for gynaeco- logical ultrasound at Karolinska University Hospital and Södersjukhuset in Stock- holm Sweden. Eligible criteria were surgery within 4 months of the ultrasound ex- amination (n=633), or ultrasound follow-up for a minimum of 3 years (in case of presumably benign lesion) (n=124). The histological diagnoses of patients under- gone surgery are shown in Table A.1 in the Appendix. From our dataset, 150 cases (75 benign, 75 malignant), each containing 3 images, were left out for testing, while the remaining 607 cases were used for training and model selection (validation).

To ensure accurate results, only patients with a histological diagnosis obtained by surgery were in included in the test set. It should be pointed out that all images from a given patient were included in the same set (training, validation, or test set).

Power Doppler ultrasound utilizes the Doppler effect

²

to estimate the movement of body fluids, in this case the blood flow. This is used to detect indicators of ma- lignancy, such as presence of high blood flow and an abnormal structure of blood vessels [29] (Figure 3.1).

Figure 3.1: Example of grayscale and power Doppler ultrasound images.

1

GE Voluson E8/E10 (5-9 MHz, 6-12 MHz), Philip IU22/EPIQ (3-10 MHz)

2

the change in observed wavelength caused by movement of the source relative to the observer

(15)

3.2 Pre-processing

Each image was first manually cropped to the region of interest, to exclude organs irrelevant to the tumour, such as urinary bladder, gut, uterus, and major blood ves- sels. The pre-trained networks assume the input to be tensors of shape (224, 224, 3), i.e. square images of 224 by 224 pixels with 3 colour channels (RGB). Therefore, the images were downsampled and resized accordingly

³

. Furthermore, the pixel val- ues were standardized channel-wise to have zero mean and unit variance over the dataset. This was done by subtracting the mean pixel intensity and dividing by the standard deviation of the pixels for each channel. For consistency, the pixel means and standard deviations were calculated solely on the training dataset. These same values were then also applied to the validation and test sets.

3.3 Data Augmentation

We performed data augmentation during training for model generalization, by ex- panding the available training data and mimicking shifts in image properties in un- seen domains. The transformations used in the augmentation process can be di- vided into three main categories [30], based on what aspect of the image that is altered. These are image quality, spatiality, and appearance. Low image quality is mainly categorised by blurriness and low resolution, caused by scanner motion, low scanner resolution, or lossy image compression. We tried to imitate this by adding Gaussian noise, JPEG compression and shift in sharpness. The transformations used, related to the spatial shape, were flips, rotations, crops, and scaling. They served the purpose of simulating variability in the shapes and positions of organs, the size of patients, and region of interest related cropping. The appearance related transformations used were shifts in brightness, contrast, and colour, mimicking dif- ferences between ultrasound systems and settings.

⁴

3.4 Training

First, the original multi-class classifier of each model, was replaced by a binary clas- sifier, consisting of a fully connected layer of 1024 hidden nodes (512 for MobileNet), followed by ReLU-activation

⁵

, dropout of 0.5 (0.2 for VGG16), and a last fully con- nected layer with two nodes and softmax activation for ”semi”-probabilistic

⁶

out- put. The hyperparameters, such as the learning rate, dropout rate, and the number of hidden nodes in the fully connected layer, were empirically set to maximize the performance on the validation set.

3

using nearest-neighbour interpolation

4

For a more detailed description, see Section A.2 of the Appendix.

5

a common non-linear activation function

6

the model’s confidence in the output does not fully align with the underlying probability

(16)

In the first training step, all weights of the original model were frozen (except from the exponential moving average (EMA) statistics of the batch normalization layers in ResNet50 and MobileNet), thereby only training our new classifier. In this step, we used an initial learning rate of 0.002 for VGG16, and 0.02 for ResNet50 and MobileNet. Then, the layers of the last convolutional block were unfrozen and fine- tuned one by one, until no more improvement could be seen. This meant that, for VGG16, the last three convolutional layers were fine-tuned with an initial learning rate of 2 · 10

⁻⁴

, while in ResNet50 and MobileNet only the classifier was trained and the EMA statistics of and the batch normalization [20] layers updated. Even for VGG16, fine-tuning the last convolutional layers only lead to a slight improve- ment.

All networks were trained using backpropagation [16] by Stochastic Gradient De- scent (SGD) with a Nesterov momentum (NAG) [31] of 0.9. A constant batch-size of 32 samples was used throughout training, as we did not see any noticeable effect in altering the batch-size. The imbalance in the number of Doppler and grayscale images, from benign and malignant cases, was addressed by training with weighted binary cross-entropy loss. We used a learning rate decay of 0.5 after every four con- secutive epochs without improvement in validation accuracy. Training was stopped after 15 consecutive epochs without improvement in validation loss, which was also the metric used to store the best model weights. We also experimented with training of separate models for Doppler and grayscale ultrasound images, but this idea was abandoned due to lack in performance.

3.5 Probabilistic Calibration and Model Ensembling

As a post-processing procedure, the models were calibrated using temperature scal- ing [25], described in Section 2.7, in order for their confidence scores to be better aligned with the true underlying certainty in their predictions. The model calibra- tion is best visualised using reliability diagrams, in which the accuracy is plotted against the model confidence. For an overconfident model, the confidence score will exceed the expected accuracy. In Figure 3.2, an example of this is shown for the VGG16 based model, where the accuracy falls below the dashed line for the uncali- brated model to the left.

Having reliable estimates of the conditional class probabilities P (class|observation)

is desirable when building an ensemble, as it reduces the problem with difference

in (over-)confidence between the models. For the ensemble, we used a soft voting

scheme of averaging the probabilities from the models (VGG16, ResNet50, and Mo-

bileNet), after independent calibration of each model. An ensemble of models is

used in the hope of obtaining better predictive performance than any of the models

(17)

by itself.

To reduce overconfident incorrect predictions, we also experimented with the use of test-time augmentation, which is the process of creating multiple augmented copies of each image at test-time and taking the average of the corresponding predictions.

While it led to some improvement when predicting malignancy of tumours based on single images, it did not result in any significant improvement on a case by case basis.

Figure 3.2: Reliability diagram for VGG16-based model before and after calibration.

The confidence is the predicted conditional probability for the most probable class.

Since the classification is binary, the confidence is always higher than 0.5.

(18)

4 Results

The models were assessed bases on sensitivity, specificity, and the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve, along with their corresponding 95% confidence intervals. Sensitivity and specificity are the pro- portions of correctly classified positive and negative samples

⁷

. The ROC curve is the result from plotting the true positive rate (TPR) against the false positive rate (FPR)

⁸

. The AUC, which is simply the corresponding integral, is a common statis- tic for model comparison, and is equivalent to the probability that the classifier will rank a randomly chosen positive sample higher than a randomly chosen negative sample [32]. The ROC curves of the trained models are shown in Figure 4.1. The confidence intervals for the sensitivity and specificity where estimated by the Jef- freys interval [33], described in Section A.3 of the Appendix, while the confidence interval for the AUC and the proportion of excluded cases were obtained by boot- strapping [34], described in Section A.4 of the Appendix.

Figure 4.1: Receiver Operating Characteristic (ROC) curves for the trained models, when including all samples from the test set.

The models, VGG16, ResNet50, and MobileNet, were trained individually and op- timized based on the model performance on the validated set. The performance statistics on the test set for each model, and for an ensemble of the models, are shown in Table 4.1.

In Table 4.2, the performance statistics are shown for the same models, when ex- cluding cases with a predicted probability of malignancy between 0.4 and 0.6, cor- responding to high uncertainty. Model testing with exclusion of uncertain cases is

7

In this context, a positive/negative sample, equals a malignant/benign sample, respectively.

8

TPR = sensitivity, FPR = 1 − specificity

(19)

AUC Sensitivity Specificity

VGG16 0.936 0.920 0.853

ResNet50 0.933 0.893 0.840

MobileNet 0.950 0.947 0.867

Ensemble 0.950 0.960 0.867

Table 4.1: Performance statistics of the models on the test set.

motivated by triage of patient being the most near-future use case for these mod- els.

AUC Sensitivity Specificity Exclusion

VGG16 0.946 0.971 0.879 0.107

ResNet50 0.948 0.945 0.951 0.227

MobileNet 0.960 0.954 0.906 0.140

Ensemble 0.958 0.971 0.937 0.127

Table 4.2: Performance statistics of the models on the test set, when excluding cases with a predicted probability of malignancy between 0.4 and 0.6, corresponding to high uncertainty.

The ensemble model was the most robust and high performing model on both the validation and test set. On the test set, it had a sensitivity of 96.0% (0.897–0.989), specificity of 86.7% (0.776–0.929), and AUC of 0.950 (0.906–0.985). When ex- cluding the 12.7% (0.073–0.180) of cases most difficult to classify (based on the con- fidence score of the model output), achieved a sensitivity of 97.1% (0.909–0.994), specificity of 93.7% (0.856–0.978), and AUC of 0.958 (0.911–0.993). As compari- son, the subjective expert assessment had a sensitivity and specificity of 96.0% and 88.0% respectively. McNemar’s test [35] for paired nominal data showed no sig- nificant difference in sensitivity or specificity between the model and the subjective expert assessment.

4.1 Comments on Potential Sources of Bias

Caliper measurements, shown in Figure 4.2 for illustration, are used to measure

the size of the tumour, and solid components within the tumour. These measure-

ments are then used for surgical planning, risk of malignancy assessment, and to

monitor tumour growth, in case of conservative management by long-term follow-

up. Calipers were present in ∼80% of both benign and malignant images in our

dataset. Although insertion of calipers is part of the standard examination proce-

dure and independent of the diagnostic outcome, we took the precautionary mea-

(20)

sure to examine the impact of the calipers. This was done by evaluating the final ensemble model on disjoint subsets of images, with and without calipers. Since we were not interested in the absolute, but rather the relative performance, we used both the validation and test set to get more accurate results. Furthermore, only the grayscale ultrasound images were used, since very few Doppler ultrasound im- ages contained calipers. The sensitivity and specificity of the ensemble model were marginally higher on the subset of images without calipers

⁹

. The two-sided Mann- Whitney U test [36] yielded p-values of 0.86 and 0.50, for the sensitivity and speci- ficity respectively, meaning that neither the sensitivities, nor the specificities, were significantly different. This shows that the presence of caliper measurements does not seem to assist the model in predicting malignancy of tumours.

Figure 4.2: Example of ultrasound images with (right) and without (left) caliper measurements (yellow lines).

9

Sensitivities of (158/178) and (60/67), and specificities of (170/215) and (75/91), with and without

calipers respectively.

(21)

5 Conclusions and Future Work

Our results show that convolutional neural networks can be used to predict ovarian malignancy with high diagnostic accuracy, comparable to that of human experts, and thus have potential in the triage of patients with ovarian tumours. Further- more, our results are largely superior to those of Martinez et al. [6], which further indicates that convolutional neural networks learn complex visual patterns from the information in raw pixel data, some of which is lost when using manually selected features.

Future work includes training models from scratch or using domain adaptation by

prior training on a larger dataset of images from a similar domain. Lastly, we want

to emphasize the importance of external validation on images from other gynaeco-

logical centres, in order to evaluate the limitation in generalization and the potential

for multi-site deployment.

(22)

References

[1] Christiansen F, Epstein E, Smedberg E, Åkerlund M, Smith K, Epstein E.

(2020). Ultrasound image analysis using deep neural networks to discrimi- nate benign and malignant ovarian tumours – a comparison to subjective ex- pert assessment. Ultrasound Obstet Gynecol. DOI: 10.1002/uog.23530 [2] Webb PM, Jordan SJ. (2017). Epidemiology of epithelial ovarian cancer. Best

Pract Res Clin Obstet Gynaecol 41:3-14. DOI: 10.1016/j.bpobgyn.2016.08.006 [3] Sharma A, Apostolidou S, Burnell M, Campbell S, Habib M, Gentry-Maharaj A, Amso N, Seif MW, Fletcher G, Singh N, Benjamin E, Brunell C, et al. (2012).

Risk of epithelial ovarian cancer in asymptomatic women with ultrasound- detected ovarian masses: a prospective cohort study within the UK collabora- tive trial of ovarian cancer screening (UKCTOCS). Ultrasound Obstet Gynecol 40: 338-344. DOI: 10.1002/uog.12270

[4] Fischerova D. (2011). Ultrasound scanning of the pelvis and abdomen for stag- ing of gynecological tumors: a review. Ultrasound Obstet Gynecol 38:246- 266. DOI: 10.1002/uog.10054

[5] Yazbek J, Ameye L, Testa AC, Valentin L, Timmerman D, Holland TK, Van Holsbeke C, Jurkovic D. (2010). Confidence of expert ultrasound opera- tors in making a diagnosis of adnexal tumor: effect on diagnostic accuracy and interobserver agreement. Ultrasound Obstet Gynecol 35(1):89-93. DOI:

10.1002/uog.7335

[6] Martínez-Más J, Bueno-Crespo A, Khazendar S, Remezal-Solano M, Martínez- Cendán J-P, Jassim S, et al. (2019) Evaluation of machine learning meth- ods with Fourier Transform features for classifying ovarian tumors based on ultrasound images. PLoS ONE. 14(7):e0219388. DOI: 10.1371/jour- nal.pone.0219388

[7] Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. (2009). Ima- geNet: A Large-Scale Hierarchical Image Database. CVPR’19 DOI:

10.1109/CVPR.2009.5206848

[8] Wu C, Wang Y, Wang F. (2018). Deep Learning for Ovarian Tumor Classifica- tion with Ultrasound Images. Advances in Multimedia Information Process- ing – PCM 2018. 395–406. DOI: 10.1007/978-3-030-00764-5_36

[9] Li X, Zhang S, Zhang Q, Wei X, Pan Y, Zhao J, et al. (2019). Diagnosis of thy- roid cancer using deep convolutional neural network models applied to sono- graphic images: a retrospective, multicohort, diagnostic study. Lancet Oncol.

20(2):193-201 DOI: 10.1016/S1470-2045(18)30762-9

(23)

[10] Ko SY, Lee JH, Yoon JH, et al. (2019) Deep convolutional neural network for the diagnosis of thyroid nodules on ultrasound. Head Neck. 41:885-891 DOI:

10.1002/hed.25415

[11] Chougrad H, Zouaki H, Alheyane O. (2018) Deep convolutional neural net- works for breast cancer screening Comput. Methods Programs Biomed.

157:19-30 DOI: 10.1016/j.cmpb.2018.01.011

[12] Han S, Kang HK, Jeong JY, et al. (2017). A deep learning framework for sup- porting the classification of breast lesions in ultrasound images. Phys Med Biol. 62(19):7714–7728 DOI: 10.1155/2018/5137904

[13] McCulloch WS, Pitts W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5:115–133. DOI:

10.1007/BF02478259

[14] Stanford Vision and Learning Lab. CS231n: Convolutional Neural Networks for Visual Recognition - Course Materi- als. cs231n.github.io. (2015). GitHub repository. Available at:

https://cs231n.github.io/assets/nn1/neural_net2.jpeg. [Accessed 20 May 2020]

[15] Zou J, Han Y, So S. (2008). Overview of artificial neu- ral networks. Methods Mol Biol 458:15–23. Available at:

https://link.springer.com/protocol/10.1007/978-1-60327-101-1_2 [Accessed 14 May 2020]

[16] Rumelhart D, Hinton G, Williams R. (1986). Learning representations by back-propagating errors. Nature 323:533–536. DOI: 10.1038/323533a0 [17] Fukushima, K. (1980). Neocognitron: A self-organizing neural network

model for a mechanism of pattern recognition unaffected by shift in po- sition. Biol. Cybernetics 36:193–202. https://doi.org/10.1007/BF00344251 DOI: 10.1007/BF00344251

[18] Da Silva B. brandinho.github.io. (2018). GitHub repository. Available at:

https://github.com/brandinho/brandinho.github.io/blob/master/images/

ConvNet.gif. [Accessed 6 May 2020]

[19] Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R. (2012).

Improving neural networks by preventing co-adaptation of feature detectors.

arXiv: 1207.0580

[20] Ioffe S, Szegedy C. (2015). Batch Normalization: Accelerating Deep Network

Training by Reducing Internal Covariate Shift. ICML’15 arXiv: 1502.03167

(24)

[21] MATLAB & Simulink. Convolutional Neural Network. Available at:

https://mathworks.com/solutions/deep-learning/convolutional-neural- network. [Accessed 7 May 2020]

[22] Raghu M, Zhang C, Kleinberg J, Bengio S. (2019). Transfusion: Understanding Transfer Learning for Medical Imaging arXiv: 1902.07208

[23] Singh S, Shrivastava A. (2019). Evalnorm: Estimating batch normalization statistics for evaluation. arXiv: 1904.06031

[24] Pereyra G, Tucker G, Chorowski J, Kaiser L, Hinton G. (2017). Regularizing neural networks by penalizing confident output distributions. ICLR’17 arXiv:

1701.06548

[25] Guo C, Pleiss G, Sun Y, Weinberger KQ. (2017). On Calibration of Modern Neu- ral Networks. arXiv: 1706.04599

[26] Simonyan K, Zisserman A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR’15 DOI: 10.1109/ACPR.2015.7486599, arXiv: 1409.1556

[27] He K, Zhang X, Ren S, Sun J. (2015). Deep Residual Learning for Image Recog- nition. CVPR’15 arXiv: 1512.03385

[28] Howard AG, Zhu M, Chen B, Kalenichenko D., Wang W, Weyand T, Andreetto M, Adam H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ICVS’17 arXiv: 1704.04861

[29] Timmerman D, Van Calster B, Testa A, Savelli L, Fischerova D, Froyman W, Wynants L, Van Holsbeke C, Epstein E, Franchi D, Kaijser J, Czekierdowski A, Guerriero S, Fruscio R, Leone FPG, Rossi A, Landolfo C, Vergote I, Bourne T, Valentin L.. (2016). Predicting the risk of malignancy in adnexal masses based on the Simple Rules from the International OvarianTumor Analysis group. Am J Obstet Gynecol. 214(4):424-437DOI: 10.1016/j.ajog.2016.01.007

[30] Zhang L, Wang X, Yang D, Sanford T, Harmon S, Turkbey B, Roth H, Myro- nenko A, Xu D, Xu Z. (2019). When Unseen Domain Generalization is Unnec- essary? Rethinking Data Augmentation. arXiv: 1906.03347

[31] Nesterov Y. (1983). A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady 27:372–376.

[32] Fawcett, T. (2006). An introduction to ROC analysis, Pattern Recognition Let- ters 27:861–874. DOI: 10.1016/j.patrec.2005.10.010

[33] Brown LD, Cai TT, DasGupta A. (2001). Interval Estimation for a Binomial

Proportion. Statistical Science 16(2):101–133.

(25)

[34] Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. Ann.

Statist. 7(1):1-26. DOI: 10.1214/aos/1176344552

[35] McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 12(2):153–157. DOI:

10.1007/BF02295996

[36] Neuhäuser M. (2011). Wilcoxon–Mann–Whitney Test. International Ency-

clopedia of Statistical Science 1656–1658. DOI: 10.1007/978-3-642-04898-

2_615

(26)

A Appendix

A.1 Historical Diagnoses

ALL CASES TEST SET (n=663) (n=150)

n % n %

Benign 325 51.3 75 50.0

Endometrioma 46 7.3 10 6.7

Dermoid 74 11.7 26 17.3

Simple/Functional cyst 31 4.9 3 2.0

Paraovarian cyst 12 1.9

Rare benign 9 1.4 1 0.7

(Hydro-)pyosalpinx 14 2.2

Fibroma/Myoma 25 3.9 5 3.3

Cystadenoma/Cystadenofibroma 108 17.0 25 16.7

Peritoneal/Inclusion cyst 6 0.9 2 1.3

Borderline 55 8.7 15 10.0

Serous 35 5.5 8 5.3

Mucinous 20 3.2 7 4.7

Malignant 254 40.1 60 40.0

Epithelial ovarian cancer 169 26.7 38 25.3 Non-epithelial ovarian cancer 28 4.4 10 6.7

Metastatic ovarian cancer 57 9.0 12 8.0

Table A.1: Histological outcome from patients having undergone surgery, with n being the number of patients.

A.2 Data Augmentation Details

• Horizontal and vertical flips; zooming/cropping by 0-10% – using the data pre-processing API from Keras.

• Gaussian noise with a standard deviation of 2%; JPEG compression by 0-20%

– using the Python library imgaug.

• Rotation by multiples of 90 degrees – using the rot90 function from Numpy.

• Shift by ± 20% in sharpness, brightness and contrast; ± 10% in colour – using the ImageEnhance Module from the Python imaging library Pillow.

A.3 Confidence Interval for a Binomial Proportion

Estimating a confidence interval for a binomial distributed statistic might at first

glance appear to be a trivial task. However, the underlying assumption of normality

(27)

in the standard Wald interval is in many cases incorrect, and may lead to poor es- timates, and inclusion of non-probabilistic proportions (proportions above one or below zero) [33]. This is especially the case with small sample sizes and proportions near zero or one. A better alternative is the Jeffreys interval, by which the limits of a 100(1–α)% confidence interval for X ∼ B(n, p) is given by the α/2 and (1 − α/2) quantiles of Beta(x + 1/2, n − x + 1/2), where n and x are the number of trials and successes respectively [33].

A.4 Uncertainty Estimation by Bootstrapping

A standard technique to estimate the uncertainty of a sample statistic is bootstrap- ping, a method which relies on random sampling with replacement [34]. It is useful when parametric inference is impossible due to an unknown shape of the sample distribution.

As an illustration, let x = (x

1

, . . . , x

N

) be the original observation and let s be the

sample statistic of which the uncertainty is to be estimated. Generate a large set of B

independent bootstrap replications {x

^∗_b

}

^B_b=1

, by randomly sampling N values (with

replacement) from x. Compute s(x

^∗

) for each x

^∗

∈ {x