DEGREE PROJECT IN ENGINEERING PHYSICS, FIRST CYCLE, 15 CREDITS
STOCKHOLM, SWEDEN 2020
Predicting Ovarian Malignancy based on Transvaginal
Ultrasound Images using Deep Neural Networks
Filip Christiansen
Abstract
Ovarian cancer is the most lethal gynaecological malignancy; however, ovarian le- sions are very common and only around 1% are malignant. Due to the large number of cases, patients are triaged by gynaecologists having a high variability in diagnostic accuracy.
The aim of this study is to train and validate deep neural networks and, by com- parison to subjective expert assessment, determine their potential in the triage of patients with ovarian tumours.
We used a transfer learning approach on pre-trained networks (VGG16, ResNet50, MobileNet), and a post-processing calibration to better align their confidence scores with the true certainty of their predictions. Our dataset contained 3077 transvaginal ultrasound images from 758 patients with ovarian tumours, where histological out- come from surgery or long-time follow-up (> 3 years) served as diagnostic ground truth. From our dataset, 150 cases (75 benign, 75 malignant), each containing 3 images, were held out for testing, while the remaining cases were used for training and model selection. The models were assessed bases on sensitivity, specificity, and AUC, along with their corresponding 95% confidence intervals.
On the test set, our final model had a sensitivity of 96.0% (0.897–0.989), speci- ficity of 86.7% (0.776–0.929), and AUC of 0.950 (0.906–0.985). When excluding the 12.7% (0.073–0.180) of cases most difficult to classify (based on the confidence score of the model output), our model had a sensitivity of 97.1% (0.909–0.994), specificity of 93.7% (0.856–0.978), and AUC of 0.958 (0.911–0.993). As compari- son, the subjective expert assessment had a sensitivity and specificity of 96.0% and 88.0% respectively.
We show that neural networks can be used to predict ovarian malignancy with high diagnostic accuracy, comparable to that of human experts, and thus have potential in the triage of patients with ovarian tumours.
KEYWORDS
Ultrasonography, Classification, Ovarian Neoplasm, Ovarian Tumours, Deep
Learning, Transfer Learning, Machine Learning, Computer-aided Diagnosis
Sammanfattning
Äggstockscancer har högst dödlighet bland gynekologiska cancersjukdomar. Ägg- stocksförändringar är dock vanligt förekommande och endast omkring 1% är ma- ligna. På grund av den höga förekomsten görs initialt en bedömning lokalt (triage) huruvida patienten skall remitteras vidare för expertbedömning, eller om uppföljn- ing på lokal vårdinrättning är tillräcklig. Triagen utförs av gynekologer som saknar utan expertkompetens inom äggstockscancer, och därav har stor variation i diag- nostisk precision.
Syftet med denna studie är att, genom jämförelse med subjektiv expertbedömning, utvärdera potentialen hos artificiella neurala nätverk för triagering av kvinnor med äggstockstumörer.
Vi använde transfer learning av förtränade modeller (VGG16, ResNet50, Mo- bileNet) och en kalibreringsmetod för bättre probabilistisk överensstämmelse mel- lan modellernas svar och deras underliggande diagnostiska precision. Vårt bildma- terial bestod av 3077 transvaginala ultraljudsbilder från 758 kvinnor med äggstock- stumörer. Samtliga fall hade säkerställd diagnos genom resultat från operation eller långvarig uppföljning (> 3 år). Av detta material lades 150 fall (75 benigna, 75 ma- ligna) à 3 bilder åt sidan för slutgiltig validering av modellen, medan resterande fall användes till träning och val av modell. Modellerna bedömdes baserat på sensi- tivitet, specificitet och AUC, ihop med deras 95% konfidensintervall.
Vid validering hade vår slutgiltiga modell en sensitivitet på 96,0% (0,897–0,989), specificitet på 86,7% (0,776–0,929), och AUC på 0,950 (0,906–0,985). Vid utes- lutande av 12,7% (0,073–0,180) av de fall som var svårast att klassificera hade vår modell en sensitivitet på 97,1% (0,909–0,994), specificitet på 93,7% (0,856–
0,978) och AUC på 0,958 (0,911–0,993). Som jämförelse hade den subjektiva ex- pertbedömningen en sensitivitet och specificitet på 96,0%, respektive 88,0%.
Vår studie visar att artificiella neurala nätverk kan användas för differentiering av
benigna och maligna äggstockstumörer med hög diagnostisk precision, jämförbar
med den hos erfarna läkare på området. Således bedömer vi att det finns poten-
tial för användning av dessa modeller för triagering av kvinnor med äggstocks-
tumörer.
Title
* Predicting Ovarian Malignancy based on Transvaginal Ultrasound Images using Deep Neural Networks
Date
November 4, 2020 Author
Filip Christiansen <filipchr@kth.se>
School of Engineering Sciences, KTH Royal Institute of Technology
Examiner Carlota Canalias
Department of Applied Physics, KTH Royal Institute of Technology
Supervisors
Kevin Smith, PhD, Associate Professor
School of Electrical Engineering and Computer Science,
KTH Royal Institute of Technology & Science for Life Laboratory Elisabeth Epstein, MD, PhD, Associate Professor
Department of Clinical Science and Education, Karolinska Institute Department of Obstetrics and Gynaecology, Södersjukhuset
*
An altered and extended version of this report has been published in [1].
Contents
1 Introduction 1
2 Background 3
2.1 Artificial Neural Networks . . . . 3
2.2 Convolutional Layers . . . . 4
2.3 Training by Backpropagation . . . . 5
2.4 Dropout . . . . 5
2.5 Batch Normalization . . . . 6
2.6 Transfer Learning . . . . 6
2.7 Calibration: Temperature Scaling . . . . 7
3 Method 9 3.1 Dataset . . . . 9
3.2 Pre-processing . . . 10
3.3 Data Augmentation . . . 10
3.4 Training . . . 10
3.5 Probabilistic Calibration and Model Ensembling . . . . 11
4 Results 13 4.1 Comments on Potential Sources of Bias . . . . 14
5 Conclusions and Future Work 16 References 17 A Appendix 21 A.1 Historical Diagnoses . . . . 21
A.2 Data Augmentation Details . . . . 21
A.3 Confidence Interval for a Binomial Proportion . . . . 21
A.4 Uncertainty Estimation by Bootstrapping . . . 22
1 Introduction
Ovarian cancer is the most lethal gynaecological malignancy with a 5-year survival rate below 45% [2]. However, ovarian lesions are very common and only around 1% are malignant [3]. Transvaginal ultrasound assessment is today the established procedure to differentiate between benign and malignant ovarian lesions [4]. Due to a large number of cases, a time-consuming diagnostic process, and a shortage of experienced sonographers, patients are triaged by gynaecologists, not specialized in ovarian cancer and having a high variability in diagnostic accuracy [5].
The large potential for improvement and the successful use of computerized imag- ing tools for medical diagnostics in other areas in recent years, raises the question if such methods could be utilized in the triage of patients with ovarian tumours. There are several potential benefits to be gained by an improved triaging process. Today, the high diagnostic uncertainty and interobserver variability cause many unneces- sary referrals to expert sonographers. This leads to long waiting times for patients and unnecessary surgical procedures and medical costs. Benign tumours can be managed more conservatively, with ultrasound follow-up or less invasive and fertil- ity preserving surgery.
A recent study [6], using traditional machine learning methods (e.g. Linear Dis- criminants (LD), Support Vector Machines (SVM), Extreme Learning Machines (ELM)) on a set of manually selected feature descriptors, was able to automatically differentiate between benign and malignant ovarian tumours with a sensitivity and specificity of around 90% and 80% respectively. However, the study was rather small, including only 384 images from 187 patients, which is often limiting both the potential and the statistical significance of the results.
In this study, we try to improve upon the results from [6] by using a transfer learn-
ing approach on ImageNet [7] pre-trained deep neural networks. To the best of our
knowledge, there has only been one paper [8] published, applying convolutional
neural networks (CNN) to transvaginal ultrasound images for ovarian tumour clas-
sification. While it shows promising results (92.5% accuracy), the paper lacks trans-
parency, as it gives no insight to how their model was tested, nor whether a separate
test set was used. Furthermore, the case mix is not reported, and the ultrasound im-
ages shown in the paper seems to be of poor quality. However, transfer learning of
pre-trained CNNs have been successfully used for classification of tumours in ultra-
sound images for other types of cancer, with a few well-designed studies on thyroid
[9, 10] and breast [11, 12] cancer. These and other studies have shown that a trans-
fer learning approach can yield similar or superior diagnostic accuracy, compared
to senior sonographers. With that said, out main research question for this thesis
follows:
Can pre-trained deep neural networks, fine-tuned on transvaginal ultrasound im-
ages, predict ovarian malignancy with a high enough diagnostic accuracy for use
in the triage of patients with ovarian tumours?
2 Background
In this chapter, relevant theory and terminology is presented. First, in Section 2.1, Artificial Neural Networks (ANN) are introduced. Then, Section 2.2 explains the concept of convolutional layers, followed by Section 2.3, which covers backpropa- gation, the algorithm by which ANNs are trained. The chapter then continues with Section 2.4 on the regularization technique called dropout, followed by Section 2.5, in which batch normalization is described. Section 2.6 explains the concept of trans- fer learning and how it can be used to reduce the reliance on large datasets. Lastly, a calibration method called temperature scaling is described in Section 2.7.
2.1 Artificial Neural Networks
Artificial Neural Networks (ANN) are computational models inspired by the dis- tributed information processing in biological systems [13]. A basic ANN consists of an input and output layer, separated by some number of hidden layers, which in turn are collections of ”vertically” stacked artificial neurons (Figure 2.1).
Figure 2.1: Schematic illustration of an Artificial Neural Network. [14]
Each artificial neuron (Figure 2.2), or node, is an elementary computational unit, defined by its weights and biases, its activation function, and its connections to other nodes. Each node receives a set of inputs from its connected nodes in the previous layer. After multiplication by the corresponding weights, the inputs are summed up and the bias is added. The output is then obtained by transformation by a non-linear activation function. Given a set of inputs, the ANN makes a prediction by succes- sively propagating the information through the network of interconnected nodes [15].
At model creation, weights and biases are randomly initialized. Therefore, the pre-
dicting performance of the model will initially not be superior to simple guessing.
Figure 2.2: Schematic illustration of an artificial neuron.
To change this, ANNs are ”trained” with a technique called backpropagation [16], explained with an example in Section 2.3.
2.2 Convolutional Layers
Convolutional layers consist of a set of stacked filters and are the main building blocks of Convolutions Neural Networks (CNN). Given an image volume (techni- cally a tensor), each filter is convolved across the width and height of the volume, producing a two-dimensional activation map. This means that the filter is moved across the image, while at each position, the dot product is computed between the parameters of the filter and the values of the image at the current position (Figure 2.3). The output from the convolutional layer is the volume of stacked activation maps from all filters, which in turn might be the input to another convolutional layer. The use of filters with shared weights across the image, results in shift invari- ant representation. The idea behind this is that the ability to recognize an object should be independent on its location within the image [17].
Figure 2.3: Schematic illustration of a convolutional filter. [18]
2.3 Training by Backpropagation
Training an image classification model in a supervised manner requires a training dataset {x
i, y
i}
Ni=1, where each x
iis an image and y
iits corresponding class label.
Simplified, the model is trained by repeatedly giving it an image x
iand letting it predict the conditional class probabilities
ˆ
y
ji= P (y
j|x
i),
j = 1, . . . , C, where C is the number of possible classes. A confident correct classi- fication of an image x
icorresponds to ˆ y
jibeing close to δ
ij. To achieve this, a loss function L must be defined, which is typically the categorical cross-entropy loss, defined as
L(y
i, ˆ y
i) = − ∑
Cj=1
y
i· log(ˆy
ij).
After a prediction has been made, the gradients of the loss function are computed with respect to the parameters of the model. Each parameter is then slightly changed in the direction opposite to the corresponding gradient, in order for the model to return a more accurate prediction next time it makes a prediction on the same image. For clarification, the updated parameters at time-step t + 1 is given by
Θ
(t+1)= Θ
(t)− η∇
ΘL,
where η is the learning rate and Θ
(t)is the set of parameters at the previous time-step t. This way, the model learns to classify images by automatically finding a suitable representation from the raw pixel data of seen images [16].
2.4 Dropout
Dropout is a regularization technique used between fully connected layers to im-
prove classification performance by reducing overfitting to training data. During
training, individual nodes are ”dropped”, by setting its output to zero, with a prob-
ability of 1 − p. To compensate for the resulting bias in lower expected output to
the next layer, the output is divided by 1 − p. Randomly dropping nodes during
training forces the network to find a more redundant representation, by reducing
co-adaptation between nodes [19]. Co-adaptation means that groups of nodes learn
to respond to the same input, leading to a correlated representation. By avoiding
this, the use of dropout results in a more robust feature representation that gen-
eralize better to new data. A schematic illustration of dropout is shown in Figure
2.4.
Figure 2.4: Schematic illustration of dropout.
2.5 Batch Normalization
When training deep neural networks, the distribution of the inputs to each layer changes, as the parameters of the previous layers change. This is referred to as in- ternal covariate shift, and increases the risk of divergence. Thereby, it imposes careful parameter initialization and the use of lower learning rates, which in turn slows down training. A common method to reduce this problem is batch normal- ization [20]. During training, it normalizes the input to each layer over each train- ing batch. This is done by re-centering and re-scaling the input by the mean and variance of the current batch. However, at test time, a batch-dependence is not de- sirable, since we want the output to only depend on the input. Furthermore, the samples might have to be processed one at the time. Therefore, instead of using the means and variances of a batch, exponential moving averages (EMA) from training are used at test time, thereby keeping the means and the variances fixed. Simply normalizing the input to each layer, to have zero mean and unit variance over the batch, reduces the representation power of the network. To fix this, batch normal- ization is followed by a linear transformation, whose parameters are learned during training, along with the original parameters of the model.
2.6 Transfer Learning
The architecture of most neural networks for image classification consist of a convo- lutional base of multiple layers of stacked convolutional filters for representational learning, and a fully connected classifier. A schematic illustration is shown in Figure 2.5.
Transfer learning is the concept of taking a model that has been trained on a large
dataset from one domain, the source domain, replacing the original classifier, and
training only the parameters of the new classifier on a smaller dataset from another
domain, the target domain. This is done by “freezing” the parameters of the orig-
inal model, thereby keeping them unchanged. The idea behind this, is that for the
original model to be able to successfully classify images in the source domain, it
Figure 2.5: Schematic illustration of a Convolutional Neural Network for image clas- sification. [21]
must have learnt a good representation of important image features. While a large difference between the domains means that the learnt representation of the source domain may not be the most suitable representation for the target domain, this is often more than compensated for if there is limited data available in the target do- main [22].
Domain adaptation is the task of adapting a model trained on data from a source domain, to a target domain. When using transfer learning, domain adaptation is the motivation behind fine-tuning (parts of) the original model after having trained the new classifier. A common approach to fine-tuning, is “unfreezing” the layers of the last convolutional block, one by one, while training with a low learning rate for a small number of epochs. While this might improve overall performance by finding a more suitable representation for the target domain, it also runs the risk of overfitting the training data, leading to poor generalization.
When using transfer learning on a model with batch normalization layers, one more thing has to be addressed, namely the fact that the means and variances of a batch, used for normalization during training in the target domain (transfer learning), will be significantly different from the exponential moving averages (EMA), used at test time, that were once computed during training in the source domain. To align the training and test behaviour of the model, the EMA statistics of all batch normal- ization layers have to be replaced with the corresponding statistics from the target domain [23]. This can be done by continuedly updating the EMA statistics during training in the target domain.
2.7 Calibration: Temperature Scaling
For a given image, the final class prediction from a CNN is given by the most prob-
able class label. This is the argmax of the confidence scores of the prediction, which
in turn are the estimated conditional class probabilities P (class |observation). These
are obtained by taking the softmax, defined by softmax(z)
i= e
zi∑
Cj=1
e
zjfor i ∈ {1, . . . , C} and z = (z
1, . . . , z
C)
T∈ R
C, of the logits, which are the outputs from the last fully connected layer. In some cases, it is desirable to not only be able to predict the most probable class given an image, but also a reliable estimate of the conditional probability P (class |observation) of each class. Neural networks tend to be ”over”-confident in their predictions, which is a result from minimization of the cross-entropy loss during training [24].
Temperature scaling is a post-processing technique to better align the confidence
scores with the underlying class probabilities. This is done by dividing the logits
{z
i}
Ci=1by a scalar T > 0, referred to as the temperature, which is optimized to
minimize the cross-entropy loss on the validation set [25]. For T > 1, the entropy
is increased, which corresponds to less certainty in the predictions. As T → ∞,
the probabilities all approach 1/K, i.e. a random guess and maximum entropy. An
important aspect of temperature scaling is that it does not change the accuracy of
the model, nor its predictions.
3 Method
We used a transfer learning approach on the pre-trained deep learning models VGG16 [26], ResNet50 [27], and MobileNet [28], from the open-source library Keras, and used TensorFlow for fine-tuning the models on our dataset of transvagi- nal ultrasound images of ovarian tumours.
3.1 Dataset
Our dataset contained 3077 grayscale and power Doppler ultrasound images from 758 patients with ovarian tumours. All patients had undergone expert ultrasound assessment, using high-end ultrasound systems
1, at the departments for gynaeco- logical ultrasound at Karolinska University Hospital and Södersjukhuset in Stock- holm Sweden. Eligible criteria were surgery within 4 months of the ultrasound ex- amination (n=633), or ultrasound follow-up for a minimum of 3 years (in case of presumably benign lesion) (n=124). The histological diagnoses of patients under- gone surgery are shown in Table A.1 in the Appendix. From our dataset, 150 cases (75 benign, 75 malignant), each containing 3 images, were left out for testing, while the remaining 607 cases were used for training and model selection (validation).
To ensure accurate results, only patients with a histological diagnosis obtained by surgery were in included in the test set. It should be pointed out that all images from a given patient were included in the same set (training, validation, or test set).
Power Doppler ultrasound utilizes the Doppler effect
2to estimate the movement of body fluids, in this case the blood flow. This is used to detect indicators of ma- lignancy, such as presence of high blood flow and an abnormal structure of blood vessels [29] (Figure 3.1).
Figure 3.1: Example of grayscale and power Doppler ultrasound images.
1
GE Voluson E8/E10 (5-9 MHz, 6-12 MHz), Philip IU22/EPIQ (3-10 MHz)
2
the change in observed wavelength caused by movement of the source relative to the observer
3.2 Pre-processing
Each image was first manually cropped to the region of interest, to exclude organs irrelevant to the tumour, such as urinary bladder, gut, uterus, and major blood ves- sels. The pre-trained networks assume the input to be tensors of shape (224, 224, 3), i.e. square images of 224 by 224 pixels with 3 colour channels (RGB). Therefore, the images were downsampled and resized accordingly
3. Furthermore, the pixel val- ues were standardized channel-wise to have zero mean and unit variance over the dataset. This was done by subtracting the mean pixel intensity and dividing by the standard deviation of the pixels for each channel. For consistency, the pixel means and standard deviations were calculated solely on the training dataset. These same values were then also applied to the validation and test sets.
3.3 Data Augmentation
We performed data augmentation during training for model generalization, by ex- panding the available training data and mimicking shifts in image properties in un- seen domains. The transformations used in the augmentation process can be di- vided into three main categories [30], based on what aspect of the image that is altered. These are image quality, spatiality, and appearance. Low image quality is mainly categorised by blurriness and low resolution, caused by scanner motion, low scanner resolution, or lossy image compression. We tried to imitate this by adding Gaussian noise, JPEG compression and shift in sharpness. The transformations used, related to the spatial shape, were flips, rotations, crops, and scaling. They served the purpose of simulating variability in the shapes and positions of organs, the size of patients, and region of interest related cropping. The appearance related transformations used were shifts in brightness, contrast, and colour, mimicking dif- ferences between ultrasound systems and settings.
43.4 Training
First, the original multi-class classifier of each model, was replaced by a binary clas- sifier, consisting of a fully connected layer of 1024 hidden nodes (512 for MobileNet), followed by ReLU-activation
5, dropout of 0.5 (0.2 for VGG16), and a last fully con- nected layer with two nodes and softmax activation for ”semi”-probabilistic
6out- put. The hyperparameters, such as the learning rate, dropout rate, and the number of hidden nodes in the fully connected layer, were empirically set to maximize the performance on the validation set.
3
using nearest-neighbour interpolation
4
For a more detailed description, see Section A.2 of the Appendix.
5
a common non-linear activation function
6
the model’s confidence in the output does not fully align with the underlying probability
In the first training step, all weights of the original model were frozen (except from the exponential moving average (EMA) statistics of the batch normalization layers in ResNet50 and MobileNet), thereby only training our new classifier. In this step, we used an initial learning rate of 0.002 for VGG16, and 0.02 for ResNet50 and MobileNet. Then, the layers of the last convolutional block were unfrozen and fine- tuned one by one, until no more improvement could be seen. This meant that, for VGG16, the last three convolutional layers were fine-tuned with an initial learning rate of 2 · 10
−4, while in ResNet50 and MobileNet only the classifier was trained and the EMA statistics of and the batch normalization [20] layers updated. Even for VGG16, fine-tuning the last convolutional layers only lead to a slight improve- ment.
All networks were trained using backpropagation [16] by Stochastic Gradient De- scent (SGD) with a Nesterov momentum (NAG) [31] of 0.9. A constant batch-size of 32 samples was used throughout training, as we did not see any noticeable effect in altering the batch-size. The imbalance in the number of Doppler and grayscale images, from benign and malignant cases, was addressed by training with weighted binary cross-entropy loss. We used a learning rate decay of 0.5 after every four con- secutive epochs without improvement in validation accuracy. Training was stopped after 15 consecutive epochs without improvement in validation loss, which was also the metric used to store the best model weights. We also experimented with training of separate models for Doppler and grayscale ultrasound images, but this idea was abandoned due to lack in performance.
3.5 Probabilistic Calibration and Model Ensembling
As a post-processing procedure, the models were calibrated using temperature scal- ing [25], described in Section 2.7, in order for their confidence scores to be better aligned with the true underlying certainty in their predictions. The model calibra- tion is best visualised using reliability diagrams, in which the accuracy is plotted against the model confidence. For an overconfident model, the confidence score will exceed the expected accuracy. In Figure 3.2, an example of this is shown for the VGG16 based model, where the accuracy falls below the dashed line for the uncali- brated model to the left.
Having reliable estimates of the conditional class probabilities P (class|observation)
is desirable when building an ensemble, as it reduces the problem with difference
in (over-)confidence between the models. For the ensemble, we used a soft voting
scheme of averaging the probabilities from the models (VGG16, ResNet50, and Mo-
bileNet), after independent calibration of each model. An ensemble of models is
used in the hope of obtaining better predictive performance than any of the models
by itself.
To reduce overconfident incorrect predictions, we also experimented with the use of test-time augmentation, which is the process of creating multiple augmented copies of each image at test-time and taking the average of the corresponding predictions.
While it led to some improvement when predicting malignancy of tumours based on single images, it did not result in any significant improvement on a case by case basis.
Figure 3.2: Reliability diagram for VGG16-based model before and after calibration.
The confidence is the predicted conditional probability for the most probable class.
Since the classification is binary, the confidence is always higher than 0.5.
4 Results
The models were assessed bases on sensitivity, specificity, and the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve, along with their corresponding 95% confidence intervals. Sensitivity and specificity are the pro- portions of correctly classified positive and negative samples
7. The ROC curve is the result from plotting the true positive rate (TPR) against the false positive rate (FPR)
8. The AUC, which is simply the corresponding integral, is a common statis- tic for model comparison, and is equivalent to the probability that the classifier will rank a randomly chosen positive sample higher than a randomly chosen negative sample [32]. The ROC curves of the trained models are shown in Figure 4.1. The confidence intervals for the sensitivity and specificity where estimated by the Jef- freys interval [33], described in Section A.3 of the Appendix, while the confidence interval for the AUC and the proportion of excluded cases were obtained by boot- strapping [34], described in Section A.4 of the Appendix.
Figure 4.1: Receiver Operating Characteristic (ROC) curves for the trained models, when including all samples from the test set.
The models, VGG16, ResNet50, and MobileNet, were trained individually and op- timized based on the model performance on the validated set. The performance statistics on the test set for each model, and for an ensemble of the models, are shown in Table 4.1.
In Table 4.2, the performance statistics are shown for the same models, when ex- cluding cases with a predicted probability of malignancy between 0.4 and 0.6, cor- responding to high uncertainty. Model testing with exclusion of uncertain cases is
7
In this context, a positive/negative sample, equals a malignant/benign sample, respectively.
8
TPR = sensitivity, FPR = 1 − specificity
AUC Sensitivity Specificity
VGG16 0.936 0.920 0.853
ResNet50 0.933 0.893 0.840
MobileNet 0.950 0.947 0.867
Ensemble 0.950 0.960 0.867
Table 4.1: Performance statistics of the models on the test set.
motivated by triage of patient being the most near-future use case for these mod- els.
AUC Sensitivity Specificity Exclusion
VGG16 0.946 0.971 0.879 0.107
ResNet50 0.948 0.945 0.951 0.227
MobileNet 0.960 0.954 0.906 0.140
Ensemble 0.958 0.971 0.937 0.127
Table 4.2: Performance statistics of the models on the test set, when excluding cases with a predicted probability of malignancy between 0.4 and 0.6, corresponding to high uncertainty.
The ensemble model was the most robust and high performing model on both the validation and test set. On the test set, it had a sensitivity of 96.0% (0.897–0.989), specificity of 86.7% (0.776–0.929), and AUC of 0.950 (0.906–0.985). When ex- cluding the 12.7% (0.073–0.180) of cases most difficult to classify (based on the con- fidence score of the model output), achieved a sensitivity of 97.1% (0.909–0.994), specificity of 93.7% (0.856–0.978), and AUC of 0.958 (0.911–0.993). As compari- son, the subjective expert assessment had a sensitivity and specificity of 96.0% and 88.0% respectively. McNemar’s test [35] for paired nominal data showed no sig- nificant difference in sensitivity or specificity between the model and the subjective expert assessment.
4.1 Comments on Potential Sources of Bias
Caliper measurements, shown in Figure 4.2 for illustration, are used to measure
the size of the tumour, and solid components within the tumour. These measure-
ments are then used for surgical planning, risk of malignancy assessment, and to
monitor tumour growth, in case of conservative management by long-term follow-
up. Calipers were present in ∼80% of both benign and malignant images in our
dataset. Although insertion of calipers is part of the standard examination proce-
dure and independent of the diagnostic outcome, we took the precautionary mea-
sure to examine the impact of the calipers. This was done by evaluating the final ensemble model on disjoint subsets of images, with and without calipers. Since we were not interested in the absolute, but rather the relative performance, we used both the validation and test set to get more accurate results. Furthermore, only the grayscale ultrasound images were used, since very few Doppler ultrasound im- ages contained calipers. The sensitivity and specificity of the ensemble model were marginally higher on the subset of images without calipers
9. The two-sided Mann- Whitney U test [36] yielded p-values of 0.86 and 0.50, for the sensitivity and speci- ficity respectively, meaning that neither the sensitivities, nor the specificities, were significantly different. This shows that the presence of caliper measurements does not seem to assist the model in predicting malignancy of tumours.
Figure 4.2: Example of ultrasound images with (right) and without (left) caliper measurements (yellow lines).
9