Semantic segmentation using convolutional neural networks to facilitate motion tracking of feet : For real-time analysis of perioperative microcirculation images in patients with critical limb thretening ischemia

(1)

Department of Biomedical Engineering, Linköping University, 2021

Semantic segmentation

using convolutional

neural networks to

facilitate motion tracking

of feet

For real-time analysis of perioperative microcirculation images in patients with critical limb thretening ischemia

(2)

motion tracking of feet: For real-time analysis of perioperative microcirculation images in patients with critical limb thretening ischemia

Martin Hulterström & Andreas Öberg LIU-IMT-TFK-A—21/587–SE Supervisor: Martin Hultman

IMT, Linköpings universitet

Examiner: Ingemar Fredriksson

IMT, Linköpings universitet

Division of Medical optics Department of Biomedical Engineering

Linköping University SE-581 83 Linköping, Sweden

(3)

This thesis investigates the use of Convolutional Neural Networks (CNNs) to perform semantic segmentation of feet during endovascular surgery in patients with Critical Limb Threatening Ischemia (CLTI). It is currently being investigated whether objective assessment of perfusion can aid surgeons during endovascular surgery. By segmenting feet, it is possible to perform automatic analysis of perfu-sion data which could give information about the impact of the surgery in specific Regions of Interest (ROIs).

The CNN was developed in Python with a U-net architecture which has shown to be state of the art when it comes to medical image segmentation. An image set containing approximately 78 000 images of feet and their ground truth seg-mentation was manually created from 11 videos taken during surgery, and one video taken on three healthy test subjects. All videos were captured with a Multi-Exposure Laser Speckle Contrast Imaging (MELSCI) camera developed by Hult-man et al. [1]. The best performing CNN was an ensemble model consisting of 10 sub-models, each trained with different sets of training data.

An ROI tracking algorithm was developed based on the Unet output, by taking advantage of the simplicity of edge detection in binary images. The algorithm converts images into point clouds and calculates a transformation between two point clouds with the use of the Iterative Closest Point (ICP) algorithm. The result is a system that perform automatic tracking of manually selected ROIs which enables continuous measurement of perfusion in the ROIs during endovascular surgery.

(4)

(5)

First, we would like to give a special thanks to our supervisor Martin Hultman, and our examiner Ingemar Fredriksson, for their tireless support and input in our work throughout this thesis. Your enthusiasm towards this project have given us a feeling that our accomplishments actually can make a difference. A big thank to Mark Selles for the great input regarding our implementation of the ROI tracking algorithm. We would also like to thank Jimmy Bakker and Bryan Wermelink for providing us with extra test data and allowing us to get an insight in the research currently being done at Perimed and the University of Twente. Finally, a big thanks to Anders Eklund and David Abramian. In the darkest of nights your help have served as a beacon of knowledge filled with helpful Linux commands and insightful information about U-nets. Also, thanks for letting us borrow your computers, please don’t send the electric bill though.

Linköping, May 2021 Martin Hulterström & Andreas Öberg

(6)

(7)

Abbreviations

Abbreviation Meaning

AUC Area Under ROC Curve

ANN Artificial Neural Network

CLTI Critical limb threatening ischemia CNN Convolutional neural network

DFU Diabetic foot ulcer ICP Iterative closest point LDF Laser Doppler flowmetry LOOCV Leave one out cross validation

LSCI Laser speckle contrast imaging

MELSCI Multi-exposure laser speckle contrast imaging PAD Peripheral arterial disease

ROI Region of interest

ROC Reciever operating characteristics

(8)

(9)

Notation vii 1 Introduction 1 1.1 Background . . . 1 1.2 Aim . . . 2 1.3 Thesis Questions . . . 2 1.4 Limitations . . . 2

2 Related work and Theory 5 2.1 Critical Limb Threatening Ischemia (CLTI) . . . 5

2.2 Multi-Exposure Laser Speckle Contrast Imaging (MELSCI) . . . . 6

2.3 Deep Learning . . . 7

2.3.1 Image segmentation . . . 7

2.3.2 Skip connections . . . 7

2.3.3 U-net . . . 8

2.3.4 Spatial dropout . . . 9

2.3.5 K-fold Cross Validation . . . 9

2.3.6 Model ensemble . . . 10

2.3.7 Model performance . . . 11

2.3.8 ROC analysis . . . 12

2.3.9 Data Augmentation . . . 13

2.4 ROI tracking . . . 16

2.5 Iterative Closest Point (ICP) algorithm . . . 16

3 Materials and method 19 3.1 Construction of the dataset . . . 19

3.1.1 Constructing Dataset 1 . . . 19 3.1.2 Constructing Dataset 2 . . . 20 3.2 Semantic segmentation . . . 22 3.2.1 Network architecture . . . 22 3.2.2 Pre-processing . . . 22 3.2.3 Data augmentation . . . 22 3.2.4 Training . . . 23 ix

(10)

3.2.5 Optimization of hyperparameters . . . 24

3.2.6 ROC analysis . . . 25

3.2.7 Semantic segmentation evaluation . . . 25

3.3.1 ROI tracking algorithm evaluation . . . 27

4 Results 29 4.1 Optimization of hyperparameters . . . 29

4.2 Semantic segmentation . . . 30

4.2.1 Layer outputs . . . 30

4.2.2 Ensemble models evaluation . . . 31

4.3 ROC analysis results . . . 37

4.5 Real-time feasibility . . . 46

5 Discussion 49 5.1 Creating the dataset . . . 49

5.2 CNN performance . . . 50 5.3 ROI tracking . . . 51 5.4 Future work . . . 51 6 Conclusion 53 6.1 Research questions . . . 53 Bibliography 55

(11)

1

Introduction

This thesis will cover the use of CNNs to segment feet and perform ROI track-ing of real-time MELSCI images durtrack-ing endovascular surgery in patients with CLTI. This chapter is written to provide a background about the problem at hand, present a problem description and motivation of the benefit of solving the prob-lem, as well as limitations.

1.1 Background

CLTI is a severe form of PAD with high mortality rate [2], where patients with the condition often need surgical treatment. The patients with CLTI have a mortality rate of 50% at 5 years and 70% at 10 years [3]. PAD is caused by atherosclerosis (plaque buildup) that causes a decrease in blood flow to the peripheral arteries [4]. The reduced blood flow to limbs can cause necrosis and gangrene, which can lead to amputation and death if not treated [5]. The goals of the treatment of this disease are to relieve ischemic pain, heal ischemic ulcers, prevent limb loss, improve patient function and quality of life and prolong survival. For some CLTI-patients that can tolerate surgical procedures, this can be achieved by performing

endovascular surgery [3]. Endovascular surgery is a procedure that is used to treat

problems affecting blood vessels, such as CLTI. During the surgery, a small inci-sion near the hip is made to access the blood vessels. A catheter is then inserted in the vessel which is navigated to the affected location where it helps treat the repressed blood vessel and increase the blood flow. [6]

Research is currently being conducted with the aim of investigating whether it is beneficial for the surgeon to receive an objective assessment of the blood flow in the effected tissue. If the blood flow is increasing in the affected area during the procedure, it could mean that the treatment is working. Hultman et al. have

(12)

developed a MELSCI camera to measure these changes. This technique illumi-nates tissue with a laser which give rise to a speckle pattern that is detected by a camera. The speckle pattern gives information about the movement of red blood cells in tissue which can be used to estimate perfusion. [1]

The MELSCI camera developed by Hultman et al. provide continuous data of the blood flow and perfusion of the tissue in real-time [1]. However, due to the way the MELSCI perfusion is calculated the image background can falsely show high perfusion, leading to visual noise and unappealing images. More importantly, this also prevents background removal through a simple thresholding algorithm.

Deep learning have shown to achieve promising results in semantic segmentation

of medical images, especially using a CNN architecture called U-net [7]. By feeding these kinds of networks images of feet and their ground truth seman-tic segmentation, it should be possible to train the network to segment feet from arbitrary images taken with the MELSCI camera such that the background is re-moved. Also, it is desirable to automatically track ROIs of the foot to ensure cor-rect collection of data. It would therefore be beneficial to use the segmentation provided by the network to facilitate ROI tracking.

1.2 Aim

The aim of the thesis is to train a CNN to perform foot segmentation and ROI tracking during CLTI surgery. If the method can be successfully implemented it will contribute to the research done by Hultman et al. [1] to determine the importance of real-time feedback from the microcirculation during endovascular surgery.

1.3 Thesis Questions

This thesis will answer the following questions:

1. Can foot segmentation of MELSCI camera images be done with CNN? 2. Can CNN output be used to facilitate automatic ROI tracking during CLTI

surgery?

1.4 Limitations

This thesis is limited to only train a CNN to segment feet. It would have been interesting to investigate whether it would be possible to segment different body parts with the same network. However, the aim of the research that is currently being conducted focuses on measuring perfusion changes in feet during endovas-cular surgery. It was therefore chosen to only focus on implementing a solution

(13)

for this situation, and not one which could have been used on several different body parts.

(14)

(15)

2

Related work and Theory

The following chapter will cover relevant theory about CLTI, endovascular surgery, MELSCI, Deep Learning and ROI tracking that is necessary to know to under-stand and review the content presented in this thesis.

2.1 Critical Limb Threatening Ischemia (CLTI)

CLTI, also known as Critical Limb Ischemia (CLI), is the end stage of PAD and is defined as limb pain which occurs at rest or impending limb loss due to severe decrease in blood flow [3]. The decrease in blood-flow eventually results in un-controlled cell death, necrosis, due to insufficient nutrition and oxygen. Necrosis is the process of uncontrolled breakdown of enzymes in the cell which results in the cell disintegrating. [8] A significant decrease in blood flow can cause life threatening condition called gangrene where the skin in large regions undergoes necrosis [5]. CLTI is also associated with impaired wound healing.Diabetic foot ul-cer (DFU) is a complication of diabetes mellitus where wound healing is impaired

and is viewed as major source of morbidity in diabetes patients. It is also a lead-ing cause of lower limb amputations and has been estimated to be one of top ten global burdens of diseases. [9, 10] The complication is related to PAD and restoration of skin perfusion is one of five major factors in DFU treatment [10]. Endovascular surgery has gained acceptance as a method to improve ulcer heal-ing through revascularization [11]. The surgery is carried out by makheal-ing a small incision near the hip or in the groin to insert a catheter into the artery to remove blockings and increase blood flow [12]. As previously described in Section 1.1 it can be beneficial for the surgeon to get an objective assessment of the perfusion for evaluation of the improvement.

(16)

2.2 Multi-Exposure Laser Speckle Contrast Imaging

(MELSCI)

MELSCI is a perfusion imaging technique developed to address shortcommings of LSCI where tissue is illuminated by laser and a camera detects the resulting

speckle pattern. Coherent light produces random interference patterns, known as

a speckle patterns, when illuminated on diffusive objects. For stationary objects, like tissue cells, the speckle pattern is stationary. If there is movement in tissue, such as red blood cells, there are fluctuations in the speckle pattern which causes blurring and reduction inspeckle contrast. The variation in speckle pattern occurs

due to Doppler frequency shifts of light depending on the velocity of the moving particles i.e., the red blood cells. Speckle contrastC is calculated as:

C(x, y) = σN < IN >

(2.1)

C(x,y) denotes the speckle contrast, σN is the spatial standard deviation and

< IN > the average intensity of neighbourhood N around pixel P(x,y). This means

that flow speeds can be coded as speckle contrast variations. [13, 14] However, problems with calculating perfusion occurs due to non-linearities which are de-pendent on flow speed and blood concentration [13]. MELSCI solves this problem by using multiple exposure times, allowing more advanced models for estimat-ing perfusion from the contrast. By usestimat-ing a high-speed camera which captures images at a short exposure time, several images can be added successively in post-processing to synthetically produce multiple exposures. The technology is able to produce 15.6 frames per second with an exposure time of 64 milliseconds. A more detailed explanation on how these images are generated can be found in the article"Real-time video-rate perfusion imaging using multi-exposure laser speckle contrast imaging and machine learning". [1] Perfusion is then calculated with the

use of anArtificial Neural Network (ANN) which has been trained to map MELSCI

speckle contrast tolaser Doppler flowmetry (LDF) perfusion. This is based on

sim-ulated tissue models since it is impossible to measure both MELSCI and LDF at the same place and time. The details on how this is done can be found in the article"Machine learning in multiexposure laser speckle contrast imaging can replace conventional laser Doppler flowmetry" [15].

An example of an image taken by the MELSCI camera can be seen in figure 2.1, which displays an example of an intensity image and a perfusion image. Red pixels correspond to areas of high perfusion and blue pixels correspond to areas where perfusion is low. It is noticeable that the camera detects false perfusion in the background. This can be prevented by separating the foot from the back-ground and only perform perfusion measurements on the pixels connected to the foot. However, this can be hard to achieve by using conventional imaging processing techniques since feet change in appearance. Deep Learning networks

are instead a more robust method for image segmentation since these networks can be trained to segment specific objects.

(17)

Intensity image 0 500 1000 1500 2000 2500 Perfusion image 0 20 40 60 80 100 120 140 160 180 200

Figure 2.1: Images taken with the MELSCI camera developed by Hultman et al. [1] during surgery. The image to the left displays an example of an intensity image and the image to the right displays an example of a perfusion image. The intensity images were the ones used in this thesis to train a CNN to segment feet.

2.3 Deep Learning

Deep learning is a sub area of machine learning that includes algorithms that have been inspired by the function and structure of the human brain called ANN. These kinds of networks have proven very efficient in solving tasks regarding, speech processing, and image processing. [16]

2.3.1 Image segmentation

When dealing with images in practice, it is seldom the case that the whole im-age contains useful information. It is more common that only some areas in the image contain the useful information. Image segmentation is an area in image processing and computer vision where the image is divided into smaller regions that enables the ability to extract the areas of the image which is considered use-ful for the application at hand.[17] Semantic segmentation is a sub-area in image segmentation where each pixel is classified with a label that belongs to a certain class where the number of labels depend on the application [18].

2.3.2 Skip connections

The backpropagation algorithm iteratively optimizes the weights in a neural net-work with respect to the gradient of the loss function with respect to the weights in later layers. This is done by calculating the partial derivatives of the loss func-tion with respect to the weights using the chain rule.[19] The loss funcfunc-tion will

(18)

gradually be minimized when repeating the backpropagation algorithm until the loss have stopped to decrease or until some user defined criteria is met. In prac-tice, the partial derivatives are often smaller than 1.0, which implies that the gra-dient gets smaller as the algorithm propagates backwards through the network, especially for deeper networks. This is the very essence of the vanishing gradient problem. If the gradient should become very small, or even zero, the layers be-fore does not get optimized at all. [20] To avoid this, one can use so called skip connections through non-sequential layers. A skip connection connects the out-put of one layer together with the inout-put of a later layer [21]. This connection can be added either through addition or concatenation, which both comes with their own benefits. The concatenated skip connection is used with the aim to reuse features simply by concatenating them with layers deeper in the network thus allowing a larger amount of information to be preserved from earlier layers. [22]

2.3.3 U-net

Deep learning has been shown to achieve promising results in image classifica-tion, especially using convolutional neural networks (CNN). Even though CNNs have existed several years, it was not until Krizhevsky et al. [23] showed their impressive results on image classification from the dataset ImageNet with 1 mil-lion training images, that the technique became more popular. Ronneberger et al. have shown that CNNs can be used to preform semantic segmentation very efficiently on medical images as well. Their network architecture, called U-net is based on the works of Long et al [24] and is illustrated in figure 2.2.

Input image

Output segmentation

mask

Figure 2.2:An example of a U-net architecture.

The architecture consists of an encoder (left side) and a decoder (right side). The encoder is constructed like a conventional neural network as it successively ag-gregate the semantic information and at the same time reduce the spatial infor-mation about the features of the image [7]. Structurally, the architecture of the encoder is constructed with a repeated application of two unpadded convolutions each followed by a ReLU-activation and a down sampling in the shape of a 2x2, stride 2, max pooling operation. The number of feature channels are doubled in

(19)

every down sampling step [25].

Spatial information in the image is recovered in the decoding part of the network which receives its information from the part at the bottom of the network, called the bridge. Every part in the decoder consists of an up sampling which halves the number of feature channels. They also include a concatenation with higher resolution feature maps through skip connections from the same depth in the encoder (illustrated as gray arrows in figure 2.2). The last part of the decoder step consists of two convolutions that is each followed by a ReLU-activation [7]. The final layer is 1x1 convolution that maps the final feature vector to the desired number of classes [25].

2.3.4 Spatial dropout

In deep learning, dropout is used to reduce the risk of overfitting. It is a regu-larization method that during the training randomly ignores or "drops out" some number of nodes. In this way, each training iteration uses a slightly different net-work topology. Whereas, during validation and testing, all feature maps are used. The dropout method thereby forces the neural network to learn the features of the data that are useful in conjunction with several subsets of neurons more robustly. [26] Since images contains strong spatial correlation, the feature maps of a CNN are also strongly correlated. This causes the regular dropout method to be inef-fective when dealing with images. Unlike regular dropout, the spatial dropout method randomly ignores whole feature maps instead of just single nodes, which have been proven to be very effective in the training process of CNNs. [27] An example of dropout function can be seen in Figure 2.3.

Figure 2.3: The image to the left shows a network with two hidden layers. The image to the right shows a network produced after applying dropout.

2.3.5 K-fold Cross Validation

K-fold cross validation is a method that is used to reduce overfitting and help generalize the model. In k-fold cross validation, k models are trained with a unique subset (fold) of the dataset which is then validated using the complemen-tary subset of the dataset. The principles of k-fold cross validation is illustrated in figure 2.4. A version of this technique is called Leave One Out Cross Valida-tion (LOOCV). In this version, k models are trained in the k-fold manner on the

(20)

whole dataset but leaves one piece of data out not to be used in the training pro-cess. For example, if one have n data points the LOOCV method would result in an n-fold cross validation where one data-point is used for validation. One of the advantages of LOOCV is that all the data is used for training at some point which causes a low bias.[28]

Training data Test data Fold 1 Fold 2 Fold k Data set

Figure 2.4:The principle of K-fold cross validation.

2.3.6 Model ensemble

When training a neural network, there often exists multiple sub-optimal local minima which in turn causes the network to acquire an sub-optimal performance. [29] An ensemble is referred to as a set of individually trained models or classi-fiers that are combined to a single output which preforms better then a single network. This can be done in many ways and an effective way of doing this is sim-ply to calculate the average of each models output and make predictions based upon this. [23] Figure 2.5 represents the principles of the construction of an en-semble model.

Model 1 output Model 2 output Model n output

Average of the output of the n models

Prediction

(21)

2.3.7 Model performance

The performance of a network is in this thesis calculated with the use of a con-fusion matrix. A concon-fusion matrix is a N × N matrix, where N is the number of

classes, which hold the values oftrue positives, false positives, false negatives and true negatives. In this thesis only a binary classifier will be considered with only

two possible predictions (0 and 1) and two possible true classes (0 and 1) and the confusion matrix is therefore of size 2 × 2. This results on four possible outcomes which are defined as:

• True Positives (TP): ground truth and predicted pixel are of class 1 (foot) • True Negative (TN): ground truth and predicted pixel are of class 0

(back-ground)

• False Positives (FP): ground truth pixel = 0 while predicted pixel = 1 • False Negatives (FN): ground truth pixel = 1 while predicted pixel = 0 The performance is summarized in the confusion matrix as presented in figure 2.6. The matrix holds the count for each of the four case from whichDice score is

calculated.

Figure 2.6:Confusion matrix.

Dice score is defined according to equation 2.2, and Dice loss as in equation 2.3. This metric is a measure of the overlap between the predictions and the ground truth. It rewards true positives as well as penalizes false positives, which is why the Dice score is well suited for semantic segmentation. [30]

Dice = 2TP

2TP + FP + FN (2.2)

Dice loss = 1 − 2TP

(22)

2.3.8 ROC analysis

ROC analysis is a way to visualise and evaluate a classifiers performance through graphs and is commonly used in both medical decision making and deep learn-ing. In the case of a binary classifier, as in this thesis, each instanceI is mapped

to one element of a set of positive and negative labels. Classification models can produce both continuous and discrete output. Classification models that produce continuous output (an estimate of which class an instance belongs to) utilizes thresholds to predict class membership, whereas discrete classifiers pro-duce a class label which indicates the predicted class of the instance. Basically, it means that the model outputs a probability for each class and the highest proba-bility wins. The ROC plot show how thetrue positive rate (TPR) and false positive rate(FPR) changes with the threshold used for binarizing the classes.

TPR =TP

P (2.4)

FPR =FP

N (2.5)

P denotes the total number of positives and N denotes the total number of

nega-tives. Continuous classifiers, like deep learning networks, yield an instance prob-ability that represents the degree to which an instance is member of a class. This value can be used with a threshold to produce a binary classifier by mapping instances to classes depending on if their respective probability is above or be-low the threshold. Each threshold produces a point on the ROC curve and the threshold may conceptually be in range (−∞, +∞). An example of this can be seen in figure 2.7. The closer a point is to the top left corner the better the clas-sifier, where a point in (0,1) represents a perfect classifier. The challenge is to select the threshold with the optimal balance between TP and NP rates, which can be found by maximizing Geometric Mean (G-mean) or Youdens´s J statistic (J-statistic) in equation 2.6 and 2.7.

G-mean =pTPR * (1-FPR) (2.6)

J-statistic = TPR - FPR (2.7)

To evaluate the performance of a classifier, and to be able to compare classifiers, one can calculate the Area Under ROC Curve (AUC) to generate a single value to represents the performance. The AUC is the probability that a randomly chosen data point from the positive class is ranked higher than a randomly selected data point from the negative class. Since the AUC is a portion of the unit square it means that the value will always range between (0,1), where an AUC value of 1 denotes a perfect classifier. [31]

(23)

Figure 2.7:An example of a ROC graph showing the performance for differ-ent thresholds for a classifier that produces continuous output.

2.3.9 Data Augmentation

Generalizability refers to the performance of a model on unseen data (test data) compared to previously seen data (training data), i.e., how well the model per-forms in a generalized environment. This helps to evaluate how well the model has learned to recognize features of feet or the exact appearance of certain feet, namely the ones in the training set. A model that performs well on training data but poor on test data is said to beoverfitted, meaning that the model only has

learned to recognize images it has already seen. The performance of the model can be evaluated over time by plotting the performance measures (accuracy, dice etc.) for each training epoch. By doing this it is possible to find a point where the

test error starts increasing in contrast to the decreasing training error. An example

(24)

Figure 2.8:An example of overfitting when training a model. The plot to the left shows a point where the validation error increases as the training con-tinues. This means that the model has been overfitted to the training data and therefore performs poorly on the validation data. The sought relation-ship between training and test error is one where the test error continues to decrease as the training increase, as seen in the plot to the right.

A useful Deep Learning model has a validation error which decreases as the num-ber of training epochs increases, i.e., continues to decrease as the training error decreases. A powerful method to prevent overfitting isdata augmentation. Data

augmentation is a technique for increasing the amount of training data while at the same time preventing overfitting by teaching the network varying shapes, ori-entations, and sizes of the same object. For example, take an image of a foot and its corresponding mask as can be seen to the left in figure 2.9. If the model learns that all feet are facing upwards in a slight angle it will learn that this is a relevant feature of a foot, which is necessarily not the case. To prevent this, the image to the right in figure 2.9 shows an augmentation method of the original image called

shearing, ensuring that the model learns that there are more features of feet than

(25)

Figure 2.9:Original images to the left and the augmented to the right. There are other augmentation methods likerotation, flipping, brightness enhance-ment, adding noise and translation which all can be useful depending on the

ap-plication of the model. [32] During surgery the patient lies on its back and the toes are facing upwards which means that it may be unnecessary to flip the image vertically. However, both left and right feet are subjects to surgery meaning that flipping the image horizontally is beneficial. Rotations, translations, and bright-ness alterations are also beneficial augmentation methods considering that there is a lot of movement during surgery and skin colour and lightning condition may vary. An example of image rotation can be seen in figure 2.10.

Figure 2.10:Rotated image and its corresponding mask.

Augmentation can be performed either before training or during training depend-ing on the situation. A relatively small dataset may benefit from augmentation pre-training, also calledoffline augmentation, to increase the size of the dataset.

(26)

Online augmentation refers to on the fly augmentations which are performed

dur-ing the traindur-ing on the original traindur-ing set. The use of offline augmentation is followed by the need for increased memory storage on the disk which can be problematic depending on the inflation of the dataset. [32]

2.4 ROI tracking

ROI tracking is a field in computer science where a computer learns to track a specific region of interest in a video. The tracking problem can be divided into a classification task and an estimation task. The classification task is to get the computer to get a coarse location of the ROI in an image, whereas the estimation task has to do with predicting the location and movement of the ROI to get a correct localisation of the object in the image, often presented with a bounding box. [33, 34]

During LSCI/MELSCI measurements it is beneficial to be able to make assess-ments about one or several ROIs at the same time. However, without an auto-matic tracking solution this has to be done manually which is time consuming and limits the clinical applicability of the technique since these types of proce-dures normally have a intraoperative period of three hours or more. It is there-fore beneficial to develop robust tracking algorithms that makes the process au-tomatic. [10]

2.5 Iterative Closest Point (ICP) algorithm

Registration of image points is the process of finding a transformation, i.e., ro-tation, translation, and shearing, of a set of data points to a set of model points that produces the best fit between the models. It can also be described as the process of fitting different data points to the same coordinate system. The ICP algorithm iteratively revises the transformation that best aligns a model shapeP

to fit a source modelX by minimizing an error metric.[35] A complete

mathemat-ical explanation of the algorithm can be read in article [35] but it can roughly be described as:

1. Select some set points in bothP and X

2. Match the points between the models

3. Estimate a registration vector that minimizes the sum squared error be-tween the transformed point setP and X

4. Transform the point setP using the estimated transformation

5. Iterate until the mean square error has been minimized

The ICP algorithm was initially created for registration of 3D shapes but handles 2D shapes like images as well since it just means that the z-coordinate in a x, y, z

(27)

coordinate system is set to zero [35]. The algorithm was implemented in Matlab which already has a pre-defined function calledpcregistericp for ICP calculations

(28)

(29)

3

Materials and method

This chapter is written to explain the choice of methods and materials used to solve the problems presented in the introduction. It describes how the datasets were created, the specific implementation of the U-net, as well as the details re-garding the training and evaluation procedure. This chapter also describes how the ROI tracking algorithm was implemented and evaluated.

3.1 Construction of the dataset

The MELSCI-camera developed by Hultman et al. outputs intensity images and perfusion images both of which could have served as input channels in the train-ing process of the network. An additional input channel for the network could have resulted in a better performing model. However, the perfusion images are calculated with the help of an ANN that maps MELSCI speckle contrast to LDF. Training the network with this data would most likely force it to only work on im-ages acquired from the MELSCI camera. In an attempt to make the model more versatile and generalizable to other applications which uses intensity images of feet, the perfusion image channel was chosen to be excluded.

3.1.1 Constructing Dataset 1

The initial dataset was constructed in Matlab 2020b [36] with the use of 5047 images. The images were collected with the MELSCI camera constructed by Hult-man et al. [1] on three healthy test subjects. A dark background was used to facilitate the segmentation by simple thresholding and morphological image pro-cessing operations. Additionally 990 images were captured without a beneficial background which were only used for testing. The original images had size 320 ×_{256 pixels but were down sampled to 160 × 128 pixels. The images and binary}

(30)

masks were concatenated into two separate 3D arrays to avoid too many separate files. This resulted in two 3D arrays with size 5048 × 160 × 128 which were saved as.mat-files. A summary of the dataset is presented in table 3.1 and a

visualiza-tion of an image pair can be seen in Figure 3.1. Dataset 1

Image 3D array Mask 3D arrray 5038 × 160 × 128 5038 × 160 × 128

990 × 160 × 128 No masks Table 3.1:Summary of Dataset 1.

Figure 3.1:Original image with mask.

The idea with Dataset 1 was to use it for pre-training and initial evaluation of the network before the bigger dataset had been created. This contributed to that net-work defects could be corrected, and optimization of netnet-work parameters could be done in an early stage.

3.1.2 Constructing Dataset 2

The second dataset was constructed from videos taken by the MELSCI camera. The videos were, unlike the ones acquired when constructing Dataset 1, collected from real surgical procedures. The dataset was constructed by using an applica-tion created in Matlab by our supervisor Martin Hultman. The shortest video contained 347 images and the longest 13940 images. The application displayed both images and a graph of the amount of movement at different time points of the surgery as the absolute difference between two consecutive frames. The time

(31)

points where a lot of movement was detected were targeted and the segmentation was performed by manually segmenting the area in the image which contained a foot. The output from the application were two 3D arrays, one containing the original images and one containing the corresponding ground truth segmenta-tion. An example of a manually segmented foot can be seen in Figure 3.2.

Figure 3.2:The image shows the application in which Dataset 2 was created.

Dataset 2 consisted of 74785 image and mask pairs which were concatenated into two separate 3D arrays of size 74758 x 320 x 256 each. Two additional 3D array pairs were also constructed of down sampled images, one by a factor two and one by a factor four. This was done for faster calculations and to avoid memory allocation problems that often occurred when running thePython script in Jupyter Lab. The additional array pairs were of size 74758 x 160 x 128 and 74758 x 80 x

64. After evaluating the down sampled datasets it was decided to only use the one that had been down sampled a factor two. The reason for discarding the factor 4 down sampled dataset was because sufficiently good results could be achieved with the factor 2 down sampled data without worrying that the training process would take too long. A summary of the dataset can be seen in table 3.2.

Dataset 2

Image 3D array Mask 3D array 74758 × 160 × 128 74758 × 160 × 128

(32)

3.2 Semantic segmentation

The semantic segmentation in this thesis was implemented in Keras. Keras is an open-source library that is used like an interface for the TensorFlow library.

3.2.1 Network architecture

In this thesis, the semantic segmentation is based on a U-net similar to the one described under Section 2.3.3. However, the chosen architecture have some mi-nor modification. The entire architecture of the implemented U-net is shown in Figure 3.3 in which the blue arrows represents a 2D convolution with a kernel size of 3x3, stride 1, followed by a batch normalization and a ReLU activation function. The orange arrow represents a 4x4 max pooling operation. The green arrow represents a 2D transpose convolution with a kernel size of size 3x3, stride 4, followed by a batch normalization and a ReLU activation function. Finally the yellow arrow represents a 2D convolution with a kernel size of 1x1, stride 1 followed by a softmax activation function. This architecture resulted in that the information in the bridge are images of size 10x8 pixels.

Input image Output segmentation mask (160 x 128 x 1) (160 x 128 x 64) (160 x 128 x 64) (40 x 32 x 64) (40 x 32 x 128) (40 x 32 x 128) (10 x 8 x 128) (10 x 8 x 256) (10 x 8 x 256) (40 x 32 x 128) (40 x 32 x 256) (160 x 128 x 64) (160 x 128 x 128) (160 x 128 x 2)

Figure 3.3:Illustration of network architecture.

3.2.2 Pre-processing

The data were originally of the dimensions 320x256. This caused the training process to become unreasonably slow. The data were therefore downsampled by a factor 2 to the size of 160x128 as described in Section 3.1.2.

3.2.3 Data augmentation

To prevent overfitting, an online data augmentation was added in a specific way during the data generation. The aim of this Section is to describe how the aug-mentation was done. The augaug-mentation techniques used in this thesis was the following:

(33)

• Translation with a random distance between 0 and 10% of the image size along each axis.

• Shearing with a random angle between 0 and 10 degrees.

• Random change of the brightness of the image between a specified interval. • Random zoom along x- and y-axis between -20% and +20%.

• Left to right flipping of the image.

When the images in one batch is generated, they are sent through an augmenta-tion process. In this process, there is a 75% chance for each image that it gets augmented. A list is generated that is as long as the number of augmentation-types. Each place of the list contains an integer drawn from a discrete uniform distribution with values between 0 and 1. These integers works as a decision for which augmentation types are to be applied to the image. This means that each image is given a stochastic combination of augmentation types. For exam-ple, when an image I is being sent through the augmentation process, the list illustrated in Figure 3.4 is generated. This means that this particular image is augmented with a zoom, left-right flipping, shearing and a translation.

Rotation Translation Shearing Brightness shift Left-Right flipping Zoom

[ 1 1

10 10 ]

Figure 3.4:An example of an augmentation combination that can be applied to an image

The augmentation process was only applied to the images used as training data. The testing and validating data was kept as they were to ensure that the results from the test-data reflects the results expected in real life.

3.2.4 Training

The input for the network during the training was the images with their corre-sponding masks as targets. The network was trained using the Adam optimizer which is an adaptive learning rate method [37]. The initial learning rate for the Adam optimizer was set to 0.0001. Binary cross entropy was used as loss func-tion on the output from the final layer which is a pixel-wise softmax of the final feature map.

A summary of the training set constructed from Dataset 1 and 2 can be seen in the table 3.3.

(34)

Dataset 1 & 2

Foot no. Number of images

1 13939 2 9190 3 464 4 12205 5 11258 6 6892 7 346 8 8329 9 5038 10 792 11 10902 Total 78563

Table 3.3:Summary of the Datasets.

The network was trained using a LOOCV as described in Section 2.3.5 with two different configurations. The first of which used 10-folds with the images in Foot

no.1-10 used as training data. The images in Foot no.11 were kept away from the

training process as these were used as test-data.

The second configuration used 5-folds with the images inFoot no.1-5 used as

training data. In this configuration the images inFoot no.6-11 were kept away

from the training process as these were used as test-data.

Each fold was trained for 25 epochs with the callback ReduceLROnPlateau

acti-vated with the parameter patience set to 10 epochs and the parameter factor set to 0.1. All folds from the LOOCV configurations was saved as an individual model (sub-model).

The sub-models in each LOOCV-configuration could then be combined to differ-ent ensemble models in differdiffer-ent combinations using the method described in Section 2.3.6, where the largest model consisting of all 10 sub-models holds 20 724 500 parameters.

3.2.5 Optimization of hyperparameters

The Weights & Biases [38] tool was used to evaluate and optimize

hyperparam-eters. This was done by performing Sweeps where the tool initializes a set of

parameters, starts a regular network training session with the chosen parameters, saves the results once the training is completed, and then repeats the cycle with a new set of parameters. The parameter set with the lowest dice-loss score was regarded as the most optimal configuration.

(35)

3.2.6 ROC analysis

ROC analysis was performed on the 10-fold ensemble model since this was the one with the highest performance. The resulting classification thresholds were then evaluated by performing segmentation on test data and calculating the dice score between the segmentation and the ground truth.

3.2.7 Semantic segmentation evaluation

An average of all dice scores from the images in the test data was used as a per-formance metric for an ensemble model or sub-model. The evaluation process was partly designed to investigate how the ensemble model was built with every sub-model from the LOOCV preformed on the test data. But also to evaluate how ensemble models constructed from a range of sub-models in different combina-tions performed on the test data.

Having 10 sub-models, the ways to construct ensemble models are many. One can use all sub-models individually and evaluate their performance. Moreover, one can also construct an ensemble model consisting of two sub-models in 45 ways, an ensemble model consisting of three sub-models in 120 ways etc. In fact, the total number of ways an ensemble model can be constructed in this manner are 1023 from 10 sub-models and 31 from 5 sub-models. All these combinations were evaluated.

3.3 ROI tracking

The ROI tracking was developed in Matlab with inspiration from the work pub-lished by Mennes et al. [10] who successfully implemented a semi-automatic tracking algorithm for ROI tracking of diabetic foot ulcers. The segmentation produced by the model was used to find foot edges with Canny edge detection

which then were used as reference points for the registration. An example of the result can be seen in Figure 3.5.

(36)

Figure 3.5:An example of the foot edge detection that is used for the regis-tration.

ROIs are drawn manually, as seen in Figure 3.6, on the first frame and the corre-sponding edges are used as reference for the registration for all the frames.

Figure 3.6:A manually placed ROI.

The algorithm then performs ICP calculations as described in Section 2.5 and the ROI is shifted to the correct position according to the transformation. An example can be seen in Figure 3.7.

(37)

Figure 3.7:A visualization of the tracking between two frames.

3.3.1 ROI tracking algorithm evaluation

The evaluation of the tracking algorithm was based on applying different kinds of transformations to a reference image and then letting the algorithm find the right transformation between the transformed image and the reference image. A refference ROI was also drawn by hand on the reference image. A dice score was then calculated between the transformed ROI in the reference image and the out-put ROI in the image from the tracking algorithm itself. This can be summarized as:

• Draw ROI on an image

• Apply transformation to the image and the ROI

• Apply the ROI tracking algorithm between the initial image and the trans-formed image

• Calculate dice score between the transformed ROIs

The transformation types used in the evaluation were the following: • Translation in x-direction +/- 50% of the image size along the x-axis. • Translation in y-direction +/- 50% of the image size along the y-axis. • Rotation 360◦starting from -180◦.

• Zoom with a scaling from 0.5 to 1.5 the size of the image.

The evaluation was partly done with one transformation type at the time and also with two transformation types in combination. The goal with this evaluation was to get an overview over how the algorithm performed in a broad range of situa-tions.

(38)

(39)

4

Results

This Chapter presents the results from the development of the CNN and the ROI tracking algorithm and how these have been evaluated. This includes graphs of dice scores from the 10- and 5-fold LOOCV models, ROC analysis, hyperparam-eter optimization, and tracking performance for different types of translations and rotations.

4.1 Optimization of hyperparameters

The results from the Sweep that was conducted using the Wandb tool. A total of 19 different configurations were run and the parameters that were varied were

learning rate, batch size, depth, dropout rate, number of filters, and optimizer function.

The results indicate that changing these parameters only contributes to relatively small changes in performance. The best parameter configuration was the one with the lowest dice loss core which was calculated according to Equation 2.3 in Section 2.3.7. The best set of parameters and dice loss score can be seen in Table 4.1.

(40)

Hyperparameter Configuration Batch size 29 Depth 2 Dropout rate 0.2328 Learning rate 0.0009167 Number of filters 72 Optimizer Adam

Dice loss score 0.0364

Table 4.1:The best set of hyperparameters and resulting dice loss score.

4.2 Semantic segmentation

This Section covers the results acquired from the network architecture illustrated in Figure 3.3 trained with the configurations described in Section 3.2.4. This resulted in 10 sub-models from the 10-fold LOOCV and 5 sub-models from the 5-fold LOOCV all of which were evaluated as described in Section 3.2.7.

4.2.1 Layer outputs

Figure 4.1 displays the output from one feature channel in each layer of one sub-model on three different feet. The Figure gives information about what happens in the model, even if it is not evident how it happens.

(41)

Figure 4.1:Outputs from one feature channel in each layer on three different feet.

4.2.2 Ensemble models evaluation

The 10-fold LOOCV resulted in 1023 combinations and the 5-fold LOOCV re-sulted in 31 combinations of ensemble models. The distribution of ensemble combinations based on number of sub-models are shown in the tables 4.2 and 4.3.

(42)

Number of sub-models Number of combinations 1 10 2 45 3 120 4 210 5 252 6 210 7 120 8 45 9 10 10 1 Total 1023

Table 4.2:Summary of the variation of possible ensemble models evaluated from the 10-fold LOOCV.

Number of sub-models Number of combinations

1 5 2 10 3 10 4 5 5 1 Total 31

Table 4.3:Summary of the variation of possible ensemble models evaluated from the 5-fold LOOCV.

As described in Section 3.2.7, all combinations of ensemble models were evalu-ated on the test data. During this process, each ensemble model combination calculated a dice score between every segmentation from the test data and their ground truth. These dice scores were then summarized to a single mean value, a dice mean. The Y-axis in figure 4.2 and 4.3 represents the distribution of dice score means achieved for each ensemble model combination. The X-axis in these images represents the number of sub-models used in the ensemble models. Figure 4.2 illustrates a violin plot of the results from the evaluation process for the 10-fold LOOCV. It is evident that the median dice score increases along the number of sub-models used in the ensemble model. However, there are still en-semble models that performs better then the enen-semble model built with 10 sub-models.

(43)

Figure 4.2:Distribution of mean dice scores based on all possible combina-tions of ensemble models from the 10-fold LOOCV composed with varying numbers of sub models. The horizontal lines are the median of the distribu-tion and the whiskers represents the range of the data.

Figure 4.3 illustrates a violinplot of the results from the evaluation process for the 5-fold LOOCV. The median of the dice scores increases along the number of models used in the ensemble model, except from the ensemble model composed with all 5-folds, which have approximately the same median as the models com-posed with 2 and 3 sub-models. Also, it is evident that the ensemble models that got the highest and lowest dice score was a single sub-model.

(44)

Figure 4.3:Distribution of dice scores based on all possible combinations of ensemble models from the 5-fold LOOCV composed with varying numbers of sub models. The horizontal lines are the median of the distribution and the whiskers represents the range of the data.

For every number of sub-models used in the evaluation, the very best ensemble model were extracted and evaluated separately. In this evaluation process, the en-semble models calculated a dice score between every segmentation from the test data and their ground truth. The dice scores for each model were then plotted in figure 4.4 and 4.5 as violin plots. This was done to get a sense of the robustness of the ensemble models in question.

Figure 4.4 illustrates the distribution of dice scores from the 10-fold LOOCV achieved by the best ensemble models based on the test data. From these re-sults, it is evident that the best model was an ensemble model constructed with 3 sub-models.

(45)

Figure 4.4:Distributions of dice scores based on the best performing ensem-ble models from the 10-fold LOOCV.

Figure 4.5 illustrates the distribution of dice scores achieved by the best ensemble models from the 5-fold LOOCV based on the test data. The test data for the 5-fold LOOCV consisted of 33219 images acquired from 6 different operations.

Figure 4.5:Distributions of dice scores based on the best performing ensem-ble models from the 5-fold LOOCV.

(46)

Figure 4.6 and 4.7 illustrates the best, median and worst segmentation done by the best ensemble model from the 10-fold LOOCV and the 5-fold LOOCV based on the test data for each cross validation configuration.

Image 20 60 100 50 100 150 Ground truth 20 60 100 50 100 150 Segmentation 20 60 100 50 100 150 20 60 100 50 100 150 20 60 100 50 100 150 20 60 100 50 100 150 20 60 100 50 100 150 20 60 100 50 100 150 20 60 100 50 100 150

Figure 4.6: The segmentations from the best performing ensemble model from the 10-fold LOOCV that achieved the best (0.9819), median (0.9678) and worst (0.9451) dice score from the test data. The first row represents the best, the second row represents the median and the last row represents the worst segmentation of the test data.

(47)

Figure 4.7: The segmentations from the best performing ensemble model from the 5-fold LOOCV that achieved the best (0.9826), median (0.9311) and worst (0.000) dice score from the test data. The first row represents the best, the second row represents the median and the last row represents the worst segmentation of the test data.

4.3 ROC analysis results

The results from the ROC analysis, see Figure 4.8, showed that the most optimal threshold was approximately 0.19-0.20. The ROC graph shows that the model used is close to a perfect classifier, at least on the test data used in this thesis.

(48)

Figure 4.8:The results from the ROC analysis.

Figure 4.9 and Figure 4.10 suggests that the default threshold outperforms the calculated best threshold on test data, at least when measuring dice score.

(49)

Best dice = 0.9773, Median dice = 0.9516, Worst dice = 0.9395.

Figure 4.9: Best, median and worst predictions with threshold = 0.19. The results are displayed in ascending order from best to worst.

(50)

Best dice = 0.9848, Median dice = 0.9707, Worst dice = 0.9469.

Figure 4.10: Best, median and worst predictions with default threshold = 0.5. The results are displayed in ascending order from best to worst.

The same trend can be seen in the graph in Figure 4.11, where the default thresh-old outperforms the calculated best threshthresh-old for a majority of the images.

(51)

Figure 4.11:Graph showing the resulting dice scores for all test images with two different thresholds.

4.4 ROI tracking

The ROI tracking was evaluated by calculating the dice score between two differ-ent transformations of the same image and ROI as described in Section 3.3.1. The results can be seen in Figures 4.12-4.16. In the evaluation process a dice score of 0.9 was chosen as a threshold to highlight the cases when the tracking algorithm should be deemed less trustworthy.

(52)

-60 -41 -20 0 20 29 40 60 Pixels translated in x-direction

0 0.2 0.4 0.6 0.8 1 Dice

Dice scores from tracking algorithm based on translation in the x-direction

Dice score = 0.9

Furthest translation in negative x-direction

20 60 100

50

100

150

Furthest translation in positive x-direction

20 60 100

50

100

150

Image translated -41 pixels in x-direction

20 40 60 80 100 120 20 40 60 80 100 120 140 160 Original image 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Image translated 29 pixels in x-direction

20 40 60 80 100 120 20 40 60 80 100 120 140 160

Figure 4.12: Results from the evaluation process when translating the ref-erence image in x-direction. The first graph shows the achieved dice scores between the transformed reference image and the output from the ROI track-ing algorithm. The second row represents the furthest translation in +/- x-direction used in the evaluation process. The third row presents how much the images can be translated in the in x-direction before the dice score drops below 0.9.

(53)

-80 -60 -40 -20 0 20 35 60 80 Pixels translated in y-direction

0 0.2 0.4 0.6 0.8 1 Dice

Dice scores from tracking algorithm based on translation in the y-direction

Dice score = 0.9

Furthest translation in negative y-direction

20 60 100

50

100

150

Furthest translation in positive y-direction

20 60 100

50

100

150

Image translated -20 pixels in y-direction

20 40 60 80 100 120 20 40 60 80 100 120 140 160 Original image 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Image translated 35 pixels in y-direction

20 40 60 80 100 120 20 40 60 80 100 120 140 160

Figure 4.13: Results from the evaluation process when translating the ref-erence image in y-direction. The first graph shows the achieved dice scores between the transformed reference image and the output from the ROI track-ing algorithm. The second row represents the furthest translation in +/- y-direction that was used in the evaluation process. The third row presents how much the images can be translated in the in y-direction before the dice score drops below 0.9.

(54)

-180 -150 -100 -58 0 50 61 100 150 180 Degrees of rotation 0 0.2 0.4 0.6 0.8 1 Dice

Dice scores from tracking algorithm based on rotation

Dice score = 0.9

Image rotated 0 degrees

20 60 100

50

100

150

20 60 100

50

100

150

Image rotated -58 degrees

20 40 60 80 100 120 20 40 60 80 100 120 140 160 Original image 20 40 60 80 100 120 20 40 60 80 100 120 140 160

20 40 60 80 100 120 20 40 60 80 100 120 140 160

Figure 4.14: Results from the evaluation process when applying rotation to the reference image. The first graph shows the achieved dice scores between the rotated reference image and the output from the ROI tracking algorithm. The second row represents a comparison between the biggest rotation used in the evaluation process and the reference image. The third row presents how much the images can be rotated in both directions before the dice score drops below 0.9.

(55)

0.5 0.6 0.7 0.8 0.91 1 1.11 1.2 1.3 1.4 1.5 Resize factor 0.4 0.6 0.8 1 Dice

Dice scores from tracking algorithm based on zooming

Dice score = 0.9

Least zoomed image

20 60 100

50

100

150

Most zoomed image

20 60 100

50

100

150

Image resized to fraction 0.91 of original size

20 40 60 80 100 120 20 40 60 80 100 120 140 160 Original image 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Image resized to fraction 1.11 of original size

20 40 60 80 100 120 20 40 60 80 100 120 140 160

Figure 4.15: Results from the evaluation process when applying zoom the reference image. The first graph shows the achieved dice scores between the zoomed reference image and the output from the ROI tracking algorithm. The second row represents the least and most zoomed image used in the evaluation process. The third row presents how much the images can be zoomed in both directions before the dice score drops below 0.9.

Figure 4.16 visualises the performance of the ROI tracking for different combina-tions of translacombina-tions and rotacombina-tions. Each plot consist of 1000 randomly generated translations. The algorithm performs the worst in the cases where there is a lot

(56)

of translation in negative y-direction (upwards).

Figure 4.16: Surface plots of dice score for 1000 different combinations of x-translations, y-translations and rotations. The plots to the right visualizes the plots to the left seen from above with dice score presented as color values.

4.5 Real-time feasibility

Figure 4.17 displays a histogram of the speed of the ROI tracking algorithm when applied to a video of 729 frames. The test was performed in Matlab on a regular laptop.

(57)

Figure 4.17:The Figure displays a histogram of the time distribution when performing ROI tracking on a video consisting of 729 frames. The mean calculation time was 21 milliseconds and the standard deviation was 13 mil-liseconds.

Figures 4.18 and 4.19 displays histograms of the prediction time for the 10-fold ensemble model and a randomly selected sub-model. The ensemble model have a mean prediction time of 0.0316 seconds. The single sub-model have a mean prediction time of 0.0171 seconds. Both histograms were plotted with 50 bins over 500 predictions. The sum of the means for both CNN segmentation and ROI tracking is smaller than 64 milliseconds which is the time it takes for the MELSCI camera to produce one image.

(58)

(59)

5

Discussion

This Chapter presents a discussion about the methods described in Chapter 3 and the results in Chapter 4.

5.1 Creating the dataset

The method to manually create the dataset was chosen because it was the eas-iest way to ensure quality. However, this showed to be a time consuming and tedious process which naturally resulted in some human errors. Some images were masked incorrectly, and it was hard to get details right, especially around the toe regions. Another method would have been to use conventional image pro-cessing to perform filtering and then remove the faulty masked images. However, this would have resulted in spending a lot of time developing an image process-ing method, which is exactly what would be avoided usprocess-ing a CNN. It would have been a catch-22 situation to develop a conventional image processing method to avoid developing a conventional image processing method.

The dataset is regarded as sufficiently large with approximately 78 000 used im-ages, but the variance could be greater. Even though it consists of images taken from 14 different subjects (11 patients and 3 healthy), which induces some vari-ance, a lot of the data has been gathered during long surgeries where there are relatively small movements. This means that a large portion of the images in the dataset look the same. This could have been avoided by using fewer images with greater variance, but since the CNN produced good results with the existing dataset it was chosen to not investigate it further. Another aspect is that all the images in the dataset are of subjects with a relatively light skin tone. It would have induced a lot more variance to the training set if there were greater variance in skin tone in the dataset. The augmentation that is performed during training

(60)

does add some variance to the dataset, but it would still be more interesting to see how the CNN performs with on subjects with different skin tone.

5.2 CNN performance

The CNN model was able to provide satisfying results as seen in Section 4.2. The main approach was to construct an ensemble model using a 5-fold and a 10-fold LOOCV, which proved to perform better than any single sub-model in both cases. When comparing the 5-fold and 10-fold LOOCV the conclusion can be made that the 10-fold achieved an over all better performance, especially regarding the re-sults in Figures 4.6 and 4.7.

When looking at the Figures 4.2 and 4.3 it is apparent that median dice score increases with the number of sub-models used in the ensemble model. However, when looking at the dice score distributions of the best ensemble models in Fig-ure 4.4, the median dice scores increase to 3 sub-models and then slowly decrease. This indicates that having more than 3 sub-models in the whole ensemble could have a negative impact. It would therefore be beneficial to use the best ensemble model composed of 3 sub-models since it consists of less sub-models and thus performs predictions faster. It is important to note that the difference between the medians of best (dice = 0.970) and the worst (dice = 0.9675) performing en-semble models in Figure 4.4 is 0.026%. This shows that regardless which of the ensemble model one chooses, it will still perform quite well. This is also empha-sized by the ROC-analysis in Section 4.3 where the ROC-graph is nearly ideal which is proof that the model is a good classifier.

The test data used in the 10-fold LOOCV was 776 images acquired from one foot. This implies that the variance in the test data is somewhat small, which could have induced some bias in the results from the evaluation process. The test data of the 5-fold LOOCV was 33219 images acquired from five different feet. The results in Section 4.2 from this LOOCV configuration showed that these models can generalize to the test data even if the 5-fold LOOCV seems to have a hard time dealing with images in which the foot is absent.

As discussed in 5.1, the ground truth segmentations are not perfect, especially around the toe areas. Figure 4.6 and 4.7 illustrates that the ground truth segmen-tation is not perfect. This is the most likely reason why the ensemble models have trouble to get the finer details in the border around the foot. The results from the model cannot be better than the over all quality of the training data. This also means that the dice score is somewhat miseleading since a lower dice score may just be caused by the network classifying feet more accurately than what was done during the creation of the datsets. Chasing a perfect dice score can therefore be regarded as somewhat pointless with the current dataset. Also, as previously stated, the images in the datasets have only been collected

(61)

from patients that have a light skin tone. How the model performs on people with darker skin tones is thus still unexplored. To get a more versatile model, the datasets should in the future include images of patients will all ranges of skin tones.

5.3 ROI tracking

Considering the results in Section 4.4 the algorithm is deemed trustworthy for cases that are regarded as normal during these kinds of surgeries. The cases where it fails are those that are regarded as abnormal, or at least unlikely to hap-pen, and it is still possible to then throw a warning flag, or even place new ROIs if it fails completely. The solution is robust since it bases the transformation of the ROI on the segmentation that has been provided by the CNN. Since CNN segments feet well in the tests that have been carried out in this thesis it means that the basis of the tracking algorithm is regarded as robust. Since the ICP cal-culations are done between the first frame, when the ROIs are placed, and the current frame, it prevents cumulative errors occurring compared to if it had been done between the current frame and the previous frame. It is also noticeable that the mean time it takes to perform segmentation and ROI tracking calculations is done faster than 64 milliseconds, which is the time it takes for the MELSCI camera to produce one image as described in Section 2.2. This means that it is possible to perform ROI tracking on every frame, but there are also cases where computation time is to long.

The tracking algorithm was developed with inspiration from the semi-automatic tracking algorithm developed by [10]. This was chosen since it had proven to produce satisfying results in similar cases as in this thesis. However, one major drawback with this method is that the implementation in Matlab only computes rigid transforms, i.e. translation and rotation. Affine transformations with scal-ing and shear are not possible with this method. The algorithms that were tried that were able to perform an affine transformation were unfortunately too slow. For the results in Chapter 4 the algorithm has only been evaluated on one image where the foot initially was situated in the centre of the image. The results pre-sented in Section 4.4 shows that the worst cases are those when the foot has been shifted so that the toes appears out of frame. An example of this can be seen in figure 4.16 where there is clear that the dice score drops rapidly during the translations in the negative y-direction. This may have to do that there is a lot of details around the toes that the ICP algorithm uses when mapping the points between the feet edges.

5.4 Future work

The dataset should be expanded with more varied images, favourably images on feet with different skin tones, size, shape and possibly also in different