Automatic segmentation of articular cartilage in arthroscopic images using deep neural networks and multifractal analysis

(1)

Linköpings universitet

Master’s thesis, 30 ECTS | Computer science

2020 | LIU-IMT-TFK-A--20/578--SE

Automa c segmenta on of

ar cular car lage in arthroscopic

images using deep neural

networks and mul fractal

analysis

Hampus Viken

Mikael Ångman

Supervisor : Marco Domenico Cirillo Examiner : Anders Eklund

(2)

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säkerheten och ll-gängligheten ﬁnns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility. According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

©Hampus Viken Mikael Ångman

(3)

Osteoarthritis is a large problem affecting many patients globally, and diagnosis of os-teoarthritis is often done using evidence from arthroscopic surgeries. Making a correct diagnosis is hard, and takes years of experience and training on thousands of images. Therefore, developing an automatic solution to perform the diagnosis would be extremely helpful to the medical field. Since machine learning has been proven to be useful and effective at classifying and segmenting medical images, this thesis aimed at solving the problem using machine learning methods. Multifractal analysis has also been used exten-sively for medical imaging segmentation. This study proposes two methods of automatic segmentation using neural networks and multifractal analysis. The thesis was performed using real arthroscopic images from surgeries. MultiResUnet architecture is shown to be well suited for pixel perfect segmentation. Classification of multifractal features using neural networks is also shown to perform well when compared to related studies.

(4)

We would like to thank our supervisor Marco and examiner Anders for their work helping and guiding us during the course of this thesis.

We would also like to thank Erik Areström for introducing multifractal analysis to us. We express our profound gratitude towards Anders Tjernvik, CEO of BioOptico AB, who made this thesis possible.

(5)

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables x 1 Introduction 1 1.1 Aim . . . 2 1.2 Research Questions . . . 2 1.3 Delimitations . . . 3 1.4 Related works . . . 3 2 Theory 4 2.1 Osteoarthritis . . . 4

2.1.1 Lesions in Articular Cartilage . . . 5

2.2 Arthroscopy . . . 7

2.2.1 Measuring Cartilage Thickness During Arthroscopy . . . 8

2.3 Artificial Neural Networks (ANNs) . . . 8

2.3.1 Convolutional Neural Networks . . . 12

2.3.2 Encoder-Decoder Network Architecture . . . 16

2.3.3 Training Data and Data Preprocessing . . . 20

2.3.4 Training the Network . . . 20

2.3.5 Validation . . . 26 2.3.6 Inference . . . 28 2.3.7 Performance Metrics . . . 28 2.4 Multifractal Analysis . . . 30 2.4.1 Fractals . . . 31 2.4.2 Hausdorff Dimension . . . 31 2.4.3 Hölder Exponent . . . 33 2.4.4 Multifractal Spectrum . . . 33 2.4.5 Wavelets . . . 35

2.4.6 Discrete Wavelet Coefficients . . . 36

2.4.7 Wavelet Leaders . . . 37

2.4.8 Wavelet Leader Multifractal Formalism (WLMF) . . . 40

2.4.9 Dyadic Intervals and Squares . . . 40

2.4.10 Log-Cumulants . . . 41

3 Method 42 3.1 Data Annotation . . . 42

(6)

3.2.2 Loss Function . . . 45

3.2.3 Optimization and Learning Rate Scheduling Methods . . . 45

3.3 Multifractal Based Segmentation . . . 45

3.3.1 Feature vector . . . 45 3.3.2 Classification network . . . 46 3.4 Data Preprocessing . . . 47 3.5 Data Augmentation . . . 49 3.6 Training . . . 52 4 Results 54 4.1 Exploration of the Sub-Images dataset . . . 54

4.1.1 Principal Component Analysis . . . 54

4.1.2 Dimensionality Reduction . . . 55

4.1.3 Study of hmin . . . 57

4.1.4 Distribution of log-cumulants . . . 57

4.2 CNN Experiment Results . . . 57

4.3 Multifractal Experiment Results . . . 62

5 Discussion 66 5.1 Analysis and Discussion . . . 66

5.1.1 CNN Based Approach . . . 66

5.1.2 Multifractal Based Approach . . . 68

5.1.3 Performance considerations . . . 71

5.2 Method Discussion . . . 72

5.2.1 CNN Based Approach . . . 72

5.2.2 Multifractal Based Approach . . . 73

5.2.3 Training Data . . . 74

5.2.4 Metrics . . . 76

5.2.5 Training of the neural networks . . . 76

5.2.6 Source Criticism . . . 77

5.3 The Work in a Wider Context . . . 77

5.3.1 Environmental Impact of Deep Learning . . . 77

5.3.2 Privacy Considerations . . . 77

5.3.3 Automation in Healthcare . . . 78

6 Conclusion 79 6.1 Research Questions . . . 79

6.2 Summary of Suggested Future Work . . . 80

(7)

2.1 Internal anatomy of the knee. Image source and license: https://www.

physio-pedia.com/File:Knee-patella.jpg . . . 5

2.2 Locations of the different knee portals . . . 7

2.3 Left: Healthy hyaline cartilage. Center: Femoral condyle lesion. Right: Deep femoral condyle lesion, with the subchondral bone exposed. . . 8

2.4 The perceptron. . . 9

2.5 An example of artificial neural network architecture. Each circle denotes an artifi-cial neuron, whereas many of them consist a layer. The input, hidden and output layer are marked in red, blue and green respectively. . . 10

2.6 The sigmoid function . . . 11

2.7 The ReLU activation function. . . 12

2.8 A basic CNN architecture. By Aphex34 - Own work, CC BY-SA 4.0, https: //commons.wikimedia.org/w/index.php?curid=45679374 . . . 13

2.9 Visualization of a convolutional operation with kernel size = 2x2, stride = 1 and padding = 0 also known as valid mode padding. . . 13

2.10 Visualization of a convolutional operation with kernel size = 2x2, stride = 2 and padding = 0. . . 14

2.11 Visualization of a convolutional operation with kernel size = 2x2, stride = 1 and padding = 1 also known as same mode padding. . . 15

2.12 Visualization of max pooling. . . 16

2.13 The encoder-decoder architecture. . . 17

2.14 The U-Net architecture. Image source [unet] . . . . 18

2.15 The MultiResUNet architecture. Image source [multiresunet] . . . . 18

2.16 Visualization of the atrous convolution. Image source [chen2017deeplab] . . . . . 19

2.17 The DeepLab architecture. Image source [chen2017deeplab] . . . . 19

2.18 Visualization of gradient descent. . . 22

2.19 Visualization of how different values of the learning rate affects loss during training. 24 2.20 Visualization of a learning rate range test performed on MultiResUNet. . . 25

2.21 Visualization of cyclical learning rate. . . 26

2.22 An example of typical training and validation loss trends. The dashed line shows where the validation loss has its minimum. . . 27

2.23 Visualization of an underfitting example . . . 27

2.24 Visualization of how a dataset is split when performing 3-fold cross validation. . . . 28

2.25 The first five iterations of the fractal Sierpinski triangle set. . . 31

2.26 Estimating the Hausdorff dimension of the coast of Great Britain by covering it with balls of decreasing size. By Prokofiev - Own work, CC BY-SA 3.0, https: //commons.wikimedia.org/w/index.php?curid=12042048. . . 32

2.27 Samples of cartilage tissue and their class as defined in this study in Table 3.1. . . 33

2.28 Estimated multifractal spectrums of the samples in Figure 2.27. D(h) is the Haus-dorff dimension and h(q) is the Hölder exponent. . . 34

(8)

2.30 The iterative filtering of the discrete wavelet transform. The Figure shows filtering

up to level 3 coefficients. . . 36

2.31 Visualisation of the dyadic cube and wavelet leaders at various levels of scale. Image credit: Herwig Wendt [wendt2008contributions] . . . . 38

2.32 Left: A small region of an arthroscopic image featuring a condyle lesion. Right: wavelet leader plotted versus j (black) and slope (blue). hmin is stated in red. . . . 39

2.33 The dyadic interval of length L= 23_{= 8 and sub-intervals. . . 41}

3.1 Example of an annotated image. Each polygon corresponds to a region given a class. Table 3.1 lists how each annotated class relates to the ICRS and CLG grades of condyle lesions. . . 43

3.2 Architecture of a neural network classifier . . . 47

3.3 Examples of source images that require cropping. Left: A doubly truncated region of interest. Middle: Region of interest is whole and centered. Right: A truncated region of interest that is also offset from the center. . . 47

3.4 Relative size of each annotated class as percentages. . . 48

3.5 Relative size of the combined classes as percentages. . . 48

3.6 Left: Original image. Right: Color jitter augmentation applied. . . 49

3.7 Left: Original image. Middle: Horizontal flip applied. Right: Vertical flip applied. . 50

3.8 Left: Original image. Right: Rotation augmentation applied. . . 50

3.9 Left: Original image. Right: Perspective augmentation applied. . . 51

3.10 Left: Original image without augmentation. Right: Elastic deformation augmen-tation applied. . . 51

3.11 Left: Original image without augmentation. Right: Gaussian noise augmentation applied. . . 52

3.12 Example of train and validation losses at each epoch during a training session. Note that early stopping was initiated after 120 epochs. . . 53

4.1 PCA of two datasets comprised of two different sizes of sub-images. Top: ts=64, Bottom: ts=128. Log-cumulants in red. . . 55

4.2 3D PCA embedding of multifractal feature vectors. Purple is class 0, blue is class 1, yellow is class 2 and green is class 3. . . 56

4.3 KernelPCA RBF radial basis function. Color to class mapping is the same as in Figure 4.2. . . 56

4.4 Visualization of t-distributed stochastic neighbour embedding (t-SNE) of the dataset. Embedding were performed with perplexity=50 and 4000 iterations. . . . 56

4.5 Distribution of hminfor each class. . . 57

4.6 Distribution of the first three log-cumulants of each class. . . 57

4.7 Confusion matrices for the results of the different depths of MultiResUNet, in ascending order. . . 58

4.8 Mean confusion matrix for the 5-fold cross validation using MultiResUNet with depth 4. . . 61

4.9 An image showing a good semantic segmentation done by MultiResUNet on one of the training images, showing 3 different classes being segmented. The leftmost sub-image shows the original sub-image, the middle shows the prediction and the rightmost shows the ground truth. . . 61

4.10 Another image showing a good semantic segmentation done by MultiResUNet on one of the training images, showing 2 different classes being segmented. The left-most sub-image shows the original image, the middle shows the prediction and the rightmost shows the ground truth. . . 62

(9)

The leftmost sub-image shows the original image, the middle shows the prediction and the rightmost shows the ground truth. . . 62 4.12 Accuracy variations for ts= 128. . . 63 4.13 Accuracy variations for ts= 64. . . 63 4.14 Confusion matrices of classification performance for classificator trained with no

augmentations and color jitter. . . 64 4.15 Mean confusion matrix of the 4-fold cross validation. . . 65 4.16 Examples of inference using the multifractal based method on previously unseen

images (sub-image size 128). Top row: two segmented samples (left and center-left) and their ground truths (center-right and right). Bottom row: four segmented samples without ground truth available. Class 0 is unmarked, class 1 is green, class 2 is yellow and class 3 is red. . . 65 5.1 Image noise: three sample details at the same level of zoom. The left sample has

the highest level of noise and the right sample the least. . . 70 5.2 A annotated image from the training data, showing two annotated segments, where

the green segment is marking the menisc and the orange segment marks a lesion. Looking at the edges of the green segment, it is clear that it does not follow the menisc in a pixel perfect manner . . . 75

(10)

2.1 The ICRS classification system. . . 6

2.2 The Cartilage Lesion Grade (CLG) system. . . 6

3.1 Linking of ICRS and CLG scales to the data annotations used. The scales are defined in Table 2.1 and Table 2.2 respectively. While loose objects may be of interest, the number of samples available were deemed to be too low for accurate classification, and thus were considered to be a part of the background instead. . . 43

3.2 The resulting feature vector of multifractal features over one scale range j. . . . 46

3.3 The combined vector of features both multifractal and global, over all used scale ranges. . . 46

4.1 Table showing the effects of different depths for MultiResUnet . . . 58

4.2 Results from the hold-out validation using different data augmentations for Mul-tiResUNet with the depth 4. The best scoring metric is marked in bold. . . 59

4.3 Results from the hold-out validation using different data augmentations for DeepLabV3. The best scoring metric is marked in bold. . . 59

4.4 The results from using different batch sizes. . . 59

4.5 The results from training the two models with varying percentage of samples. . . . 60

4.6 The results from 5-fold cross validation of the MultiResUNet model using no aug-mentations. The average is presented with the standard deviation, which is rounded to 3 decimals places . . . 60

4.7 Grid search for optimal combination of sub-image side ts = 128 (left) and ts = 64 (right), width factor F and number of hidden layers N reported as total accuracy (Acc) and mean per-class accuracy (MPCA). Best results per F in bold, and best results overall in bold and with grey background. In the case of equal results, smaller F and N took precedence. . . 63

4.8 Classification performance of 2 vs 4 classes. . . 64

4.9 Effect of different data augmentation techniques on the metrics measured. . . 64

(11)

Knee articular cartilage degeneration is a problem that in 2015 affected 2,7% of the world population [1]. Several studies [2, 3, 4] show that clinical examination alone is not a reliable basis for the diagnosis of pathological knees, therefore knee arthroscopy is often performed to gather more evidence in the diagnostic process [4, 5]. Knee arthroscopy is an invasive procedure that uses endoscopy cameras, inserted through a number of small incisions, so that direct views of the interesting internal surfaces, ligatures and cartilage in the knee are obtained [5]. Non-invasive imaging techniques such as MRI or CT scans are sometimes used, but studies suggest that the accuracy of a clinical diagnosis is not significantly affected by the use of such scans [2, 4]. When performing arthroscopic examinations, reports [2, 3] show that even experienced surgeons may misdiagnose or miss pathological evidence.

Nickinson et al. [3] stated that the two most common diagnoses, medial meniscal tear and osteoarthritis, are misdiagnosed to a rate of 18% and 10% respectively. Specifically, in the case of osteoarthritis, it was also found that 25% of diagnoses are false negatives which may cause the patient unnecessary pain and suffering if the condition gets to progress even further as often is seen. Kean et al. [6] reported that there is a poor correlation between clinical evaluation and imaging appearance, which makes treatment of osteoarthritis difficult. Sur-gical complications are associated with knee arthroscopy, even though procedures performed are relatively straightforward [5]. Reigstad and Grimsgaard [7] reported the overall rate of complications after arthroscopic procedures to 5% which supports the importance of correctly diagnosing the condition and limit the number of procedures performed. For all these reasons, better tools for the osteoarthritis diagnosis are highly requested by experienced surgeons [8], which prompts research into the development of such tools.

The condyle, which is the cartilage covering the ends of load bearing bones in joints, may develop lesions that cause osteoarthritis. Condyle lesions are classified according to their appearance and thickness using standardized scales: the Outerbrige, ICRS, and CLG scale. The commonly used Outerbrige scale leverages visual clues to determine the correct grade of cartilage lesion. The International Cartilage Repair Society (ICRS) and the Cartilage Lesion Grade (CLG) scale are based upon the depth of the lesion. There exists accurate methods of measuring cartilage thickness using light absorption rates [9, 10, 11] that can be

(12)

useful when diagnosing osteoarthritis. But some of them requires hyperspectral light and few arthroscopic cameras are capable of hyperspectral imaging. The development of methods for lesion classification that relies on existing camera imaging is therefore preferred. Other than the thickness of cartilage, surface roughness, color and the degree of cartilage fibrillation are factors that are used to diagnose osteoarthritis. As lesions degrade, these visual properties change, allowing accurate assessment and grading of the cartilage thickness [5].

Machine learning, but more specifically deep learning, has seen huge success and popularity in biomedical engineering, specifically in the realm of image processing. Particularly, image segmentation and classification of pathological conditions in medical images have seen a huge influx of different solutions using deep learning [12, 13, 14, 15, 16]. This prompts the usage of deep learning in more areas of health care with the aim to aid in diagnosis, treatment and automation of time consuming or difficult tasks. Freeing up work from already overworked physicians and reduce the rate of human error are a few of the potential benefits. Deep learning has also seen a lot of research within endoscopic image processing, which is closely related to arthroscopy [17, 18, 19, 20, 21, 22].

Signals that exhibits fractal and multifractal properties may be organized and recognized by their multifractal spectrum. Specifically, gray-scale images (2D signals) may be analysed, classified and segmented by their textural appearance after estimating and analyzing their multifractal spectrum [23, 24, 25, 26]. Textures in medical imaging is a rich source of informa-tion for the clinician, which, depending on skill, might successfully leverage it for diagnosis [23]. Multifractal analysis have been performed on many domains of signals, including EEG/EEC anomaly detection in hearts and brain activity [27, 28], MR brain imaging for tumor detection [29], microscopic imaging of metastatic bone disease [30], mammography of breast cancer [31] and images of retinal vessels [32]. More recently, multifractal analysis have been applied to signals sourced from knees. Fredo et al. [33] compared the multifractal spectra of vibroarthro-graphic signals and found that normal knees may be differentiated from knees with Knee Joint Disorder .

To the authors’ knowledge, investigating the viability of automatically classifying and segment-ing articular cartilage from arthroscopic imagsegment-ing ussegment-ing multifractal analysis or convolutional neural networks has not been done. Since there are clear textural differences between arthro-scopic images of healthy and pathological cartilage, and since multifractal analysis have been applied successfully on many different domains of medical imaging, the authors hypothesise that it may also perform well classifying textures present in arthroscopic images.

1.1 Aim

In this thesis, two methods for automatically classifying and segmenting clinical conditions of joint cartilage suitable metrics, are proposed. The methods are evaluated and statistically proven to be effective. Hence, our aim is to develop these two methods to be helpful for the diagnosis of pathological cartilage tissue or as a solid base for further work in the area. Moreover, factors for how the classification performance is affected by data augmentation, training techniques and other design considerations are presented.

1.2 Research Questions

(13)

1. How do the two methods compare with regards to relevant metrics, performance and size of data set?

Multifractal Based Method

2. How well can selected cartilage tissue be classified and segmented by their estimated multifractal features, using neural networks, as measured by relevant metrics?

3. How does the depth and width of the neural networks affect the relevant metrics? 4. What multifractal features are useful when classifying cartilage tissue?

5. How does the size of the data set affect the classification as measured by relevant metrics?

CNN Based Method

6. How good can a convolutional neural network (CNN) be at classifying and segmenting selected cartilage tissue as measured by relevant metrics?

7. How does the size of the data set affect the classification and segmentation as measured by relevant metrics?

8. What data augmentation techniques should be used when training the network to im-prove the classification and segmentation performance?

1.3 Delimitations

The study does not consider all possible hyaline or pathological conditions of joint cartilage due to data availability constraints. Only a selected subset of conditions are analyzed and processed depending on the availability of annotated samples.

The study is under a time constraint of 20 weeks, which limits the amount of time that can be spent on data acquisition, training of neural networks and exploration of neural network designs. Two select neural networks will be studied thoroughly, but not exhaustively for the CNN-based method. Several variants of the same network architecture will be studied, but not other network architectures or classification methods, for the multifractal based method. The training and validation of the AI will be performed on a GPU-instance provided by Amazon Web Services, and therefore limited by the GPU’s specifications.

1.4 Related works

Several works inspired the overall goals and methods of this thesis. Johansson et al. [9, 11] automatically segmented arthroscopy images based upon the light absorption rate variations between cartilage of different thickness.

Using CNNs to automatically segment medical images is a well researched area, particularly with endoscopic images [17, 18, 20]. Many of these uses traditional deep CNNs to classify the images, but there are some works that explore semantic segmentation using encoder-decoder networks [15, 16, 14, 34]. These are the works that has inspired the deep learning method used in this thesis.

Inspiration for segmentation of arthroscopic images through classification of multifractal fea-tures was mainly had from the works of Wendt et al. [24, 25] and Islam et al. [29]. Wendt et al. proposed a practical method for classifying textures using multifractal features that the method of this thesis uses. On the other hand, Islam et al. segmented of real-life MR images by classifying local multifractal features.

(14)

This chapter presents the theoretical background upon which this project is based on. Four main topics are presented: section 2.1 describes the causes of osteoarthritis in detail; section 2.2 presents how arthroscopic procedures are performed and how condyle lesions are graded; section 2.3 presents the theoretical base for artificial neural networks, how they are constructed and how they are used; in the end, section 2.4 contains the theory and practical adaptation of multifractal analysis.

2.1 Osteoarthritis

Knee osteoarthritis affected nearly 27 million people in the United States in 2007 [35]. Ex-amination and attempted treatment were subsequently the cause of approximately 1 million arthroscopic surgeries in 2009 [5]. On a global scale, a total of 237 million people were re-ported to suffer from osteoarthritis and among those, 202 million were afflicted in the knee joints as reported by Vos et al. [1] in 2015. This constitutes to 2,7% of the worlds population at that time. The rate of cases increases with age, level of activity and excess body weight [36]. After cardiac diseases, osteoarthritis is the second largest reason for reduced time spent on vocational activities, which especially taxes communities with demographics that posses increased rate of pathological joints, both from loss of income and expense of treatment. Loss of articular cartilage in addition to inflammation and cyst formations in the bone causes pain and loss of motion in the osteoarthritic knee. Articular cartilage, unable to regenerate itself, may be damaged by wear, trauma, genetic predisposition, and a plethora of other causes [5, 37]. As the cartilage wears, lesions are formed and stressed by load bearing, whereas friction is exerted upon the articular bones causing their deformation. Particles from degenerating cartilage pollutes the surrounding articular fluid potentially causing effects such as autoimmune response and inflammation. Pain and other symptoms stem from these changes that are unpredictable, and so it complicates treatments and diagnoses [5, 38].

Some controversy exists regarding the effectiveness of some, predominately surgical, treatments of osteoarthritis. Nevertheless, such surgical treatment procedures are often used. A treatment of mild to moderate osteoarthritis focuses on preventing further degradation of the articular

(15)

cartilage in the affected knee by strengthening its musculature and reducing swelling and pain with medications. More advanced cases may call for various surgical procedures by removing damaged tissues, repairing lesions or even replacing the entire joint with an artificial ditto [5, 39].

2.1.1 Lesions in Articular Cartilage

The condyle is a layer of cartilage at the end of the bone in a joint that has a high resistance to tensile forces. Its thickness varies from 2-4 millimeters in adults and it is composed, mostly, by fibrous collagens and water. As no vascular, neural or lymphatic supply are present in the cartilage, no generation of damaged cartilage occurs. The collagen forms a 3 dimensional matrix where the fibres lie largely random in direction and location. This matrix secures the other underlying structural elements. One such type of element is proteoglycans aggregate molecules that accounts for 4% to 7% of the cartilage by weight. These molecules attract the positive side of water molecules present in the articular fluid which swells the cartilage resulting in a low friction surface capable of enduring heavy loads such as body weight [5, 40]. In the study [41], performed by Curl et al. condyle defects were present in 63% of the 31,516 arhtroscopies included. Although not all condyle defects are pathological, they may worsen into lesions that may degenerate further and cause osteoarthritis. Lesions are areas of cartilage that are abnormally thin, or even missing, exposing the underlying bone: the subchondral bone [5].

Figure 2.1: Internal anatomy of the knee. Image source and license: https://www. physio-pedia.com/File:Knee-patella.jpg

In abnormally weakened cartilage, the fibers in the matrix structure exhibit a more radial alignment as opposed to the random alignment in healthy cartilage. This was shown by N.D. Broom in [40] to be deeply related with decreased functionality of the condyle, stressing the subchondral bones. Wear of the collagen fiber matrix causes it to deteriorate, and over time, the underlying cartilage elements may grind away until the bone is exposed. Gradual progression of the wear may be seen and analyzed during arthroscopic exams as the surface of the cartilage roughens and light absorption changes when the blood-containing subchondral bone is exposed.

(16)

The knee has three areas of articular cartilage as shown in Figure 2.1: The two condyle areas of the femoral and the tibial plateau. The knee cartilage condition assessment may be done using formalized scales of lesions. Outerbridge scale is the simplest, ranging from 0 to IV, where 0 stands for normal cartilage, I stands for cartilage with softening and swelling, II stands for cartilage with partial thickness defects on the surface, III stands for cartilage with fissuring to the level of subchondral bone and IV for exposed subchondral bone [42]. The Outerbridge scale is thus based on visual indications of the cartilage under investigation. The more recent ICRS and CLG scales used in this study are introduced below.

ICRS/CLG Grades I to IV

During this study, ICRS/CLG grades I to IV were of special interest, since cartilage associated with one of those grades inhibit distinctive visual properties and are found in osteoarthritic knees. A study performed using a modified Outerbridge scale concluded that 63% of all arthroscopies featured cartilage lesions and the share of lesions for each grade were 9.7% for I, 28.1% for II, 41.0% for III and 19.2% for IV (n=31,516) [41]. While the scales were not ICRS nor CLG, the study shows that more advanced lesions were present in a large amount of cases and that they are of interest and relevant to include in this study.

ICRS

The ICRS classification system improves upon the more subjective Outerbridge scale, taking into consideration the depth of the lesions compared with surrounding cartilage [43]. Table 2.1 lists the grades of the scale.

Table 2.1: The ICRS classification system.

Grade Definition

0 Intact cartilage

I Superficial (soft indentation or superficial fissures and cracks) II Lesion less than half the thickness of articular cartilage III Lesion greater than half the thickness of articular cartilage IV Lesion extending to subchondral bone

Cartilage Lesion Grade (CLG)

The Cartilage Lesion Grade (CLG) is based upon the ICRS scale, but instead of relying on surrounding cartilage thickness, the grades are separated on the basis of remaining cartilage thickness in a lesion [9]. This system does not require surrounding tissue to be intact. The grades are listed in Table 2.2.

Table 2.2: The Cartilage Lesion Grade (CLG) system.

Grade Definition

0 Normal cartilage

I Lesion with remaining cartilage thickness larger than 1.5 mm II Lesion with remaining cartilage thickness between 1.0 and 1.5 mm III Lesion with remaining cartilage thickness between 0.5 and 1.0 mm IV Lesion with remaining cartilage thickness less than 0.5 mm

Both grading systems include knowing the cartilage thickness, either compared to the sur-roundings or by absolute measurement. The next section presents a method to measure the cartilage thickness using arthroscopic imaging.

(17)

2.2 Arthroscopy

As previously mentioned, arthroscopy is the most common orthopaedic procedure being per-formed. However, it is a relatively young type of surgery, not entering mainstream orthopaedic surgery until the 1970s [5]. Today, arthroscopic examination has become a standard procedure to perform before other orthopaedic surgeries such as meniscectomy and cartilage restoration. The procedure is typically done by creating two incisions in the knee, one for the arthroscope camera and the other for surgical instruments. These incisions can be created in different parts of the knee to view and operate on different places of the intraarticular anatomy. These are aptly named portals, and the common five ones are: the anterolateral, the anteromedial, the superomedial, the posteromedial, and finally, the posterolateral portal. Where these knee portals are located, can be seen in Figure 2.2.

Figure 2.2: Locations of the different knee portals

Once the incisions have been made, the arthroscope and surgical instrument can be inserted. Using the arthroscope camera the surgeon can examine the knee intraarticular anatomy and identify potentially pathological symptoms. The arthroscope is a type of endoscope, which is an optical and tubular camera used for looking deep into the body of a patient. Typical images produced by the arhtroscope can be seen in Figure 2.3, where three examples of healthy and femoral condyle lesions are shown.

(18)

Figure 2.3: Left: Healthy hyaline cartilage. Center: Femoral condyle lesion. Right: Deep femoral condyle lesion, with the subchondral bone exposed.

The endoscope typically feeds a live video stream to specialised hardware before being sent to a video monitor for the surgeon to view. This specialised hardware can apply filters to the video to enhance certain features or information of the images to aid the surgeon in diagnosis [5].

2.2.1 Measuring Cartilage Thickness During Arthroscopy

Johansson et al. presents in [11] a novel way of measuring cartilage thickness leveraging the different absorption rates of light between cartilage and bone. Specifically, they showed that the thickness may be accurately measured by the amount of visible light absorbed. This measurement can therefore be performed using normal arthroscopy camera imaging capturing light in the wavelength range of λ= 380 nm to λ = 780 nm. Since the measurement is based on imaging, it may be performed during or after the surgery on recorded data. This couples the thickness of the remaining cartilage with direct visual clues that may be leveraged by the methods proposed in this paper to estimate the depth of cartilage lesions.

Computer assistance is often used in conjunction with arthroscopy gathering and storing recorded data, enhancing interesting visual aspects of the images [9], performing measure-ments [44] and other tasks both visual and surgical in nature [45]. Real time algorithms can aid the surgeons performing arthroscopies in tasks that are difficult such as classifying tissue based on visuals. As stated in chapter 1, surgical tools for athroscopies are wanted, and as shown by Johansson et al, cartilage thickness can be used for automatic thickness measurement using computer vision.

2.3 Artificial Neural Networks (ANNs)

There exist many problems that are hard to solve using traditional programming techniques and models. For example, one such problem is trying to teach a computer to identify handwrit-ten digits. This is something that is ”easy” for a human to perform, as we do it subconsciously without even thinking about it. However, this operation in the human brain involves many visual cortices, each having millions of neurons and billions of connections between these neu-rons, resulting in a very complex network of neuneu-rons, or a neural network. This is what makes the problem of identifying handwritten digits harder to program than what it appears at the surface. However, these cerebral networks inspired the invention of artificial neural networks to solve complex tasks that were previously considered only solvable by humans. Using arti-ficial neural networks to solve problems like identifying handwritten digits, has consistently

(19)

been proven to be very successful, reaching accuracies close to or higher than that of humans. [46, 47]

The Artificial Neuron

The basic building block of artificial neural networks is the artificial neuron, which is loosely based on the biological neuron in the human brain. The first artificial neuron was the per-ceptron, invented by scientist Frank Rosenblatt between the 50s and 60s [48]. The design of the perceptron can be seen in Figure 2.4, where x1, x2, x3 are the inputs to the perceptron.

The output is calculated by Equation 2.1, where wj is a weight variable associated with each input xj. The output is always either 0 or 1, indicating that the perceptron is activated or not, just like a biological neuron. The weights wj allow different linear combinations of the inputs, meaning the perceptron can weigh the inputs depending on their importance. Thus, the perceptron can make binary decisions.

output=⎧⎪⎪_⎨⎪⎪ ⎩

0 if _∑_jwjxj≤ threshold 1 if ∑jwjxj> threshold

, where j indexes an input. (2.1) 9

Figure 2.4: The perceptron.

However, the model of the perceptron described above lacks a key feature: A bias, that will add an invariant part to the prediction. This allows for the tuning of how likely the perceptron is to be activated for all inputs. Therefore, we update the Equation 2.1 to Equation 2.2, where

b is the said bias.

output= {0 if ∑jwjxj+ b ≤ threshold 1 if ∑_jwjxj+ b > threshold

(2.2)

Artificial Neural Networks Architecture

So far the architecture of ANNs have not been discussed at all, which is what allows the artificial neuron to be such a powerful but simple tool. Arranging the artificial neurons in specific network structures allows for complex decision making with regard to some input. How to best design and implement these networks is a topic that is still being researched to this day, and is yet to be an exact science.

The most basic design principle of constructing a neural network, is dividing it into layers. These layers are further divided into three categories; input layer, output layer and hidden layers. These names are quite self explanatory. The input layer consists of the neurons that receive the input and sends them into the network. The output layer consists of the neurons

(20)

that outputs the result of the network. Finally, the hidden layers consists of layers of neurons that simply are inbetween the input and output layers. This design principle is visualised in Figure 2.5, where there is only a single hidden layer.

The arrows, in Figure 2.5, show how the output from one neuron propagates as an input to another. As illustrated, the output of one neuron connects to all the neurons of the next layer and so on. This is commonly referred to as the layer being fully connected. This means that a neuron in the layer bases its output on all the neurons in the previous layer. Which in turn means that each layer abstracts on the previous layer, and it is this ability to abstract on previous layers that allow for complex decision making.

Figure 2.5: An example of artificial neural network architecture. Each circle denotes an artificial neuron, whereas many of them consist a layer. The input, hidden and output layer are marked in red, blue and green respectively.

The number of hidden layers, as well as neurons in the different layers, are important param-eters to configure for the network. These paramparam-eters greatly influence the performance of the network. Unfortunately, there is no known guaranteed way to optimally set these parameters for maximum performance. However, there are some guidelines that can be followed to get good performance, that have been empirically developed.

Activation Functions

Unfortunately the perceptron has a flaw when it comes to training. If we take the example of learning to identify handwritten digits, and say that our perceptron is able to identify all the digits except for the digit 1. We then update the weights and biases to make the perceptron correctly classify the digit 1. If enough updates have been done, a perceptron might flip its output, which might affect the output of other perceptrons in the network for other digits. Therefore, a sigmoid neuron was introduced, that is very similar to the perceptron. This sigmoid neuron is a perceptron with a sigmoid as activation function, where an activation function is a function applied to the output of the neuron. Since the sigmoid is non-linear, our neurons can now solve non-linear problems.

(21)

The sigmoid function is a function that takes any real value and forces it to any real value between 0 and 1. It is described by Equation 2.3 and visualised in Figure 2.6.

σ(z) = 1

1+ ez (2.3)

As previously stated, the sigmoid function is applied on the output of the neuron, changing the output to be defined as

output= σ(wx + b), (2.4)

Where w are the weights, x are the inputs, b the bias, σ is the activation function, which in this case is the sigmoid function.

Figure 2.6: The sigmoid function

There are of course many activation functions currently existing and that are being used. However, one of the more popular ones is the ReLU activation function. It is defined as

ReLU(y)= max(0, y), (2.5)

where y is the output of the neuron. It forces negative outputs of the neuron to zero and positive ones to remain the same. ReLU was mainly introduced to combat the vanishing gradients problem, which is common for deeper neural networks. For the interested reader, this problem is explained further in [49, p. 289].

(22)

Figure 2.7: The ReLU activation function.

Another hugely popular activation function, for multi-class problems, is the softmax function. This function takes an input vector of scores and returns a vector of probabilities where their sum is 1 as shown in Equation 2.6.

output_i= si ∑(si)

(2.6)

2.3.1 Convolutional Neural Networks

One of the most prominent and used kind of artificial neural network is the convolutional neural network (CNN) [50]. It is specialized for grid-like input data, like images or time-series data. CNNs have been proven to be successful in image analysis, natural language processing and complex games like GO [51, 52, 53]. Particularly in the field of biomedical engineering, CNNs have been useful in classifying and segmenting pathological conditions in medical images. CNNs belong to an area that has seen rapid growth over the last years and continues to do so even today.

A basic CNN consists of several blocks, where each block consists of a convolutional layer followed by batch normalization, activation function and pooling layers. After each block, a feature map is produced and is fed to the next block. Each feature map can be seen as an abstraction on the input to that block, where higher level features of the input is discovered. An example would be that the first block would find edges and lines, the second find squares and so on. After the final block, the final feature map has been produced and is fed to usually two dense layers that then output a vector with class scores. This basic CNN architecture can be seen in Figure 2.8.

The following sections in this chapter describe the different layers that constitutes the CNN architecture.

(23)

Figure 2.8: A basic CNN architecture. By Aphex34 - Own work, CC BY-SA 4.0, https: //commons.wikimedia.org/w/index.php?curid=45679374

The Convolution Layer

As the name suggests, CNNs are based on the convolutional layer. A convolution layer computs the 2D conolution between an input and a kernel, which has a much smaller size than the input (3x3, 5x5, 7x7). The kernel is made up of weights that are updated during training. An example of a 2D convolution can be seen in Figure 2.9, where the input size is 4x4, the kernel size is 2x2, and the convolution is done using a stride of 1 and padding of 0.

Figure 2.9: Visualization of a convolutional operation with kernel size = 2x2, stride = 1 and padding = 0 also known as valid mode padding.

Stride dictates the number of positions the kernel moves at each step. If the stride is increased, the output matrix will be smaller in size, which is typically desired when trying to downsample or increase computational efficiency. This is visualized in Figure 2.10, where the stride is set to 2, and with the same example as in the above figure.

(24)

Figure 2.10: Visualization of a convolutional operation with kernel size = 2x2, stride = 2 and padding = 0.

The padding parameter adds values around the input matrix, typically containing values of zero. This is primarily done to allow the output matrix to have the same size as the input matrix, in order not to loose pixels at the perimeters of the input image. The effects of padding can be seen in Figure 2.11, where padding and stride is set to 1.

(25)

Figure 2.11: Visualization of a convolutional operation with kernel size = 2x2, stride = 1 and padding = 1 also known as same mode padding.

What is in common for the parameters kernel size, stride and padding, is that they all affect the size of the output matrix. As one would expect, there exists a formula to calculate the size of the output matrix, given the three previously mentioned parameters. It is formulated as

Output size= x− k + 2 ∗ p

s + 1, (2.7)

where x is the input size, k is the kernel size, p is the padding and s is the stride.

What makes the convolutional layer so powerful, is that it greatly reduces the number of parameters that has to be learned by the network. This can be illustrated by comparing the number of parameters that has to be learned by a fully connected layer, commonly found in neural networks, and a convolutional layer for the same input matrix. As previously mentioned a fully connected layer consists of a set number of neurons, each having a weight associated to each input to that layer. Lets say that the input is a black and white (single channel) image of size 200x200 pixels. Each neuron would have 40 000 weights or parameters to learn. Now lets say we replace the fully connected layer with a convolutional layer found in the first layers of the U-Net architecture [34], where the kernel size is 3 and number of kernels is 64. Given these parameters, the number of learnable parameters or weights of the convolutional layer is 3∗ 3 ∗ 64 or 576, which is vastly smaller than 40 000. This is one of the prime reasons why CNNs are so extensively used in image processing.

(26)

Max Pooling

The max pooling layer is used for downsampling of an input, typically hidden layer output matrices or feature maps. This is done by dividing the input into sub-regions and taking the maximum value for all these sub-regions. The size of these sub-regions are defined by the filter size, which typically is 2x2. Whether these sub-regions are overlapping or not, is determined by the stride. In Figure 2.12 we see an example when the filter is of size 2x2 and has stride 2. There are several benefits of the downsampling performed by the max pooling layer. It both reduces computational cost and abstracts the input which helps against overfitting.

Figure 2.12: Visualization of max pooling.

Batch Normalization

A problem arises when training deep neural networks. It is caused by the fact that the parameters of the layers in the network are updated during training, which results in these layers producing outputs with different distributions after each update. This complicates learning for the layers of the network, since their input is based on previous layers outputs and therefore must learn the new distributions. This phenomenon is called ”internal covariate shift” and to combat this, batch normalization was introduced by Sergey Ioffe and Christian Szegedy at Google Inc [54].

The goal of batch normalization is to reduce the internal covariate shift by fixing the distri-bution of inputs during training. This is done by ”whitening” the inputs to all layers, i.e. linearly transforming the inputs to have zero mean and unit variances. The algorithm to per-form this whitening is rather complex, but the interested reader can read about it in the batch normalization paper [54].

There are two main advantages of using batch normalization. Firstly, it enables higher learning rates, as it combats vanishing and exploding gradients. Secondly, it regularizes the model, as a training example is seen in conjunction with other examples in the mini-batch.

2.3.2 Encoder-Decoder Network Architecture

The encoder-decoder network architecture is a commonly used design for convolutional neural networks. It is relatively simple, consisting of just two main parts, the encoder and decoder. The encoder is responsible for extracting deep features from the input, encoding them to a final state. This is typically done by using a CNN as the encoder and removing the final dense layers. This retains only the deep features produced from the final convolutional layer(s) of the CNN. From these deep features the decoder produces the desired output. Typically the decoder tries to retain the spatial information that is lost when performing the encoding steps.

(27)

A simple example on how these types of networks are used, is in language translation. The encoder would from a sentence in one language extract its semantic information, and the decoder would then produce the sentence’s semantic equivalent in another language.

Figure 2.13: The encoder-decoder architecture.

Semantic Segmentation

A typical use case of encoder-decoder architectures is for semantic segmentation tasks. Se-mantic segmentation tasks involve both classifying and detecting objects in an image: for example, trying to find and identify pedestrians and cyclists in a video feed from a camera mounted on a car. CNNs without the decoder are typically not suited for these tasks, since they focus on extracting abstract features from an image for classification and not maintaining spatial information required for object detection. However, a CNN used with both an encoder and an decoder, can perform semantic segmentation tasks quite well. This design is used by many popular models designed for semantic segmentation, and some of these models will be described in the following sections.

U-Net

U-Net [34] is a popular encoder-decoder network, specifically designed to be used for biomedical image segmentation. Its design allows it to reach high performance even with a small dataset, since annotated targets are scarce in medicine.

U-Net derives its name from its U-shape like structure, which can be seen in Figure 2.14. The network begins with a contracting path, the encoder, that extracts features from the input image. The encoder part consists of a number of steps, where this number is called the depth. Each encoder step consists of two following 2D convolutional layers with kernel size 3x3 and stride 1, followed by a max pooling layer with kernel size 2x2 and stride 2. After each convolutional layer, the ReLU activation function is applied. The first encoder step begin with 64 filters for the convolutional layers, and this number is doubled at each following step. After the final step of the encoder, the features produced so far are sent to the bottleneck. The bottleneck connects the encoder to the decoder and applies the same operations as a encoder step, but replaces the max pooling layer with an up-convolutional layer with kernel size 2x2. After the bottleneck, the network ends with an expanding path or decoder that generates the final output. This part consists of several steps, where at each step the feature map is up-sampled by 2x2 up-convolutions and the number of channels in the feature map is halved, followed by regular 3x3 convolutions and the ReLU activation function.

At each step in the decoder path, the feature map from the corresponding level in the encoder path is cropped and concatenated with the feature map after the up-convolution. This design is why the U-Net network is called a fully connected CNN, since the encoder and decoder paths are connected via skip connections.

(28)

Figure 2.14: The U-Net architecture. Image source [34]

MultiResUNet

The MultiResUNet [14] architecture is according to the author the supposed successor of the popular U-Net architecture, and uses the same U-shape like structure. The architecture of MultiResUNet can be seen in Figure 2.15. The main difference in the MultiResUNet archi-tecture is the modifications of the skip connections and the blocks in the contracting path. In the skip connections, a residual block has been added, that applies convolutional layers with residual connections on the encoder features. For the ”contracting path”, so called MultiRes blocks have substituted the two 3x3 convolutional layers. These MultiRes blocks, consists of a succession of three 3x3 convolutional layers, where the output after each succession is concatenated at the end. The goal of the MultiRes block is to allow the network to look at learnt features from different scales.

(29)

DeepLabV3

The DeepLab series of models are state-of-the-art open-source semantic segmentation models designed by Google Inc [55]. The key feature of these models are that they essentially allow other CNNs, like ResNet and VGG16 normally used for classification, to be used for semantic segmentation tasks. This is done by replacing the dense layers at the end of the DCNNs with convolutional layers producing a score map, and then applying decoder like operations on this score map. The most important part of these decoder operations is the atrous convolution which is a special variant of the regular convolution. Essentially, it is a convolution with up-sampled filters to combat the lost spatial information from regular convolution, caused by max pooling and stride. An up-sampled filter is basically a regular filter but with zeros inserted between filter values, depending on stride size or the desired factor of up-sampling. A visualization of the atrous convolution can be seen Figure 2.16 and the architecture of DeepLab can be seen in Figure 2.17.

With the later versions of the DeepLab series, the DCNN is changed to use ResNet50, ResNet101, and more elaborate ways of using the atrous convolution are introduced. Specifi-cally with DeepLabv3, image pyramid, encoder-decoder structure, more cascaded atrous con-volutions and spatial pyramid pooling are introduced. These new features are aimed at the problem of objects existing at different scales, and for the interested reader, these new features can be read about in the DeepLabV3 paper [55].

Figure 2.16: Visualization of the atrous convolution. Image source [56]

(30)

2.3.3 Training Data and Data Preprocessing

One of the most important parts when working with neural networks is the training data, i.e. the data that the neural network bases its learning on. This data dictates what type of neural network to choose. For example, if your data are images, then a CNN would most likely be a good choice for a network architecture.

The amount of data is also an important factor in a machine learning project, since it directly limits how much the network can learn, and therefore how accurate it can become. Generally a model will generalize better with more data. There are however papers that indicate that a relatively small number of images are required to achieve above 90 % accuracy for certain tasks [57]. Though what is consistently found is that increasing the dataset will allow the model to reach higher performance [58].

Data Augmentation

In order to increase the number of images in the training and validation sets, and thus increase the performance of the network, data augmentation is often applied. Data augmentation is where different augmentation techniques are used to artificially introduce variations in the dataset. This has the effect of regularizing the model, thus preventing overfitting. This usually means modifying the data in such a way that it is different but not unrecognizable. A good example is the horizontal flip operation used in image processing. This operation flips the image horizontally, creating a mirrored version of the original image. Both the images show the same content, but from different angles. Just using the horizontal flip would result in a doubling of the original dataset size, illustrating the usefulness of data augmentation when trying to increase the dataset. There are of course many other augmentation techniques, but only the ones used in this thesis will be explained, in the method chapter.

However, data augmentation can also be used to introduce noise and other distortions to the data. This is useful, since it forces the network to train on noisy data, which in turn makes it more robust to such noise in real data. This is particularly useful in image classification, where an AI should be able to correctly classify an image, even if it is cropped, have a different zoom level or varying light conditions but with the object to be classified in the image still discernible.

2.3.4 Training the Network

How a typical neural network is built and designed has now been described, but how are the weights and biases for the neuron equation chosen? This is done through mainly two processes called forward propagation and back propagation, but also with the help of a loss function. Application of them is often called training.

Forward Propagation

In order to learn, a network has to generate some output to be validated. This process is forward propagation and as the name suggests, the input is propagated forward through the network. To be more precise, the input is fed to the next layer of neurons, which in turn feeds its output to the next layer and so on until the output layer is reached and an output is produced.

(31)

To clarify, lets pick as example a neural network with two hidden layers. All the layers calculate their output according to the following formula:

Zi(x) = γ(WiTx+ bi), (2.8)

where Zi(x) is the output matrix, γ is an activation function, Wi is the weight matrix, x is the input matrix, and bi is the bias of the i-th layer, where i is the index of a layer.

With Equation 2.8, the output of the network can be calculated according to:

Z3= Z3(Z2(Z1(x))), (2.9)

where Z3is the output of the final layer. Equation 2.9 clearly shows why this process is called

forward propagation.

Loss Functions

Now that we have a way of calculating the output of the network, how is this output validated? Or rather, how incorrect was the network? In order to quantify this question, loss functions are used. These functions calculate how much the output from the network differ from the correct value.

One of the most common loss functions is the mean squared error function, described by Equation 2.10. Where n is the number of samples, Yi is the vector of outputs from the network and ˆYi is the vector of correct values. What this function does, is calculating the average squared deviation from the correct value. A useful property of this function is that large deviations have a greater effect on the error than small deviations effectively penalizing large errors more than smaller ones.

e= 1 n n ∑ i₌₁ (Yi− ˆYi)2 (2.10)

Another common loss function is the cross entropy loss function, which is based on the cross-entropy from information theory [59]. It is typically used for classification problems, where a model is trained to predict a label for a certain input. Cross entropy measures the dissimilarity between two probability distributions, p and q, where p is the true distribution and q is the predicted distribution. This dissimilarity is quantized by the following equation

H(p, q) = − ∑ p(x)loq(x), (2.11)

where H(p,q) is the cross entropy, p(x) and q(x) are the two distributions given an event x. For binary classification cases, Equation 2.11 can be rewritten to Equation 2.12, using the following reasoning: The probability for q= 1 is y and p = 1 is ˆy, then the probability for q = 0 is 1− y and p = 0 is 1 − ˆy.

(32)

Back Propagation

The final step of training is updating all the weights and biases of the neurons in the network, according to some optimization technique using the loss value from a chosen loss function. This step is called back propagation, as we go back through all the layers of the network and update the weights and biases of the neurons.

In more precise terms, we want to minimize a loss function using an optimization technique with regards to the weights and biases of the network. The most common optimization tech-niques used, particularly for neural networks, are gradient based algorithms. These algorithms calculate the derivative of the loss function and use the negative gradient to update the weights and biases along the derivative towards a minimum point. One way of visualizing this, is to plot the derivative of the loss function against the weights of a network. Figure 2.18 shows this plot, but is of course a very simplified version of such a plot, as most deep neural networks consists of many thousands or millions of weights.

Figure 2.18: Visualization of gradient descent.

Stochastic Gradient Descent

One hugely popular gradient based optimization algorithm is stochastic gradient descent (SGD). It is called stochastic gradient descent, since it uses stochastic approximation of the gradient descent optimization. Meaning, it only uses a random subset of the entire dataset to calculate the derivative and perform the optimization. This allows for more economical calculations of the derivative but requires more updates of the network.

The subset that is used for SGD is called a mini-batch and is an important parameter to tweak, since it affects memory consumption and how well the network converges during training. A larger mini-batch will result in a calculated gradient that is closer to the ”true” gradient, but will consume more memory. Given a specific mini-batch size b, the number of updates n can be calculated according to

n= l/b,

(33)

For each batch, the derivative of all samples are calculated and averaged over that mini-batch. This is used to perform the back propagation of the network. One pass of forward and back propagation for all mini-batches is called an epoch. The back propagation calculation can be seen in equation 2.13, where w is the weights of the network, ∇Q(w) is the derivative of the loss function and n is the mini-batch size. However, there is also η, which is the learning rate or step length. This important parameter of the SGD algorithm will be described in more detail in the learning rate section below.

w∶= w −η n n ∑ i=1 ∇Q(w)i (2.13)

However, SGD can be quite susceptible to oscillations during training, slowing down conver-gence. One way of mitigating this is to introduce a momentum to the SGD formula, called Stochastic gradient descent with momentum (SGDM). The momentum is a value that is mul-tiplied by the value of the last update, often having a value between 0.8-0.9. This means that the next value of SGDM will be affected by the last value of SGDM, resulting in new values being less affected by the derivatives. This new improved version of SGD can be seen in equation 2.14, where m is the momentum and∇w is the last value of SGDM.

w∶= w − η n n ∑ i=1 ∇Q(w)i+ m∇w (2.14) Learning Rate

The Learning rate parameter of SGD is one of the most important hyperparameters to tweak when training a neural network. It greatly affects the convergence during training and which minimum point is reached. A wrongly configured learning rate can result in a network that does not learn, learns very slowly or does not converge. This is usually a result of setting the learning rate too high or low. Setting it too high will result in training that is sensitive to the derivatives, making large updates based on them, which leads to divergence rather than convergence. Setting the learning rate too low, results in training that makes small updates based on the derivatives, leading to slow convergence. This is illustrated by Figure 2.19, where loss is plotted against the number of epochs into training.

(34)

Figure 2.19: Visualization of how different values of the learning rate affects loss during train-ing.

Due to the large impact that learning rate has on a network’s ability to learn, finding a good learning rate is of outmost importance when training a network. This is why it is a hot research topic in machine learning, and no definitive answer exists. However, there exists some general good techniques when determining what learning rate to use. One such technique is performing the ”learning rate range test”, which was introduced by Leslie Smith in his paper about cyclical learning rates [60]. Here you let the network train with an increasing learning rate, usually making an increment every mini-batch and with a range from 10−7 to 10−1. The loss is calculated for each mini-batch. When this is done, the loss is plotted against the learning rate. How such a learning rate range test plot can look like, can be seen in Figure 2.20. From the plot shown in Figure 2.20, a lower bound and an upper bound can be found for the learning rate. These two values are marked by the two oranges lines in Figure 2.20. The lower bound is chosen as the learning rate where the loss starts decreasing. The upper bound is chosen as the learning rate right before the loss starts oscillating. One can clearly see the reasoning for these bounds. A lower learning rate than the lower bound, results in a loss that does not decrease and is quite stable. A higher learning rate than the upper bound would result in a loss that does not converge and oscillates wildly.

(35)

Figure 2.20: Visualization of a learning rate range test performed on MultiResUNet.

Learning Rate Scheduling and Cyclical Learning Rate

As previously mentioned, choosing the right learning rate is important if a neural network is to converge and reach a minimum. However, even when choosing a good learning rate, the network can still have some problems reaching one of its minima. There is also the possibility that the network has reached a saddle point during training and is therefore stuck. Varying the learning rate during training has been proven successful to avoid such problems. This is called learning rate scheduling, and there is a whole host of different strategies of doing this. One such strategy is the cyclical learning rate scheduler [60]. As the name suggests, it cycles between two values for the learning rate during training. It starts from the lower bound of the learning rate and then increases to the upper bound during one step size, where the step size is typically half an epoch. During the next step, the learning rate decreases to the lower bound. This increase and decrease of the learning rate between the two bounds can be done linearly, exponentially or using some other scheme, but is often done linearly or triangularly as it is also called.

The cyclical learning rate scheduler has been proven to reduce the number of iterations required to reach optimal results. One intuitive way of understanding why this is, as described in its original paper, is that cycling to larger learning rates allows the learning process to get past saddle points.

(36)

Figure 2.21: Visualization of cyclical learning rate.

2.3.5 Validation

An important step when training a network is to validate its performance on unseen data that has not been used to train the network. This is called validation, and measures how well a network generalizes on its task. The results from validation will also indicate whether a network is underfitting or overfitting to the training data. A model is underfitting when the distance between the validation loss and training loss is decreasing, and overfitting when the distance is increasing, as illustrated in Figure 2.22.

In Figure 2.22, the best model is achieved at the epoch where the dashed line is, which marks where the validation loss has its minimum. This is also the reason why validation is performed, since it can be used to determine how many epochs the model has to train for in order to reach optimal performance.

(37)

Figure 2.22: An example of typical training and validation loss trends. The dashed line shows where the validation loss has its minimum.

However, a model is also underfitting when the training loss increases, even though the vali-dation loss is decreasing. See Figure 2.23

Validation loss

Epoch

Loss

Figure 2.23: Visualization of an underfitting example

Usually when performing validation, a smaller subset of the training dataset is chosen as the validation set and is not used for training and updating the weights of the network. There are many ways of choosing the validation set and how to use it, but there are mainly two validation methods that are being used, hold-out validation and k-fold cross validation

K-fold Cross Validation

When performing k-fold cross validation, the dataset is divided into k folds. The training set will contain k− 1 folds and the validation set will contain the remaining fold. The model is

(38)

then trained and validated on each of these splits. When this is complete, the loss achieved for each split is averaged and that is the final loss for that particular k-fold cross validation run. [49] How the dataset is split according to k-fold cross validation can be seen in Figure 2.24.

Train

Validation

Train

Validation

Train

Validation

Train

Split 1

Split 2

Split 3

Figure 2.24: Visualization of how a dataset is split when performing 3-fold cross validation.

One of the advantages with this validation method is that the final loss that has been calculated is based on several different subsets of the dataset. This means that the loss value is not biased towards any particular subset of the dataset.

Holdout Validation

When performing the holdout validation method, the dataset is divided into a training and validation set, where the validation set is typically 10% or 20 % of the total number of samples. The advantage of this method is that it is simple to implement, faster than k-fold cross validation and performs reasonably well. However, the validation loss has a high variance due to being entirely dependent on the samples that were put into the validation set. Therefore, this method suits as a validation method when the dataset is large.

2.3.6 Inference

When training is completed and an optimal model has been produced, the final step is to actually use the model. This step is called inference and is when the model is fed input and produces its predictions based on that input. In this step there is no ground truth for the output, so the output needs to be validated by a human, if that is desired.

2.3.7 Performance Metrics

A number of metrics were calculated from the segmentation and classification results in order to accurately measure and assess the performance of the neural network [61, 62]. The metrics were selected so that comparison with related work would be possible and to allow for a nuanced comprehension of the actual performance. Common terminology found in the definitions are listed below.