• No results found

Evaluating Segmentation of MR Volumes Using Predictive Models and Machine Learning

N/A
N/A
Protected

Academic year: 2021

Share "Evaluating Segmentation of MR Volumes Using Predictive Models and Machine Learning"

Copied!
97
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Master’s thesis, 30 ECTS | Elektroteknik

2020 | LIU-IMT-TFK-A--20/581--SE

Evaluating Segmentation of MR

Volumes Using Predictive Models

and Machine Learning

Simon Kantedal

Supervisor : David Abramian Examiner : Anders Eklund

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovs-mannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presen-teras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa-tional purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

A reliable evaluation system is essential for every automatic process. While techniques for automatic segmentation of images have been extensively researched in recent years, evaluation of the same has not received an equal amount of attention. Amra Medical AB has developed a system for automatic segmentation of magnetic resonance (MR) images of human bodies using an atlas-based approach. Through their software, Amra is able to derive body composition measurements, such as muscle and fat volumes, from the segmented MR images. As of now, the automatic segmentations are quality controlled by clinical experts to ensure their correctness. This thesis investigates the possibilities to leverage predictive modelling to reduce the need for a manual quality control (QC) step in an otherwise automatic process.

Two different regression approaches have been implemented as a part of this study: body composition measurement prediction (BCMP) and manual correction prediction (MCP). BCMP aims at predicting the derived body composition measurements and comparing the predictions to actual measurements. The theory is that large deviations between the predictions and the measurements signify an erro-neously segmented sample. MCP instead tries to directly predict the amount of manual correction needed for each sample. Several regression models have been implemented and evaluated for the two approaches.

Comparison of the regression models shows that local linear regression (LLR) is the most performant model for both BCMP and MCP. The results show that the inaccuracies in the BCMP-models, in prac-tice, renders this approach useless. MCP proved to be a far more viable approach; using MCP together with LLR achieves a high true positive rate with a reasonably low false positive rate for several body composition measurements. These results suggest that the type of system developed in this thesis has the potential to reduce the need for manual inspections of the automatic segmentation masks.

(4)

Acknowledgments

I want to begin by expressing my gratitude towards my supervisors, David Abramian and Magnus Borga. The many, and often lengthy, conversations with you have been a great asset for this project and to me personally. With your knowledge and experience, more than one hurdle has been avoided. Thanks also to Anders Eklund for being the examiner of this thesis project. I also want to direct a thank you to the DevOps team at Amra. Your help in clearing any practical and technical obstacles related to this project has been much appreciated.

A special thanks to everyone supporting me privately through this education and thesis project. With your support, the workload has been bearable. Thanks to David Abramian, Ellinor Larsson and Oscar Sjöberg for helping me proofread this report.

(5)

Abstract iii

Acknowledgments iv

Contents v

List of Figures viii

List of Tables xii

Notations xiii 1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 1 1.3 Research questions . . . 1 1.4 Delimitations . . . 2 1.5 Limitations . . . 2 1.6 Related work . . . 2 2 Theory 3 2.1 Magnetic resonance imaging . . . 3

2.2 Atlas-based segmentation . . . 4

2.3 Dataset . . . 5

2.3.1 Body composition measurements . . . 5

2.3.2 Patient data . . . 7

2.4 Dataset analysis . . . 7

2.4.1 Arithmetic mean . . . 7

2.4.2 Standard deviation . . . 7

2.4.3 Multicollinearity . . . 8

2.4.4 Principal component analysis . . . 8

2.5 Multiple regression . . . 9

2.6 k-nearest neighbours regression . . . 10

2.6.1 Feature scaling . . . 10

2.7 Local linear regression (LLR) . . . 11

2.8 Neural network regression (NNR) . . . 11

2.8.1 Activation function . . . 12

2.8.2 Network architecture . . . 14

2.8.3 Learning algorithm . . . 15

(6)

2.9 Evaluation . . . 19

2.9.1 Mean squared error (MSE) . . . 19

2.9.2 Mean absolute error (MAE) . . . 19

2.9.3 Receiver operating characteristics (ROC) curves . . . 20

2.9.4 Repeatability coefficient . . . 21 2.9.5 Bland-Altman analysis . . . 22 3 Method 23 3.1 Data pre-processing . . . 23 3.2 Dataset analysis . . . 23 3.2.1 Distribution analysis . . . 24 3.2.2 Deviation analysis . . . 24 3.2.3 Collinearity analysis . . . 24

3.2.4 Principal component analysis . . . 24

3.3 Dataset subdivision . . . 24

3.4 Approaches to applying predictive models . . . 24

3.4.1 Body composition measurement prediction (BCMP) . . . 25

3.4.2 Manual correction prediction (MCP) . . . 25

3.5 Multiple regression . . . 25

3.6 k-nearest neighbours regression . . . 25

3.7 General regression neural network . . . 26

3.8 Local linear regression . . . 26

3.9 Evaluation metrics . . . 26

3.9.1 ROC curves . . . 27

4 Results 28 4.1 Dataset analysis . . . 28

4.1.1 Distribution analysis . . . 29

4.1.2 Manual correction analysis . . . 30

4.1.3 Collinearity analysis . . . 32

4.2 Multiple regression . . . 33

4.2.1 Effect of collinearity of input features . . . 34

4.3 k-nearest neighbours regression . . . 34

4.3.1 Optimisation of k . . . 34

4.3.2 Feature scaling . . . 35

4.4 Neural network regression . . . 35

4.4.1 Parameter optimisation . . . 36

4.4.2 Principal component analysis . . . 37

4.4.3 Effects of training on reduced PCA-basis . . . 38

4.5 Local linear regression . . . 38

4.5.1 Optimisation of k . . . 39

4.6 Model comparison . . . 39

4.6.1 Body composition measurement prediction . . . 39

4.6.2 Manual correction prediction . . . 41

4.6.3 Training time . . . 43

4.7 Evaluation of results . . . 44

4.7.1 Body composition measurement prediction . . . 44

4.7.2 Manual correction prediction . . . 45

(7)

5.3 Neural network optimisation . . . 47

5.4 Effects of dimensionality reduction using PCA . . . 47

5.5 Interpretation of ROC curves . . . 48

5.6 Comparison of approaches . . . 50 5.7 Method . . . 51 5.7.1 Source criticism . . . 52 6 Conclusion 53 6.1 Research questions . . . 53 6.2 Possible improvements . . . 54 6.2.1 Alternative features . . . 54 6.2.2 Feature handling . . . 55

6.2.3 Neural network improvements . . . 55

Bibliography 56 A Analysis 59 A.1 Dataset analysis . . . 59

A.2 Manual correction analysis . . . 59

A.3 Feature distributions . . . 60

A.4 Manual correction distributions . . . 64

B Optimization 69 B.1 Neural network optimization . . . 69

C Results 70 C.1 Body composition measurements prediction . . . 70

C.2 Manual correction prediction . . . 74

C.3 Method comparison MCP . . . 78

C.4 ROC-plots using BCMP . . . 80

(8)

List of Figures

2.1 Water-only image constructed using Dixon imaging. Image source: UK Biobank. . . 4

2.2 Fat-only image constructed using Dixon imaging. Image source: UK Biobank. . . 4

2.3 Fat- and water images with segmentation masks overlaid. Image source: Amra medical AB. . . 6

2.4 The two first PC:s overlayed an elliptically shaped data cluster. . . 8

2.5 Non-linear model of a neuron. . . 12

2.6 ϕ(x) =1+e1´x . . . 13

2.7 ϕ(x) =tanh x . . . 13

2.8 ϕ(x) =max(0, x) . . . 13

2.9 The logistic function ϕ(x) =1+e1´x together with derivative ϕ1(x) . . . 14

2.10 The ReLU function ϕ(x) =max(0, x)together with derivative ϕ1(x). . . . 14

2.11 General example of a neural network architecture with l inputs, m neurons in one hidden layer and n outputs. . . 15

2.12 Comparison between training and validation loss. . . 19

2.13 Illustration of TP, TN, FP and FN. . . 20

2.14 Illustration of an ROC curve. . . 21

3.1 Exerpt from the CSV-file containing all patient data. . . 23

4.1 The full distribution for VAT. . . 30

4.2 The sex-separated distribution for VAT. . . 30

4.3 The full distribution for ASAT. . . 30

4.4 The sex-separated distribution for ASAT. . . 30

4.5 Sagittal intersection of an erroneous segmentation of left posterior thigh. The correct mask is coloured blue and the erroneous parts of the automatic proposal are coloured red. . . 32

4.6 Variation inflation factor for every pair of features. . . 33

4.7 Pair plot of right and left posterior thigh muscle fat infiltration. . . 33

4.8 MAEf f mvand MAEf vfor the k-NN model using varying k-values . . . 35

4.9 Loss functions when training a neural network for BCMP with the (12, 100, 100)-architecture and a logistic activation function. . . 37

4.10 Fraction of variance explained by the principal components. . . 37

4.11 MAEf f mvand MAEf vfor the local linear regression model using varying k-values . . . 39

4.12 Average mean absolute errors when predicting the fat-free muscle volumes. The error bars express the standard deviation of the prediction errors. . . 40

4.13 Average mean absolute errors when predicting the fat volumes. The error bars express the standard deviation of the prediction errors. . . 40

4.14 BCMP of VAT using NNR for all female subjects. . . 41

4.15 BCMP of VAT using NNR for all female subjects. . . 41

4.16 Bland-Altman plot for BCMP of VAT using NNR for all female subjects. . . 41

4.17 Bland-Altman plot for BCMP of VAT using LLR for all female subjects. . . 41

(9)

4.21 MCP of VAT for all female subjects using LLR. . . 43

4.22 Model comparison when predicting VAT for all male subjects. . . 43

4.23 Model comparison when predicting VAT for all female subjects. . . 43

4.24 ROC curves for BCMP of LATFFMV using NNR. . . 44

4.25 ROC curves for BCMP of LATFFMV using LLR. . . 44

4.26 ROC curves for MCP of LATFFMV using NNR. . . 45

4.27 ROC curves for MCP of LATFFMV using LLR. . . 45

5.1 Comparison of the validation loss function when using the SGD- and Adam-optimizers and RATFFMV manual corrections as targets. . . 47

5.2 ROC-curve of LPTFFMV using MCP on all women using the repeatability coefficient as GT threshold. . . 49

5.3 ROC-curve of LPTFFMV using MCP on all women using half the repeatability coefficient as GT threshold. . . 49

5.4 TP, TN, FN and FP plotted against decision threshold. . . 49

5.5 TP, TN, FN and FP plotted against decision threshold, truncated x-axis. . . 49

5.6 ROC-curve when using local linear regression to predict errors in visceral adipose tissue volumes for all women. . . 50

5.7 ROC-curves for BCMP using LLR to predict all parameters for all men. . . 51

5.8 ROC-curves for MCP using LLR to predict all parameters for all men. . . 51

A.1 The full distribution for LATFFMV. . . 60

A.2 The sex-separated distribution for LATFFMV. . . 60

A.3 The full distribution for LPTFFMV. . . 60

A.4 The sex-separated distribution for LPTFFMV. . . 60

A.5 The full distribution for RATFFMV. . . 61

A.6 The sex-separated distribution for RATFFMV. . . 61

A.7 The full distribution for RPTFFMV. . . 61

A.8 The sex-separated distribution for RPTFFMV. . . 61

A.9 The full distribution for LATMFI. . . 62

A.10 The sex-separated distribution for LATMFI. . . 62

A.11 The full distribution for LPTMFI. . . 62

A.12 The sex-separated distribution for LPTMFI. . . 62

A.13 The full distribution for RATMFI. . . 63

A.14 The sex-separated distribution for RATMFI. . . 63

A.15 The full distribution for RPTMFI. . . 63

A.16 The sex-separated distribution for RPTMFI. . . 63

A.17 The full distribution for manual corrections of VAT. . . 64

A.18 The sex-separated distribution for manual corrections of VAT. . . 64

A.19 The full distribution for manual corrections of ASAT. . . 64

A.20 The sex-separated distribution for manual corrections of ASAT. . . 64

A.21 The full distribution for manual corrections of LATFFMV. . . 65

A.22 The sex-separated distribution for manual corrections of LATFFMV. . . 65

A.23 The full distribution for manual corrections of LPTFFMV. . . 65

A.24 The sex-separated distribution for manual corrections of LPTFFMV. . . 65

A.25 The full distribution for manual corrections of RATFFMV. . . 66

A.26 The sex-separated distribution for manual corrections of RATFFMV. . . 66

(10)

A.28 The sex-separated distribution for manual corrections of RPTFFMV. . . 66

A.29 The full distribution for manual corrections of LATMFI. . . 67

A.30 The sex-separated distribution for manual corrections of LATMFI. . . 67

A.31 The full distribution for manual corrections of LPTMFI. . . 67

A.32 The sex-separated distribution for manual corrections of LPTMFI. . . 67

A.33 The full distribution for manual corrections of RATMFI. . . 68

A.34 The sex-separated distribution for manual corrections of RATMFI. . . 68

A.35 The full distribution for manual corrections of RPTMFI. . . 68

A.36 The sex-separated distribution for manual corrections of RPTMFI. . . 68

C.1 NNR, VAT, men . . . 70 C.2 LLR, VAT, men . . . 70 C.3 NNR, ASAT, men . . . 70 C.4 LLR, ASAT, men . . . 70 C.5 NNR, RATFFMV, men . . . 71 C.6 LLR, RATFFMV, men . . . 71 C.7 NNR, LATFFMV, men . . . 71 C.8 LLR, LATFFMV, men . . . 71 C.9 NNR, RPTFFMV, men . . . 71 C.10 LLR, RPTFFMV, men . . . 71 C.11 NNR, LPTFFMV, men . . . 72 C.12 LLR, LPTFFMV, men . . . 72 C.13 NNR, ASAT, women . . . 72 C.14 LLR, ASAT, women . . . 72 C.15 NNR, RATFFMV, women . . . 72 C.16 LLR, RATFFMV, women . . . 72 C.17 NNR, LATFFMV, women . . . 73 C.18 LLR, LATFFMV, women . . . 73 C.19 NNR, RPTFFMV, women . . . 73 C.20 LLR, RPTFFMV, women . . . 73 C.21 NNR, LPTFFMV, women . . . 73 C.22 LLR, LPTFFMV, women . . . 73 C.23 NNR, VAT, men . . . 74 C.24 LLR, VAT, men . . . 74 C.25 NNR, ASAT, men . . . 74 C.26 LLR, ASAT, men . . . 74 C.27 NNR, RATFFMV, men . . . 75 C.28 LLR, RATFFMV, men . . . 75 C.29 NNR, LATFFMV, men . . . 75 C.30 LLR, LATFFMV, men . . . 75 C.31 NNR, RPTFFMV, men . . . 75 C.32 LLR, RPTFFMV, men . . . 75 C.33 NNR, LPTFFMV, men . . . 76 C.34 LLR, LPTFFMV, men . . . 76 C.35 NNR, ASAT, women . . . 76 C.36 LLR, ASAT, women . . . 76 C.37 NNR, RATFFMV, women . . . 76 C.38 LLR, RATFFMV, women . . . 76 C.39 NNR, LATFFMV, women . . . 77 C.40 LLR, LATFFMV, women . . . 77

(11)

C.43 NNR, LPTFFMV, women . . . 77

C.44 LLR, LPTFFMV, women . . . 77

C.45 Prediction comparison of ASAT on all men. . . 78

C.46 Prediction comparison of ASAT on all women. . . 78

C.47 Prediction comparison of LATFFMV on all men. . . 78

C.48 Prediction comparison of LATFFMV on all women. . . 78

C.49 Prediction comparison of LPTFFMV on all men. . . 79

C.50 Prediction comparison of LPTFFMV on all women. . . 79

C.51 Prediction comparison of RATFFMV on all men. . . 79

C.52 Prediction comparison of RATFFMV on all women. . . 79

C.53 Prediction comparison of RPTFFMV on all men. . . 80

C.54 Prediction comparison of RPTFFMV on all women. . . 80

C.55 NNR, LPTFFMV, BCMP. . . 80 C.56 LLR, LPTFFMV, BCMP. . . 80 C.57 NNR, RATFFMV, BCMP. . . 81 C.58 LLR, RATFFMV, BCMP. . . 81 C.59 NNR, RPTFFMV, BCMP. . . 81 C.60 LLR, RPTFFMV, BCMP. . . 81 C.61 NNR, ASAT, BCMP. . . 81 C.62 LLR, ASAT, BCMP. . . 81 C.63 NNR, VAT, BCMP. . . 82 C.64 LLR, VAT, BCMP. . . 82 C.65 NNR, LPTFFMV, MCP. . . 82 C.66 LLR, LPTFFMV, MCP. . . 82 C.67 NNR, RATFFMV, MCP. . . 83 C.68 LLR, RATFFMV, MCP. . . 83 C.69 NNR, RPTFFMV, MCP. . . 83 C.70 LLR, RPTFFMV, MCP. . . 83 C.71 NNR, ASAT, MCP. . . 83 C.72 LLR, ASAT, MCP. . . 83 C.73 NNR, VAT, MCP. . . 84 C.74 LLR, VAT, MCP. . . 84

(12)

List of Tables

4.1 Distribution of sexes in the UK Biobank dataset. . . 28

4.2 Statistics for all male subjects in the UK Biobank dataset. . . 29

4.3 Statistics for all female subjects in the UK Biobank dataset. . . 29

4.4 Manual corrections made in QC for all male subjects in the UK Biobank dataset. . . 31

4.5 Manual corrections made in QC for all female subjects in the UK Biobank dataset. . . 31

4.6 Resulting MAE for BCMP using multiple regression. . . 34

4.7 Effects on MAE of keeping all, discarding or combining collinear input features when fitting a multiple regression model. . . 34

4.8 Resulting MAE:s for BCMP using k-Nearest Neighbours regression with k=40. . . 34

4.9 Effect of scaling input features using VIF in k-NN regression. . . 35

4.10 Resulting MAE:s for BCMP using NNR with optimal parameters. . . 35

4.11 Resulting MAE:s for MCP using NNR with optimal parameters. . . 36

4.12 MAE results from neural network optimisation using BCMP for (12, 100, 100)-architecture . . 36

4.13 Ratio of variance explained by the principal components. . . 38

4.14 Effects on MAEf f mvand MAEf vof training on a subset of PCA-features. . . 38

4.15 Resulting MAE:s from BCMP using LLR with optimal parameters. . . 38

4.16 Resulting MAE:s from MCP using LLR with optimal parameters. . . 39

4.17 Mean absolute errors when predicting the measurements. . . 40

4.18 MAEf f mvand MAEf vfor MCP. . . 42

4.19 AUC-values for BCMP using NNR and LLR. . . 44

4.20 AUC-values for MCP using NNR and LLR. . . 45

A.1 Statistics for all samples in the UK Biobank dataset. . . 59

A.2 Corrections made in QC1 for all samples in the UK Biobank dataset. . . 59 B.1 Mean absolute error results from neural network optimization for the (12, 10, 10)-architecture 69

(13)

For the convenience of the readers, a list of commonly used acronyms is presented below.

Acronym Meaning

AUC area under curve

BCMP body composition measurement prediction IMT Department of Biomedical Engineering k-NN k-nearest neighbours

LLR local linear regression MAE mean absolute error

MCP manual correction prediction MFI muscle fat infiltration MRI magnetic resonance imaging MSE mean squared error

NNR neural network regression PCA principal component analysis QC quality control

ROC receiver operating characteristic SGD stochastic gradient descent VIF variance inflation factor

ASAT abdominal subcutaneous adipose tissue VAT visceral adipose tissue

LATFFMV left anterior thigh fat-free muscle volume LPTFFMV left posterior thigh fat-free muscle volume RATFFMV right anterior thigh fat-free muscle volume RPTFFMV right posterior thigh fat-free muscle volume LATMFI left anterior thigh muscle fat infiltration LPTMFI left posterior thigh muscle fat infiltration RATMFI right anterior thigh muscle fat infiltration RPTMFI right posterior thigh muscle fat infiltration

(14)

1

|

Introduction

This thesis project has been performed at Amra Medical AB, hereinafter referred to as Amra.

1.1

Motivation

Identifying the muscle and fat tissues in MR volumes of a patient’s body enables analysis to be per-formed on the health status of the patient. The process of muscle and fat tissue identification is referred to as segmentation of the MR volume. Amra has developed a tool to automate this segmentation process using an atlas-based method. As with all automatic methods, evaluation of the performance is of im-portance. As of now, Amra includes a manual step in the segmentation pipeline. In this step, clinical experts, hereinafter referred to as anaysis engineers, can adjust the segmentation masks presented by the automatic system. This manual inspection is referred to as quality control, QC in short. Using manual inspection to validate the results instills confidence in the algorithm. However, including a manual step in an otherwise automatic process is both time-consuming and costly. Furthermore, the manual work hours constitute a bottleneck when it comes to the expansion of Amra’s business. For these reasons, it is imperative to reduce the need for manual inspection.

1.2

Aim

The broader aim of this thesis is to investigate the possibilities to automatically evaluate the quality of segmentations performed by Amra’s system. This task is performed using a variety of different regres-sion methods. Through this analysis, this thesis aims to assist in further automation of the segmentation pipeline at Amra.

1.3

Research questions

To clarify the aim of this thesis, three research questions have been formulated:

1. How well can a system predict the value for one patient feature given a number of other features? 2. How well can a system predict the amount of manual correction that will need to be performed

on a patient feature?

3. Can any of the above mentioned approaches be used to predict when an automatic segmentation needs to be manually corrected?

(15)

The three above mentioned research questions are the main focus of the thesis. In addition to answering these questions, the thesis also aims at providing an investigation on the following questions:

i How is the performance of the neural network regression models affected by input feature dimen-sionality reduction using principal component analysis?

ii How is the performance of the multiple regression model affected by removing or combining collinear input features?

1.4

Delimitations

There are countless approaches to solving the regression problem. This project is delimited to investi-gating the regression models stated in Chapter 2. The results from most of these models are dependent on one or more hyperparameters. This project is delimited in evaluating a limited number of hyperpa-rameters. This is especially relevant for the neural network regression model, which can be constructed in myriad of ways. A limited optimisation process is performed in Section 4.4.1 to ensure the parameter setup performs in a local optimum. The neural network would, however, most likely perform better in some other setup not evaluated.

Furthermore, not all regression models have been evaluated for the manual correction prediction, MCP. Only the two models with the most promising results from the body composition measurement predic-tion, BCMP, have been implemented to work with the second approach.

1.5

Limitations

This project is limited by the data that has been made available by Amra, which is sourced solely from the UK Biobank. The UK Biobank dataset is composed of data on 13496 British volunteers with ages ranging between 46 and 81. The homogeneity of this group makes it difficult to draw general conclu-sions from the results of this project. All MR scans in the dataset have been performed using Siemens MAGNETOM Aera 1.5T [22].

1.6

Related work

Compared to image segmentation techniques, segmentation evaluation is relatively un-studied. A lim-ited amount of papers and articles discussing this issue have been found during the research for this thesis. One example is "Evaluation of Image Segmentation Quality by Adaptive Ground Truth Composition" [33], presented by Bo Peng and Lei Zhang in 2012. This paper evaluates the segmentation in the image domain and is thus not directly comparable to this thesis report. Another similar example is "Evaluating Segmentation Error without Ground Truth" [26] presented by Timo Kohlberger et al. also in 2012. This paper investigates segmentation evaluation in the medical domain. The system described in this paper fetches the features from the images. In contrast, this thesis project uses features derived from volumes to predict segmentation quality.

In summary, the investigations performed in this thesis have been performed in other applications be-fore. The approach used in this thesis is quite specific, however. Due to this specificity, direct compar-isons with literature are hard to make.

(16)

2

|

Theory

The underlying theory behind this thesis project is presented in this chapter. The foundation for this project is predictive analysis, which aims to capture the relationship among input features to predict one or more targets. When designing a predicting system, a large number of models can be considered. Sections 2.5 to 2.8 present the models that were chosen for this project.

2.1

Magnetic resonance imaging

Magnetic resonance imaging (MRI) is a non-invasive 3D imaging technique which has been very suc-cessfully applied to the investigation of the internal properties of living subjects. MRI uses magnets and coils to induce a strong magnetic field over the subject. The atomic nuclei in the subject absorbs the electromagnetic radiation and reemit radiation in the radio frequency range. The emitted signals can subsequently be analysed to create high-resolution images [34]. As MRI relies on magnetic fields to pro-duce images, patients are not exposed to any ionising radiation, as opposed to other medical imaging techniques based on x-rays, e.g. computed tomography. Imaging is most often performed by exciting hydrogen nuclei. Hydrogen exhibits good MR sensitivity and is high in concentration in biological tis-sue, most notably in water and fat, which are the main sources of the signal [34, pp. 29]. The details on MRI are quite involved and are not covered in this thesis.

The analysis performed at Amra requires that the MR images are separated into their water and fat components. One technique to achieve this was presented in 1984 by W. Thomas Dixon [12]. The Dixon imaging technique is based on the chemical shift difference between water and fat. One image is ac-quired with the water and fat signal in-phase and another image is acac-quired with the water and fat signals 1800out-of-phase. Dixon showed that the fat-only image and water-only image can be acquired

using summation and subtraction as displayed in Equation (2.1) and Equation (2.2) [27].

Fat only image = in-phase image - opposed-phase image (2.1)

Water only image = in-phase image + opposed-phase image (2.2) Examples of the water and fat images are displayed using a coronal intersection in Figure 2.1 and 2.2.

(17)

Figure 2.1: Water-only image constructed using Dixon imaging. Image source: UK Biobank.

Figure 2.2: Fat-only image constructed us-ing Dixon imagus-ing. Image source: UK Biobank.

2.2

Atlas-based segmentation

Atlas-based segmentation is a method used in many medical applications. This method relies on clinical experts manually labelling medical images per the requirements of the application. Applying the labels onto previously unseen data is performed by extrapolating the labelled samples to new ones. Amra uses a method with multi-atlas segmentation using whole-body water-fat MRI described in [2]. The steps of this method are very briefly summarised in Algorithm 1.

Algorithm 1Atlas-based segmentation as performed by Amra

1: for eachtarget x do

2: Find the seven prototypes y1´y7most similar to the target 3: for y P[y1, y7]do

4: Apply non rigid transformations to make intensity image of y most similar to x 5: Apply the corresponding transformation on the labels of y

6: end for

7: Compile resulting labels from each of the prototypes y1´y7using a voting scheme 8: Apply threshold on probability map to create final segmentation

9: Combine masks with image information present in target volume to obtain muscle volume

(18)

2.3. Dataset

A target in this context is the unseen sample that is to be labelled while a prototype is an already labelled sample. A proprietary similarity measure, called the signature vector, is used to evaluate the similarity between the target and the prototypes. The signature vector is made up of 20 floating-point numbers and constructed by performing a dimensionality reduction of the fat image. Signature vectors can be compared using Euclidian distance as a measure of similarity. This measure is not altered in this project and is therefore not more closely examined. The labels extrapolated from the seven prototypes with signature vectors most similar to the target’s signature vector each get one vote on the final labelling. The votes are added in each voxel and can be interpreted as a probability map with values in the range

[0, 1]after normalisation. A value of 1 means all prototypes classified the voxel as belonging to a certain class. A value of 0 means that none of the prototypes classified the voxel as belonging to the class. A threshold is then applied to decide the class belonging. This threshold decides how many prototypes that need to agree on the classification.

2.3

Dataset

The data needed to perform these assessments were provided by Amra and comprises samples from the UK Biobank dataset. UK Biobank is a national and international health resource providing health data from many thousands of participants to bona fide health researchers with the aim to improve the prevention, diagnosis, and treatment of several serious diseases [39]. The dataset contains a wealth of survey information on each participant. In addition to this, the dataset also contains whole-body MR-volumes of the participants. This combination of data allows for valuable conclusions to be drawn from correlations between, for instance, disease record and health data. Body composition analysis using Amra Profiler Research [1] has been performed on 13496 samples in this dataset, providing the foundation for this project.

2.3.1

Body composition measurements

There are a number of body composition measurements accessible from each sample in the dataset, namely:

• Left anterior thigh fat-free muscle volume (LATFFMV). • Left posterior thigh fat-free muscle volume (LPTFFMV). • Right anterior thigh fat-free muscle volume (RATFFMV). • Right posterior thigh fat-free muscle volume (RPTFFMV). • Abdominal subcutaneous adipose tissue (ASAT) volume. • Visceral adipose tissue (VAT) volume.

• Left anterior thigh muscle fat infiltration (LATMFI). • Left posterior thigh muscle fat infiltration (LPTMFI). • Right anterior thigh muscle fat infiltration (RATMFI). • Right posterior thigh muscle fat infiltration (RPTMFI). • Liver fat (LF) fraction.

(19)

These body composition measurements have all been computed from MR-volumes using Amra’s soft-ware. The MR-volumes are segmented into masks as described in Section 2.2. The water-only and fat-only images, overlayed with masks are presented in Figure 2.3. The masks, together with fat-only or water-only MR-images created as described in Section 2.1, allow for quantitative computation of bodily measurements. These measurements are readily available in the deliverable form, which means after they have gone through quality control stage. These measurements are also needed in their uncorrected form, i.e. as the automatic system proposes them. It is possible to obtain these corresponding measure-ments by reversing the work done by the analysis engineers at Amra. The liver fat fraction measurement is created using a region of interest, manually defined in the first quality control stage. Since this project is performed on samples not yet manually inspected, this measurement is not used.

Figure 2.3: Fat- and water images with segmentation masks overlaid. Image source: Amra medical AB.

The muscle fat infiltration (MFI) measurements are computed from the same masks as the fat-free mus-cle volume measurements. The MFI measurements differ from the fat-free musmus-cle volume measure-ments in how they are computed. The muscle fat infiltration fractions are computed by comparing the fat channel to the water channel inside regions labelled as muscles in the MR-volumes. This entails a couple of consequences, one being that an erroneous segmentation of the muscle mask does not neces-sarily affect the MFI-value.

(20)

2.4. Dataset analysis

2.3.2

Patient data

Some patient metadata is also available for each sample in the dataset. The following data is available: • Sex.

• Age. • Height. • Weight. • Patient ID.

The patient data age, height and weight are, together with the body composition measurements defined in Section 2.3.1, hereinafter referred to as features.

2.4

Dataset analysis

An important sub-task of this project is gaining sufficient knowledge about the dataset. A solid un-derstanding of the dataset is imperative, both to be able to design predictive models efficiently and to analyse the results from said models. This section explains the mathematical theory behind the analysis methods applied to each feature in the dataset. The following characteristics have been computed for each feature: the minimum value, the maximum value, the arithmetic mean value (µ) and the standard deviation (σ). Plots describing the distribution of these features have also been produced. The distri-bution of each feature in the dataset have been computed and visualised. Every sample has also been inspected for completeness, making sure that they contain every expected measurement.

2.4.1

Arithmetic mean

The arithmetic mean, µ, of each feature is defined as:

µ= 1 n n ÿ i=1 yi, (2.3)

where n is the number of samples and yiis the value of the feature for the i :th subject.

2.4.2

Standard deviation

The standard deviation, σ, of each feature is computed as:

σ= g f f e 1 n n ÿ i=1 (yi´ µ)2, (2.4)

(21)

2.4.3

Multicollinearity

Multicollinearity refers to the linear relationship between two or more input features. A high level of multicollinearity can lead to misleading results in regression tasks and is thus a property that should be investigated for the dataset at hand. One way to measure the multicollinearity of the input features is by using the variance inflation factor (VIF). As described in [24], the VIF is determined as:

V IFk=

1 1 ´ R2

k

, (2.5)

where R2kis the coefficient of determination obtained by regressing the k:th predictor on the remaining predictors. In other words, a linear regressor is fitted for each pair of features. The R2-score of each feature prediction is used to compute the VIF as defined in 2.5. As a rule of thumb, VIF-values above 4 warrant a further investigation and VIF-values exceeding 10 should be thrown away or combined with the linearly dependent feature [10]. A method to validate the results from the VIF-analysis is to plot linearly dependent features against each other in a so-called pair plot. This way, the linear relationship can be visualised and validated through visual inspection.

2.4.4

Principal component analysis

Principal component analysis (PCA), first introduced by Karl Pearson in [15], is a method used to trans-form a set of possibly linearly dependent features to a set of linearly independent principal components, hereinafter referred to as PC:s. The PC:s are linear combinations of the features composed so that they express the variance of the dataset optimally. The first PC is the linear combination of the features which gives the largest variance. The first PC describes the dimension in which the samples are maxi-mally spread out. The second PC is the dimension orthogonal to the first PC with the largest variance. The third PC is the dimension orthogonal to both the first and the second PC with the largest variance, and so forth. Each of the n PC:s are constructed in this way, where n is the number of features in the dataset. Thus, the PC:s are ordered in terms of importance, i.e. the first PC describes the most variance in the dataset, while the n:th PC describes the least variance in the dataset. As an illustrative example, the PC:s of an elliptically shaped data cluster describing the features x1and x2is displayed in Figure 2.4.

2

1

0

1

2

3

x

1

2

1

0

1

2

x

2

PC1

PC2

(22)

2.5. Multiple regression

The magnitude of the principal components relates to their importance in explaining the variance in the dataset. By comparing the magnitudes, it is possible to throw away principal components accounting for little variance. This way, principal component analysis can be used for reducing the dimensionality of the input feature space [6].

2.5

Multiple regression

Perhaps the most intuitive approach to use a priori knowledge to predict a feature is to linearly extrapo-late the trend of the data to a further data point. The case of using one dependent feature to predict one target feature using linear extrapolation is called simple regression. This method estimates a so-called regression line from patterns in the dataset. The regression line can be expressed as:

y=a+bx, (2.6)

which can be used to compute the target variable, y, given a measured variable, x. The slope, b, and intersect, a, of the line are estimated by minimising the mean squared error (MSE) of the line in relation to the dataset. The MSE is calculated as:

MSE= 1 n n ÿ i=1 ε2i, (2.7)

where n is the number of samples and εi is the distance between data point i and the regression line,

known as the residual.

Simple regression suffers from the obvious drawback that it is only capable of predicting one target feature given one dependent feature. For the application relevant for this project, the model needs to use information from multiple input features to predict the target feature. The extension of simple regression to include multiple features is called multiple regression. This extension naturally replaces the regression line with a linear hyperplane fitted to the features. The multiple regression model is:

Y=+ε (2.8)

In equation 2.8, the variable Y is an n ˆ 1 vector of observable values. These are the values of the features which are being modelled in the regression, the target features. The variable X, on the other hand, is an n ˆ p matrix of observable values, called the design matrix. This matrix contains the values of all input features to the equation. The variable β is a p ˆ 1 vector of parameters. It is this parameter vector, β, which is estimated from the data. The final variable, ε, is a n ˆ 1 vector of residuals. In these matrices, n is the number of observations, and p is the number of independent variables. This predictive model functions by finding the β-vector, which most accurately described the relation between X and Y. Once this vector is estimated, the dependent variable for a new sample, Yican be computed as:

Yi=Xiβ+εi, (2.9)

where Xiare the independent values of the sample.

The parameter vector β can be estimated using, for instance, the ordinary least squares estimator (OLS). This estimator can be expressed as:

ˆβ= (XTX)´1XTY (2.10)

Linear regression offers a learning model which is quick and easy to estimate and apply. The cost func-tion is convex, meaning that a global optimum exists, which is seldom the case for more advanced, non-linear, models. One major drawback with this method is that it assumes a linear relationship be-tween the input features and the target feature [16, pp. 18–28, 41–60].

(23)

2.6

k-nearest neighbours regression

The k-nearest neighbours (k-NN) algorithm is a non-parametric method used for regression and classifi-cation. Compared to parametric methods, such as linear regression, the k-NN algorithm does not make strong assumptions about the shape of the true regression function. This algorithm requires a known set of input-output-mappings: Dt(y1; x1), ...,(yn; xn)u, where the values of all target features y1, ..., ynhave

been observed. The input feature vectors x1, ..., xnare composed of m values each. A target observation

is a tuple(y0; x0)where the value y0is missing and x0have been observed. The k-nearest neighbours algorithm operates by determining the k samples with input features most similar to the input features of the training set. This similarity can be quantitatively measured using, for example, the Euclidian distance: dE(xi, x0) = g f f e p ÿ j=1 (xij´x0j)2, (2.11) where xi= [xi1¨ ¨ ¨xip]Tand x0= [x01¨ ¨ ¨x0p]T.

Once the k neighbours are determined, the prediction for y0can be calculated as:

y0= k

ÿ

j=0

wi¨yi, (2.12)

where wiis the weight for sample i. The weights can be determined in a couple of different ways. The

most straightforward approach is perhaps to give all k samples equal weight: w0=w1=¨ ¨ ¨=wi=

1

k (2.13)

However, this uniform weighing of the neighbours might not be sensible. The whole point of the k-nearest neighbors algorithm is that target samples can be determined from samples in their vicinity. It might thus be sensible to include the distance to the samples in the weights. This can be realised as:

wi= 1

dE(xi, x0) (2.14)

The weights would then need to be normalised as: wi =

wi

řk

j=0wj

(2.15)

It is important to note that different features have very different ranges. For example, the mean right posterior muscle fat infiltration is 0.12 (fraction) while the mean weight is 75.1 (kg) in this dataset. This difference in range implies that proper normalisation of the features is imperative before computing the distances. Without normalisation, the features ranging in large values would almost wholly determine the neighbourhood [36, pp. 279–287].

2.6.1

Feature scaling

There is reason to believe that the Euclidian distance between all input features might be a flawed metric for determining the similarity of samples. One might, for instance, expect that the weight is a more significant feature than age when predicting the subcutaneous fat volume. Some features might even confuse the system more than they contribute. Including age when predicting the subcutaneous fat

(24)

2.7. Local linear regression (LLR)

volume of an older person might assign the sample elderly neighbours, even if there is no correlation between age and subcutaneous fat volume. One way of determining the feature scaling of a given feature parameter is to set it proportional to how well it explains the target parameter. As described in Section 2.4, one way to quantify linear relations between samples is the variance inflation factor. It is conceivable to use this factor as a scaling factor to the different features. This feature scaling strategy is empirically tested in Section 4.3.2.

2.7

Local linear regression (LLR)

Consider again the multiple regression model Y = +εfirst introduced in Equation (2.8). If the

variables X and Y are input features and target features for the full dataset, this model assumes a global linear structure of the dataset. Local linear regression (LLR) leverages the idea that the data might be accurately described with a linear model in small regions of the input feature space. By using the algorithm described in Section 2.6, it is possible to find a neighborhood of k training samples on which the regression vector β can be fitted. In other words, this regression method uses a distance metric to find which part of the training data to use and then applies multiple regression only to this specific part of the dataset as detailed in Section 2.5.

2.8

Neural network regression (NNR)

A neural network, more accurately named an artificial neural network (ANN), is a computational paradigm loosely inspired by the function of the animal brain. As Simon Haykin describes in [21, pp. 2], a neural network is a massively parallel distributed processor made up of simple processing units. These simple processing units, called artificial neurons, loosely model the biological neurons present in the animal brain. The neurons in an ANN are arranged in layers and interlinked across layers using a network of connections with weights. Through multiplication with these weights and addition with biases, associated with the neurons, the ANN transforms one or more inputs to one or more outputs. The weights and biases are gradually adjusted by training the network using inputs with known out-puts. The training process of an ANN is described in Section 2.8.3. Once trained, features from an input sample with no known output can be fed through the network to produce an output, giving a non-linear mapping between input and output.

The artificial neuron is the crucial component of a neural network. An artificial neuron is a processing unit which inputs several features and produces an output. The components of an artificial neuron are the following:

1. A number of connections or edges. Each connection is characterised by a weight, w, which the corre-sponding input is multiplied with. A signal from neuron j to neuron k gives the input xjwhich is

multiplied by the weight wjk.

2. A summation function,ř, used for addition of all inputs.

3. An activation function, ϕ, which introduces a non-linearity to the system. The activation function is more thoroughly described below.

4. A bias-term, bk. This term increases or lowers the input of the activation, depending on whether it

is positive or negative.

(25)

Input signals x1 b -x2 b -q q q q q q xm b -  bk ?   wk1 S S S S w   wk2 @ @ R   wkm   7 Synaptic weights   ř Summing junction vk-ϕ(¨) - Output yk

Figure 2.5: Non-linear model of a neuron.

The neuron can be described mathematically as:

yk= ϕ(uk+bk), (2.16) where uk= m ÿ j=1 wkjxj (2.17)

2.8.1

Activation function

As mentioned earlier, the primary purpose of the activation function is to introduce a non-linearity to the network. Most activation functions also limit the output of each neuron. This mathematical activation function can be composed in a multitude of different ways; some of the most common are described below.

• Linear identity function — the simplest example of an activation function is the linear identity func-tion:

ϕ(x) =x, (2.18)

where the activation function just passes on the output of the neuron. Calculating the derivative of ϕ(x)with respect to its input x gives:

(x)

dx =1 (2.19)

As is evident from Equation (2.19), the derivative of the linear function is independent of the input variable x. This means that any error in the prediction can not be back-propagated using the gradient. Furthermore, if a linear activation function is used in the hidden layers, every layer can be expressed as a linear combination of the previous layer. This way, no matter how many layers are used, the final layer would be a linear function of the input to the first layer.

• Logistic function — this characteristic sigmoid-shaped function is one of the most widely used activation functions today. The logistic function is a strictly increasing function balancing between

(26)

2.8. Neural network regression (NNR)

linear and non-linear behaviour. The logistic function is mathematically defined as:

ϕ(x) = 1

1+e´x (2.20)

The logistic function, illustrated in Figure 2.6, provides some attractive properties as an activation function. Firstly, the output from this function is bounded in the range[0, 1]. This prevents the neuron’s output from becoming excessively large. The logistics function is continuously differen-tiable, which is important for the training of neural networks. However, it also suffers from several drawbacks, the most significant being the so-called vanishing gradient-problem, more thoroughly explained below [31].

• Hyperbolic tangent function — another sigmoid-shaped activation function is the hyperbolic tangent function, defined as:

ϕ(x) =tanh x= e

x´e´x

ex+e´x (2.21)

This function share many characteristics with the aforementioned logistic function. As can be seen in Figure 2.7, the hyperbolic tangent functions range is [´1, 1], compared to [0, 1] for the logistic function. Like the logistic function, the hyperbolic tangent function suffers from vanishing gradients. Compared to the logistic function, the hyperbolic tangent function has been empirically proved to provide a faster convergence [23], due to it being zero-centered [21, pp. 12-15].

• Rectified linear unit (ReLU) — the ReLU-activation function was first introduced by Nair and Hin-ton [30] in 2010 and has since become the most widely used activation function for deep neural networks. This function consists of a linear function thresholded at zero:

ϕ(x) =max(0, x) =

#

x, if x ě 0

0, if x ă 0 (2.22)

By being linear for all positive values, the ReLU overcomes the problems with vanishing gradients experienced with the logistic- and hyperbolic tangent functions [31]. Furthermore, simple math-ematical operations make the ReLU-activation function less computationally expensive than its sigmoid counterparts. 8 6 4 2 0 2 4 6 8 x 0.0 0.2 0.4 0.6 0.8 1.0 (x ) 0.5 Figure 2.6: ϕ(x) = 1+e1´x 8 6 4 2 0 2 4 6 8 x 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 (x ) Figure 2.7: ϕ(x) =tanh x 8 6 4 2 0 2 4 6 8 x 0 1 2 3 4 5 6 7 8 (x ) Figure 2.8: ϕ(x) =max(0, x)

Vanishing gradient problem

As previously mentioned, the sigmoid-shaped activation functions suffer from what is known as the vanishing gradient problem. This issue means that gradients tend to become diminishingly small for

(27)

large positive and negative input values. This effect is best understood by investigating the derivative of the logistic function:

ϕ(x) = 1 1+e´x ùñ (x) dx = ex (1+ex)2 (2.23)

As can be seen in Figure 2.9, there is only a narrow range of inputs, x, which result in a derivative, ϕ1(x),

meaningfully larger than zero. In fact, most inputs, x, cause the gradients to vanish, hence the name vanishing gradients. This can be compared to the ReLU-activation function displayed in Figure 2.10, where it can be seen that ϕ1(x) = 1, @x ą 0. The probability of the input being in the desired range

decreases with the number of hidden layers, which is why this problem is most prominent for deep neural networks. 8 6 4 2 0 2 4 6 8

x

0.0 0.2 0.4 0.6 0.8 1.0

(x

)

(x) 0(x)

Figure 2.9: The logistic function ϕ(x) = 1+e1´x

to-gether with derivative ϕ1(x)

8 6 4 2 0 2 4 6 8

x

0 1 2 3 4 5 6 7 8

(x

)

(x) 0(x)

Figure 2.10: The ReLU function ϕ(x) = max(0, x)

together with derivative ϕ1(x)

2.8.2

Network architecture

As mentioned earlier, it is the interconnection of individual artificial neurons in a specific layered struc-ture that forms a neural network. The number of layers and the number of neurons in each layer is re-ferred to as the architecture of the neural network. This architecture is an important factor in the learning process of the network. The architecture of a neural network can be designed in a near-infinite number of ways. Finding the architecture yielding the best results is an interesting optimisation problem in itself [18]. This is, however, not the focus of this thesis project. In this project, only single- and multilayered feed-forward networks are utilised. This means that the information flows sequentially from input to output. Furthermore, the layers are assumed to be fully-connected, meaning that every neuron in a given layer is connected to every neuron in the previous and next layer. Given these limitations, two main hyperparameters may be altered in the network, namely the width and the depth. The width refers to the number of neurons in each hidden layer, and the depth refers to the number of hidden layers. Figure 2.11 depicts a neural network architecture with l inputs, m neurons in one hidden layer, and n outputs. In this project, l=12 and n=1, since 12 input features are used to predict one target feature.

(28)

2.8. Neural network regression (NNR)

..

.

..

.

..

.

I1 I2 I3 Il H1 Hm O1 On Input layer Hidden layer Output layer

Figure 2.11: General example of a neural network architecture with l inputs, m neurons in one hidden layer and n outputs.

2.8.3

Learning algorithm

The attractive property of an artificial neural network is its ability to adapt to the environment without explicitly defining how to do this. This adaptation is performed by iteratively adjusting the synaptic weights and biases of the network to achieve better performance. This iterative process is often referred to as the network learning from its environment. The learning is performed without manual interference. Even so, the network needs a well-defined set of rules on how to perform this learning. This set of rules is often known as a learning algorithm. Learning algorithms can be divided into two main learning paradigms:

• Supervised learning — learning when there are ground truth input Ñ output mappings available. • Unsupervised learning — learning when there are no input Ñ output mappings available.

Since only supervised learning is used in this project, learning hereinafter refers to supervised learning. The neural network learns by adapting its weights and biases in a way that minimises the error between the predicted output and the true output found in the ground truth. This error is computed using the loss function. The loss function accepts the parameters predicted output and true output and produces a loss. The most commonly used loss function for regression neural networks is the mean squared error (MSE) loss function, which is also used for this project. The MSE loss is defined as:

MSE= 1

N ÿ

i=0

(Yi´Yi,pred)2, (2.24)

where N is the total number of samples, Yiis correct output for sample i and Yi,predis the prediction for

(29)

The back-propagation algorithm, first introduced by Rumelhart et al. in [14], propagates the information about the error backwards in the network, repeatedly adjusting the weights and biases of each neuron to minimise the loss. One pass over the total dataset is known as an epoch. By performing multiple epochs, the network is able to gradually improve its performance.

The process of adjusting the weights and biases to minimise the loss in the neural network is known as optimisation. This process is analogous to searching for the global optimum of the loss function. The naive approach to optimising this loss function is to take fixed-length steps in the direction of the negative gradient for each iteration. This approach would theoretically land the function in an optimum. Using this approach, the solution would almost certainly land in a local optimum of the loss function. To solve this problem, and improve the convergence of the training, the machine learning community has developed a large number of alternative optimizers. The three commonly used optimizers stochastic gradient descent (SGD), root mean square propagation (RMSProp) and Adam will be investigated and tested in this project.

Gradient descent

All three optimizers used in this project are based on gradient descent. Gradient descent is the most straightforward approach to optimise the weights in a neural network. We define the cost function J(w), where w is the vector of weights. The optimisation problem can be described as minimising the cost function J(w)with respect to the weight vector w. Optimality is reached when:

∇J(w) =0, (2.25) where ∇J(w) =  BJ Bw1 , BJ Bw2 ¨ ¨ ¨ BJ Bwm T (2.26) It is important to note that there are often many local optima of a loss function, but only one global optimum.

Gradient descent is an iterative method, aimed at minimising the loss function. In each iteration, ad-justments are applied to the weight vector in the direction of the steepest descent, i.e. in the opposite direction of the gradient vector∇J(w). The updates to the weight vector can be described as:

∆w(n+1) =´η∇J(w(n)), (2.27)

where η is known as the learning rate [21, pp. 121-124].

Stochastic gradient descent

Stochastic gradient descent, abbreviated as SGD, is a variation of the standard gradient descent opti-misation method. SGD computes the error and updates the weights vector for each new sample. This means that the gradient vector is estimated from every new sample rather than calculated from the full dataset. This approach significantly reduces the computational burden of the optimisation method. An alternative to estimating the gradient vector for each single sample is to estimate it as the average of a randomly chosen subset of the samples. This special case is known as mini-batch gradient descent. However, the terms are highly interchangeable and often when a neural network is said to be optimised using SGD, in reality, mini-batch gradient descent has been utilised. As an example, whenever the batch size is set larger than 1, the Keras implementation of SGD is, in reality, a mini-batch gradient descent implementation [4].

(30)

2.8. Neural network regression (NNR)

The SGD method can be expanded to include a momentum term. Sutskever et al. describe in [37] how this term can be used to accelerate the gradient descent. Using momentum, the updates to the weight vector can be expressed as a linear combination of the gradient and the previous update:

∆w(n+1) =µ∆w(n)´ η∇J(w(n)), (2.28) where µ P[0, 1]is known as the momentum coefficient.

RMSProp

RMSProp is a neural network optimisation algorithm first introduced by Geoffrey Hinton et al. in the online course "Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Opti-mization" provided by Coursera [9]. Note that this algorithm has never been published and thus there are no peer-reviewed sources available. Nevertheless, this is one of the most widely used optimisation algorithms in practice.

RMSProp harvests the idea that the learning rate should not necessarily be the same for all weights and biases, and is thus labelled as an adaptive learning rate optimisation method. This algorithm achieves this by dividing each gradient with a running average of its recent magnitudes. For every iteration, the weight at index i is updated as:

wi=wi´1´ ηaSdwi

dwi

, (2.29)

where η is the learning rate. The derivative is defined as the partial derivative of the cost function dwi= BwB Ji The step length Sdwi is in turn computed as:

Sdwi =βSdwi´1+ (1 ´ β)dw 2

i, (2.30)

where the hyperparameter β decides the impact of the moving average term. This update is performed independently for all weights w1, w2, ¨ ¨ ¨ wn, where n is the total number of weights. The update scheme

for the biases is equivalently:

bi =bi´1´ η dbi aSdbi , (2.31) Sdbi =βSdbi´1+ (1 ´ β)db 2 i (2.32)

The biases b1, b2, ¨ ¨ ¨ , bnare independently updated the same way as the weights w1, w2, ¨ ¨ ¨ wn.

RMSProp tries to achieve faster convergence of relevant weights and biases, whilst dampening oscilla-tions in irrelevant weights and biases. In other words, it aims to provide a faster, more stable, conver-gence of the loss function [9].

Adam

The Adam optimisation algorithm was first introduced by Diederik P. Kingma and Jimmy Lei Ba in their 2015 paper "Adam: A Method for Stochastic Optimization" [25]. The paper has since then been cited more than 40000 times, and Adam is now one of the most widely used optimisation algorithms in the machine learning community. Adam is designed to combine the advantages of the two optimization methods: "AdaGrad" [13] and "RMSProp" [9]. The Adam algorithm aims to achieve a computationally efficient optimisation invariant to diagonal rescalings of the gradients.

The algorithm for the Adam optimizer is presented using pseudo-code in Algorithm 2. Adam utilises a moving average approach of the first two moments of the gradients. The first moment is the mean,

(31)

and the second moment is the uncentered variance. The decay rates for the moving average terms is controlled by the hyperparameters β1and β2.

Algorithm 2Adam optimisation algorithm, recreated from [25].

Require:

α: Stepsize.

β1, β2P[0, 1): Exponential decay rates for the moment estimates.

f(θ): Stochastic objective function with parameters θ. θ0: Initial parameter vector.

m0Ð0 (initialise 1:st moment vector)

v0Ð0 (initialise 2:nd moment vector)

t Ð 0 (initialise timestep)

while θtnot converged do t Ð t+1

gtÐ∇θft(θt´1)(Get gradients w.r.t stochastic objective at timestep t)

mtÐ β1¨mt´1+ (1 ´ β1)¨gt(Update biased first moment estimate)

vtÐ β2¨vt´1+ (1 ´ β2)¨g2t (Update biased second raw moment estimate)

ˆ

mtÐmt/(1 ´ βt1)(Compute bias-corrected second raw moment estimate)

ˆvtÐvt/(1 ´ βt2)(Compute bias-corrected second raw moment estimate)

θtÐ θt´1´ α ¨ ˆmt/(

?

ˆvt+e)(Update parameters)

end while

return θt(Resulting parameters)

2.8.4

Dataset subdivision

The dataset has to be subdivided into training, validation and testing data to make the learning robust. First, a fraction of the dataset should be reserved for testing. This testing dataset is solely used to assess the performance of the prediction methods, and thus not used in training. Reserving a random subset of the dataset before any training is performed enables us to test the generalisation of the learning methods. The performance of a learning algorithm is generally improved with a larger training set. One might thus be tempted to make the testing set as small as possible. To be able to draw meaningful conclusions about the performance, the test set, however, needs to be representative of the full dataset. These prerequisites make the choice of test ratio a balancing act where the size and complexity of the dataset are essential factors.

When training a neural network, the remaining data should be split into training and validation data. For each iteration, the loss is computed on both the training and validation data. However, only the training data loss is used to update the weights and biases in the neural network. This way, the training loss can be compared to the validation loss as a measure to prevent overfitting to the training dataset [35].

An illustration of the loss computed from training and validation data is presented in figure 2.12. An increase in validation loss, while the training loss is constant or decreasing is a sign of overfitting. By monitoring the validation loss, it is possible to abort the training if the loss curves start diverging.

(32)

2.9. Evaluation

0

1

2

3

4

5

Epoch

0

2

4

6

8

10

12

14

Loss

Validation loss

Training loss

Figure 2.12: Comparison between training and validation loss.

2.9

Evaluation

To aid in the evaluation of the different prediction methods, this section presents metrics and methods to be used in the comparisons.

2.9.1

Mean squared error (MSE)

One straightforward approach to measure the performance of a regression method is to compute the mean squared errors between the predicted values and the true values, defined as:

MSE= 1 N N ÿ i=0 (Yi´Yi,pred)2, (2.33)

where N is the total number of samples, Yiis the true value of sample i and Yi,predis the corresponding

predicted value. Note that this is the same metric used as the loss function, defined in 2.24. Comparing the squared errors removes the risk that positive and negative errors might cancel each other out. A quite natural consequence of using squared errors is that large errors become emphasised.

2.9.2

Mean absolute error (MAE)

Another way to quantify the prediction errors is to use the mean absolute error, defined as:

MAE= 1 N N ÿ i=0 |Yi´Yi,pred|, (2.34)

where N is the total number of samples, Yiis the true value of sample i and Yi,predis the corresponding

predicted value as above. This metric also removes the risk of positive and negative errors cancelling each other out. Using the absolute error, rather than the squared error, reduces the difference between large and small errors, which might be sensible.

(33)

2.9.3

Receiver operating characteristics (ROC) curves

The problem discussed in this thesis is fundamentally about binary classification. The aim is to create a system capable of classifying segmentations as either correct or incorrect. As detailed in Section 1.2, this is performed by using regression techniques to predict segmentation errors. Thresholding transforms the floating-point predicted errors to binary classes. One intuitive way to quantify the performance of a binary classifier is to use accuracy, i.e. to what extent is the classifier correct in its classification. Accuracy proves to be a poor performance measure for unbalanced datasets, however. An illustrative example is a dataset where 98% of all samples are negative. On this dataset, any classifier would achieve an accuracy of 98% by simply classifying all samples as negative. Clearly, we need some more sophisti-cated performance measure to characterise our classifiers. Before continuing, a few concepts need to be introduced:

• True positive (TP) — a true positive is a sample which has correctly been classified as incorrectly segmented.

• True negative (TN) — a true negative is a sample which has correctly been classified as correctly segmented.

• False positive (FP) — a false positive is a sample which has incorrectly been classified as incorrectly segmented.

• False negative (FN) — a false negative is a sample which has incorrectly been classified as correctly segmented.

• False positive rate (FPR) — FPR= No. False Positive decisionsNo. actually negative cases

• True positive rate (TPR) — TPR= No. True Positive decisionsNo. actually positive cases

true positives false positives

false negatives true negatives

selected elements = positive sample

= negative sample

Figure 2.13: Illustration of TP, TN, FP and FN.

An illustration of TP, TN, FP and FN is provided in Figure 2.13. Naturally, the above metrics are de-pendent on the chosen threshold. A low threshold generates many true positives, but also many false

(34)

2.9. Evaluation

positives. A high threshold has the opposite effect. Which of these metrics are the most important to optimise is very case-dependent. It is easy to realise that it is of high importance to maximise the true positive rate in, for instance, medical screening for a serious disease. In other applications, it might be more important to minimise the false positive rate. The receiver operating characteristics (ROC) curve introduces a performance measure independent of the classification threshold. An ROC curve is con-structed by plotting the FPR against the TPR for a wide range of decision thresholds. The area under the resulting curve (AUC) can be interpreted as a metric of the classifiers performance [29]. An illustration of an ROC curve is presented in Figure 2.14.

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

0.0

0.2

0.4

0.6

0.8

1.0

True positive rate

ROC curve (AUC = 0.776)

Chance

Figure 2.14: Illustration of an ROC curve.

2.9.4

Repeatability coefficient

The Joint Committee for Guides in Metrology defined in 2008 repeatability as "closeness of the agree-ment between the results of successive measureagree-ments of the same measurand carried out under the same conditions of measurement" [19]. In other words, subsequent measurements performed under the same conditions with the same equipment will likely produce an error smaller than the repeatability coefficient. The following conditions apply for the use of repeatability:

• The same measurement procedure. • The same observer.

• The same measuring instrument, used under the same conditions. • The same location.

• Repetition over a short period of time.

The repeatability coefficient can thus be seen as the error margin of the measurement equipment under current circumstances.

References

Related documents

Some of these statements will be used as indicators in order to estimate the importance of social and moral norms as well as warm-glow motives, in turn influencing the

The SIFT features are matched to estimate the relative movement of the vessel between multiple consecutive scansC. Figure 5a,b shows examples of how well these features

For this reason, this dissertation con- tributes to understanding of adolescents involved in illegal political activities by employing a broader perspective capturing both

The aim of the interview study was to describe Kenyan midwives’ experiences of female genital mutilation and their experiences of caring for genitally mutilated women in

The RDAC model used is segmented, has ideal switches, ideal resistances, an ideal voltage source, and 10 Ω resistance to model the output impedance of the supply network..

Cross talk measurements has shown that the phenomena causes problems when using a long rise time input with toggling outputs placed next to the signal.. Power cycling did not result

I förarbetena har det alltså framhållits att sambor genom avtal inte skall kunna uppnå samma regler som gäller för makar. I doktrin finns dock skiftande uppfattningar om detta.

Denna studie är av en kvalitativ, beskrivande karaktär. För att införskaffa en mer generell kunskap där allmängiltiga teorier kan dras kring hur den