• No results found

Assessment of lung damages from CT images using machine learning methods.

N/A
N/A
Protected

Academic year: 2022

Share "Assessment of lung damages from CT images using machine learning methods."

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

Assessment of lung damages from CT images using machine learning methods.

Bedömning av lungskador från CT-bilder med maskininlärning metoder.

QUENTIN CHOMETON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH

(2)
(3)

Abstract

Lung cancer is the most commonly diagnosed cancer in the world and its finding is mainly incidental. New technologies and more specifically artifi- cial intelligence has lately acquired big interest in the medical field as it can automate or bring new information to the medical staff.

Many research have been done on the detection or classification of lung cancer. These works are done on local region of interest but only a few of them have been done looking at a full CT-scan. The aim of this thesis was to assess lung damages from CT images using new machine learning methods.

First, single predictors had been learned by a 3D resnet architecture: can- cer, emphysema, and opacities. Emphysema was learned by the network reaching an AUC of 0.79 whereas cancer and opacity predictions were not really better than chance AUC = 0.61 and AUC = 0.61.

Secondly, a multi-task network was used to predict the factors altogether.

A training with no prior knowledge and a transfer learning approach using self supervision were compared. The transfer learning approach showed sim- ilar results in the multi-task approach for emphysema with AUC=0.78 vs 0.60 without pre-training and opacities with an AUC=0.61. Moreover using the pre-training approach enabled the network to reach the same performance as each of single factor predictor but with only one multi-task network which save a lot of computational time.

Finally a risk score can be derived from the training to use these informa- tion in a clinical context.

KEYWORDS: Deep Learning, Artifical Neural Networks, Lung damages, CT-Scans, Multi-task learning, Transfer learning

(4)

Acknowledgment

First and foremost, I would like to thank Carlos Arteta, my supervisor and guide during my project at Optellum Ltd. He helped me a lot with my work by bringing new ideas and always took time to answer all my questions.

I would also thank all the Optellum team for their welcoming and their help. I learned a lot by being part of this team and having the opportunity to discuss with them.

I would like to thank Dr. Chunliang Wang, my supervisor at KTH university and Dr. Dmitry Grishenkov for guiding us through this master thesis project.

I would also thank Antoine Broyelle, my co-intern-mate at Optellum. All our discussions and his ideas helped my project moved further. And all our lunch times made bearable the British weather.

I would also thank Lottie Woodward who had the incredible patience to proofread a french style written report.

The author thanks the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial (NLST). The statements con- tained herein are solely those of the authors and do not represent or imply concurrence or endorsement by NCI.

(5)

Nomenclature

AU C Area Under the Curve

CN N Convolutional Neural Network CT Computed Tomography

DL Deep Learning HU Hounsfield Unit LR Learning Rate M L Machine Learning

N LST National Lung Screening Trial ROC Receiver Operating Characteristic

(6)

Contents

1 Introduction 1

2 Materials and methods 2

2.1 Clinical Data . . . 2

2.2 Image Annotations . . . 2

2.2.1 Emphysema . . . 2

2.2.2 Opacities . . . 2

2.2.3 Consolidation . . . 2

2.2.4 Cancer . . . 3

2.3 Preprocessing . . . 4

2.3.1 Normalization . . . 4

2.3.2 Data Augmentation . . . 6

2.4 Evaluation metrics . . . 6

2.4.1 Loss and Accuracy . . . 6

2.4.2 ROC and confusion matrix . . . 7

2.5 Experiments . . . 7

2.5.1 Single factor prediction . . . 7

2.5.1.1 Network . . . 7

2.5.1.2 Training . . . 8

2.5.2 Combination of factors . . . 10

2.5.2.1 Network . . . 10

2.5.2.2 Full training . . . 10

2.5.2.3 Self-supervision . . . 11

2.5.2.4 Training with Pre-training . . . 12

3 Results 13 3.1 Single factor prediction . . . 13

3.1.1 Cancer . . . 13

3.1.2 Emphysema . . . 13

3.1.3 Opacities . . . 14

3.2 Combining factors . . . 14

3.2.1 Training with Pre-training . . . 17

3.2.2 Risk score . . . 17

4 Discussion 21 4.1 Main findings . . . 21

4.2 General impact . . . 22

4.3 Comparison to other methods . . . 22

4.4 Limitations . . . 23

4.5 Future work . . . 23

(7)

5 Conclusion 24

References 26

A State of the Art 29

A.1 Clinical Background . . . 29

A.1.1 Lung Cancer . . . 29

A.1.2 Nodules . . . 30

A.1.3 Risks factors for nodule malignancy . . . 31

A.1.3.1 Nodule size [21] . . . 32

A.1.3.2 Nodule Morphology [15] . . . 32

A.1.3.3 Nodule Location [21] . . . 32

A.1.3.4 Multiplicity [21] . . . 32

A.1.3.5 Growth Rate [21] . . . 32

A.1.3.6 Age, Sex, Race [21] . . . 32

A.1.3.7 Tobacco [21] . . . 33

A.1.4 Problem of detection . . . 33

A.2 Engineering Background . . . 33

A.2.1 Deep Learning . . . 33

A.2.2 How does a deep learning network learn? . . . 34

A.2.3 Transfer learning . . . 36

A.2.3.1 Feature Extractor . . . 36

A.2.3.2 Fine-tuning . . . 36

A.2.4 Challenges with deep learning . . . 37

A.2.4.1 Architecture selection in transfer learning . . 37

A.2.4.2 Number and quality of data: pre-processing . 40 A.2.4.3 Time-processing (CPU and GPU) . . . 41

A.2.4.4 Overfitting . . . 41

A.2.4.5 Performance evaluation . . . 42

(8)
(9)

1 Introduction

The incidence of new cancer cases annually is 454.8 per 100,000 human being in 2016 [14]. Additionally, lung cancer represents around 20% of deaths due to cancer. Cancer, in general, is not a well-known disease and most of the cases are discovered too late to be treated. The main challenge nowadays is to detect and predict cancer as soon as possible, in order to treat it in the best possible way. For the past 3-4 years, artificial intelligence has been develop- ing in the medical field in order to provide useful tools helping for a better detection and prediction.

Lung cancer screening or incidental findings are the two main ways to detect lung cancer in a patient. Screening should normally be the main way of finding lung cancer, as it is for breast cancer. But in most of the countries such as the UK, there is no national screening programme and most of the cases are incidental findings.

Incidental findings mean that lung nodules are found while doing an- other exam than lung targeted. For example, nodules can be found while doing a heart or a liver CT-scan. The main problem now is that radiologists are not trained to sort the nodules and detect whether they are cancer or be- nign ones. These patients should be reported to a pulmonologist which will in most of the cases ask for a chest CT-scan which lead to more radiation ex- posure for the patient. In the worst case, the patient’s report never reaches the pulmonologist and is lost resulting in their cancer never being found or being found too late.

Written in association with Optellum Ltd, is trying to address this is- sue. Their vision is to redefine cancer care from early diagnosis to treatment, by enabling every clinician to choose the optimal diagnostic and treatment pathway for their patients. This is done by using machine learning on vast medical image repositories. This thesis is part of this vision and focuses on:

"the assessment of lung damages from CT images using machine learning methods." It will focus on how to assess any kind of lung damages by us- ing deep learning methods whilst looking only at a global scale: the entire CT-scan. This work does not focus on the finding of nodules (local scale) as it has be done in [36, 18] but on the global assessment of lung damages (global scale). Then it will be possible to combine these two scales of features to better predict cancer.

(10)

2 Materials and methods

2.1 Clinical Data

Machine Learning (ML) and even more Deep Learning (DL) require the use of large quantities of data. Data here are medical images and more particu- larly CT-Scans from the NLST dataset. NLST dataset is a screening trial in which around 50000 heavy smokers aged between 55 and 74 received either a chest X-ray or a chest CT scan at three different time points if possible. This thesis has extracted CT Scans for 10 000 patients from this study to focus on.

Counting the different time-points, a total number of 16164 images have been collected.

The CT-scans come from a large number of sites as well as manufacturers.

All the CT-scans have a 512x512 pixels dimensions in the axial plan but have a high variability in the resolution. (cf part 2.3)

The entire set of clinical data is split 70:30 into training and validation set.

If a scan from a patient is chosen in one set, all the other CTs for this patient would be included within the same set.

2.2 Image Annotations

Approximately 500 different metadata have been recorded for each scan in the NLST trial. This data had been divided into categories like: demographic, lung cancer, smoking, death, or follow-up. Only a few of this metadata is interesting in this master thesis. After a long analysis of the dictionaries sum- marizing them, three of them had been chosen plus one created by members of the company:

2.2.1 Emphysema

Emphysema is defined as the "abnormal permanent enlargement of the airspaces distal to the terminal bronchioles accompanied by destruction of the alveolar wall and without obvious fibrosis". A patient presenting emphysema would be classified as 1 (fig 1a) [23].

2.2.2 Opacities

Opacities represent the result of a decrease in the ratio of gas to soft tissue (blood, lung parenchyma and stroma) in the lung (fig 1b).

2.2.3 Consolidation

Consolidation of the lung is a solidification of lung tissue due to liquid or solid accumulation in the air spaces (fig 1c).

(11)

2.2.4 Cancer

The cancer markup had been realized by trained doctors hired by the com- pany. Thanks to their knowledge, they had been able to differentiate cancer nodules from benign nodules. A patient is marked as cancer as soon as he has at least one cancer nodule in one of his available scans. For example, if patient X has no cancer on CT at time-point 0, but one cancer nodule at CT at time-point 1, both images of the patient will have the label cancer.

(a) Emphysema (b) Opacities (c) Consolidation Figure 1: Visible diseases in the lung

All the metadata but cancer annotation had been created during the NLST trial and are subject to human errors. The proportion of positive patient de- pending on the factor is summarized in table 1 and fig 2. The proportion of consolidation is too low in the dataset to be used for training later. The data are unbalanced. This problem was tackled during training phase by using the weighted sampler method in Pytorch. By giving a weight to each class it ensures that in each batch, there is 50:50 of images with and without the disease. It results in showing more times each disease cases in average.

Class Positive Proportion

Cancer 940 5.9%

Emphysema 5043 31.2%

Opacities 3783 23.4%

Consolidation 124 0.8%

Table 1: Proportion of each factor in the NLST database

(12)

Figure 2: Distribution of NLST data 2.3 Preprocessing

2.3.1 Normalization

As previously noted, the CT-Scans vary a lot in term of resolution and inten- sity due to the variability of devices. In order to better generalize, a normal- ization step for all the images is necessary before using them for training.

Resampling

The first normalization is to rescale the images to the same spatial resolution;

1x1x2mm (x;y;z) for removing zoom or thickness variance. Indeed a scan may have a spatial resolution of 2x2x2.5mm meaning that the distance be- tween slices is 2.5mm. The resampling has been completed using the nearest interpolation.

Standardization

CT-scan intensity is measured in Hounsfield Unit (HU) and represents the radiodensity. The CT scans from our database range from -1024 to 1000. The interesting values for the lung images are around 0 as it represents water and around -1000 as it represents air. As shown in figure 3a, these values are the most common in lung CT-Scan. Data in machine learning and deep learning are commonly standardized which means removing the mean value of the dataset and dividing by the standard deviation of the dataset. Values then mostly range between -1 and 1. The mean and standard deviation values for the NLST dataset can be seen in table 2. The example of distribution of HU before and after standardization can be found in figure 3a and 3b:

xstandard= x − meandataset σdataset

(1)

(13)

Mean dataset -440 HU Standard deviation dataset (σdataset) 480 HU

Table 2: Dataset mean and standard deviation

(a) Before (b) After

Figure 3: HU distribution before and after standardization

Fixed-size

The network has to be fed with the same inputs however after resampling, all the CT scans have a different number of pixel per slices and a differ- ent number of slices per scan. For example: 280x280 or 400x400 (x;y). The adopted strategy is to use an input of 320x320x32 (x;y;z) which means 32 slices of 320x320 pixels. In order to do this, images have either been cropped to 320x320 or a 0 padding had been added on the side to reach the 320x320 input size. Using a random cropping could leave a tumor out of the lung, but in our study, this is not a problem. The main idea here is that the network should learn global pattern inside the lung and not local information as a tu- mor. So not having the tumor inside the image doesn’t change the pattern of the lung in its globality, and then should not change what is the network learning from the CT scan.

In order to reach the target of 32 slices several methods are computed and used in both the singular and multiple experiments:

• Choice of 32 slices in the CT with a pre-defined step in between 2 slices (typically 3)

• Random choice of 32 slices

• Minimum Intensity Projection (MinIP) over 3 slices: take the minimum value over the z-axis for 3 consecutive slices. This method emphasizes the dark area and helps to detect emphysema

• Maximum Intensity Projection (MaxIP) over 3 slices: take the maximum

(14)

value over the z-axis for 3 consecutive slices. This method emphasizes the light area and helps to detect nodules

• Average Intensity Projection (AIP) over 3 slices: takes the average value over the z-axis for 3 consecutive slices.

2.3.2 Data Augmentation

In order to better generalize the results and avoid overfitting, data augmen- tation is used to increase the dataset. The first way is to apply a random crop to the images while fixing the size. By doing this, slightly different parts of the same image are shown which help the network learns further.

The other augmentation is a random flip over the x or y axis during train- ing meaning a reflection of the slices across the mid of the axis.

The last augmentation is a random rescale of the histogram intensity of each image. As shown in fig 3a, the normal distribution of an image in HU is within the range [-1000:600]. To rescale the image’s intensity, random mini- mum between -1150 and -850 is chosen and a random maximum between 100 and 1300 and rescale the histogram from the initial range to the [min:max]

range. The intensity of the image will change each time the network sees the image. It is important as the intensity mainly depends on the machine and wrong correlation could be learned by the network. Indeed, in the NLST dataset, patients had their CT in different hospitals, and the rate of Cancer or Emphysema can be really different from one hospital to another. Image inten- sity is a intrinsic parameter of the used machine depending on its calibration, the followed protocol and reconstruction filters used by the manufacturer as shown in [20]. Then, without intensity augmentation, the network could learn simple correlation between emphysema cases and the average intensity of the scan.

2.4 Evaluation metrics

During this thesis, many models will be trained using different networks, training methods and hyperparameters (parameters which are set by hand before the training such as learning rate, momentum, etc.). In order to com- pare all these models, evaluation metrics must be set before starting the train- ing. The metrics used to evaluate a model and compare two models are:

2.4.1 Loss and Accuracy

For both loss and accuracy, they will be computed for both training and val- idation set (cf Annex A). Loss function (also called error or cost functions) maps the network parameters to a scalar value which specify how wrong is this set of parameters. The main task is then to minimize the loss function

(15)

by updating the network’s parameters. A loss function is computed dur- ing the forward pass of the network. Working on a classification problem, the cross-entropy loss function is used in all experiments (unless otherwise noted). This loss function is a combination of the Log Soft Max function and the Negative Log Likelihood function:

N LL(x, class) = −x[class] (2)

LSM (x, class) = log(exp(x[class]) P

jexp x[j] (3)

For the same experiment with different hyperparameters, the model reaching the lower loss function on training and validation set can be defined as the best model.

The accuracy is defined as the percentage of well classified elements in a classification task. The higher the accuracy on the validation set, the better the model.

2.4.2 ROC and confusion matrix

The two other metrics are ROC (and AUC) and confusion matrix. These met- rics are defined in Annex A. In the ROC, the more the curve is to the top left corner the better the model is. This can also be seen by comparing the Area Under the Curve (AUC). The higher the AUC, the better the model. The confusion matrix allows us to compute different metrics: accuracy, precision, recall, etc. These are important to understand what the network mis-classify and compare with other models.

The decision to keep one model rather than another are made based on these evaluation metrics.

2.5 Experiments

2.5.1 Single factor prediction

The first set of experiments is to determine the ability of the network to learn different disease factors: cancer, emphysema and opacities. These are trained separately but use the same network.

2.5.1.1 Network

The base architecture used to train the different factors is a 3D version of ResNet18 which will be called ResNet3D 5 [12]. The input is a set of 32 slices or projection of slices having 320x320 pixels. Firstly, a 3D convolution with a 3x5x5 kernel and 32 channels. The large filter is used to detect larger com- ponent in the image (shapes, blobs, etc.). After a 3D batch normalization and

(16)

max pooling, the network is composed of four similar blocks (cf fig 4). For each block, the input is used twice, first in the succession of layers (left part of fig 4) but also added to the output of this succession (right part of fig 4).

Then the output of one block feeds the input of two successive layers.

The classifier part of the network, which aims to determine if the factor is found or not, is composed of a 3D convolution with a 1x1x1 kernel and a 3D spatial average pooling. Through this convolution, the 256 channels are mapped to a 2 classes output: whether they have the disease or not.

Figure 4: ResNet 3D block: The input of a 3D block is first passing through the succession of convolution 3D, BN, ReLu activation, convolution 3D and BN (left part of the diagram) but also is added to the output of the last BN (right part).

2.5.1.2 Training

Implementation is done using Pytorch framework [27] and the network is trained using NVIDIA GPU. The training is done during 40 epochs, meaning iterating 40 times over the dataset. A batch size of 16 (maximum fitting the memory), a momentum of 0.99, a weight decay of 1e−4and an initial learning

(17)

Figure 5: ResNet 3D used for cancer prediction for example

rate of 0.01. The learning rate decay is done with the plateau method. If the validation loss does not decrease during 3 consecutive epochs, the actual learning rate is divided by a factor 10. The learning rate (LR) is defined as the amount of change applied to the model. LR decay is used in order to be more and more accurate and reach a global minima.

Cancer:

The cancer training is done by using the cancer metadata. The input is a volume of 32 slices of 320x320 pixels. The average intensity projection is used to get the 32 slices. As only 5.9% of the images are marked as cancer, the classes are balanced during the training using a sampler.

All the described pre-processing and data augmentation steps are per- formed.

The goal is to see the ability of the network to detect cancer from a CT volume.

Emphysema

The emphysema training is similar to the cancer one. The emphysema meta- data is used as output. The minimum intensity projection is used as it reveals darker part of the image which correspond to emphysema. All the described preprocessing and data augmentation steps are performed. The goal is to see the ability of the network to detect emphysema from a CT volume.

Opacities

The opacity training is the same as emphysema and cancer but with the opac- ity metadata as output. The maximum intensity projection is this time used, as opacity is seen as a brighter area on the CT, the MaxIP emphasizes the presence of opacity. All the described preprocessing and data augmentation steps are performed. The goal is to see the ability of the network to detect opacity from a CT volume.

(18)

2.5.2 Combination of factors

Figure 6: Multi-task Network

Once training is computed and the ability of the network is understood, the goal is then to combine all these factors detection in the same network.

The next experiments are computed with emphysema, cancer, and opacities annotations and try to predict the three of them with only one multi-task network. First, the network is trained from no prior knowledge and then a self-supervised learning is used before fine-tuning the network.

2.5.2.1 Network

For this follow-up prediction, a multi-task network shown in fig 6 is used.

The network is separated in two parts. First the feature extraction part is the ResNet 3D that was already used for the single factor prediction. The number of channels is the same at each layer, only the last classifier part is removed. The second part of the network is the factor predictions. In this part, the output of the ResNet 3D is divided in three subparts for the three predictors. During the training, a loss is computed for each of the classifier and backward through their own classifier and the ResNet 3D.

In total three losses are computed, one for each of the factor classifier. The backward phase (update of the weights by gradient descent) is then com- puted according to the three losses.

2.5.2.2 Full training

The full training is done while initializing the convolution weight with gaus- sian distribution centered in zero and with a standard deviation ofp2/n

(19)

where n is the number of inputs to the neuron. The batch normalization layers are initialized with a weight of 1 while of the bias are set to 0. The training is done during 50 epochs with a batch size of 16, a momentum of 0.99, a weight decay of 1e−4 and a initial learning rate of 0.01. The learning rate is lowered using the plateau function.

2.5.2.3 Self-supervision

Figure 7: Siamese Network

Another approach for training this complex multitask network is to first pre-train the feature extractor on a specific task before using transfer learning and fine-tuning it on the multi-task. Different methods exist to pre-trained a network, self-supervision one is used in order to force the network to learn global features and not only specific to the main task. This method use Siamese network and have been first used for face recognition [6].

Siamese network

Here, a Siamese network is trained to distinguish between pairs of images from the same patient at two different time points or from two different pa- tients. The Siamese network is then two ResNet 3D networks in parallel sharing the same weights (see fig 7). The input is a pair of images which go through the ResNet 3D to compute a 256 vector. Vectors from the two in- puts are then used to compute the contrastive loss (cf next paragraph). The distance between two images is computed as the absolute difference between the two output vectors.

Training of Siamese network

(20)

To compute the pairs of images, only patients with 2 CT-scan time points or more are kept in the NLST database. It corresponds to 5,085 patients or pairs of images. Once a patient is chosen as an input, a random choice is made to determine if the other image should come from the same patient or a different one. If different, a random choice of a CT Scan among all the remaining CTs is made. All the data augmentation is performed. The random scale of the image’s intensity is really important for the network to not learn specificities of the machine whilst differentiating similar from different patients.

In order to improve the results of the Siamese network, the adaptive mar- gin loss function described by Wang et al. [32] is chosen. Normally, Siamese network are trained with the contrastive loss function [6] described in equa- tion (4).

(1 − label)1

2Dw2 + label × 1

2{max(m − Dw, 0)}2 (4) Where Dw represents the euclidean distance between the two output vectors and m a defined margin. The main issue with this method is finding the right margin, this is why the adaptive margin loss function is chosen as it depends on the inputs. In a batch, all the distances between same patient must be smaller than an adaptive up-margin Mpand all the distance between the dif- ferent patients must be larger than an adaptive down-margin Mndefined in equation (5).

Mp= 1

µ(1 − exp(−µd)), Mn= 1

γ log(1 + exp(γs))

(5)

Where s is the mean similar distance, d the mean different distance and µ and γtwo hyperparameters set respectively to 8 and 2.

From equation (5), Mτ and Mcare defined as Mp = Mτ − Mcand Mn = Mτ+ Mc. Finally the loss function is defined (6) with label ∈ {−1; 1}.

Loss = X

batch

max{Mc− label(Mτ− Dw), 0} (6)

2.5.2.4 Training with Pre-training

The training is the same as in part 2.5.2.2, the only difference is in the initial- ization of the weights. In this case of transfer learning (see A.2.3), the weights from the pre-trained network are used in the multi-task network before train- ing following which fine-tuning of the entire network is performed.

(21)

3 Results

The following sections will present the results obtained for the different sets of experiments, starting by the single factor predictor and then moving to multi-task prediction.

3.1 Single factor prediction 3.1.1 Cancer

Figure 8 represents the loss and accuracy functions for the training and val- idation set. The training loss decreases while the validation one is first un- stable before remaining close to a constant (after 10 epochs). The ROC and prediction matrix in fig 9 show that the network hasn’t learned well the Can- cer prediction as the the ROC is around the random prediction (AUC = 0.51).

The prediction matrix shows an inability to classify correctly cancer cases:

most of the cases are classified as benign. The unbalanced validation set give us a quite high validation accuracy: approximately 75%, but this figure must be analyzed knowing that the network mainly classifies as benign and the main class is benign.

(a) Loss ResNet 3D (b) Accuracy ResNet 3D Figure 8:Loss and Accuracy evolution for training and validation phase, Cancer

prediction

3.1.2 Emphysema

For the second experiment, the prediction of emphysema, the ROC and pre- diction matrix show that the network is able to learn few aspects of emphy- sema. Indeed the AUC is 0.78, which is much improved than random. The confusion matrix (fig 11b) shows that presence and absence of emphysema are mainly classified accurately. In this case, the validation accuracy of 72%

is meaningful as both classes are mostly classified correctly. Moreover, the training and validation loss curves (fig 10a are as expected from a network

(22)

(a) ROC Cancer

Prediction No Prediction Yes

Actual No 3470 983

Actual Yes 214 139

(b) Confusion matrix Cancer

Figure 9:ROC Curve and Confusion Matrix for Cancer prediction

which has learned something. Then, emphysema can be classified accurately by the network from a CT volume.

(a) Loss ResNet 3D (b) Accuracy ResNet 3D Figure 10: Loss and Accuracy evolution for training and validation phase,

Emphysema prediction

3.1.3 Opacities

Concerning the opacity prediction, the confusion matrix (fig 13b) shows the network is reasonably accurate at recognizing opacities when present but has more troubles to deal with the absence of opacities: many more false positives than false negatives. The number of false positives is too high to conclude significance and thus it cannot be assumed that the network has learned.

3.2 Combining factors

Combining factors is first done by training the network from random initial- ization. The ROC curve and histogram distributions for emphysema, cancer and opacities in fig 17 and 18 (left column) show that the network does not

(23)

(a) ROC Emphysema

Pred No Pred Yes Actual No 2287 1179 Actual Yes 314 1069 (b) Confusion matrix Emphysema

Figure 11:ROC Curve and Confusion Matrix for Emphysema prediction

(a) Loss ResNet 3D (b) Accuracy ResNet 3D Figure 12:Loss and Accuracy evolution for training and validation phase,

Opacities prediction

(a) ROC Opacities

Pred No Pred Yes Actual No 1910 1837 Actual Yes 315 787

(b) Confusion matrix Opacities

Figure 13:ROC Curve and Confusion Matrix for Opacities prediction

really learn. There is no separation between classes and the histogram distri- bution look like a Gaussian centered on 0.5 meaning the network has become

(24)

confused for most of the cases.

(a) ROC Siamese

Pred No Pred Yes

Actual No 673 90

Actual Yes 67 696

(b) Confusion matrix Siamese

Figure 14: ROC Curve and Confusion Matrix for self-supervision

For the self-supervision task, the maximum validation accuracy is 89.25%

while the AUC is 0.96. In order to determine if the network classify accurately two patients, the distance between the two output vectors of the network is computed (fig 16a). The best threshold is found in fig 16b: 89.25% of accuracy for a threshold of 0.0448. If the distance is lower than the threshold, images are classified as a same pair otherwise as a different pair. The two classes are well separated on the distance graph 16a

The filters of the first convolution are computed (fig 15) (which corre- sponds to the weights of this layer), they show that the network has learned some patterns as shapes in the filters are distinguishable. More important, the filters have changed significantly from the random initialization and then it can be concluded that the network is now able to recognize specific shapes because of these filters.

(a) Random initialization of the filters (b) Filters after training Figure 15: 32 filters of the first convolutional layer

(25)

(a) Distance distribution for validation set for similar and different patients’

images

(b) Evolution of validation accuracy vs distance threshold

Figure 16: Siamese network, validation set: distance and threshold

3.2.1 Training with Pre-training

When adding a pre-training to the network, it can be noted that the network is better at classifying the different diseases, especially for emphysema where an AUC = 0.78 is reached (fig 17). Moreover, the histograms representing the distribution of probability of having the disease for each class are not gaus- sian more or less centered in 0.5 anymore. It can be seen for the emphysema that the two class are well separated (fig 18).

3.2.2 Risk score

Now the final task is to link all these results to a possible clinical application.

From the different histogram distributions, a risk score for each disease can be created. For a given probability of having the disease, the risk score is the proportion of patients having the disease at this probability. The different risk scores are displayed in fig 19 and the formula is:

rs(prob = p) = disease(prob)

total(prob) (7)

(26)

(a) ROC Cancer (b) ROC Cancer Pre Trained

(c) ROC Emphysema (d) ROC Emphysema Pre Trained

(e) ROC Opacities (f) ROC Opacities Pre Trained Figure 17: ROC curves: left column shows without pre-training, right column shows with pre-training

(27)

(a) Cancer (b) Cancer Pre Trained

(c) Emphysema (d) Emphysema Pre Trained

(e) Opacities (f) Opacities Pre Trained

Figure 18: Distribution of probability of having the disease for each class: left column shows without pre-training, right column shows with pre-training

(28)

Figure 19: Risk score curve for each disease

(29)

4 Discussion

4.1 Main findings

Throughout this thesis it is shown first that there is valuable information in a whole CT scan. Indeed, from the single factor predictor, it can be noted that it is possible to assess emphysema or opacities better than chance even if the task is not fully learned. However, in the case of cancer, the network does not learn any features which enable it to predict cancer. In the case of emphy- sema, which has been seen to be successful, without having any knowledge of what is emphysema, the network is able to diagnose emphysema cases correctly. The task of detecting cancer by training with a full CT-Scan seems to be too complicated for the network. It can be explained by the fact that cancer is nowadays detected by nodules, and there is no real pattern inside the lung helping to predict it. Another reason could be that the used dataset is a high lung cancer risk population as they were all heavy smokers. The emphysema detection is much better because the task is more visual and a pattern clearly exist in the whole lung.

The second aim of this thesis is to predict the three diseases with a unique network, based on the assumption that the features for detecting cancer, em- physema or opacities should be the same. By training from no prior knowl- edge, it can be noted that the results are not comparable in terms of ROC to the ones obtained with single factor prediction. However, by using the self- supervision as pre-training, the results are improved and ROCs get closer to the ones obtained by training one by one.

Here two results are important. First, the impact of pre-training a net- work and using a transfer learning approach. In our case, the filters obtained with the pre-training fig 15b show pattern as horizontal and vertical slices or blobs which are well known pattern for finding edges, shapes, or texture in an image. The difference of results between non pre-trained and pre-trained network is shown in fig 17 and 18. AUC are increasing for emphysema and opacities and the histogram distributions clearly show the impact of pre- training. In emphysema, it enables the clear separation of classes. Moreover, pre-training a network ensures that it is learning visual features and not only machine related features as the resolution or the intensity of the CT-Scan.

The second result is that it is possible to use a multi-task network in order to learn multiple predictions at the same time. Indeed, whilst using the multi- task network, similar results as the single factor prediction are achieved. Re- sult is slightly better in the case of emphysema: AUC = 0.79 in single predic- tion and AUC = 0.78 in multi-task prediction. This demonstrates the idea that combining learning can help a network find the most interesting features and have better results on some tasks. Whereas, whilst learning only one predic-

(30)

tion, the network can become too specific and learn only local features.

Finally, from these experiments, risk score for each disease can be calcu- lated. In order to predict a global risk score, further clinical analysis is re- quired. A simple way could just be to average the different scores in order to have the global one.

4.2 General impact

Nowadays, images are essential in the detection of lung cancer. However, the use of this images is still empirical as doctors only look at them through a display software. Much research exists on the automatic detection of nodules or classification of nodules but none of them try to assess the information contained in the whole Ct-Scan. In this thesis, it was demonstrated despite having lots of limitations, that it is possible to find information by only look- ing at an upper scale (global CT-scan), and not a local scale or specific area or feature. This information could be a great help for doctors in the future as with further development, it might be possible to automatically give a risk score for each patient going through a CT-scan exam, and then draw doctors attention on patients at risk.

The work done in this thesis could also be used in order to curate big database. For example if for a retrospective study, 2000 emphysema patients must be found, by taking the 2000 with the highest probability of having emphysema, a researcher will save a lot of time comparing to choosing them by hand.

Nowadays, pneumologist use a BROCK score to evaluate the lung cancer risk [22]. This score includes risk factor such as Emphysema or presence of nodules. By combining this work, with a local features work on nodules detection, it might be possible to calculate the the BROCK score only from CT-scan.

4.3 Comparison to other methods

The main use of 3D chest CT is to do lung nodules detection as it has been done in the Kaggle Challenge with the LUNA dataset or the Luna Challenge [34, 9]. They reach score up to 0.951 (FROC) which is a derivative of the ROC curve. This score is way better than the AUC obtained with the global feature approach. However, these methods focus on the specific task of nodule de- tection using small patches (with a nodule or just background) in the training phase while our method uses the full CT-Scan during the training. Moreover, detection in deep learning is a famous and well-known task with a lot of dif- ferent approaches. These two methods could then be complementary and the output of both combine in order to better predict lung cancer.

(31)

A study closer to what is done in this thesis is using CheXNet [28]. In this study, they use transfer learning on a 2D network to predict 14 different lung diseases on 2D X-ray images. Even if some concerns arise about this dataset and this study [24], the presented results are good. They reach an AUC of 0.92 for emphysema and 0.78 for nodule. These results can be explained by the used of a pretrained network: DenseNet, pretrained on ImageNet. In this thesis, the input are 3D volume and no 3D pretrained network have been released yet.

4.4 Limitations

Several limitations apply to this project and most of them are due to the avail- ability of the data and the consistence of them.

First of all, all the training have been done on the NLST dataset, which has an inclusion criteria of having more than 30 pack years as smoking history.

This of course bias the training of the network and impact its robustness. This issue raise the problem of having clinical data for healthy patients. Indeed, these data would considerably improve all the artificial intelligence applica- tions in the medical field. Indeed, in our case, some patients are not diag- nosed of cancer for example, but their lung have many damages due to their smoking history, and learning this patient as a healthy one (absence cancer) for the network might confuse it.

The second limitation is the noise of the metadata. Indeed, while analyz- ing the results, a lot of mistakes in the markup of emphysema or opacities were found, confusing for sure the network in his learning process. A great thing is that the emphysema prediction was sometimes better than the hu- man markup. Indeed, for most of the cases with a really high probability of having emphysema whilst the human markup says no emphysema, it is re- alized after checking that the patient indeed has emphysema as shown in fig 20.

A more technical limitation is the gpu memory. 3D scan are heavy files and take a lot of gpu memory. A direct consequence is a small size of batch size used for the training: 16 comparing to most of the other deep learning training which use a batch size of 128. This could influence the training as the backpropagation has more chance to be sensible to the specificities when the batch is small.

4.5 Future work

From this promising project, a lot of new work can derive. First of all, work- ing on the same project but with different dataset would be the first thing to do to validate the work done. Validity of metadata will have to be ensured while collecting them.

(32)

Figure 20: False Positive Emphysema

Another idea would be to work on new metadata. Many diseases or infec- tious pattern can appear in the lung and training the network on all of them at the same time would give a more robust network and then better result.

In a more technical approach, the self-supervision training can certainly be improved by using another task or another loss function as the histogram loss. Testing new networks would also be something interesting or combin- ing networks. For example working on a texture network could be interest- ing.

Finally, it is also possible to apply this idea of working on a full scanner on other part of the body. For example breast cancer or liver cancer could find an utility in such a damage score.

5 Conclusion

Different approaches to assess lung damages have been explored in this thesis for a large dataset of patients.

First, the single predictor approach show that a network is able to learn some disease as emphysema or opacities with a certain degree of confidence by only looking at a global scale.

Moreover, the work done here shows that it is possible to predict multi- ple diseases by using only one network, and then that the same features are necessary to detect different diseases.

This thesis also emphasizes the importance of pre-training. The self-supervision method used here enabled the network to be initialized with visual features useful for the multi-task learning. The importance of transfer learning has been shown as the results are better when pre-training and fine-tuning.

Even if the work as limitations as a bias database, noisy metadata, the overall show that it is indeed possible to retrieve useful information by look-

(33)

ing at a full CT-Scan, and that deep neural network are able to learn at a large scale. The results are encouraging and this work can lead to many other ex- periments on different kind or location of images.

(34)

References

[1] I. GLOBOCAN 2012. Data visualization tools that present current national estimates of cancer incidence, mortality, and prevalence. 2016.URL: http:

//gco.iarc.fr/today/online- analysis- multi- %20bars?

mode = cancer & mode _ population = continents & population = 900&sex=0&cancer=29&type=0&%20statistic=0&prevalence=

0&color_palette=default.

[2] Y Bengio. “Learning deep architectures for ai. Foundations and Trends R in Machine Learning, 2 (1): 1–127, 2009”. In: Cited on (), p. 39.

[3] Isabel Bush. Lung nodule detection and classification. Tech. rep. Technical report, Stanford Computer Science, 2016.

[4] MEJ Callister et al. “British Thoracic Society guidelines for the investi- gation and management of pulmonary nodules: accredited by NICE”.

In: Thorax 70.Suppl 2 (2015), pp. ii1–ii54.

[5] Wanqing Chen et al. “Cancer statistics in China, 2015”. In: CA: a cancer journal for clinicians 66.2 (2016), pp. 115–132.

[6] Sumit Chopra, Raia Hadsell, and Yann LeCun. “Learning a similarity metric discriminatively, with application to face verification”. In: Com- puter Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE. 2005, pp. 539–546.

[7] Ciro Donalek. “Supervised and Unsupervised learning”. In: Astronomy Colloquia. USA. 2011.

[8] Jacques Ferlay et al. “Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012”. In: International journal of cancer 136.5 (2015).

[9] grt123.URL: https://github.com/lfz/DSB2017/blob/master/

solution-grt123-team.pdf.

[10] Duc M Ha and Peter J Mazzone. “Pulmonary Nodules”. In: Age 30 (2014), pp. –05.

[11] Mohammad Havaei et al. “Brain tumor segmentation with deep neural networks”. In: Medical image analysis 35 (2017), pp. 18–31.

[12] Kaiming He et al. “Deep residual learning for image recognition”. In:

Proceedings of the IEEE conference on computer vision and pattern recogni- tion. 2016, pp. 770–778.

[13] RT Heelan et al. “Non-small-cell lung cancer: results of the New York screening program.” In: Radiology 151.2 (1984), pp. 289–293.

(35)

[14] National Cancer Institute. Cancer Statistics.URL: https://www.cancer.

gov/about-cancer/understanding/statistics.

[15] Shingo Iwano et al. “Computer-aided diagnosis: a shape classification of pulmonary nodules imaged by high-resolution CT”. In: Computerized Medical Imaging and Graphics 29.7 (2005), pp. 565–570.

[16] Michael T Jaklitsch et al. “The American Association for Thoracic Surgery guidelines for lung cancer screening using low-dose computed tomog- raphy scans for lung cancer survivors and other high-risk groups”. In:

The Journal of thoracic and cardiovascular surgery 144.1 (2012), pp. 33–38.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105.

[18] Devinder Kumar, Alexander Wong, and David A Clausi. “Lung nodule classification using deep features in CT images”. In: Computer and Robot Vision (CRV), 2015 12th Conference on. IEEE. 2015, pp. 133–138.

[19] Geert Litjens et al. “A survey on deep learning in medical image anal- ysis”. In: arXiv preprint arXiv:1702.05747 (2017).

[20] Dennis Mackin et al. “Measuring CT scanner variability of radiomics features”. In: Investigative radiology 50.11 (2015), p. 757.

[21] Heber MacMahon et al. “Guidelines for management of incidental pul- monary nodules detected on CT images: from the Fleischner Society 2017”. In: Radiology (2017), p. 161659.

[22] Annette McWilliams et al. “Probability of cancer in pulmonary nod- ules detected on first screening CT”. In: New England Journal of Medicine 369.10 (2013), pp. 910–919.

[23] KM Venkat Narayan et al. “Report of a national heart, lung, and blood institute workshop: heterogeneity in cardiometabolic risk in asian amer- icans in the US”. In: Journal of the American College of Cardiology 55.10 (2010), pp. 966–973.

[24] Luke Oakden-Rayner. CheXNet: an in-depth review. URL: https : / / lukeoakdenrayner . wordpress . com / 2018 / 01 / 24 / chexnet - an-in-depth-review/.

[25] Maxime Oquab et al. “Learning and transferring mid-level image rep- resentations using convolutional neural networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014, pp. 1717–

1724.

[26] Sinno Jialin Pan and Qiang Yang. “A survey on transfer learning”. In:

IEEE Transactions on knowledge and data engineering 22.10 (2010), pp. 1345–

1359.

(36)

[27] Adam Paszke et al. “Automatic differentiation in PyTorch”. In: (2017).

[28] Pranav Rajpurkar et al. “CheXNet: Radiologist-Level Pneumonia De-

tection on Chest X-Rays with Deep Learning”. In: arXiv preprint arXiv:1711.05225 (2017).

[29] Ali Sharif Razavian et al. “CNN features off-the-shelf: an astounding baseline for recognition”. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops. 2014, pp. 806–813.

[30] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).

[31] Standford Univeristy. Transfer Learning.URL: http://cs231n.github.

io/transfer-learning/.

[32] Jiayun Wang et al. “Deep ranking model by large adaptive margin learning for person re-identification”. In: Pattern Recognition 74 (2018), pp. 241–252.

[33] WL Watson and AJ Conte. “Lung cancer and smoking”. In: The Ameri- can Journal of Surgery 89.2 (1955), pp. 447–456.

[34] Julian de Wit. URL: http://juliandewit.github.io/kaggle- ndsb2017/.

[35] Jason Yosinski et al. “How transferable are features in deep neural net- works?” In: Advances in neural information processing systems. 2014, pp. 3320–

3328.

[36] Wentao Zhu et al. “DeepLung: 3D deep convolutional nets for auto- mated pulmonary nodule detection and classification”. In: arXiv preprint arXiv:1709.05538 (2017).

(37)

A State of the Art

A.1 Clinical Background A.1.1 Lung Cancer

According to the Globocan series published in 2012, lung cancer is the most commonly diagnosed cancer in the world with around 1.82 million new cases each year. In 2012, 1.6 million people died from lung cancer which represents 19.4% of cancer’s deaths. Lung cancer’s incidence is higher in developed countries from America, Europe and Asia. [5, 8] Moreover, significant up- ward trends are visible in these countries and especially for Asian females [5]

Lung cancer is mainly related to smoking history [33], but not only as the number of lung cancer in Asian female population is increasing while this population has a really small smoking background [5].

Figure 21: Number of incident cases worldwide in 2014 [1]

As every cancer, the lung cancer appeared when abnormal cells grow, in one or both lungs for later forming what is called a tumor. A tumor can be benign or malignant.

Lung cancer is divided into two types:

• Primary lung cancer: cancer which originates in the lung and can also be divided into two main types of cancer

– Non-small-cell lung cancer (NSCLC): 80% of the cases and his di- vided into 4 sub-categories (squamous cell, adenocarcinoma, bron- chioalveolar carcinoma, large-cell undifferentiated carcinoma) – Small-cell lung cancer (SCLC):20% of the cases and are small cells

which multiply quickly.

(38)

• Secondary lung cancer: cancer which starts in another part of the body and metastasizes to the lung

Lung cancer is mainly discovered from incidental findings (cardiovascu- lar computed tomography scan (CT), liver CT) and from screening program like in the US [16]. CT images are a series of X-Ray images taken from many different rotations, producing cross-sectional images and computed thanks to a computer. The use of digital geometry processing allows creating 3D volume from series of 2D images. This is an expensive technology which provides detailed information about body structures and lung structure for chest CT-Scan.

According to the guidelines of this screening programs, a chest low dose computed tomography (LDCT) must be performed and from which nodules, benign or malignant, can be revealed and a decision made according to the diagnosis.

A.1.2 Nodules

A pulmonary nodule is a small, round or egg-shaped lesion in the lungs which results in a radiographic opacity [10]. They are considered to be less than 30mm. They are now differentiated into three main categories: solid nodules, part-solid nodules, and pure ground grass nodules [4, 21] (Fig 2).

Depending on the category, the guidelines for management of pulmonary modules vary [4, 21]. Figure 3 summarises the guideline from the Fleischner Society.

Sub-solid nodules have a higher likelihood of malignancy [22]. However, many factors increasing the risk of malignancy exists and they will be useful to keep in mind while choosing and training the network.

(a) Solid Nodule (b) Part Solid Nodule (c) Ground Grass Opacity Nodule Figure 22: Types of nodule

(39)

Nodule Type Characteristics Malignacy [22]

Benign Malignant Solid Obscures the underlying bronchiovascular structure. 98.9% 1.1%

Ground Grass Opacification is greater than that of the background

98.1% 1.9%

but through which the underlying vascular structure is visible

Part Solid Mix of the two previous types of nodules 93.4% 6.6%

Table 3: Nodule type and malignancy

Figure 23: Fleischner society 2017 Guidelines for Management of Incidentally Pulmonary Nodules in Adults [21]

A.1.3 Risks factors for nodule malignancy

The assessment of nodule malignancy is a true challenge nowadays to per- form better prevention of lung cancer. Indeed, the sooner a nodule is detected as malignant, the better the treatment will be. Many risk factors for malig- nancy had been reported in the literature, which helps nowadays the doctors

(40)

to determine which nodule management to follow:

A.1.3.1 Nodule size [21]

The main risk factor is the size of the nodule. Nodule sizes are divided into three categories: <6mm (<100mm 3 ), 6-8 mm (100 – 250mm 3 ) and > 8mm (> 250mm 3 ). The smaller nodules are more likely to be benign and don’t require any follow-up in most of the cases whereas the biggest ones require a close follow-up (3 to 6 months).

A.1.3.2 Nodule Morphology [15]

Spiculated nodules are associated with malignancy for many years [21] and are then classified as high-risk nodules.

Figure 24: Different nodules’ morphologies

A.1.3.3 Nodule Location [21]

Upper-lobe nodule location is a high-risk factor. [22]

A.1.3.4 Multiplicity [21]

High multiplicity is a low-risk factor. The presence of 5 or more nodules likely results from an infection and then these nodules are benign. Having between 1 and 4 nodules increase the risk of malignancy.

A.1.3.5 Growth Rate [21]

The growth rate is estimated by the Volume Doubling Time (VDT) which corresponds to the number of days in which the nodule doubles in volume.

A VDT < 400 days is a high-risk factor.

A.1.3.6 Age, Sex, Race [21]

Lung cancer is really unusual before 40 years old. However, lung cancer in- cidence increases for each added decade. Women are more likely to develop

(41)

lung cancer than men and the incidence of lung cancer is much higher in black population than white population.

A.1.3.7 Tobacco [21]

A smoking history increases the risk from 10 to 35 times of having lung cancer compared to non-smokers.

A.1.4 Problem of detection

Pulmonary nodules are now detected by radiologists by considering their shape, size and brightness of the unknown mass in the lung. Studies have shown that only 68% of the nodules are found with this visual human detec- tion [13]. The early-classification of the nodules is remaining a challenge in order to reduce the aggressiveness of the follow-up and treatment of patients.

Computer-aided detection (CAD) has then a large role to play in nodule de- tection and is a high topic of interest. [3, 18] Especially the new deep learning architectures are promising since their appearance less than 10 years ago and will be the main topic of my master thesis. Now we will focus more on the engineer approach of the problem by first presenting briefly what is deep learning.

A.2 Engineering Background A.2.1 Deep Learning

As described previously, new technologies in medicine have a more and more important role to play. Machine learning, one of these new technologies, is a branch of artificial intelligence in which the system has the ability to learn and improve by itself from experience. In a simple definition, machine learn- ing uses algorithms to parse data, learn and output a prediction of a partic- ular task. Machine learning consists of many sub-categories as decision tree learning, rule-based machine learning or deep learning.

(42)

Figure 25: Example of neural network with one hidden layer (in the center) Deep learning is then a new field of machine learning based on artificial neural network but using deeper architecture (Fig 5). A network is composed of nodes, which are linked together by weights (REFER TO FIG). When send- ing an input, only a few nodes will fire in order to produce an output. The goal is to adapt the weights by changing their value in order to get the right nodes firing. Deep learning was first inspired by the functioning and the structure of human brain and how the information is delivered from one neu- ron to another. The advantage of deep learning is that each layer produces a certain representation of the input data which are also used as input for the next level of representation. Then it is possible, passing through a lot of different layers to combine all these representations in order to perform any kind of tasks depending on the chosen network. For example in the medical world, deep learning architectures are now used to perform [2]: image, ob- ject or lesions classification [19], detection, segmentation [11], registration or image generation and enhancement. Moreover, deep learning is a hot topic in the medical field with an enormous increase in the number of papers pub- lished within the two last year [2]. (REFER TO FIG 6)

A.2.2 How does a deep learning network learn?

A deep learning network must learn by itself and only from the input data that a human shows it. The goal is to eliminate all the biased knowledge that we normally include in the design of our algorithms. A deep learning net- work is then a black box, on which we can only control hyperparameters. The learning of a network is separated in three phase: the training, the validation and the testing as the images.

(43)

Figure 26: Example of Deep Learning architecture: GoogleNet

Figure 27: 6 a) Nb of papers including machine learning techniques (CNN = Convolutional neural network) b)Nb of paper depending on the task

During the training phase, the weights and bias are updated at each step in order to reach the best model. The learning can be supervised or unsuper- vised [7]. Supervised learning includes labels, which represent the desired output for an input. Thus, every time we show a new input to the network (a CT image in our case), we also provide it which output it should return (benign or malignant in our case) and then, thanks to specific optimizer, the weights and bias are automatically updated to reach the best performance.

For example, in the case of nodules classification, every time we show a new nodule to the network, we have to provide it the desired output which will be the type of this nodule (solid, sub-solid, ground grass). In unsupervised learning, no labels are provided, and then the network makes its own deci- sion about how to classify the data.

In the validation phase, the goal is to assess the performance of the network by showing it input data that it has never seen before. We can then assess how well the network behaves with new and unknown data. It allows engi- neers to find the best model.

(44)

The testing phase is the last phase, once the best model had been chosen through the validation phase, to evaluate the general performance of the model.

A.2.3 Transfer learning

As we said, deep learning enables us to work on tasks which are computed from the database rather than imposed from a model or selected by a human.

Such techniques offer the advantage of selecting optimal features for the task and enable a higher number of degrees of freedom for the classifier than a model would ever do, but the training of such systems and the management of a large number of degrees of freedom becomes a challenge.

Then, transfer learning is a new method more and more used in the field. The main idea is that the features learned from one system are used and adapted to another system. It enables better convergence of the system for complex tasks or tasks where the number of available data is too low [26, 25, 29, 35].

For example, it is possible to use the features of a network which had learned to classify goldfish, giant schnauzer, tiger cat, ... (AlexNet trained with Im- agenet [17]) in order to classify medical images, as many of the features are shared by every kind of images (for instance edges). Using transfer learn- ing enables to save a lot of computational time comparing to train a network from no prior knowledge and initialized with random weights.

Several methods exist in order to adapt these existing models to a specific application [31]

A.2.3.1 Feature Extractor

It consists of removing the last fully-connected layer of a pre-trained network (CITE FIGURE). Pre-trained network means that we keep the weights learned from a previous training made with a set of general images. The remaining part of the network is then considered as a feature extractor. The last fully- connected layer is replaced by a linear classifier and train with the specific set of images. The network learns how to classify specific images (for instances medical images) based on features learned from general images.

A.2.3.2 Fine-tuning

The second approach consists in replacing a larger part of the existing and pre-trained network and to fine-tune the weights (CITE FIGURE). This is done with backpropagation on the number of replaced layers. It is possible to fine-tune the entire network. It will then take more computational time but will remain shorter than training from no prior knowledge as the weights are not distributed randomly. This method is based on the principle that a net- work becomes more and more specific with the layers, then if we retrained a

(45)

sufficient number of layer, it is possible to erase this specificity learned from a previous dataset and train it to the new dataset.

A.2.4 Challenges with deep learning

Deep learning is a promising and powerful tool, but the use and comprehen- sion of it remain tricky. An engineer has to face many challenges in order to get results from these networks. The following challenges are the main challenges that I will have to face during my master thesis, and that every engineer need to think about while working with deep learning.

A.2.4.1 Architecture selection in transfer learning

Since the revolution of deep learning in 2012 with the creation of AlexNet network which won the ImageNet competition (classification challenge over 1000 classes), numerous and various network architectures have been cre- ated. Each model has its own specificities and thus advantages and draw- backs. A good comprehension of them enables a user to choose the right one depending on the achieved task. Some comparison had been done on the same dataset as seen in figure 8. Here are some of the well-known architec- tures broadly used:

(46)

Figure 28: a) AlexNet network b)Alexnet network with feature extractor c) Alexnet network with fine tuning of the three last layers

(47)

Figure 29: Top 1 accuracy vs operations, size & parameters

Alexnet [17]

AlexNet is the network which had changed the vision and use of deep neural network by being the first one with such a large network. The network is used for classification of images and from a 256*256 image give a probability to belong to one of the 1000 classes. It uses large convolution in order to extract the spatial features from the images.

The main breakthrough of Alexnet is the use of GPU for the first time to perform the training which reduces it consequently.

VGG [30]

The main difference between VGG developed in Oxford, and AlexNet is the use of a series of smaller spatial convolution 3*3. The number of parameters and thus the power of the network increase a lot but as the computation time.

GoogleNet

GoogleNet is a more recent network based on the concept of inception. An in- ception module if a parallel combination of different operations (convolution for example) done with a smaller amount of parameters. It has been shown that parallelizing these operations lead to equivalent results with a reduced time of computation.

(48)

Figure 30: Inception module

Resnet [12]

Finally, one of the most famous networks is ResNet. Its differentiation is based on the idea of that one output should feed the input of not only one but two successive layers.

Figure 31: One output feeds two different inputs

The use of the network is then important and challenging, but the right network without the right data will never learn the desired output.

A.2.4.2 Number and quality of data: pre-processing

In deep learning and more generally in machine learning, the number of data and their quality have a high importance on the performance of a network.

Indeed, as the human brain, the more data the network will see the more experienced it will be on this task and then the more accurate it will be on a particular task.

The number of data is then a key point, especially in the medical field where it is challenging to collect a large amount of data. Pre processing is a key step in the success of a network on a larger scale. Indeed, CT scans are performed following different protocols depending on the machine, the hospitals and also the user. All these differences result in differences in the image: resolution, centring of the region of interest, different level of con- trast. But the network is trained on one set of images and the goal is to use

(49)

it for every CT scan in the future. Standard pre-processing steps include nor- malization (to show the network the same kind of images), data augmenta- tion (rotation, rescaling, flipping ...) and elimination of outliers. Then, one sends the same kind of images (same size, mean value ...) which enables the network to perform better. The data augmentation is used to increase the variability of training images.

Moreover, it is important to use balanced data while performing training, validation, and testing. In the case of classification, balanced data means having around the same amount of data for each class.

A.2.4.3 Time-processing (CPU and GPU)

Training time is a key aspect in the development of deep learning network.

A model can take more than one week to be trained and so, it is necessary to use the right tools in order to accelerate these process. In order to get the best results, multiple experiments need to be done for each network.

Central Processing Unit (CPU) is found on each computer whereas Graph- ics Processing Unit (GPU) is added in order to accelerate the computation.

Numerous differences exist between CPU and GPU, but in a nutshell, GPU allows much faster computation while CPU is easier to program. Nowadays, all the deep learning packages include the GPU integration which enables one user to program efficiently his network on GPU and so to save a lot of computational time.

A.2.4.4 Overfitting

The main challenge of training neural network with a lot of data is to trade with the bias-variance dilemma. Bias represents the error in the result from wrong assumptions in the algorithm. Variance is the error from fluctuations in the training set.

As shown on CITE FIGURE, it is impossible to have both a low bias and variance, and then it is necessary to find a compromise.

In order to apply the model for general use, one must avoid overfitting.

We call overfitting when the model learn too many particularities from the training data (for example a specific noise due to the CT acquisition at the hospital) and thus doesn’t perform well on unseen data from another dataset.

One easy way to track overfitting is to track the validation error (or vali- dation loss), as soon as it increases again, it means that what the network just learned is specific to the training data and cannot be generalized.

Once all these challenges have been taken into account, the performance of a model has to be evaluated thanks to some evaluators in order to be able to compare different model and results.

References

Related documents

The RMSProp optimizer is implemented for three learning rates and the results presented in Table 10 are the accuracies for the training, validation and test set, number of epochs

Figure 6.1 - Result matrices on test dataset with Neural Network on every odds lower than the bookies and with a prediction from the model. Left matrix shows result on home win

The second strand of thinking that influence our work and further our understanding of knowledge sharing processes and their characteristics from a social perspective is that of

We read the letter by Drs Blomstedt and Hariz titled “The Paper That Wrote Itself —A Ghost Story” 1 concerning out viewpoint article titled “Directional Leads for Deep

experiences regarding health and illness in their lifeworld, investigated in their everyday life at their HVB in Sweden and their home in their country of origin. The thesis aims

Studiens syfte är att undersöka förskolans roll i socioekonomiskt utsatta områden och hur pedagoger som arbetar inom dessa områden ser på barns språkutveckling samt

Both the results on the features from the engine speed - torque and the vehicle speed data suggest that the classes rough and very rough were relatively similar to each other

Tommie Lundqvist, Historieämnets historia: Recension av Sven Liljas Historia i tiden, Studentlitteraur, Lund 1989, Kronos : historia i skola och samhälle, 1989, Nr.2, s..