Deep Learning Method used in Skin Lesions Segmentation and Classification

(1)

IN

DEGREE PROJECT

MEDICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Deep Learning Method used in

Skin Lesions Segmentation and

Classification

(2)

(3)

IN

DEGREE PROJECT

MEDICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

(4)

(5)

3.8 Experiment 7 - Supervised learning on two class classification(training on segmentation output) and test it on real cropped ISIC test data 14 4 Results 15 4.1 Experiment 1 . . . 15 4.2 Experiment 2 . . . 15 4.3 Experiment 3 . . . 15 4.4 Experiment 4 . . . 15 4.5 Experiment 5 . . . 16 4.6 Experiment 6 . . . 17 4.7 Experiment 7 . . . 18 5 Discussion 18 5.1 Experiment 1 - Supervised learning Segmentation:U-net . . . 19

5.2 Experiment 2 - Adversarial aided Supervised learning Segmentation 19 5.3 Experiment 3 - Supervised Segmentation:Deeplab . . . 20

5.4 Experiment 4 - Supervised learning three class classification . . . 21

5.5 Experiment 5 - Combine segmentation and classification . . . 21

5.6 Experiment 6 - Supervised learning on two class classification and combine it with segmentation . . . 22

5.7 Experiment 7-Supervised learning on two class classification(training on segmentation output) and test it on croped ISIC test data . . 22

(6)

6 Conclusion 23

6.1 Our Contribution . . . 24

6.2 Limitation and future work . . . 24

7 Acknowledgement 24 Appendices 28 A State of the art review in Skin Lesions Segmentation and Clas-sification 28 A.1 Introduction . . . 28

A.2 Background on skin lesions . . . 28

A.2.1 Medical background in skin cancer . . . 28

A.3 Dataset for skin lesion classification and segmentation . . . 28

A.4 Traditional computer vision and machine learning methods on skin lesion . . . 29

A.4.1 Traditional methods on skin lesion classification . . . 29

A.4.2 Traditional methods on skin lesion segmentation . . . 29

A.5 Deep learning method . . . 29

A.5.1 convolutional neural networks . . . 29

A.5.2 Deep learning methods used in classification and segmen-tation . . . 31

A.6 Data labeling tool . . . 34

(7)

Deep Learning Method used in Skin Lesions

Segmentation and Classification

Fengkai Wan

August 16, 2018

Abstract

Malignant melanoma (MM) is a type of skin cancer that is associated with a very poor prognosis and can often lead to death. Early detec-tion is crucial in order to administer the right treatment successfully but currently requires the expertise of a dermatologist. In the past years, studies have shown that automatic detection of MM is possible through computer vision and machine learning methods. Skin lesion segmentation and classification are the key methods in supporting automatic detection of different skin lesions.

Compared with traditional computer vision as well as other machine learning methods, deep neural networks currently show the greatest promise both in segmentation and classification.

In our work, we have implemented several deep neural networks to achieve the goals of skin lesion segmentation and classification. We have also applied different training schemes. Our best segmentation model achieves pixel-wise accuracy of 0.940, Dice index of 0.867 and Jaccard index of 0.765 on the ISIC 2017 challenge dataset. This surpassed the official state of the art model whose pixel-wise accuracy was 0.934, Dice index 0.849 and Jaccard Index 0.765. We have also trained a segmentation model with the help of adversarial loss which improved the baseline model slightly. Our experiments with several neural network models for skin lesion classification achieved varying results.

We also combined both segmentation and classification in one pipeline meaning that we were able to train the most promising classification model on pre-segmented images. This resulted in improved classification perfor-mance. The binary (melanoma or not) classification from this single model trained without extra data and clinical information reaches an area under the curve (AUC) of 0.684 on the official ISIC test dataset.

Our results suggest that automatic detection of skin cancers through image analysis shows significant promise in early detection of malignant melanoma.

Keywords: Deep neural network, Skin lesion, Segmentation, Classification.

1 Introduction

(8)

Clinical diagnosis is usually performed through visual check with the aid of a dermatoscope, a simple instrument that helps to magnify and improve lighting for the area being looked at. There are checklist rules to help the diagnosis, see the background chapter in the appendix for further information. The gold standard for diagnosis is the histopathological examination on the biopsy sam-pled from a suspicious lesion. Due to the limited number of experienced experts around the world, patients sometimes receive a false diagnosis [3] which can be fatal.

Computer aided automatic diagnosis (CAD) methods have been proposed to help the clinical diagnosis: CAD can be divided into two categories here:

1. Traditional computer vision and machine learning methods. 2. Deep learning methods.

Traditional computer vision and machine learning method

The traditional computer vision and machine learning methods for classification tasks focus more on low level visual features (color, edges, texture and so on) [4]. By using different features they can simplify the task for a machine so that it can use these features to make further predictions. For example, one team used a support vector machine (SVM) over features that were extracted from HOG feature extractor [5]. Their SVM model achieved sensitivity of 98.21%, specificity of 96.43% and accuracy of 97.32% on a melanoma binary classification task. However, we can see that their model heavily depends on the HOG feature extractor which is complicated and requires a lot of computing resources.

There has also been a lot of great work that applied traditional computer vision and machine learning methods to skin lesion segmentation task. These work focused on different ways to create threshold values for pixels or areas of the image since the skin lesion usually looks darker than human skin and lighter than hair. This can be problematic depending on the lighting environment and camera setting which led to an increased focus on improving the quality and consistency of input data. Enhanced contrast and color of the images was also tested to get better results [6]. Some also tried improving the thresholding algorithm [7]. Apart from setting thresholds and modifying image content, there is also work focusing on the great variance that can be found around the border between the skin lesion and the surrounding normal skin. This variance creates a large gradient around the border of the lesion, Zhou et.al mention their gradient flow based algorithm to perform skin lesion segmentation task [8]. Deep learning method

Deep learning methods have achieved better results in general image segmen-tation and classification tasks in recent years. There are also many researchers that have applied deep learning methods to skin lesion segmentation and clas-sification, and many have achieved much better performance than traditional methods. Please see a detailed introduction to deep learning methods used in this area in the appendix A.5.

(9)

they do, see A.5.2 for details of different evaluation metrics. Those evaluation metrics are also used in the ISIC challenges of 2016 and 2017.

For the classification task in the ISIC challenge there is a large data classes imbalance problem within the original data (mentioned in 2.1.4) and there is also a lack of sufficient data.

The state of the art results that we have identified usually depend on col-lecting or creating more data to ensure a balanced dataset. This is generally done through collecting additional clinical diagnoses from doctors. Some other efforts that are often used include ensemble learning methods, meaning they have multiple models work together and provide predictions through a voting system [10] [11] [12].

1.1 Problem statement

Nowadays, the majority of the deep learning approaches use FCN based methods for segmentation and ensemble learning methods for classification. FCN has issues with boundary precision in image segmentation and ensemble learning for classification requires both significant amounts of time and computing resources. There is a need for a more elegant method requiring less computing resources and data. The method should be validated by the most broadly accepted public dataset: the ISIC challenge dataset.

ISIC challenge 2017 dataset

We utilized the ISIC challenge 2017 dataset which was used as a benchmark in several studies. This dataset has 2000 training images, 150 validation images and 600 testing images. They are all paired with binary masks which can be used as ground truth for the segmentation task, see Figure 1. Note that ”binary” means that there are only two values in the mask images: 1 and 0 where 1 indicates the highlighted part of a lesion, 0 indicates the background part or normal skin. They also have category labels that indicate which kind of skin lesion type the image represents. The three categories of skin lesion are melanoma, nevus and seborrheic keratosis, see Figure 2 for visual interpretation. Our hypothesis

As suggested by [13], we want to propose a working pipeline that can improve the final classification performance by utilizing the segmentation model to crop the original image before performing the classification task. See the pipeline in Figure 3.

1.2 Aim of this project

(10)

Figure 1: The ISIC dataset original images and their binary masks (Images in first row are original images, images in second row are their corresponding masks)

2 Methods

We applied 7 deep neural networks for segmentation and classification tasks.

2.1 Data Pre-processing

2.1.1 Data format

The original images and masks in ISIC challenge datasets are stored in JPEG and BMP format. It’s possible to utilize the images and feed them into Tensor-flow models.

However, for an improved data feeding flow we used a specific way to feed data: the TFRecord format. This format enabled us to encode all the infor-mation needed into a single file. With the help of TFRecord, we never needed to read files from different directories and worry about inconsistent file format problem.

(11)

Figure 2: The ISIC dataset original images belongs to three classes. First row: melanoma, Second row: nevus, Third row: seborrheic keratosis.

TFRecord files.

In our work, we mostly used TFRecord format because of it’s convenience in handling different files and we were not limited by the potential storage problem. 2.1.2 Data resize

The image size in ISIC challenge varies a lot. For easier implementation, we followed the input image size mentioned in the state of the art paper [9], all the images and masks were resized into 192×256. Note that dermocopic images were resized with bilinear interpolation method and binary value masks were resized with nearest neighbor method so that the resize operation didn’t introduce extra noise.

There are two ways to do resizing: either resize the images first then store them into the TFRecord file or generate the TFRecord first then resize the tensor array before sending the data into models.

(12)

Table 1: Performance on U-net model with different size of TFRecord files.

Index Average value

original TFRecord smaller TFRecord Time for one epoch/min 20.1 4.5

Accuracy on test set 0.831 0.831 Sensitivity on test set 0.797 0.781 Specificity on test set 0.924 0.925 Dice coefficient on test set 0.848 0.846 Jacard index on test set 0.729 0.725

2.1.3 Data augmentation

Due to limited number of images, data augmentation (which means artificially modifying the original image by some method to increase the amount of data available) is needed for our work. With proper data augmentation we can syn-thesize some more real world cases with the available dataset. With additional data fed into the neural network, the model can learn some very important properties like location invariance, rotation invariance, contrast invariance and so on.

Based on careful study towards the datasets we designed some augmenta-tions for the tasks. These can be divided into two categories: mask preserving augmentation and mask non-preserving augmentation.

Mask preserving augmentation

When the data augmentation performed on the original image does not influence the mask’s content, the augmentation is called mask preserving augmentation. The augmentations we have applied in this category are: random contrast vari-ance within the range of 50% ∼ 90%; random brightness varivari-ance within the delta of ±0.1; normalize the original images’ pixel value to be in the range of 0 ∼ 1.

Mask non-preserving augmentation

In contrast to mask preserving augmentation, if we apply non-preserving method to original images we will need to perform the same augmentation to the mask images.

The augmentations that we applied in this situation are: random flip right and left; random flip up and down; random crop some part of images and resize the cropped part to 192 × 256; random rotation of images within ±10◦ and so on.

Our augmentation setting

For consistency, the data augmentation methods that we used was kept the same as shown in Table 2.

(13)

left, right and up, down means the images and corresponding masks have a 50% possibility of being flipped; Random contrast and brightness means only the images (not the masks) are changed in the sense of contrast or brightness within the range of 50% ∼ 90% and ±0.1; Random zoom means the images and masks are zoomed in or zoomed out within a range of 0.8 ∼ 1.2.

Table 2: Augmentation details.

Augmentation percentage or ratio range randomrotation ±10◦

randomf lip : lef tright 50% randomf lip : updown 50%

randomcontrast 50% ∼ 90% randombrightness ±0.1

randomzoom 0.8 ∼ 1.2

2.1.4 Data classes balance

Data classes imbalance is a very serious problem for our classification task as it can cause a lot of issues for machine learning methods [14]. In our classification task, the composition of the ISIC training dataset is shown in Table 3 where we can see that the numbers of images within different classes are unbalanced. This makes it harder to learn robust features from the training set.

Table 3: Composition of the ISIC challenge training dataset. Non-melanoma Melanoma

Seborrheic keratosis Nevus

254 1372 374

There are two ways to handle the data classes imbalance problem. Adding more minority data by data augmentation

We can save augmented images in a way so that we increase the volume of the minority classes to the same size as the majority class in order to balance the whole training dataset.

Adding more melanoma data to keep balance

A more attractive way to solve the data imbalance problem is to find additional melanoma images from other sources so that we create a real balanced training dataset.

(14)

2.2 Neural network architectures

Our work aims at both segmentation and classification task. We will introduce different kind of neural networks that we have replicated. Firstly the segmen-tation network structures and then the classification network structures. 2.2.1 Segmentation neural network

U-net structure based neural network

We have utilized two kinds of methods to perform skin lesion segmentation using ”U-net” which is brought up in [15] as a segmentation generator: (1) The naive supervised method is simply a U-net trained with respect to supervised loss function that indicates the difference between ground truth and predicted masks. Our modifications are that we applied atrous convolutional in different convolutional layers so that the model can get a larger viewing area without increasing parameters [16]; (2) The other method trains the neural network with the aid of an adversarial neural network. This idea was brought up by Jaemin et.al [17]. Their segmentation neural network achieved better segmentation result with the help of different discriminators.

Deeplab structure based neural network

With the experiments on U-net structure done, we tried another way to develop segmentation models: ”Deeplab”. The idea behind this design is to utilize multi-scale information and focus more on the boundary of the object. We imple-mented two structures of Deeplab which are Deeplab v3 [18] and Deeplab v3+ [19]. Both of our Deeplab implementations were built based on Resnet [20] with 101 layers of convolutional neural network. The differences between these two models and their earlier implementations [16] are that: (1). Deeplab v3 removed the conditional random field algorithm that needed a lot of comput-ing resources as was mentioned regardcomput-ing Deeplab v1 [16]. They tried to solve the feature map multi scale inconsistency problem by applying Atrous Spatial Pyramid Pooling (ASPP) on the output of Resnet. (2). Deeplab v3+ further improved the model by replacing the base model from Resnet with Xception [21] which creates better result by deploying depthwise separable convolutional in their model without adding any more parameters than Inception v4 model [22]. To further improve the performance on boundaries of the predicted mask, they also incorporated an encoder-decoder structure similar to U-net [15] and achieved improved results.

2.2.2 Classification neural network Inception v3

(15)

MobileNet

MobileNets [25] is a relatively small but effective model structure, leveraging depth-wise separable convolutional to get smaller models. We implemented this model and performed the same experiments as for Inception v3 to compare their performance.

NASNet

The final neural network model we tested is NASNet [24]. This structure is not designed by humans but by a network, as described in the original paper. This network learns the best way to structure itself with the help of the massive computing resources of Google. This work resulted in a network named NASNet which then went on to beat the state of the art in large dataset classification and has also been shown to be generalizable for other datasets.

Our idea is to apply the self-learned NASNet model to our classification tasks and fine tune some final layers of the model to generalize to our specific task.

3 Experiments

Our experiments were basically performed without delicately fine-tuning of the hyper parameters. The reason behind this is that we had limited computing resources and we could not perform parallel experiments to compare the slight differences between different parameter settings.

Note that our work was also limited by a small dataset. To compensate for this we applied image augmentation tricks like: random rotate, flip, crop and contrast, brightness variation (as described in Table 2) to all the models that we trained.

Our experiments were carried out by applying U-net based supervised learn-ing segmentation and then comparlearn-ing the performance between different net-work details, for example the kernel size of convolutional layers.

We then developed an adversarial network aided supervised method and compare the results with the baseline supervised learning method. The segmen-tation generator was still the same U-net structure for this method.

With more computing resources, we developed a Deeplab structure based model to perform further improved skin lesion segmentaion. We implemented both Deeplab v3 and Deeplab v3+. We compared the results with our other segmentation methods and the state of the art.

We also applied several convolutional neural networks to perform three or two class classification.

We compared the three class classification performance among different mod-els and we also performed two class classification to simplify the problem.

Based on the results from segmentation and classification, we have tried utilizing the skin lesion shape information from segmentation before doing the classification.

(16)

3.1 Tools

Hardware

Our work was mainly done on two computers one has Geforce GTX 1080 (8 GB memory) graphic card and the other is an Amazon Machine learning Instance (AMI) server with one Tesla K80 (12 GB memory) graphic card.

Software

Our neural networks were mainly developed with TensorFlow which is a deep learning framework developed by Google [26].

3.2 Experiment 1 - Supervised learning

Segmentation:U-net

Due to limited computing resources, both parameters in the model and training batch size were limited. The first experiment was done to find the best U-net structure with a completely supervised learning method.

As we can see in Table 4, we applied three different structures and also a hole-filling algorithm. All the models in this subsection are trained with: a batch size of 16, the learning rate 0.001 and the optimizer Adam (a learning rate self-adjusted gradient descent method).

We only changed the convolutional kernel size of the top two layers of each U-net. As indicated in Table 4, the Unet 53 on the third line stands for U-net model with 5×5 convolutional kernel in first layer and 3×3 convolutional kernel in second layer. The fourth line stands for the same model structure but the final output was run through a hole-filling algorithm.

3.3 Experiment 2 - Adversarial aided Supervised learning

Segmentation

In the second experiment, following the spirit of [17], we applied an adversarial discriminator to the segmentation generator: the same vanilla U-net that we mentioned in 3.2. In this case the generator updates its parameters not only with respect to the supervised Dice loss but also to the loss that was produced by the adversarial discriminator. The discriminator is designed to classify whether the segmentation maps are produced by the generator or they are ground truth. To help the discriminator distinguish the segmentation masks we mapped the segmentation masks with the original input images and applied matrix multi-plication between them. This ensures that the input contains more information for discriminator and that the discriminator can work better.

(17)

batch size of 16 which is the same as in subsection 3.2. When we updated the parameters for the generator we set the weights for discriminator loss to 0.1 and the weights for dice loss to 1. This was done to ensure that the discriminator only had a slight influence on the generator.

3.4 Experiment 3 - Supervised Segmentation:Deeplab

To further increase the model capacity we applied the DeepLab [16] style struc-ture in the segmentation task. As suggested by Deeplabv 3 [18] and Deeplabv 3+ [19] we apply these two models to do the skin lesion segmentation task.

Both of these models used a pre-trained Resnet-101 [20] model, and utilized atrous spatial pyramid pooling (ASPP) to get the multi-scale information that was contained in the images. Deeplabv 3+ [19] also applied an encoder-decoder structure to boost up the performance of the model at the fine boundaries of the skin lesions.

The experiments were carried out by setting the batch size as 16 and learning rate 0.007 with exponential decay.

3.5 Experiment 4 - Supervised learning three class

classi-fication

We have applied three kinds of classification neural networks to our classification task. The results are shown in Table 8.

The data classes imbalance problem within the 2017 ISIC challenge training dataset can mislead the model to only predict everything as the majority class in training dataset. To avoid this, we tried to combine all the available datasets (including the training, test and validation data sets), then divide all images with respect to their disease classes. After that we generated more data through augmentation with random flip, rotate, crop and non-geometrical augmentation for melanoma and seborrheic keratosis images to ensure the exact same number of images as in the nevus class. We then produced training, validation and test datasets at a ratio of 8:1:1. This meant all the datasets were balanced.

We applied three different networks in this experiment. In Table 8 no aug means no data augmentation was applied during training while with aug means we still applied augmentation in training. This second approach was very time consuming so we chose not to apply further image augmentation while perform-ing NASNet fine-tunperform-ing.

3.6 Experiment 5 - Combine segmentation and

classifica-tion

In this experiment we tried to combine the segmentation and classification to-gether.

(18)

Figure 3: The pipeline of experiment 7.

In our experiment setting, the rescaling factor in Table 9 is used to multiply with the perimeter of original bounding box to get the enlarged length for both height and width of the enlarged bounding box.

3.7 Experiment 6 - Supervised learning on two class

clas-sification and combine it with segmentation

Both the melanoma and seborrheic keratosis datasets are affected by data classes imbalance issues: their numbers are much smaller than those for the nevus dataset. Since melanoma is the most deadly disease among the three classes we will focus on classifying melanoma and non-melanoma images. We have rearranged the dataset into melanoma and non melanoma classes. The training strategy is the same as for Experiments 4 and 5 as we only fine-tune the final two layers. We also tried to combine the segmentation and classification. All the results are shown in Table 10.

3.8 Experiment 7 - Supervised learning on two class

clas-sification(training on segmentation output) and test

it on real cropped ISIC test data

(19)

cropped image, see figure 3. The performance is shown in Table 11.

4 Results

4.1 Experiment 1

Table 4 shows the results of Experiment 1. In the table, the bold number indicates the highest value in each column. The best Dice score, Accuracy and Jaccard Index are achieved with “Unet 53” method. “Unet 53 + Post” method delivered the best Sensitivity. The best Specificity is achieved with “Unet 55” method.

Table 4: Experiment results of different U-net models on ISIC challenge 2017 test datasets (Acc:Accuracy; Sensi:Sensitivity; Speci:Specificity; J C:Jaccard Index; Dice:Dice Index).

Methods Acc Sensi Speci JC Dice U net 55 0.853 0.777 0.922 0.706 0.836 U net 55 + P ost 0.851 0.781 0.917 0.703 0.835 U net 53 0.864 0.817 0.913 0.729 0.849 U net 53 + P ost 0.864 0.819 0.909 0.727 0.849 U net 33 0.861 0.807 0.911 0.722 0.843 U net 33 + P ost 0.861 0.807 0.913 0.721 0.841

4.2 Experiment 2

Table 5 shows the results of Experiment 2. In the table, the bold number indicates the highest value in each column. The best Dice score is achieved with “Unet 33 + dlr” method. “Unet 33 + dlr + po” delivered the best Accuracy, Sensitivity and Jaccard Index. The best Specificity is achieved with “Unet 33 + slr +po” method.

4.3 Experiment 3

Table 6 and Table 7 shows the results of Experiment 3. In the table, the bold number indicates the highest value in each column. In Table 6, the best Dice score, Accuracy and Jaccard Index are achieved with “Deeplabv 3 + 224 × 320” method. “Deeplabv 3 + 192 × 256” delivered the best Sensitivity. The best Specificity is achieved by “Deeplabv 3” method. In Table 7, the best Sensitivity, Jaccard Index and Dice are achieved with “Rashika Mishra” method. “Deeplabv 3 + 224 × 320” delivered the best Accuracy and Specificity.

4.4 Experiment 4

(20)

Table 5: Experiment results of different U-net models with adversarial loss on ISIC challenge 2017 test datasets (slr:same learning rate for generator and dis-criminator; dlr:different learning rate for generator and disdis-criminator; po:with hole filling algorithm).

Methods Acc Sensi Speci JC Dice U net 53 + slr 0.858 0.837 0.883 0.716 0.842 U net 53 + slr + po 0.859 0.832 0.886 0.718 0.842 U net 53 + dlr 0.859 0.810 0.902 0.718 0.841 U net 53 + dlr + po 0.861 0.813 0.906 0.721 0.841 U net 33 + slr 0.863 0.803 0.921 0.727 0.846 U net 33 + slr + po 0.865 0.803 0.923 0.731 0.847 U net 33 + dlr 0.868 0.839 0.898 0.736 0.856 U net 33 + dlr + po 0.871 0.843 0.899 0.742 0.855

Table 6: Experiment results of different Deeplab models on ISIC challenge 2017 test datasets.

Methods Acc Sensi Speci JC Dice state of art Yuan [9] 0.934 0.825 0.975 0.765 0.849 Deeplabv 3 0.937 0.776 0.987 0.745 0.854 Deeplabv 3+192 × 256 0.938 0.835 0.971 0.762 0.865 Deeplabv 3+224 × 320 0.940 0.832 0.973 0.765 0.867

Table 7: Experiment results of different models on ISIC challenge 2017 test+validate datasets (600+150 images).

Methods Acc Sensi Speci JC Dice RashikaM ishra[28] 0.928 0.930 0.954 0.842 0.868 U net 33adv + dlr + po 0.857 0.779 0.928 0.713 0.846 Deeplabv 3+192 × 256 0.941 0.839 0.971 0.762 0.865 Deeplabv 3+224 × 320 0.943 0.835 0.973 0.765 0.867

applied. The best Accuracy is achieved by “NASNet mobile + no aug + 811” method.

4.5 Experiment 5

(21)

Table 8: Experiments results (50000 steps) of different models on ISIC challenge 2017(811 means 8:1:1 ratio for training,validation and test).

Methods Accuracy M obileN ets + no aug + 811 0.535 M obileN ets + with aug + 811 0.556 Inception v3 + no aug + 811 0.558 Inception v3 + with aug + 811 0.572 N ASN et mobile + no aug + 811 0.637 N ASN et large + no aug + 811 0.625

images where the segmentation mask for cropping is the ground truth of the test dataset. The best Accuracy is achieved by “No crop” method.

Table 9: Performance of NASNet mobile trained on fake aug data and tested on real test data from 2017 ISIC challenge (fake aug: generates augmented images to balance the training data).

Methods Accuracy N o crop 0.420 Cropwith enlarge2% 0.322 Cropwith enlarge5% 0.322 Cropwith enlarge20% 0.370 Cropwith enlarge50% 0.368 CropwithGroundT ruth enlarge20% 0.347 CropwithGroundT ruth enlarge50% 0.387 CropwithGroundT ruth enlarge100% 0.380

4.6 Experiment 6

Table 10 shows the results of Experiment 6. In Table 10 withenlarge means the bounding box generated by the segmentation mask is enlarged by some percentage. The best Accuracy is achieved by “No crop” method.

Table 10: Performance of NASNet mobile trained on fake aug data and tested on real test dataset from 2017 ISIC challenge (melanoma/non melanoma).

(22)

4.7 Experiment 7

Table 11 and Table 12 shows the results of Experiment 7. In the table , the bold number indicates the best value in each column. The best Accuracy is achieved by “Sigmoid large enlarge 20%Crop f t f inal + cell11 + 17” method in Table 11. In the Table 12, the best Accuracy and AUC score are achieved by “RECOD Titans” method. In Table 11 different experiment settings were noted by abbreviated words in the method part. In Table 12 our best method is compared with the top 3 state of the art methods and the one whose perfor-mance is close to ours, where AUC stands for ”area under the ROC (receiver operating characteristic) curve”. AUC is the main assessment criteria of the ISIC challenge. The columns ”models” and ”train dataset” mean the number of models that make up the ensemble and the total number of training datasets they used respectively.

Figure 4 shows the ROC curve of our best method. It was plotted by setting 100 uniformly separated thresholds whose values were within 0 ∼ 1.The hori-zontal axis indicates the false positive rate(fpr) and the vertical axis indicates the true positive rate(tpr). The point pairs for fpr and tpr are calculated with respect to the thresholds.

Table 11: Performance of NASNet trained on cropped fake aug data and tested on real (melanoma/non melanoma) test dataset from 2017 ISIC chal-lenge (Please note that in the table: mobile:NASNet mobile model; large:NASNet large model; N oCrop:no crop or resize was applied to test data; enlarge 20%:test data was cropped and resized with 20% enlarged bounding box; Sigmoid:the model was trained with sigmoid cross entropy loss; f t f inal:fine tuning on final layer of model; f t f inal + cell11 + 17:fine tuning on final layer and the last two cell layers of model).

NASNet Methods Accuracy mobile N oCrop f t f inal 0.720 mobile enlarge 20%Crop f t f inal 0.760 large N oCrop f t f inal 0.650 large enlarge 20%Crop f t f inal 0.643 large N oCrop f t f inal + cell11 + 17 0.762 large enlarge 20%Crop f t f inal + cell11 + 17 0.783 Sigmoid mobile N oCrop f t f inal + cell11 + 17 0.713 Sigmoid mobile enlarge 20%Crop f t f inal + cell11 + 17 0.772 Sigmoid large N oCrop f t f inal + cell11 + 7 0.758 Sigmoid large enlarge 20%Crop f t f inal + cell11 + 17 0.794

5 Discussion

(23)

Table 12: Compare our best model with top 3 state of the art on real (melanoma/non melanoma) test dataset from the 2017 ISIC challenge.

Methods Acc AUC models train dataset RECODT itans[10] 0.872 0.874 7 7544

LeiBi[11] 0.858 0.870 2 3600 KazuhisaM atsunaga[12] 0.828 0.868 1 3444 Sigmoid N ASN et large 0.794 0.684 1 2000 T op18, merphee[29] 0.760 0.684 1 2000

5.1 Experiment 1 - Supervised learning

Segmentation:U-net

Based on the results of Experiment 1: (1). We can see that the U-net model performs better when it has relatively smaller top two layers. This might be caused by the fact that with smaller convolutional kernel, the model can capture finer details like the edge of the lesion. (2). One more interesting finding is that by adding one layer of 5 × 5 kernel, the model can get slightly better result than with 3 × 3 kernels across all layers. This could be because the top layer can capture more field of view so that it can help build the rough shape of the prediction. (3). The post-processing only helps to improve the sensitivity of these models but causes a slight drop in accuracy, Jaccard and Dice. This might be caused by the fact that the model predicts more openings around the boundaries but fewer holes in the center, and as the post-processing will fill up the boundaries to make the prediction smoother you might lose some of the fine details of the lesion boundaries.

5.2 Experiment 2 - Adversarial aided Supervised learning

Segmentation

(24)

Figure 4: The ROC curve of our best model.

5.3 Experiment 3 - Supervised Segmentation:Deeplab

Based on Table 6 and Table 5 we see that the performances are much better than those of the U-net structure based model.

We can see that (1). Both models surpassed the state of the art in the 2017 ISIC challenge in terms of both accuracy and Dice. The Deeplabv 3+ model trained on images with input size of 224 × 320 got the same Jaccard index value as the state of the art. As our hyper-parameters were set empirically, there might be space for further improvement if the hyper-parameters were selected delicately. (2). Deeplabv 3+ performs slightly better than Deeplabv 3, this proves that the encoder-decoder structure can help the model performs better at the lesion boundaries as described by the author [19]. (3). Deeplabv 3+ performs slightly better on relatively larger input images, the reason for this can be the larger model parameter numbers in Deeplab model compared to the U-net model helps the model learn more features in larger input images. Limited by computing resources, we can not perform further experiments on training using even larger input images. We propose to test this in our future work.

(25)

Dice but worse Jaccard index. They achieved better results with the help of well designed neural network structure and 10 fold cross validation training strategy. We should learn from them and try to optimize the model in the future work.

5.4 Experiment 4 - Supervised learning three class

classi-fication

In this experiment (Table 8) we can see that MobileNet got a bit worse perfor-mance than Inception v3 and training the model with augmentation helps to get better performance. Compared with relatively smaller performance differ-ence between MobileNet and Inception v3, the NASNet got around 8% higher performance than that of the other structure.

We also notice that NASNet mobile (which has less parameters and is de-signed for mobile applications) always performed better than NASNet large in three classes classification task. The reason behind this could be : it’s much easier for model with more parameters to over-fit the classes imbalance dataset. Since we have mixed all the data available in the 2017 ISIC challenge and changed the ratio of training, validation and testing datasets to 8:1:1, the test data is not exactly the same as in the ISIC official challenge. After we have selected the best model for classification task which is NASNet, we started to use the exact same training, validation and testing datasets as the ISIC official challenge. To address the data classes imbalance issue we generated more data for melanoma and seborrheic keratosis in the training dataset to balance the training dataset.

We have tested the performance of NASNet mobile model on 600 original ISIC test images. The results are shown in the first row (with the method no crop) of Table 9.

5.5 Experiment 5 - Combine segmentation and

classifica-tion

In this experiment, we proposed to combine the segmentation and classification models which is to crop the original image with segmentation mask and classify the cropped area with classification models.

To find the best performance we can get while we use this method, we have tried to use the ground truth mask which equals to the prediction mask produced by perfect segmentation model that can generate 100% accurate segmentation mask to crop the original image(with different enlarge rates) and make pre-dict based on that. As it shown in the last three rows of Table 9 there is no performance gain after this pipeline.

Then we listed the classification performance got by applying mask produced by our best segmentation model instead of ground truth mask first in rows 2-5 of Table 9. The results were worse but we can see that with a different bounding box enlarge rate we get a different accuracy and we can get better performance at some enlarge rate setting.

(26)

5.6 Experiment 6 - Supervised learning on two class

clas-sification and combine it with segmentation

As it shows in Table 10 that the binary classification accuracy without apply-ing segmentation mask first is also much better than the one with it. This is consistent with the ones in Experiment 5.

5.7 Experiment 7-Supervised learning on two class

classi-fication(training on segmentation output) and test it

on croped ISIC test data

In this experiment,we can see that the top accuracy for both cropped test dataset and even non-cropped test dataset are relatively better than the ones which were trained on non-cropped images (Table 11). The performance on cropped test dataset is better than the one without. One thing to note is that the hyper-parameters in our NASNet models are set as the one for ImageNet challenge. We could get slightly better performance if we perform more experiments on selecting the optimized hyper-parameters for the ISIC challenge dataset. Based on a limited number of experiments, we get the best model for melanoma clas-sification which is shown in the last row of Table 11. The model structure is NASNet large which is trained with sigmoid cross entropy loss instead of soft-max cross entropy loss and the model is got by fine-tuning the parameters in final layers, cell 7 and cell 11 in the model.

Based on experiments 5,6,7, we can conclude that using the segmentation model to get cropped training data and fine-tune classification model on that can help the model learn the features with less disturbance therefore it can get better accuracy.

To evaluate our best melanoma classification model, we have compared it with top 3 state of the art models in the 2017 ISIC challenge in Table 12. With-out extra training data, extra diagnosis information help (as it mentioned in [12]) and ensemble learning, our single model achieved relatively worse perfor-mance with respect to AUC index but roughly the same accuracy. This indicates that the model is influenced by the data classes imbalance in training set(The model prefers to predict the image as majority class). See the Figure 4 for the ROC curve.

5.8 Overall discussion

According to our experiments, we can see that in the segmentation task which is not influenced by data classes imbalance problem we can get promising results that are comparable to the state of the art.

However, for the classification task, we are still limited by data classes im-balance problem even after we have tried different methods to alleviate them. But our experiments still remain valuable.

As it mentioned in Experiment 4, NASNet outperforms the other models a lot. This suggests that NASNet can perform better than other human designed classification models in our skin lesion classification task.

(27)

non-cropped images. The accuracy is not as good as the one without cropping the image with segmentation mask before classification. The reason for this could be that the data distribution is changed after cropping and resizing the image and the already trained classification model can’t perform well on that data distribution which lead to worse prediction.

As it is very challenging to directly apply three classes classification on an imbalanced dataset, we switched our focus to melanoma classification. As shown in Experiment 6, the prediction accuracy got numerically higher but the attempt to make a prediction directly on cropped test images also failed.

Finally, we have tried to train the classification model with cropped training images. Knowing that our segmentation model is good enough to generate a bounding box around the lesion, we have applied segmentation to all the training images including the generated new images. After that, we cropped the images with bounding boxes and trained a classification model with the cropped images. We find that in Table 11 the model trained on cropped images performs much better on cropped test images and the prediction accuracy on non-cropped images doesn’t drop too much. Based on Experiment 7, we can see that cropping the training data using our segmentation model before training the classification model can improve the prediction performance. The reason behind this could be that the crop operation makes the model focus more on the lesion itself than the surrounding disturbance, for example the hair in Figure 3. However,as the relatively low AUC index and the ROC curve in Figure 4 indicate, our model still suffers from data classes imbalance problem.

As it suggested in ISIC challenge 2017, the evaluation metrics for classifi-cation task is based on accuracy and AUC. However, as shown in Experiment 4,5,6,7: we selected the best model only based on the accuracy index. At the very late stage of the thesis when we want to compare the classification model with the state of the art, we started to apply AUC as well as accuracy to evaluate the methods.

Knowing that the accuracy is calculated with the threshold of 0.5 (if the prediction probability is higher than 0.5 then the prediction is classified as 1 otherwise it will be classified as 0), however, in the real world application, the threshold can be set to different values according to different requirements. The AUC value is introduced to evaluate the performance of the model with different thresholds. It will be more convincing if we can select the model based on AUC.

6 Conclusion

In this work, we proposed several neural networks for skin lesion segmentation and classification. After comparing with state of the art in this area, we found that our segmentation models have better pixel level segmentation capacity than others.

With respect to the classification task, we found that NASNet outperforms the other human designed neural network and proved that the NASNet is also suitable for skin lesion classification task.

(28)

6.1 Our Contribution

Our contributions are: firstly, our best segmentation model achieves relatively similar performance as the state of the art model without complicated training strategy, extra data or time consuming post processing. Secondly, we prove that NASNet model can get better performance in skin lesion classification compared to other popular models. Thirdly, our pipeline which uses segmentation model first to crop the images and then makes prediction on cropped images can achieve better accuracy if the classification model is also trained on cropped training images.

6.2 Limitation and future work

Unfortunately, being limited by a lack of additional skin lesion data, we cannot get rid of data classes imbalance problem so we only alleviate them by gen-erating fake data with image augmentation techniques. Lacking of computing resources like GPU and CPU limited our opportunity to perform massive hyper-parameter fine-tuning for each model and ensemble learning to further improve the performance of our models.

As is discussed in discussion 5.8, we could have got models with better performance by selecting them with AUC. However, we didn’t have time to reformulate the program for different models.

Our future work is to collect more clinical data to improve the performance (with respect to the AUC index) of the classification network and get more computing resources to validate our ideas.

7 Acknowledgement

I’m very grateful to the International Skin Image Collaboration (ISIC) for their public datasets. I also want to thank my supervisor at Doctrin AB, the group supervisor at my university (KTH) and my peer reviewers for their great sug-gestions and comments.

References

[1] Codella NC, Nguyen QB, Pankanti S, Gutman D, Helba B, Halpern A, et al. Deep learning ensembles for melanoma recognition in dermoscopy images. IBM Journal of Research and Development. 2017;61(4):5–1. [2] Ali ARA, Deserno TM. A systematic review of automated melanoma

de-tection in dermatoscopic images and its ground truth data. In: Medical Imaging 2012: Image Perception, Observer Performance, and Technology Assessment. vol. 8318. International Society for Optics and Photonics; 2012. p. 83181I.

[3] Mason J. Review of Australian government health workforce programs. 2013;.

(29)

[5] Bakheet S. An svm framework for malignant melanoma detection based on optimized hog features. Computation. 2017;5(1):4.

[6] Schaefer G, Rajab MI, Celebi ME, Iyatomi H. Colour and contrast en-hancement for improved skin lesion segmentation. Computerized Medical Imaging and Graphics. 2011;35(2):99–104.

[7] Sookpotharom S. Border detection of skin lesion images based on fuzzy C-means thresholding. In: Genetic and Evolutionary Computing, 2009. WGEC’09. 3rd International Conference on. IEEE; 2009. p. 777–780. [8] Zhou H, Li X, Schaefer G, Celebi ME, Miller P. Mean shift based gradient

vector flow for image segmentation. Computer Vision and Image Under-standing. 2013;117(9):1004–1016.

[9] Yuan Y, Chao M, Lo YC. Automatic skin lesion segmentation using deep fully convolutional networks with Jaccard distance. IEEE transactions on medical imaging. 2017;36(9):1876–1886.

[10] Menegola A, Tavares J, Fornaciali M, Li LT, Avila S, Valle E. RECOD titans at ISIC challenge 2017. arXiv preprint arXiv:170304819. 2017;. [11] Bi L, Kim J, Ahn E, Feng D. Automatic skin lesion analysis using

large-scale dermoscopy images and deep residual networks. arXiv preprint arXiv:170304197. 2017;.

[12] Matsunaga K, Hamada A, Minagawa A, Koga H. Image classification of melanoma, nevus and seborrheic keratosis by deep neural network ensem-ble. arXiv preprint arXiv:170303108. 2017;.

[13] Yin XX, Hadjiloucas S, Chen JH, Zhang Y, Wu JL, Su MY. Tensor based multichannel reconstruction for breast tumours identification from DCE-MRIs. PloS one. 2017;12(3):e0172111.

[14] Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intelligent data analysis. 2002;6(5):429–449.

[15] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 3431–3440.

[16] Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convo-lution, and fully connected crfs. arXiv preprint arXiv:160600915. 2016;. [17] Son J, Park SJ, Jung KH. Retinal vessel segmentation in fundoscopic

im-ages with generative adversarial networks. arXiv preprint arXiv:170609318. 2017;.

(30)

[19] Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv preprint arXiv:180202611. 2018;.

[20] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.

[21] Chollet F. Xception: Deep learning with depthwise separable convolutions. arXiv preprint. 2016;.

[22] Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI. vol. 4; 2017. p. 12.

[23] Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the in-ception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 2818– 2826.

[24] Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115.

[25] Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, et al. Mobilenets: Efficient convolutional neural networks for mobile vision ap-plications. arXiv preprint arXiv:170404861. 2017;.

[26] Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. TensorFlow: A System for Large-Scale Machine Learning. In: OSDI. vol. 16; 2016. p. 265–283.

[27] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Advances in neural information processing systems; 2014. p. 2672–2680.

[28] Mishra R, Daescu O. Deep learning for skin lesion segmentation. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2017. p. 1189–1194.

[29] Murphree DH, Ngufor C. Transfer Learning for Melanoma Detection: Par-ticipation in ISIC 2017 Skin Lesion Classification Challenge. arXiv preprint arXiv:170305235. 2017;.

[30] Johr RH. Dermoscopy: alternative melanocytic algorithmsthe ABCD rule of dermatoscopy, menzies scoring method, and 7-point checklist. Clinics in dermatology. 2002;20(3):240–247.

[31] Barata C, Celebi ME, Marques JS. Improving dermoscopy image clas-sification using color constancy. IEEE journal of biomedical and health informatics. 2015;19(3):1146–1152.

(31)

[33] LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, et al. Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems; 1990. p. 396–404. [34] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R.

Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research. 2014;15(1):1929–1958.

[35] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information process-ing systems; 2012. p. 1097–1105.

[36] Scherer D, M¨uller A, Behnke S. Evaluation of pooling operations in convo-lutional architectures for object recognition. In: International conference on artificial neural networks. Springer; 2010. p. 92–101.

[37] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.

[38] Ercal F, Chawla A, Stoecker WV, Lee HC, Moss RH. Neural network diagnosis of malignant melanoma from color images. IEEE Transactions on biomedical engineering. 1994;41(9):837–845.

[39] Mar´ın C, Alférez GH, Córdova J, González V. Detection of melanoma through image recognition and artificial neural networks. In: World Congress on Medical Physics and Biomedical Engineering, June 7-12, 2015, Toronto, Canada. Springer; 2015. p. 832–835.

[40] Huang G, Li Y, Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ. Snapshot ensembles: Train 1, get M for free. arXiv preprint arXiv:170400109. 2017;. [41] Frid-Adar M, Klang E, Amitai M, Goldberger J, Greenspan H. Synthetic Data Augmentation using GAN for Improved Liver Lesion Classification. arXiv preprint arXiv:180102385. 2018;.

[42] He X, Yu Z, Wang T, Lei B. Skin Lesion Segmentation via Deep RefineNet. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer; 2017. p. 303–311.

[43] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:151106434. 2015;.

[44] Chu C, Zhmoginov A, Sandler M. CycleGAN: a Master of Steganography. arXiv preprint arXiv:171202950. 2017;.

[45] Zhang Y, Yang L, Chen J, Fredericksen M, Hughes DP, Chen DZ. Deep adversarial networks for biomedical image segmentation utilizing unanno-tated images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2017. p. 408–416.

(32)

Appendices

A

State of the art review in Skin Lesions

Seg-mentation and Classification

A.1 Introduction

The following article presents the state of the art deep learning methods that were used in skin lesions segmentation and classification. The content includes: first, briefly introduction to the great needs in skin lesion classification, the public datasets available that we can use nowadays, the development of deep learning methods especially convolutional neural network(CNN) and segmenta-tion. Second, introduction to some state of the art methods used in skin lesion classification and segmentation. Last but not least,discussion on what we have learned from the literatures.

A.2 Background on skin lesions

A.2.1 Medical background in skin cancer

Skin cancer is a very common cancer in the USA, Australia and Europe [1], among different skin lesions,melanoma is the deadliest, there are over 20,000 people in EU dead because of melanoma per year. However, early detection of melanoma will increase five-year survival rate from around 83% to around 97% [2].

Currently, with the help of dermoscopy, well trained clinicians can get an accuracy of 75% - 84% [2]. The specialists usually make their decisions with the help of dermoscope: a skin surface microscope with high quality magnifying lens and great lighting condition. The diagnosis is usually carried out following the ABCD rule and seven checklist [30], These rules are complicated to follow. Histopathology analysis is the golden standard in skin cancer diagnosis: in this way, a sample of biopsy from suspicious area is taken and it was sent to lab for more precise analysis.

Due to limited number of expertises around the world people usually can’t get corrected diagnosis [3]. To solve this, computer aided automatic skin cancer detection is needed.

(33)

A.4 Traditional computer vision and machine learning

meth-ods on skin lesion

There are some traditional computer vision algorithms and traditional machine learning algorithms used in this area.

A.4.1 Traditional methods on skin lesion classification

They basically use low level visual features(color, edges, texture and so on) [4] or apply machine learning method like SVM over features that were extracted from HOG feature extractor [5], they have achieved sensitivity of 98.21%, speci-ficity of 96.43% and accuracy of 97.32% on binary classification task. Although with great sensitivity and specificity, this SVM method requires great comput-ing source for feature extractor. The parameter volume of SVM is relatively small which make it hard to generalize their result to larger dataset. There are some studies focusing on modifying the image to improve the performance. For example, Barata et.al found out that changing the color constancy of der-moscopy images can improve the sensitivity and specificity of the classification algorithm for about 5% [31].

A.4.2 Traditional methods on skin lesion segmentation

Except for classification task mentioned above, there are some traditional meth-ods aiming at skin lesion segmentation, the traditional work focused on different ways to do threshold on the image since the skin lesion usually looks darker than human skin and lighter than hair. This can be problematic and it’s highly de-pended on the lighting environment, camera setting. Therefore they focused on improve the quality and consistency of input data. Some of them try to en-hance the contrast and color of the image to get better result [6]. Some of them tried to improve the thresholding algorithm [7]. Apart from setting threshold and modify image content, some of them focusing on the great variance at the border between skin lesion and surrounding skin. With the great variance there will be gradient around the border, Zhou et.al [8] mentioned their gradient flow based algorithm to perform skin lesion segmentation task [8].

A.5 Deep learning method

Nowadays, deep learning methods were applied to many areas like object de-tection, image classification and segmentation. Among different Deep learning methods, convolutional neural network(CNN) shows promising capacity in im-age classification and segmentation tasks.

A.5.1 convolutional neural networks

(34)

Basic concepts in convolutional neural networks Parameters in CNN models

Deep convolutional neural network model usually contains large number of pa-rameters, those parameters work together toward the final output of the model. It’s impossible to modify the parameter by hand craft because of the huge amount of parameters in the model.

Back propagation

Luckily, convolutional neural networks nowadays learn its’ parameters by back propagation(BP): with novel optimizers like stochastic gradient decent(SGD) the neural network will optimize itself with respect to the loss function that the customer defined. Every parameters updates themselves with respect to the gradient between two update phase. The gradient were generated by gradient flow from loss function. The gradient flow were passed by chain rule [33].

With the development of back propagation, the neural networks can update its parameters by gradient decent. However, there are so many parameters in one neural networks. It’s hard to get the training converge. There are some technologies that can help the the training process converge faster.

Dropout

The first one to mention is dropout [34], there are lots of parameters in neural network, as the training iteration goes. Every parameters will update it’s value with respect to gradient. However, the gradient update can be mislead by some random rare cases or noises, the specific cases doesn’t contain general representation of the dataset. If the model takes every rare cases into account the converge will be slow and it can be biased. Dropout can simply solve this problem by randomly ignore some parameters every iteration. This will help the model only learn the major distribution of the dataset, not the rare cases. Weight sharing of parameters

Weight sharing of parameters within same CNN layers can decrease the param-eters number dramatically. In [35] they used weight sharing, the computing source needed decreased a lot.

Pooling

Extracting some important information from feature maps instead of using all information from feature maps can save a lot of time. Pooling layers did the work. Pooling layer will only select one value in small area of feature maps to represent the whole small area where the selected value can be the maximum or average value in the small area. In this case the model will only learn features that have the most significant effects [36].

Optimization algorithms

(35)

network are usually non-convex. The SGD algorithm can get trapped in some strange local minimum. There are modified algorithms like moment SGD which is hinted by physical moment that will assign moment to SGD updates so that it can improve the performance of SGD. As time goes, there are researches on applying adaptive gradient decent which can use the history or shifted average of history to update the parameters. Adagrad,RMSProp, Adam belongs to this. Adam is the stablest and most used one [37] nowadays.

A.5.2 Deep learning methods used in classification and segmentation Performing segmentation and use them on further prediction can improve the performance with some tricks for example enlarging the border of mask that the segmentation algorithm generated can help improving the classification task [13]. This can be a way to improve the accuracy and robustness for classification model. Therefore, we will look into both the classification and segmentation in the following sections.

Classification

Due to great needs in automatic skin lesion classification, there are some early researches been done with deep learning methods [38, 39]. Their results are heuristic. They used artificial neural network to perform melanoma classifi-cation and recognition tasks. Their works showed us the capacity of neural network.

There are some great work been done recently. The most promising work to mention is the work from Stanford team that their model can performed dermatologist level classification with Inception V3 neural network architecture which is a deep neural network with inception modules that apply more effective convolutional and pooling of feature maps. They have trained their network on 129,450 clinical images which is two order larger than previous dataset. They performed both nine-ways and three-ways classification on the dataset and they performed partitioning algorithm to achieve better accuracy. Their model can get the accuracy of 72.1 ± 0.9%(dermatologist can only get around 65.8%) over three-way classification and the accuracy of 55.4±1.7%(dermatologist can get around 54.1%) over nine-way classification. They get such success because of their deep and well fine-tuned neural network and plenty of data.

Now,we only have the public dataset of ISIC 2017 challenge which contains approximate 2000 training images,150 validate images and 600 test images. We can see that with limited number of dataset our model can’t reach great per-formance level, therefore we need specific image preprocessing for augmentation so that the classifier(for example: the state of the art classifier in this area)that we are going to use can perform better with limited number of dataset.

(36)

can gain more robustness and generality, therefore we should also notice this effectiveness of state of the art in ensemble learning.

Adding more labeled data is also helpful.There is one very recent research that used Generative adversarial neural network(GAN) to generate data that can’t be distinguished as fake data to improve the performance of liver lesion classification [41]: By adding the synthetic data the result on sensitivity and specificity increased more than 5% .

Segmentation

Being different from classification task, segmentation tasks has more special evaluate metrics to indicate the performance of the model. As it suggested in ISIC 2016 challenge, the evaluation metrics includes: pixel wise accuracy(AC), sensitivity(SE), specificity(SP), dice coefficient(DI) and Jaccard index(JA). To give a numerical understanding. Set TN, TP, FN, FP as true negative, true positive, false negative and false positive. The evaluation metrics are defined in Table 1.

Table 13: evaluation metrics. Evaluate metrics Expression

AC (TP+TN)/(TP+FP+TN+FN)

SE TP/(TP+FN)

SP TN/(FP+TN)

DI 2*TP/(2*TP+FP+FN) JA TP/(TP+FP+FN)

There are two different ways to do skin lesion segmentation with deep learn-ing methods: supervised and unsupervised method.

Supervised method means the model was trained with labeled data, in this case, it will be binary mask image where the high value(numerically ”1”) corre-sponds to the region of interest and the low value(numerically ”0”) correcorre-sponds to background. This method is frequently used in industries. However, labeling images requires great human effort, for task like medical images it’s even harder because of the ethical problems in using patients data. Therefore, supervised learning in skin lesion area is limited by the amount of labeled data.

Unsupervised method means the model will not or only need some labeled images to get great prediction ability. There is one unsupervised method that was brought up recently by Goodfellow [27], they can use the adversarial method to train the generator and discriminator to compete against each other: genera-tor will try to generate images that the discriminagenera-tor can’t distinguish whether this is a real image or not. On the other hand, the discriminator will try to distinguish the generated and real images. With the data that were created by Generative adversarial network, we can train the model with more data. Supervised methods used in segmentation

(37)

In the ISIC 2017 Challenge task, Yuan et.al got the first place by using deep fully convolutional networks(FCN) with Jaccard distance [9], they used 19 layers FCN with regularization method like batch normalization and dropout, they also applied both geometric transformation(like rotation and flip) and non-geometric transformation(like random contrast normalization) to the original input image for image augmentation. After the prediction of model, they used computer vision algorithm to fill the small holes and do morphological dilation. In the challenge the get the Jaccard performance of 0.765 and Dice loss of 0.849 on ISIC 2017 challenge dataset.

Recent study shows that with deeper neural network the performance can get better, the latest research done by He [20] shows that a kind of neural network named ResNet that they created outperformed the state of the art by solving the gradient exploding/vanishing problem in deep neural network. Hinted by this idea He et.al [42] used Deep RefineNet work to perform skin lesions segmentation. They also used conditional random field as post process method to improve the performance. Their performance on ISIC data 2017 is Jaccard index 0.758 and Dice index 0.843.

The most recent state of art that we found is achieved by Rashika [28], they designed a 15 layer U-net architecture model and apply different convolutional kernel size to extract both general and detail information in the images. They also applied traditional region filling and hole filling algorithm to the prediction. The final performance on Jaccard index reaches 0.842 and Dice coefficient is 0.868.

The methods mentioned above all used segmentation model like Fully con-volutional neural network(FCN) and U-net. They required labeled data and post-processing.

Un-supervised methods

There are not any researches in unsupervised method being done in this area. But we can see some applications in different area.

There are basically two methods can be used in our research: one is to generate data by Generative adversarial neural network(GAN) and use those data to do supervised learning based on that, the other one is to use adversarial network to train model with both labeled and unlabeled data.

For the data generation, there is one research named deep convolutional gen-erative adversarial neural networ(DCGAN) [43], they use unsupervised learning method to generate data from noise, the images that it generate are very similar to real world data, this can be a way to increase segmentation data.

(38)

A.6 Data labeling tool

Even with great generate capacity, the generate adversarial network is based on representation learning which means it will extract abstract and hidden relations in the original data distribution.

Based on productive environment, the company will receive images token by patient, the patients can get wrong prediction because of the inner distribution difference between the images that they send and the training dataset we have. The company I worked with has professional doctors, they can help us label the new images. We need to provide them a handy tool to label them. The label tool we have found is mentioned in this paper [46].

A.7 Conclusion

Based on the literature study on all the articles that we mentioned above, we get more information on how important it is to develop robust and accurate automatic skin lesion detector.

We also get hand on how the technology evolves in these years. We find that there can be a lot of ways to improve the performance on classification and segmentation: Use more data and more suitable model can improve the performance, post-process can also help in improving the prediction.

(39)

(40)

Deep Learning Method used in Skin Lesions Segmentation and Classification

IN

DEGREE PROJECT

MEDICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Deep Learning Method used in

Skin Lesions Segmentation and

Classification

IN

DEGREE PROJECT

MEDICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

Contents

Deep Learning Method used in Skin Lesions

Segmentation and Classification

Fengkai Wan

August 16, 2018

1

Introduction

1.1

Problem statement

1.2

Aim of this project

2

Methods

2.1

Data Pre-processing

2.2

Neural network architectures

3

Experiments

3.1

Tools

3.2

Experiment 1 - Supervised learning

Segmentation:U-net

3.3

Experiment 2 - Adversarial aided Supervised learning

Segmentation

3.4

Experiment 3 - Supervised Segmentation:Deeplab

3.5

Experiment 4 - Supervised learning three class

classi-fication

3.6

Experiment 5 - Combine segmentation and

classifica-tion

3.7

Experiment 6 - Supervised learning on two class

clas-sification and combine it with segmentation

3.8

Experiment 7 - Supervised learning on two class

clas-sification(training on segmentation output) and test

it on real cropped ISIC test data

4

Results

4.1

Experiment 1

4.2

Experiment 2

4.3

Experiment 3

4.4

Experiment 4

4.5

Experiment 5

4.6

Experiment 6

4.7

Experiment 7

5

Discussion

5.1

Experiment 1 - Supervised learning

Segmentation:U-net

5.2

Experiment 2 - Adversarial aided Supervised learning

Segmentation