List of Abbreviations

(1)

(2)

ii

Abstract

Using the computational capabilities of computers within the medical field has become increasingly popular since the emergence of CAD during the middle of the twentieth century. The prevalence of skin cancer attracted research resources, and in 2017, a group of scientists from Stanford University trained a CNN which could outperform board certified dermatologists in several skin cancer classification tests. The Stanford study gave rise to another study conducted by Boman and Volminger who tried to replicate the results using publicly available data. However, they did not achieve the same performance.

It was the ambition of this study to extend the work of Boman and Volminger.

But due to a large part of the training data being unavailable, comparisons were difficult to make and therefore the ambitions of the study shifted. The models presented in this study achieved 3-way classification accuracies of 82.2% and 87.3% for the balanced and imbalanced models respectively. The balanced model was trained on a data set which had been randomly

oversampled and downsampled to make the different classes equal in size. The balanced model showed greater average values of specificity and sensitivity at a relatively small loss of accuracy. Despite the accuracies of these models being higher than that produced by Boman and Volminger, it is difficult to draw any conclusions as the methodology in this study diverged from the previous work.

(3)

iii

Sammanfattning

Sedan CAD utvecklades under mitten av 1950-talet har det har blivit allt mer populärt att utnyttja den beräkningskapacitet som moderna datorer tillför inom det medicinska området. Att hudcancer är så vanligt förekommande ledde till att en grupp av forskare från Stanford år 2017 tränade ett CNN som kunde prestera bättre än certifierade hudläkare i flera klassifikationstester av hudcancer. Stanfordstudien gav upphov till en studie av Boman och

Volminger som försökte replikera resultaten med offentligt tillgängliga data.

Men de lyckades inte uppnå samma prestanda. Syftet med denna studie var inledningsvis att bygga på Boman och Volmingers arbete. Men på grund av att en stor del av den träningsdata som de använde var otillgänglig så var

jämförelser svåra att göra och fokus skiftades därmed till att förändra andra delar av metoden. Modellerna i detta arbete uppnådde en 3-vägs-

klassifikationsträffsäkerhet på 82,2% och 87,3% för den balanserade modellen respektive den obalanserade modellen. Den balanserade modellen tränades på en uppsättning data som slumpmässigt över- och undersamplats för att göra klasserna lika stora. Detta resulterade i bättre genomsnittlig sensitivitet och specificitet på bekostnad av en relativt liten förlust i

klassifikationsträffsäkerhet. Trots att klassifikationsträffsäkerheten var bättre för dessa modeller än den från Boman och Volmingers arbete, så är det svårt att dra några slutsatser eftersom metodiken i detta arbete avvek från den tidigare studien.

(4)

iv

List of Abbreviations

ANN Artificial neural network

CAD Computer-aided detection and computer-aided diagnosis CNN Convolutional neural network

KC Keratinocyte carcinoma RNN Recurrent neural network ROC

SK

Receiver operating characteristic Seborrheic keratosis

(5)

v

Chapter 1 Introduction

Cancer is one of the greatest health-related problems of the twenty-first century. Annually, it is the second most common cause of death amongst humans, causing 9.6 million people to lose their lives in 2018.¹ Cancer is a broad group of diseases that can manifest itself in different parts of the body.

Generally, cancer is the result of uncontrolled mitosis of abnormal cells. The uncontrolled growth leads to excess tissue forming a tumor.² A common area for this unwanted growth is the epidermis, the outermost layer of the skin.³ The skin is the largest human organ and acts as a physical barrier against unwanted intrusions or infections and provides vital sensory input that travels through the nervous system to the brain.⁴ Skin cancer may therefore carry severe consequences as these functions are impacted. It is also very common as it in the United States affects more people than all other types of cancer combined, affecting 1 in 5 Americans by the age of 70.⁵

Computer-aided detection and diagnosis (CAD) was developed during the middle of the twentieth century when computers were used to analyze abnormalities in radiographic data. A decade later, during the 1960s and 1970s, computers were used in the classification of lesions in radiographic

1 WHO, “Cancer,” WHO, September 12, 2018. Accessed May 27, 2020.

https://www.who.int/news-room/fact-sheets/detail/cancer.

2 Cancer Treatment Centers of America, “What is Cancer?,” Cancer Treatment Centers of America. Accessed May 27, 2020.

https://www.cancercenter.com/what-is-cancer.

3 Skin Cancer Foundation, “Skin Cancer Facts & Statistics,” Skin Cancer Foundation. Accessed May 27, 2020. https://www.skincancer.org/skin-cancer- information/skin-cancer-facts/.

4 F. John G. Ebling and William Montagna, “Human skin,” Encyclopædia Britannica, June 8, 2020. https://www.britannica.com/science/human-skin.

5 Skin Cancer Foundation, Skin Cancer Facts & Statistics.

(8)

2

images of the chest. However, these trials were severely limited by the computational power available at the time.⁶ Technological advancement has since made the algorithmic problems of CAD computationally feasible. This has led to an increase in research dedicated to finding algorithms for

classifying different types of lesions. In the past two decades, classifiers have achieved classification accuracies of up to 90%.⁷

In 2017, Esteva et al. from Stanford University designed a convolutional neural network (CNN) that outperformed licensed

dermatologists in classifying and diagnosing malignant lesions.⁸ This led to a KTH Royal Institute of Technology bachelor’s thesis in which Boman and Volminger attempted to implement the methodology from Esteva et al., using publicly available skin cancer image data instead of the special permission image data used in the previous study. However, they were unable to replicate the strong classification accuracy of Esteva et al. because of the differences in data availability and time restraints.⁹

6 Jorge Hernández Rodríguez, Francisco Javier Cabrero Fraile, María José Rodríguez Conde, and Pablo Luis Gómez Llorente, “Computer aided detection and diagnosis in medical imaging: a review of clinical and educational applications,”

(conference paper, the Fourth International Conference on Technological Ecosystems for Enhancing Multiculturality, Salamanca, Spain, November 16, 2020).

7 Nazia Hameed, Anglia Ruskin, Kamal Abu Hassan, and M.A. Hossain, “A comprehensive survey on image-based computer aided diagnosis systems for skin cancer,” (conference paper, 2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA), Chengdu, China, December 15-17, 2016).

8 Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M.

Swetter, Helen M. Blau, and Sebastian Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, no. 542 (2017): 115–118.

9 Joakim Boman and Alexander Volminger, “Evaluating a deep convolutional neural network for classification of skin cancer,” (bachelor’s thesis, KTH Royal Institute of Technology, 2018).

(9)

3

1.1 Problem Statement

This paper aims to continue the work of Boman and Volminger by adapting the methodology of Esteva et al. to account for the smaller availability of data when only using on publicly available skin cancer image data.¹⁰ Without accounting for the difference in data availability, problems related to the nature of training artificial neural networks (ANNs) will limit the resulting models, as can be concluded from Boman and Volminger’s study.¹¹ This work aims to overcome these problems by incorporating sampling methods into the methodology and by being more selective in the choice of data to use. By doing so, the hope is that this work will be able to clear the obstacles related to this specific problem, and in extension to enable other researchers with limited access to data to contribute in the research of CAD in future research.

1.2 Scope

This study will be restricted to evaluating and modifying the methodology developed by Esteva et al. and to build on the ideas brought forth by Boman and Volminger. Because of this, most of the limitations of these two papers will inherently still apply to this paper. One such limitation arises from the data used, as any inherent bias in the available data may carry over into the results. Such bias may include e.g. the predominant skin color of the patients or the position of the skin lesions on the body. Therefore, the results of the study may appear to be more general than in practice. Finally, it is important to point out that this study is carried out to assess and gauge the presented models’ reliability as a potential supportive tool for doctors and

dermatologists. Therefore, the ambition of this research is not to substitute human specialists, but merely to provide an objective second opinion.

10 Ibid.

11 Esteva et al., Dermatologist-level classification of skin cancer with deep neural networks.

(10)

4

1.3 Thesis Outline

The first chapter covers some rudimentary introductions to the work presented in this paper. The second chapter delves more thoroughly into the foundations of CAD and ANNs. It also includes analysis of state-of-the-art methods developed in the research area. The third chapter explains the choice of methods used for the experiment and the metrics used to evaluate the results.

In the fourth chapter, the results from the experiment are presented. In the fifth chapter, these results are discussed and critically analyzed. Taking into

consideration different aspects such as ethics, reliability, and relevance.

Finally, in chapter six, the conclusion from the study are presented and areas for improvement are discussed.

(11)

5

Chapter 2 Background

2.1 Skin Cancer

Skin cancer is the most prevalent type of cancer in the US. However, the most common type of skin cancer, keratinocyte carcinoma (KC), is not always reported to cancer registries so it is hard to estimate the exact number of cases.

Melanoma is the most dangerous type of skin cancer, resulting in the majority of deaths but only accounts for approximately 1% of cases. It is the most frequent among non-Hispanic whites at 25 per 100,000, compared to 4 per 100,000 among Hispanics, and 1 per 100,000 among blacks. Melanoma is more common among women than men under the age of 50. But by age 65 men stand twice the risk of women, and three times the risk by age 80. The American Cancer Society writes that “this pattern reflects the differences in occupational and recreational exposure to ultraviolet radiation by sex and age, which have changed over time”.¹²

Early detection of skin cancer is important, as most cases of KC can be cured when detected and treated early. As for melanoma, it is also highly curable when detected early, but is more likely than KC to spread. The survival rates for people with melanoma are 92% after 5 years, and 89% after 10 years. However, the survival rate is much lower for regional and distant stages of the disease than for localized melanoma. According to the American Cancer Society, “early signs of skin cancer include changes in size, shape, or color of a mole or other skin lesion, the appearance of a new growth on the skin, or a sore that doesn’t heal”.¹³

12 American Cancer Society, Cancer facts & figures (Atlanta, GA, American Cancer Society, 2016), 20. https://www.cancer.org/research/cancer-facts-statistics/all- cancer-facts-figures/cancer-facts-figures-2016.html.

13 Ibid., 22.

(12)

6

The detection of skin cancer is primarily a visual procedure. Out of the two types of KC, basal cell carcinoma “may appear as a growth that is flat, or as a small, raised pink or red translucent, shiny area that may bleed following minor injury”. Whilst the second type, squamous cell carcinoma, “may appear as a growing lump, often with a rough surface, or as a flat, reddish patch that grows slowly”. Warning signs of the most common type of melanoma can be identified with the ABCDE rule: “A is for asymmetry (one half of the mole does not match the other half); B is for border irregularity (the edges are ragged, notched, or blurred); C is for color (the pigmentation is not uniform, with variable degrees of tan, brown, or black); D is for diameter greater than 6 millimeters (about the size of a pencil eraser), and E is for evolution”.

However, not all cases of melanoma may display such signs.¹⁴

2.2 Distinguishing Between Melanoma, Seborrheic Keratosis, and Nevus

A major difficulty in identifying melanoma is that the disease visually resembles other less harmful skin conditions (see figure 1). Furthermore, the disease can also vary greatly in appearance, making it troublesome to

recognize certain characteristics.¹⁵ In a study conducted in 2002, 9,204 diagnoses of seborrheic keratosis (SK) from between 1992 and 2001 were revisited to review the cases retrospectively. The study found that in several cases, melanoma was also present apart from the diagnosed SK. This pattern was so prominent that the researchers concluded that melanoma not only looks like, but also mimics SK.¹⁶ A study from 2003 found that melanoma can be

14 Ibid., 21.

15 Allan C. Halpern, Ashfaq A. Marghoob, and Ofer Reiter, “Melanoma overview - A dangerous skin cancer,” The Skin Cancer Foundation, April 2019.

https://www.skincancer.org/skin-cancer-information/melanoma/.

16 Leonid Izikson, Arthur J. Sober, Martin C. Mihm Jr, and Artur Zembowicz,

“Prevalence of Melanoma Clinically Resembling Seborrheic Keratosis: Archives of 9204 cases,” Archives of dermatology, no. 138 (2002), 1562–1566.

(13)

7

mistaken for nevus, but that there is no causation between the growth of nevus leading to the growth of melanoma.¹⁷

a) b) c)

Figure 1. Pictures of skin lesions diagnosed as (a) nevus, (b) SK, and (c) melanoma.

From the International Skin Imaging Collaboration (ISIC) Melanoma Project archive.

Accessed June 6, 2020. https://www.isic-archive.com.

2.3 Artificial Neural Networks

ANNs were inspired by their biological counterparts and are in some ways remotely related to them. Two similarities are that the building blocks of both networks are simple computational devices and that the connections between the neurons determine the function of the network.¹⁸

The most fundamental part of an ANN is the so-called neuron which contains a mapping between an input and an output. A neuron typically has more than one input and has a single output. The number of inputs depends on the specifications of the problem the ANN is attempting to solve. When predicting chess moves, each square of the board could be used as input.

Similarly, when doing image prediction, the individual color value of each pixel in the picture could be used as input. In the same way, the output also depends on the nature of the problem. The output could, for example, be a

17 Caroline Bevona, William Goggins, Timothy Quinn, Julie Fullerton, and Hensin Tsao, “Cutaneous Melanomas Associated with Nevi,” Archives of

dermatology, no. 139 (2003): 1620–1624.

18 Martin T. Hagan, Howard B. Demuth, Mark H. Beale, and Orlando De Jesus, Neural network design (Boston: Pws., 1996), 1:9.

(14)

8

binary response, a probability between 0 and 1, a more general numerical response, or even a categorical response such as dog or cat.

A neuron with R inputs works as follows. First, each input p is multiplied by weights w from a weight matrix W. These products are then added together with a bias value b, obtaining the net input

Which in matrix form can be written as

The net input is then passed to a transfer function f

The weights and bias are both learnable parameters inside the network, which means that they change in order to adjust toward the desired values and correct output during the training process. The weights affect the amount of influence a change in the corresponding input will have on the output. The bias on the other hand, represents how far off the predictions are from their intended value. Finally, transfer functions are “chosen to satisfy some specification of the problem that the neuron is attempting to solve”. A simple example is the hard limit transfer function, which sets the output to 0 if the net input is negative, or 1 if its argument is greater than or equal to 0. The output of a neuron using this transfer function could therefore be used to classify inputs into two distinct categories. In multilayer networks trained using the

backpropagation algorithm, a commonly used transfer function is the log- sigmoid. One reason for this is that function has the useful property of being differentiable. The log-sigmoid function takes the input and squashes the output into the range 0 to 1, according to the expression:

19 Ibid., 2:7.

20 Ibid., 2:3–5.

𝑛 = 𝑝_!𝑤_!,!+ 𝑝_#𝑤_!,#+ ⋅⋅⋅ +𝑝_$𝑤_!,$+ 𝑏.

Equation 1

𝑛 = 𝑊𝑝 + 𝑏. _{Equation 2}

𝑛 = 𝑓(𝑊𝑝 + 𝑏).¹⁹ Equation 3

𝑎 = 1

1 + 𝑒^%& .²⁰ Equation 4

(15)

9

Typically, a single neuron is not sufficient to solve most problems.

Instead, multiple neurons are used, operating in parallel, in what is referred to as layers. In a network with multiple layers, each layer has its own weight matrix, bias vector, net-input vector, and output vector. In general, multilayer networks are more powerful than single-layer networks as they can be trained to approximate more complex functions. In single-layer networks, the number of inputs and neurons is entirely determined by the problem specifications. But in networks with more than two layers the problem specifications do not directly decide the number of neurons in the layers between the input and output layers. There are few problems to which the optimal number of neurons can be predicted. The design of ANN architectures for different types of problems is therefore an active area of research.²¹

2.4 Backpropagation

To optimize an ANN, a performance index is first chosen to measure the performance of the network. In the optimization process, an algorithm

searches the parameter space, by adjusting the network weights and biases, to reduce this performance index. One such algorithm is the gradient descent backpropagation algorithm that can optimize multilayer ANNs. The algorithm uses the mean square error between the network’s output and a provided target output as the performance index. It takes an input vector P and a target output vector T and then minimizes an approximated form of the mean square error, E, function

Where x is the vector of network weights and biases and a is the output of the network, obtained through recursion, using the previously established

notation,

𝑎^' = 𝑝, Equation 6

21 Ibid., 2:9–12.

𝐹(𝑥) = 𝐸[𝑒^#] = 𝐸[(𝑡 − 𝑎)^#]. Equation 5

(16)

10

𝑎^()!= 𝑓^()!(𝑊^()!𝑎⁽+ 𝑏^()!)

𝑓𝑜𝑟 𝑚 = 0, 1, … , 𝑀 − 1, ^{Equation 7}

𝑎 = 𝑎^*. Equation 8

Where M is the number of layers in the network.²²

After the inputs have been propagated forward using the recursive formula above (see equation 6–8), the algorithm calculates each layer’s sensitivity. A layer’s sensitivity can be defined as the sensitivity of the

approximated performance index to changes in the ith element of the net input.

The sensitivity is calculated propagating backwards, which explains the name of the algorithm. With the sensitivities calculated, the final part of the

algorithm updates the weights and biases of the network using gradient descent by utilizing the nature of the sensitivities. The algorithm iteratively goes in the direction of the negative gradient of the approximated performance index until it finds new parameter values that approaches a local minimum for the performance index.²³

2.5 Convolutional Neural Networks

Advancements in computer vision with deep learning has been driven

primarily by the CNN algorithm. The CNN architecture assumes hierarchical patterns in the data analyzed, which translates well into image recognition.

Distinct smaller features of an image, such as edges, are combined into more complex features, such as a mouth, nose, and a pair of eyes. These are in turn combined into even more complex features, such as a face, and so on. To achieve this, the convolutional neurons use dot product operations to aggregate the input from neurons that have limited receptive fields so that a larger visual field is covered in aggregation. This enables the network to

22 Ibid., 11:13.

23 Ibid., 11:11–13.

(17)

11

identify patterns in larger areas of an image using significantly less parameters than with fully connected layers which means the network is easier to train.²⁴

Pooling layers are common alongside the convolutional layers in CNN architectures. Max pooling and average pooling layers, which make up part of the Inception v3 architecture, aggregate information in similar ways to the convolutional layers. But instead of the dot product aggregate, the max or average value of the input is returned. This can reduce the spatial size of a feature, extract the most important information, and filter out noise. Pooling layers also further decrease the computational power required to train ANNs.²⁵

2.6 Transfer Learning

Transfer learning is the concept of transferring experience from one task to another, similar one. An analogy for transfer learning is that it is supposedly easier to learn Spanish if you already know Italian. This very human behavior has inspired this learning technique to be used within machine learning purposes.²⁶ The inception v3 model utilizes transfer learning by being

pretrained on general images and later on allowing the final layer, right before the output, to be retrained on a new task.

2.7 Inception v3

In 2014, a trend of deeper CNNs resulted in high performing new models. One of which was the GoogLeNet architecture, later known as the first iteration of the Inception CNN, i.e. Inception v1. An important contribution of the

24 Sumit Saha, “A Comprehensive Guide to Convolutional Neural Networks – the ELI5 way,” Towards Data Science, December 15, 2018.

https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural- networks-the-eli5-way-3bd2b1164a53.

25 Ibid.

26 Torrey, L., & Shavlik, J., ”Transfer Learning,” 2010, in Olivas, E. S., Guerrero, J. D., Martinez-Sober, M., Magdalena-Benedito, J. R., & Serrano López, A.

J. (Ed.), Handbook of Research on Machine Learning Applications and Trends:

Algorithms, Methods, and Techniques (242-264).

(18)

12

Inception architecture in the notion to design wider instead of deeper. This is achieved with so called Inception modules within the architecture. The main concept of these inception modules is using local sparse models which can approximate dense components.²⁷ The Inception module utilizes

convolutional layers that aggregate 1x1, 3x3, and 5x5 respectively, as well as pooling layers, it then concatenates the respective outputs into a single output vector. This enables the layers within the architecture to choose which filter size is the most important to learn the information. Different kernel sizes can be advantageous for different types of images. More globally distributed relevant data in the image calls for a larger kernel whereas a more local distribution prefers a smaller kernel.²⁸

The original Inception architecture was later refined and has been continuously iterated upon. Leading up to the third iteration, Inception v3. The main improvements compared to Inception v1 was achieved by factoring the computationally expensive larger convolutions into several less

computationally expensive smaller convolutions.²⁹

2.8 Data Sampling

A data set is imbalanced if its classes are not somewhat equally represented.

Machine learning techniques such as the applications of ANNs are often used to solve problems characterized by imbalanced data. Furthermore, the

distribution among classes may be different between the testing and training data, so that the true misclassification costs are unknown during learning.

27 Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich,

“Going Deeper with Convolutions,” arXiv 1409, no. 4842 (2014).

https://arxiv.org/abs/1409.4842.

28 Bharath Raj, “A Simple Guide to the Versions of the Inception Network,”

Towards Data Science, May 29, 2018. https://towardsdatascience.com/a-simple- guide-to-the-versions-of-the-inception-network-7fc52b863202.

29 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna, “Rethinking the Inception Architecture for Computer Vision,”

arXiv 1512, no. 00567 (2015). https://arxiv.org/abs/1512.00567.

(19)

13

Because of this, predictive accuracy (see equation 9), which is a common choice for evaluating classifiers, may not be the most suitable measure when the data is imbalanced or when the costs of different errors vary significantly.

This indicates that the optimal training data distribution for a learning algorithm depends on the problem being solved.³⁰

In a study analysing the effect of class distributions on classifier learning, it was demonstrated that the naturally occurring class distribution does not always produce the best-performing classifier. When using AUC as the performance metric, which they believed more appropriate than error rate, they found that in general the training set be formed from equal numbers of examples of each class if the true misclassification costs are unknown and if the optimal distribution is not to be determined by experimentation.³¹

Another study comparing the performance of different sampling techniques found that standard multilayer neural networks are not sensitive to class imbalance when applied to linearly separable domains, but that “its sensitivity increases with the complexity of the domain”. The paper showed that both oversampling the minority class and downsampling the majority class were effective methods in such cases. Furthermore, it found that more sophisticated methods than a uniformly random approach had no benefit in the case of the domains studied.³²

30 Nitesh V. Chawla, "Data mining for imbalanced datasets: An overview,"

in Data mining and knowledge discovery handbook, 875-886. Springer, Boston, MA, 2009.

31 Gary M. Weiss, and Provost Foster, "Learning when training data are costly: The effect of class distribution on tree induction," Journal of artificial intelligence research 19 (2003): 315-354.

32 Nathalie Japkowicz, "The class imbalance problem: Significance and strategies," in Proc. of the Int’l Conf. on Artificial Intelligence. 2000.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 . Equation 9

(20)

14

2.9 Related Work

The Stanford study conducted in 2017 by Esteva et al. was the most ambitious skin cancer CAD project to date. In the study, the Inception v3 CNN was trained using transfer learning on skin cancer images and the corresponding disease diagnosis labels. Compared to previous studies, the model was trained with a much larger data set of 129,450 images from the ISIC Dermoscopic Archive, Edinburgh Dermofit Library, and the Stanford Hospital. The data set was two orders of magnitude larger than the data sets used previously.³³

The authors also implement new methods in which they use a tree- structured taxonomy of diseases in order to “take advantage of fine-grained information contained within the taxonomy structure”. Each class with more than 1,000 images was recursively divided into smaller classes with the average generated training class size being slightly less than 1,000, generating 757 classes in total. When testing, the probabilities of the child classes were added together into probabilities for their parent classes which were the classes actually tested.

In addition to the large training set and improved methodology, the team compared the performance of their model against at least 21 board- certified dermatologists in two tests of binary classification. The result of binary classification between melanoma and nevus can be seen in the Receiver operating characteristic (ROC) curve diagram in figure 3. The red dots show the results of individual dermatologists, with the green dot showing their average (the green bars denoting one standard deviation). The CNN outperformed the average dermatologist in this test and in a second test between KC and SK.

(21)

15

Figure 3. The ROC-curve for the binary classification between melanoma and nevus with the individual and average dermatologists plotted as red dots and green dot respectively. The green bars emerging from the average point represent one standard deviation. From Esteva et al., Dermatologist-level classification of skin cancer with deep neural networks.

The results of Esteva et al. caught the interest of Boman and

Volminger that, in their bachelor’s thesis from 2018, attempted to replicate the methodology on a smaller data set comprised of publicly available data. Two of the data sources used by Esteva et al., the Edinburg Dermofit Library and the Stanford Hospital were not publicly available, so the authors attempted to compensate for this by using other publicly available data sources together with the ISIC data. Although following the same methodology, Boman and Volminger were not able to replicate the strong results from Esteva et al. (see figure 4). Among the reasons stated were the smaller data set available which they partitioned into 16 classes.³⁴

34Boman and Volminger, Evaluating a deep convolutional neural network for classification of skin cancer.

(22)

16

Figure 4. A ROC-curve from the KTH study replicating the binary classification test between melanoma and nevus from Esteva et al., the results of which are displayed in figure 3. From Boman and Volminger, Evaluating a deep convolutional neural network for classification of skin cancer.

(23)

17

Chapter 3 Method

In this chapter, the methods used to answer the problem statement and the data used in the process will be presented. The methodology used in this study aimed to build upon the previous work by Boman and Volminger which in turn was heavily inspired by the 2017 study from Stanford.³⁵³⁶

3.1 Data Set

The data used in this study comes from the ISIC Dermoscopic Archive. In total, 23,906 pictures were collected from the database. Each image came with a description file containing various information about the skin lesion. The only information that was extracted from the description files was the diagnosis which was used as labels for training our CNN.³⁷

Unfortunately, the DermQuest archive used by Boman and Volminger could not be accessed. It had contained 16,826 images used in their study but has as of late 2019 been shut down until further notice. Other sources also used by Boman and Volminger such as the Dermatology Atlas and

DermaAmin were excluded as part of the data selection process.³⁸ This was motivated by the lower quality of the images from these sources. The images from these sources were often less zoomed in body pictures as opposed to the ISIC images which were zoomed in on the skin lesions. The images were

35 Ibid.

37 The International Skin Imaging Collaboration (ISIC) “Melanoma Project archive,” ISIC. Accessed June 6, 2020. https://www.isic-archive.com.

38 Boman and Volminger, Evaluating a deep convolutional neural network for classification of skin cancer.

(24)

18

excluded to avoid having the network learn these distracting features when the amount of data available for each class is small.

A large number of images were also discarded from the ISIC data set.

Among these were images belonging to classes with less than 200 images as was done by Boman and Volminger in order to assure that the network has sufficient data for training, validation, and testing.³⁹ This differs compared to Esteva et al., however, which does not state a minimum class size. It was motivated however, because of their larger data set and because they possessed the medical knowledge required to arrange the data based on clinical and visual similarity into the taxonomy tree structure described in chapter 2.8 to make up for the lack of data of any individual class.⁴⁰

Other images that did not pass through the data selection process were skin lesions diagnosed as “unknown” or “other” because they consisted of different unidentified diseases that could contaminate the pools of similar looking, identified diseases. The final images and their specific labels can be found in table 1. Unfortunately, only three classes passed the data selection process, diverging from the methodology of Boman and Volminger that had 16 classes, and Esteva et al. that had 757.^{41 42} However, the three remaining classes, nevus, melanoma, and SK were still deemed significant enough to study due to their visual similarity as discussed in chapter 2.2. Of the data that made in through the selection process, 10% was then randomly selected into the validation set and another 10% into the testing set.

Drawing from the literature in what was discussed in chapter 2.9, sampling adjustments were done to adjust for the imbalanced training data set.

This was deemed necessary as the cost of misclassifying errors vary

39 Ibid.

41 Ibid.

(25)

19

significantly for the different classes. However, a separate experiment was carried out without any sampling adjustments to draw a comparison. The nevus data was randomly downsampled to 5,000 while the melanoma and SK data was randomly oversampled to 5,000 images, yielding an equal split between minority and majority classes. The number 5,000 was chosen arbitrarily, but by taking into consideration storage space, computational requirement, and time limitations making test distribution experimentation unfeasible. This was another significant modification of the methodology compared to previous work, as Boman and Volminger only downsampled their majority classes to 1,000 images, whilst Esteva et al. did no sampling.

However, it should be noted that Boman and Volminger had a lower limit of 200 images, and Esteva et al. might have achieved a more balanced data set from their differing sources.^{43 44}

Label Total images Training (Oversampled)

Validation Test

Nevus 18,570 5,000

(5,000) 1,857 1,857

Melanoma 2,604 2,170

(5,000) 217 217

SK 420 338

(5,000) 42 42

Table 1. The number of unique images for each data split and category, with the total number of images used in the training split including oversampled data enclosed by parentheses.

3.2 Training Process

Inception v3 was used as the base CNN for the training process. Pre-trained weights were downloaded that had already been trained to a 78.0% top-1

43 Ibid.

(26)

20

accuracy and 93.9% top-5 accuracy on the ILSVRC-2012-CLS image

classification data set. The pre-trained Inception v3 model was then retrained through transfer learning, where the final classification layer’s weights are removed and retrained, while keeping the other layers locked. To replicate the methodology of previous studies as accurately as possible, the same

parameters for the learning process were used as in Boman and Volminger (see table 2).^{45 46 47}

Parameter Value

Learning rate 0.001

Batch size 100

Decay 0.9

Epsilon 0.1

Momentum 0.9

Table 2. The parameters used to train the Inception v3 CNN, the same as in Boman and Volminger, Evaluating a deep convolutional neural network for classification of skin cancer.

The TensorFlow library was used to train, validate, and test the network.⁴⁸ The TensorFlow Slim library was also used because it features

45 Sergio Guadarrama, “Pre-trained models,” in TensorFlow-Slim image classification model library readme, GitHub. Accessed June 6, 2020.

https://github.com/tensorflow/models/tree/master/research/slim.

46 Stanford Vision Lab, “The Large Scale Visual Recognition Challenge 2012 (ILSVRC2012),” Image Net. Accessed June 6, 2020. http://www.image-

net.org/challenges/LSVRC/2012/.

48 Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Mike Schuster, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent

(27)

21

many built-in functions for training and evaluating RNNs on image data with transfer learning.The library also includes functions to automatically

preprocess images for training with the Inception infrastructure. Utilizing this, the images were randomly cropped, bi-cubically resized and distorted with a random hue, contrast, brightness, and saturation to augment the data. The images were then resized into the 299x299 pixel resolution supported by the Inception architecture. After preprocessing, the network was trained using the backpropagation algorithm.⁴⁹

3.3 Validation Process

The network was validated using the 3-way classification accuracy. The 3-way classification was initially inspired by the taxonomy approach used by Boman and Volminger and Esteva et al. but due to the lack of available data, a regular 3-way classification was performed.⁵⁰ The original 3-way classification was based on the taxonomy in Esteva et al. which divided the skin disorders into either benign, malignant or non-neoplastic. Because of data limitations, and a lack of clinical knowledge, not enough data could be classified as non-

neoplastic, resulting in the 3-way classification between nevus, melanoma, and SK of this study.⁵¹

The 3-way classification validation was carried out with the built-in functions of TensorFlow which runs the network on the specified validation set and then prints out the classification accuracy (see equation 9).

Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, “TensorFlow,” GitHub.

Accessed June 6, 2020. https://github.com/tensorflow.

49 Sergio Guadarrama, “TensorFlow-Slim image classification model library readme,” GitHub. Accessed June 6, 2020.

(28)

22

During the training process, model checkpoints with saved weights were saved every 10 minutes. After training for a substantial amount of time, all checkpoints were validated and the model checkpoint with the lowest validation accuracy error was selected as the final model to avoid overfitting.

3.4 Test Process

In the testing process, a more sophisticated method was used than in the validation process. As was discussed in chapter 2.9, predictive accuracy may not be the most suitable measure when the data is imbalanced or when the costs of different errors vary significantly.

A common ethical dilemma in medicine is the trade-off between specificity and sensitivity (see equation 10 and 11). A false positive is when the given test gives a positive result, i.e. the subject is diagnosed as sick, whilst in reality, the subject is healthy. A false negative is the opposite, a patient being declared healthy whilst having the disease tested for. A false negative is consensually viewed as a negative outcome as it results in a person needing care being excluded from care. A false positive, however, is more of an ethical issue as a lower threshold for a positive test result will lead to more people being diagnosed which leads to both fewer false negatives but also more false positives, the ethical part is whether or not this is a valid trade-off as it risks the unnecessary stress and traumas of healthy individuals.

Furthermore, these needs need also to be balanced against the cost of unnecessary treatment of healthy individuals.⁵²

52 Pozgar, G. D. (2005). “Legal and ethical issues for health professionals”. Boston, Jones and Bartlett Publishers.

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 . Equation 10

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑡𝑜𝑡𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 . Equation 11

(29)

23

To avoid this problem, a confusion matrix was constructed (see table 3), and the sensitivity and specificity was calculated to be tested alongside the classification accuracy.

Predicted class Positive Negative Actual Positive True positive False positive

class Negative False negative True negative Table 3. A generic binary confusion matrix.

(30)

24

Chapter 4 Results

In this chapter, the results of the study are presented. The test process was performed both with a model trained on data that had been balanced through oversampling and downsampling and with a model trained on imbalanced data as explained in chapter 3.1. These are referred to as the balanced model and the imbalanced model respectively.

4.1 Testing Results

The balanced model achieved a classification accuracy of 82.2% on the test data set. The unbalanced model achieved a higher classification accuracy of 87.3% for the same data set. However, the difference in tactics employed by the two models can be seen clearly by comparing table 4 and table 5. The imbalanced model prioritizes nevus responses as that was the majority class in the imbalanced data set. In contrast, the balanced model gives more balanced responses. This results in better average sensitivity but reduces the overall accuracy.

The balanced model achieved a sensitivity of 65.4% for melanoma, 84.5% for nevus, and 73.8% for SK. The imbalanced model achieved a sensitivity of 18.1% for melanoma, 97.5% for nevus, and 0% for SK. As for specificity, the balanced model achieved 89.1% for melanoma, 79.9% for nevus, and 94.4% for SK. And the imbalanced model achieved 97.3% for melanoma, 17.3% for nevus and 100% for SK. The results are summarized in table 6.

(31)

25

Predicted class

Melanoma Nevus SK

Melanoma 142 44 31

Actual class Nevus 203 1569 85

SK 3 8 31

Table 4. A confusion matrix showing the 3-way classification results of the balanced model.

Predicted class

Melanoma Nevus SK

Melanoma 41 176 0

Actual class Nevus 47 1810 0

SK 4 38 0

Table 5. A confusion matrix showing the 3-way classification results of the imbalanced model.

Parameter Melanoma Nevus SK

Balanced Model

Sensitivity 65.4% 84.5% 73.8%

Imbalanced Model Sensitivity

18.1% 97.5% 0%

Balanced Model Specificity

89.1% 79.9% 94.4%

Imbalanced Model Specificity

97.3% 17.3% 100%

Table 6. Sensitivity and specificity results for the balanced and imbalanced models.

(32)

26

Chapter 5 Discussion

5.1 Key Findings

The CNNs developed in this study achieved a better 3-way classification accuracy compared to the work of Boman and Volminger (82.2% and 87.3%

for the two models compared to 68.3%). However, one should be careful to draw any conclusions from these results due to the large diversion in

methodology caused by the availability of data. The differences can largely be explained by the narrowing down from 16 to 3 classes due to the lower limit of 200 for class sizes.

Instead of the direct comparison of the results with previous work, the most interesting results are derived from the sampling process implemented into the methodology. As can be seen by comparing the confusion matrix results of the balanced and imbalanced models (see table 4 and 5). Significant improvements in sensitivity and better overall specificity was achieved with only moderate loss in classification accuracy by utilizing oversampling and downsampling methods (see table 6).

5.2 Limitations

During the making of this study, time was a limiting factor which was quite detrimental to the gathering of data and foremost to proper testing processes.

Other than time, the available data was also a major limitation. Apart from the various small databases with a couple of hundred images or less each, the sudden disappearance of DermQuest resulted in a substantially smaller data set to train the network on. Having a sufficiently large data set to train from is crucial to make the network more capable of generalizing and thereby correctly predicting future, unseen input data. A large motivation behind this

(33)

27

study was to further the work of Boman and Volminger by extending their testing to achieve more nuanced and hopefully higher performing results.

However due to the time factor and data availability already discussed, the testing phase of was severely hampered and comparisons with the previous work could not be done directly due to the large diversion in methodology from not using the disease taxonomy.⁵³

5.3 Error Analysis

It is quite evident that the first run of validation and testing was acutely skewed because of the imbalance of classes in the training data since the network never attempted to predict an instance of SK. Although the CNN achieves better results with balanced data, the effects of imbalance is likely still impacting the results. The reason is that the lack of data for SK cannot be fully solved by oversampling as no new information is created. In an ideal training setting, each class would contain equal amounts of unique images.

Furthermore, due to the size of the data set, it was not possible to make sure that no images that were duplicates or contained different angles of the same lesions ended up in both the training and the validation or testing set.

This could make the results appear better than they should.

5.4 Other Aspects Than Resolution and Data Availability

Boman and Volminger also found the ISIC Dermoscopic Archive to have higher quality images compared to the other sources, but they still decided to use several databases which were deliberately excluded in this study.

Although the quality and resolution of some of these images were surely good and additional training images were lost because of this, the decision can be justified from other perspectives. As was discussed in chapter 3.1, many of

(34)

28

these databases were providing images of skin lesions located in a specific region of the body. For instance, some images displayed a type of acne and thereby also, inevitably, included characteristics of a face. Although good efforts had been made to hide this for the sake of anonymity, these

characteristics could falsely train the CNN to classify faces, and not the actual skin disease. Including such images and then testing the network on

classifying such a disease could improve the achieved results but much like the case with the unbalanced data, a model with higher testing accuracy does not necessarily outperform other models when it comes to generalization.

5.5 Ethicality of the Study

Ethics is always a pressing matter when it comes to scientific studies and certainly even more so when looking at studies regarding individuals and medicine, such as this one. A clear ethical stance must be taken in this report as this research field potentially could threaten the medical profession and specifically dermatologists who might fear that their jobs are at risk. Despite this study being effectively inspired by a report where a CNN outperformed human dermatologists, it is important to note that this study simply aims to give dermatologists and other benefactors an extensive tool to help them in their work. It is not in this study’s interest to replace dermatologists in any shape or form, but one must always keep this in mind while working on experiments such as this one.

(35)

29

Chapter 6 Conclusions

The results of this study demonstrate that balancing of data can be used successfully at a small cost. Although models trained in this study

outperformed the previous models developed by Boman and Volminger, due to deviating similarities, this apparent improvement cannot conclude anything major. It would be scientifically disingenuous to argue that this shows

anything of statistical significance. However, the balance of training data can be more conclusive as this study found that oversampling and downsampling can increase the specificity and sensitively greatly at a low accuracy loss.

6.1 Further Research

Despite interesting findings made in this study, there are a lot of areas which could be more extensively researched upon. The most critical improvement to be made is the size of the data sets used. Due to the limiting time factor, this study did not successfully explore other options to compensate for the sudden demise of DermQuest. Researchers wishing to continue exploring the

classification of skin cancer should definitely allocate more of their time on gathering suitable data for the training of the network.

Another aspect which could be improved upon would be the testing process. As previously discussed, accuracy alone is rarely sufficient when evaluating complex methods and tasks and therefore further studies should aim to produce additional metrics which subsequently should be analyzed and discussed from a more statistical standpoint. A good way of presenting results from binary classification is to plot a ROC-curve, shown in the earlier

chapters.

(36)

30

Bibliography

Abadi, Martín; Agarwal, Ashish; Barham, Paul; Brevdo, Eugene; Chen, Zhifeng; Citro, Craig; Corrado, Greg; Davis, Andy; Dean, Jeffrey;

Devin, Matthieu; Ghemawat, Sanjay; Goodfellow, Ian; Harp, Andrew;

Irving, Geoffrey; Isard, Michael; Jozefowicz, Rafal; Jia, Yangqing;

Kaiser, Lukasz; Kudlur, Manjunath; Levenberg, Josh; Mané, Dan;

Schuster, Mike; Monga, Rajat; Moore, Sherry; Murray, Derek; Olah, Chris; Shlens, Jonathon; Steiner, Benoit; Sutskever, Ilya; Talwar, Kunal; Tucker, Paul; Vanhoucke, Vincent; Vasudevan, Vijay; Viégas, Fernanda; Vinyals, Oriol; Warden, Pete; Wattenberg, Martin; Wicke, Martin; Yu, Yuan; and Zheng, Xiaoqiang, “TensorFlow,” GitHub.

Accessed June 6, 2020. https://github.com/tensorflow.

American Cancer Society, Cancer facts & figures, Atlanta, GA: American Cancer Society, 2016. https://www.cancer.org/research/cancer-facts- statistics/all-cancer-facts-figures/cancer-facts-figures-2016.html.

Bevona, Caroline; Goggins, William; Quinn, Timothy; Fullerton, Julie; and Tsao. Hensin, “Cutaneous Melanomas Associated with Nevi,”

Archives of dermatology, no. 139 (2003): 1620–1624.

Boman, Joakim, and Volminger, Alexander, “Evaluating a deep convolutional neural network for classification of skin cancer,” bachelor’s thesis, KTH Royal Institute of Technology, 2018.

Cancer Treatment Centers of America, “What is Cancer?,” Cancer Treatment Centers of America. Accessed May 27, 2020.

https://www.cancercenter.com/what-is-cancer.

Chawla, Nitesh V, "Data mining for imbalanced datasets: An overview,"

in Data mining and knowledge discovery handbook, 875-886.

Springer, Boston, MA, 2009.

Ebling, John and Montagna, William, “Human skin,” Encyclopædia

Britannica, June 8, 2020. https://www.britannica.com/science/human- skin.

Esteva, Andre; Kuprel, Brett; Novoa, Roberto; Ko, Justin; Swetter, Susan;

Blau, Helen; and Thrun, Sebastian, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, no. 542 (2017):

115–118.

Guadarrama, Sergio, “TensorFlow-Slim image classification model library readme,” GitHub. Accessed June 6, 2020.

(37)

31

Halpern, Allan; Marghoob, Ashfaq; and Reiter Ofer, “Melanoma overview - A dangerous skin cancer,” The Skin Cancer Foundation, April 2019, https://www.skincancer.org/skin-cancer-information/melanoma/.

Hameed, Nazia; Ruskin, Anglia; Abu Hassan, Kamal; and Hossain M.A., “A comprehensive survey on image-based computer aided diagnosis systems for skin cancer,” conference paper presented at the 2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA), Chengdu, China, December 15–17, 2016.

The International Skin Imaging Collaboration (ISIC), “Melanoma Project archive,” ISIC. Accessed June 6, 2020. https://www.isic-archive.com.

Izikson, Leonid; Sober, Arthur; Mihm Jr., Martin; and Zembowicz, Artur,

“Prevalence of Melanoma Clinically Resembling Seborrheic Keratosis:

Archives of 9204 cases,” Archives of dermatology, no. 138 (2002):

1562–1566.

Japkowicz, Nathalie, "The class imbalance problem: Significance and strategies," In Proc. of the Int’l Conf. on Artificial Intelligence, 2000.

Raj, Bharath, “A Simple Guide to the Versions of the Inception Network,”

Towards Data Science, May 29, 2018.

https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the- inception-network-7fc52b863202.

Rodríguez, Jorge Hernández; Fraile, Francisco Javier Cabrero; Conde, María José Rodríguez; and Llorente, Pablo Luis Gómez, “Computer aided detection and diagnosis in medical imaging: a review of clinical and educational applications,” conference paper presented at the Fourth International Conference on Technological Ecosystems for Enhancing Multiculturality, Salamanca, Spain, November 16, 2020.

Saha, Sumit, “A Comprehensive Guide to Convolutional Neural Networks – the ELI5 way,” Towards Data Science, December 15, 2018.

https://towardsdatascience.com/a-comprehensive-guide-to- convolutional-neural-networks-the-eli5-way-3bd2b1164a53.

Skin Cancer Foundation, “Skin Cancer Facts & Statistics,” Skin Cancer Foundation. Accessed May 27, 2020. https://www.skincancer.org/skin- cancer-information/skin-cancer-facts/.

Stanford Vision Lab, “The Large Scale Visual Recognition Challenge 2012 (ILSVRC2012),” Image Net. Accessed June 6, 2020.

(38)

32

http://www.image-net.org/challenges/LSVRC/2012/.

Szegedy, Christian; Liu, Wei; Jia, Yangqing; Sermanet, Pierre; Reed, Scott;

Anguelov, Dragomir; Erhan, Dumitru; Vanhoucke, Vincent, and Rabinovich, Andrew, “Going Deeper with Convolutions,” arXiv 1409, no. 4842 (2014). https://arxiv.org/abs/1409.4842.

Szegedy, Christian; Vanhoucke, Vincent; Ioffe, Sergey; Shlens, Jonathon; and Wojna, Zbigniew, “Rethinking the Inception Architecture for

Computer Vision,” arXiv 1512, no. 00567 (2015).

https://arxiv.org/abs/1512.00567.

Torrey, L., & Shavlik, J., ”Transfer Learning,” 2010, in Olivas, E. S., Guerrero, J. D., Martinez-Sober, M., Magdalena-Benedito, J. R., &

Serrano López, A. J. (Ed.), Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques (242-264).

Weiss, Gary M., and Foster Provost, "Learning when training data are costly:

The effect of class distribution on tree induction," Journal of artificial intelligence research 19 (2003): 315-354.

WHO, “Cancer,” WHO, September 12, 2018. Accessed May 27, 2020.

https://www.who.int/news-room/fact-sheets/detail/cancer.

(39)

TRITA-EECS-EX-2020-408

List of Abbreviations

Abstract

Sammanfattning

List of Abbreviations

Table of Contents

Chapter 1 Introduction

1.1 Problem Statement

1.2 Scope

1.3 Thesis Outline

Chapter 2 Background

2.1 Skin Cancer

2.2 Distinguishing Between Melanoma, Seborrheic Keratosis, and Nevus

2.3 Artificial Neural Networks

2.4 Backpropagation

2.5 Convolutional Neural Networks

2.6 Transfer Learning

2.7 Inception v3

2.8 Data Sampling

2.9 Related Work

Chapter 3 Method

3.1 Data Set

3.2 Training Process

3.3 Validation Process

3.4 Test Process

Chapter 4 Results

4.1 Testing Results

Chapter 5 Discussion

5.1 Key Findings

5.2 Limitations

5.3 Error Analysis

5.4 Other Aspects Than Resolution and Data Availability

5.5 Ethicality of the Study

Chapter 6 Conclusions

6.1 Further Research

Bibliography