IN
DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS
STOCKHOLM SWEDEN 2019 ,
Investigating Skin Cancer with Unsupervised Learning
KEIVAN MATINZADEH RAFAEL DOLFE
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Investigating Skin Cancer with Unsupervised Learning
KEIVAN MATINZADEH RAFAEL DOLFE
Degree Project in Computer Science Date: June 3, 2019
Supervisor: Pawel Herman Examiner: Örjan Ekeberg
School of Electrical Engineering and Computer Science
Swedish title: Undersökande av hudcancer med oövervakat lärande
iii
Abstract
Skin cancer is one of the most commonly diagnosed cancers in the world.
Diagnosis of skin cancer is commonly performed by analysing skin lesions on the patient’s body. Today’s medical diagnostics use a established set of labels for different types of skin lesions. Another way of categorising skin lesions could be to let a computer perform the analysis without any prior knowledge of the data, where the data is a data set of skin lesion images. This categorisation could then be compared to the already existing medical labels assigned to each image. This categorisation and comparison could provide insight into underlying structures of skin lesion data.
To investigate this, three unsupervised learning algorithms; K-means, agglom- erative clustering, and spectral clustering, have been used to produce cluster partitionings on a data set of skin lesion images. We found no clear cluster partitionings and no connection to the already existing medical labels. The highest scoring partitioning was produced by spectral clustering when the number of clusters was set to two. Further investigation into the structure of this partitioning revealed that one cluster contained essentially every image.
Although relatively low, the score does indicate that the underlying structure
may be best represented by a single cluster.
iv
Sammanfattning
Hudcancer är en av de mest förekommande typerna av cancer i världen. Det vanligaste sättet att diagnosticera hudcancer är för en dermatolog att analysera hudsår på en patients kropp. Dagens medicinsk diagnostik använder en etablerad mängd beteckningar för olika typer av hudsår. Ett alternativ till denna typ av diagnostisering skulle kunna vara att låta en dator utan förkunskap om datan (bilder på hudsår) sköta analysen. Denna katogorisering skulle sedan kunna jämföras med de existerande medicinska katogorierna som varje bild fått.
För att undersöka detta användes tre algoritmer av typen oövervakat lärande för att producera kluster-indelningar på ett dataset innehållandes bilder på hudsår. Dessa algoritmer var K-means, agglomerative clustering, och spectral clustering. Vi fann inga uppenbara kluster-indelningar och ingen koppling mellan de nuvarande medicinska beteckningarna. Den indelning av kluster som fick högst poäng när den evaluaredes internt var den indelning av kluster genererad av spectral clustering. Detta skedde när antalet kluster som algoritmen skulle dela upp datan i var satt till två. En djupare undersökning i strukturen av denna indelning visade att ett av klustrerna i princip innehöll varje bild.
Även fast Silhouette-värdet för denna indelning var låg, pekar värdet på att den
underliggande strukturen bäst kan representeras av ett enda kluster.
v
Acknowledgements
We would first like to thank our supervisor associate professor Pawel Herman
at KTH for his helpful feedback and suggestions. He helped us formulate
meaningful research questions and always pointed us in the right direction
throughout the course of the project. We would also like to thank the creators
of the HAM10000 data set. Without the data set, this project would not have
been possible.
Contents
1 Introduction 1
1.1 Problem Statement & Research Questions . . . . 2
1.2 Scope . . . . 3
2 Background 4 2.1 Machine learning - An Unsupervised Learning Approach . . . 4
2.1.1 Unsupervised Learning . . . . 4
2.1.2 Clustering . . . . 5
2.2 Clustering Algorithms . . . . 5
2.2.1 K-Means . . . . 6
2.2.2 Agglomerative Hierarchical Clustering . . . . 6
2.2.3 Spectral Clustering . . . . 7
2.3 Silhouette Index . . . . 7
2.4 Rand Index . . . . 8
2.5 Dimensionality . . . . 9
2.5.1 The Curse of Dimensionality . . . . 9
2.5.2 Principal Component Analysis (PCA) . . . . 10
2.6 Related Work . . . . 11
2.6.1 Data Clustering . . . . 11
2.6.2 Image Segmentation . . . . 11
3 Method 13 3.1 Data set . . . . 13
3.2 Implementation . . . . 14
3.2.1 Approach . . . . 14
3.2.2 Scikit-learn . . . . 15
3.2.3 OpenCV . . . . 15
3.2.4 Setup . . . . 15
3.2.5 Downloading and Reading . . . . 16
vi
CONTENTS vii
3.2.6 Preprocessing . . . . 16
3.2.7 Clustering Algorithms . . . . 17
3.2.8 Silhouette and Rand Index Scores . . . . 17
4 Result 18 4.1 Preprocessing the data . . . . 18
4.1.1 Reading images as grayscale and resizing . . . . 18
4.1.2 Applying PCA . . . . 19
4.2 Clustering the Data . . . . 20
4.3 Computing Silhouette Scores . . . . 20
4.4 Computing Rand Index Scores . . . . 21
4.5 Analysing the Data . . . . 22
4.5.1 Investigating the Findings of the Rand Index Score . . 23
4.5.2 Investigating the Findings of the Silhouette Index Score 24 5 Discussion 26 5.1 Influence of Feature Extraction . . . . 26
5.2 Source of Errors . . . . 27
5.3 Retrospective . . . . 28
6 Conclusion 29
Bibliography 30
Chapter 1 Introduction
Skin cancer is one of the most commonly diagnosed cancers in the US with skin cancer itself being most common amongst non-hispanic whites, with a rate of 27 per 100 000 [1]. This means that it is of importance to conduct further research in order to gain a better understanding of the disease. The most common way of diagnosis is for a dermatologist to analyse lesions on the patients body. Dermatologists can then analyse large amounts of independent lesions to gain further understanding of the lesions themselves. Large image- sets of lesions can also be used in computer aided diagnostics (CAD), meaning having machine learning algorithms intelligently analyse image sets to e.g. gain a deeper understanding of the data.
Generally, many studies have been conducted for different types of cancers for the purpose of disease prognosis and prediction of outcome of treatments [2].
Most of these studies involve the use of machine learning algorithms belonging to a subset of machine learning called supervised learning. A supervised learning model learns to generalise on labeled pairs of input-output to respond correctly to all possible inputs [3]. Categorising data based on prior knowledge is called classification. A different approach would then be to use a group of machine learning algorithms belonging to the subset unsupervised learning.
These type of algorithms are not trained, thus they do not know beforehand what to look for. Instead the algorithms draw their own conclusions from the data, and classify it in whatever way they deem appropriate. This type of categorising, without any prior knowledge, is called clustering. "The aim of unsupervised learning is to find clusters of similar inputs in the data without being explicitly told that these datapoints belong to one class and those to a
1
2 CHAPTER 1. INTRODUCTION
different class. Instead, the algorithm has to discover the similarities for itself."
[3]. One of the main purposes of clustering is to gain insight about data, finding underlying structures and anomalies, and to generate hypotheses [4].
1.1 Problem Statement & Research Questions
Artificial intelligence have been used to create models capable of diagnosing skin cancer accurately on par with trained dermatologists [5] and in some cases even outperforming the dermatologists [6]. This demonstrates the importance and the potential of using artificial intelligence in the field of skin cancer. While this is very important, a whole field of machine learning in the case of skin cancer seems to have been partially ignored, namely unsupervised learning.
Some research in regards to unsupervised learning in other types of cancer have been conducted, e.g. computing the survival rate of lung cancer patients [7]. While studies like this are somewhat relevant because of their use of unsupervised learning, the data used is numerical. An interesting approach is then to instead use images and then apply unsupervised learning techniques to that data. This works well with skin cancer since the tumours are visible on the skin as lesions, and also because of the fact that many different types of tumours fall under the same umbrella of skin cancer. This opens up for the possibility of investigating whether the different types of skin cancers have an underlying "natural" and potentially unknown structure.
In our thesis we will compare the result of classification on skin cancer car- ried out by various clustering algorithms, and also compare the results to the true labels of the data set. The aim of the thesis is to answer the following questions:
• "Do the classes generated through the different clustering algorithms resemble the ground truth labels of the data set?"
• "Does the result of the clustering provide any insight on any underlying structure of the data?"
Our hypothesis is that K-means will perform well in regards to recovering the true structure of the data. This was observed in previous research conducted by Souto et al. [8] on gene expression data.
Our research differ from the work presented in [8] in the type of data used. We
use images of lesions from the Tschandl, Rosendahl, and Kittler [9] data set
CHAPTER 1. INTRODUCTION 3
while the data used in [8] are gene expression data.
1.2 Scope
The scope is limited by the data set and the number of algorithms and validation measures. Three algorithms were chosen as to increase the likelihood of obtaining a meaningful result. However, there are many clustering algorithms available and research on new ones are continuously published.
To be able to determine the quality of a cluster partitioning, what is called an internal validation index is used. In some instances, a specific index is chosen depending on what clustering algorithm is used. In this thesis only one is used, the Silhouette Index. While many different ones do exist, the literature supports the idea of only using the Silhouette index to carry out the validation for different types of clustering algorithms, since the Silhouette index only depends on the partitioning of the data, and not on the clustering algorithm that was used to obtain said partitioning [10]. The same reasoning is used to motivate the choice of only having one external validation index. An external validation index is what is used to compare two cluster partitionings between each other.
Because of these limitations our scope is constrained to the data set described
in 3.1 and the algorithms and validation methods shown in table 3.1.
Chapter 2 Background
2.1 Machine learning - An Unsupervised Learn- ing Approach
Machine learning is a branch of Artificial Intelligence, where computational methods are used to improve performance or to make accurate predictions [11], where accuracy is measured by how well the chosen actions reflect the correct ones [3].
2.1.1 Unsupervised Learning
Unsupervised learning is a subset of machine learning. Contrary to the more well known branch of machine learning, supervised learning, where right answer is already labeled in the data, unsupervised learning uses no labeling and instead tries to categorise input on common features. Without being told which data points belong to which class, the algorithm have to cluster these points together based on their similarities. The aim of unsupervised learning is for an algorithm to, without being explicitly told that some data points belongs to different classes, find cluster of similar inputs based on the data [3].
4
CHAPTER 2. BACKGROUND 5
2.1.2 Clustering
Clustering, or Cluster analysis, is the formal study of algorithms and methods for grouping objects [12]. In cluster analysis the groups are of interest themselves, their assessments are incentric. The groups do not need to reflect some reference set of classes (which is a requirement in classification). The goal of cluster analysis is then not to establish rules to separate future data into categories, but to find a valid organisation of the data.
The clustering structure is formally represented as a set of subsets C = C 1 , ..., C k of S such that: S k
i=1 C i and C i T C j = ∅ [13].
To determine how similar two objects are, two main types of measurements are used: distance measures and similarity measures, where the distance measure between two instances x i and x j commonly is denoted as d(x i , x j ).
Figure 2.1 show a set of data points before and after being clustered. This partitioning contains three different clusters (which is denoted by the different colours).
Figure 2.1: A before and after image of data points that have been clustered by the K-means algorithm [14].
2.2 Clustering Algorithms
This section outlines the theory behind the three clustering algorithms that will
be used.
6 CHAPTER 2. BACKGROUND
2.2.1 K-Means
The K-means Algorithm is a popular partitioning algorithm that was first proposed over 60 years ago, in 1955 [4]. Given a set of data points x 1 ..x n and a set of k cluster centres c 1 ..c k , the algorithm tries to allocate the cluster centres in the input space such that there is one cluster centre in the middle of each cluster of data points.
The position for each cluster centre is randomly spaced out in the input space.
Then, for each data point x i the distance (e.g. the Euclidean distance) between x i
and each cluster centre c j , D(x i ,c j ), is calculated. The data point is assigned to whichever cluster centre c j that minimises the distance function D(x i ,c j ).
When each data point has been assigned to a cluster centre the algorithm calculates the mean value for all of the data points assigned to each cluster centre and then updates the position of the cluster centre to be that point in space. Thus moving each respective cluster centre to the centre of its assigned data points. The algorithm is then iterated until the cluster centres stop moving [3].
2.2.2 Agglomerative Hierarchical Clustering
Agglomerative clustering is a hierarchical clustering [13] method. Hierarchical clustering is a way of cluster analysis which constructs clusters by recursively partitioning instances in either a top down or bottom up fashion based on some similarity measure, such as the sum of squares. The agglomerative method is a bottom up approach; Each object is initially represented as its own cluster and then they are merged together until the desired structure is obtained.
Representing the nested grouping of objects and similarity levels at which groupings change results in a dendogram. Cutting the dendogram at the simi- larity level that is desired yields a clustering of the data objects. This is one of the strength of hierarchical clustering. The user has the option to choose different partitions according to the desired similarity level.
One of the weaknesses of hierarchical clustering is its time complexity, which
is at least O(n 2 ), where n is the total number of instances.
CHAPTER 2. BACKGROUND 7
2.2.3 Spectral Clustering
Spectral clustering is a method of clustering where the dimensionality of the data first is reduced before applying some clustering method on it, e.g. a partitioning algorithm such as K-means [15].
The data points are interpreted as vertices and the similarity between each point are interpreted as edges. Through the spectrum (or eigenvalues) of similarity matrix A, the dimensions are reduced and then a classical clustering algorithm is applied on this lower dimension. It can yield better results in cases where algorithms such as K-means, who looks for "round blobs" in the data, would fail [16].
2.3 Silhouette Index
The Silhouette index is a way to measure how accurate a clustering of an object is [10]. The index ranges from -1 to +1 where a low value indicates that the object would potentially fit better in another cluster while a high index indicates that the current cluster is a good match. A index close to 0 then means that the object could be assigned to another cluster without making the accuracy of the clustering any worse [17].
To create the silhouette what is needed is a set of data clustered into k clusters through a clustering technique, and the collection of proximities between all objects in the clusters. For each object, a value s(i) is then calculated and plotted in a graph.
Take any object i in the data set, and denote A as the cluster it is assigned, then a(i) is defined as the average dissimilarity from i to all the other points in A.
The minimum of all the average dissimilarities to all other clusters of which i is not assigned is denoted b(i). The Silhouette index s(i) of object i is then calculated as
s(i) =
1-a(i)/b(i) if a(i) < b(i)
0 if a(i) = b(i)
b(i)/a(i)-1 if a(i) > b(i)
(2.1)
The Silhouette index of a whole cluster C, ¯ s(C) is the mean of all objects i in
cluster C. The Silhouette index of a whole cluster partitioning is then the mean
8 CHAPTER 2. BACKGROUND
of each cluster’s Silhouette index.
Figure 2.2: "An illustration of the elements involved in the computation of s(i), where the object i belongs to cluster A." [10]
2.4 Rand Index
The Rand index is an external validation index named after William M. Rand [18]. An external validation index measure performance by matching a cluster- ing structure to a priori information [12].
Suppose U = u 1 ...u R and V = v 1 ...v C represents two different partitionings of the object set S = O 1 ...O n , then U and V are subsets of S and S R
i=1 u i = S = S C
j=1 v i . A contingency table is created where n ij denotes the number of objects that are common to classes u i and v j . The table represents the class overlap between the two partitions U and V.
RxC Contingency table Class v 1 v 2 ... v C Sums
u 1 n 11 n 12 ... n 1C n 1·
u 2 n 21 n 22 ... n 2C n 2·
· · · · ·
· · · · ·
· · · · ·
u R n R1 n R2 ... n RC n R·
Sums n ·1 n ·2 ... n ·C n.. = n
Object pairs in the table are classified as four different types
CHAPTER 2. BACKGROUND 9
(i) a, Objects in the pair are placed in the same class in U and in the same class in V
(ii) b, Objects in the pair are placed in different classes in U and in different classes in V
(iii) c, Objects in the pair are placed in different classes in U and in the same class in V
(iv) d, Objects in the pair are placed in the same class in U and in different classes in V
where types (i) and (ii) are interpreted as agreements in the classification of the objects from a pair while types (iii) and (iv) represent disagreements.
The Rand index is then calculated as R = a+b+c+d a+b = a+b (
n2) .
What the Rand index represents is the frequency of agreements over the total pairs.
2.5 Dimensionality
By taking a set of input values, machine learning algorithms produce an output for that input vector. The vector contains several real numbers written down as, e.g. ~ v = (0.1, 0.2, −0.3, 0.4, −0.5). The size of the vector is then what is called the dimensionality of the input. Plotting ~ v would then require one dimension in the data space for each element in the vector. In this example, vector ~ v has five dimensions [3].
2.5.1 The Curse of Dimensionality
As the number of input dimensions gets larger, more data will be needed for the algorithm to be able to generalise in an accurate manner [3].
One can imagine covering a line with 10 points in 1D-space where the distance
d(x i , x j ) between two points are 1/10 of the length of the line l. Representing
the line in 2D-space while still having the distance d(x i , x j ) between points
i and j entails having to fill out the plane with 100, 10 2 , points. Taking it
further and representing it in 3D means having to cover a cube resulting in 10 3
points.
10 CHAPTER 2. BACKGROUND
2.5.2 Principal Component Analysis (PCA)
Principal components [19] allow the summation of the data set with a smaller number of representative variables that collectively explain most of the the variability in the original set.
Principal Component Analysis (PCA), is a technique for finding principal components, lower-dimensional representation of the data that captures as much information as possible.
Given a vector ~ x with p random variables, we try to reduce the number of dimensions to a smaller value q. In order to reduce the number of dimensions from p to q PCA finds linear combinations a 0 1 x, a 0 2 x, ..., a 0 q x, which are the principal components of the data. These specific components have the maxi- mum variance for the data. The vectors a 1 , a 2 , ..., a q are the eigenvectors of the covariance matrix S, corresponding to the q largest eigenvalues. The variance for each principal component is given by their respective eigenvalues. The proportion of the total variance in the original data set accounted for by the first q principal components is then represented by the ratio of the sum of the first q eigenvalues to the sum of the variances of all p original variables [20].
There exist several methods for choosing the appropriate number of PC’s for a data set. One intuitive way is the scree test method [21]. Given a scree plot (see 2.3 below), a curve representing the eigenvalues versus their rank, an "elbow"
in the curve is sought after. This elbow corresponds to an inflexion point. It is then sufficient to determine the point where the sign of the second-order derivative change and use that as a cut off point for choosing an appropriate number of principal components. The reason being as the sign changes, the curve gets less steep meaning the following eigenvalues will account for less variance of the data.
Figure 2.3: A scree plot [22].
CHAPTER 2. BACKGROUND 11
2.6 Related Work
The two main approaches of using clustering in medical diagnostics has been to use it either for clustering numerical data, or image segmentation.
2.6.1 Data Clustering
As mentioned before, clustering is used to find underlying structures in data.
Souto et al. [8] conducted a study comparing different clustering methods and proximity measures on cancer gene expression data. They found that a mixture of multivariate Gaussians and K-means produced the best results in terms of the recovery of the actual structure of the data sets. Rand index was used for external validation.
Bhattacharjee et al. [23] was able find subclasses of lung cancer by applying hierarchical and probabilistic clustering to expression data.
Most of the relevant research uses data in the form of gene expression and not image data which would be more relevant for this study. But Souto et al. [8]’s procedure of evaluating different cluster partitionings generated by different algorithms is similar to our approach explained in 3.2.1.
2.6.2 Image Segmentation
Even if our work focuses on clustering images for the sake of finding underlying structures, image segmentation is worth mentioning as clustering is widely used in that field.
Multiple studies have explored different ways of segmenting images using clustering algorithms. An early study from Coleman et al. [24] used a K- means algorithm in order to segment images. They concluded that while segmentation performed by a human or a trained segmenter would produce a more satisfying result, the unsupervised approach had its own advantages;
"the supervised method is incapable of satisfactory performance in situations
where the statistics of the scene vary substantially". This means that clustering
has an advantage in cases where the images look substantially different to one
another. Also mentioned was the fact that no training phase is needed, omitting
a tedious step in the procedure.
12 CHAPTER 2. BACKGROUND
More recent studies have looked at using segmentation for medical images. Li et al. [25] proposed a new image segmentation algorithm based on spatial fuzzy clustering. The data used was medical images of different modalities, including a CT scan of liver tumours and an MRI slice of cerebral tissues. Ng et al. [26]
proposed a new method of combining K-means with the watershed algorithm
to be used for segmentation on medical images, producing segmentation maps
which have 92% fewer partitions than the segmentation maps produced by the
conventional watershed algorithm.
Chapter 3 Method
This section details the methods used to perform the experiment. It is aimed at detailing the implementation well enough to be able to reproduce the re- sults.
3.1 Data set
One data set containing 10015 thousand images of skin lesions were used. This data set is called the HAM10000 training set and the images it contains were collected over a period of 20 years from the Department of Dermatology at the Medical University of Vienna and the skin cancer practice of Cliff Rosendahl in Queensland, Australia.
The images are organised into seven categories; bcc - Basacel cell carcinoma, bkl - Benign keratosis, df - Dermatofibroma, nv - Melanocytic nevi, mel - Melanoma, vasc - Vascular skin lesions. These are regarded as the ground truth labels of the data.
13
14 CHAPTER 3. METHOD
3.2 Implementation
3.2.1 Approach
Each algorithm runs for a total of 10 times with the number of desired clusters K ranging between 2 and 14. The reason for having K range between these values is because of the a priori knowledge of the number of true labels in the data set; 7. For each algorithm, each cluster partitioning from the different values of K is evaluated by an internal validation index. The index evaluates the validity of the generated clusters. See 2.3 for more information. The cluster partitioning that scores the best for each algorithm is then chosen. Only one internal validation index is used, namely the Silhouette index
Since the data set being used has already been labeled, the cluster partitioning scoring the highest for the Silhouette Index for each algorithm is compared to the true labels with the external validation method Rand Index.
Algorithms and evaluation methods used
Algorithms
Spectral clustering Agglomerative clustering K-means
Internal Evaluation Silhouette index External Evaluation Rand index
Table 3.1: All algorithms and validation measured used in this paper
A comparison between the best clusters for each algorithms based on the internal validation is then conducted.
For each algorithm’s generated clusters, the accuracy of those clusters is evalu-
ated by the internal validation index. E.g. all thirteen clusters generated through
K-means are evaluated with the Silhouette index. The one cluster that scores
the highest out of all fourteen clusters is then chosen. The highest scoring
cluster of each algorithm is selected and then compared based on the same
criterion as before, namely comparing them and picking out the one cluster
with the highest Silhouette index score. The highest scoring cluster is then
compared to the true labels based on an external validation criterion. The result
may then be a cluster partitioning that differs from the original classification of
CHAPTER 3. METHOD 15
the data set. This can yield information about potentially underlying structures in the data that the human eye might have missed.
Figure 3.1 below is a flowchart describing the whole process, from downloading the data set, applying the algorithms on the data and measuring the results, to analysing the data.
2019-05-14, 18+33 Untitled Diagram.drawio
Page 1 of 1 about:blank
Download and read data set
Preprocess data set - resize and PCA
Apply clustering algorithms on data
Compute Silhouette and Rand index
scores Analyse the data