A comparative study on a practical use case for image clustering based on common shareability and metadata

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

A comparative study on a practical use case for image clustering

based on common shareability and metadata

ERIK DACKANDER

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

A comparative study on a practical use case for image clustering based on common shareability and metadata

ERIK DACKANDER

Master in Computer Science Date: August 15, 2018

Supervisor: Hamid Reza Faragardi Examiner: Elena Troubitsyna

Swedish title: En jämförande studie i ett praktiskt användningsfall för bildklustring baserat på gemensamt delade bilder och dess metadata

School of Electrical Engineering and Computer Science

(4)

(5)

iii

Abstract

As the amount of data increases every year, the need for effective struc-

turing of data is a growing problem. This thesis aims to investigate and

compare how four different clustering algorithms perform on a practi-

cal use case for images. The four algorithms used are Affinity Propaga-

tion, BIRCH, Rectifying Self-Organizing Maps, Deep Embedded Clus-

tering. The algorithms get the image metadata and also its content,

extracted using a pre-trained deep convolutional neural network. The

results demonstrate that while there are variations in the data, Affin-

ity Propagation and BIRCH shows the most potential among the four

algorithms. Furthermore, when metadata is available it improves the

results of the algorithms that can process the extreme values cause. For

Affinity Propagation the mean share score is improved by 5.6 percent-

age points and the silhouette score is improved by 0.044. BIRCH mean

share score improves by 1.9 percentage points and silhouette score by

0.051. RSOM and DEC could not process the metadata.

(6)

iv

Sammanfattning

Allt eftersom datamängderna ökar för varje år som går så ökar även behovet av att strukturera datan på en bra sätt. Detta arbete syftar till att undersöka och jämföra hur väl fyra olika klustringsalgoritmer fun- gerar för ett praktiskt användningsfall med bilder. De fyra algorith- merna som används är Affinity Propagation, BIRCH, Rectifying Self- Organizing Maps och Deep Embedded Clustering. Algoritmerna ha- de bildernas metadata samt deras innehåll, framtaget med hjälp av ett deep convolutional neural network, att använda för klustringen.

Resultaten visar att även om det finns stora variationer i utfallen, vi-

sar Affinity Propagation och BIRCH den största potentialen av de fy-

ra algoritmerna. Vidare verkar metadatan, när den finns tillgänglig,

förbättra resultaten för de klustringsalgoritmer som kunde hantera de

extremvärden som metadatan kunde ge upphov till. För Affinity pro-

pagation föbättrades den genomsnittliga delnings poängen med 5,6

procentenheter och dess silhouette index ökade med 0.044. BIRCHs

genomsnittliga delnings poäng ökade med 1,9 procentenheter samt

dess silhouette index förbättades med 0.051. RSOM och DEC kunde

inte processa metadatan.

(7)

Chapter 1 Introduction

Machine learning is an area that is constantly evolving and new re- search pushes its boundaries further each year. It can be used in many areas where one wants to process data in an effective way.. Many in- dustries has been successful in using machine learning in predicting and personalizing user preferences for marketing, recommendations and other valuable areas that can increase the user experience as a whole. Examples of some fields using machine learning today are the medicine, finance and technology.

The rapid development in machine learning has led the IT indus- try to begin utilizing it to improve many aspects of their applications while simultaneously reducing costs. One such way is for companies with large amount of user data to help structuring or presenting the data in such a way that increases the user engagement for the appli- cation. For example a structuring that increases image sharing can attract new users to an application. This can increase the user engage- ment within the application while also presenting the developers with better understanding of their user base and such can improve their revenue streams further.

One such way of structuring the user data automatically is to use the method of clustering from the machine learning field. By using clustering algorithms one can extract the most important features of a set of data and use it to find correlations and patterns in the data. As the amount of data storage has grown significantly in recent years the importance of structuring the data has increased and therefore the use of cluster analysis is expected to grow with it [15][32].

1

(10)

2 CHAPTER 1. INTRODUCTION

1.1 Problem statement

There is today a lack of comparative studies of clustering algorithms performance on real data in practical use cases as the usage of these algorithms is expected to grow [15]. With a growing practical use of clustering, the knowledge of the cluster algorithms advantages and disadvantages also needs to increase.

The main focus of this research is to compare and evaluate how dif- ferent clustering algorithms are able to perform on images combined with the date and geolocation of the images. The clustering algorithms compared in this research was Affinity Propagation, BIRCH, Rectify- ing Self-Organizing Maps and Deep Embedded clustering. These al- gorithms was chosen for different reasons. Affinity Propagation is a common algorithm used in the industry and has a standardized way of choosing the number of clusters. BIRCH is an older algorithm that has the benefit of being simple and running very fast and gives a benchmark against how the newer, more advanced algorithms per- form. Both RSOM and DEC are newly discovered ways of clustering and gives indication how the state-of-the-art performs compared to its older counterparts.

The research will be carried out on a practical use case of a data storage application. From the results gathered one can draw the con- clusions of how a full real-time implementation would be constructed to achieve the most correctness of the image clustering as a future work.

1.1.1 Research Questions

• "Which of the clustering algorithms Affinity Propagation, BIRCH, Rec- tifying Self-Organizing Maps and Deep Embedded clustering achieve the best clustering in practice based of common shareability using im- age content and metadata?"

• "Does using image metadata improve the performance of clustering al- gorithms?"

1.1.2 Evaluation

In order to evaluate the algorithms, the main measurement will be

what images has been shared together in one single share-event within

(11)

CHAPTER 1. INTRODUCTION 3

the Degoo application. By assuming that images shared together should also be clustered together we can use it as a measurement of the accu- racy that each algorithm are able to achieve. This would be preferred since it does not require gathering any new data to evaluate. In or- der to avoid edge cases that this evaluation method would not be able to handle such as all images being put in the same cluster, other met- rics will also be used to weight up the performance measure. The al- gorithms will also be measured in the mean elapsed time to give an indication of their practical usage.

1.1.3 Scope of study

The limitations of this thesis is mostly reliant on time constraints and available data. Exactly what information that can be gathered from the data and how it is stored determined many limitations. Laws and the end-user license agreement that limits how the user data can be handled is the largest factor affecting what data can be used.

The number of algorithms compared was set to four to limit how much computation is needed and the amount of implementation re- quired for the research. Furthermore, to have measurable results there is a prerequisite that enough users often shares several pictures to- gether instead of separately.

1.2 Research methodology

The research starts by identifying and stating the research goals. The challenges of the research are investigated and the evaluation method is formulated. After this, a literature study is performed to find the related works and different approaches. Which algorithms to evaluate on the data are researched in this step and what metrics to evaluate them on.

Next step is to implement the different algorithms to be able to

run them. This also requires us to collect the data needed for eval-

uation. The metrics are also implemented to calculate the different

scores. Lastly all the algorithms are ran with the data and their scores

are calculated.

(12)

4 CHAPTER 1. INTRODUCTION

Figure 1.1: The research methodology as a flow chart.

1.3 Degoo Media AB

The research was carried out at the Swedish cloud storage company Degoo Media AB. Degoo works a lot with user experience for their product which is accessible across most of today’s mainstream plat- forms. They have millions of users in multiple continents which makes data analysis an important task of their work as a way to personalize the experience for each user in a better way. The majority of the user data stored at Degoo today is in the form of images meaning that au- tomatically structuring of the user photos will save the user a lot of work and time if done with a high enough correctness. Automatically structuring of the users photos an important task as a way to improve the user experience with the application.

1.4 Scientific contribution

The paper will aim to make a larger comparison between algorithms

for clustering images. Previous similar work such as [25][23] have

been of single clustering algorithms with small amount of data that

was compared to manually structured clusters to measure how they

performed. This creates a very subjective measurement combined with

(13)

CHAPTER 1. INTRODUCTION 5

a very small arbitrary data set. This thesis will instead have a large data set from real-world data that use several algorithms for com- parison with measurement of how users prefer to structure their im- ages based on common shareability. Therefore this thesis aims to con- tribute:

• How the image metadata affects the performance of clustering of large data sets.

• How image clustering algorithms perform in combining com-

monly shared images in a common cluster.

(14)

Chapter 2 Background

This section presents the reader on the background knowledge that is needed in order to understand what the research comprises. It starts by going through where the research has been carried out and the basis of cluster analysis. Followed by more in depth knowledge regarding the algorithms used to gather the results. It will also present related work exists in the field as a way to help draw conclusions from the results.

2.1 Convolutional neural network

A convolutional neural network (CNN) is a form of artificial neural network with applied deep learning. A neural network[21] is a di- rected graph structured in layers of neurons where each neuron in one layer is connected to all the neurons in the next layer. It aims to pre- dict or classify the results of a set number of features by learning from example. During the training, For each data it receives and gets the outcome of, it will adjusts the weights of its neurons in order to make it fit the results better. A main difference with CNN[31][18] however is instead of looking at each feature by itself, it also takes into account the nearby features. This is well suited for visual recognition as it does not only look at each single pixel but also the pixels around it.

6

(15)

CHAPTER 2. BACKGROUND 7

Figure 2.1: Example of how the pixels are merged in a convolutional neural network. [7]

2.2 Cluster analysis

Cluster analysis[29][10] is a data mining task aiming to find useful pat- terns in large quantities of data. It specifically aims to categorize the data into multiple subsets where the data within one subset has fea- tures that more similar to each other than the data found in other sub- sets. The goal of cluster analysis is to have as homogeneous data in each subset as possible. Since most clustering problems are NP-hard [32] the algorithms used are often heuristically designed to iterate to a close to optimal solution.

Figure 2.2: Example of data points (left) clustered into 3 sets showed

in color (right) based of their euclidean distance to each other.

(16)

8 CHAPTER 2. BACKGROUND

Cluster analysis is used within the field of machine learning where it is part of the unsupervised learning tasks as there are no predefined categories to place the data within. Its supervised learning counter- part is the task of statistical classification which demands the user to predefine which categories the data can put into. Due to the many variations in clustering algorithms, the algorithms are usually catego- rized into different paradigms[32][12][28] based of a set of rules that each paradigm fulfills. The most common clustering paradigms are Hierarchical and partitional.

Partional clustering[12][17] consists of having a predefined a set of clusters. The data can then be divided among the clusters by defining a distance function between a cluster and a new data point. A common way to achieve this is to use centroid-based clusters where the clus- ter is represented by its mean feature values and then use Euclidean distance to categorize all new data points.

Hierarchical clustering[1] however, usually does not require the user to predefine the number of clusters. The idea of hierarchical clus- tering is to create a tree-like structure of the data from which the clus- ters can be determine by selecting different branches of the tree. There are two ways of building these hierarchical clusters, either with an agglomerative or a divisive approach. The agglomerative approach starts by defining each data point as a cluster and then combining them into larger cluster, divisive approach starts with all data points in one clusters and then try to divide the cluster into smaller sub clusters.

2.2.1 Silhouette coefficient

Silhouette coefficient[30] is a measurement for validation in cluster

analysis. The basic idea of the algorithm is to take a set of clusters

k and measure for each cluster the average distance between its data

points and compare it to the average distance it has to the data points

in the nearest cluster. Let i be a data point where a(i) is the average

distance from that data point to all other data points within the same

cluster and b(i) is the average distance to all data points in its nearest

cluster. The distance is determined by some distance function. Then

the sillhouette coefficient of data point i is calculated as:

(17)

CHAPTER 2. BACKGROUND 9

s(i) =



 

 

1 − a(i)/b(i) if a(i) < b(i)

0 if a(i) = b(i)

b(i)/a(i) − 1 if a(i) > b(i)

(2.1)

where −1 ≤ s(i) ≤ 1 and the total silhouette score for the full cluster- ing is the average of all the silhouette coefficients. From this equation we can draw the conclusion that a high silhouette score means that the clustering was appropriate.

2.2.2 Calinski-Harabasz Index

Calinski-Harabasz Index[5] is a metric to use to validate a clustering.

While the Silhouette Coefficient metric looks at each data points aver- age distance, the Calinski-Harabasz metrics looks at the clusters as a whole. The main idea is to take the ratio between the mean dispersion between the clusters and the mean dispersion within the clusters. This will then be multiplied with the ratio between the total number of data points and the total number of clusters. The score is than calculated by using the cluster dispersion W

k

and within cluster dispersion B

k

for k clusters and N data points using:

s(k) = B

k

W

_k

∗ N − k

k − 1 (2.2)

2.2.3 BIRCH

BIRCH[36] or Balanced Iterative Reducing and Clustering using Hier- archies is an older clustering algorithm developed in 1996. It is a hi- erarchical clustering algorithm that specializes in very large data sets and its performance scale well with both the number of clusters re- quested and the number of data points given. One of the reasons for the high performance in speed is because it does not have to consider all possible data points for each clustering step and that it can mini- mize the number of I/O-operations by only loading the data set once as the algorithm scales the size down by a large factor while in mem- ory.

The first step of the algorithm is to build a clustering feature tree,

also known as a CF-tree. A CF-tree is a height balanced tree where

(18)

10 CHAPTER 2. BACKGROUND

each node i of the tree represents a sub-cluster. A node is defined as a tuple:

CF

_i

= (N

_i

, ~ LS

_i

, SS

_i

) (2.3) Where N is the total number of data points in the sub-cluster, L~ S is the linear sum of the sub-cluster and SS is the squared sum:

LS = ~

N

X

i=1

X

_i

(2.4)

SS =

N

X

i=1

X

_i²

(2.5)

The CF-tree can then be resized by merging leafs close to each other and thus reducing the number of sub-clusters available and remove data points that are outliers. The formula for calculating the distance between two sub-clusters is:

D

_1,2

= s

N

₁

∗ SS

₂

+ N

₂

∗ SS

₁

− 2 ∗ ~ LS

₁

∗ ~ LS

₂

N

₁

∗ N

₂

(2.6)

It is then possible to repeat this step if further merging is needed. After this the next step is to run a more traditional clustering algorithm such as K-means on the values stored in the leaf nodes. The clustering of the leaf nodes is ran with a set number of clusters that we requested beforehand and thus generating the clusters from the algorithm.

2.2.4 Affinity propagation

This algorithm was first proposed by a research group[11] in 2007 and has sense have many variations develop from it. Affinity propagation has become very popular for practical usage within the industry[3].

One reason for its popularity as oppose to the one of the most stan- dard clustering algorithm K-means[27][33]. Affinity propagation has the great benefit of not requiring a predefined number of clusters and has a good way of defining how many clusters to create by using the median of all input similarities as preference for exemplar selection.

The algorithm works by first considering all data points as possi-

ble cluster centers or "exemplars". There also has to be a similarity

(19)

CHAPTER 2. BACKGROUND 11

matrix, either precomputed or computed dynamically of the similari- ties between all the data points. The most common way of measuring similarity between two data points is to use their negative squared eu- clidean distance:

S(x, y) = − v u u t

n

X

i=1

(x

_i

− y

_i

)

²

(2.7) This similarity distance indicates how likely one specific data point is to be considered the exemplar of another data point. After this step an iterative process will begin where the clusters will begin to emerge.

In this step the data points will start to exchange two types of mes- sages, namely responsibility and availability. The responsibility r(x, y) represents how well suited data point x is to serve as exemplar for data point y. The availability a(x, y) represents how likely it is for data point a to choose data point b as its exemplar taken all other exem- plars into consideration. Initially all availabilities are set to zero. The responsibilities are calculated using the following formula:

r(x, y) = s(x, y) − max

y⁰,y⁰6=y

{a(x, y

⁰

) + s(x, y

⁰

)} (2.8) Important to note is that a negative responsibility for a points towards itself r(x, x) means that the point is not considered a good candidate as exemplar for itself. Therefore the data point should seek to choose another data point as its exemplar. The availability for data point x to data point y is calculated as the responsibility x has to itself plus the sum of all positive responsibilities it has towards all other data points.

The availability is calculated using the following formula:

a(x, y) = min n

0, r(x, y) + X

y⁰6={x,y}

max{0, r(x, y

⁰

)} o

(2.9) After running enough iterations that the algorithms converge, affinity propagation needs to have a stopping criteria. For example using a predefined maximum number of iterations or running until no new exemplars has been defined for a set number of iterations. The algo- rithm can be terminated at any time for which each data point x will have an exemplar which is the data point y that maximizes:

r(x, y) + a(x, y) (2.10)

(20)

12 CHAPTER 2. BACKGROUND

If x = y then the data point is an exemplar itself.

2.2.5 Rectifying Self-Organizing Maps

Rectifying Self-Organizing Maps[13] (RSOM) is a clustering algorithm developed as a variant of the more the general approach of Self-Organizing Maps[19][4] (SOM). SOM is a competitive learning algorithm where all the neurons representing the output space compete for each data point under a winner-takes-it-all condition. SOM aims to take continuous high dimensional data and map it down to a discrete 2-dimensional feature space where each coordinate is represented by one neuron.

The first step of the algorithm is to initialize all the neurons with random values for each feature in the input space. After this all the data points will be assigned to the closest neuron by using a distance function such as euclidean distance. The algorithm then have to adapt to the input data by learning. The idea is for the winning neuron to make a big update to its feature space and let other neurons update decay with their distance to the data point. To determine how much a neuron should update its values the algorithm has to first calculate the topological neighborhood for each neuron j to the winning neuron I(x) where x is the data point. This is done as:

T

_j,I(X)

= e

^−S^j,I(x)² ^/2σ²

(2.11)

where S is the distance between two neurons and σ is the size of the winning neurons neighborhood. σ is calculated using any custom function but needs to decay over time as the neighborhoods should decrease in size and adapt more on a local level.

The last step for the SOM algorithm is then to update the values for all the neurons. The weight update for neuron j considering that neuron i was the winner and x the data point, is calculated as:

∆w

j

(i) = η(t) ∗ T

j,(x)

(t) ∗ (x − w

j

(i)) (2.12) where t is some time unit described in epochs and n(t) is a learning rate function which values will decrease in correlation with time.

RSOM is very similar to the SOM learning algorithm with the ex-

ception of its ability to detect so called outliers, data points that are far

from all other data points. By creating separate clusters for all outliers

the algorithm has a much better performance than the general SOM

(21)

CHAPTER 2. BACKGROUND 13

algorithm and is therefore more suitable for large data sets. It does this by having a separate function for calculating a neuron score for neuron j:

e

^t_j

= e

^t−1_j

+ p

^t

(β

_j

+ z

_j

) (2.13) where p is the learning rate scalar and z it the number of won neurons in the current epoch. The value for β

j

is the total activation sum that has happened in the current epoch for all neurons except for j. We define a value θ ∈ [0, 1] and if e

j

< θ then we consider that cluster to be an outlier and does not need to be considered anymore.

2.2.6 Deep embedded clustering

Deep embedded clustering[34][14] (DEC) is a clustering algorithm pro- posed by a research group in 2016. The idea behind the algorithm is to train a deep neural network (DNN) to make a mapping for the feature space to a lower dimensional space. After the mapping has been com- pleted one can use a simple clustering algorithm to separate the lower dimensional feature space into clusters.

The algorithm begins by initializing the cluster centroids for the DNN. This is done by using a stacked autoencoder (SAE) and teaching it greedily layer-by-layer. Each layer in a SAE consists of a denoising autoencoder which is a small two layer neural network consisting of a encoder and a decoder layer. For each layer in the SAE we take the output of the previous layers and add noise to it, then the new layer will aim to reconstruct the same output as previous layers by removing the noise by minimizing the mean square error metric:

M SE = 1 N

n

X

i=1

(x

_i

− ˆ x

_i

)

²

(2.14) We can then concatenate all the encoding layers in reverse order to form a deep autoencoder. The data is then ran through the deep au- toencoder to map it to a lower dimensional space. From this lower dimensional space the data is separated into the initial clusters using a simple clustering algorithm such as K-means.

The next step is to optimize the mapping between the feature spaces

and thus improving the cluster centroids. This is done by first doing a

soft assignment for each data point z

i

by computing its similarity with

each cluster centroid µ

j

:

(22)

14 CHAPTER 2. BACKGROUND

q

ij

= (1 + ||z

i

− µ

j

||

²

)

⁻¹

P

j⁰

(1 + ||z

_i

− µ

_j⁰

||

²

)

⁻¹

(2.15) Where a higher value represents a higher probability of a z

i

being as- signed to centroid µ

j

. To determine how we should tune the param- eters in order to optimize the clustering we need to have a target dis- tribution P to match our soft assignment against. Several ways exists for choosing the target distribution but the original paper set the target distribution as:

p

ij

= q

_ij²

/f

_j

P

j⁰

(q

_ij²0

/f

_j⁰

) (2.16) f

j

= X

i

q

ij

(2.17)

The cluster centroids and DNN parameters are then tuned to better match the target distribution using a stochastic gradient decent (SGD).

The gradients are sent through the DNN using standard backpropa- gation learning. This procedure is repeated while a set portion of the cluster assignments are changed between two iterations.

2.3 Related work

• In 2000, Platt[25] wrote a paper that aimed to cluster images into albums automatically. It mainly used the image content but also the option to look at time stamp of the photo. The data was gathered by manually 1320 photos, sorting them into albums and compare it to how the algorithm clustered it instead. The algo- rithm used was a probabilistic clustering using a Hidden Markov model to classify each image into a cluster based on the already clustered images.

• In 2007, Dueck and Frey[8] made an implementation of the, at

the time newly proposed affinity propagation algorithm in order

to categorize images into clusters. They used the highly popu-

lar k-means clustering algorithms for comparison and showed

that their implementation performed better in categorizing the

images correctly for all the test sets they used.

(23)

CHAPTER 2. BACKGROUND 15

• In 2007, Bekkerman and Jeon[2] proposed a new way of using unsupervised learning to categorize multimedia collections called Comraf* which is an extension of an older algorithm called just Comraf. The algorithm had the advantage of being able to use multi-modal information such as color histogram and and image captions simultaneously.

• In 2014, Rahmani et al.[26] presented a comparison on image clustering between using the K-means and the Fuzzy K-Means algorithms. The paper only looks at the image content in a re- gression way with colors, textures, shapes and similar features instead of any classification of the content. This indirectly leads to clusters of images that should contain similar characteristics.

The difference between K-means and Fuzzy K-means is that the later allows images to be present in more than one cluster simul- taneously with some form of probability. The paper concludes that Fuzzy K-means is better by most factors.

• In 2016, Qin-Hu Zhang et al.[35] did a Comparative study of a recently proposed clustering algorithm called multivariant opti- mization algorithm against more common clustering algorithms such a K-means. It used real-life data sets to measure the perfor- mance and found that the new algorithm performed very well.

• In 2016, Liang et al.[23] proposed an semi-supervised image clus-

tering algorithm which combined the Laplacian Regularized Met-

ric Learning algorithm (LMRL) with other metrics, such as color

and form gathered through feature extraction. This was used to

create a new form of semi-supervised clustering algorithm for

images which are mostly unlabeled.

(24)

Chapter 3 Method

This section presents the methods used in this research and how the results are obtained. It also discuss what data are used and how the data is processed using the different clustering algorithms.

3.1 Data

The data used consists of images from the users and events gathered from user actions within the Android application. All sets must have the user consent for machine learning usage. These two conditions of both platform and consent restrict the amount of available data. Lastly, the amount of users is also limited to only who that fulfill the criteria of using the share function within the application as these are the users that the algorithm performance could be evaluated on. Other noises removed from the data are those users with less than 10 images up- loaded as there is too high variation in clustering such few images.

The images are the users private photos that they had given con- sent for our usage. All images are scaled down to be 224x224 pixels while keeping their meta-data. This is to run the feature extraction from them using ImageNet as it requires a certain image size. There are however no limitations to what type of images are used in the im- age age or content.

The events are filtered as only the share-events from all the users events. From the share-events it is possible to gather who shared what images and when. These events also has to fulfill the criteria of being at least two images shared simultaneously. However to reduce noise and some special cases, share-events of single images within one minute of

16

(25)

CHAPTER 3. METHOD 17

each other are merged as one share-event. This is because some users share images one at a time which result in their events just having a few seconds between each other.

3.2 Image feature extraction

To be able to run the cluster analysis with good performance and within a reasonable time frame the algorithms should not be ran directly on the images. Instead one should extract certain features from the im- ages that represents what the algorithms should cluster the images on.

This technique is called feature extraction.

3.2.1 Content extraction

For the content extraction of the images a convolutional neural net- work (CNN) is used. The CNN used is MobileNet[18], trained on the large standardized data set ImageNet[6]. MobileNet is a faster version of Keras other computer vision DNNs, with a very low trade-off for precision. It is therefore preferred for the large amount of data that the network need to process.

To retrieve the features of the images and not just direct predic- tion of the content, the values are instead gathered from the feature map in the DNN. This way one can gather the numerical values that is used to classify the image into the 1000 classes existing in ImageNet.

This is preferable instead of the probability of the image belonging to a specific class as it gives a much more general approach to the content features.

3.2.2 Metadata

The image metadata used for clustering is the date of the image and

its geolocation. For the date of the image, what variable taken from

the metadata is dependent on what variables are available as not all

images has all variables set. To solve this, the variable representing

the most relevant timestamp is taken from what available variables

existed. The variable DateTime has highest priority followed by the

variable DateTimeOriginal and lastly If none of the previous variables

exists, DateTimedDigitized is used. For some images it is also possible

to extract the date of the image given in whole days by taking it from

(26)

18 CHAPTER 3. METHOD

the image name as it is common for cameras to automatically name the images by the date they were taken. The timestamp of images is taken as seconds since January 1 1970. As a way to scale down the date values, the timestamps are divided into 8 hour intervals instead of seconds when clustering.

For the geolocation, the latitude and longitude variables are ex- tracted from the metadata. To be able to represent the latitude and longitude as numerical values, their south and west values respec- tively are represented as negative values. To avoid the problem of having the geolocation parameters irrelevant for their small numeri- cal values and have them weighted more equally with the rest of the image features, the values of latitude and longitude are upscaled by a large factor.

Insufficient metadata

For images that do not contain both timestamp and geolocation in the metadata some approximations had to be made. While the conditions differ, the method for approximating the metadata is influenced by Hi- rota et al. research[16]. In their research they approximate the missing metadata in images for search results on the web. The metadata they approximated was the social tags and content of the images such as angles and light. They did the approximation by taking images con- taining metadata in the search results returned by the same queries and applying k-nearest-neighbor algorithm to find similar images. If the images in the cluster met a threshold distance, an algorithm called Maxmin[22] was used to calculate the visually closest image. The metadata was then copied from the visually closest image over to the image we try to approximate the metadata for.

The approach in this thesis to be able to approximate the missing geolocation in the metadata, a threshold of the users images must have their geolocation set to avoid too much uncertainty. The way it is ap- proximated is to look at the images that have their geolocation prop- erly set. From these images, one takes the image which have the date parameter closest to the image we approximate for. We then copy the geolocation of this image as our approximated latitude and longitude.

The reason behind this approximation is the assumption that people

tend to take many photos at one single location before moving on to

the next.

(27)

CHAPTER 3. METHOD 19

For the date parameter it is harder to make an approximation as we can not use the geolocation data as a good measure since the person might have taken many photos at the same location over a longer time.

In this case we instead set the date of these images to be the earliest date of any image that the user has stored minus the time between that image and the average date of their images, given as:

d

_i

= max(0, d

₀

− ( P

N

i=0

d

_i

N − d

₀

)) (3.1)

where d

0

is the earliest set date parameter.

To make sure that the approximation of missing metadata do not get too noisy, some thresholds are used. If the user do not have a cer- tain metadata tag for a large enough portion of their images, the tag is not used and the clustering is limited to the other available features.

This threshold is higher if the metadata tag is harder to approximate.

3.3 Clustering

The clustering algorithms compared in this research are Affinity Prop- agation, BIRCH, Rectifying Self-Organizing Maps and Deep Embed- ded clustering. All four algorithms are ran using already existing li- braries, thus the implementation needed for each algorithm consists of tuning its parameters. For Self-Organizing Maps and Deep Embed- ded clustering the implementation used are the same as the original papers used for their benchmarks. Affinity Propagation and BIRCH use the default Scikit-learn library[24] implementation.

The data built using the feature extraction is ran through the four

algorithms sequentially. For each user, the scores for each algorithm is

saved and then calculated as a total score and presented after all the

data is processed. For all algorithms except BIRCH there is a limitation

in number of iterations they are allowed to run. Since BIRCH always

will converge when it merges the leafs, no limit needs to be set. To

not affect the result however, the other three algorithms are allowed

to run enough iterations until they converge as long as it is within a

reasonable time frame. The time limit is set as maximum number of

iterations and therefore the exact time limit could differentiate. Let N

be the number of data points. Affinity Propagation sets its maximum

number of iterations to 200 + 5 ∗ N . For Rectifying Self-Organizing

(28)

20 CHAPTER 3. METHOD

Maps and Deep Embedded Clustering the maximum number of iter- ations are set in the number of epochs they run. RSOM uses 500 ∗ N which is suggested as a "rule of thumb" in the original Self-Organizing Maps paper[20] and for DEC it is set to be 100 ∗ N .

Due to the differences in how the algorithms determine how many clusters they want to create and to improve the comparability of the results, all the algorithms use the same number of clusters for each user even though the distribution was free to chose. The number of clusters is determined by running the data through the Affinity Prop- agation algorithm with the preference set to the mean of its input sim- ilarities. This number is then passed on to the other three clustering algorithms. The reason for choosing Affinity Propagation as the de- terminer for number of clusters is that it has a standardized way to determining the separations of clusters concluded from research[11], the other three algorithms requires preferences set from a more practi- cal approach that is less derived from theoretical research.

To give the reader an indication of the algorithms performance in practice, each algorithm is also measured as the mean time elapsed for each user. This presents the reader with the opportunity to weight in the correctness of the clusters against the time elapsed for practical use cases.

3.4 Score calculation

The scores used for evaluation are mainly the correctness of clustering images shared together within the same cluster. This score is calcu- lated as for each time a user shared multiple images, if all the images shared are put in the same cluster it gets a score of one otherwise zero.

The users share score is then calculated as the mean of the scores gath-

ered for all its share events. To compensate this score and avoid odd

cases such as all images put in an dis-proportionally small amount of

clusters and thus reaching a 100% score, two other measurements are

used. These are the silhouette coefficient and the Calinski-Harabasz

index. Both these measurements are able to give an indication of how

homogeneous the data within one cluster is. A problem that could oc-

cur is when a clustering algorithm decided to not split the data in any

way, thus putting all the data in the same cluster. This leads to prob-

lems in calculating the silhouette coefficient and Calinski-Harabasz in-

(29)

CHAPTER 3. METHOD 21

dex, if this happens both the scores are given its worst possible value

to indicate that it was a bad clustering. While creating only a single

cluster could be the correct way to cluster the images, from an empiric

approach it is determined that this is rarely the case. Lastly, for each

algorithm the mean average of these three scores are calculated and

presented for comparison. To give an indication to the reader how the

metadata affected the final results, the scores are calculated both with

and without using metadata.

(30)

Chapter 4 Results

4.1 Results overall

Affinity propagation

BIRCH RSOM DEC

Mean share score 0.577 0.641 0.578 0.771

Mean silhouette coefficient 0.132 0.146 -0.158 -0.388 Mean Calinski-Harabasz index 51.105 52.191 40.887 1.520 Mean time elapsed (seconds) 0.015 0.009 8.400 78.815 Table 4.1: Results including all users, both with and without metadata.

For the overall results including both with and without metadata, the Deep Embedded Clustering algorithm has the best mean share score.

We can see however that DEC has a low silhouette coefficient as well as a low Calinski-Harabasz index. The other three algorithms have more balanced scores across the metrics. We can also see that the older algorithms Affinity Propagation and BIRCH have a lower computa- tional cost than RSOM and DEC when comparing the mean elapsed time.

22

(31)

CHAPTER 4. RESULTS 23

4.2 Results without metadata

Affinity propagation

BIRCH RSOM DEC

Mean share score 0.598 0.632 0.355 0.650

Mean silhouette coefficient 0.111 0.122 0.026 0.059 Mean Calinski-Harabasz index 3.019 3.295 2.110 2.779 Mean time elapsed (seconds) 0.005 0.005 2.617 73.109

Table 4.2: Results including only users having metadata.

As seen in table 4.2, DEC outperforms when applied to the image con- tent without any metadata. The closest competitor to DEC is BIRCH which has similar scores but also has a smaller computational cost by a large margin.

4.3 Results only when metadata available

Affinity propagation

BIRCH RSOM DEC

Mean share score 0.654 0.651 0.823 0.903

Mean silhouette coefficient 0.155 0.173 -0.360 -0.878 Mean Calinski-Harabasz index 103.844 105.820 83.418 0.139 Mean time elapsed (seconds) 0.026 0.014 14.742 85.074

Table 4.3: Results including only users not having metadata.

When applying only metadata, all share scores increase, especially for

the newer clustering algorithms RSOM and DEC. However, in terms

of comparison the other metrics gets worse for RSOM and DEC which

can indicate problems with separating the data into multiple clusters

instead of just creating one big cluster. This is prevalent for DEC as

(32)

24 CHAPTER 4. RESULTS

its silhouette score is close to the lowest possible value of negative one that indicates that all data was put into a single cluster.

4.4 Metadata impact

AP BIRCH RSOM DEC

0.4 0.6 0.8 1

Mean shar e scor e

Metadata excluded Metadata included

Figure 4.1: Metadata impact for the share score.

(33)

CHAPTER 4. RESULTS 25

AP BIRCH RSOM DEC

−1

−0.5 0

Mean silhouette scor e

Metadata excluded Metadata included

Figure 4.2: Metadata impact for the silhouette score.

From these diagram we can see that while all four algorithms are im- proved in terms of share score, only Affinity Propagation and BIRCH benefits from using the metadata in terms of silhouette score. Both RSOM and DEC have a problem to handle the metadata and to sepa- rate the data into multiple clusters.

Due to how the Calinski-Harabasz index works, the scores can not

be compared in a good way since the distances between data points

increases when metadata is used. Therefore the comparison for this

index is not presented.

(34)

Chapter 5 Discussion

In this section, the overall findings of this research will be discussed.

The results will be investigated as well as the factors that may have altered the results and what impact it may have. The practical usage for these results contra the theoretical findings will also be examined.

5.1 The overall results

From the results reported in Chapter 4 we can see that all algorithms have their advantages and disadvantages. According to the results of this study, both RSOM and DEC are uneven in their performance while Affinity Propagation and BIRCH perform more evenly independent of the data. When only using the contents of the images and no metadata the algorithms are close to equal in their performance depending on what parameters one values the most. However, only Affinity Prop- agation and BIRCH benefit from using the metadata. This could be caused by the other two algorithms having problems to handle the ex- treme values that the date and location parameters has compared to the 1024 length feature vector representing the content. There is also a large difference in the mean elapsed time as the newer more advanced algorithms take a longer time to run their clustering even for small amounts of iterations.

It is worth noting that the weighting of the different parameters had to be made manually and the scores could be improved if further tweaking of these parameters are made.

26

(35)

CHAPTER 5. DISCUSSION 27

5.2 Validity of the results

There are a lot of factors that could have affected the validity of the results. However, most of the factors that could have had an impact on the scores should have impacted all four algorithms in a similar way and thus not change the overall outcome of how the algorithms compared to each other.

5.2.1 Internal validity

For the internal validity, one has to look at what data was actually used. The data was limited to users that had shared at least two im- ages combined and had also given consent that their data would be used for data analysis. Furthermore, it was also limited to users which had more than 10 images that was available to cluster on. From these restrictions it should be concluded that the users involved was with high probability the more active users and thus having more recent data available for the clustering. It was possible to see from an empir- ical view that an increasing portion of images are taken with phones and also that the metadata inclusion has become more common. This is specially prevalent with the photo location as it often requires the phone to have some form of network or gps connection while the pho- tos are being taken. As we could see from the results, the general trend is that more metadata improves the clustering as it gives us more pa- rameters to analyze. Thus the restrictions that was invoked on which data was used could have improved the results even though it is the authors opinion that the impact should be rather small. This difference could however be neglected by the fact that some images was not able to be parsed correctly and was therefore rejected thus weighting out the spread of images.

5.2.2 Construct validity

A factor that likely had a large impact on the results was how the De-

goo application is built for representing the photos for the users. The

default way for presenting the images for most users is to list them

ordered by their date with a small error rate for the images which it

is not possible to determine their date neither from the metadata or

their name. This should have impacted the results regarding the meta-

(36)

28 CHAPTER 5. DISCUSSION

data versus non-metadata as users would be biased to share images that appear closer to each other in the listing. When using the date parameter for clustering on this kind data, the results has to be care- fully compared against not using the date parameter. It is also worth mentioning that there is a problem with measuring the locations be- tween images taken close to the international dateline. This is because the dates will go from -180 degrees longitude to +180 degrees, mean- ing that two images could have vastly different location parameters even though they are taken relatively close to each other. However, since very few users are located at the dateline this impact should be neglected when looking at the overall result.

It is worth discussing the impact of choosing Affinity Propagation as the algorithm to define the number of clusters. The reasoning was that all algorithm should use the same number of clusters to make the preconditions for their performance as equal as possible. Affinity Propagation was chosen as it had the most standardized way of being able to define the number of clusters without any arbitrary parame- ters. While this method should help making the preconditions better for the share score it could impact the results as some algorithms make have an advantage if they are worse than the other algorithms in de- termining the number of clusters itself.

5.2.3 Conclusion validity

The conclusion validity refers to how certain we can be that the conclu- sions can be correctly drawn from our results. As this thesis only deals with one single application, the validity of our conclusions can be questionable. The difference in scores between the algorithms would need to be by a larger margin to draw any general conclusions for any practical usage of image sharing.

5.3 Alternative solutions

There are a few alternative ways and improvements of accomplishing

similar or better results. First of all from the results gathered it seems a

lot of users has much noise in their data. Either in the form of outlying

images or simply images taken under different conditions, such as a

screenshot. Often times the RSOM algorithm should be able to han-

dle noises but the default implementation used in this research had

(37)

CHAPTER 5. DISCUSSION 29

problems handling the extreme values that came from the metadata.

However there are simpler algorithms such as DBSCAN[9] which is suppose to be able to handle such oddities even though it might not perform as well as RSOM on non-metadata. DBSCAN does also re- quire the user to define arbitrary parameters for how it should deter- mine what is considered to be noise and can therefore be problematic for the general use case.

Another way of determining the number of clusters used for each algorithm is to run the data with multiple different values set for num- ber of clusters and determining which one gives the least amount or errors for that particular algorithm. This way one can find the opti- mal number of clusters for different amount of data. There is however several problems with the approach as it would require a very large amount of computational time as well as a static approach for a dy- namic problem. The images used can vary a lot in their content and metadata and thus even though a static number of clusters for each interval of data might give a good average score, the practical usage would likely be bad.

5.4 Practical usage

For a practical usage of these algorithms there are some factors to weight in. First of all, the ability for an algorithm to determine the number of clusters automatically will be important. Letting users de- termine the number of clusters themselves could work for small amounts of data but for the broader usage it should be fully automated. De- pending on how much variation the data has, this will be beneficial for Affinity Propagation the most as it has a well established way of determining the number of clusters. The other algorithms would need some manual tuning to get the right amount of clusters.

There is also the important factor of computational cost. While the

weight of this factor is dependent on the amount of data and what time

frame it needs to work within, faster algorithms will always have an

advantage. In terms of computational time Affinity Propagation and

BIRCH is by far the best. These two algorithms are both fast and seem

to scale well thus the computational cost should rarely be a problem

for them. If one wants to use either RSOM and DEC the data and

conditions has to be evaluated first.

(38)

Chapter 6 Conclusion

This research was trying to answer the following research questions:

• "Which of the clustering algorithms Affinity Propagation, BIRCH, Rec- tifying Self-Organizing Maps and Deep Embedded clustering achieve the best clustering in practice based of common shareability using im- age content and metadata?"

• "Does using image metadata improve the performance of clustering al- gorithms?"

The first question can be concluded that both Affinity Propagation and BIRCH perform similar on the images that had metadata. While RSOM and DEC had higher mean share score their problems with the mean silhouette score shows their problem with separating the data.

In order to be able to use RSOM and DEC for this kind of problem their default implementation would need some modifications.

The second question can be answered with "yes", the metadata im- prove the results for all four algorithms.

6.1 Future research

For future research there would be a need to make a comparison where the RSOM and DEC algorithms are modified to work better with the metadata or present the metadata in another way. As this approach is focused on the practical usage, more use cases for all the algorithms with comparisons would be beneficial to prove the general case. It would also be good to include an algorithm such as DBSCAN[9] to reduce the noise in the clustering.

30

(39)

CHAPTER 6. CONCLUSION 31

Focus should also be on finding a more equal way of comparing the

clustering algorithms instead of letting one algorithm decide the num-

ber of clusters. The results could also be split into more subsets with

how the different features should be weighted between each other to

give readers a better understanding of what are the optimal precondi-

tions.

(40)

Bibliography

[1] Charu C Aggarwal and Chandan K Reddy. Data clustering: algo- rithms and applications. CRC press, 2013.

[2] Ron Bekkerman and Jiwoon Jeon. “Multi-modal clustering for multimedia collections”. In: Computer Vision and Pattern Recogni- tion, 2007. CVPR’07. IEEE Conference on. IEEE. 2007, pp. 1–8.

[3] Michael J Brusco and Douglas Steinley. “Affinity propagation and uncapacitated facility location problems”. In: Journal of Clas- sification 32.3 (2015), pp. 443–480.

[4] John A Bullinaria. “Self organizing maps: fundamentals”. In: In- troduction to Neural (2004).

[5] Tadeusz Cali ´nski and Jerzy Harabasz. “A dendrite method for cluster analysis”. In: Communications in Statistics-theory and Meth- ods 3.1 (1974), pp. 1–27.

[6] Jia Deng et al. “Imagenet: A large-scale hierarchical image database”.

In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE. 2009, pp. 248–255.

[7] Adit Deshpande. A Beginner’s Guide To Understanding Convolu- tional Neural Networks Part 2. [Online; accessed June 21, 2018].

2016.

URL

: https://adeshpande3.github.io/A-Beginner%

27s-Guide-To-Understanding-Convolutional-Neural- Networks-Part-2/ .

[8] Delbert Dueck and Brendan J Frey. “Non-metric affinity propa- gation for unsupervised image categorization”. In: Computer Vi- sion, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE.

2007, pp. 1–8.

[9] Martin Ester et al. “A density-based algorithm for discovering clusters in large spatial databases with noise.” In: Kdd. Vol. 96.

34. 1996, pp. 226–231.

32

(41)

BIBLIOGRAPHY 33

[10] Brian S Everitt et al. In: (2011).

[11] Brendan J Frey and Delbert Dueck. “Clustering by passing mes- sages between data points”. In: science 315.5814 (2007), pp. 972–

976. [12] Mohammed Ghesmoune, Mustapha Lebbah, and Hanene Az- zag. “State-of-the-art on clustering data streams”. In: Big Data Analytics 1.1 (2016), p. 13.

[13] Eren Golge and Pinar Duygulu. “Rectifying Self Organizing Maps for Automatic Concept Learning from Web Images”. In: (Dec.

2013).

[14] Xifeng Guo et al. “Improved deep embedded clustering with lo- cal structure preservation”. In:

[15] Christian Hennig et al. Handbook of cluster analysis. CRC Press, 2015.

[16] Masaharu Hirota et al. “A robust clustering method for missing metadata in image search results”. In: Journal of information pro- cessing 20.3 (2012), pp. 537–547.

[17] Holmes. Data Mining: Foundations and Intelligent Paradigms. Springer Science & Business Media, 2012.

[18] Andrew G Howard et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications”. In: arXiv preprint arXiv:1704.04861 (2017).

[19] Teuvo Kohonen. Self-organizing maps. eng. 3. ed.. Springer series in information sciences, 30. Springer, 2001.

[20] Teuvo Kohonen. “The self-organizing map”. In: Neurocomputing 21.1-3 (1998), pp. 1–6.

[21] P Kshirsagar and N Rathod. “Artificial neural network”. In: In- ternational Journal of Computer Applications (2012).

[22] Reinier H van Leuken et al. “Visual diversification of image search results”. In: Proceedings of the 18th international conference on World wide web. ACM. 2009, pp. 341–350.

[23] Jianqing Liang, Yahong Han, and Qinghua Hu. “Semi-supervised

image clustering with multi-modal information”. In: Multimedia

Systems 22.2 (2016), pp. 149–160.

(42)

34 BIBLIOGRAPHY

[24] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”.

In: Journal of machine learning research 12.Oct (2011), pp. 2825–

2830.

[25] John C Platt. “AutoAlbum: Clustering digital photographs using probabilistic model merging”. In: cbaivl. IEEE. 2000, p. 96.

[26] Md Khalid Imam Rahmani, Naina Pal, and Kamiya Arora. “Clus- tering of image data using K-means and fuzzy K-means”. In: In- ternational Journal of Advanced Computer Science and Applications 5.7 (2014), pp. 160–163.

[27] Yordan P Raykov et al. “What to do when K-means clustering fails: a simple yet principled alternative algorithm”. In: PloS one 11.9 (2016), e0162259.

[28] Alan P Reynolds et al. “Clustering rules: a comparison of par- titioning and hierarchical clustering algorithms”. In: Journal of Mathematical Modelling and Algorithms 5.4 (2006), pp. 475–504.

[29] Charles Romesburg. Cluster analysis for researchers. Lulu. com, 2004.

[30] Peter J Rousseeuw. “Silhouettes: a graphical aid to the interpre- tation and validation of cluster analysis”. In: Journal of computa- tional and applied mathematics 20 (1987), pp. 53–65.

[31] Tara N Sainath et al. “Deep convolutional neural networks for LVCSR”. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE. 2013, pp. 8614–8618.

[32] Ka-Chun Wong. “A short survey on data clustering algorithms”.

In: Soft Computing and Machine Intelligence (ISCMI), 2015 Second International Conference on. IEEE. 2015, pp. 64–68.

[33] Xindong Wu et al. “Top 10 algorithms in data mining”. In: Knowl- edge and information systems 14.1 (2008), pp. 1–37.

[34] Junyuan Xie, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis”. In: International conference on machine learning. 2016, pp. 478–487.

[35] Qin-Hu Zhang et al. “Data clustering using multivariant opti-

mization algorithm”. In: International Journal of Machine Learning

and Cybernetics 7.5 (2016), pp. 773–782.

(43)

BIBLIOGRAPHY 35

[36] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. “BIRCH:

an efficient data clustering method for very large databases”. In:

ACM Sigmod Record. Vol. 25. 2. ACM. 1996, pp. 103–114.

(44)

www.kth.se

A comparative study on a practical use case for image clustering based on common shareability and metadata

A comparative study on a practical use case for image clustering

based on common shareability and metadata

ERIK DACKANDER

A comparative study on a practical use case for image clustering based on common shareability and metadata

ERIK DACKANDER

Master in Computer Science Date: August 15, 2018

Supervisor: Hamid Reza Faragardi Examiner: Elena Troubitsyna

Swedish title: En jämförande studie i ett praktiskt användningsfall för bildklustring baserat på gemensamt delade bilder och dess metadata

School of Electrical Engineering and Computer Science

iii

Abstract

As the amount of data increases every year, the need for effective struc-

turing of data is a growing problem. This thesis aims to investigate and

compare how four different clustering algorithms perform on a practi-

cal use case for images. The four algorithms used are Affinity Propaga-

tion, BIRCH, Rectifying Self-Organizing Maps, Deep Embedded Clus-

tering. The algorithms get the image metadata and also its content,

extracted using a pre-trained deep convolutional neural network. The

results demonstrate that while there are variations in the data, Affin-

ity Propagation and BIRCH shows the most potential among the four

algorithms. Furthermore, when metadata is available it improves the

results of the algorithms that can process the extreme values cause. For

Affinity Propagation the mean share score is improved by 5.6 percent-

age points and the silhouette score is improved by 0.044. BIRCH mean

share score improves by 1.9 percentage points and silhouette score by

0.051. RSOM and DEC could not process the metadata.

iv

Sammanfattning

Resultaten visar att även om det finns stora variationer i utfallen, vi-

sar Affinity Propagation och BIRCH den största potentialen av de fy-

ra algoritmerna. Vidare verkar metadatan, när den finns tillgänglig,

förbättra resultaten för de klustringsalgoritmer som kunde hantera de

extremvärden som metadatan kunde ge upphov till. För Affinity pro-

pagation föbättrades den genomsnittliga delnings poängen med 5,6

procentenheter och dess silhouette index ökade med 0.044. BIRCHs

genomsnittliga delnings poäng ökade med 1,9 procentenheter samt

dess silhouette index förbättades med 0.051. RSOM och DEC kunde

inte processa metadatan.

Contents

1 Introduction 1

1.1 Problem statement . . . . 2

1.1.1 Research Questions . . . . 2

1.1.2 Evaluation . . . . 2

1.1.3 Scope of study . . . . 3

1.2 Research methodology . . . . 3

1.3 Degoo Media AB . . . . 4

1.4 Scientific contribution . . . . 4

2 Background 6 2.1 Convolutional neural network . . . . 6

2.2 Cluster analysis . . . . 7

2.2.1 Silhouette coefficient . . . . 8

2.2.2 Calinski-Harabasz Index . . . . 9

2.2.3 BIRCH . . . . 9

2.2.4 Affinity propagation . . . 10

2.2.5 Rectifying Self-Organizing Maps . . . 12

2.2.6 Deep embedded clustering . . . 13

2.3 Related work . . . 14

3 Method 16 3.1 Data . . . 16

3.2 Image feature extraction . . . 17

3.2.1 Content extraction . . . 17

3.2.2 Metadata . . . 17

3.3 Clustering . . . 19

3.4 Score calculation . . . 20

v

vi CONTENTS

4 Results 22

4.1 Results overall . . . 22

4.2 Results without metadata . . . 23

4.3 Results only when metadata available . . . 23

4.4 Metadata impact . . . 24

5 Discussion 26 5.1 The overall results . . . 26

5.2 Validity of the results . . . 27

5.2.1 Internal validity . . . 27

5.2.2 Construct validity . . . 27

5.2.3 Conclusion validity . . . 28

5.3 Alternative solutions . . . 28

5.4 Practical usage . . . 29

6 Conclusion 30 6.1 Future research . . . 30

Bibliography 32

Chapter 1 Introduction