Clustering users based on the user’s photo library

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Clustering users based on

the user’s photo library

(2)

(3)

Clustering users based on the

user’s photo library

MARCUS BERGHOLM

Master in Computer Science Date: June 18, 2018

Supervisor: Jens Edlund Examiner: Olof Bälter Principal: Carl Hesselskog

Swedish title: Gruppering av användare baserat på användarens fotobibliotek

(4)

(5)

Abstract

For any user-adaptive system the most important task is to provide the users with what they want and need without them asking for it explicitly. This process can be called personalisation and is done by tailoring the service or product for individual users or user groups. In this thesis, we explore the possibilities to build a model that clusters users based on the user’s photo library. This was to create a better personalised experience within a service called Degoo. The model used to perform the clustering is called Deep Embedding Clustering and was evaluated on several internal indices alongside an automated categorization model to get an indication of what type of images the clusters had. The user clustering was later evaluated based on split-tests running within the Degoo service. The results shows that four out of five clusters had some general indication of types such as vacation photos, clothes, text, and people. The evaluation of the clustering impact on the split-tests shows that we could see patterns that indicated optimal attribute values for certain user clusters.

(6)

Sammanfattning

Det ultimata målet för alla användaranpassade system är att ge an-vändarna det som de behöver utan att de begär det explicit. Denna process kan kallas användaranpassning och görs genom att skräddarsy tjänsten eller produkten för enskilda användare eller användargrupper. I denna avhandling undersöker vi möjligheterna att bygga en modell som grupperar användare baserat på användarnas fotodata. Motivationen bakom detta var att skapa en bättre personlig upplevelse inom en tjänst som heter Degoo. Modellen som används för att utföra grupperingen heter Deep Embedding Clustering och utvärderades på flera interna index tillsammans med en automatiserad kategoriseringsmodell för att få en indikation av vilken typ av bilder grupperna hade. Användargrup-peringen utvärderades senare baserat på flera split-test som körs inom Degoo tjänsten. Resultaten visar att fyra av fem grupper hade en allmän indikation på typer som semesterbilder, kläder, text och människor. Ut-värderingen av grupperingseffekten på split-testerna visar att vi kunde se mönster som indikerar optimala attributvärden för vissa grupper.

(7)

2.6 Image recognition . . . 13 2.7 Bag of words . . . 14 3 Method 15 3.1 Environment . . . 15 3.1.1 NumPy . . . 15 3.1.2 Sklearn . . . 15 3.1.3 Google cloud . . . 16 3.1.4 Tensorflow . . . 16 3.1.5 Matplotlib . . . 16 3.1.6 Hardware . . . 16 3.2 Image clustering . . . 17 3.2.1 Data . . . 17 3.2.2 Feature extraction . . . 17 3.2.3 Model . . . 17 3.3 Evaluation . . . 19

3.3.1 Internal cluster evaluation . . . 19

3.3.2 Cluster categorization . . . 20

3.3.3 User categorization . . . 20

3.3.4 Split-test evaluation . . . 20

4 Result 23 4.1 Internal cluster evaluation . . . 23

4.2 Cluster categorization . . . 25 4.3 User categorization . . . 29 4.4 Split-test impact . . . 29 5 Discussion 45 5.1 Conclusion . . . 47 5.2 Future work . . . 47 6 References 49

(9)

Chapter 1

Introduction

The following chapter intents to introduce the reader to the problem and the research question for the master thesis. It also intents to give information about the overall aim and motivation behind it.

For any user-adaptive system the most important task is to provide the users with what they want and need without them asking for it explicitly. This process can be called personalisation and is done by tailoring the service or product for individual users or user groups. The personalisation is usually done by delivering dynamic content to the users such as product recommendations, advertisement, links etc. (Mobasher 2007).

Today, services are generating a lot of user data and may not be using it to gain insight into their users. Such insight can be used to create a better personalised experience within the service. But creating a model for each user will many times be to expensive and hard to obtain. It is therefore of great interest to cluster users, in some way, to be able to give users within the same cluster the same sort of settings. Degoo provides a cloud backup storage service both in form of a desktop application and a smartphone application. The smartphone application is available for both iOS and Android and allows the user to backup their images in the cloud. This service generates a lot of data and in particular images from users. Degoo is today running 168 different split-tests within their application to personalise it. A split-test is a multivariate test where different versions of a web page or application is tested to determine which performs better. For example one important question is: what price maximizes the revenue? This is today tested by creating a split-test with several different prices and then Degoo examines how many purchases are made with each price tag. The problem with this is different users have very different price sensitivities. If we would be able to group these users into clusters we could easily determine whether clustering users helps raise the revenue or not.

Since Degoo has a lot of user data containing images we opted to investigate the possibilities to use that data to cluster the users. We will therefore, in this thesis,

(10)

explore the possibilities to build a model that cluster users given an image library from each user that is backed up to the service. The result from the user clustering would then be that all users receive a categorization based on the cluster assigned to the user and that we get more insight to the userbase. The overall expectation of the project is that the user clustering has some impact on the split-tests and that the results can provide input when selecting parameters for users.

1.1 Aim

The overall aim of the thesis is to cluster users using only images and evaluate the impact of the clustering on split-tests running in the Degoo service. This is done by implementing a model that clusters images uploaded to the service and later assigning each user to a cluster. The result is then evaluated on the split-tests running in the Degoo application today.

1.2 Problem description and research questions

The method that will be used is to implement a model to find patterns/similarities in user images in such a way that we can cluster these images and then categorize users based on these clusters. The result of the clustering will be evaluated by using it in the already running split-tests within the Degoo application. By doing this we can establish if the clustering had any impact on the split-tests or not.

Research Questions

• Is it possible to utilise image clustering and automated image categorization to obtain a categorization or type for each cluster?

• Using the image clusters can we give a category or several categories to a user? • Will the user clustering have any impact on the split-tests running in the

application?

1.3 Delimitations

Since Degoo has a lot of user data and we want to use image clustering to obtain user categorizations and investigate if this would have any impact on the split-tests in the application, we limit ourselves to only look at the clustering and categorization of users using images.

(11)

1.4 Ethical aspects

As the research clearly enters the area of big data and big data analysis we need to consider the users and the user privacy. Especially since we will use images that belongs to users, which can be sensitive information, to find new patterns. With all this data comes great risks and can possibly give rise to concerns for the user’s privacy. Tene and Polonetsky (2011) write in their paper, “Privacy in the age of big data: A time for big decisions” that:

“The harvesting of large data sets and the use of analytics clearly implicate privacy concerns. The tasks of ensuring data security and protecting privacy become harder as information is multiplied and shared ever more widely around the world. Information regarding individuals’ health, location, electricity use, and online activity is exposed to scrutiny, raising concerns about profiling, discrimination, exclusion, and loss of control.” We will of course make all the data used for the project anonymous and make the transfer of data as secure as possible. The data is today stored on Google’s servers by Degoo and will only be extracted when needed. When performing the image extraction images was directly transformed into feature vectors and no images was stored locally. The user was also given the opportunity to opt- in or out from us using their data in our analysis. It is worth noting that anonymizing data is hard. There is always the possibility that identifying information about a user can leak even if the dataset is anonymized. An example of this would be the Netflix price. The Netflix price were an open competition arranged by Netflix where the competitors was to predict user ratings for movies using a collaborative filtering algorithm. Although Netflix had anonymized the dataset used for the competition two researchers were able to identify individual users by matching the dataset with movie ratings on the database called Internet Movie Database (IMDB) (Narayanan and Shmatikov 2006). This then lead to a class action lawsuit from the affected Netflix users (Netflix 2009).

Noteworthy are also the new General Data Protection Regulation (GDPR), made by EU, that was enforced on the 25th of May. The GDPR is a regulation that concerns all individuals within the European Union and covers data protection and privacy. GRPR addresses both the control over personal data and the export of personal data outside EU. The aim is that the users should have control over their personal data. Another important part of GDPR is that international businesses are using the same regulations within EU.

There are three main areas GDPR will affect where data analytics is used. The first one is called Data processing and profiling where the GDPR wants more control over data processing and profiling. It is defined as:

(12)

“Any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular, to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements.”

The second is called The right to an explanation which gives the consumers the right to not be the subject of a decision which is only based on automated processes. The last and third is called Bias and Discrimination and is applicable to companies that uses automated decision-making. In this case they have to prevent discriminatory effects (Council of European Union 2016).

As earlier stated, we only used users and their user data that had given us consent to do so, which is needed for the Data processing and profiling. As well as images was directly transformed into feature vectors and was not stored locally for user privacy. Since the system only is used to group users to find optimal parameters in split-test, we can easily explain why and how a decision was made.

(13)

Chapter 2

Background

The following chapter presents the underlying theories and background needed for developing the model that may answer the research questions in section 1.2.

2.1 Degoo

Degoo provides a cloud backup storage service both in form of a desktop application and a smartphone application. The smartphone application is available for both iOS and Android and allows the user to backup their images in the cloud.

The background to the thesis roots in that Degoo wants to give their users a better personalised experience within their application. A few examples of personalisation in the application are:

• Customized recommendations of photos in user feed • Better targeted ads

• Gain insight of what users like/dislike by using split-tests in the application Degoo is today doing about 200 different split-tests in the application to be able to personalise their application better. For example one important question is: what price maximizes the revenue? This is today tested by creating a split-test with several different prices and then they examine how many purchases are made with each price tag. The problem with this is different users have very different price sensitivities. If we would be able to group these users into clusters we could easily determine whether clustering users helps raise the revenue or not.

The provided data includes user data with nationality, language, paying customer, OS etc. and data based on what the user do within the application (events), e.g. sharing a photo, opening the application, etc. There is also the backup data provided by the users, which is images.

(14)

2.2 Personalisation

The ultimate goal of any user-adaptive system is to personalise the system or service without the users asking for it explicitly (Mulvenna, Anand, and Büchner 2000). This process can be called personalisation and is done by tailoring the service or product for individual users or user groups. One example of personalisation is web personalisation, that is tailoring the product to the needs or interests of a specific user or a group of users. This can, for example, be done by delivering a dynamic content which can be textual elements, links, advertisement, product recommendations, etc. (Mobasher 2007). The example of personalised advertisement can be particular

interesting since it is usually a key component to the business model of many online services (O’Donnell and Cramer 2015).

2.2.1 Process

Performing the process of personalisation in a service requires at least four steps: • Collection of data

• Modeling and categorization of the collected data • Analysis of the collected data

• Make actions based on the modeling and analysis of the data

2.2.2 Modeling and analysis

The modeling and analysis of the data can be done in several ways. According to Eirinaki and Vazirgiannis (2003) popular alternatives are content-based filtering, collaborative filtering and rule-based filtering. These are described below:

• Content-based filtering: These systems are based on recommending items or products that are similar to the ones that a user liked in the past. This means that the system will only base the recommendations on individual users preferences.

• Collaborative-based filtering: The collaborative-based filtering methods are based on collecting user rating on objects and trying to use this information of behavior and rating to give recommendations that are of interest to the user. The assumption here is that users with similar behavior have similar interests. • Rule-based filtering: In the rule-based filtering the users are asked a set of questions. Based on these questions the system will give a result tailored to the users need. An example of a result would be a list of products.

But in order to make the service user-adaptive and to be able to distinguish between user or user groups the need for user profiles arises.

(15)

2.3 User profiling

In order to personalise a service, the system should be able to distinguish between different users or groups of users (Eirinaki and Vazirgiannis 2003). This can be done by a method called user profiling. User profiling is the process of creating a profile for a user or a user group. The profile can be constructed based on various of data sources, both hardware and software, such as RFID-tags, biometrics, sensors, data cleansing, data aggregation and data mining. The profiles constructed will contain information about the user such as preferences, characteristics and activities made by the user. This will then be used as the basis to make decisions. This decisions can even be made without human interaction, this is called automatic profiling (Profiling the European Citizen: Cross-Disciplinary Perspectives 2008).

2.3.1 Static and dynamic profile

There are two types of profiles, these are either static or dynamic. A static profile is when the information within the profile is rarely or never altered. Such information can be the demographic information or sex of the user. A dynamic profile is when the users information is regularly or often updated. This kind of data can be collected through many different channels, for example recoding the navigational behavior of the user (Eirinaki and Vazirgiannis 2003).

2.3.2 Practices: supervised vs unsupervised

Profiles can be constructed by using either supervised or unsupervised methods. When using supervised methods to construct profiles the method usually involves testing it against a hypothesis, similar to the methodology of traditional scientific research, where one usually starts with a hypothesis and try to test its validity. The unsupervised methods constructs profiles by exploring a database using some data mining process. In this way it is possible to find patterns in the data that were not previously known. This can then be seen as finding similarities not thought of or unexpected similarities (Zarsky 2002).

2.3.3 Individual and group profiles

There are two different ways to classify what a profile refers to. One is called individual profiling, where the profile is constructed using the data from a single user. This type of profiling can be used to discover particular individual characteristics. The other one is called group profiling and can be used to give groups of users a certain type. This can be based on the fact that the constructed individual profile is similar to other already constructed profiles. A group profile can either refer to results using data mining methods that finds existing communities such as region,

(16)

politics, universities or unknown categories of people that do not from a community (Custers 2004).

2.4 Data clustering

Cluster analysis techniques are mostly used to explore datasets to get an under-standing of whether or not they can be represented in any meaningful way. This can be described as using a relatively small number of groups of objects which are similar to each other and which are dissimilar in some respects from objects in other clusters (Everitt 2010).

Clustering is a general task, and is the most common form of unsupervised learning (Manning 2008), that has various different proposed algorithms that excels at different tasks. Data clustering is usually used in exploratory data mining and for statistical data analysis to discover patterns in datasets. The task is then to group a set of objects in such a way that the objects within the same group, also called cluster, are similar and at the same time dissimilar from objects in other clusters. The resulting clusters can give information about the underlying data distribution.

The difference between classification and clustering might not seem great at first, but they are very different. Classification is a supervised learning problem where the task is to replicate a categorical distribution that a human supervisor has decided and clustering is a unsupervised learning problem which means that there is no human supervisor to guide. (Manning 2008)

2.4.1 Proximity measure

The most central part about clustering is how to measure the closeness between objects. This is usually referred to as dissimilarity, distance or similarity and the more general term proximity. When the dissimilarity or distance is small or when the similarity is large between two individual objects they are said to be close (Everitt 2010). There are many different closeness measures, for example the Euclidean distance.

2.4.2 Hard and soft clustering

Another important part in clustering is the distinction between hard and soft clustering. Hard clustering is when an object is given a hard assignment, that is the object belongs to exactly one cluster. Soft clustering is when an object is given a soft assignment, which is when an object belongs, by some percentage, to all clusters (Manning 2008).

(17)

2.4.3 Clustering evaluation techniques

The clustering techniques are generally classified as partitional clustering or hierar-chical clustering, based on the properties of the generated clusters (Xu 2008). In hierarchical clustering there are two ways to do the clustering, the agglomerative method or the divisive method. In the agglomerative method, each individual object makes a cluster and the two most similar clusters are merged with each other. This is done iteratively until a large single cluster has been formed or the process is stopped for some reason (not similar enough, human decision, number of cluster) (Rahmani, Pal, and Arora 2014). The divisive method starts with just one cluster

and then splits it iteratively into smaller clusters.

Partitional clustering assigns a set of data points into k clusters without any hierar-chical structure and at the same time satisfying some distance criterion (Xu 2008). While the hierarchical methods never revisit the clusters after construction, the partitional methods iteratively revisit the clusters to improve them. With correct use of the data and method, this can result in high quality clusters (Berkhin 2006). The general procedure of cluster analysis can be described in four basic steps. The first one is feature selection and feature extraction where feature selection is the process of choosing distinguishing features within the dataset and feature extraction is a type of dimensionality reduction that utilizes some transformations to produce useful and novel features. In Computer Vision feature extraction is usually used to efficiently represent interesting parts in a image. Good feature selection can greatly reduce the workload as we are not considering as many features to cluster. Clustering algorithm design and selection is the second step. In this step one usually

makes the decision of a proximity measure and use that to build a criterion function for the clustering algorithm. Since there is no single solution to clustering, one have to carefully choose the algorithm based on the problem to be solved.

The third step concerns cluster validation. The problem with clustering is that a clustering algorithm can always divide the dataset into clusters, whether there exists any structure or not. Therefore it is of importance to validate the clusters in some way. There are three different categories of evaluations, which are internal evaluation, external evaluation and relative indices. When the result of the clustering is evaluated on the data used for the clustering we call it internal evaluation and that usually involves testing the how dense and well-separated the clusters are. External evaluation measures the result against some already known prior information on the data and is used to validate the cluster structure. The forth and last step is results interpretation and is where one tries to make something out of the clustering result, for example personalisation of web applications or targeted ads (Xu and Wunsch 2005).

(18)

Internal cluster evaluation

Internal cluster evaluation is when the data used for the clustering is also used to validate the clusters (Rendón et al. 2011). There are several indices developed for internal clustering and Rendón et al. (2011) performed a comparison between external and internal indexes where six internal indexes were used. In this study the Davies-Bouldin index, Dunn index and Calinski-Harabasz index performed well. The Davies-Bouldin (DB) index aims to find sets of cluster that are compact and well separated. Davies_{≠ Bouldin =} 1 c c ÿ i=1 max i”=j { d(Xi) + d(Xj) d(ci, cj) }, (2.1)

where c is number of clusters, i and j are cluster lables and d(Xi), d(Xj) are the

average distance for all the samples in cluster i and j and lastly d(ci, cj) is the

distance between two centroids. A smaller DB value yields “better” clusters (Rendón et al. 2011).

The Dunn index also aims to identify dense and well-separated clusters. Dunn= min

1ÆiÆc{min{

d(ci, cj)

max_1ÆkÆc(d(X_k₎₎}}, (2.2) where c is number of clusters, i and j are cluster lables, d(Xk) is the intracluster

distance of cluster Xk and c(ci, cj) represents the distance between cluster i and j.

A larger Dunn value represents better clusters (Rendón et al. 2011).

The Calinsiki-Harabasz index evaluates the clusters based on the average between-and within-cluster summation of squares.

CH = q inid2(ci, c)/(NC ≠ 1) q iqxœCid2(x, ci)/(n ≠ NC) , (2.3)

where n is number of points in the dataset, ni is number of points in cluster i, NC

is number of clusters, Ci is the ith cluster, ci is the center of Ci and d(x, y) is the

distance between x and y (Rendón et al. 2011).

External cluster evaluation

As mentioned before, external evaluation measures the result against some already known prior information on the data and is used to validate the cluster structure. There are many known external indices and one of them is Clustering Accuracy (ACC). ACC can be used when the ground-truth categories of the images is known. This is usually done by setting the number of clusters to the number of ground-truth categories. Then ACC is calculated by:

(19)

ACC = max

m

qn

i=11{li = m(ci)}

n , (2.4)

where n is the number of data points, ci is the cluster assignment, li is the

ground-truth label and m is the range over all one to one mappings from cluster to label (Yang et al. 2010).

2.4.4 K-means

The k-means algorithm might be the most well known clustering algorithm. It is a centroid based clustering algorithm and iterates on two steps. One possible initialization of k-means is done by randomly scattering k points which will be the centroids for the k clusters. The two steps are then 1) the clustering step, where the algorithm goes through each data point and assigns it to the closest cluster centroid and 2) the moving of the centroids, where the algorithm moves the centroids to the average of the points in a cluster. These two steps repeat until there is little to no change in the centroids location (Everitt 2010).

2.5 Image clustering

To perform clustering on images we need to represent the images in a compressed and meaningful way. The problem of using machines to analyse images compared to humans is the semantic gap. The semantic gap is when a machine learning algorithm might be able to retrieve low-level features of images, such as common patterns or colors, but might not be able to have a higher level of understanding of the image. The gap can be reduced using techniques from the research areas of Neural Networks, Genetic Algorithms and Clustering research (Sethi, Coman, and Stan 2001).

2.5.1 Image feature extraction

For large datasets or data such as images we can use feature extraction to reduce the amount of variables required to describe the data. Feature extraction is related to dimensionality reduction where the input data is transformed into a set of features or feature vectors. Features are distinctive properties of inputs that are used to differentiate between categories of input patterns.

Image features can be split up into two different types: local features and global features. Global features are generally used in image retrieval, object detection and classification, while the local features are used for object recognition/identification. Global features describe the image as a whole to generalize the entire object where the local features describe the image patches (key points in the image) of an object. The global features can be categorized as contour representations, shape descriptors,

(20)

and texture features, while local features are the texture in an image patch (Lisin et al. 2005).

With recent developments within Artificial Neural Networks and Deep Learning the feature learning capabilities of these network have been proven good. These networks have then been used to generate features of images to later be used as clustering features for the image (Xie, Girshick, and Farhadi 2016).

2.5.2 Clustering algorithms

Xie, Girshick, and Farhadi (2016) describe a method called Deep Embedding Cluster-ing (DEC) that simultaneously learns feature representations and cluster assignments using a Deep Autoencoder and Chang et al. (2017) describe a method called Deep Adaptive Clustering (DAC) that recasts the clustering problem into a binary pairwise-classification framework to judge whether pairs of images belong to the same cluster. In DAC, label features of images are generated by a Deep Convolutional Network (Convnet) and are then used to calculate the cosine distance, which is used as the

similarity measure.

Another paper identifies users groups using image clustering on a service called Instagram. In this paper, the authors used Scale-Invariant Feature Transform (SIFT) to detect and extract local discriminative features of images, then apply a standard vector quantization approach to obtain codebook vectors for each image and later use k-means to cluster the images based on the euclidean distance (Hu et al. 2014).

Scale-Invariant Feature Transform and k-means

The Scale-Invariant Feature Transform is an computer vision algorithm used to detect and extract local features from images. Lowe (2004) describes SIFT as an algorithm that transforms image data into scale-invariant coordinates relative to local features and says that there are four major stages to compute the image features. These steps are called (1) Scale-space extrema detection, (2) Key point localization, (3) Orientation assignment and (4) Key point descriptor. The process for each step

is described by Lowe (2004).

These SIFT keys can then be used, as in Hu et al. (2014), to cluster the images based on the content by utilizing the k-means algorithm.

Deep Embedding Clustering

As mention before DEC is a method that simultaneously learns feature representa-tions and cluster assignments using Deep Neural Networks. To explain the DEC algorithm lets consider the problem of clustering n points, where {xi œ X}n1 into k

(21)

of directly clustering in the data space DEC transforms the data with a non-linear mapping f◊: X æ Z where ◊ are learnable parameters and Z is the latent feature

space. Since this functions like dimensionality reduction, that is that Z has a much smaller dimensionality then X, DEC avoids some of the problems of the curse of dimensionality. To parameterize f◊, Deep Neural Networks are used due to their

theoretical function approximation properties and their demonstrated feature learn-ing capabilities (Xie, Girshick, and Farhadi 2016). This means that the features of images are generated by a Deep Neural Network by taking the result of the layer before the classification is done.

DEC is built in two phases. The first phase is parameter initialization with the Deep Autoencoder and the second phase is parameter optimization, that is clustering, where DEC iterates between computing an auxiliary target distribution and minimizing the Kullback-Leibler (KL) divergence to the distribution (Xie, Girshick, and Farhadi 2016).

Deep Adaptive Clustering

Deep Adaptive Clustering (DAC) is a clustering algorithm that uses a single-stage ConvNet-based method to cluster images. DAC does this by comparing if two images belong in the same cluster by using a pairwise-classification. The images are then represented by the label features generated by the ConvNet and similarity between the two images is calculated by the cosine distance based on the label features (Chang et al. 2017).

To get Deep Adaptive Clustering to work Chang et al. (2017) developed an algorithm called Adaptive learning, which is an iterative algorithm used to optimise the DAC model. At each iteration, pairwise images with the estimated similarities are selected. The next step is to train the ConvNet using the selected labeled samples from last step in a supervised way. When all the samples are used for training and the function of the binary pairwise classification can not be improved further the algorithm has converged. When the algorithm has converged, the images can be clustered by finding the largest response of label features (Chang et al. 2017).

2.6 Image recognition

In recent years, the field of Machine Learning and Deep Learning has made great progress on addressing many problems. But the one that has got the most attention is the problem of image recognition. The main reason behind the progress is a model called Deep Convolutional Neural Network (ConvNet) or very Deep Convolutional Neural Networks (Szegedy et al. 2016).

Using these Deep Learning models for image classification has been yielding great result in recent times. ImageNet is an image recognition competition using machine

(22)

learning. During the ImageNet competition in 2012, a ConvNet called AlexNet achieved a top-5 error rate of 15.3% on the validation data set. Later Inception (GoogLeNet) achieved 6.67%, BN-Inception-v2 achieved 4.9% and Inception-v3

reached 3.46% in the same validation data set (TensorFlow 2018).

2.7 Bag of words

In the area of Computer Vision the Bag of Words model (BoW) can be used to classify images by treating the image features as words. The BoW method comes from document classification and is represented by a sparse vector of occurrence counts of words within a document. When using the BoW model in Computer Vision, it can often be described as a bag of visual words. Where the bag is a vector that describes the number of occurrence of a local image feature (Yang and Newsam 2010).

(23)

Chapter 3

Method

The following chapter presents the underlying methods and evaluation techniques used to develop the model to answer the research questions in section 1.2.

3.1 Environment

In this section we describe what software and hardware was used during the model creation and user categorization. The programming language used for all development was Python 3.6.4 and libraries compatible with the specified version.

3.1.1 NumPy

Numpy is a Python library mostly used for scientific analysis projects and to achieve efficient array operations. The library provides a high-performance multi-dimension array object and tools for operations on these arrays, as well as functions to find the maximum value, minimum value or unique values in the array, etc. Many of the used libraries in this thesis also depend on NumPy to run.

3.1.2 Sklearn

Sklearn also called scikit-learn, is build upon NumPy and SciPy. SciPy is a library used for scientific computing and technical computing. It contains many modules for scientific computing such as, numerical algorithms, optimization, and linear algebra. The Sklearn library expands on SciPy and provides tools for data mining and data analysis.

(24)

3.1.3 Google cloud

This module provides operations such as, setting up data buckets, fetching files within a data bucket and other similar features available on Google cloud. In this thesis Google cloud was only used for fetching images and split-test data stored within Google’s BigQuery, which is a web service that provides analysis of massively large datasets. BigQuery accepts queries expressed in standard SQL dialect.

3.1.4 Tensorflow

TensorFlow is an open source software library for numerical computation using data flow graphs. It can be used for many various tasks but is mostly used for machine learning applications such as neural networks. The graphs are built by nodes and edges where the nodes represent mathematical operations while the edges are data arrays called tensors. The tensors flow between the nodes to perform the numerical computations. Since the architecture is general and flexible, it is easy to deploy computations on one or more CPUs or GPUs.

The Tensorflow back-end was used to build and run the computations of the model for the image clustering.

Keras and Deep Embedding Clustering

Keras is a Deep Learning library for Python. It provides a high-level neural network API and is able to run on top of TensorFlow, CNTK or Theano. Keras allows for easy and fast setup of neural networks and runs both on CPU and GPU. The core data structure of Keras is a model and a way to organize layers. In this thesis Keras is used to build the DEC model and train it. A Keras module that implements the DEC model is used for the thesis and can be found at Github.

3.1.5 Matplotlib

Matplotlib is a SciPy module and provides easy to use functions to create graphs and figures within Python. The provided functionality is plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc. The Pyplot module within Matplotlib gives a MATLAB-like interface with full control of line styles, front properties and axes properties. Matplotlib is used to visualize the results for the different parts of the thesis.

3.1.6 Hardware

A Macbook Pro laptop with an Intel Core i5 processor with a clock rate at 2.9 Ghz and 16 GB of memory was used during the development and testing. The operating

(25)

system used was MacOS. For the model training a more powerful computer utilizing a GeForce GTX 970 graphics processing unit was used.

3.2 Image clustering

3.2.1 Data

The data that was used to build the model for the experiment was images that users uploaded to the Degoo service. Only users that agreed upon that their images were included in the machine learning experiment were used. The data it self is regular images backed up by the user to the Degoo servers, so the content can be anything from vacation photos to screen-shoots on the phone. All the data was fetched using Google BigQuery and transformed into image features. The image features were then saved in NumPy array files along with a prediction of the image and what user the image belonged to. A total of 81857 images from 4676 users were used to train and test the clustering model. Only if the user had 15 or more images the user was given a category and used in the split-test analysis. This was to eliminate users with only a few images (many only had one image) that could skew the results.

3.2.2 Feature extraction

To extract features from the images a Deep Neural Network pre-trained on ImageNet were used. The network used is called VGG16 and is part of the Keras library. To use the VGG16 network, the images were transformed into 224x224x3 vectors by using the img_to_array and preprocess_input provided by Keras. The trans-formed image was then put through the VGG16 network, but instead of getting the classification result from the last layer the result from the fc1 layer (the second to last layer) was computed. These features were then used in the pre-training and training of the DEC model.

3.2.3 Model

As mention earlier the model used in the thesis is the Deep Embedding Clustering model (DEC). We opted to use DEC since it is more established and because of its availability compared to DAC although the DAC model performed better for some datasets. DEC is built in two phases, (1) the parameter initialization using a Deep Autoencoder and (2) the parameter optimization which is the clustering.

In the first phase the Deep Autoencoder is constructed by using a greedy layer-wise training. The layers are created by the Keras model. After the greedy layer-wise training all encoder layers followed by the decoder layers are concatenated in reverse layer-wise training order. The fully constructed Autoencoder is then fine tuned by

(26)

Figure 3.1: Deep auto-encoder structure (Xie, Girshick, and Farhadi 2016)

minimizing the reconstruction loss. The result is a Deep Autoencoder and before moving to the next step the decoder part of the Autoencoder is discarded. This is then the initial mapping between the data space and the feature space, shown in figure 3.1.

The data points can then be passed through the Deep Autoencoder to get the initial deep embedded data points. These points are then used to initialize the k clusters using the standard k-means algorithm.

The second phase iterates between computing an auxiliary target distribution and minimizing the Kullback-Leibler (KL) divergence to the distribution. This is done by fist computing a soft assignment between the embedded points and the cluster centers. Then the deep mapping is updated and the cluster centers improved by learning from the current high confidence predictions using the auxiliary target distribution. This part can be seen as a from of self training, where we have an initial classifier and an unlabeled dataset. The dataset is then labeled by using the classifier in order to train the high confidence predictions. This process is repeated until some convergence criterion is met.

(27)

Number of clusters

To establish the number clusters to use for the model, several internal cluster evaluation indices were used along side an automated image categorization. The automated image categorization used a bag of words-like method where each image in a cluster was given a category based on a DNN pre-trained on the Imagenet dataset, the network used was the VGG16 network provided by Keras. The image category of an image for a cluster was then saved in a bag belonging to that cluster. In the end, each cluster had a bag full of categorized images which was used to give an overview of the type of images in each cluster.

3.3 Evaluation

3.3.1 Internal cluster evaluation

We use several internal indices to evaluate how many clusters should be used while clustering the images. Below we describe the indices used in the thesis.

Davies-Bouldin index

As described before in section 2.4.3 DB index measures how tight and well separated the produced clusters are. The DB index was implemented to fit the DEC model. This was done by first adding the average internal distance of two clusters and dividing it by the distance between the two cluster centroids, lets call this value R. Then the maximum R value for each cluster is summed and divided by the number of clusters. The formula for the DB index is described in equation 2.1. A lower DB value is considered to give better clusters.

Dunn index

The Dunn index also identifies tight and well separated clusters. But in this case we measure the ratio between the minimal distance between two clusters and the maximal internal cluster distance. The formula for the Dunn index is described in equation 2.2. An implementation for the DEC model was implemented. Higher Dunn values yields better clusters.

Calinski-Harabasz index

The Calinsiki-Harabasz (CH) index is described in section 2.4.3 and evaluates the clusters based on the average between- and within-cluster summation of squares. The Sklearn implementation was used to calculate the CH index and the formula describing it can be seen in equation 2.3. A large CH value indicates good clusters.

(28)

3.3.2 Cluster categorization

To categorize the clusters we used an automated image categorization model. This means that each image used to create the clusters was also put through a DNN pre-trained on the ImageNet dataset. As mentioned above, the DNN used for the classification was the VGG16 network provided by the Keras library. When the DEC model was done training each image was associated with the assigned cluster given from the DEC model and the predicted category of the image was put into a bag of word for the associated cluster. This generated a bag of word for each cluster which then could be analyzed and give an overview of the image types in each cluster.

3.3.3 User categorization

The user categorization was created by using a method similar to a soft assignment. All images of a user were put through the DEC model to get a cluster for each image. After getting the cluster assignments the images in each cluster was counted and divided by the total number of images given by the user. This resulted in an array containing how much the user belonged to each cluster.

3.3.4 Split-test evaluation

Degoo is performing several split-tests at the same time within the application, this means that different users might be experiencing different versions of the application. They are also tracking certain events that happen within the application or events that the user triggers, one such event is if the application was removed by the user. To get insight into these tests, Degoo is grouping the split-test on the events, and by doing this they are able to draw some conclusions from the data. One example is if the removal of the application was increased or decreased with a particular split-test attribute.

The idea is then to also group on the clustering of users done by image clustering in addition to grouping on events. This is to see if it is possible to find optimal param-eters for certain clusters or if any cluster is over represented or under represented in some events.

To do this a tool built by Degoo was used to extract the split-test data, first without using clustering and secondly using clustering. The data was then compared with each other to see if any pattern from the clustering emerged.

From 168 tests 5 of them that showed the most variance between attribute values were chosen. This means that there was no clear best attribute value for the split-test. For each of these five tests statistics for four events where fetched. The events where called app remove, user engagement, feed interval cards shown and user is hooked. The app remove event triggers if a user has removed the application. User

(29)

engagement is an event that occurs every x seconds if the application is opened, where x is set by Google. The feed interval cards shown event occurs every time the user have scrolled by five cards (photos, images, ads) and can be seen as a measure of how much users scrolls in the feed. Lastly the user is hooked event triggers when a user had the application installed for more then three days and have stored at least 0.5 GB of data.

Below we describe the five split-tests selected.

Content blocker bytes per pixel threshold 6

This test removes images from the feed that are potentially “boring”. One way this is done is by checking the number of bytes per pixel. By doing this it is easy to see if the image is blurry. JPEG removes the highest frequencies from an image if it has low amplitude, after the Fourier-transform. This gives the effect that it can remove more frequencies from blurry images and this results in that blurry images are smaller counted in bytes. The attribute value for this test is set to be the limit of the number of bytes per pixel for an image to appear in the feed.

Content blocker dark and light sum threshold

This test removes images from the feed that are considerate over- or under-exposed. It checks the sum of the darkest and brightest pixels to see if they dominate the image. The attribute value for this test is set to be the limit for maximum percent of pixels that are either very dark or bright.

FDB download max groups out of memory

This test tries to determine the maximum size per metadata-batch that the client downloads when the memory is starting to run out. If the batches are big the overhead is small but the memory usage is high, and if the batches are small the overhead is big while the memory usage small. The attribute value for this test is set to be the maximum number of metadata-groups to download in each batch when fetching metadata (filename, size, etc.).

Small file threshold ratio 7

When an image is small enough it is stored directly in the metadata instead of storing it in Google Cloud Storage. By doing this the I/O costs are reduced and gives a quicker backup. The disadvantage is that the metadata database is getting to large. The test then tries to find a good balance when to do this. The attribute value for this test is set to be the limit to determine if the file is small enough to be uploaded using only the metadata.

(30)

Top size remote videos supported

This test tries to determine the maximum size of a video in the feed that is not stored locally. Too long videos uses too much bandwidth for both Degoo and the user but at the same time it is important to have a lot of material to show in the feed. The attribute value for this test is set to be the video in bytes.

(31)

Chapter 4

Result

The following chapter presents the results obtained by implementing the model described in chapter 3 and comparing different values of k, where k is the number of clusters used when training. The results obtained from the automated image categorization of clusters are also presented along side the impact of the user clustering on the split-tests.

4.1 Internal cluster evaluation

As mentioned in section 3.3.1 several internal cluster evaluations were used to find the optimal amount of clusters. The indices were calculated as described in section 3.3.1 and section 2.4.3. Since the indices are calculated in such a different way it can be hard to measure them against each other. We have therefore split the indices into three different figures displaying the measurements by varying the number of clusters from 4 to 14, see Figures 4.1, 4.2 and 4.3.

For the Davies-Bouldin index lower values are better, while for the other two higher values are better. Combining the results for all three indices we see that a k around 4-5 produces good clusters. But since the measurements aren’t based on any ground truth (this was explained in section 2.4.3) the results can be very vague and should not be seen as precise truth labels.

(32)

3 4 5 6 7 8 9 10 11 12 13 14 15 0 0.2 0.4 0.6 0.8 1 Number of clusters DB inde x Davies-Bouldin index

Figure 4.1: Davies–Bouldin index over a varying number of clusters (lower score is better) 3 4 5 6 7 8 9 10 11 12 13 14 15 0 10 20 30 40 Number of clusters Dunn in de x Dunn index

(33)

3 4 5 6 7 8 9 10 11 12 13 14 15 2 4 6 8 ·10 5 Number of clusters CH in dex Calinski-Harabasz index

Figure 4.3: Calinski-Harabasz index over a varying number of clusters (higher score is better)

4.2 Cluster categorization

After the number of cluster had been set, all the clusters were put through an automated image categorization model described in section 3.3.2 that was trained on the ImageNet dataset. The associated images for each cluster were given a category and the category was associated with the cluster. The result for each cluster is displayed in histograms below. Each histogram shows the top 10 categories associated with the cluster on the y-axis and how many times it occurred on the x-axis. The raw data for each cluster can be found in appendix A.2.

In figure 4.4, describing cluster 0, we can see that a large amount of images are either under the category sandbar, lakeside, valley or sunglasses. In cluster one, which is shown in figure 4.5, the dominant categories are web site, comic book, toyshop and restaurant.

In cluster 2, shown in figure 4.6, we see that the big categories are menu, web site, envelope and packet. And in cluster 3, shown in figure 4.7, the dominant categories are pajama, seat belt, diaper and wig. Lastly in cluster 4, shown in figure 4.8 the big categories are pajama, groom, miniskirt and restaurant. As we can see we have pajama as the dominant category for two of the clusters, cluster 3 and cluster 4, and this can possibly be explained by several factors. One of them would be that we only took the top 1 prediction for each image and discarded the rest which could

(34)

200 220 240 260 280 300 320 340 360 380 400 420 swimming trunksmaze

sarong breakwaterseashore swing sunglassesvalley lakeside sandbar Number of images C at egor ie s Cluster 0

Figure 4.4: Top 10 categories for cluster 0

200 250 300 350 400 450 500 550 600 fountainaltar

pot sliding doorbook jacket stage restauranttoyshop comic bookweb site

Number of images C at egor ie s Cluster 1

(35)

yield some information loss. Another would be that the model were more confident in predicting, in this case pajama, then other key features in the image. We also have to consider that the model is not trained to describe an image just find key features in it.

0 500 1,000 1,500 2,000 2,500 3,000 binder

monitor crossword puzzlebrass slide rule book jacketpacket envelopeweb site menu Number of images C at egor ie s Cluster 2

300 400 500 600 700 800 900 1,000 brassierebikini

bonnet shower caplab coat bow tie wig diaper seat belt pajama Number of images C at egor ie s Cluster 3

(36)

200 300 400 500 600 700 diapersuit

candle lab coatkimono sarong restaurantminiskirt groom pajama Number of images C at egor ie s Cluster 4

In figure 4.9 the spread of images over the clusters is shown, where the cluster id is on the x-axis while the number of images in the cluster is displayed on the y-axis. We can see that the spread is somewhat even over the clusters with just one of them standing out, that is cluster 1.

0 1 2 3 4 0 0.5 1 1.5 2 2.5 ·104 Cluster id Nu m be r of im age s

Figure 4.9: Number of images for each cluster, that is the spread of images over the clusters. The dashed line represents the average of images in a cluster.

(37)

4.3 User categorization

Each user was given a soft assignment to each cluster based on how many images the user had in each cluster. Figure 4.10 shows the spread of users in the clusters when only taking the top 1 soft assignment for each users. Here we can see that we have an even spread of users in each cluster with the exception of cluster 1 which is standing out from the rest.

0 1 2 3 4 0 100 200 300 400 500 Cluster id Nu m be r of us er s

Figure 4.10: Spread of users over the clusters by only looking at top 1 soft assignment. The dashed line represents the average of users in a cluster.

4.4 Split-test impact

Here we present the results gained by using the clustering for the split-tests within the application. The raw split test data collected can be found in appendix A.1. Only users with more than 15 images were used to extract the data for the clustered split-tests, as described in section 3.2.1. This resulted in a total of 1166 unique users. Five split-tests were identified based on the statistics collected from the split-tests without the clustering of users. These split-tests were selected because they showed the most variance over the different attribute values. After the five split-tests had been selected each of them was grouped on each cluster of users to get the event data. The event data is how many times an event has occurred for the event. Below are the five split-tests listed and for each split-test four bar graphs are shown. Each bar graph represents one of the four events described in section 3.3.4. On the x-axis of the bar graph we have all the attribute values for the split-test and on the y-axis we show how many times the event occurred compared to the other attributes values

(38)

in percent. This means that if we have two attribute values for a split-test and for attribute value 1 we have 3 event counts while for attribute value 2 we have 1 event count. This means that we have a total of 4 event counts and this would yield 75% for attribute value 1 and 25% for attribute value 2.

The split-test results are compared using the non-clustered data and the clustered data. We try to find clusters that uninstall the application more frequently then non-clustered data as well as finding optimal attribute values for the clusters. We say that an optimal attribute value is when the cluster has a low uninstall event count and a high cards shown, engagement and user hooked event count.

(39)

Content blocker bytes per pixel threshold 6

The Content blocker bytes per pixel threshold 6 test is described in section 3.3.4 and we can see that in figure 4.11 that cluster 0 stands out for attribute value 0.07 as well as cluster 1, 3 and 4 for attribute value 0.08 when comparing to the non-clustered data. In figure 4.12 and 4.13 we see that there is a more uniform spread. Lastly in figure 4.14 we see that cluster 1 and 2 is lower for attribute value 0.05 as well as cluster 0 for attribute value 0.06 and cluster 3 for attribute value 0.07, all compared to the non-clustered data. We can also see that all clusters perform better for attribute value 0.09.

0.05 0.06 0.07 0.08 0 20 40 60 Attribute value E ve nt cou nt % App remove Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(40)

0.05 0.06 0.07 0.08 0 20 40 Attribute value E ve nt cou nt % Cards shown Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.12: Feed interval cards shown event for Content blocker bytes per pixel threshold 6 0.05 0.06 0.07 0.08 0 10 20 30 40 Attribute value E ve nt cou nt % Engagement Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(41)

0.05 0.06 0.07 0.08 0 20 40 Attribute value E ve nt cou nt % User hooked Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.14: User is hooked event for Content blocker bytes per pixel threshold 6

Content blocker dark and light sum threshold

The Content blocker dark and light sum threshold test is described in section 3.3.4 and we can see a lot of variance over all attributes values for most events. This can possibly be seen as random noise. We can see some variance over the remove application event shown in figure 4.15. For attribute value 0.85 cluster 1, cluster 3 and cluster 4 stands out while the non-cluster result shows more of a spread over all attributes. In figure 4.16 all cluster shows high event count for attribute value 0.7 while in figure 4.17 many of the clusters good a high event count for attribute value 0.75. Same results yield for user is hooked event shown in figure 4.18.

(42)

0.7 0.75 0.8 0.85 0.9 0.95 0 10 20 30 40 Attribute value E ve nt cou nt % App remove Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.15: App remove event for Content blocker dark and light sum threshold

0.7 0.75 0.8 0.85 0.9 0.95 0 10 20 Attribute value E ve nt cou nt % Cards shown Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.16: Feed interval cards shown event for Content blocker dark and light sum threshold

(43)

0.7 0.75 0.8 0.85 0.9 0.95 0 10 20 30 40 Attribute value E ve nt cou nt % Engagement Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.17: User engagement event for Content blocker dark and light sum threshold

0.7 0.75 0.8 0.85 0.9 0.95 0 10 20 30 Attribute value E ve nt cou nt % User hooked Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(44)

FDB download max groups out of memory

The FDB download max groups out of memory test is described in section 3.3.4 and most of the clusters shows similar results as the non-clustered result. However there are some outliers in figure 4.19 where cluster 0 shows more uninstalls for attribute value 0.1 and cluster 2 for attribute value 0.05. In figure 4.21 cluster 1 show more engagement in attribute value 0.2.

0.05 0.1 0.15 0.2 0 20 40 Attribute value E ve nt cou nt % App remove Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(45)

0.05 0.1 0.15 0.2 0 20 40 Attribute value E ve nt cou nt % Cards shown Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.20: Feed interval cards shown event for FDB download max groups out of memory 0.05 0.1 0.15 0.2 0 10 20 30 40 Attribute value E ve nt cou nt % Engagement Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(46)

0.05 0.1 0.15 0.2 0 20 40 Attribute value E ve nt cou nt % User hooked Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.22: User is hooked event for FDB download max groups out of memory

Small file threshold ratio 7

The Small file threshold ratio 7 test is described in section 3.3.4 and cluster 1 and 2 shows many uninstalls for attribute value 0.5 while cluster 4 and cluster 0 show more for attribute value 0.6 and 0.7 respectively in figure 4.23. For the cards in feed event, shown in figure 4.24, the distribution is uniform but cluster 2 stands out for attribute value 0.9 while cluster 0 and cluster 1 stands out more for attribute value 0.5. In figure 4.25 all the clusters have a high engagement for attribute value 0.5 while the data without the clustering are more evenly spread. Lastly in the user is hooked event, shown in figure 4.26, all clusters show a high event count for attribute value 0.9.

(47)

0.4 0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 Attribute value E ve nt cou nt % App remove Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.23: App remove event for Small file threshold ratio 7

0.4 0.5 0.6 0.7 0.8 0.9 0 10 20 30 Attribute value E ve nt cou nt % Cards shown Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(48)

0.4 0.5 0.6 0.7 0.8 0.9 0 10 20 30 Attribute value E ve nt cou nt % Engagement Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.25: User engagement event for Small file threshold ratio 7

0.4 0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 Attribute value E ve nt cou nt % User hooked Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(49)

Top size remote videos supported

The Top size remove videos supported test is described in section 3.3.4 and shows a similar distribution for the clustered data compared to the non-clustered data over most events. The biggest difference is shown in figure 4.30 where many of the clusters shows high engagement in attribute value 3145728 while the non-clustered result did not.

10485760 20971520 3145728 5242880 8388608 0 20 40 60 Attribute value E ve nt cou nt % App remove Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(50)

10485760 20971520 3145728 5242880 8388608 0 10 20 30 40 Attribute value E ve nt cou nt % Cards shown Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

Figure 4.28: Feed interval cards shown event for Top size remote videos supported

10485760 20971520 3145728 5242880 8388608 0 10 20 30 40 Attribute value E ve nt cou nt % Engagement Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(51)

10485760 20971520 3145728 5242880 8388608 0 20 40 Attribute value E ve nt cou nt % User hooked Without clustering Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4

(52)

(53)

Chapter 5

Discussion

The propose of this study was to explore the possibilities to build a model that clusters users based on the user’s photo library. This was done to see if clustering of users would improve the personalisation of the Degoo application. To evaluate the user clustering, split-test data from the Degoo application was used. This was done by comparing non-clustered split-test data to clustered split-test data and the result of this is presented in section 4.4.

The selection of how many image clusters would be used was determined using the results from the internal cluster evaluation, where Davies-Bouldin index, Dunn index and Calinski-Harabasz index was used. However while the internal indices gives us an indication of how many cluster we should use, the decision of only using five clusters was also motivated by that we did not want to split the users into to many groups and to avoid having too many users in the same cluster. If more clusters were used the userbase would either be to spread out or many of the users would end up in the same cluster.

Considering only the internal indices used in the study we can clearly see an indication that in increasing k above 6 won’t yield better clusters. However, we can not say anything about k above 14, but in this case having that many clusters would probably split the userbase too much.

Based on the category histograms in section 4.2 there is an indication of a type for each cluster, some more clear than others. For the top 10 categories in cluster 0, shown in figure 4.4, we can see that many of the categories can be associated with either vacation images or beach images. While cluster 2, shown in figure 4.6, contains mostly images with text which probably are screenshots of a mobile, computer screen or other similar images. Both cluster 3 and 4, shown in figure 4.7 and figure 4.8 respectively, show similar categories containing mostly different type of clothes. This indicates that one of the clusters probably contains images of people, such as selfies etc. while the other cluster contains images of clothes. This is probably the result of the image categorization and that the model was more confident in categorizing

(54)

clothes then people. As mentioned in section 1.4, we could not check this due to GDPR. Lastly cluster 1 is the most unclear cluster of the five clusters. It gives an indication of outdoors images such as altar, fountain, stage etc. but at the same time it contains images of web sites and comic books which most likely is text. However, we also have to consider the reliability of the automated categorization of images. While it can categorize many of the images correctly there is many categories it cannot categorize yet. But since it is only used to give an indication and not an exact answer what images each cluster contains it is a good alternative. Theoretically we could manually categorizing the images by hand. But this would result in an entirely different thesis.

We only used one model in this study to cluster the users and images. It would be interesting to use other models to perform the image clustering such as DAC, described in section 2.5.2 or any other clustering algorithm developed for images. In section 1.3 we limited the study to only consider images when performing the clustering. But since Degoo collects a lot more data about the users the possibility to extend and get an even better understanding of the userbase is good and would be an interesting further extension of this work.

The result given by the split-tests presented in section 4.4 is interesting. One can easily see that for some attribute values in the five tests some of the cluster are clearly overrepresented when comparing to the non-clustering results, such as cluster 1 and 2 for attribute value 0.5 as well as cluster 4 for attribute 0.6 in in the Small file threshold ratio 7 test, shown in figure 4.23. Also in the Content blocker dark and light sum threshold test, shown in figure 4.15, we can see that for attribute value 0.85 both cluster 1 and 4 are overrepresented. By overrepresented we mean that a clusters shows much more event count compared to the non-clustered data.

Looking only at the Small file threshold ratio 7 test we can identify that for cluster 2 and 0 the optimal attribute is 0.9, since we do not have any removals at all for cluster 2 and below 10% for cluster 1 while also keeping a high engagement for the rest of the events. Considering the Top size remote videos supported test we can see that we have a high user is hooked percentage for attribute value 3145728 for cluster 0, 1 and 2 and at the same time a low removal percentage for cluster 0 and 2. This combined with a relatively high engagement gives an indication of a good attribute value. In the Content blocker dark and light sum threshold test attribute values 0.7 and 0.95 show good values for cluster 4.

When looking more generally at all the five split-tests there is an indication that a high engagement in the feed interval cards shown, user engagement and user is hooked event also yields a higher rate of the app remove event. Also a couple of tests show little variance compared to the non-clustered results, these tests are Content blocker bytes per pixel threshold 6 and FDB download max groups out of memory.

(55)

5.1 Conclusion

This study presents and implements a method for clustering users based on the user’s photo library. It utilises a clustering model called Deep Embedding Clustering and is building on a Deep Autoencoder. The clusters were evaluated using several internal indices and also an automated image categorization model to give an indication of what images was associated with each cluster. Four out of five clusters showed indications of different types such as vacation photos, clothes, text and people. The user clustering was done by counting the number of images a user had in each cluster and assigning the user to the cluster with the highest amount of images. To evaluate if the clustering had any impact on the split-test running in the application, five were selected to compare the non-clustered data to data grouped on the generated clusters. This showed that some clusters where overrepresented for some attribute values in the event of removing the application and we could see some patterns indicating optimal attributes values for some clusters. This information can then be used by Degoo to personalise the application for the users.

5.2 Future work

One thing mentioned in the discussion that might be interesting looking into further is to include more data about the users when performing the clustering. Meaning not only looking at the images that the users upload to the service but to also include sex, age, location, type of system (computer/smartphone, operating system) etc. Another thing worth investigating is the use of different models to cluster the users and to compare them against each other.