Degree project in Computer Science Second cycle
Music Streaming Service: A Cluster Analysis Approach
Fredrik Göthner
EXAMENSARBETE VID CSC, KTH
Identifying Patterns in User Behavior in a Music Streaming Service: A Cluster Analysis Approach
Göthner, Fredrik
E-‐postadress vid KTH: fgothner@kth.se Exjobb i: Datalogi
Handledare: Herman, Pawel Examinator: Lansner, Anders Uppdragsgivare: Spotify AB Datum: 2013-‐06-‐03
Music Streaming Service: A Cluster Analysis Approach
Abstract
Logged user data has become a highly valued asset to many Internet based services with large user bases. Being able to draw insight from this data is considered a key to gaining competitive advantages for the companies behind the services. This study aims to identify patterns in the behavior of users when interacting with Spotify, a music streaming service, by studying automatically logged data. In the study, we examine several methods to perform such analyses using machine learning techniques. We identify six different types of behavior through k-‐means cluster analysis, each representing between 51.4% and 0.5% of all user sessions. We also identify five factors partly explaining the differences in behavior between different sessions. These are found through factor analysis and account for 39% of the variance in the data. Finally, we demonstrate how factors and clusters can be translated from numeric representations to linguistic interpretations.
Att identifiera mönster i användar- beteende för en musikströmningstjänst: En klusteranalys
Sammanfattning
Loggad användardata har blivit en högt värderad tillgång för många Internetbaserade tjänster med en stor mängd användare. Att finna insikter från dessa data anses vara en nyckel till att vinna konkurrensfördelar för företagen bakom tjänsterna. Denna studie har som mål att identifiera mönster i beteendet hos användare av Spotify, en musikströmningstjänst, genom att studera loggad data. I studien utreds flera metoder för att göra denna typ av analys genom att använda maskininlärningstekniker. Vi identifierar sex olika typer av beteende genom k-‐means klusteranalys, där var och en representerar beteendet i mellan 51.4 % och 0.5 % av alla sessioner. Vi identifierar också fem faktorer som förklarar en del av skillnaderna i beteende mellan användares olika sessioner. Dessa hittas genom faktoranalys, och förklarar tillsammans 39 % av variansen i studiens data. Till sist går vi igenom hur kluster och faktorer kan översättas från numeriska representationer till semantiska tolkningar.
I would like to thank my supervisor at KTH, Pawel Herman, for his solid support and engagement in this study. His support has truly contributed to the quality of the study and to the enjoyment of the work.
I would also like to thank my supervisor at Spotify, Henrik Landgren, for his support and for the opportunity to do my degree project for, in my opinion, one of the most interesting Internet services in the world. Henrik has been a constant source of inspiration and feedback.
Finally, I would like to thank the members of the Spotify Analytics Insights team who have provided constant support with technical issues as well as feedback regarding all conceivable aspects of the study.
1 Introduction ... 1
1.1 Background ... 1
1.2 Related Work ... 2
1.3 Problem Statement ... 2
1.4 Scope of Study ... 3
1.5 Potential Approaches ... 4
1.5.1 Factor Analysis ... 4
1.5.2 Principal Component Analysis ... 4
1.5.3 Cluster Analysis ... 5
1.5.4 Association Analysis ... 7
1.6 Report Outline ... 7
2 Method ... 9
2.1 Overview & Experimental Setting ... 9
2.2 Dataset ... 9
2.2.1 Sessions ... 9
2.2.2 Data Attributes ... 9
2.2.3 Population ... 10
2.3 Data Preprocessing ... 10
2.3.1 Outlier Removal ... 10
2.3.2 Session Filtering ... 10
2.3.3 Ceiling Values ... 10
2.3.4 Normalization ... 10
2.4 Exploratory Factor Analysis ... 11
2.5 Cluster Analysis ... 12
2.5.1 Algorithms ... 12
2.5.2 Model Selection ... 13
2.5.3 Model Evaluation ... 15
3 Results ... 17
3.1 Factor Analysis ... 17
3.1.1 Factor Analysis Model Selection ... 17
3.1.2 Factor Loadings ... 18
3.1.3 Interpretations ... 19
3.2 Cluster analysis ... 20
3.2.3 Interpretation ... 23
3.2.4 Evaluation of Generalization and Robustness ... 24
4 Discussion & Conclusions ... 26
4.1 Conclusions ... 26
4.1.1 Identifying Behavior Patterns ... 26
4.1.2 Applicability of Machine Learning ... 26
4.2 Discussion of Results ... 26
4.2.1 Model Validation ... 26
4.2.2 Distribution of Use Cases ... 27
4.2.3 Uncertainty of Classification ... 27
4.3 Challenges and Emerging Issues ... 27
4.3.1 Data Collection and Preprocessing ... 27
4.3.2 Attribute Selection ... 27
4.3.3 Factor Analysis ... 28
4.3.4 Cluster Analysis ... 28
4.4 Future Work ... 28
4.4.1 Other Learning Methods ... 28
4.4.2 Analyzing Users by Behavior ... 29
4.4.3 Expanding Behavior ... 29
References ... 31
Appendix A ... 33
Appendix B ... 38
1 Introduction
1.1 Background
Spotify is an Internet based music streaming service, offering music to several stationary and mobile platforms. The service is currently offered through a free, ad-‐supported subscription and two premium subscriptions based on monthly payments. In Q1 of 2013, Spotify reported 24 million active users globally, and 6 million paying subscribers.
Behavior related to music consumption changes with technological development. During the last century, technological innovations like vinyl record players, portable tape recorders, CD players and digital distribution of music have changed the processes by which we select music and control playback as well as the situations in which we consume music (Guberman, 2011). It is also possible to distinguish a shift in priorities amongst music consumers from fidelity to convenience (Guberman, 2011). The emergence of music streaming services like Spotify during the last decade is likely to impact the way users consume and listen to music. At the same time, these services provide great opportunities to study user behavior through logged user data. In this study, behavior refers to how the user acts in order to select music, control playback and discover new music, rather than what music the user is listening to and what he or she is doing while listening.
The study of user behavior through logged data is a well-‐known research topic, for example within the fields of web search ranking (Agichtein et al., 2006), user-‐adaptive systems (Frias-‐Martinez et al., 2005) or telecommunication networks (Zhu et al., 2011).
For a music streaming service, insight in user behavior is useful for many purposes, for example product design and development, optimizing in-‐
service advertising and tracking growth and market penetration. The ability to segment the user population based on behavior rather than traditional means such as gender, age and socioeconomic status could provide valuable consumer insight. Additionally, the ability to obtain this information through analysis of logged data in addition to user surveys, focus groups or other traditional methods could yield practical advantages.
The size of the data collected by Spotify on a daily basis can be considered very large. Several of the company’s data sources have dimensionalities in the hundreds and the size of the data collected is in the magnitude of Terabytes.
Being able to automate analytic processes is vital to leverage this amount of information.
The main goal of this study is to conceive a structured method to analyze user behavior through logged data. The method should be possible to apply to other, similar applications. A natural first question in a data driven study of user behavior is simply if there are any patterns or stereotypes in the way users behave when they use the service, and if so, what they are. The analytic goal of this study will be focused on identifying potential patterns in user behavior.
1.2 Related Work
Many studies have performed quantitative analysis of people based on their behavior in a certain domain. Behavior in this sense can have various meanings, but is typically connected to analyzing the decision-‐making processes of individuals or groups of people in certain situations. During the last decade, this type of problem has been approached with various statistical methods and machine learning techniques.
Within the field of behavior related to music consumption, Boer et al. (2012) assessed how the use of music is “underpinned by psychological processes”.
The study identified ten different psychological functions of music, found through principal component analysis of survey data and demonstrated through multi-‐dimensional scaling. The study also examined systematic differences in the importance of these functions across different gender groups or cultures.
In another study, more focused on exploring differences between different groups of people, Chamorro-‐Premuzic, Swami & Cermakova (2010) performed statistical hypothesis testing regarding the associations between individuals’ motives for listening to music, music consumption habits and personal attributes such as demographics, “Big Five” personality traits and emotional intelligence. The study concluded that age and motives for listening better explained differences in music consumption than personality traits.
Several studies have worked with identifying groups of individuals, based on habits or behavior in isolated situations, using various cluster analysis approaches. For instance, Jiang, Ferreira & Gonzales (2012) studied daily activity patterns of urban inhabitants through principal component analysis and cluster analysis. Brandtzaeg et al. (2010) performed a typology study of Internet usage and identified five types of European Internet users using k-‐
means cluster analysis of survey data. Primack et al. (2012) used a two-‐step cluster analysis approach based on k-‐means and hierarchical agglomerative cluster analysis to segment U.S. university students based on substance abuse habits.
Data related to behavior has also been used to improve the performance of algorithms working mainly on heuristics and other types of data. Agichtein et al. (2006) showed how user behavior can be used as implicit feedback to improve the performance of web-‐ranking algorithms based on artificial neural networks.
Generally, machine learning offers several attractive advantages for this type of study; it can be applied to very large data sources, its methods can aid distinguishing patterns that are too complex to distinguish manually and it allows us to rely on computers for heavy computations (Marsland, 2009).
1.3 Problem Statement
The goal of this study can be expressed in two separate parts:
1. Identifying the most common patterns in user behavior.
These patterns will be referred to as use cases. This objective can be expressed as two sub-‐objectives:
a. Obtain numeric representations of use cases through quantitative analysis of automatically logged user data.
b. Conceive a method to translate the numeric representations to comprehensible, semantic descriptions.
2. Evaluating the usefulness of machine learning methods for this type of analysis. More specifically, compare the applicability of various, specific techniques, and demonstrate the results of one or several techniques.
Using the taxonomic description suggested by Frias-‐Martinez et al. (2005) and considering the stated purpose of the study, the task at hand can be described as collaborative modeling (modeling a population of many users) for classification purpose. However, it should be noted that although the ultimate purpose is to be able to classify a user or an instance of usage by the type of behavior the user exhibits, this task is of unsupervised nature, meaning that we do not have any data conveying the “ground truth”
behavior type (Marsland, 2009).
Taking this into account, the following description of the task is suggested:
1. Collaborative modeling. Finding patterns in user behavior.
2. Interpretation. Translating these patterns into use cases by explaining what they represent.
3. Classification. Using the resulting model and interpretation to classify and describe previously unseen data.
The study examines behavior in discrete instances of use, rather than the aggregated behavior of each user. This is further discussed in the Method-‐
chapter.
1.4 Scope of Study
As the purpose of the study is to identify patterns in user behavior, it is necessary to decide on a level of detail of the behavior that is studied. The aim of the study is to identify the most common behavior of the users, rather than to identify every type of behavior that a user can exhibit.
Considering the purpose of the study, the data available and certain practical implications, we define a set of criteria for the data attributes that are used to describe user behavior:
• Discriminability. The attribute should provide information about the user’s behavior that can help distinguish one type of behavior from another.
• Platform invariance. The attribute should be as robust as possible with respect to what type of platform is used to access the service.
• Robustness over time. The meaning and values of the attribute should be robust to changes in the service and software updates in the clients.
Based on these criteria, the definition of user behavior for the purpose of this study is limited to data concerning the following:
• The actions taken by the user to manipulate the sequence of tracks played,
• The distinct parts of the client, or views, visited by the user when navigating the client, or
• The creation and acquisition of playlists.
In addition to the data related to behavior that is used for modeling, other user-‐related data is used to assist the interpretation and analysis of the resulting patterns.
1.5 Potential Approaches
The following section covers state-‐of-‐the-‐art machine learning techniques that have been used for similar applications.
1.5.1 Factor Analysis
Factor analysis is a method of determining if covariance among data attributes can be expressed as the result of common, underlying factors and how those relate to the observed attributes. Factor analysis is commonly used to explore or confirm underlying structures of data sets within a number of different fields, for example psychology (Lecavalier & Norris, 2010;
Mavor & Louis, 2010) or athletics (Ertel, 2011).
For the purpose of this study, factor analysis can be used to reveal structure in the data and to better understand tendencies in user behavior.
1.5.2 Principal Component Analysis
Principal component analysis (PCA) refers to projecting the observed data onto a different base, such that the components of the base are orthogonal, and represent the directions in which the observed data has the greatest amounts of variance. This representation can be found by extracting the eigenvectors of the covariance matrix of the observed data. The order of size of the eigenvalues of each eigenvector will correspond to the order of size of variance accounted for by that component (Marsland, 2009). Through PCA, a representation of the data can often be obtained that represents most of the variance in the data with fewer dimensions (Marsland, 2009). For this reason, PCA is often used for dimensionality reduction (Lewis-‐Beck, 1994).
Dimensionality reduction does not have an obvious advantage for this study, since each attribute conveys interpretative value. Coercing several attributes means risking losing some of the interpretative value, and although it has proven useful in many other studies (Lewis-‐Beck, 1994), PCA is not used in this study.
1.5.3 Cluster Analysis
Cluster analysis has been used in several previous studies to identify groups of individuals, who are common in some aspect (Jiang, Ferreira & Gonzales, 2012; Brandtzaeg et al., 2010; Primack et al., 2012). Cluster analysis refers to partitioning the data set. In other words, finding groups, or clusters, of samples in the observed data such that samples within the same cluster are similar in some sense, and samples from different clusters are dissimilar (Marsland, 2009). Once groups like this are found, they can be used to classify new data by assigning new samples to the most appropriate group (Marsland, 2009).
Finding such clusters in a dataset representing separate instances of listening could reveal patterns in user behavior, if we can argue that each cluster corresponds to one behavior type.
There are a number of known algorithms that performs clustering, working in different ways. Several taxonomies categorizing clustering algorithms have been suggested (Tran et al., 2013; Cios et al., 1998). In this study, we categorize algorithms by three main types:
• Objective function-‐based optimization methods, in which the algorithm tries to find a partition of the samples that optimizes the value of an objective function (Cios et al., 1998).
• Hierarchical methods, where the samples are aggregated or divided into different groups (Cios et al., 1998).
• Density based methods, where a cluster is defined as a region with high density (Tran et al., 2013).
In this study, several algorithms of each of these categories are considered and described by two characteristics, namely similarity metric and model arguments. Similarity metric refers to how similarity (or dissimilarity) between two samples is determined. Common ways to determine similarity includes basing it on a distance norm or probability. Model arguments refers to parameters that need to be specified before using the algorithm, and for which a good setting has to be found for each application. An overview of the algorithm types and a few examples are presented in Table 1.
Table 1. Overview of clustering algorithms.
Category Algorithm(s)
Most important model arguments
Similarity metric Objective
function- based optimization
K-means Number of clusters (k), initial
assignments, distance norm
Distance (typically Euclidean norm) Fuzzy c-
means
Number of clusters, initial assignments, distance norm, fuzzification- parameter
Distance
K-medoids Number of clusters (k), distance norm
Distance
Gaussian Mixture Models
Number of clusters, covariance shape
Probability
Hierarchical Agglomerative clustering (e.g.
AGNES)
Linkage, distance norm
Distance
Density based
DBSCAN Neighborhood size, core point limit
Distance
In order to evaluate the usefulness of different clustering algorithms for this study, the following criteria are used:
1. The algorithm should yield a result that can easily be interpreted and used to explain user behavior.
2. The algorithm should have a feasible computation time (<1h) for large data sets (>100,000 samples and >20 attributes) on conventional personal computers.
3. The algorithm should yield a model that can be used to classify samples that were not used for training.
The algorithms’ relations to these criteria are listed in Table 2.
Table 2. Review of six different clustering algorithms
Algorithm Criterion # References
1 2 3
K-means Yes Yes Yes Marsland (2009)
Fuzzy c-means Medium Yes Yes Meyer et al.
(2013)
K-medoids Medium No Yes Maechler et al.
(2013)
Agglomerative clustering
Medium No Yes Maechler et al.
(2013) Gaussian
Mixture Models
Medium Yes Yes Marsland (2009)
DBSCAN No No No Tran et al.
(2013)
K-‐means clustering is a commonly used clustering technique. It outputs a measure of distance for each sample to a number (k) of prototypes (means).
Each sample is then assigned to its closest prototype (Zhu et al., 2011; Jiang, Ferreira & Gonzales, 2012). K-‐means meets the above criteria.
Fuzzy c-‐means clustering offers an interesting feature in that it assigns a degree of membership between each sample and each cluster, rather than strictly assigning each sample to a cluster. However, it requires specifying a fuzzification value, which specifies the degree to which samples affect far-‐
away cluster centers (Meyer et. al. 2013). This parameter needs to be configured to a suitable value in parallel with the number of cluster centers, and its impact on the end-‐result is not obvious before training the model.
K-‐medoids is computationally heavier than k-‐means, although implementations exists that can handle large datasets (Maechler et al., 2013).
In k-‐medoids, each cluster is represented by a prototype sample, whereas in k-‐means each cluster is represented by the mean vector, or centroid. This makes k-‐medoids less sensitive to outliers (Marsland, 2009). However, this also means that k-‐medoids has trouble representing binary attributes well, since the prototypes cannot take any intermediate value.
Gaussian Mixture Models (GMM) uses a procedure similar to that of k-‐
means to fit a statistical model to the data through Expectation Maximization (Marsland, 2009). It offers several mathematical advantages over k-‐means and k-‐medoids – it is generally more flexible on the shape and size of the clusters (Marsland, 2009). Consequently, the resulting model is more complex, including means, prior probabilities and covariance matrices of the obtained mixture model (Marsland, 2009). Overall, it meets the criteria and is considered an appropriate candidate for this study.
Based on the above criteria (interpretability, computational feasibility, predictive possibility), k-‐means and GMM are selected as appropriate methods for cluster analysis.
1.5.4 Association Analysis
Association analysis refers to extracting association rules between items in a transactional data set. These frequent item sets can be used to predict items in future transactions (Frank & Witten, 2000). Association analysis has been used for the purpose of identifying behavioral patterns in previous studies (Ros, Delgado & Amparo Vila, 2009; 2011). However, the resulting association rules are typically used for prediction purpose, and do not offer explanatory value equivalent to the results of the other methods discussed. Association analysis is therefore not covered further in this study.
1.6 Report Outline
The second chapter of this report, Method, covers details about data gathering and preprocessing. It also covers theoretic descriptions of the machine learning techniques used to find patterns in user behavior as well as descriptions of how these methods are used in this study.
The third chapter, Results, covers the measurements made for model selection, numeric descriptions of the behavior patterns found and an explanation of how these numeric descriptions can be used to interpret user
behavior. It also contains a brief analysis of the generalizing ability of the final model and its robustness.
The fourth chapter, Discussion & Conclusions, concludes the findings of the study and discusses the challenges and issues which emerged throughout the work. It also covers suggestions for future work.
2 Method
2.1 Overview & Experimental Setting
The data used for the study is collected through Spotify’s logged and aggregated data sources, through a distributed computing and data storage framework called Apache Hadoop. The datasets used for this study is obtained by running scripts conforming to the MapReduce programming paradigm (Dean & Ghemawat, 2004).
The data is preprocessed using the NumPy (NumPy, 2013) and Scikit-‐Learn Python libraries (Scikit-‐Learn, 2013), which are used to plot the initial distributions of each attribute, remove outliers, filter the dataset and normalize the data.
The first analytic step involves performing Exploratory Factor Analysis to obtain an understanding of the underlying structure of the data. This is done using the fa()-‐function from the psych R-‐library.
Lastly, the data is clustered using two different algorithms, k-‐means clustering and Gaussian Mixture Models. Both implementations are from the Scikit-‐Learn Python library.
2.2 Dataset
2.2.1 Sessions
To isolate instances of use, we introduce the concept of sessions. A session is defined as a user continuously using the service (by playing songs or browsing the client) with at most 15 minutes of inactivity and using one platform only. Any activity after more than 15 minutes of inactivity or a platform change is assigned to a different session than any previous activity.
2.2.2 Data Attributes
For each session, 21 data attributes are collected:
• 3 attributes related to the timing of the session.
• 10 attributes related to how the user browses the client.
• 6 attributes related to music selection.
• 2 attributes related to playlist maintenance.
In this study, a stream is defined as a user playing a track for at least 30 seconds. A playlist is a collection of songs compiled by a user.
The attributes are evaluated based on the criteria stated in the Introduction-‐
chapter (discriminability, platform invariance and robustness over time) and through discussions with internal Spotify staff. The following data attributes were considered but rejected due to limited interpretability or poor performance in pilot tests:
• 2 attributes related to state of client during playback.
• 2 attributes related to the time of the session.
• 1 attribute related to the platform used.
• 14 attributes related to the source of the streams.
2.2.3 Population
The study is limited to users from Sweden. The user base is sampled to obtain a dataset of feasible size. The dataset used for training of the models consists of 179,748 sessions. The data is collected for the period January 28th – February 24th 2013.
2.3 Data Preprocessing
2.3.1 Outlier Removal
Many machine learning applications, including k-‐means clustering, are sensitive to outliers (Marsland, 2009). Outliers are samples taking unusual values for one or more attributes.
In this study, we define an outlier as a sample for which one attribute takes a value that falls outside the 99th percentile of the distribution for that attribute. Any sample defined as an outlier is removed. In other words, each attribute (excluding those covered in section 2.3.3) is limited to the range of values containing 99% of the data. The rationale behind this procedure is that any data outside the 99th percentile does not fit the definition of “the most common behavior” stated in section 1.4, and is thus outside the scope of the study.
2.3.2 Session Filtering
Sessions containing less than 2 streams or that are shorter than 90 seconds are filtered from the dataset. The rationale is that these sessions are too short to be used to analyze behavior. After filtering and outlier removal, 138,920 samples remain.
2.3.3 Ceiling Values
Some attributes, conveying information about events that only happen in a small subset of samples have distributions that are very sparse. These attributes have zero-‐values for the vast majority of the samples, but typically have very variable values among the non-‐zero samples. Instead of removing samples falling outside of the 99th percentile, a threshold for the attribute is defined and values over this threshold are simply truncated to the threshold.
This allows us to retain the information conveyed by the attribute while avoiding outliers caused by certain high attribute values.
2.3.4 Normalization
For the methods used in the study (discussed in sections 2.4 and 2.5), it is important that each attribute is weighted equally, regardless of the magnitude of its values (Marsland, 2009). The dataset is normalized by subtracting the mean value and dividing by the standard deviation of each attribute. This procedure is known as z-‐score normalization.
2.4 Exploratory Factor Analysis
Exploratory Factor analysis is a method for revealing structure of the data, working under the assumption that the covariance among attributes is explained by common underlying factors, or latent variables (Lewis-‐Beck, 1994).
Figure 1. Factor analysis model.
Figure 1 demonstrates how each observed data attribute, xi, is modeled as a random variable, for which each observation is a weighted sum of observations from a lower number of latent, random variables and one unknown random variable ui:
𝑥! = 𝑏!"𝐹!+ 𝑑!𝑢!
!
!!!
where 𝑏!,! is the factor loading between latent variable 𝐹! and attribute 𝑥!, and 𝑑! is the noise component of 𝑥!.
An observation of the data attributes are usually expressed in vector form, x, and the factor loadings in matrix form, B.
In this study, an initial solution to the factor loadings is found by obtaining the minimum residual solution through the Ordinary Least Squares-‐method (Revelle, 2013). The initial solution is rotated to the Varimax rotation, which maximizes the variance in the factor loadings for each factor (Lewis-‐Beck, 1994). In practice, this means that each factor will typically have loadings with high absolute values for a few attributes, and close-‐to-‐zero for the rest.
The number of latent variables to use is decided by studying the amount of variance explained with different numbers of latent variables. This approach is discussed by Lewis-‐Beck (1994). In addition, a scree-‐plot is used in which the eigenvalues of the covariance matrix are used to estimate the number of latent variables in the data. In a scree-‐plot, the eigenvalues are ranked in descending order, and the number of variables is estimated at the first point (if one can be found) in the plot where the descent between two eigenvalues can be considered small (Lewis-‐Beck, 1994; Ertel, 2005). This point is referred to as an “elbow-‐point”. A generic example is demonstrated in Figure 2.
x
1x
2x
nF
1F
2F
k… …
u
1u
2u
nb11 b12
bkn
d1
d2
dn
Figure 2. Demonstration of an “elbow point”. After the 4th value of on the x-‐axis, the metric does not decrease much.
2.5 Cluster Analysis
2.5.1 Algorithms
Two clustering algorithms are used to perform cluster analysis: k-‐means and Gaussian Mixture Models (GMM). These are selected based on the criteria discussed in the Introduction-‐chapter: Interpretability, computational feasibility and predictive possibility.
K-‐means
K-‐means is a prototype-‐based model that seeks to minimize the sum of squared distance from each sample to its respective cluster center (Marsland, 2009). The approach is discussed further below. The algorithm works in three major steps (Zhu et al., 2011):
1. It first initializes a user-‐specified number of cluster centers (k) by assigning samples to centers (randomly or systematically).
2. It then updates the values of each cluster center to the mean value of its corresponding samples.
3. Next, each sample is reassigned to its closest cluster center.
Steps 2 and 3 are repeated until no sample changes assignment or until a fixed number of maximum iterations is reached.
The result is k cluster centers and a label for each sample used in training and with values in the sample space. K-‐means is susceptible to local minima (Marsland, 2009). To mitigate this problem, the algorithm is run 30 times for each k with different initial assignments, and the best-‐scoring solution is selected for each k.
0 5 10 15 20 25
1 2 3 4 5 6 7 8 9
metric
varying parameters
GMM
GMM is a parametric method that estimates the parameters of a user-‐
defined number (M) of normally distributed multivariate random variables under the assumption that each sample is an observation from one of the variables. The objective is to maximize the log-‐likelihood of the data under the model. GMM estimates three parameters for each variable model (Marsland, 2009):
• μm, which represents the mean of the mth variable,
• Σm, which represents the covariance matrix of the mth variable, and
• αm, which is the weight or prior probability of the mth variable.
The parameters are estimated through Expectation Maximization, which, similarly to k-‐means, involves three steps (Marsland, 2009):
1. Initialize the random variables, for example by setting μm
to the values of randomly selected data points, setting Σm
to the covariance matrix of the entire data set, and setting αm to 1/M.
2. Calculate the posterior probabilities Zi,m for each sample xi under each random variable. This is called the E-‐step.
𝑍!,! = 𝑝 𝑥! 𝑚 = 𝑘) = 𝛼!𝒩 𝑥! 𝝁!, 𝚺!) 𝛼!𝒩 𝑥! 𝝁!, 𝚺!)
!!!!
3. Update μm, Σm and αm using Zi,m as a weight between each sample and variable. This is called the M-‐step.
a. 𝝁!= !!!,!!!
!!,!
!
b. 𝚺!= 𝒊!!,!(!!! 𝝁!)(!!!𝝁!)𝒕
!!,!
!
c. 𝛼! = !!𝑵!,!
Steps 2 and 3 are repeated until no changes in the parameters occur or until a fixed number of iterations is reached. To avoid local maxima, the algorithm is run 5 times for each value of M, and the model with highest log-‐likelihood score is selected for that GMM.
2.5.2 Model Selection
The algorithms above have certain parameters that need to be specified. The most central one is the number of clusters; the problem of selecting the number of clusters is a common problem in cluster analysis (Jiang, Ferreira &
Gonzales, 2012; Zhu et al., 2011).
To determine a suitable number of clusters, the algorithms are run with different settings, and the resulting models evaluated first against a set of numeric measures and then against a set of heuristic criteria.
The numeric measures used are silhouette coefficient for both algorithms, and sum of squared error and Bayesian Information Criterion (BIC) for k-‐
means and GMM, respectively. They are described below.
Silhouette Coefficient
The silhouette coefficient is a measure of proximity of a sample to other samples in the same cluster, compared to the proximity of the sample to other samples in the closest neighboring cluster.
It is defined for one sample xi, as 𝑠! ≔ !!!!!
!"# (!!,!!)
where 𝑎! is the average distance between the ith sample and other samples in the same cluster and 𝑏! is the lowest average distance to the samples in any other cluster. The silhouette coefficient ranges between -‐1 and 1. A value close to 1 indicates that the sample is significantly closer to samples within its own cluster than samples in other clusters, a value of 0 indicates that the sample is right in between two clusters while a negative value means that the point is closer to samples in a different cluster than its own (Maechler, 2013).
To evaluate the models, the average silhouette coefficient for 3,000 randomly selected samples is used.
Sum of Squared Error (SSE)
The objective function for k-‐means is defined as
𝑆𝑆𝐸 = | 𝑥!− 𝝁!|!
!!∈!!
!
!!!
where k is the number of clusters, N the sample size, 𝑆! denotes the jth cluster, 𝑥! denotes the ith sample and 𝝁! denotes the cluster mean of cluster 𝑆!. A low value of SSE indicates that all samples are close to their respective cluster centers. However, the measure does not account for the number of clusters, meaning that SSE will in theory reach zero if k ≈ N. Thus, the objective should be to find a value where increasing k does not lead to a big decrease in SSE (Kile & Uhlen, 2012). This is done by plotting SSE values for different values of k, and trying to find an elbow point. In this case, an elbow point is a value of k after which SSE only decreases marginally (Demonstrated in Figure 2).
Bayesian Information Criterion (BIC)
Evaluating statistical models like GMM based only on log-‐likelihood values for the data under the model often leads to the highest score for the model with highest dimensionality (Schwarz, 1978), with over-‐fitting and low generalizing ability as a result. Schwarz (1978) suggested the Bayesian Information Criterion, which is a measure for model evaluation that penalizes models with higher dimensionality. For GMM, dimensionality translates to means, covariance-‐matrices and weights for each variable (3*M). It can be expressed as follows (in this form, it is optimized towards a low value):
𝐵𝐼𝐶 = −2 ∗ ln 𝐿 + 𝑘 ∗ ln (𝑁)
where L is the likelihood-‐value of the data under the model, k is the dimensionality of the model and N is the sample size. The likelihood-‐value is defined as the dataset’s summed score of the probability density function of the mixture model.
Heuristic Criteria
In addition to the numeric measures, a number of heuristic criteria are established to assess the validity of a model. These are meant to reflect the over-‐all objectives of the study:
1. Each of the resulting clusters should represent a significant portion of the samples (cardinality) and a significant portion of the users. Any cluster representing only a few sessions or a few users arguably falls outside the scope of identifying the most common behavior.
2. The number of clusters should not be too large to overview and grasp. The results of the study should be meaningful also to persons not involved in the study.
They should not require excessive studying to overview.
3. The number of clusters should not be too small to have analytic value. The study should ideally provide a partition that is meaningful and interesting.
4. Each cluster should have “outstanding” (higher or lower than average) values for several attributes. Different types of behavior are expected to affect several attributes.
A partition that captures extreme-‐values for one attribute per cluster and shows average values for other attributes is not considered meaningful.
5. Each cluster or prototype should differ from all other clusters in several attributes. A partition in which only one or two attributes differ between clusters is arguably too granular.
2.5.3 Model Evaluation Generalizing Ability
In order to evaluate extent to which the resulting models can be applied to different data, the models are used to classify different datasets. Similar measures to the model selection procedure are monitored to assess the generalizing ability of the models.
For k-‐means, the average SSE per sample and average silhouette coefficient is measured. For GMM, the average log-‐likelihood per sample and average silhouette coefficient is measured.
The models are evaluated using four separate datasets:
1. Different user group. The dataset is from the same time period and country, but with a different set of users.
2. Different user group and different time period. The data is collected during the four weeks prior to that of the training data; December 31st (2012) – January 27th (2013).
3. Different time period and country. The data is collected for the same time period as 2. but with users from Great Britain instead of Sweden.
4. Random data. The data is randomly generated. Each attribute is uniformly distributed in [0, μ + 2*σ], where μ is the mean of the attribute and σ is the standard deviation.
Robustness
In order to test the robustness of the model, the training data is classified several times with different amounts of noise added. The noise is meant to represent variation which may occur naturally between different observations of the same use case. The same measures are used as in the generalizing tests.
The noise added is normally distributed around zero, with varying values variance. For each trial, the relative portion of samples assigned the same label as the original data is measured.
3 Results
3.1 Factor Analysis
This section covers the results of the factor analysis.
3.1.1 Factor Analysis Model Selection
To select an appropriate number of latent variables, several factor analysis models are trained and evaluated with a scree-‐plot and by studying the amount of variance explained.
The scree plot in Figure 3 is derived from the Pearson correlation coefficient matrix of the normalized data set of 21 attributes and 138,920 samples. It shows the eigenvalues of the correlation coefficient matrix in descending order. The sizes of the eigenvalues indicates that 4 or 5 components would be a suitable choice for number of latent variables. Figure 4 also shows that the amount of variance explained only increases marginally with 6 or more components. Based on the scree-‐plot and the increase in explained variance between 4 and 5 components, the factor analysis model with 5 components is selected.
Figure 3. Scree plot. Eigenvalues of the correlation coefficient matrix of the data in decreasing order.
0 0.5 1 1.5 2 2.5 3 3.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Eigenvalue
Component