Customer segmentation of retail chain customers using cluster analysis

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Customer segmentation of retail

chain customers using cluster

analysis

SEBASTIAN BERGSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

Customer segmentation of retail

chain customers using cluster

analysis

SEBASTIAN BERGSTRÖM

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2019

Supervisor at Advectas AB: Pehr Wessmark Supervisor at KTH: Tatjana Pavlenko Examiner at KTH: Tatjana Pavlenko

(4)

TRITA-SCI-GRU 2019:092 MAT-E 2019:48

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

In this thesis, cluster analysis was applied to data comprising of customer spending habits at a retail chain in order to perform customer segmentation. The method used was a two-step cluster procedure in which the first step consisted of feature engineering, a square root transformation of the data in order to handle big spenders in the data set and finally principal component analysis in order to reduce the dimensionality of the data set. This was done to reduce the effects of high dimensionality. The second step consisted of applying clustering algorithms to the transformed data. The methods used were K−means clustering, Gaussian mixture models in the M CLU ST family, t-distributed mixture models in the tEIGEN family and non-negative matrix factorization (NMF). For the NMF clustering a slightly different data pre-processing step was taken, specifically no PCA was performed. Clustering partitions were compared on the basis of the Silhouette index, Davies-Bouldin index and subject matter knowledge, which revealed that K-means clustering with K = 3 produces the most reasonable clusters. This algorithm was able to separate the customer into different segments depending on how many purchases they made overall and in these clusters some minor differences in spending habits are also evident. In other words there is some support for the claim that the customer segments have some variation in their spending habits.

Keywords: Cluster analysis, customer segmentation, tEIGEN , M CLU ST , K-means, NMF, Silhouette, Davies-Bouldin, big spenders

(6)

(7)

Swedish title: Kundsegmentering av detaljhandelskunder med klusteranalys Sammanfattning

I denna uppsats har klusteranalys tillämpats p˚a data best˚aende av kunders konsumtionsvanor hos en detalj- handelskedja för att utföra kundsegmentering. Metoden som använts bestod av en tv˚a-stegs klusterprocedur där det första steget bestod av att skapa variabler, tillämpa en kvadratrotstransformation av datan för att hantera kunder som spenderar l˚angt mer än genomsnittet och slutligen principalkomponentanalys för att reducera datans dimension. Detta gjordes för att mildra effekterna av att använda en högdimensionell datamängd. Det andra steget bestod av att tillämpa klusteralgoritmer p˚a den transformerade datan. Meto- derna som användes var K-means klustring, gaussiska blandningsmodeller i M CLU ST -familjen, t-fördelade blandningsmodeller fr˚an tEIGEN -familjen och icke-negativ matrisfaktorisering (NMF). För klustring med NMF användes förbehandling av datan, mer specifikt genomfördes ingen PCA. Klusterpartitioner jämfördes baserat p˚a silhuettvärden, Davies-Bouldin-indexet och ämneskunskap, som avslöjade att K-means klustring med K = 3 producerar de rimligaste resultaten. Denna algoritm lyckades separera kunderna i olika segment beroende p˚a hur m˚anga köp de gjort överlag och i dessa segment finns vissa skillnader i konsumtionsvanor.

Med andra ord finns visst st¨od f¨or p˚ast˚aendet att kundsegmenten har en del variation i sina konsumtionsvanor.

Nyckelord: Klusteranalys, kundsegmentering, tEIGEN , M CLU ST , K-means, NMF, Silhouette, Davies- Bouldin, storkonsumenter

(8)

(9)

Acknowledgements

I would first and foremost like to thank my supervisors Pehr Wessmark at Advectas and Tatjana Pavlenko at KTH for all the help during the time working on this thesis. The discussions and feedback were immensely helpful and I could not have completed the thesis without them. Also I want to thank Christopher Madsen for his peer review of my project, which helped shape the report into its current shape.

(10)

(11)

List of Figures

2.1 Illustration of MCLUST models in 2D . . . 5

3.1 Illustration the sparsity in the data set . . . 19

3.2 Correlations between the product groups . . . 20

3.3 Histograms showing the distribution of products in a given level 1 product group . . . 20

3.4 Histograms showing the distribution of products in a given level 1 product group . . . 21

3.5 Box plots that illustrate the presence of big spenders . . . 21

3.6 Illustration of silhouette values in the presence of outliers for K-means clustering with K = 2 . 22 3.7 Result of running t−SNE with standard settings . . . 23

3.8 Result of running t−SNE with standard settings, now colored according to x3 . . . 23

3.9 Result of running t−SNE with standard settings, now colored according to the total amount of purchases made. 3.9a) Result of one run. 3.9b) Result of another run. . . 24

3.10 Result of running t−SNE with standard settings, colored according to the total amount of purchases made in x3 . . . 24

3.11 Results of running t−SNE with Linderman and Steinerberger’s settings colored according to the total amount of purchases. 3.11a) Result one run. 3.11b) Result of another run. . . 25

4.1 Silhouette plots for K = 3 and K = 7 in K-means. 4.1a) K = 3. 4.1b) K = 7 . . . 26

4.2 Silhouette plots for different number of components K of the best M CLU ST model EII. 4.2a) K = 2. 4.2b) K = 3. 4.2c) K = 4. . . 27

4.3 Silhouette plots for different number of components K of the best tEIGEN model. 4.3a) K = 2. 4.3b) K = 3. 4.3c) K = 4. . . 28

4.4 Silhouette plot obtained when using 3 clusters in NMF . . . 29

4.5 Density plots of the principal components in cluster 0 . . . 30

5.1 Density plots of x3 and x4 in the different clusters. 5.1a) Density function of x3 within the three different clusters. 5.1b) Density function of x4 within the three different clusters . . . 32

5.2 Boxplots of x3 and x4 in the different clusters. 5.2a) Boxplot of x4 within the three different clusters. 5.2b) Boxplot of x3 within the three different clusters . . . 32

5.3 Radar charts showing the mean percentages spent within each product category for the three clusters. 5.3a) Cluster 0. 5.3b) Cluster 1. 5.3c) Cluster 2. . . 33

5.4 Radar chart showing the mean percentage spent within each product category (excluding x3 for the three clusters) . . . 33

5.5 Variances of the percentages in cluster 0 . . . 34

5.8 Density plots of percentages spent in x3, x4, x5 and x8 in the three clusters. 5.8a) x3 percentages. 5.8b) x4 percentages. 5.8c) x5 percentages. 5.8d) x8 percentages. . . 35

A.1 Silhouette plots for K-means clustering with different K. A.1a) K = 2. A.1b) K = 4. A.1c) K = 5. A.1d) K = 6. . . 38

A.2 Silhouette plots for different number of components K for the model EEE. A.2a) K = 3. A.2b) K = 4. . . 38

A.3 Silhouette plots for different number of components K for the model EEI. A.3a) K = 3. A.3b) K = 4. . . 38

A.4 Silhouette plots for different number of components K of the model EEV. A.4a) K = 2. A.4b) K = 3. A.4c) K = 4. . . 39

A.5 Silhouette plots for different number of components K of the model VEI. A.5a) K = 2. A.5a) K = 3. A.5c K = 4. . . 39

A.6 Silhouette plots for different number of components K of the model VII. A.6a) K = 2. A.6b) K = 3. A.6c) K = 4. . . 39

A.7 Silhouette plots for different number of components K for the model CICC. A.7a) K = 2. A.7b) K = 3. A.7c) K = 4. . . 39

A.8 Silhouette plots for different number of components K for the model CIIC. A.8a) K = 2. A.8b) K = 3. A.8c) K = 4. . . 40

A.9 Silhouette plots for different number of components K for the model CIIU. A.9a) K = 2. A.9b) K = 3. A.9c) K = 4. . . 40

A.10 Silhouette plots for different number of components K for the model UIIC. A.10a) K = 2. A.10b) K = 3. A.10c) K = 4. . . 40

A.11 Silhouette plots for different number of components K for the model UIIU. A.11a) K = 2. A.11b) K = 3. A.11c) K = 4. . . 40

(13)

A.12 Silhouette plots for different number of components K for the model UUCU. A.12a) K = 2.

A.12b) K = 3. . . 41

A.13 Silhouette plot obtained when using 2 clusters in NMF . . . 41

(14)

List of Tables

2.1 Models available in Mclust . . . 5

2.2 Models in the tEIGEN-family . . . 6

3.1 Class assignments based on x3 . . . 23

3.2 Class assignments based on the total number of purchases . . . 24

4.1 Average silhouette scores for different K in K-means clustering . . . 26

4.2 Davies-Bouldin scores for different K in K-means clustering. . . 26

4.3 M CLU ST models that converged . . . 27

4.4 Average silhouette values for the different M CLU ST models with varying number of components 27 4.5 Davies-Bouldin scores for the different M CLU ST models with varying number of components . 27 4.6 Models in the tEIGEN -family that converged . . . 28

4.7 Average silhouette values for the different tEIGEN models with varying number of components 28 4.8 Davies-Bouldin scores for the different tEIGEN models with varying number of components . 28 4.9 Average silhouette values when clustering using NMF . . . 29

4.10 Davies-Bouldin values when clustering using NMF . . . 29

(15)

1 Introduction and motivation

1.1 Customer segmentation and cluster analysis

Customer segmentation is the process of a company dividing its customer base into different groups based on some shared characteristics. This is done so the different groups can be analyzed and marketing efforts can be tailored toward these groups based on their preferences in order to improve sales and customer relations.

Cluster analysis is a branch of mathematics whose purpose is to group similar data points into subsets called clusters. The partition of data points should be done such that data points in the same cluster are more similar to one another than ones in different clusters [25, p.501]. A central aspect of cluster analysis is the concept of proximity between data points, which is usually described using a N × N (with N being the number of data points) dissimilarity matrix D whose elements dij contain the distance between the i:th and j:th observation [25, p.503]. Assuming we have access to a data matrix

X = {x_ij}1≤i≤n,1≤j≤p (1.1)

where n is the number of data points and p is the dimension of each point we define a dissimilarity d_j(x_ij, x_i⁰_j) between the values of the j:th feature in our data [25, p.503]. With the help of this we nearly always define a dissimilarity between two data points x_i, x_i⁰ (i.e. rows in X) as

D(xi, xi⁰) =

p

X

j=1

wjdj(xij, xi⁰j),

p

X

j=1

wj = 1, [25, p.505] (1.2)

A common choice is dj(xij, xi⁰j) = (xij− xi⁰j)², which is suitable for numeric data [25, p.503]. These dissimi- larities will be used in the clustering algorithms to produce clusters of similar data points. Details for different clustering algorithms will be presented in other sections of this report.

Cluster analysis can be applied to customer segmentation since customers can be clustered based on the patterns in the data. In other words, customer segmentation can be done on the basis of collected data rather than preconceptions about what customer segments exist and how they differ from one another.

1.2 Problem formulation and data description

The data used in this thesis has been provided from a retail client of Advectas AB. The original data consists of transactions that can be linked to individual customers and must be aggregated for each customer before clustering can be done. This part of the project entails feature engineering and the total number of purchases in different product groups will be used to profile the customers. Due to the proprietary nature of the data, the client name will not be revealed and variable names are anonymized. A subset of the customers (the most interesting ones from the client’s perspective) was used in this analysis. In total 74 836 customers were segmented based on purchases during a set time period.

1.3 Previous research

Cluster analysis has been used extensively for customer segmentation. Examples include Zakrzewska and Murlewski who used K-means clustering and DBSCAN for customer segmentation in banking [53], the usage of mixture models for shopping market customers by Gilbert et al [22] and other examples [46]. While K-means clustering and hierarchical clustering have been used extensively in customer segmentation, other methods are less common, e.g. include the tEIGEN family of mixture models for clustering. Due to the scarcity of research articles which utilize this algorithm for customer segmentation it may be of interest to examine it and compare it to more commonly used algorithms.

(16)

2 Theory

2.1 Clustering algorithms used in this thesis

2.1.1 K-means clustering

K-means clustering is a clustering algorithm in which the data points are segmented into K ∈ Z non-overlapping clusters where K is specified by the user [25, p.507] and all the columns in the data matrix X (i.e. the variables) are quantitative. Strategies for choosing a suitable K will be discussed in Section 2.2. In this case the squared Euclidean distance is the dissimilarity measure, i.e.

d(x_i, x_i⁰) =

p

X

j=1

(x_ij− x_i⁰_j)²= ||x_i− x_i⁰||², [25, p.509]. (2.1)

The goal of K-means clustering is to minimize the function W (C), defined in Equation 2.2. Here, C is a many-to-one mapping that assigns observations to clusters, i.e. the clustering result of some algorithm.

¯

xj = (¯x1j, ..., ¯xpj) is the mean vector of cluster number j and Nj =PN

i=1I{C(i) = j} where I{C(i) = j} is the indicator function. In other words, W (C) is minimized by assigning the observations to clusters such that the average dissimilarity from the cluster mean within each cluster is minimized [25, p.509].

W (C) = 1 2

K

X

j=1

X

C(i)=j

X

C(i⁰)=j

d(xi, xi⁰) = ||xi− ¯xj||² (2.2)

W (C) can be called the ’within cluster’ point scatter. This is a combinatorial optimization problem which is computationally intractable for all but very small data sets. It can be shown that the number of different possible cluster assignments given N data points is given by _K!¹ PK

j=1(−1)^{K−j K}_kj^N which grows rapidly in its arguments [25, p.508]. For this reason, only a small subset of different cluster assignments are examined in K-means clustering. These strategies are built on iterative greedy descent. In this thesis a combination of Lloyd’s algorithm (which is commonly referred to as K-means clustering) and K + +-means clustering have been used. K + +-means was proposed by Arthur and Vassilvitskii as a way to select the initial cluster centres in K-means clustering [8] and is presented in Algorithm 1. First we explain the notation used. χ ∈ R^d is the data set which consists of n data points, K is the number of clusters, C = (c₁, ..., c_K) is the set of cluster centers and D(x) is the shortest distance from a data point x to the closest center already chosen.

Algorithm 1: Lloyd’s algorithm and K + +-means Input: Number of clusters K, data set χ

Output: Partition of data set into K clusters K + +-means

1. Randomly choose a center c1from χ

2. Take a new center ci, choosing data point x ∈ χ with probability P^D(x)²

x∈χD(x)²

3. Repeat the previous step until K centers have been chosen Lloyd’s algorithm

4. For each i ∈ {1, ..., K}, set cluster Ci to be the set of points in χ that are closer to ci than they are all to cj for all j 6= i, i.e. Ci= {x ∈ χ : ||x − ci||²≤ ||x − cj||²∀j 6= i}

5. For each i ∈ {1, ..., K} set ci= _|C¹

i|

P

x∈C_ix, where |Ci| is the number of elements in cluster Ci

6. Repeat steps 4 and 5 until C no longer changes.

One problem to consider is that although the algorithm converges to a minimum, it may be a suboptimal local minimum. The algorithm may also be sensitive to the initial cluster assignments so it is recommended to run the algorithm multiple times with different initial assignments and choose the solution which gives the best results [25, p.510]. This is what we refer to as the ”hard” K-means algorithm which gives a definitive assignment for each data point. We could also instead use a ”soft” clustering algorithm, the concept is introduced in Section 2.1.2.

(17)

2.1.2 Mixture models for clustering

Clustering can also be done using a statistical model for the population we sample from. The model assumes the population consists of sub-populations, i.e. clusters, in which our data set’s features follow different multivariate probability distributions. The population as a whole will then follow a so called mixed density distribution.

When using a mixture model, clustering becomes a question of estimating the mixture’s parameters, calculating posterior cluster probabilities using these parameters and assigning cluster membership based on these [20, p.143]. We may express this mathematically as follows. Finite mixture models have probability densities of the form

f (x; p, θ) =

K

X

j=1

pjgj(x; θj). (2.3)

x is a p-dimensional random variable, p^T = (p1, ..., pK−1) is the vector of mixing proportions wherePK

j=1pj= 1, K is the assumed number of clusters and the gj are the component densities parameterized by their respective θj. Assuming we have estimated the mixing distribution’s parameters, probabilities for cluster membership can be estimated using the posterior probability in Equation 2.4

P (clusterj|xi) =pˆjgj(xi, ˆθ)

f (x_i; ˆp, ˆθ), j = 1, 2, ..., K (2.4) Parameters are estimated by maximum-likelihood estimation [20, p.144-145], specifically the EM-algorithm is used in the case of unobserved cluster labels. The description of the algorithm is largely based on the one available by McLachlan and Krishnan [39, p.19-20], albeit with slightly changed notation to be consistent with the notation in Everitt [20].

Imagine we can observe realizations of the random variable X with the density in Equation 2.3 but not Y and let Z = (X, Y ) be the complete data vector. Letting g_c(z; Θ) denote the density of Z, the complete-data log likelihood function is given by log(L_c(Θ)) = log(g_c(z; Θ)). The algorithm proceeds as in Algorithm 2.

Algorithm 2: EM-algorithm

Input: Starting values for parameter estimates Θ⁽⁰⁾ Output: Parameter estimates Θ^(k)

1. Initialize Θ = Θ⁽⁰⁾ 2. For k = 1, 2, ...:

(a) E-step: Compute Q(Θ; Θ^(k−1)) = E_Θ(k−1)[log(Lc(Θ))|X]. Here Θ^(k−1)denotes the estimate of Θ in the (k − 1) : th iteration.

(b) Compute M-step: Choose Θ^(k+1) as any θ ∈ Θ such that Q(θ; θ^(k)) is maximized, i.e.

Q(θ^(k+1); θ^(k)) ≥ Q(θ; θ^(k))∀θ ∈ Θ

The E- and M-steps are repeated until the log-likelihood function converges, i.e. L(θ^(k)) − L(θ^(k−1)) changes by an arbitrarily small amount.

2.1.3 Gaussian mixture models (MCLUST)

Mathematically, a Gaussian mixture model (GMM) can be described as a mixture model in which the components follow Gaussian distributions. The model assumed is given by Equation 2.5, in which {pk}k≥1 are the mixing proportions, K is the assumed number of components and Φ(x|µ, Σ) is the density of a multivariate normal random variable following a N (µ, Σ)-distribution [17].

f (x) =

K

X

k=1

pkΦ(x|µk, Σk) (2.5)

The GMMs used in this thesis come from the M CLU ST family in which the number of clusters and covariance structures can vary [45]. In the MCLUST setting the covariance matrix Σk can be parameterized in terms of its eigenvalue decomposition as

(18)

Σk= DkΛkDkT

= λkDkAkDkT

(2.6) In this decomposition Dkis the matrix of eigenvectors and Λkis a diagonal matrix with Σk’s eigenvalues on the diagonal. Λkcan be expressed as Λk= λkAkwhere λkis the first eigenvalue of Σkand Ak= diag(α1k, ..., αpk) where 1 = α1k ≥ α2k ≥ ... ≥ αpk > 0. We may interpret the decomposition as follows. Dk determines the orientation of cluster k, λk determines the volume of the cluster and Ak determines its shape, see Figure 2.1 for examples in 2D. By placing different constraints on Σk we can obtain different models, these are listed in Table 2.1.

In mclust, clustering is done slightly differently compared to the method described in Section 2.1.2. Clustering is done by the classification maximum likelihood procedure, which aims to find θ, γ such that L(θ, γ) is maximized, where L(θ, γ) =Qn

i=1φ_γ_i(x_i; θ_γ_i). In this case, γ^T = (γ₁, ..., γ_n) where γ_i = j if x_i comes from cluster j, i.e. γ is the vector containing the cluster memberships and is treated as an unknown parameter [20]. Also, φj refers to the density function from the jth component of the data. In this case it refers to the density function for a N (µj, Σj)-distribution.

Specifics for parameter estimation by the EM-algorithm are presented by Celeux and Govaert [17]. The relevant log-likelihood function is given in Equation 2.7, in which Φ(x|µ, Σ) denotes the density of a Gaussian N (µ, Σ)- distribution.

L(θ, z₁, ..., z_n|x1, ..., x_n) =

K

X

i=k

X

x_i∈Pk

ln(p_kΦ(x_i|µk, Σ_k)) (2.7)

The variant of the EM-algorithm used is called the CEM algorithm and can briefly be described as follows. An initial partition of the data is made, then the conditional probabilities t_k(x_i) (see Equation 2.9) are computed (E-step). These are used to assign each xi to the cluster with the largest current tk(xi) (C-step). Then the parameters ˆpk, ˆµk and ˆΣk are updated in the M-step, which consists of maximizing F in Equation 2.8. In this equation, we need the notation of a classification matrix c = {cik}1≤i≤n,1≤k≤K where cik∈ {0, 1}. c defines a partition in our case.

F (θ|x1, ..., xn, c) =

K

X

k=1 n

X

i=1

cikln(pkΦ(xi|µk, Σk)), (2.8)

t(x_i) = pˆ_kΦ(x_i, ˆµ_k, ˆΣ_k) PK

l=1pˆ_lΦ(x_i, ˆµ_l, ˆΣ_l) (2.9) Using n_k =Pn

i=1c_ik, updating formulas for mixing proportions and mean vectors are given in Equations 2.10 and 2.11, respectively.

ˆ pk= nk

n (2.10)

ˆ

µk= ¯xk = Pn

i=1cikxi

n_k (2.11)

The updating formula for the covariance matrix varies from model to model. In some cases, a closed form solution for estimation exists while in other cases an iterative procedure must be used. Presenting details on how all these are updates is outside the scope of this thesis and the interested reader is referred to Celeux and Govaert [17]. Browne and McNicholas present updates for the specific models EV E and V V E [14]. Unless explicitly stated otherwise, any descriptions of the parameter estimation procedure in this thesis come Celeux and Govaert according to the authors of the mclust-software. The tolerance for likelihood convergence in the clust software was 10⁻⁵.

(19)

Model Σ_k Distribution Volume Shape Orientation

EII λI Spherical Equal Equal -

VII λkI Spherical Variable Equal -

EEI λA Diagonal Equal Equal Coordinate axes

VEI λkA Diagonal Variable Equal Coordinate axes

EVI λA_k Diagonal Equal Variable Coordinate axes

VVI λ_kA_k Diagonal Variable Variable Coordinate axes

EEE λDAD^T Ellipsoidal Equal Equal Equal

EVE λDAkD^T Ellipsoidal Equal Variable Equal VEE λkDAD^T Ellipsoidal Variable Equal Equal VVE λkDAkD^T Ellipsoidal Variable Variable Equal EEV λDkADkT

Ellipsoidal Equal Equal Variable VEV λkDkADkT

Ellipsoidal Variable Equal Variable EVV λDkAkDkT Ellipsoidal Equal Variable Variable VVV λkDkAkDkT Ellipsoidal Variable Variable Variable

Table 2.1: Models available in Mclust

Figure 2.1: Illustration of MCLUST models in 2D

Some special notice has to be taken to potential pitfalls when using the EM-algorithm for clustering, specifically convergence problems and singularities in the likelihood function. The EM-algorithm may get stuck in local maxima rather than global ones, to combat this the algorithm is often run repeatedly using different starting values. In mclust and teigen these values are based on solutions from K-means clustering as a default, which has been done in this project. In some cases the likelihood function may become infinite, which is due to the number of parameters to estimate being large in comparison to the number of data points. This may happen in models with unconstrained variances or many components. Two methods are possible for combating this problem: Constraining the variances or attempting a Bayesian approach [20]. In this thesis the covariance matrices were constrained.

2.1.4 t-distributed mixture models (tEIGEN)

An alternative to Gaussian mixtures is to use mixture models in which the components follow a multivariate t-distribution. Different models and software packages are available for this, this thesis has used the tEIGEN- family of models which is described in Equation 2.12. The different mixture models in the family take the form of

g(x|ϑ) =

K

X

k=1

πkft(x|µk, Σk, νk) (2.12)

The πk are the mixing proportions, K is the number of components, ft(x|µk, Σk, νk) is the density function of a random variable following a p-dimensional t-distribution that has expected value vector µk, covariance matrix

(20)

Σkand degrees of freedom νk. In the tEIGEN model family, Σkis decomposed using the eigen-decomposition as in M CLU ST , i.e. Σk= λkDkAkD^T_k. λk, Dk and Ak have the same meaning as in M CLU ST . The same constraints on Σk can be considered as for the M CLU ST models, and an additional one is that the degrees of freedom across groups can be constrained [6]. For a complete list of the models available in the software package teigen, see Table 2.2. In the column for νk, the entries may be either ”C” (for ”constrained”) or ”U”

(for ”unconstrained”). If νk is constrained, then it will be equal for all the clusters. If it is not constrained, it may vary between clusters.

Model Σk νk

CIIC λI C

CIIU λI U

UIIC λ_kI C

UIIU λ_kI U

CICC λA C

CICU λA U

UICC λ_kA C

UICU λkA U

CIUC λAk C

CIUU λAk U

UIUC λkAk C

UIUU λkAk U

CCCC λDAD^T C

CCCU λDAD^T U

UCCC λkDAD^T C

UCCU λ_kDAD^T U

CUCC λD_kAD_k^T C CUCU λD_kAD_k^T U UUCC λ_kD_kAD_k^T C UUCU λkDkADkT

U

CCUC λDAkD^T C

CCUU λDAkD^T U

CUUC λDkAkDkT

C CUUU λDkAkDkT

U UCUC λkDAkD^T C UCUU λkDAkD^T U UUUC λkDkAkDkT C UUUU λ_kD_kA_kD_k^T U Table 2.2: Models in the tEIGEN-family

In order to assign cluster labels to data points the model parameters must be estimated, which is done using a variant of the EM-algorithm called the multicycle ECM algorithm. The procedure is similar to the one described by Celeux and Govaert [17]. In the words of the authors, parameter estimation is done as follows.

For the general tEIGEN model the complete-data log likelihood is given by Equation 2.13, in which zik = 1 if xi belongs to cluster k and 0 otherwise.

l_c(ϑ) =

K

X

k=1 n

X

i=1

z_iklog[π_kγ(u_ik|νk/2, ν_k/2)Φ(x_i|µk, Σ_k/u_ik)] (2.13)

Here, γ(y|α, β) = ^β^α^y^α−1_Γ(α)^exp(−βy)I{y > 0}, where I is the indicator function, for some α, β > 0. Φ is the density function of a multivariate Gaussian with mean µ and covariance matrix Σ. In each E-step of the algorithm the zik (cluster membership indicators) and uik(characteristic weights) are updated by their conditional expected values

ˆ

z_ik= πkft(x|µk, Σk, νk) PK

h=1π_hf_t(x|µ_h, Σ_h, ν_h) (2.14)

(21)

ˆ

u_ik= ν_k+ p

νk+ σ(xi, µk|Σk) (2.15)

In Equation 2.15 σ(xi, µk|Σk) is the squared Mahalanobis distance between xi and µk. The CM step consists of two steps, one for updating mixing proportions, means and degrees of freedom and another for updating covariance matrices. Mixing proportions and means are updated as described by Andrews and McNicholas [6].

ˆ πk= nk

n, ˆµk= Pn

i=1zˆikuˆikxi

Pn

i=1zˆ_ikuˆ_ik , nk=

n

X

i=1

ˆ

zik (2.16)

Updating the degrees of freedom deserves some special attention. Depending on if they are constrained or not, different approximations will be used for updating [5]. In the case of constrained degrees of freedom, let k⁰= −1 −¹_nPG

g=1

Pn

i=1zˆ_ig(log( ˆw_ig) − ˆw_ig)) − φ(^ν^ˆ^old₂^+p) + log(^ν^ˆ^old₂^+p). An approximation is then provided by

ˆ

ν ≈ −exp(k) + 2exp(k)(exp(φ(^ˆ^ν^old₂ )) − (^ˆ^ν^old₂ −¹₂))

1 − exp(k) (2.17)

The same approximation can hold for the unconstrained degrees of freedom by using alternative definition of k⁰, namely

k⁰= −1 − 1 n_k

n

X

i=1

ˆ

zik(log( ˆwik) − ˆwik) − φ(νˆ^old+ p

2 ) + log(νˆ^old+ p

2 ) (2.18)

Covariance matrices are updated in the second CM-step. In the words of the original authors [6] these are updated similar to their Gaussian counterparts in Celeux and Govaert [17].

Finally, the aspects of initialization and convergence criteria must be addressed. The software package teigen provides some alternatives for these aspects. The zik must be initialized and in this thesis this was achieved by using a K-means initialization with 50 random starting points. The initial degrees of freedom is 50 by default [5]. Aitken’s acceleration is used to determine convergence [5]. In iteration t, Aitken’s acceleration is given by

a^(t)= l^(t+1)− l^(t)

l^(t)− l^(t−1) (2.19)

In this case, l⁽ⁱ⁾ refers to the value of the log-likelihood function at iteration i. It is used to compute an asymptotic estimate of the log-likelihood by

l^(t+1)_∞ = l^(t)+ 1

1 − a^(t)(l^(t+1)− l^(t)) (2.20)

The stopping criterion is l^(t+1)∞ −l^(t+1)< 2, where 2= 0.1 is the default value used in the tEIGEN −sof tware.

When the algorithm has converged, cluster memberships for the data points are determined by the maximum posterior probabilities, i.e. if ˆz_ig is maximized for component g, then data point i is assigned to cluster g.

2.1.5 Non-negative matrix factorization (NMF)

An entirely different approach to clustering is non-negative matrix factorization (NMF). It is a matrix approximation method which factorizes a non-negative matrix into two non-negative matrices which have lower rank than the original. NMF has been used for clustering in different contexts, e.g. document clustering [51] and cancer gene clustering [16]. There also exists theoretical work regarding the clustering aspect of NMF done by Mirzal and Furukawa [40]. A general explanation of the problem, which is inspired by the one provided by Brunet et al. and Xu et al. [16] is as follows. Suppose we have access to a data matrix X ∈ R^m×n in which all elements are non-negative, m denotes the number of features and n denotes the number of observations. We wish to find an approximate non-negative factorization of this matrix, i.e. find non-negative matrices U ∈ R^m×k, V^T∈ R^k×n such that X ≈ U V^T. This is the general outline of the problem, which may be tackled in different ways.

(22)

Xu et al. assume that the columns in X are normalized to unit Euclidean length while no such assumption is described by Brunet et al. Moreover, finding U and V can be done by solving different optimization problems and using different optimization procedures. Brunet et al. seek to minimize a functional related to the Poisson likelihood while Xu et al. seek to minimize J = ¹₂||X − UV^T||, where ||.|| is the squared sum of all matrix elements. In other words they wish to minimize half the Euclidean distance between X and UV^T, which of course is equivalent to minimizing the Euclidean distance.

The steps taken in this thesis are as follows: The data is not normalized prior to clustering since all features have the same units and due to the constraint of non-negativity on the matrix elements, no PCA is performed.

In other words, the only pre-processing done is the square-root transformation. The transpose of this data matrix (denoted X⁰) will then act as input to the NMF algorithm. We will seek to solve the optimization problem in Equation 2.21.

min

U,V

1

2||X⁰− UV^T|| (2.21a)

subject to U ≥ 0, V ≥ 0 (2.21b)

The optimization problem may be solved by using the updating formulas presented in Equation 2.22. These multiplicative updates were introduced presented by Lee and Seung [35], here we use the same notation as in [51].

These are iterative updates, so an initialization is required. Initialization is done using the method Nonnegative Double Singular Value Decomposition (NNDSVD), which is described by Boutsidis and Gallopoulos [13]. In this thesis, the tolerance for the stopping condition has been 10⁻⁴ and the maximum number of iterations is set to 200.

uij ← uij

(XV)ij

(UV^TV)ij

, vij ← vij

(X^TU)ij

(VU^TU)ij

(2.22)

It should be noted that the solution to minimizing the Euclidean distance is not unique. Say U and V are solutions to minimizing J and that D is some positive diagonal matrix. Then UD and VD⁻¹ will also be solutions to the optimization problem. To ensure the solution’s uniqueness, U is normalized by setting

vij ← vij

s X

i

u²_ij, uij ← uij

qP

iu²_ij

(2.23)

Once this has been done, the actual clustering can be done. This is done by assigning data point x_i to cluster j∗ = argmax_jv_ij. In other words, for each row i in V we care about the column index of the greatest element in that row. This column index is the cluster assignment for the data point xi [51]. It is also worth discussing how NMF relates to K-means clustering to further illustrate the algorithm’s use in clustering, as shown by Kim and Park [32]. It is helpful to view K-means clustering as a lower rank matrix factorization with special constraints. As previously discussed, the objective function in K-means can be written as in Equation 2.24, in which A = [a1, ..., an] ∈ R^m×n, K is some integer and C = [c1, ..., cK] ∈ R^m×K is the centroid matrix. cj is the cluster centroid of cluster j and B ∈ R^n×K is the clustering assignment.

J_k =

K

X

j=1

X

a_i∈Cj

||ai− cj||²= ||A − CB^T||²_F (2.24)

J_K can be rewritten. Letting |C_j| denote the number of data points in cluster j and

D⁻¹ = diag( 1

|C1|, 1

|C2|, ..., 1

|CK|) ∈ R^K×K, C = ABD⁻¹ (2.25) we may rewrite JK as

JK = ||A − ABD⁻¹B^T||²_F (2.26)

Using this formulation, K-means may be viewed as the task of finding B such that JK is minimized. In this setting, B has one non-zero element per row, this non-zero element being 1. We consider a factorization of D⁻¹

(23)

using two diagonal matrices D1 and D2 which satisfy D⁻¹ = D1D2 and let F = BD1 and H = BD2. Then JK can be written as

JK= ||A − AFH^T||²_F (2.27)

and the optimization problem becomes

minF,HJK =||A − AFH^T||²_F (2.28a)

in which F and H have one non-zero and positive element per row. If we set U = AF we see that the objective function is similar to the one in NMF. In K-means, U is the centroid matrix with rescaled columns and H has exactly one non-zero element per row. In other words each row represents a hard clustering of each data point. The difference compared to NMF is that NMF does not have these constraints, so the basis vectors (i.e.

columns in U) in NMF are not necessarily the cluster centroids and it does not force hard clustering upon the data points.

2.2 Cluster validation indices

After running a clustering algorithm and obtaining a partition we may naturally wish to evaluate this solution.

Is it a good one, is it close to being good or is it worthless? Maybe we want to compare different kinds of algorithms or maybe we want to vary the parameters in one type of algorithm and choose the best parameter values, but how do we determine the best result from all these different methods? Several validation indices exist and there is some confusion regarding naming convention as well as which one should be chosen.

According to Halkidi et al [23] cluster validation indices can be divided into three different types: external criteria, internal criteria and relative criteria. The usage of external criteria requires some prior knowledge of cluster structures inherent to the input data, in our case customer segment labels. In this thesis no such prior information is available so we exclude this type of criteria and turn our attention to the two remaining ones.

Defining internal and relative criteria proves to be somewhat challenging as the terms are used interchangeably.

We begin by providing the explanations by Halkidi et al [23] and then proceed to explain the naming conundrum.

Internal criteria are described in terms of evaluating the clustering results in terms of the data itself, e.g. a proximity matrix. According to Halkidi et al the goal is to determine how much the result of a clustering algorithm agrees with the proximity matrix, for which purpose the authors propose using Hubert’s Γ-statistic.

Relative criteria are used when we compare different clustering schemes with respect to some predetermined criterion. These criteria do not depend on any statistical tests like the external and internal criteria do. Now we turn our attention to a potential source of confusion. In the work by Halkidi et al, examples of relative criteria include the Davies-Bouldin index, Dunn index and Dunn-like indices. These are well known indices that have been studied in other works. The crux however, is that in other works these are usually called internal indices, see [27, 10, 37, 48]. The very same articles refer to the Silhouette index as an internal validity measure although it fits the definition of a relative criterion. To make things even more confusing, other articles refer to the very same indices as relative criteria, see [50, 29].

This confusion about naming convention is important to note since cluster evaluation and comparing clustering solutions is a central part of this thesis. No prior customer segment memberships are known for the data points in this thesis, meaning external validation is out of the question. The indices used will be the Silhouette index and Davies-Bouldin index. Whether the to call them internal or relative criteria is left as an exercise to the reader, even though it has no direct impact on the work in this thesis it is a confusing topic that needs to be addressed.

2.2.1 Choice of cluster validity index (CVI)

In the absence of class labels for our data we have to use internal validation indices to evaluate the clustering partitions obtained. There are a multitude of different indices one can use and in order to reduce the arbitrariness of choice, a few criteria have been used in this thesis: level of empirical support, computational complexity and availability of software. These three aspects will be the basis on choice of CVI. The rationale is that since no single CVI dominates all data sets and we cannot analytically prove that a single CVI is suitable, we have to use empirical results available in the research literature. Also, we must take into account that due to the amount of data available some CVIs may be computationally unfeasible. Lastly the question of software availability

(24)

must be considered. It has been shown that different software implementations of the same algorithm can vary greatly in quality [30]. If we are to be certain about our results we need to trust the software used, so this is a factor we must consider.

Several works on comparing CVIs have been published. The most comprehensive one was by Arbelaitz et al [7]

in which multiple CVIs were evaluated on different data sets. The CVIs were evaluated as follows. A clustering algorithm was applied to the data set with different values of K, the hypothesized number of true clusters present. This yields a set of different partitions of the data. The CVI is computed for all such partitions and the best partition according to the CVI will be the ”predicted” partition by the CVI. The goal is to predict the partition that is most similar to the true one. This similarity was measured using the Adjusted Rand index, Jaccard index and Variation of Information. The results show that no single CVI dominates the others but overall the Silhouette and Davies-Bouldin indices seem to be good choices. It should however be noted that they is not significantly different from the Calinski-Harabasz and Dunn indices. In another study it was shown that the Silhouette index was very competitive [30]. In this particular study the Silhouette was actually not the single best performing CVI, but since the better performing CVI overall has less support in the literature we refrain from using it. Also, the Davies-Bouldin index did not perform well in this paticular study. In another article by Liu et al [37] we can see that the Silhouette index and Davies-Bouldin are among the top performers.

In this particular article neither was not the best choice of CVI (rather, S Dbw was) but it should also be noted that S Dbw does not perform well in general [30]. All in all, there exists work which shows that the Silhouette and Davies-Bouldin are good choices even if they do not always find the correct number of clusters.

Other arguments for using the Silhouette and Davies-Bouldin indices are related to their mathematical properties. The Silhouette differs from the other well-performing indices (Dunn, Davies-Bouldin and Calinski- Harabasz) since it can compute a score for each data point while the others only compute an overall score for the partition. Thus, we can evaluate a distribution of silhouette scores instead of just the arithmetic mean which is desirable since distributions are more informative than summary statistics. Secondly, the Silhouette has limits. Since we know that a Silhouette index must lie in [−1, 1] we have a slightly more absolute measure of cluster quality. Even though we do not know the true labels of each customer, we at least now that if a clustering algorithm gives very high silhouette values then at least it produces tight-knit clusters. This argument also applies to the Davies-Bouldin index since we know that it is theoretically bounded from below by 0 (this follows from Definitions 1 and 2), so for very small values of ¯R we at least know that tight-knit clusters are formed.

2.2.2 The silhouette index

The silhouette index is a internal validation index which provides a graphical way of assessing the quality of clustering results. Unless stated otherwise, all results in this section come directly from the original article by Rousseeuw [44]. In the author’s own words it is useful for evaluating clusters when we wish to create clusters that are compact and clearly separated. The construction of a silhouette plot requires only the partition which results from applying a clustering algorithm to some data set and the distances between the data points.

Before defining the silhouette and discussing some properties we introduce some notation. Throughout this section x denotes some data point in our data set X. We let A be the cluster x is assigned to by some algorithm.

When A contains more points than just x we can compute a(x) as the average dissimilarity of x to all other points in A. If we let C be a different cluster than A we can compute d(x, C) as the average dissimilarity of x to all points in C. Further, let b(x) = minC6=Ad(x, C). This minimum is attained for some cluster B, which will be called the neighbor of x. We can think of it as the ”second best choice” for cluster assignment of x, i.e. the closest cluster we would have used if A was not available. We assume that the number of clusters K is strictly greater than 1. We then define the silhouette score of x as

s(x) =







1 − a(x)/b(x) if a(x) < b(x)

0 if a(x) = b(x)

b(x)/a(x) − 1 if a(x) > b(x)

⇐⇒ s(x) = max(a(x),b(x))^b(x)−a(x)

The definition implies that −1 ≤ s(x) ≤ 1. Rousseeuw sets s(x) = 0 if A = {x} which in his own words is arbitrary, but the most neutral in his opinion. This is done since it is unclear how to define a(x) if A = {x}.

s(x) is used to measure how well x matches its’ cluster assignment. Below we describe how to interpret it.

In the case when s(x) ≈ 1 we say that x is well-clustered since s(x) ≈ 1 =⇒ a(x) << b(x), i.e. it is likely that it is in the correct cluster since it is much closer to A than to B. When s(x) ≈ 0 it is unclear if x should be put in A or B. Finally we consider the case when s(x) ≈ −1, which is the worst case. s(x) ≈ −1 =⇒ a(x) >> b(x), i.e. x is on average closer to elements in another cluster than the one it has been assigned to. We can almost certainly conclude that x has been put in the wrong cluster. Using these definitions and computed s(x) for

(25)

different x we can construct a graphical display. For some A, its’ silhouette is the plot of all s(x) (in decreasing order) where x ∈ A. The s(x) are represented by horizontal bars whose lengths and direction are proportional to the respective s(x).

Silhouettes are useful since they are independent of the cluster algorithm used, only the actual cluster partitions are used. This means we can use silhouettes to compare the quality of outputs from different algorithms applied to the same data set. E.g. if using K−means clustering then we can select an appropriate K based on the silhouettes. Generally we can use silhouettes to determine the correct number of clusters K. This is explained in greater detail below.

Assume our data consists of clusters that are far apart but that we have specified K to be less than the true number of clusters. Some cluster algorithms (e.g. K−means) will join some of the true clusters so we end up with exactly K clusters. This will result in large a(x) and subsequently small s(x), which will be visible in the graphical display. In the reverse situation, i.e. we have set K too high, then some of the true clusters will be divided so we form exactly K clusters. Such artificial divisions will also give small s(x) since the b(x) will become small. We can compare cluster quality between different partitions that arise. For each cluster resulting from a partition we can compare the average s(x) from different clusters. This will distinguish weak and clear clusters. When comparing plots we can compute the overall average s(x) of the entire data set. Different K will in general give different average values of s(x), denoted them by ¯s(K). We can find a suitable number of clusters K^∗ by K^∗= argmaxKs(K).¯

When choosing the number of clusters or cluster algorithm by maximizing the Silhouette coefficient, there are potential pitfalls one has to be aware of. High values of the average silhouette coefficient may indicate that a strong clustering structure has been found, but this might be misleading if there are outliers in the data.

These outliers may be so far from the other data points that the other points seem like a tightly knit cluster in comparison to the outliers. In this case a clustering algorithm with 2 clusters may give a very high average silhouette index even if the cluster assignment is very rough since the outlier(s) end up in one cluster and all the other data points end up in another. An example of this is provided by Rousseeuw to illustrate the idea.

This phenomenon is relevant for this thesis since the data is very skewed, which is explored in more detail in Section 3.3.

2.2.3 The Davies-Bouldin index

Another cluster validity index is the Davies-Bouldin index, introduced in 1979 [19]. Unless explicitly stated otherwise, all information in this section was obtained from the original article. The Davies-Bouldin index indicates similarity of clusters and can be used to determine how appropriate a partition of data is. The DB index is independent of the number of clusters as well as method used for clustering, making it very attractive for comparing clustering algorithms that are very different in nature. Some notation needs to be defined. The original authors define the concept of a dispersion function as follows.

Definition 1. A real-valued function S is said to be a dispersion measure if the following properties hold: Let cluster C have members X1,...,Xm.

1) S(X1, ..., Xm) ≥ 0

2) S(X1, ..., Xm) = 0 ⇐⇒ Xi = Xj, ∀Xi, Xj ∈ C

The authors set out to define a cluster separation measure R(Si, Sj, Mij) which computes the average similarity of each cluster with its most similar cluster. They propose a definition of a cluster similarity measure as in Definition 2.

Definition 2. Let M_ij denote the distance between vectors that are characteristic of clusters i and j. S_i and S_j denoted the dispersion of cluster i and j, respectively. A real-valued function R is a cluster similarity measure if the following properties hold:

1) R(S_i, S_j, M_ij) ≥ 0

2) R(Si, Sj, Mij) = R(Sj, Si, Mji) 3) R(Si, Sj, Mij) = 0 ⇐⇒ Si= Sj = 0

4) Sj= Sk and Mij < Mik =⇒ R(Si, Sj, Mij) > R(Si, Sk, Mik) 5) Mij = Mik and Sj > Sk =⇒ R(Si, Sj, Mij) > R(Si, Sk, Mik)

Definition 2 implies some heuristically meaningful limitations on R. Specifically, these are 1. R is nonnegative

(26)

2. R is symmetric

3. The similarity between two clusters is zero only if their dispersion functions vanish

4. If the distance between clusters increases while their dispersions remain constant, the similarity of the clusters decreases

5. If the distance between clusters remains constant while the dispersions increase, the similarity increases The authors propose a function which satisfies the criteria in Definitions 1 and 2. Specifically, when using the same S_i, S_j and M_ij as in Definition 2 and letting N denote the number of observations they set

R_ij =Si+ Sj

M_ij , ¯R = 1 N

N

X

i=1

R_i, R_i:= max_i6=jR_ij (2.29)

R is used to select an appropriate clustering solution. One can think of it as the average value of the similarity¯ measures of each cluster with its most similar cluster. When comparing multiple clustering solutions to one another, the solution which minimizes ¯R is the best one with respect to this measure. In the original article and in the following choices of distance function, dispersion measure and characteristic vector were made

Si= (1 Ti

Ti

X

j=1

|Xj− Ai|^q)^1/q, Mij = (

N

X

k=1

|aki− akj|^p)^(1/p) (2.30)

In Equation 2.30, Ti is the number of observations in cluster i, Ai is the centroid of cluster i and akiis the kth component of the centroid in cluster i. In this thesis, Mij has been chosen as the Euclidean distance between centroids, i.e. p = 2 and q = 2, meaning S_i is the standard deviation of the distance between samples and their respective cluster centres.

There are additional aspects that must be mentioned. First of all, the partition of the data set must yield at least two clusters for ¯R to make any sense. This is because the distance measure M_ij must be non-zero for ¯R to be defined. Moreover the presence of cluster with only a single member limits the use of ¯R since they will have zero dispersion (see Definition 1, property 2).

2.3 Principal Component Analysis (PCA)

Principal component analysis (PCA) is a linear dimension reduction technique often used as a data pre- processing step prior to clustering. Unless explicitly stated othwerise, the information about PCA was taken from Izenmann [28]. Consider the case of a random vector X = (X₁, ..., X_r)^T which consists of r random variables. We assume X has mean vector µ_xand the r × r covariance matrix Σ_XX. The goal of PCA is to find a set of t (where t << r) ordered and uncorrelated linear projections of the input variables that can replace the original variables with minimal loss in information. In this case, ”information” refers to the total variation of the input variables, defined in Equation 2.32. These projections are so called principal components and will be denoted ξi, 1 ≤ i ≤ t and are of the form shown in Equation 2.31

ξi= b^T_i X = bi1X1+ ... + bjrXr, j = 1, 2, ..., t (2.31)

r

X

i=1

var(Xi) = tr(ΣXX) (2.32)

To find these principal components we make use of the spectral decomposition of ΣXX, see Equation 2.33. Λ is a diagonal matrix whose elements are eigenvalues of ΣXX (denoted {λi}) and U is a matrix whose columns are the eigenvectors of ΣXX.

Σ_XX= UΛU^T, U^TU = I (2.33)

We thus see that tr(Σ_XX) = tr(Λ) =Pr

i=1λ_i. b_i = (b_1i, ..., b_ri)^Tis the ith coefficent vector and is chosen so that the following properties hold

(27)

1. The first t principal components ξ1, ..., ξtare ranked in decreasing order of their variances {var(ξi)}. In other words, var(ξ1) ≥ var(ξ2) ≥ ... ≥ var(ξt)

2. ξj is uncorrelated with all ξk, k < j

In the case of dealing with observed data, we have to estimate the principal components. Specifically, we estimate ΣXX by the sample covariance matrix ˆΣXX= S/n = XCXCT

/n, where n is the number of observations available and XC is the centered data matrix (i.e. each column is centered). The ordered eigenvalues of ˆΣXX

are denoted ˆλ1≥ ˆλ2≥ ... ≥ 0. The eigenvector associated with the jth greatest ˆλj is the jth eigenvector and is denoted ˆvj, j = 1, 2, ..., r. The jth sample PC score of X is given by

ξˆj = ˆv^T_j(X − ˆX) (2.34)

and the variance of the jth principal component is estimated by ˆλ_j. So we clearly see that in order to compute the principal components we need the eigenvalues and eigenvectors of X^TX.

We now turn our focus to how the principal components are computed. For this purpose we need to consider the singular value decomposition of a matrix. Unless explicitly stated otherwise, the information about SVD and its relationship to PCA is taken from Jolliffe [31]. For an arbitrary matrix X of size n × p, where we assume that n is the number of data points and p is the dimension of each data point, we may write

X = ULA^T (2.35)

In Equation 2.35 we used the following notation

1. U, A are matrices of sizes (n × r) and (p × r), respectively. They are also orthonormal.

2. L is an (r × r) diagonal matrix.

3. r is the rank of X.

To see why these statements are true, consider the spectral decomposition of X^TX. Let the eigenvalues of X^TX be denoted as {l_k} for k = 1, 2, ..., p and let S be the sample covariance matrix. Then we can write

(n − 1)S = X^TX = l1a1a1T + ... + lrararT (2.36) A is defined as a (p × r) whose kth column is ak, U is the (n × r) matrix whose kth column is

uk= l^−1/2_k Xak, k = 1, 2, ..., r (2.37) and finally L is the (r × r) diagonal matrix whose kth diagonal element is l^1/2_k . Thus, conditions 1 and 2 are satisfied and it can be shown that X = ULA^T. This equation can also be expressed as

ULA^T =

p

X

k=1

XakakT = X (2.38)

in which Xak is a vector which contains the scores on the kth PC. SVD is important for the computation of PCA since we see that if we can find U, L and A such that ULA^T = X then A and L give the eigenvectors and square roots of eigenvalues of X^TX. This means that we in turn get the coefficients and standard deviations of the principal components of S. Also, U will give scaled versions of the principal component scores. We then see why SVD is important for PCA, now we turn our focus to how it is actually computed in practice.

The software implementation used in this thesis relies on an approximation of the SVD presented by Halko et al [24], in which a randomized low-rank approximation is used. The general idea behind their algorithm is to first perform a low-rank approximation of X and then perform SVD on this approximation. A brief summary of their goal is that they wish to construct a matrix Q with k orthonormal columns which satisfies

X ≈ QQ^TX (2.39)

k is to be kept as low as possible and is usually chosen in advance. SVD of X is then performed with the help of Q.

Customer segmentation of retail chain customers using cluster analysis

Customer segmentation of retail

chain customers using cluster

analysis

SEBASTIAN BERGSTRÖM

Customer segmentation of retail

chain customers using cluster

analysis

SEBASTIAN BERGSTRÖM

Acknowledgements

Contents

List of Figures

List of Tables

1 Introduction and motivation

1.1 Customer segmentation and cluster analysis

1.2 Problem formulation and data description

1.3 Previous research

2 Theory

2.1 Clustering algorithms used in this thesis

2.2 Cluster validation indices

2.3 Principal Component Analysis (PCA)