Categorization of songs using spectral clustering

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2021

Categorization of songs using spectral clustering

LINUS BELOW BLOMKVIST

FELIX DARKE

(2)

Abstract

A direct consequence of the world becoming more digital is that the amount of available data grows, which presents great opportunities for organizations, researchers and institutions alike. However, this places a huge demand on efficient and understandable algorithms for analyzing vast datasets.

This project is centered around using one of these algorithms for identi- fying groups of songs in a public dataset released by Spotify in 2018. This problem is part of a larger problem class, where one wish to assign data into groups, without the preexisting knowledge of what makes the different groups special, or how many different groups there are. This is typically solved using unsupervised machine learning.

The overall goal of this project was to use spectral clustering (a specific algorithm in the unsupervised machine learning family) to assign 50 704 songs from the dataset into different categories, where each category would be made up of similar songs. The algorithm rests upon graph theory, and a large em- phasis was placed upon actually understanding the mathematical foundation and motivation behind the method before the actual implementation, which is reflected in the report.

The results achieved through applying spectral clustering were one large group consisting of 40 718 songs in combination with 22 smaller groups, all larger than 100 songs, with an average size of 430 songs. The groups found were not examined in depth, but the analysis done hints that certain groups were clearly different from the data as a whole in terms of the musical features. For instance, one group were deemed to be 54% more likely to be acoustic than the dataset as a whole.

As a conclusion, the largest cluster was deemed to be an artefact of the fact that when a sample of songs listened to on Spotify is taken, the likelihood of these songs mainly being popular songs would be high. This would explain the homogeneity that resulted in the fact that most songs were assigned into the same group, which also resulted in the limited success of spectral clustering for this specific project.

(3)

Sammanfattning

En följd av en allt mer digitaliserad värld är en ökning av tillgänglig data, vilket är n˚agot som skapar potentiella möjligheter för företag, forskare och institutioner. Dock s˚a kräver stora datamängder effektiva och lättförst˚aeliga algoritmer som kan analysera och processa datan.

Detta projekt är centrerat kring att använda en av dessa algoritmer för att identifiera grupper av l˚atar fr˚an den datamängd som Spotify la ut och offentliggjorde 2018. Detta problem tillhör en bredare problemklass, där m˚alet är att gruppera en datamängd utan att ha n˚agon tidigare kunskap om vad som särskiljer grupperna och hur m˚anga som faktiskt finns. En vanlig metod för att lösa detta är obevakad maskininlärning.

M˚alet med detta projekt var att använda algoritmen spektralklustring (en algoritm inom obevakad maskininlärdning) för att kategorisera 50 704 l˚atar, där varje kategori d˚aomfattar l˚atar av liknande karaktär. Algoritmen bygger p˚a grafteori, och en stor del av arbetet las p˚a att först˚a de matematiska grunderna och koncepten bakom metoden innan den implementerades. Detta reflekteras i rapporten.

Resultaten fr˚an appliceringen av spektralklustring var en stor grupp best˚aende av 40 718 l˚atar kombinerat med 22 mindre grupper, där varje grupp omfat- tade i snitt 430 l˚atar. Ingen grupp hade mindre än 100 l˚atar. Ingen djupare analys gjordes av de funna grupperna, men den som gjordes visade att vissa grupper hade utmärkande musikaliska karaktärsdrag jämfört med den totala datan. Exempelvis s˚a hade en grupp 54% högre akustiska drag jämfört med den totala datamängden.

Samanfattningsvis s˚a blev bedömningen att det stora gruppen var en artefakt av det faktum att när ett stickprov av l˚atar tas fr˚an Spotify, s˚a är sannolikheten stor att majoriteten av dessa l˚atar är populära l˚atar. Detta skulle kunna vara en förklaring till den homogenitet som leder till att majoriteten av l˚atarna hamnade i samma grupp. Detta hade d˚a följden att algoritmen spektralklustring hade begränsad framg˚ang inom detta projekt.

(4)

Chapter 1 Introduction

In 2018, Spotify released a dataset, with the goal of spurring research into the topic of which songs are likely to be skipped. The purpose of this project is to use machine learning to categorize songs, which may be useful when trying to predict skips [1].

In this chapter, the aim is to give the reader a broad introduction to the dataset in Section 1.1. Secondly, Section 1.2 will introduce the problem we are trying to solve by categorizing songs, and motivate the value of this.

Finally, a clear hypothesis will be formulated in Section 1.3.

1.1 The Spotify Dataset

The dataset consists of approximately 3.7 million unique songs. The songs are represented as 30-dimensional vectors, where each dimension represents different characteristics of the acoustic signature of the song, or metadata.

A small sample of these variables are shown in Table 1.1. Note that the first track id us popularity estimate danceability acousticness

t ed6c098c 99.55 0.5623 0.1107

t a75a857b 93.27 0.6119 0.3373

Table 1.1

song was more popular than the following song, and that the second song

(7)

Apart from this, Spotify also gives access to user specific information regarding the playback of a given song. This means that certain songs may occur multiple times in the dataset, since they may have been listened to several times by different users. This augments the dataset with 20 more dimensions, all representing different actions by the user. To paint a more vivid picture for the reader, Table 1.2 displays a few of these.

track id session position context type not skipped

t ed6c098c 9 radio True

t a75a857b 10 radio False

Table 1.2

This shows that the user was in radio mode for this part of the session, and the track at position 9 was listened to in its entirety while the proceeding track was skipped.

To summarize, the Spotify dataset in its most raw form can be seen as a set of 200 million vectors in a 50-dimensional space. 30 of these dimensions are unique to the track itself, while the remaining 20 are unique to a specific playback of the song. If two users both were to listen to ’Who Let the Dogs Out’ by Baha Men, the 30 dimensions that are unique to the song would be identical, but the other 20 would most likely differ.

Playback of ’Who Let the Dogs Out’

˙ ˝¸ ˚

[x1. . . x₂₀

¸ ˚˙ ˝

Unique to the user

x₂₁. . . x₅₀]

¸ ˚˙ ˝

Unique to the song

For the interested reader, a full description of the dataset is given by [1].

In the upcoming section, certain characteristics of the dataset will be highlighted, which will lead over into the problem formulation of categorizing songs.

(8)

1.2 Problem formulation

The overall goal stated by Spotify for releasing this data was to spur research into the topic of predicting skips. Consider two songs played in succession by the same user, and one wishes to predict if the second song will be skipped.

It is intuitive that if there is a big difference between the songs, the likelihood of skip could be different (this is of course dependent on other factors; for instance, the variable ’context switch’ states if the user actively switched playlist or playback mode, and in that case, it may not be different).

This observation motivates a need to represent the songs as something simpler than a 30-dimensional vector. One is typically used to seeing a label, such as the genre, which gives a general idea about what type of song this is.

It would not be feasible to have a human go through every single song in the Spotify dataset, and label them. Instead, machine learning could be used to categorize songs, which would be equivalent to projecting the 30-dimensional vector representing the song into a one dimensional label.

Playback of a song

˙ ˝¸ ˚

[x1. . . x20

¸ ˚˙ ˝

Unique to the user

x21. . . x50]

¸ ˚˙ ˝

Unique to the song

‘æ

Playback of a song

˙ ˝¸ ˚

[x1. . . x20

¸ ˚˙ ˝

Unique to the user

Category ]

¸ ˚˙ ˝

Shared with similar songs

It is important to point out that this project is not necessarily about using machine learning to decide which genre a certain song is. Instead, as will be shown in Chapter 3, the method for categorization will be firmly grounded in mathematical similarity, and this may lead to groups in the data that may be different from our preconceived notion of genres.

With that, a clear problem formulation has been established. From this, a hypothesis can be formulated.

(9)

1.3 Hypothesis

The overall hypothesis is that there are different categories of songs. This is actually a trivial statement, since the existence of genre is already known.

To make the formulation a bit more specific, the hypothesis is that the 30- dimensional vector representing a song can be projected into a one dimensional label, where different labels correspond to different songs. This formulation makes the hypothesis less sharp, but it is still testable by comparing members of different groups to each other after the categorization is done.

(10)

Chapter 2 Method

This chapter will start off by exploring the two types of machine learning, supervised and unsupervised in Section 2.1, and conclude that unsupervised learning is better suited for categorization of songs. Section 2.2 will introduce the specific type of unsupervised learning that will be used, and compare two different algorithms within this family, together with visual examples.

Finally, Section 2.3 contains some comments about the choice of method.

2.1 Different types of machine learning

It is likely that there are groups in the data, based on the simple notion that there exists wildly different kinds of songs. However, it is not known how many, and even if one were to sample the data and listen to songs, it is likely that different people would assign the same song into different groups. With this in mind, unsupervised and supervised learning will be introduced, with the purpose of motivating which one is the most suitable for categorizing songs.

(11)

The difference is best explained with an example: imagine that one wants to train a model to categorize images of apples and bananas. If one were to use supervised learning, it would translate to the following:

1. Categorize some images by hand, i.e label them as apples or bananas 2. Show these images to the model, and it will (hopefully) learn what

characterizes an apple and a banana

Note that this requires the preexisting knowledge of apples and bananas, and the knowledge that the dataset only consists of these two categories.

Unsupervised learning circumvents this problem by a more general approach:

1. Define similarity

2. Assign objects to groups based on their similarity to other objects For the particular example of categorizing fruits, one could measure similarity as geometric shape. By applying this metric, one group with curved objects and one group of round objects would be found. Note that although this requires some preexisting knowledge about the fact that objects in the images could vary by shape, the requirement of this is less than in the supervised case.

As previously stated, it is not known how many categories there are in the Spotify dataset. In addition, the labels are not clear cut, in contrast to the example of apples and bananas. Hence, unsupervised learning is the overall machine learning strategy that will be used for classification of the songs. In the next section, the specific class of algorithms used for this project will be introduced.

(12)

2.2 An introduction to clustering

For a given set of datapoints, a potential strategy for assigning them into groups would be to measure how similar they are to each other, and assign sufficiently similar points into the same group. This strategy is known as clustering. To introduce this in more detail, consider Figure 2.1 and Fig- ure 2.2, which are two synthetic datasets generated with the python library Scikit-learn [3], with two arbitrary features.

Figure 2.1: Concentric circles Figure 2.2: Blobs

The intuition here is that each of the two rings make up a cluster in Figure 2.1, and that each of the three blobs make up a cluster in Figure 2.2.

However, Subsection 2.2.1 and Subsection 2.2.2 will showcase that different algorithms (i.e different mathematical strategies for finding the clusters) may not give the results that are in line with this intuition.

(13)

2.2.1 The k-means clustering algorithm

Here, a very brief introduction to the k-means clustering algorithm will be provided. The purpose of this subsection is simply to showcase how it per- forms on the sample datasets in Figure 2.1 and Figure 2.2, and present the general idea behind it. For a more thorough explanation of k-means, the reader is referred to [5].

The general idea is that given the number of clusters k, introduce k centroids R = {r1, ..., r_k} . A point is said to belong to a specific cluster ci œ {c1, ..., ck} if the point is closest to the corresponding centroids. The k-means algorithm is simply a systematic way of choosing these centroids given a good enough initial guess. A good analogy is that k-means can be thought of as finding k centers of mass. In Figure 2.3 and Figure 2.4, the initial step of the algorithm is shown.

Figure 2.3: Concentric circles at

initialization of k-means Figure 2.4: Blobs at initialization of k-means

The state after 1 and 100 iterations is seen in Figure 2.5, Figure 2.6, Figure 2.7 and Figure 2.8, for the respective datasets. This showcases how the clusters evolve when the center is updated in line with the algorithm.

(14)

Figure 2.5: Concentric circles after

one iteration of k-means Figure 2.6: Blobs after one itera- tion of k-means

Figure 2.7: Concentric circles after

100 iterations of k-means Figure 2.8: Blobs after 100 itera- tions of k-means

The clusters found in Figure 2.8 are in line with expectations, but the clusters in Figure 2.7 are obviously not correct. However, going back to the center of mass-analogy for k-means, this makes sense. It is worth noting that k-means can successfully cluster the concentric circles, but it builds upon the preexisting knowledge of circular geometry. The fact that k-means requires one to state the amount of clusters is another weakness, since it was concluded in Section 2.1 that the amount of categories in the Spotify data is unknown.

Subsection 2.2.2 will introduce a clustering algorithm that handles these issues.

(15)

2.2.2 The spectral clustering algorithm

Spectral clustering rests upon linear algebra and graph theory. The full theory will be reviewed in Chapter 3. For now, a very basic intuition will be provided.

Consider the concentric circles, as seen in Figure 2.1. There is a clear separation between the circles, and for now, the reader is asked to accept that there could be another geometry where this separation can be made visible in a way that allows k-means to correctly cluster it. This is showcased in Figure 2.11, but the theory that validates this claim will be explored in Chapter 3. For now, the results achieved with spectral clustering, as seen in Figure 2.9 and Figure 2.10 will have to speak for itself.

Figure 2.9: Concentric circles,

spectral clustering Figure 2.10: Blobs, spectral clustering

Figure 2.11: Concentric circles in spectral domain

(16)

It is obvious that spectral clustering was able to capture the more complex geometry. Before giving the promised theoretical background, some brief comments on the choice of method will be provided in Section 2.3.

2.3 Comments about choice of method

Spectral clustering is one of the most popular modern clustering algorithms [12], and the high dimensionality of the Spotify dataset makes it appealing to use a method that rests upon a solid theoretical foundation. However, it is important to point out that although the examples highlighted throughout this chapter all makes the case for unsupervised learning in general and spectral clustering in particular, there is no guarantee that this is the approach that would perform best, in comparison with all other alternatives available.

With that said, the support for spectral clustering in the literature is strong [12], and the purpose of this project is not strictly to find the best performing algorithm for this problem. At the same time, it is a feasible way of tack- ling the problem of categorizing songs. The upcoming chapter will explore spectral clustering in more theoretical detail.

(17)

Chapter 3 Theory

Spectral clustering rests upon linear algebra in general and graph theory in particular. This implies that the datapoints one wishes to cluster has to be represented as a graph. How one goes about doing this will be introduced in Section 3.1, and in Section 3.2, the mathematical objects representing the graph will be introduced. From this, a discrete version of Laplacian operator will be derived in Section 3.3. Finally, Section 3.4 will show that the eigenvalue problem for the Laplacian operator is in fact more or less equivalent to the clustering problem, and hence, it can be used to solve the problem of categorizing songs.

To avoid making this chapter too dense, non essential theorems and proofs has been omitted. Instead, the goal is to provide the reader with an intuition about why spectral clustering works. To be able to provide this intuition, the work from [12], [5], [11] and [10] has been immensely helpful, and the theory is presented along the lines of these publications.

(18)

3.1 From data to graph

Starting with the basics, a graph G is a set of vertices and edges, visualized in Figure 3.1

Figure 3.1: Vertices and edges

The vertices are the nodes in the network, and the edges are the connec- tions between the nodes. The graph G = (V, E) is nothing but the collections of all vertices V and edges E. A set of datapoints X = {x1, x₂, ..., xm} will be mapped to the corresponding node in V , i.e xi ‘æ vi,1 Æ i Æ m. Edges are drawn between vertices that represent datapoints that are sufficiently similar.

In Section 3.2, the mathematical representation of the graph G will be explored.

3.2 Mathematical representation of the graph

As hinted at in the previous section, similarity plays a key role in the graph representation of data, since edges will be drawn between vertices that represent sufficiently similar points. With that in mind, consider the following definition.

Definition 3.2.1 (Similarity metric)

For a set of datapoints X = {x1, x₂, ..., xm} œ Rⁿ, a similarity metric s is a mapping from Rⁿ◊ Rⁿ to R fulfilling

s(xi, x_j) = si,j = sj,i

C Ø si,j Ø 0 si,j = C ≈∆ i = j

(19)

This translates to the similarity metric being a positive semi-definite, symmetric operator with an upper bound C. Note that the similarity will get closer to C for the more similar two points are. By measuring the similarity between the m vectors in X, the similarity matrix S can be defined, where element Si,j is the similarity between datapoint xi and xj.

Definition 3.2.2 (Similarity matrix)

For a set of datapoints X = {x1, x₂, ..., xm} œ Rⁿ and a similarity metric s as defined in 3.2.1, the similarity matrix S is given by

S :=

S WW WW U

s_1,1 s_1,2 ... s_1,m s_2,1 s_2,2 ... s_2,m ... ... ... ...

s_m,1 s_m,2 ... sm,m

T XX XX V=

S WW WW U

C s_1,2 ... s_1,m s_1,2 C ... s_2,m ... ... ... ...

s_1,m s_2,m ... C

T XX XX V

Note that the symmetry of si,j has been exploited to assert the symmetry of the matrix, in combination with the fact that si,j = C ≈∆ i = j. As previously stated, the similarity matrix gives information about the similarity between all datapoints. Since the goal is to connect sufficiently similar datapoints with an edge, this is required information.

There are many different similarity measures. The ones used in this study is outlined in Chapter 4. If one were to run some sort of function across all elements of S that maps to zero if the similarity is not deemed sufficient, and leaves the element intact otherwise, all information required to draw the graph would be available. In addition to this, it is desired that vertices do not connect to themselves. This can be formalized with a definition.

Definition 3.2.3 (Adjacency matrix)

Given a similarity matrix S with elements si,j as defined in 3.2.2, choose a mapping ” such that

”(si,j) =

Y_ _] __ [

si,j, if xi and xj are similar enough 0, if i = j

0, otherwise

A :=

S WW WW U

”(s1,1) ”(s1,2) ... ”(s1,m)

”(s2,1) ”(s2,2) ... ”(s2,m) ... ... ... ...

”(sm,1) ”(sm,2) ... ”(sm,m)

T XX XX V=

S WW WW U

0 ”(s1,2) ... ”(s1,m)

”(s1,2) 0 ... ”(s2,m) ... ... ... ...

”(s1,m) ”(s2,m) ... 0

T XX XX V

(20)

Note the preserved symmetry and that A is simply a more sparse version of S. By only drawing edges between vertices with a corresponding nonzero element in A, this showcases that choosing ” in Definition 3.2.3 is equivalent to defining the criterion for drawing an edge. The edge between vi and vj

will be weighted with the corresponding element ai,j.

Finally, recognize that for a given vertex vi, the row Ai,1, ...Ai,m holds all the edge weights leading into that particular vertex. By row-wise summation of A, the total edge weights for all vertices can be gathered, and it defines the degree matrix.

Definition 3.2.4 (Degree matrix)

Given the adjacency matrix A with elements ai,j as defined from Definition 3.2.3, the degree matrix are given by

di,j =

Y] [

qm j=1ai,j

0 , i ”= j

Note that this is a strictly positive, diagonal matrix.

To summarize this section, consider the steps required to go from data to graph:

• Define a similarity metric s(xi, xj) (3.2.1)

• From this, calculate the similarity matrix S (3.2.2)

• Choose a criterion for when to draw an edge ”

• Calculate the adjacency matrix A (3.2.3)

• Draw edges between vertices with corresponding nonzero elements in A, and weight the edges by the similarity

It is worth noting that the degree matrix D is not required for drawing the graph. However, it is required to derive the graph Laplacian, which will be the focus of Section 3.3. Before that, a example of the steps above will be provided.

(21)

Example 3.2.1 (Mathematical representation of the graph)

Consider X = {x1, . . . , x12} œ R² as seen in Figure 3.6. The similarity matrix is computed with a similarity function, with C = 1, which gives the symmetric 12 ◊ 12 similarity matrix.

Figure 3.2

S =

S WW WW U

1 s1,2 ... s1,12

s_1,2 1 ... s_2,12 ... ... ... ...

s_1,12 s_2,12 ... 1

T XX XX V

All elements of S will fulfill 1 Ø si,j Ø 0.

Note that the similarities between the points that are close to each other should be higher. To make sure that edges only are drawn between points in the same cluster, a threshold for the similarity could be set. This translates to defining ” (as introduced in 3.2.3), and calculating A.

”(si,j) =

Y_ _] __ [

si,j if si,j Ø‘

0, if i = j 0, otherwise

A=

S WW WW U

0 ”(s1,2) ... ”(s1,12)

”(s1,2) 0 ... ”(s2,12) ... ... ... ...

”(s1,12) ”(s2,12) ... 0

T XX XX V

Depending on how ‘ is chosen, the final graph will look different.

Figure 3.3: Small ‘ Figure 3.4: Large ‘

(22)

3.3 Deriving the graph Laplacian

The Laplacian operator shows up in almost every domain in mathematics and physics. Here, it will be shown that the matrix formed by subtracting the adjacency matrix A (3.2.3) from the degree matrix D (3.2.4) can be interpreted as a discrete version of the Laplacian operator. To get to that point, a slight detour into the realm of heat transfer is required. The following derivation will not fully prove the interpretation of L = D ≠ A as a discrete Laplacian operator, but the hope is that this example gives the intuition that it is reasonable. A more detailed presentation can be found in [11].

Consider 3 rigid bodies separated by vacuum as seen in Figure 3.5, where the components of „ = [„0 „1„2]^T represent the temperature in each of these bodies.

The diffusion equation states that

ˆ_t„= K „,

where K is the heat conductivity. Since the heat conductivity of vacuum is zero, ˆt„ = 0. Also note that this applies to all 3 bodies separately, since the Laplace model of heat transfer states that no heat will leak into vacuum.

ˆt

S WU

„₀ 00

T XV= ˆt

S WU

0

„₁ 0

T XV= ˆt

S WU

00

„2

T XV= 0.

Now, consider a finite discretization for each of these rigid bodies into four, as seen in Figure 3.6 (the argument can be expanded to an arbitrary discretization or infinity),

(23)

Figure 3.5 Figure 3.6

which in turn expands „ into ˆ„ = [ˆ„0 ˆ„₁. . . ˆ„₁₁]^T, where „0 = ˆ„0+ ˆ„1+ ˆ„₂ + ˆ„3 (with the same logic for „1 and „2). Heat will flow between the discrete regions of each body. However, no diffusion will occur to discrete regions of other bodies. Now, consider the diffusion equation for component 0 of ˆ„. By Newton’s law of cooling we have the following relation:

ˆtˆ„₀ = K ˆ„0 =

= K[(ˆ„0≠ ˆ„₁) + (ˆ„0≠ ˆ„₂) + (ˆ„0≠ ˆ„₃)] = K^ÿ³

j=0[ˆ„0≠ ˆ„j]

This gives an explicit formula for the heat flux of a given component.

Now, it is time to exemplify how this is equivalent to defining the Laplacian operator as L = D ≠ A.

d ˆ„0

dt = K((D ≠ A)ˆ„)0 = K((^ÿ¹²

j=2a1,j) ˆ„0≠ (

ÿ12

j=2a1,j) ˆ„j≠1) =

= K^ÿ¹²

j=2

a_1,j(ˆ„0≠ ˆ„j≠1)

(24)

Of heat transfer, it is required that the components ai,j of the adjacency matrix A is equal to 1 if i, j belongs to the same discretization of a rigid body, and zero otherwise (since heat will only transfer within the body). Hence, the expression simplifies to

K

ÿ3

j=0( ˆ„₀≠ ˆ„j)

which is equal to the previous expression found from Newtons law of cooling.

By the same reasoning, this expression can be shown to hold for an arbitrary component of ˆ„.

Now that it has been concluded that it is reasonable to believe that D≠A can be viewed as the Laplacian operator in the context of a graph, it is time to proceed to how the eigenvalue problem for this operator can be equivalent to solving the problem of categorizing songs.

3.4 Finding clusters from the graph Lapla- cian

In this section, the main theoretical result, which shows how the eigenspace of the Laplacian is related to the clusters, will be presented. In order to get to that point, some of the fundamental properties of the Laplacian will be required.

Theorem 3.4.1 The Laplacian matrix L = D ≠ A has the following prop- erties:

1. For all vectors v œ R^m

v^TLv = [^ÿ^m

i,j

ai,j(vi≠ vj)² ]

2. L is symmetric positive semi-definite

3. The smallest eigenvalue for L, in a fully connected graph, is 0 and has a corresponding eigenspace spanned by the eigenvector = [1, 1, ..., 1]^T,

(25)

Proof sketch 3.4.1 The exact proofs are omitted, and the reader is referred to the resources mentioned in the introduction to this chapter for these.

However, some brief comments will be provided.

The first property is shown by evaluating the expression on the left hand side, and systematically using the definition of D and A. The second property follows directly from the fact that A is symmetric, and D is a positive diagonal where the diagonal elements are the sum of the rows in A. Finally, the third property follows directly from the first property. ⇤ The third property is perhaps the most important, and it goes in line with the intuition developed in Section 3.3, i.e that the flux measured across a fully connected graph will be zero. Since spectral clustering is an algorithm for finding components of a graph with a low degree of connectivity to other parts of the graph, this result could potentially be helpful. To be more specific, if it were possible to rearrange the Laplacian of the entire graph in such a way that it was composed of blocks that themselves are Laplacian matrices for the fully connected components of the graph, these eigenvectors could be used to identify these blocks.

Thankfully, this exact property is asserted by Theorem 3.4.2.

Theorem 3.4.2 Any adjacency matrix A as defined in 3.2.3, can be written as a block diagonal matrix, which implies that the graph Laplacian L = D≠A can be written as a block diagonal matrix as well.

Proof sketch 3.4.2 First, note that for the adjacency matrix representing a graph with no connections, A will be strictly diagonal, which by definition also is a block diagonal matrix. Conversely, a completely connected graph will be nonzero everywhere, which by definition is a block diagonal matrix consisting of one block.

For the partially connected graph with completely connected components, the rows can be rearranged without effecting the solution space of the operator, given that a corresponding column rearrangement is made to preserve symmetry. For a given element ai,j, it will be nonzero only if vertex i and j belong to the same connected component. Hence, the matrix can be ma- nipulated such that the rows representing elements belonging to the same component are placed adjacent, which will result in a block diagonal matrix.

By definition, L = D ≠ A. By rearranging the rows in D to match the rearrangement in A, the elements of D are the degrees of the vertices in the

(26)

corresponding blocks, and hence, the block diagonal structure will not be

broken by this arithmetic operation. ⇤

Once again, the reader is referred to [11] for a more detailed proof.

Theorem 3.4.2 simply states that an arbitrary Laplacian matrix can be rewritten in block diagonal form where every block makes up the Lapla- cian. To exemplify this, consider our example from earlier.

L=

S WU

L₁ 0 0 0 L2 0 0 0 L3

T XV

Figure 3.7

To summarize, the following theoretical results has been established so far:

• For a fully connected graph the Laplacian matrix has eigenvalue zero, with a corresponding eigenspace spanned by the vector.

• A Laplacian matrix representing a graph with several distinct connected components can be rewritten as a block diagonal matrix, where every block is the Laplacian matrix for the corresponding connected component.

Putting these results together, every Laplacian block Li will have the corresponding zero eigenspace spanned by i, and hence, it would be reasonable to believe that the algebraic multiplicity of the zero eigenvalue would be equal to the amount of Laplacian blocks, and conversely equal to the amount of clusters. This is affirmed by Theorem 3.4.3.

(27)

Theorem 3.4.3 The algebraic multiplicity k of the zero eigenvalue for the graph Laplacian is equal to the amount of connected components.

Proof sketch 3.4.3 Theorem 3.4.1 asserts that every Laplacian matrix has the eigenvalue zero, and Theorem 3.4.2 states that a graph with k connected components has a graph Laplacian that can be rewritten as a block diagonal matrix with k blocks, where every block itself is a Laplacian matrix. Hence, it follows that the algebraic multiplicity of the zero eigenvalue is at least k.

This proof comes down to showing that it is exactly k.

For a block diagonal matrix, the characteristic polynomial can be written as det(⁄I ≠L) = det(⁄I≠L1)...det(⁄I ≠Lk). If each of these factors have only one zero solution, the theorem follows. This will be shown by contradiction.

If the eigenvalue 0 for graph Gi (corresponding to the i-th connected component) has the algebraic multiplicity greater than 1, the fact that L is symmetric asserts that there exist at least two eigenvectors v1, v₂ with eigenvalue 0 that are linearly independent, since the algebraic multiplicity equals the geometric multiplicity for symmetric matrices.

Then v₁^TLv₁ = 0 and v₂^TLv₂ = 0 which by Theorem 3.4.1 translates to

ÿ

i,j

ai,j(v1,i≠ v1,j)² = 0

ÿ

i,j

ai,j(v2,i≠ v2,j)² = 0

Since the quadratic form and the weight factor in these sums are non negative, this can only sum to zero when v1,i = v1,j and v2,i = v2,j for all i, j such that ai,j ”= 0. Because of that v1 and v2 are constant vectors and scalar multiples of each other. But in the beginning, the eigenvectors v1 and v₂ were set to be linearly independent, and hence, there is a contradiction.

The above shows that there is strictly one zero root to det(⁄I ≠Lⁱ), which translates to exactly k zero solutions to det(⁄I ≠L) = det(⁄I ≠L1)...det(⁄I ≠

L_k), and hence, the theorem follows. ⇤

This result is the centerpiece of the theoretical background on this project, since it states that the problem of categorizing songs is equivalent to finding the zero eigenspace for the Laplacian operator.

Finally, the second property of Theorem 3.4.1 states that the graph Lapla- cian is symmetric. By the spectral theorem, this implies that the k eigenvec- tors corresponding to k zero eigenvalues form an orthogonal basis. Combining

(28)

this with the fact that eigenvector i corresponds to the Laplacian block Li, and the intuition from the heat transfer example introduced in Section 3.3, it can be shown that the space constructed from these eigenvectors can be used to indicate which sub-graph (and by that, cluster), a given datapoint belongs to.

This concludes the theoretical part of this report. In the next chapter, the implementation of spectral clustering specific for this project will be presented.

(29)

Chapter 4 Implementation

With the theoretical results acquired, spectral clustering can be implemented as follows:

1. Compute the n ◊ n Laplacian matrix L 2. Find the eigenvalues and eigenvectors of L

3. Construct the embedded n◊k space from the eigenvectors correspond- ing to the k (sufficiently close to) zero eigenvalues

4. Apply k-means to the n ◊ k space to classify points into k clusters The reason for allowance of eigenvalues sufficiently close to zero is twofold.

Primarily, a small perturbation in the data (and hence, the matrices used to define the Laplacian) will result in a small perturbation of the eigenvalues.

Secondly, the sheer size of the Spotify dataset calls for approximate methods of finding eigenvalues and eigenvectors.

(30)

4.1 Programming the method

The overall implementation was done in Python, and can roughly be broken down into three parts

1. Preprocessing of the data 2. Performing clustering 3. Evaluation of the results

A full description of this will not be made in this report, but some short comments will be provided below.

4.1.1 Performing clustering

To find the eigenvalues and eigenvectors close to zero, the Scipy-sparse library was used. The application of k-means to the space constructed from the eigenvectors was done using Scikit-learn [3].

4.1.2 Evaluation of the results

It has been concluded that the number of clusters are equal to the number of eigenvalues sufficiently close to zero. In the initial hypothesis, it was stated that songs mapped to different clusters should be ”clearly different”. Hence, the code returns the number of eigenvalues close to zero (i.e, the number of clusters), and offers the ability to sample songs from the different clusters, in combination with metadata. This allows one to evaluate the hypothesis.

4.2 Defining similarity

In Chapter 3, it was established that in order to construct the Laplacian matrix, a similarity metric is needed. Now, it is time to contemplate the choice of such a function s. To reiterate, it has to be symmetric, positive semi-definite and have an upper bound C, where si,j = C ≈∆ i = j. A common choice is the Gaussian similarity function

s(xi, x_j) = exp ≠|xi ≠ xj|²/‡² (4.1)

(31)

Note that the Gaussian similarity function fulfills the stated criterions, with the upper bound C = 1. One also has to set the parameter ‡. In [13], the authors suggest that ‡ is chosen in accordance with the local variance of the points xi and xj. An intuition about how the choice of this parameter affects the similarity function is obtained in Figure 4.1

The main conclusion to be drawn from Figure 4.1 is that the suggestion in [13] to locally scale ‡ can be motivated by the fact that ‡ needs to be tuned for the Gaussian similarity function to ”detect” the local variations one can expect between points in a particular dataset. The smaller ‡ is, the quicker the Gaussian similarity function tends to zero.

A weakness of local scaling as suggested in [13] is the computational cost of having to estimate ‡ for every point.

4.2.1 Transformation from similarity to adjacency ma- trix

The similarity matrix is computed by pairwise measuring the Gaussian dis- tance between all points, where element si,j corresponds to the similarity between datapoint i and j. One does not wish to connect all vertices. In- stead, a cut-off is imposed, which was referred to as the mapping ” in the

(32)

theory section.

S =

S WW WW WW WU

C s_1,2 s_1,3 ... s_1,m s1,2 C s2,3 ... s2,m

s1,3 s2,3 C ... s3,m

... ... ... ... ...

s_1,m s_2,m s_3,m ... C

T XX XX XX XV

‘≠”

æ

S WW WW WW WU

0 a_1,2 a_1,3 ... a_1,m a1,2 0 a2,3 ... a2,m

a1,3 a2,3 0 ... a3,m

... ... ... ... ...

a_1,m a_2,m a_3,m ... 0

T XX XX XX XV

The ambition is that A should be sparse, which translates to only connecting vertices with a sufficiently high degree of similarity.

In this project, two ways of achieving this has been explored.

‘-threshold

Set a threshold ‘, and map S to A as following:

”(si,j) =

Y_ _] __ [

s_i,j if si,j Ø‘

0, if i = j 0, otherwise m nearest neighbors

Set a value m. For the vertex vi, ai,j = si,j if vj are among the m nearest neighbors of vi, and zero otherwise.

Note that this relationship may not necessarily be symmetric. However since there are advantages in working with a symmetric Laplacian, the re- lationships of the adjacency matrix are made symmetric by setting A = 1/2(A + A^T) in the code.

4.2.2 Hierarchical spectral clustering

Even though spectral clustering is able to detect clusters where other algorithms tend to fail, a common problem in real world applications is that a large set of the data can appear as one large cluster. For instance, using the example of fruits, most fruits are round. The precision required to dis-

(33)

required to distinguish the round fruits from the bananas. The approach would then be to first separate the bananas from the round fruits, and then proceed to cluster the round fruits from each other, using higher precision.

In the literature, this strategy is known as hierarchical clustering [5]. In the implementation for this project, it has translated to using a stricter threshold

‘ when trying to separate the largest cluster.

Before moving on to presenting the results, some comments on solving the eigenvalue problem for large matrices will be provided.

4.3 Large scale eigenvalue problems

One big challenge for this project is the sheer size of the dataset. Because of it’s size, the regular numerical methods for finding eigenvalues and eigenvectors are simply not usable. However, there is a certain family of eigensolvers, based on the usage of Krylov methods, that can find an approximate so- lution to the eigenproblem through matrix-vector multiplication, which is a computationally efficient procedure, especially when dealing with sparse matrices.

The aim is not to fully explain and outline the methods used in any great theoretical detail. Instead, the goal is to give the reader some intuition how the eigenproblem of this size was handled. In the implementation, the sparse matrix eigensolver from [9] was used.

This topic is introduced by taking a look at an intuitive way of finding one eigenvalue/eigenvector pair using matrix multiplication. As it turns out, this will serve as a bridge into the more advanced Krylov methods.

4.3.1 Power method

The P ower method is an algorithm for computing eigenvalues and eigenvec- tors. The idea is to multiply a matrix with a randomly chosen vector, then iteratively normalize and multiply the matrix with the normalized vector from the earlier step.

The purpose of this repeated vector multiplication, is that repeated mul- tiplication of the matrix A will amplify its largest eigenvalue in magnitude.

The procedure of the P ower method is as following (for an arbitrary matrix A), following the implementation in [7]

1. Choose an initial vector q such that it has the norm ||q||²= 1

(34)

2. Create a loop with the index k such that 3. zk = Aqk≠1

4. qk = _||z^z_k^k_||2

This loop continues until the vector q converges. If q converges it will converge to a unit vector with a scalar multiple x. From this, the eigenvector corresponding to the largest eigenvalue has been obtained. In this project multiple eigenvalues are needed, as acquired through the P ower method.

However, it is possible to extract more than one eigenvector/eigenvalue pair from this procedure. How this is done will be explained in the upcoming section.

4.3.2 An general introduction to Krylov methods

Under the assumption that there exists an eigenvector basis for the matrix A, an arbitrary vector b can be written as a linear combination of this basis, in combination with a residual that is outside the span of the eigenvector basis. As showcased in the previous section, multiplying this arbitrary vector with A many times will magnify the eigenvector corresponding to the largest eigenvalue. The idea behind Krylov methods is that since b is a linear combination of the entire eigenvector basis, it should be possible to extract other eigenpairs by repeated multiplications. This motivates the construction of the Krylov subspace sequence as following:

Start with a random vector b as the first member, then construct the second member by multiplying with A and get Ab. For the third member multiplication with A is repeated and the third member becomes A²b. This process is repeated for i ≠ 1 number of times:

K_i(A, b) = span{b, Ab, A²b, ..., A^i≠1b}

The dimensions of the Krylov subspace can not grow unlimited, since A and b have finite dimension. If A is an N ◊ N matrix then the Krylov subspace will have at most N dimensions. From this embedded space, it is possible to extract not only the largest eigenpair, but several [4]. How this is actually achieved will be the topic of the next section.

As stated, we are using [9] to solve the eigenvalue problem. This solver is using the Lanczos algorithm, which is a procedure to find the eigenpairs

Categorization of songs using spectral clustering

Categorization of songs using spectral clustering

LINUS BELOW BLOMKVIST

FELIX DARKE

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 The Spotify Dataset

1.2 Problem formulation

1.3 Hypothesis

Chapter 2 Method

2.1 Different types of machine learning

2.2 An introduction to clustering

2.3 Comments about choice of method

Chapter 3 Theory

3.1 From data to graph

3.2 Mathematical representation of the graph

3.3 Deriving the graph Laplacian

3.4 Finding clusters from the graph Lapla- cian

Chapter 4

Implementation

4.1 Programming the method

4.2 Defining similarity

4.3 Large scale eigenvalue problems