Working with emotions

(1)

Working with emotions

Recommending subjective labels to music tracks using machine learning

JOHAN BRODIN

(2)

Working with emotions

Recommending subjective labels to music tracks using machine learning

JOHAN BRODIN

JBRODI@KTH.SE

Master’s Thesis in Computer Science

School of Computer Science and Communication (CSC) Royal Institute of Technology, Stockholm

Supervisor: Erik Isaksson Examiner: Viggo Kann

Project provider: Soundtrack Your Brand Supervisor at Soundtrack Your Brand: Carl Almgren

December 26, 2016

(3)

Curated music collection is a growing field as a result of the freedom and supply that streaming music services like Spotify provide us with.

To be able to categorize music tracks based on subjective core values in a scalable manner, this thesis has explored if recommending such labels are possible through machine learning.

When analysing 2464 tracks with one or more of the 22 different core values a profile was built up for each track by features from three different categories: editorial, cultural and acoustic. When classifying the tracks into core values different methods of multi-label classification were explored. By combining five different transformation approaches with three base classifiers and using two algorithm adaptations a total of 17 different configurations were constructed. The different configurations were evaluated with multiple measurements including (but not limited to) Hamming Loss, Ranking Loss, One error, F1 score, exact match and both training and testing time.

The results showed that the problem transformation algorithm La- bel Powerset together with Sequential minimal optimization outperformed the other configurations. We also found promising results for neural networks, something that should be investigated further in the future.

(4)

Referat

Arbeta med känslor: Rekommendation av subjektiva etiketter till musikspår med hjälp av maskininlärning

Kurerade musiksamlingar är ett växande omr˚ade som en direkt följd av den frihet som strömmande musiktjänster som Spotify ger oss. För att kunna kategorisera l˚atar baserade p˚a subjektiva värderingar p˚a ett skalbart sätt har denna avhandling undersökt om rekommendationer av s˚adana etiketter är möjliga genom maskininlärning.

När 2464 sp˚ar med ett eller flera av 22 olika kärnvärden analyserades byggdes en profil för varje sp˚ar upp av attribut fr˚an tre olika kategori- er: redaktionella, kulturella och akustiska. Vid klassificering av sp˚aren undersöktes flera olika metoder för fleretikettsklassificering. Genom att kombinera fem olika transformationsmetoder med tre bas-klassificerare och använda tv˚a algoritm-anpassningar konstruerades totalt 17 olika konfigurationer. De olika konfigurationerna utvärderades med flera olika mätvärden, inkluderat (men inte begränsat till) Hamming Loss, Ran- king Loss, One error, F1 score, exakt matchning och b˚ade träningstid och testningstid.

Resultaten visade att transformationsalgoritmen ”Label Powerset”

tillsammans med Sekventiell Minimal Optimering utklassade de andra konfigurationerna. Vi fann ocks˚a lovande resultat för artificiella neu- ronnät, n˚agot som bör undersökas ytterligare i framtiden.

(5)

1 Introduction 1

1.1 Background and purpose . . . 1

1.2 Research question and area of study . . . 2

1.3 Ethical and various sustainability perspectives . . . 3

2 Previous work 4 2.1 Multi-label classification with machine learning . . . 4

2.2 Problem transformation . . . 5

2.2.1 Copy and select transformations . . . 5

2.2.2 Label Powerset . . . 6

2.2.3 Binary Relevance . . . 6

2.2.4 Ranking by pairwise comparison . . . 7

2.3 Algorithm adaption . . . 7

2.3.1 K-nearest neighbors . . . 7

2.3.2 Decision trees . . . 8

2.3.3 Neural networks . . . 8

2.4 Working with music and its metadata . . . 9

2.4.1 Editorial metadata . . . 9

2.4.2 Cultural metadata . . . 10

2.4.3 Acoustic metadata . . . 11

2.4.4 Feature selection . . . 12

2.5 Evaluations and measurements . . . 12

3 Method 14 3.1 Data set . . . 15

3.2 Features from Spotify . . . 16

3.2.1 Audio features . . . 16

3.2.2 Editorial and cultural metadata . . . 17

3.3 Acoustic features via Essentia . . . 19

3.3.1 Low level features . . . 20

3.3.2 Rhythmic features . . . 20

3.4 Algorithm choice and setup . . . 20

3.5 Evaluation and metrics . . . 22

(6)

4 Results 24

4.1 Table over metrics per configuration . . . 24

4.2 Table over training and testing time . . . 24

4.3 Table over average per problem transformation . . . 25

4.4 Table over average per base classifier . . . 25

5 Discussion 28 5.1 Evaluation of algorithms . . . 28

5.2 Data set and features . . . 29

6 Conclusion 31 6.1 Future work and improvements . . . 31

Bibliography 34

(7)

Introduction

“Music is the shorthand of emotion.”

Leo Tolstoy

The streaming music services of today are making music more accessible for customers than ever before, but with all these possibilities a great deal of responsibility is left to the end user. A classic way of consuming music is listening to radio or putting a CD/cassette/vinyl in your player – all curated by someone – radio hosts, artists or label companies. Therefore one might not be surprised that curated collection of music is a growing field and that proper tools for curating is of interest.^1,2 Music expresses emotions and emotions can in turn be caused by music – the connection is undeniable. Taste and preferences aside – what makes a certain track inflict a certain emotion or set a specific mood? Why do, for instance, the sound of ocean waves make people calm and distinct rhythmic drums other people’s hips move?

1.1 Background and purpose

Soundtrack Your Brand is a company that provides a unique streaming music service which enables other companies (or brands) to play music in their establishments.

They have hardware players, a web interface and mobile applications which allow

1http://www.npr.org/sections/therecord/2015/06/01/411119372/how-streaming-is-changing- music (December 26, 2016)

2https://digit.hbs.org/submission/the-currency-of-spotify-how-it-changed-the-way-we- discover-music/ (December 26, 2016)

(8)

CHAPTER 1. INTRODUCTION

the users to control and schedule the music. Aside from the actual playing of music and managing different locations they help customers to find their own brand sound, their “soundtrack”. The music curators is a group of talented musicians, DJs and music experts who identify music that expresses a certain company’s core values, e.g. mature, inclusive or energetic. Company core values are often very subjective in nature and harder to measure than common music attributes, e.g. length or beats per minute (BPM). The company visions a platform where their curators can collaborate in organizing music and creating soundtracks of future brands. This is the entry point for this thesis project. The goal for Soundtrack Your Brand is to have a prototype of the curation tool in place that will assist the curators in their everyday work. The goal for the thesis will, however, focus mainly on enhancing the process of assigning core values to a music track by presenting the curator with suggestions based on machine learning. The idea is that this will improve speed, accuracy and consistency when curators organize the music. If this thesis can help improve a tool for curating, the news value for companies like Spotify³ (that also work with curated music) might be significant.

1.2 Research question and area of study

The area of research this thesis goes into aims to answer the following question:

What is an efficient and accurate way of recommending subjective core values to a music track, through machine learning?

This entails finding an supervised machine learning approach that, based on features of a given music track, gives accurate recommendation of what core values that should be associated with it. Since a track is not restricted to be associated with just one core value, a learning algorithm that supports multi-labeling must be found – separating this from the more classic binary classification problem. Furthermore, extracting features from a music track can be done in multiple ways. One could look at:

• Facts published by the label company (track name, lyrics, etc.)

• Measurable values (BPM, length, etc.)

• Calculated metrics and acoustic fingerprints of the track (danceability, spectral energy, etc.)

• Crowdsourced data (reviews, genre, etc.)

3https://www.spotify.com/ (December 26, 2016)

(9)

By evaluating several multi-label machine learning algorithms the goal is to find an approach that supplies meaningful recommendations to the curators. Rather than finding a silver-bullet that outperforms any other approach for a multi-label problem, this thesis aims to find something that works for music and subjective labels. By using tracks that already are used by Soundtrack Your Brand together with their associated core values (supplied by music curators) the models can be trained and later evaluated. The hypothesis is that by supplying recommendations of core values the speed and accuracy of the curator’s task of categorizing tracks will be improved. When a tool is available in which the recommendations are presented, it will then also serve as input of training data for future improvements of the model.

1.3 Ethical and various sustainability perspectives

Regarding the pure economical sustainability the results of this thesis might lead to a more efficient and scalable workflow when curating music collections. This is of highest interest for other companies that work with curated music (e.g. Spotify or Apple Music⁴). Working with subjective values are not unique for music and can be applied in multiple other domains (art, photos and movies) and therefore any findings that are not feature specific can be generalized.

The ecological sustainability perspective is of little importance for this thesis – mainly because of the limited effects on the surrounding environment. One could argue that finding efficient algorithms would decrease training and testing time and in extension shrink the ecological footprint. However, the scope of problem makes this negligible. There is little room for ethical issues in this thesis since all the data is anonymous. The only discriminating outcomes of the project would be if some artists are constantly favoured over others, since this could in the long run lead to skewed play counts and in extension skewed income distribution for the artists based on the algorithm’s decisions.

4

(10)

Chapter 2

Previous work

Previous work is discussed in this chapter as a foundation for the thesis. The chapter will cover an introduction to multi-label classification with machine learning, previously used methods, an overview of different metadata available about a certain music track and measurements used to evaluate the algorithms performance.

2.1 Multi-label classification with machine learning

Within the area of machine learning single-label classification is a learning problem where the task is to learn to classify previously unseen instances based on a set of training instances. Each instance is associated with a unique class (or label) from a set of labels L. Depending on the number of classes in L, the task is either called binary classification (when |L| = 2) or multi-class classification (when |L| >

2) [23]. A classic example of binary classification is to identify spam-mail in an email inbox. An example of multi-class classification would be to determine which handwritten number (0≠9) an image represents. However since a given music track may be associated with multiple core values at the same time a binary or multi- class classification approach will not suffice. Multi-label classification allows the instances to be associated with more than one class. In other words a multi-label classifier will learn from a set of instances where each instance might belong to one or more classes in L.

Multi-label classification algorithms are often grouped into two categories: problem transformation and algorithm adaptation. Problem transformation transposes the problem to multiple single-label classification tasks and by training a set of any single-label (or base) classifiers the problem can be solved. Algorithm adaptation is based on single-label classification approaches that are adapted to handle multi-

(11)

label data directly and thereby address the full problem [16].

The following sections will give an overview of the two groups of multi-label classification algorithms. For the formal description of these methods the following notation will be used;

L= {⁄^j : j = 1...q} to denote the finite set of labels in a multi-label learning task.

D = {(xi, Y_i), i = 1...m} to denote a set of multi-label training examples, where xi is the feature vector and Yi ™ L the set of labels of the i-th example.

The description is borrowed from the Data Mining and Knowledge Handbook.

2.2 Problem transformation

There exist multiple quite trivial problem transformations that are used to convert the multi-label data set to a single-labeled one while using the same set of labels.

Using a single-label classifier that can output the probability distribution over all the labels the ranking between them can be learned. Using this probability and some threshold the label set for a new instance can be determined. This section will introduce six simple, but easy to understand methods, and then continue with three more elaborate transformation methods: Label Powerset, Binary Relevance and Ranking by Pairwise Comparison.

2.2.1 Copy and select transformations

Copy transformation entails that each instance (xi, Yi) in the training set D will be replaced with |Yi| examples (xi, ⁄_j) for each ⁄j œ Yi. In other words each label for a given instance is separated and treated as independent labels in a single-label classification task. An extension of this is copy-weight transformation that associate a weight of _{|Y |}¹ with each newly copied label/attribute combination. This is used to focus on classifying the single-labeled instances correctly. Another family of simpler transformation methods is the select family that simply replaces Yi with just one of its members, thus reducing it to a single-label classification task. There exist three common strategies when choosing which label to pick: the most (select-max) or least (select-min) frequent in the training set or simply randomly selected (select- random). The last, and most drastic, is to just ignore all multi-labeled instances and just keep single-label instances [26].

(12)

CHAPTER 2. PREVIOUS WORK

2.2.2 Label Powerset

Label powerset (LP) is a pragmatic problem transformation that considers each unique permutation set of labels that exists in in the multi-label training set as one class in a new single-label classification space. In other words, given a new instance, the single-label classifier will output the most probable class, which in reality represents a set of labels in the multi-label classification task. If the classifier can determine the probability distribution over all the classes one could use this to calculate a weighted probability for each original label, for which an example will follow:

There are four labels A, B, C and D which are transformed into three classes with label sets X(AB), Y (CD) and Z(AC). If the probability distribution for a given instance is X(0.7), Y (0) and Z(0.3) the weighted probability for the labels will be A(1.0), B(0.7), C(0.3) and D(0.0).

One should be aware that this makes the computational complexity of LP is dependent on the number of classes, which is equal to the number of unique label sets in the training set. The upper bound will be determined either by the total number of instances in the training set (m) (each has a unique label set) or by 2^q where q is the number of unique labels. This is rarely an issue, but it can cause poor performance for larger values of m and q. The single-label classification task can experience problems with the large number of classes that occur and the tendency to not get that many examples associated with each of them. Two known extensions are Pruned Problem Transformation (PPT) and the Random k-labelset (RAkEL) method [26, 27].

2.2.3 Binary Relevance

Binary Relevance (BR), also known as “one vs rest”, is a straightforward and pop- ular problem transformation method. Because of its simplicity and the fact that it historically has proven its efficiency it is often used as a baseline when comparing with new multi-label methods [15]. BR decomposes the problem to a set of q bi- nary classification tasks, that is one for each different label in L. This is done by transforming the original training set into q data sets where each set contains all the instances from the original set, labeled either positive or negative, based on if the label is set or not. When classifying a new instance BR will output the union of the labels that are predicted by the q classifiers [26].

The main disadvantage with BR is that it assumes that the labels are independent and might therefore fail to learn and predict a label combination if such dependency

(13)

exists. On the other hand BR has at least two advantages: the complexity is linear with the number of labels and it is easy to parallelize the work of the classifier [15].

2.2.4 Ranking by pairwise comparison

The key idea behind Ranking by Pairwise Comparison (RPC) is to transform the multi-label data set into a q(q ≠ 1)/2 binary label data sets, in other words, one binary label for each pair of labels (⁄i, ⁄_j), 1 Æ i Æ j Æ q. Each one of these data sets will contain the instances from the training data (D) that has either the label

⁄_i or ⁄j, but not both in its label set. For each data set a binary classifier, Mij, will be trained and thereby learn to separate the objects with the label ⁄i from those having the label ⁄j. Given a new instance, a query is submitted to all models M_ij and their combined predictions will result in the final output. In the simplest version of RPC each prediction of a model Mij is interpreted as a vote for either ⁄i

or ⁄j and the ranking is obtained by counting the votes for each label [14]. RPC has proven to perform better than BR since it is believed to profit from the simpler decision boundaries in the subproblems [7, 12].

Calibrated Ranking by Pairwise Comparison (or Calibrated Label Ranking (CLR)) extends RPC by introducing an additional virtual label. The additional label in each example will act as a natural breaking point between the relevant and the irrelevant label sets [8].

2.3 Algorithm adaption

In this section three of the classic single-label machine learning algorithms and their adaptations will be presented.

2.3.1 K-nearest neighbors

There are a number of adaptations based on the traditional k-nearest neighbors (kNN) learning algorithm. All adaptations use the pattern of retrieving the k nearest instances for a new instance and is distinguishable based on how they evaluate the label set of these neighbors [16]. One example is the multi-label k-nearest neighbor (ML-kNN) that utilizes the maximum a posteriori principle based on statistical information (prior and posterior probabilities) for the frequency of each label within the k nearest neighbors of the test instance [31].

(14)

2.3.2 Decision trees

One of the original and (what is considered) a simple decision tree algorithm is the ID3 algorithm. ID3 uses the classic information gain when branching the tree and the growing continues until all instances share a single value of the target feature or when the best information gain is less or equal to zero. ID3 only supports Boolean attributes, cannot handle missing values and does not apply pruning in any part of the learning process [20]. A natural evolution from ID3 is the C4.5 algorithm and rather than using information gain it uses gain ratio as splitting criteria. The branch stopping criteria is based on the number of instance to be split and not on the ratio itself. When the growing phase is complete, a pass of error-based pruning is run. Numeric attributes are handled by being transformed into logical statements and missing values in the training set are handled by using a corrected gain ratio [21]. Multi-Label C4.5 (ML-C4.5) is in turn an adaptation of the C4.5 algorithm to support multiple labels in the leaves of the tree. This was accomplished by modifying the formula for the calculations of entry by making a summation of the entropies for each individual class label in a computationally efficient way [5].

2.3.3 Neural networks

The name “neural networks” originates from the biological nervous systems and a neural network is intended to stimulate how objects of the real world interact with such a system. One popular adaption is the feed-forward neural network which has neurons arranged in layers. The first layer accepts the input, while the last layer produces the output. The middle layers have no connection with the real world – and are therefore often called hidden layers. The layers are parallel in a way that each neuron in one layer is connected to neurons in the next layer, but there is no connection between neurons in the same layer. The name feed- forward describes how information is constantly flowing in one direction – from one layer to the next one. The different parameters and weights of such a network are learned by minimizing a error function over the training instances. The most popular approach (and the reason for neural networks becoming viable and popular) to minimize the error function is the backpropagation algorithm, which uses gradient descent to update the parameters of the networks by propagating the errors of the output layers successively back to the hidden layers. Backpropagation for Multi- label Learning (BP-MLL) was the first multi-label neural network algorithm and as the name implies is derived from the single-label backpropagation algorithm by replacing its error function. The proposed error function intends to capture the characteristics of multi-label learning, i.e. labels belonging to an instance should be ranked higher than those not belonging to that instance [30].

(15)

2.4 Working with music and its metadata

Classification in machine learning utilize a data set which is built up by features that describes an object and labels that define it. By learning these examples a new instance can be broken down to its describing features and later be classified. In order to successfully identify the (often emotional) core values used by Soundtrack Your Brand in music one must be able to identify what attributes (or features) of a track that makes it belong to a certain set of labels.

According to Pachet [19] the sound of music itself is probably not a form of knowledge, but knowledge about music (also called metadata) could be used. The author tried to categorize the metadata into three areas, and even though he himself realized that the border between them is fuzzy, it makes a natural starting point. The three areas are editorial-, cultural- and acoustic metadata.

2.4.1 Editorial metadata

Editorial metadata is information related to a track that is released by the official editor, artists or label company. This includes, but is not limited to: track title, name of artists, album name, date of recording, artists biography, lyrics and genre information. This form of metadata is far from objective since it is often released by the label company when promoting itself. Furthermore, will, for instance, any genre classification or artist biographies often be influenced by the cultural backgrounds of the editor [19].

Baumann and Halloran [1] uses a straightforward method called Term Frequency –inverse Document frequency (TF-IDF) to turn this information into learnable fea- tures. TF-IDF is often used in information retrieval to reflect how important a word is to a document in a collection. Their goal is to find similar tracks, so they simply calculate the TF-IDF values for each word in the track’s lyrics, build a vector and then compute the cosine-measure (or angle) between two of those vectors. A shorter distance between two vectors would then indicate two more similar tracks.

Thanh and Shirai [24] try to determine the mood of a track by using mainly editorial metadata – lyrics, title and artist. They identified many key features, but in general they used TF-IDF to determine the importance of a word. One other method they used was to identify sentimental words like love in “I love you”, but also negations as in “I don’t love you”. They also tried to find modifiers like “I love you very much” by utilizing a dictionary that returns the level of positivity/negativity for each word. Furthermore, they realized that the different parts of a track (title, introduction, verse, chorus, bridge and outro) carry a different amount of importance and gave them different weights. They also identified that each band/artist often

(16)

produces tracks in the same mood category and therefore weighted it quite high.

The conclusion was that only using editorial metadata is not good enough to do mood classification in a real music search engine system. The two main reasons stated was that mood is subjective and that lyrics are both short and contain many metaphors that are hard to grasp the meaning of. However it shows that using the artist name, sentiment words and putting more weight for words in the chorus and title improved the performance for mood classification [24].

Meyers [17] also utilizes lyrics to classify mood in music and conclude that the combination of audio and lyrical features is a crucial factor in a mood classification system. According to him previous attempts on similar problems have relied solely on acoustic metadata and thereby lost contextual information. Such information which, according to the authors, enables the system to achieve higher accuracy when classifying the mood. Hu et al. [13] also showed that the combination of audio and lyrics features improved performance in many mood categories, but not all of them. Finally Howard et al. [11] raises the serious concern regarding the significant challenges in preprocessing multilingual text. According to them, regular techniques like stemming and stopwords can do more harm than good when treating such data.

2.4.2 Cultural metadata

Cultural knowledge (or metadata) is defined as something that is produced by the community, environment or culture around an object. Contrarily to the editorial metadata this information is not explicitly edited in a system by experts, but rather information that implicitly grows around an object. The idea is to look at surrounding content of an object – rather than the object itself – and by using different distance measurements one could determine which objects are related. One most often process enormous amounts of data, therefore sophisticated information retrieval techniques are essential [19].

Baumann and Halloran [1] computed the cultural similarity between two tracks by performing a Google search for the musical work. The first 50 search results were downloaded, parsed and used to build a TF-IDF-weighted vector space model. The model will contain the most relevant terms that describe the track and by using the cosine similarity method one can compute the similarity between two entities.

Unfortunately Google closed down the API¹ which made it possible to do full web searches after September 29 2014, but the techniques are transformable to other search engines.

Baumann and Hummel [2] later did something similar in their experiment, but had a more advanced preprocessing step where they tried to remove “non-review” parts

1https://developers.google.com/web-search/docs/ (December 26, 2016)

(17)

of the documents, e.g. advertisements. They also enriched the query by adding terms like “music” or “review” to yield more relevant results. Furthermore, they used part-of-speech taggers in removing language noise. As the name indicated the technique tries to assign a part-of-speech tag to every word in the text - separating the verbs from the nouns and so on. Whitman [28] did something similar, but also applied much more advanced natural language processing (NLP) which improved their results.

Berenzweig et al. [3] used another source for cultural metadata: human-authored music collections. The first kind of collection was playlists and the, crude but useful, assumption was that playlists contain similar music. Using online collections of playlists they managed to convert the data into a similarity matrix where a cell represents the joint probability that two artists occur in the same playlist. The second kind of collection was entire music collections that were gathered from popular music sharing services and the assumption and method was similar to playlists.

They concluded that the most useful ground truth they produced was from analyzing music collections, and not from acoustic features, surveys, expert opinions or web text analytics. They did however mentioned that combining information from different sources was promising, but needed more investigation. Whitman [28]

used a similar approach in his paper, but used a much more sophisticated method to deal with very popular artist that skewed the similarity matrix. Both of these experiments were done on artist level and they conclude that this could be limiting when a band releases a track which varies from their normal sound.

2.4.3 Acoustic metadata

Acoustic metadata is defined as information gained through analyzing the audio signal of a track. By ignoring editorial and cultural aspects of the music piece the idea is that the information will be purely objective when describing the characteristics of the sound of the track [19].

There exists multiple tools to extract these kinds of features from a given music file and one of them, the Marsyas tool, was used by Trohidis et al. [25] to extract 72 different acoustic features. These were divided into two categories: rhythmic and timbre features. For the rhythmic features the paper used a beat histogram to calculate multiple different BPM-values, for instance, the two highest peaks in BPM, but also the histogram bins between 40-90, 90-140 and 140-250 BPMs, respectively. The timbre features were more complex in nature. For instance they used Mel Frequency Cepstral Coefficients (MFCC) which are commonly used for speech recognition and music modeling. Furthermore, they used three other features that were extracted from the Short-Term Fourier Transform (FFT): Spectral centroid, spectral rolloff and spectral flux. These are calculated per frame, and a frame is a time slice of the signal (often no longer of one second each). So for all the timbre features the mean,

(18)

standard deviation, mean standard deviation and standard deviation of standard deviation were calculated over all frames and used as features.

Berenzweig et al. [3] mentions that features derived from MFCCs have shown very promising performance for multiple audio classifications tasks and are often favored by groups working with audio similarity. However, since MFCCs are a purely local feature that is calculated over a small window in time, they cannot capture information about melody or long-term song structure. The authors had some solutions to this using clustering, but it did not yield any significant improvements. They also presented another interesting idea, namely to use features in an “anchor space” derived from the MFCC features. They built a 12-class neural network to discriminate between 12 genres and two 2-class neural networks to recognize both Male/Female (gender of the vocalist), and Lo/Hi fidelity (quality of sound). The output from these networks were then used as features for the similarity measurement.

Meyers [17] use five main musical features: mode, harmony, tempo, rhythm, and loudness. These were chosen for two reasons: they have shown to convey the emotional meaning in music [4] and they can with relative ease be extracted. Han et al.

[10] used a similar set of features: scale (key, mode and tonality), average energy (a measurement of loudness), rhythm (average and standard deviation of a BPM in- terval) and harmonics. Wieczorkowska et al. [29] used 29 different acoustic features:

the dominating fundamental frequency, maximal level of sound, tristimulus (three coefficients), contents of even and odd harmonics, brightness of sound, irregularity of spectrum, the 10 most prominent frequency peaks and the amplitude measured in decibel for each of those peaks.

2.4.4 Feature selection

Feature selection is when one only selects a subset of relevant features to use in the model and training of the machine learning algorithm. There are three main reasons to use feature selection: simplify the interpretation of the model, shorten training time and to reduce overfitting for improved generalization. Fiebrink and Fujinaga [6] has shown that feature selection using Principal Components Analysis (PCA) does not always improve accuracy, but might significantly reduce training and testing time.

2.5 Evaluations and measurements

When evaluating machine learning classifiers it is important to use multiple and meaningful metrics to ensure an overall good performance. Trohidis et al. [25] use

(19)

an extensive number of measurements in their evaluation of different multi-label algorithms, eight of them listed below.

• Hamming Loss measures accuracy by calculating the fraction of wrong labels with the total number of labels. Since this is a loss function, the optimal value is zero [31].

• F1 score is the weighted average of precision and recall which reaches its best value att one and worst score at zero. Two versions of F1 score is Micro F1 and Macro F1 where Micro F1 is a global average (counting the total true positives, false negatives and false positives) and Macro F1 calculates the values for each label and later find their unweighted mean [27].

• Area Under ROC (Receiver Operating Characteristic) Curve (AUROC) shows the ratio between true positive and false positive. A higher value (close to one) is prefered since then mostly true positive values are obtained. Just like for the F1 score this can be calculated either as Micro AUC or Macro AUC [23].

• One-error evaluates how often the highest-ranking label is not in the set of relevant labels. The performance is perfect when the score is 0, i.e. the smaller the value, the better the performance [31].

• Coverage evaluates how deep one must, on average, go down the ranked list of labels in order to find all the relevant labels for an instance [26].

• Ranking loss describes, on average, the number of times that irrelevant labels are ranked higher than relevant ones. Since this is a loss function, the optimal value is zero [26].

• Average precision computes the proportion of relevant labels that are ranked before it for each relevant label and is averaged by the number of relevant labels. A higher value means better performance and a value of 1 means the perfect performance [23].

• 0/1 loss (or exact match) is one the strictest metrics since it completely ignores the fact that multi-label prediction has a notation of begin partially correct and count only exact matches (all labels in the set for an instance is correctly predicted). 0/1 loss has its optimal value at 0 and since exact match only is its inverse, it has its optimal value at 1 [23].

Furthermore, the authors evaluated CPU time consumed during the training, parameter selection and testing phase. This can be used to find trade-offs between different algorithms – maybe a longer training time will lead to shorter testing time?

Lastly they calculated the accuracy for each label, as if they were independently predicted, to identify which labels that were generally harder to predict and which algorithms performed well for those labels.

(20)

Chapter 3

Method

This chapter contains the methodology used in the project to answer the research question at hand. By extracting features from a track and running it through differ- ent types of multi-label classifiers for evaluation this thesis aims to find an efficient and accurate way of recommending its core values.The chapter is organized accord- ing to the classic machine learning approach: data set retrieval, feature extraction, preprocessing, training of the classifier and evaluation of the result.

All necessary code is written in Python using the Jupyter Notebook¹ – which is ideal for interactive, exploratory and iterative research tasks. Python is chosen as a programming language based on its wide range of available scientific libraries.

Scikit-learn², NumPy³, Pandas⁴ and Essentia⁵ are just few of the open source libraries utilized in this project. However when it comes to multi-label classification algorithm scikit-learn implements only a small quantity of algorithms compared to Mulan⁶ and MEKA⁷ (a Multi-label Extension to WEKA⁸ (Waikato Environment for Knowledge Analysis)) – two popular Java libraries for multi-label classification in particular. Since MEKA both contains a more extensive library as well as pro- viding a wrapper for Mulan, it is the natural choice for this project which aims to compare several multi-label classification algorithms.

1http://jupyter.org/ (December 26, 2016)

2http://scikit-learn.org/ (December 26, 2016)

3http://www.numpy.org/ (December 26, 2016)

4http://pandas.pydata.org/ (December 26, 2016)

5http://essentia.upf.edu/documentation/ (December 26, 2016)

6http://mulan.sourceforge.net/ (December 26, 2016)

7http://meka.sourceforge.net/ (December 26, 2016)

8http://www.cs.waikato.ac.nz/ml/weka(December 26, 2016)

(21)

3.1 Data set

The data set is based on a number of Spotify playlists – one for each of the core values that the algorithm is about to learn. The playlists are constructed by the curators in the content team at Soundtrack Your Brand as samples for a specific core value. Some of the tracks exists on multiple playlists and thereby represent more than one core value. However the core values comes in pairs that are mutually exclusive, e.g. Calm and Energetic or Inclusive and Exclusive. Two of the core values (‘International’ and ‘National’) had to be discarded from the project since they had only a handful (less than 5) tracks available in their respective playlists – which had led to serious overfitting. In total the data set consists of 2464 tracks and each track is associated with one or more of the 22 core values (or labels). The distributions among the labels can be viewed in Table 3.1.

Label Count Label Count Label Count

Calm 125 Elegant 125 Serious 125

Energetic 124 Rugged 124 Easy-going 125

Youthful 125 Conventional 124 Discreet 125

Mature 124 Forward-thinking 124 Expressive 125

Modern 123 Down-to-earth 125 Human 124

Traditional 125 Dreamy 125 Technological 125

Inclusive 125 Careful 125

Exclusive 125 Provocative 125

Table 3.1. Table showing distribution of tracks among the different core values/la- bels.

The tracks in playlists is obtained through the “Spotify Web API”⁹, which is a REST API that returns metadata in JSON format regarding artists, albums and tracks directly from the Spotify catalogue. The API is accessed through the open source python library “Spotipy”¹⁰. The initial data set is created with each track’s unique Spotify identifier (track id), a URL to a 30 second preview music file (preview URL) and its corresponding core values (labels). All feature extraction is made in Python, mainly using scikit-learn, to later be exported to a file format (.arff) readable by MEKA were the algorithms are trained and evaluated. The following subsections describes how different features for each track were extracted.

9 https://developer.spotify.com/web-api/ (December 26, 2016)

10https://spotipy.readthedocs.io (December 26, 2016)

(22)

CHAPTER 3. METHOD

3.2 Features from Spotify

Through their API Spotify offers a wide range of metadata about a given track. A total of 16 different values were used and with preprocessing (binary categories and TF-IDF) they were turned into 130 features.

3.2.1 Audio features

For each track a set of audio features can be obtained through the API – these are a mixture of high and low level features, measured values and calculated metrics or plain information reported by the label company. The features added to the data set are listed with their explanations below.

• The track’s duration measured in milliseconds.

• The averaged estimated tempo of the track measured in BPM.

• Which key the track is in and the keys are represented by an integer between 0 and 11 (using the standard Pitch Class notation¹¹).

• The time signature, or meter, is the number of beats in a each bar of the melody. It is a notational convention used when describing music.

• The melodic content of a track can be either be derived from the major or minor on the modality scale. The mode represented by a 1 for major and 0 for minor.

• The valence is a measurement, ranging form from 0.0 to 1.0, that describes overall positiveness the track brings. A higher value indicates more positive, e.g. happy, and tracks with lower score is experienced as more negative, e.g.

sad.

• Danceability describes how suitable a track is for dancing based on a few key musical elements. It ranges from 0.0 to 1.0 where 1.0 is a track most danceable.

• Energy is a measure from 0.0 to 1.0 and tries to capture a perceptual measure of intensity and activity. A higher value is given to for example death metal and a Bach prelude would get a much lower score.

• Acousticness is a confidence measure ranging from 0.0 to 1.0 whether the track is acoustic. A higher value represent a high confidence that the track is acoustic.

11https://en.wikipedia.org/wiki/Pitch class (December 26, 2016)

(23)

• Liveness is a probability measurement between 0.0 and 1.0 that tries to detect the presence of an audience in the recording. A higher value represents an increased probability that the track was performed live.

• The averaged loudness across entire track measured in decibels (dB) where value typical range between ≠60 and 0 dB.

• Speechiness is a measurement of the presence of spoken words in the track.

The range is between 0.0 and 1.0 and a value close to max would indicate a speech-like recording, e.g. audio book, and a lower value (below 0.33) is given to most kinds of music.

Most of the provided features are floats or integers that are well fitted for most machine learning algorithms. Three of the features (time signature, key and mode) are however of a categorial fashion. This means, for instance, that the distance between two tracks key-values has no numerical meaning when using a linear machine learning model. So rather than just using the raw numbers as features they are encoded with scikit learn’s OneHotEncoder into binary features.

3.2.2 Editorial and cultural metadata

For each track there is a set of editorial metadata provided. This includes, but is not limited to track name, information about the associated artists, which album the track is associated with, which track number it has on the album, popularity, and whether the track has explicit lyrics. Each album and artist is in turn associated with more metadata and one that is in particularly interesting is genres. Unfortu- nately this is not available on a track level, but using the genres for the album and the different artists associated with the track a set of intermediary genres can be obtained. Both the artists and albums are matched with unique identifiers, so no mix-up with lookups of similar names can be made.

A total of 568 unique genres were encountered for the tracks in the data set. Each label was associated with in average 147 different genres. The Provocative label had the highest number of unique genres with 190 and the Forward-thinking label had the fewest with 93. When set in relation to the number of tracks per label (in average 125) one realizes that almost every track in each label has a unique genre.

A closer inspection of the different genres offers a good explanation for this since in addition to the classic genres, e.g. “rap”, “singer-songwriter” and “jazz”, there is also extremely specific ones, e.g. “deep swedish indie pop”, “progressive trance house” or “math pop” (a full list can be found through the “Every Noise at Once”¹² service).

12http://everynoise.com/engenremap.html (December 26, 2016)

(24)

CHAPTER 3. METHOD

Using the genres as categorical features would add 568 new, very sparse, binary features to the data set and would probably cause poor generalisation (due to ‘The curse of Dimensionality’). Therefore, a decision to treat the genres like text was made. If a track has the genres “indie christmas”, “indie folk”, “folk-pop”, “folk christmas” and “indie pop” then the genre string would be “indie christmas indie folk folk-pop folk christmas indie pop”. The genre string is then preprocessed by removing any special characters and transformed into lower case. The sentences are then split into words (or tokens) and the porter stemming algorithm¹³ is applied to each of the tokens. Stemming is used since the original words are not as important as the fact that words with the same meaning are treated as the same, e.g. “poptimism” and “pop”. The stemmed words are inserted into scikit learn’s TfidfVectorizer feature extractor to create a TF-IDF matrix. This is used to prop- erly represent repeated use of major genres when there is such specified genres for each track. In our example above there are, for instance, three mentions of “indie”

and “folk” and two mentions of “christmas”, “pop”. This yields a matrix with 407 columns and as a last step scikit learn’s TruncatedSVD decomposer is used to do a latent semantic analysis¹⁴ on the matrix to reduce the number of new features in the data set to 100.

By decorating the track with genres from the artist 82% of data set were assigned one or multiple genres. The connection between some of the core values and genres is undeniable – classical music tends to be more “Mature” and R&B (Rhythm and blues) tends to be more “Youthful”. Spotify have one of the most rigorous set of genres out there – at the time of this report they use over 1480 different genres¹⁵. In other papers they use a much more restricted genres-set and can then use them as categorial features (Trohidis et al. [25] used 7 and Berenzweig et al. [3] used 12 different genres). The idea to use TF-IDF to reduce the dimensionality of the genres profile for a track is to author’s knowledge untested in similar problems.

A similar text feature extraction approach was evaluated for the artist name following the premise that different tracks from the same artist often is associated with the same core values (according to Thanh and Shirai [24] an artist was often associated with the same mood). The preliminary results were terrible and is probably rooted in two main areas. First, the data set do not contain that many recurrent artists – 71% of the tracks had a unique set of artists and only around 5% of the artists had more than two tracks within the same core value. Second, it is often only an exact artist name match that is interesting, and that might not be enough.

Since even if the distance between the TF-IDF vector for “The Sound” and “The Sounds” is short, they are not nearly the same band and do not play the same type of music. Furthermore if the name is an exact match one can not be completely sure since there are multiple bands that share the same name, e.g. “Ain Soph” is

13https://tartarus.org/martin/PorterStemmer/ (December 26, 2016)

14https://en.wikipedia.org/wiki/Latent semantic analysis (December 26, 2016)

15http://everynoise.com/everynoise1d.cgi?scope¯all (December 26, 2016)

(25)

both an ambient band from Italy and a progressive rock band from Japan¹⁶. Popularity and explicitness were two other features from the metadata that were added to the data set, but due to the nature of the values (integer and Boolean) no further preprocessing was necessary. The feature “explicit” is set to “true” in very few of the total number of examples (around 3.5%), but half of those examples is all marked with the label “Provocative” and the rest is either “Youthful” or

“Expressive”. This shows that a very sparse feature still can isolate a core value very well. One reason for the low frequency of tracks marked with explicit might be related to the fact that curators actively avoid these types of tracks when building soundtracks for brands.

By looking up the album associated with a track, a release date could be extracted.

Using that date a ‘freshness’ feature were calculated by counting the number of months elapsed from the current date squared – thereby severely punish the ‘freshness’ of an older than just a couple of months. A potential problem with this approach is if an old track is re-released on a new album, this causes the release date to be misleading and gives a poor “freshness”. One could obtain the original release date from other sources¹⁷, however Spotify was chosen for its convenience.

3.3 Acoustic features via Essentia

When analyzing the acoustic features of a track only 30 seconds of the original track is used and this might affect the quality of the analysis. This reason is limitations in the Spotify API which only provides a URL to a preview of the track in MP3 format. However previous work by Trohidis et al. [25] also used 30 seconds of the track for analysis, but used an arbitrarily “period of 30 seconds after the initial 30 seconds” where on the other hand Spotify’s preview is promised to contain a relevant segment of the track. The reasoning of the mentioned work to only use 30 seconds is unknown – but has is surely either rooted in a similar limitation of resource or computing time.

The track is downloaded and cached locally for analysis using Essentia. Essentia is an open-source C++ library for audio analysis and audio-based music information retrieval developed by the Music Technology Group in Barcelona (for more details see the documentation¹⁸). The MP3-file is later loaded by the “EasyLoader which turns it into an audio signal by downmixing the stereo signal from Spotify into mono, resampling (if needed) it to the default 44100 hz sampling rate and normalizing it via

16http://rateyourmusic.com/list/noname219/popular artists with the exact same name (De- cember 26, 2016)

17http://www.songfacts.com/ (December 26, 2016)

18http://essentia.upf.edu/documentation (December 26, 2016)

(26)

CHAPTER 3. METHOD

ReplayGain. Inputting the audio signal into the standard “Extractor” configured to extract low-level and rhythmic features, a pool of acoustic feature are returned.

Some of them are pure values, e.g. BPM, and other are calculated frame-wise and therefore are represented with an array of values, e.g. MFCC. In total 49 acoustic features (averages of multiple values, combined values or just the pure values) were added to the data set, all described in the following two sections.

3.3.1 Low level features

Only a handful of the features extracted via the low-level extractor are added to the data set:

• The first 13 Mel Frequency Cepstral Coefficients.

• Spectral energy, centroid, rolloff and flux.

• Inharmonicity and three tristimulus values.

• The ratio between a signal’s odd and even harmonic energy given the signal’s harmonic peaks.

The 13 features MFCC, spectral energy and three tristimulus values are calculated frame-wise and therefore each holds an array with the value for each frame. By aggregating these both using the mean and the standard deviation they could easily be added as features.

3.3.2 Rhythmic features

From the extracted rhythmic features were the estimated average BPM as well as the highest and second highest peak in BPM used as features. Furthermore, using a BPM histogram (estimated BPM for each second in the track) bins for BPM values between 40-90, 90-140 and 140-250 were created – much like in [25]. The BPM histogram was also used to calculate the standard deviation and use that as a feature.

3.4 Algorithm choice and setup

Using the graphical interface in MEKA, an experiment with 17 configurations of different multi-label classifiers were evaluated. A detailed review of all the algorithms

(27)

and their configuration follows (for more technical details see the documentation for MEKA¹⁹ and WEKA²⁰).

Several problem transformation approaches were selected for evaluation: Binary Relevance, Label Powerset and Ranking by Pairwise Comparison. The chosen algorithms are presented in a Table 3.2 below and is selected for their mentioning and performance in previous work.

Family Algorithm Abbrev MEKA

BR Binary Relevance BR BR

LP Pruned Problem Transformation PPT PSt

LP Label Powerset LP LC

LP Random k-label Sets RAkEL MULAN -S RAkEL1

RPC Calibrated Label Ranking CLR MULAN -S CLR

Table 3.2. Table describing the five problem transformations; to which family they belong, what abbreviation will be used as well as the MEKA command it correspond to.

Each of the problem transformations requires a base (single-label) classifier to function. In this experiment Naive Bayes, Sequential minimal optimization and ID3 / C4.5 will be used. A detailed overview can be found in Table 3.3. The three were chosen because they represent both probabilistic classifiers (NB) and decisions trees (C45 and SMO) – two common subjects in machine learning.

Classifier Abbrev WEKA

Naive Bayes NB NaiveBayes

Sequential minimal optimization SMO SMO

ID3 / C4.5 C45 J48

Table 3.3. Table describing the three base classifiers; what abbreviation will be used as well as the WEKA command it correspond to.

For a given combination of a problem transformation and a classifier the pattern (PT).(C) will be used, e.g. Label Powerset with Naive Bayes would be LP.NB.

Furthermore, two algorithm adaptations of single-label classifiers will be evaluated, details in Table 3.4 below.

MLkNN is a classic machine learning approach that is easy to grasp and have some unique qualities (fast training time, but relative slow testing time). BPMLL symbolizes what is expected to be the future of machine learning – neural networks.

However, a concern is that the number of training examples is not enough.

19http://meka.sourceforge.net/api-1.9 (December 26, 2016)

20http://weka.sourceforge.net/doc.dev/ (December 26, 2016)

(28)

CHAPTER 3. METHOD

Classifier Abbrev MEKA

Multi-Label k-Nearest Neighbor MLkNN MULAN -S MLkNN Backpropagation for Multi-label Learning BPMLL MULAN -S BPMLL

Table 3.4. Table describing the two algorithm adaptations; what abbreviation will be used as well as the MEKA command it correspond to.

MLkNN was run with 10 neighbors and the smoothing parameter set to one. To avoid getting different results when running random bases algorithms like RAkEL the random seed was always fixed. Other than that the algorithms were run with the default configurations (consult documentation for more details).

3.5 Evaluation and metrics

In order to answer the research question concerning finding an efficient and accurate way of recommending core values to music tracks using machine learning – one must first determine if this is feasible. Finding metrics that can be used to evaluate and compare different approaches and in addition to this discuss what minimum requirement needs to be met by the model for it in order to be usable. By doing this one can compare multiple algorithms, choose the top performer and evaluate if the approach is good enough to help the music curators in their day-to-day work with assigning core values to tracks.

The first step in the evaluation is to split the entire data set into two subset:

training data and testing data. As the names suggests the training data will be used exclusively for training of the classifiers and the testing data will be stored away and only used to verify the classifiers afterwards. The split is done via random sampling of the data and is an 80/20 split – 80% of the data is training and 20% for testing. By splitting the data into two sets the hope is to verify that the algorithms capture the general patterns of the labels and not overfit the instances supplied in the data set. Most studies [13, 25] used 10-fold cross validation rather than splitting the data set into test and training data. One of the studies [29] that used an explicit training set (with similar data set size) also choose the “rule-of-thumb” [18] 80/20 split used in this experiment.

Most of the evaluation metrics used by Trohidis et al. [25] is supported by MEKA:

Hamming Loss, Ranking Loss, One Error, F1 micro/macro score, AUROC macro score, exact match and Average Precision. Each machine learning algorithm is trained and evaluated 10 times and the average of each of the previously named metrics is calculated. Apart from this both the average time required for building and testing are recorded.

(29)

All the tests were performed on the same computer with an Intel Core i5 with 4 cores and 16 gigabytes of RAM – all mostly dedicated to the Java Virtual Machine on which MEKA is executed.

(30)

Chapter 4

Results

This chapter summarizes the results from the experiments conducted following the methodology in the previous chapter.

4.1 Table over metrics per configuration

Each configuration was trained 10 times with 80% of the data set as training data and the result presented in Table 4.1 is an average of those 10 times. Each metric is ranked (1-17) for each classifier based on how well it performed relative to the others, e.g. BR.SMO had the lowest Hamming Loss and is therefore ranked number one (1) in that metric. All the rankings are then summarized and presented in the righthand column “Ranking” which the table is also sorted by. The idea for this column is to give a rough estimate over their overall performance.

4.2 Table over training and testing time

Each configuration is trained and evaluated (or tested) 10 times and the average CPU time, measured in seconds, for these two phases are documented in Table 4.2 below.

(31)

4.3 Table over average per problem transformation

Each of the five evaluated problem transformation approaches (BR, LP, PPT, CLR and RAkEL) were configured with three base classifiers (NB, SMO and C45). In order to compare the overall performance of the different problem transformations their average ranking, training time and testing time is computed and presented in Table 4.3.

4.4 Table over average per base classifier

The three base classifiers (NB, SMO and C45) were each used in five different problem transformation approaches (BR, LP, PPT, CLR and RAkEL). In order to compare how they performed overall their average ranking, training time and testing time is computed and presented in Table 4.4.

(32)

CHAPTER 4. RESULTS

ClassiferHLOERLF1.miF1.maAUROCEMAvg.PRanking LP.SMO0.070.730.380.260.250.610.230.15149 BPMLL0.080.750.190.250.230.810.070.14751 CLR.SMO0.080.900.190.280.280.810.100.15151 BR.NB0.080.750.220.240.210.770.090.15655 RAkEL.NB0.080.840.300.240.230.690.160.15559 PPT.NB0.080.830.230.230.220.740.170.15460 CLR.NB0.100.750.220.260.260.780.030.15561 LP.NB0.080.770.390.230.220.600.190.16162 RAkEL.SMO0.060.860.400.250.210.590.140.16362 CLR.C450.090.930.200.280.280.800.070.14572 BR.C450.080.820.410.200.200.570.080.17777 MLkNN0.080.850.340.180.150.650.080.16182 RAkEL.C450.090.910.340.200.190.650.090.16089 LP.C450.080.850.450.150.150.550.120.16296 BR.SMO0.050.880.470.110.080.530.060.17297 PPT.C450.090.890.430.150.150.570.110.16197 PPT.SMO0.090.970.440.100.030.740.030.174107 Table4.1.Thetableisshowingtheevaluatedmetrics(Hammingloss(HL),Oneer- ror(OE),Rankloss(RL),F1micro/macro(F1.mi/F1.ma),AUROCmacro(AUROC), Exactmatch(EM),Avgprecision(Avg.P))andtheadditionalRankingcolumnfor eachofthe17configurations.ThetableissortedbyRankingandeachcolumnis coloredfromgreen(best)tored(worst)toillustratetheirrelativeperformance.

(33)

Classifer Training Time Test Time

BR.NB 1.42 0.82

BR.SMO 29.72 0.13

BR.C45 11.16 0.09

LP.NB 0.09 5.09

LP.SMO 19.81 1.65

LP.C45 1.77 0.01

BPMLL 8.17 0.06

CLR.NB 2.38 9.34

CLR.SMO 32.29 1.00

CLR.C45 15.34 0.13

MLkNN 4.65 1.04

RAkEL.NB 1.00 13.26

RAkEL.SMO 45.09 1.70

RAkEL.C45 14.87 0.06

PPT.NB 0.09 5.22

PPT.SMO 19.22 1.75

PPT.C45 1.85 0.03

Table 4.2. Training and testing time in seconds for each configuration averaged over 10 runs.

PT Avg. Ranking Avg.Train Time Avg. Test Time

BR 76.33 14.10 0.35

LP 68.67 7.23 2.25

PPT 87.33 7.05 2.33

CLR 60.67 16.67 3.49

RAkEL 69.67 20.32 5.01

Table 4.3. Average ranking, training time and testing time for the three base clas- sifiers (NB, SMO and C45) for each Problem Transformation (PT).

Base classifier Avg. Ranking Avg. Build Time Avg. Test Time

SMO 73.00 29.22 1.24

NB 59.20 0.99 6.75

C45 85.40 9.00 0.07

Table 4.4. Average ranking, training time and testing time for the three base clas- sifiers.

(34)

Chapter 5

Discussion

Results presented in the previous chapter as well as parts of the methodology will be both analyzed and discussed in this chapter. The discussion contains an attempted explanation as to why the results turned out as they did and how they can be inter- preted.

5.1 Evaluation of algorithms

The ranking measurement (see Table 4.1) was an attempt to give a rough estimate for the overall performance of the different algorithms. When analyzing multiple metrics simultaneously it can be hard to pick a clear winner and this was meant to assist with that. However it is not clear that every metrics should be treated equally – one metrics could be more important than others for the overall performance. One idea is to weigh the individual ranks differently, but an active choice was made to keep it a simple model and instead discuss the more important metrics.

Looking at Table 4.3 the na¨ıve implementation, LP, seemed to have outperformed both its extensions RAkEL and PPT. The computational complexity of LP is, as mentioned before, is dependent on the number of unique label sets in the training set. The data set in this experiment had 115 unique label sets, which evidently was not enough to cause a bad performance (consult Table 4.3). However when data grows, both in number of examples and number of unique label sets, revisiting RAkEL or PPT might be wise.

BR was one of the algorithm adaptations that showed the worst performance (see Table 4.3) and the explanation might lie in its core concept. Since BR assumes that the labels are independent and create one classifier for each, it will fail to learn (and

(35)

predict) any label combination if such dependency exists. This is a major problem since, as discussed previously, there exists multiple dependencies between labels (a

“Calm” track can never be “Energetic” as well, a “Mature” track might often be

“Traditional”, etc.). BR might perform better at a data set with fewer and more independent labels.

SMO combined with any algorithm adaptation had generally good performance (except for together with BR), but had a very long build time – three times longer than C45 which came in third place (see Table 4.4). This is odd since it is supposed to outperform its predecessor (support vector machines such as C45), but maybe some of the configurations might have been overlooked. The training time is within reason, and more importantly the “test time” is one of the better (see Table 4.4).

A crude way of determining an accuracy-baseline is to always predict the label with most occurrences. E.g. in a training set where 20% of the emails are “spam”, by always answering “not-spam” an accuracy of 80% would be achieved, only miss- classifying the 20% of spam-emails. This is superior to random selection since the distribution amongst labels is far from even. The most common label set in the entire data set is (‘Technological’) with 121 occurrences. This is around 4.9% of the total number of instances and 15 out of the 17 tested configurations have an exact match (roughly the equivalent of the single-label accuracy) that surpass this. A random selection in this data set would yield a 1/115 (choose one from the number of unique label sets), which is around 0.9% chance of predicting the right label set.

This clearly shows that the machine-learning approaches, and the features, used in this thesis are promising.

The overall winner between the different algorithms combinations is LP.SMO – best ranking and acceptable build and test time (see Table 4.2 and Table 4.1).

The LP algorithm adaptations seems to capture the difference between all the label combinations well and SMO seems to be as advertised – a more complete support vector machine model.

If the result is set in a broader view they give an indication that with enough data and relevant features even very subjective values becomes distinguishable to the computer. This shows that similar approaches can be made in other domains in which people have subjective opinions: art, photos or movies.

5.2 Data set and features

Apart from the evaluation of the different algorithms the data set and its features plays a crucial role for the overall result. Some aspects of the dataset will be discussed in more detail in the following sections.