Content-based music recommendation system: A comparison of supervised Machine Learning models and music features

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

Content-based music

recommendation system:

A comparison of supervised Machine Learning models and music features

MARINE CHEMEQUE-RABEL

(2)

(3)

Content-based music recommendation system:

A comparison of supervised Machine Learning models and music features

Marine Chemeque-Rabel marinecr@kth.se

Master in Computer Science

School of Electrical Engineering and Computer Science Supervisor: Bob Sturm

Examiner: Joakim Gustafsson Tutor: Didier Giot, Aubay

Swedish title: Inneh˚ allsbaserat musikrekommendationssystem

Date: August 18, 2020

(4)

As streaming platforms have become more and more popular in recent years and music consumption has increased, music recommendation has become an increasingly relevant issue. Music applications are attempting to improve their recommendation systems in order to offer their users the best possible listening experience and keep them on their platform. For this purpose, two main models have emerged, collaborative filtering and content-based model. In the former, recommendations are based on similarity computations between users and their musical tastes. The main issue with this method is called cold start, it describes the fact that the system will not perform well on new items, whether music or users. In the latter, it is a matter of extracting information from the music itself in order to recommend a similar one.

It is the second method that has been implemented in this thesis. The state of the art of content-based methods reveals that the features that can be ex- tracted are numerous. Indeed, there are low level features that can be temporal (zero crossing rate), spectral (spectral decrease), or even perceptual (loudness) that require knowledge of physics and signal processing. There are middle level features that can be understood by musical experts (rhythm, pitch, ...). Finally, there are higher level features, understandable by all (mood, danceability, ...).

It should be underlined that the models identified during the paper readings step are also abundant.

Using the two datasets GTZAN and FMA, we will aim to first find the best

model by focusing only on supervised models as well as their hyperparameters

to achieve a relevant recommendation. On the other hand it is also necessary to

determine the best subset of features to characterise the music while avoiding

redundant and parasitic information. One of the main challenges is to find a

way to assess the performance of our system.

(5)

Sammanfattning

Med anledning till att streamingplattformar har blivit mer och mer popul¨ ara under de senaste ˚ aren, och musikf¨ orbrukningen har ¨ okat, har musikrekommen- dationen blivit en allt viktigare fr˚ aga. Musikapplikationer f¨ ors¨ oker f¨ orb¨ attra sina rekommendationssystem genom att erbjuda sina anv¨ andare den b¨ asta m¨ ojliga lyssningsupplevelsen och h˚ alla dem p˚ a sin plattform. F¨ or detta ¨ andam˚ al har tv˚ a huvudmodeller framkommit, samarbetsfiltrering och inneh˚ allsbaserad mod- ell. I den f¨ orsta ¨ ar rekommendationer baserade p˚ a likhetsber¨ akningar mellan anv¨ andare och deras smak. Huvudfr˚ agan med denna metod kallas kallstart, den beskriver det faktum att systemet inte kommer att fungera bra p˚ a nya objekt, vare sig f¨ or musik eller anv¨ andare. I den senare modellen handlar det om att extrahera information fr˚ an sj¨ alva musiken f¨ or att rekommendera en annan.

Det ¨ ar den andra modellen som har implementerats i denna avhandling.

Det senaste inom inneh˚ allsbaserade metoder avsl¨ ojar att de funktioner som kan extraheras ¨ ar m˚ anga. Det finns faktiskt l˚ agniv˚ afunktioner som kan vara tem- por¨ ara (noll¨ overg˚ angshastighet), spektral (spektral minskning) eller till och med perceptuell (perceptuell h¨ oghet) som kr¨ aver kunskap om fysik och signalbehan- dling. Det finns funktioner p˚ a medelniv˚ a som kan f¨ orst˚ as av musikaliska experter (rytm, tonh¨ ojd ...). Slutligen finns det funktioner p˚ a h¨ ogre niv˚ a, f¨ orst˚ aliga f¨ or alla (hum¨ or, dansbarhet ...). Det b¨ or betonas att de modeller som identifierats under pappersavl¨ asningssteget ocks˚ a ¨ ar rikliga.

Med hj¨ alp av de tv˚ a datam¨ angder GTZAN och FMA ¨ ar m˚ alet f¨ or det f¨ orsta

att hitta den b¨ asta modellen genom att endast fokusera p˚ a ¨ overvakade modeller,

liksom dess hyperparametrar f¨ or att uppn˚ a en relevant rekommendation. ˚ A

andra sidan ¨ ar det ocks˚ a n¨ odv¨ andigt att best¨ amma den b¨ asta delm¨ angden av

funktioner f¨ or att karakterisera musiken samtidigt som man undviker redundant

och parasitisk information. En av utmaningarna ¨ ar att hitta ett s¨ att att bed¨ oma

prestandan i v˚ art system.

(6)

1 Introduction 1

1.1 Context . . . . 1

1.2 Purpose and specifications . . . . 1

1.3 Research question . . . . 2

1.4 Overview . . . . 2

2 Background 3 2.1 Recommendation overview . . . . 3

2.1.1 Recommendation definition . . . . 3

2.1.2 Music is different . . . . 3

2.1.3 What is a good recommendation? . . . . 3

2.1.4 Available data . . . . 4

2.2 Types of recommendation systems . . . . 5

2.2.1 Collaborative approach . . . . 5

2.2.2 Content-based approach . . . . 6

2.2.3 Context-based approach . . . . 7

2.2.4 Hybrid approach . . . . 7

2.3 Models for content-based recommendation . . . . 7

2.3.1 Logistic Regression . . . . 8

2.3.2 Decision Trees . . . . 10

2.3.3 Bagging: Random Forest . . . . 10

2.3.4 Boosting: Adaboost . . . . 11

2.3.5 k-Nearest Neighbours . . . . 12

2.3.6 Support Vector Machine . . . . 13

2.3.7 Naive Bayes . . . . 14

2.3.8 Linear Discriminant Analysis . . . . 14

2.3.9 Neural Networks . . . . 14

2.4 Features for content-based recommendation . . . . 16

2.4.1 Low-level features . . . . 17

2.4.2 Middle-level features . . . . 22

2.4.3 High-level features . . . . 23

2.5 Features selection algorithms . . . . 24

2.5.1 Filter model . . . . 24

2.5.2 Wrapper model . . . . 25

2.5.3 Embedded model . . . . 26

3 Methods 27 3.1 Choosen approach . . . . 27

3.2 Datasets . . . . 27

3.2.1 GTZAN . . . . 28

3.2.2 Free Music Archive . . . . 28

3.2.3 Data-augmentation . . . . 30

3.3 Features extraction . . . . 31

3.3.1 Preprocessing . . . . 31

3.3.2 Chosen features . . . . 31

3.3.3 Wrapper model for features selection . . . . 31

3.4 Models . . . . 32

3.4.1 Hyperparameter tuning . . . . 32

(7)

3.5 Evaluation . . . . 33

3.5.1 Evaluation of the classification using labels . . . . 33

3.5.2 Evaluation of the prediction using confusion matrices . . 34

3.5.3 Evaluation of the prediction based on human opinion . . . 34

4 Results 35 4.1 Preliminary results . . . . 35

4.1.1 Tests on FMA . . . . 35

4.1.2 Tests on GTZAN . . . . 39

4.2 Dataset creation . . . . 41

4.3 Hyperparameters tuning . . . . 42

4.3.1 Logistic regression optimization . . . . 43

4.3.2 Decision tree and random forest optimization . . . . 43

4.3.3 Adaboost optimization . . . . 44

4.3.4 K-nearest-neighbours optimization . . . . 45

4.3.5 Support vector machine optimization . . . . 46

4.3.6 Linear Discriminant Analysis . . . . 47

4.3.7 Feed-Forward Neural Network . . . . 47

4.3.8 Global results after tuning . . . . 47

4.4 Feature selection . . . . 47

4.4.1 Most important features . . . . 52

4.5 Data augmentation . . . . 54

4.6 Final examples of recommendations . . . . 54

5 Conclusions and discussions 57 5.1 Discussion of the results . . . . 57

5.1.1 Quantitative results . . . . 57

5.1.2 Qualitative results . . . . 57

5.2 Conclusion . . . . 58

5.2.1 Research question . . . . 58

5.2.2 Known limitations . . . . 58

5.3 Future work . . . . 59

5.3.1 Improvement suggestions . . . . 59

5.3.2 Application development . . . . 59

(8)

1 Introduction

1.1 Context

In 1979, the beginning of a recommendation system was born. Elaine Rich described her Grundy library system [1]: it is used to recommend books to users following a short interview in which the user is initially asked to fill in his first and last name and then, in order to identify the user’s preferences and classify them ”stereotype”, Grundy asks them to describe themselves in a few key words. Once the information has been recorded, Grundy makes an initial suggestion by displaying a summary of the book. If the suggestion does not please the user, Grundy asks questions to understand on which aspect of the book it has made a mistake and suggests a new one. However, its use remains limited and Rich faces problems of generalisation.

The recommendation systems that really emerged in the 1990s have de- veloped strongly in recent years, especially with the introduction of Machine Learning and networks. Indeed, on the one hand, the growing use of the cur- rent digital environment, characterised by an overabundance of information has allowed us to obtain large user databases. On the other hand, the increase in computing power made it possible to process these data especially thanks to Machine Learning when human capacities were no longer able to carry out an exhaustive analysis of so much information.

Unlike search engines that receive requests containing precise information from the user about what they want, a recommendation system does not receive a direct request from the user, but must offer them new possibilities by learning their preferences from their past behaviour.

E-commerce sites that aim to sell a maximum of items or services (travel, books, ...) to customers must therefore recommend suitable goods quickly. As for sites that offer streaming music and movies, their goal is to keep their users on their platform as long as possible. The common point is that it is necessary to make adequate recommendations. Recent progress in this field is considerable and these recommendations are as beneficial for companies that maximise their profits as they are for customers who are no longer overwhelmed by the number of possibilities. Decision-making is made easier and a good recommendation is therefore a significant time saver.

In 2006, Netflix, which was an online DVD rental service, launched the Net- flix Challenge with $1 million to be won. The goal of the contest was to build a recommendation algorithm that could surpass the current one by 10% in tests.

The contest generated a lot of interest, both in the research community and among movie lovers. The prize was won 3 years later and highlighted several methods and research directions to solve this kind of problem. A recommenda- tion system will be defined according to Burke’s definition : [2]: it is a system capable of providing personalised recommendations or guiding the user to in- teresting or useful resources (called items) within a large data space.

1.2 Purpose and specifications

The project, entitled Aubay Musical Playlist, was carried out in Aubay’s

”Innov” division. It is a brand new Research and Development project, its goal

is to achieve a complete state of the art of the available methods in order to offer

(9)

1 INTRODUCTION – 1.3 Research question

a functional and efficient music recommendation system. This project does not have a direct client, so the training dataset is not provided, and thus needs to be determined.

In the long-term, the goal is not only to recommend existing songs but also to generate songs adapted to the musical taste of the user. During this master thesis I focused on the recommendation part while exchanging with a colleague in charge of the generation part. The future of the project will consist in gathering these two parts in order to have a fully functional recommendation system.

The aim of this thesis is to explore the different recommendation approaches, the available datasets, the ways to take into account the user’s preferences and the machine learning methods in order to build a suitable recommendation system. One important part was only dedicated to determine how to evaluate this recommendation system. This project will be introduced to the members of the company and will take the form of an application. The user will be asked to upload a music (mp3, wav format) and the application will recommend some musics to be listened to afterwards.

1.3 Research question

This master thesis focuses on two aspects: determining the listener’s prefer- ences and evaluating our recommendations. The main research question is the following:

How can a music listener’s tastes be taken into consideration in order to automatically recommend music? How can one measure

the tastes of a music listener?

Several points will, therefore, have to be addressed:

- how to classify the music’s style?

- how to take tastes into account?

- how can the performance of such a system be measured?

1.4 Overview

This report will be structured as follows: the technical background required for this project will first be described in detail. The different approaches that can be used to implement recommendation systems will be presented and the machine learning methods that will be experimented in this thesis will be de- scribed. The ways to evaluate our results will also be presented.

The Method section will describe the work performed. I will first introduce the chosen dataset and the reasons why it has been chosen. I will then dig deeper and detail the experiments that were carried out. Then the different experiments carried out will be explained in detail.

Finally, the Results section will highlight and give a visualisation of the main results obtained. Quantitative and qualitative interpretations of these results will allow us to reach a final unique model and to answer the research sub-questions.

In the “Conclusion and Discussion” section I will discuss the future of this

work.

(10)

2 Background

2.1 Recommendation overview

2.1.1 Recommendation definition

In this thesis the focus will be on recommendation systems. A recommen- dation system is a set of techniques and services whose purpose is to propose to users articles that are likely to interest them... They are presently implemented on multimedia content distribution platforms (Netflix, Deezer, Spotify, ...), on- line sales platforms (Amazon, Ebay, ...), social networks (Facebook, Twitter, ...), etc... Recommendation systems are particularly useful when the number of users and articles becomes very large. That is because users are unlikely to know all the richness of the catalogue offered by the service, and it can be ar- gued that it is almost impossible to make a personalised human prescription for all the users of a service. The purpose of the recommendation system is to lead users through the vast amount of data available, particularly in e-commerce platforms, filtering this data to automatically propose to each consumer the items that are likely to be of interest to them.

2.1.2 Music is different

The recommendation systems are more and more used in many fields: ho- tels, travels, products. But the musical field has some particularities to take into account. [3] The first factor to consider is the duration of a music track.

As a track is short, it is less critical to make a bad recommendation than it is for a movie or a book, for example. The user can also quickly browse through the music to quickly see if it suits their taste or not. A second specificity is the number of tracks available, indeed the choice is very wide, it is estimated that at least tens of millions of songs are accessible on the Internet. It is common for repeated recommendations of the same music to be appreciated.

While for trips or movies the user is looking for diversity, the user may like to listen to the same music over and over again. Moreover, it is possible that at the first listening the user was not attentive since listening to music is often done in parallel with another activity (sport, work...). Attentive listening requires quality hardware, the proper mood, and exclusive attention time. Moreover, it is quite easy to extract a set of features from one piece of music. In- deed information can be extracted through signal processing, thanks to musical knowledge, thanks to lyrics, or just using user feedback. Old music is as rel- evant as new music: recent music, as well as music from a few decades ago or classical music can be as enjoyable. It is a matter of correctly understanding the user’s tastes. It must also be taken into account that music listening is often passive: the listener doesn’t necessarily listen attentively to it: in shops, bars, while working... The last point that distinguishes music from other items that may be recommended is that music is often played in sequence. Indeed, as they are short, they are often chained together in the form of a playlist.

2.1.3 What is a good recommendation?

Taking these peculiarities into consideration it is now necessary to make an

adequate recommendation. Naturally, the main objective is to achieve a good

(11)

2 BACKGROUND – 2.1 Recommendation overview

level of accuracy which means predicting music that the user will like and listen to. The more the user has confidence in the recommendation system and knows how it works, the more effective it will be.

A successful recommendation involves a trade-off between exploitation and exploration: [4] On the first hand, exploitation consists in playing safe music, music that the recommender knows the user likes. It’s called lean-back experience and it brings short-term rewards. On the other hand, exploration is about playing new music, making new discoveries. It’s called lean-in experience and it brings long-term rewards. If it’s properly gauged, a little serendipity may please [5]. This implies that we need to find the appropriate balance between: novelty and familiarity, diversity and similarity as well as popularity and personalization.

A relevant recommendation must also reflect the listener context. It can- not be only based on music and listener properties, it is needed to take into account the mood, the activity... For example someone who is working does not want to listen to the same music as when they are running.

Finally, transparency with users is a crucial point. It has been proved that explaining how the algorithm works to the user improves their confidence and therefore the time they will spend on the platform, firstly to perfect their profile and secondly because they will get better recommendations. [4]

2.1.4 Available data

The main goal of Music Information Retrieval is to extract the most relevant information from various representations of a music (audio, lyrics, web, meta- data, ...)

The features extracted can be split into four categories. The first one is the music content [6] and groups together three types of features. Signal pro- cessing techniques give us access to low-level features, which means machine- interpretable features. They can be temporal (zero crossing rate) or spectral (spectral flux, spectral decrease, ...) features. Musical knowledge is required to extract middle-level features such as the beat, tonality, pitch, ... Finally, if the previous ones are only understandable by the machine or the experts, the high-level features are accessible to everyone, such as danceability, or liveness for example.

The second category is the music context. The goal is to retrieve as much information (country, related artists, genres) as possible based on metadata thanks to web pages, blogs, lyrics, tags, ... [6]

The usable data does not only come from the music itself, but can also be focused on the user. First of all, there are the listener properties. Everyone has tastes and preferences, these can be retrieved implicitly (plays, playlists) or explicitly (thumbs, stars). [6]

Finally, the last category is the listener context. Data can be retrieved directly from the sensors of the device the listener is using. They are useful because according to the mood of the listener, their activity (sport, work, ...) the desired music varies strongly. [6]

There are different methods to get information about the user’s context: [4] They

can be retrieved explicitly, which means directly by asking the user (thanks to

(12)

the sensors of our devices (heart rate, light intensity, accelerometer, position, weather, and so on) Another way is to infer them. Indeed, machine Learning and statistical techniques can be used to infer conclusions: for example from position and movement speed the activity can be inferred.

Figure 1: Factors in Music Information Retrieval [6]

2.2 Types of recommendation systems

There are three main recommendation systems which provide the ability to create music playlists adapted to a user: collaborative filtering, content- based information retrieval techniques, and context-based recommendation. A combination of the previous techniques is possible and is called hybrid. [7]

2.2.1 Collaborative approach

This recommendation method is based on the analysis of both the behaviour of the listeners and the behaviour of all others users of the platform. The fun- damental assumption here is that the opinions of other users can be used to provide a reasonable prediction of another user’s preferences for an item that they have not yet rated: a user is given recommendations based on users with whom they share the same tastes with. Indeed, during years, in order to choose music, restaurants, movies, etc.. We have been asking our friends, family, and colleagues to recommend something they liked. And it is this mechanism that is attempted to be reproduced here. Netflix was a pioneer of this method (based on stars given by other users) but it is now widely used, including for Spotify’s Discover Weekly. [8]

The first family of collaborative filtering methods is called memory-based approach. The principle is to store all data in a Users/Songs matrix. This can be done thanks to implicit or explicit feedback. In the former, if the item has been listened at least once the value is 1, 0 otherwise. In the latter the value is the number of stars if available, 0 otherwise.

We end up with a large matrix. To reduce it, Spotify tries to approximate

this matrix by an inner product of two others smaller matrices. [9]

(13)

2 BACKGROUND – 2.2 Types of recommendation systems

Thanks to matrix factorization, we have now two types of vectors, one user vector X and one song-vector Y for each listener.







0 0 1 0 1 0

1 1 0 0 0 0

1 0 0 0 1 0

0 0 1 1 0 1

0 1 0 0 0 0

1 0 1 1 0 0

0 0 0 1 1 1







=





 .. . X

.. .







· · · · Y · · ·

(1)

The last step is to find similarity between vectors to be able to recommend musics to listeners, to do so there are two methods: [10]

• User-user similarity: comparing the listener vector with others user’s vectors to find those who have similar tastes.

• Item-item similarity: comparing tracks vectors to find which one is the closest to the actual listened music.

There is a second approach called model-based: the goal is to predict the user’s rating for missing items using machine learning models.

The key advantage of the collaborative approach is that we do not need to analyze and extract features from the raw files, so there is no need to have the audio files, nor to have an in-depth knowledge of music or physics. Moreover, it brings serendipity, it is the effect of surprise that the user can receive by being given a relevant recommendation that they would not have found alone.

There are three major drawbacks, the first one is called cold start, it des- ignates two issues: new user problem and new item problem. [11] The former reflects the lack of user data to make a relevant recommendation while the latter reflects the fact that we do not know who to recommend new items to. The next issue is the scalability, indeed a large number of users and items requires high computing resources. The last one is sparsity, because the amount of items is large, one user can only rate a small subset of them. [11]

2.2.2 Content-based approach

The content-based recommendation consists in the analysis of the content of the items candidates for recommendation. This approach aims to infer the user’s preferences in order to recommend items that are similar in content to items they have previously liked. This method does not need any feedback of the listener, it is only based on sound similarity which is deduced from the features extracted from the previous listened songs. [8] This method is based on the similarities between the different items. To estimate similarities, it is a matter of extracting features to best describe the music. The Machine Learning algorithms then recommends the closest item to those that the user already likes.

It is, therefore, necessary to create items profiles based on features extracted

from items. Moreover, this method requires user profiles based on both their

(14)

following form: a list of weights (which reveals the importance) corresponding to each feature we have selected.

The main advantage of this approach is that an unknown music is just as likely to be recommended as a currently popular one, or even a timeless one.

This allows new artists with a few ”views” to be brought up as well. Moreover, the problem of the cold start and in particular of the new items is thus avoided:

when new items are introduced into the system, they can be recommended directly, without requiring integration time as is the case for recommendation systems based on a collaborative filtering approach.

The negative point is that this method limits the diversity in the recom- mendation, it tends to over-specialise. Moreover, the integration of a new user cannot be instantaneous, they have to listen and evaluate a certain amount of songs before being able to receive recommendations, this is the user cold start.

2.2.3 Context-based approach

Studies [12] have shown that the mood, activity, or even the location of the person influences the music they want to listen to. We listen to music in a given moment, in a predefined emotional state, and established circumstances (party, work, ...). And these predispositions will play a decisive role in the way we feel about the music. Although there are many applications [12] of this type of recommendations such as tourist guide applications with adaptive ambient songs, there are not many concrete applications on this subject.

Many barriers still block the research on this field. Indeed, the nature of the data to be taken into account is highly varied and depends on the environ- ment (time, place, weather, culture, ...) or the user themselves (motion speed, emotions, heart rate, device luminosity, ...). An even more significant issue is the lack of data available for research purposes. In the real world it is not easy to retrieve them either, as users do not always want to transmit as much information from their mobile phone sensors.

2.2.4 Hybrid approach

It is also possible to combine the previous complementary methods to create a recommendation system called hybrid. It can also be based on all other lesser known methods such as location-based recommendation. This method can alleviate the problems of cold start and sparsity. Several implementations can be set up, first of all the recommendation systems can be mixed into one.

It is also possible to keep several systems separated and assign them weights, or the ability to switch between systems at will. Finally, it is possible to extract results from one system, to then be used as an input for the next one.

2.3 Models for content-based recommendation

During the state of the art phase, the reading of numerous research papers has shown that a variety of models can be used for recommendation. These models will be tested in this thesis and therefore presented in this section.

The models chosen are supervised machine learning algorithms. Machine

learning is a type of artificial intelligence where an algorithm automatically

modifies its behaviour in order to improve its performance on a task based on a

(15)

2 BACKGROUND – 2.3 Models for content-based recommendation

set of data. This process is called learning since the algorithm is optimised from a set of observations and tries to extract the statistical regularities from them. In supervised learning, the objective of the algorithm is to predict an explicit and known t-target from the training data. The two most common types of targets are either continuous values where t ∈ < (regression problem) or discrete classes where t ∈ 1, ..., N

C

for a problem with N

C

classes (classification problem).

2.3.1 Logistic Regression

Logistic regression has proven its effectiveness in the field of music classi- fication, although not the most efficient method [13], it has the advantage of being fast. Logistic regression is often used for multi-class classification. It is a question of finding the optimal decision boundary in order to separate the different classes. [14] The easiest case is when there are only two classes (0 and 1), in that case, as LR is a linear model, the score function can be written as follows:

(X

⁽ⁱ⁾

) = θ

0

x

0

+ θ

1

x

1

+ ... + θ

n

x

n

(2) with:

X

⁽ⁱ⁾

: an observation (from the training or test set) as a vector x

1

, x

2

, ..., x

n

x

i

: one of the valuable features of the predictive model θ

0

: a constant called bias

θ

_i

: the weights (associated with the features) that have to be computed It can be written more compactly by noting Θ the vector containing the components θ

₀

, θ

₁

, ..., θ

_n

and X the vector containing x

₁

, x

₂

, ..., x

_n

:

S(X) = ΘX (3)

Then the goal is to find coefficients θ

0

, θ

1

, ..., θ

n

such as:

• S(X

⁽ⁱ⁾

) > 0 if the sample is in the positive class (label 1)

• S(X

⁽ⁱ⁾

) < 0 if the sample is in the negative class (label 0)

The sigmoid (figure 2) function (sigmoid(x) = e

^1+ex¹

) is then applied to the score function, which allows us to obtain values between 0 and 1. The overall hypothesis function for logistic regression is therefore:

H(x) = Sigmoid(S(X)) = 1

1 + e

^ΘX

(4)

(16)

Figure 2: Sigmoid function [15]

Classifying musics according to their genres is a case of a multi-class classi- fication, and the algorithm commonly used by this model is One-Versus-All:

it consists in splitting the problem into several sub-problems of binary classifi- cation. First class 1 will be separated from all the others, then class 2 from all the others and so on.

Figure 3: Logistic Regression for multi-class classification [16]

To prevent overfitting, l1 and l2 regularisation methods can be used to adjust the value of the weights ω

_i

:

• Lasso (Least Absolute Shrinkage and Selection Operator) Regression (l1):

it consists of adding a regularisation term to the loss function:

L(x, y) =

n

X

i=1

(y

i

− f (x

i

))

²

+ λ

n

X

i=1

|ω

i

| (5)

Lasso has multiple solutions and tends to shrink the less important features to zero, so it’s particularly effective when you have a large number of features and you want to select the most important ones.

• Ridge Regression (l2):

L(x, y) =

n

X

i=1

(y

i

− f (x

i

))

²

+ λ

n

X

i=1

ω

_i²

(6)

(17)

2 BACKGROUND – 2.3 Models for content-based recommendation

Ridge regression has only one solution, it doesn’t reduce the number of features but rather the impact of each features.

2.3.2 Decision Trees

This paper [17] shows that in the classification of Latin music, decision trees can be efficient. They will therefore be studied in this thesis as well as ensemble methods in order to potentially improve our models. In a decision tree [18], one can classify items by continuously separating them parameter by parameter. A decision tree is composed of three main types of elements: the nodes are the tests performed on the attributes, the edges are the results of the tests, they connect one level of nodes to the next and leaf nodes: these are the last nodes of the tree, they represent the final classes. There are two types of decision trees, regression trees and classification trees, the former can take targets which are continuous values for example to predict the price of a house while the latter is composed of Yes/No questions and its targets are discrete/categorical. This is an iterative process that consists of dividing the data into partitions and then distributing it to each of the branches.

The algorithm used to train the decision trees for classification is Divide- and-Conquer, it aims to split the dataset into subsets at each node. The principle is to select a test for the first node, called root node, which allows to split the set into two sub-parts by maximising the Information Gain or Gini Im- purity. Then this action need to be repeated recursively until there is a branch where all instances are of the same class. To avoid overfitting, a depth limit can be set.

In order to define the Gain of Information and Gini Impurity, the concept of entropy must be specified.

Entropy is the measures of impurity, disorder, or uncertainty in a set of exam- ples. For a dataset with C classes and with p

i

being the proportion of element of class i in the dataset:

E = −

C

X

i

p

i

log

₂

p

i

(7)

Information Gain (IG) measures how much ”information” a feature gives us about the class, it can be computed as follows:

IG = Entropy(parent) − [weighted average] ∗ Entropy[children] (8) Gini Impurity is the probability of incorrectly classifying a randomly cho- sen element in the dataset if it were randomly labelled according to the class distribution in the dataset:

G =

C

X

i

p

i

(1 − p

i

) (9)

2.3.3 Bagging: Random Forest

Random Forest is a method of ensemble learning. According to James

(18)

their diversity, independence, decentralization and aggregation, a combination multiple classifiers give a better one. For optimal results, classifiers with high variance and low bias should be grouped together in order to reduce the overall variance while maintaining a low bias. In decision trees, high variance means boundaries which are highly dependant on the training set, low bias means boundaries that are close in average to the true boundary.

Random Forest is a special case of Bagging (bootstrap aggregating), the aim of such methods is to reduce the variance introduced by a single tree and thus reduce the forecasting error. To predict the result, the Random Forest algorithm averages the forecasts of several independent models (in the context of a classification it predicts the most frequent category). To build these models, several bootstrap replicates of the training set are created by sampling with replacement, thus all our models can be trained in a parallel way. This algorithm not only uses Bagging but also a randomization of the selected features at each node.

2.3.4 Boosting: Adaboost

Adaboost is one of the most used algorithms in boosting which also optimises decision trees. Multiple binary classifiers are taken, each based on a feature, this will give us a set of weak classifiers. The Adaboost principle is based on the assumption that a set of weak classifiers can give a strong one (figure 4). The principle is to loop on the classifiers with weighted samples and when a sample is incorrectly classified its weight is increased. Specifically, the steps are:

1. Initialise weights: at the beginning they are uniform 2. Train one decision tree

3. Compute the weighted error rate: compute how many items (taking into account their weight) are missclassified

4. Compute the decision tree’s weight depending on its error rate W tree = learning rate ∗ log ( 1 − e

e ) (10)

5. Update weights of missclassified items:

W item new = W item old ∗ exp (W tree) (11) 6. Repeat from step 2 to 5 for each tree

7. Final decision:

P red F inal = X

t∈trees

W t ∗ P red(t) (12)

This means that each model is trained in a sequential way and learns from

mistakes made by the previous models. While Random Forest aims to decrease

variance and not bias, Adaboost aims to decrease bias but not variance.

(19)

2 BACKGROUND – 2.3 Models for content-based recommendation

Figure 4: Adaboost classifier [19]

2.3.5 k-Nearest Neighbours

This method will make a prediction based on the entire data set. When we want to predict a new value, the algorithm will look for the K instances of the set closest to it. Then, it will use the output values of the closest K neighbours to compute the value of the variable that need to predicted. [20]

Parameter K is to be determined, a sufficiently high value is needed to avoid underfitting, however if the value is too high there is a risk of overfitting and so a poor generalisation on unseen data. A compromise has to be found. For each new item i, the first step is to calculate its distance d to all the other values of the dataset and retain the K items for which the distance is minimal. [21]

Then the optimal K is used to make the prediction: in the case of a regression, the next step is to calculate the mean (or median) of the output values of the selected K neighbours. In the case of a classification, it is to retrieve the most represented class among the K neighbours.

For distance determination various formulas are available, as long as they satisfy the criteria of non-negativity, identity, symmetry and triangle inequality, those commonly used are:

• Hamming distance: for two equal-length strings, this is the number of positions for which the characters are different.

• Manhattan distance:

d

m

(x, y) =

n

X

i=1

|x

i

− y

i

| (13)

• Euclidean distance:

d

e

(x, y) = v u u t

n

X

i=1

(x

i

− y

i

)

²

(14)

(20)

• Minkowski distance:

d

n

(x, y) = (

n

X

i=1

|x

i

− y

i

|

^p

)

^p¹

, p ≥ 1 (15)

• Tchebychev distance:

d

_t

(x, y) = ( max

ⁿ

i=1

|x

i

− y

i

|) (16)

It is a method that has the advantage of being simple, transparent and intuitive while giving reliable results, but it is sensitive to redundant or useless features. [20] k-NN seems to give good results (an accuracy of 91%) when it comes to music classification, even when using MFCC features (defined in the subsection features, 2.4) [22]

2.3.6 Support Vector Machine

Support Vector Machine are widely used for music classification and gives good results but the features used are often restricted to MFCC and some rhythmic features. This method uses several successive SVMs to improve the results. [23] The basic principle is to separate two groups of data while max- imising the margin around the border (so the distance between two classes). It is based on the idea that almost everything becomes linearly separable when represented in high-dimensional spaces. So the two steps are: transforming the input into a suitable high-dimensional space, and then finding the hyperplane that separate data while maximising margins. In practice, kernel functions are used to reap the benefits of a high-dimensional space without actually repre- senting anything, indeed the only operation done in high-dimensional space is the computation of the scalar products between pair of items. The common used Kernels are:

• Linear:

K(~ x, ~ y) = ~ x · ~ y (17)

• Polynomial:

K(~ x, ~ y) = (~ x

^t

~ y + 1)

^p

(18)

• Radial based:

K(~ x, ~ y) = e

⁻^2ρ2¹ ^||~^x−~^y||²

(19) In some cases, it is possible to accept some outliers that will be in the margin in order to be able to separate the data. Although they were created to deal with binary problems, there are two ways to adapt it to multi-class problems:

• One-versus-all: it consists in transforming C classes classification problem using a unique separator into a C binary classification problem. The ranking is given by the classifier which fits the best.

• One-versus-one:

^k(k−1)₂

binary classifiers are taken this time, the idea is

that for each class C

i

it will be compared to each other class C

j

6= C

i

.

The final ranking will be given by majority vote.

(21)

2 BACKGROUND – 2.3 Models for content-based recommendation

2.3.7 Naive Bayes

Bayes classifiers are based on a probabilistic approach employing Bayes’

theorem, it gives the probability for an item to be in the class C

i

knowing that the item has a set of features named x = (x

1

, ..., x

F

).

P (C

_i

|x) = P (x|C

i

)P (C

i

)

P (x) = P (x|C

i

)P (C

i

) P

j

P (x|C

_j

)P (C

_j

) (20) P (C

i

) is called the prior,

P P (C

i

|x)) is called the posterior, P (x|C

_i

) is called the likelihood, P (x) is called the evidence.

The result to be computed is often based on several variables. Since the com- putation is complex, one type of classifier often used is the Naive Bayes Clas- sifier [24]: it is assumed that these variables are independent. This is a strong assumption, which is why the word ”naive” is used.

2.3.8 Linear Discriminant Analysis

This method is used to predict which predefined class the item belongs to, based on its characteristics measured using predictive variables. It was first introduced by Fisher in [25]. It achieved 71% on GTZAN [26]. Linear Discrimi- nant Analysis is a dimensional reduction technique which means that it aims to reduce the number of dimensions (i.e. features) in the dataset while keeping as much relevant information as possible. It uses information from every feature in order to create a new axis and projects the data on this axis while maximising the distance between classes. For that purpose, the initial step is to compute the between class variance which is the level of separability between classes (i.e the distance between the mean of a class and its elements). Then, the distance between the mean and samples of each class must be computed. This metric is called within class variance. Finally, the last stage is to construct the lower dimensional space which maximises the between class variance while minimising the within class variance.

2.3.9 Neural Networks

Neural networks of all types are also widely used, including feed-forward neu-

ral networks. Results based only on high-level features give an overall accuracy

of 85%. [27] A neural network (see figure 5) is a system whose architecture is

inspired by the functioning of biological neurons, however it is nowadays getting

closer and closer to mathematical and statistical methods.

(22)

Figure 5: Illustration of an artificial neural network [28]

A formal neuron is the elementary unit of an artificial neural network. When receiving signals from other neurons in the network, a formal neuron responds by producing an output signal which is transmitted to other neurons in the network.

The signal received is a weighted sum of signals from different neurons. The final output signal is a function of this weighted sum:

y

j

= f

N

X

i=1

w

i,j

x

i

!

(21)

y

j

is the output of the formal neuron j.

x

_i

for i ∈ {1, ..., N } are the signals received by the neuron j from neurons i.

w

_i,j

are the weights of the interconnections between neurons i and j.

f , called the activation function, gives the output value. Usually we use the identity, sigmoid, or hyperbolic tangent functions.

For multi-classes classification purposes, the simplest neural network is the Multi-Layer Perceptron (MLP) [29]. It is a network that contains several layers (and on each one several units) all fully connected. The training method is called gradient backpropagation, it is used to find the weight values for each neuron that are most relevant for further classification. There are three types of neurons, the input cells are associated with the data (one for each input fea- ture), the output neurons each associated with a class, and the hidden neurons which are in the intermediate layers.

For too deep neural networks several problems appear, the first one concerns time and computing power which becomes very quickly overwhelming. The training algorithm also has trouble working correctly, indeed, it is often facing exploding or vanishing gradient issues.

There are different ways, called regularisation methods, to deal with overfit- ting:

• Dropout: it consists in deactivating a percentage of units for a particular layer during training, more precisely for each stage of training, neurons are either kept with probability p, or dropped out with probability 1-p.

This improves generalisation since it forces the layer to learn the same

concept but with different neurons. This method is commonly applied on

fully connected layers.

(23)

2 BACKGROUND – 2.4 Features for content-based recommendation

Figure 6: Without/with dropout [30]

• Early stopping: the idea is to stop the training when the system starts to overfit, i.e. when the test accuracy starts to decrease (figure 7). In order to achieve this, a validation set must be created, it allows to test the model at each epoch and thus stop the training as soon as the validation accuracy decreases and thus the overfitting appears.

Figure 7: Overfitting [31]

2.4 Features for content-based recommendation

The features mentioned in the papers read to carry out this thesis that can be

extracted from the music are numerous and can be classified according to their

level (low, middle, high). There are several representations of music. From a

physical point of view, the sound is a wave, i.e. an oscillation of pressure, which

is generally transmitted through the ambient air. Sound is therefore a superpo-

sition of sound waves of different frequencies with different characteristics such

as amplitude and phase. It is mainly from this representation that the features

can be extracted. While some features use the signal in the time domain, others

focus on its frequency shape. In fact, the discrete Fourier transform (DFT) can

be used to decompose a digital time signal into its sinusoidal components, and

thus pass it into the frequency domain.

(24)

2.4.1 Low-level features

Low-level features are those that can be computed immediately from the raw audio file using statistical, signal processing and mathematical methods. Low- level features can be grouped according to their nature: temporal, spectral, energetic, or perceptual. An audio signal is constantly changing, that’s why the first step is to split the signal into short frames. It enables us to make the hypothesis that the signal is statistically stationary on this frame. Usually the signal is framed into 20-40ms spans, if shorter we would not have a sufficiently long segment to give a reliable result, and if longer, the signal changes too much.

Initially, the focus is on temporal features:

• Zero-crossing rate:

It corresponds to the number of crossings between the signal and the zero axis for a given time frame (which is a short period of time) on the signal.

A high value is characteristic of a noisy sound while a low value indicates a periodic signal. [32] It is mainly used in music information retrieval to catch noise and percussive sound. [7] As this value tends to be higher for percussive sounds, it helps us to differentiate for example rock from metal.

It can be computed for the frame t, K being the frame size, as follows: [7]

CR

t

= 1 2

(t+1)·K−1

X

k=t·K

| sign(s(k)) − sign(s(k + 1)) | (22)

where:

sign(s(k)) =

1 if s(k) ≥ 0

−1 if s(k) < 0

• Amplitude envelope:

It computes the maximum amplitude among all samples for a frame t: [7]

AE

t

=

^(t+1)·K−1

max

k=t·K

s(k) (23)

• Root-mean-square energy:

This feature is correlated to the perception of sound intensity, so it can be used to evaluate loudness. A low energy is particularly representative of classical music. [7] On one frame t:

RM S

t

= v u u t 1

K ·

(t+1)·K−1

X

k=t·K

s(k)

²

(24)

Spectral features are defined as follows:

• Spectral centroid:

It indicates the location of the centre of mass / barycentre of the spectrum, it represents the band where most of the energy is. [33] It is calculated as the weighted mean of the frequencies in the sound. Low Spectral Centroid usually corresponds to classical music, especially those with only piano.

Other music tends to have Spectral Centroids that vary much more. [7]

(25)

2 BACKGROUND – 2.4 Features for content-based recommendation

In order to compute it, the spectrum is considered as a distribution: the values are the frequencies and the probabilities to observe them are nor- malised in amplitude. [32]

µ = Z

x · p(x)dx (25)

where x are the observed data:

x = f req s(x) and p(x) the probability to observe x:

p(x) = ampl s(x) P

x

ampl s(x) It is also possible to compute it in the following way:

SC

_t

= P

N

n=1

m

t

(n) · n P

N

n=1

m

_t

(n)

• Spectral spread - Bandwidth:

The spread (= bandwidth) can be defined as the variance of the distri- bution, it indicates how spread is the spectrum around its mean value. [32]

σ

²

= Z

(x − µ)

²

· p(x)∂x (26)

It is also possible to compute it in the following way:

SS

t

= P

N

n=1

m

_t

(n)· | n − SC

_t

| P

N

n=1

m

t

(n)

• Spectral skewness:

It shows how asymmetric a distribution is around its mean value. [32] A value of 0 characterises a symmetric distribution, a higher value denotes a concentration of energy on the left while a lower value denotes more energy on the right.

γ

₁

= m

₃

σ

³

(27)

where

m

3

= Z

(x − µ)

³

· p(x)dx

• Spectral Kurtosis:

It reveals how flat the distribution is around its mean value. [32] The value for a normal distribution is 3, higher means peaker, lower means flatter.

γ

2

= m

4

σ

⁴

(28)

where

Z

4

(26)

• Spectral roll-off frequency:

It corresponds to the frequency value f

_c

such that a percentage (e.g. 95%) of the signal energy is contained below this value. [32] By noting sr/2 the Nyquist frequency:

f_c

X

0

a

²

(f ) = 0.95

sr/2

X

0

a

²

(f ) (29)

• Band energy Ratio:

This indicator measures the extent to which low frequencies dominate high frequencies. It is calculated by selecting a limit value called the split frequency band F . [7]

BER

t

= P

F −1

n=1

m

t

(n)

²

P

N

n=F

m

t

(n)

²

(30)

• Spectral Flux:

It depicts the power change between two consecutive frames.

F

t

=

N

X

n=1

(D

t

(n) − D

t−1

(n))

²

(31)

• Spectral Slope:

This indicator quantifies the amplitude decay of the spectrum, it is calcu- lated by linear regression, it is thus of the following form: [32]

ˆ

a = slope ∗ f + const (32)

• Spectral Decrease:

It also quantifies the amplitude decay of the spectrum but the method of calculation is more based on the perceptual part: [32]

decrease = 1 P

K

k=2

s(k)

K

X

k=2

s(k) − s(1)

k − 1 (33)

• Spectral Flatness:

The flatness reveals how close a sound is to white noise. A flat power spectrum (high flatness value) correspond to white noise.

It is expressed as the geometric mean to arithmetic mean ratio of a power spectrum: [32]

f latness = Q

n∈num band

m

_t

(n)

_K¹

1 K

P

n∈num band

m

_t

(n) (34)

The signal energy can also be taken into account:

• Global energy:

It gives the estimate signal power at a given time. [32]

(27)

2 BACKGROUND – 2.4 Features for content-based recommendation

• Harmonic energy:

It gives the estimate the power of the harmonic part at a given time. [32]

• Noise energy:

It gives the estimate the power of the noise part at a given time. [32]

This last part defines the psycho-acoustic features. The purpose of these features is to characterise and model the hearing system. They allow both to evaluate the values as perceived by humans and to predict the discomfort or annoyance that may be caused by certain sounds.

• MFCC:

The Mel Frequency Cepstral Coefficients have been introduced for speech and speaker recognition and found out to be powerful for extracting the power spectrum of the audio signal. Man-made sounds are filtered by the shape of the vocal apparatus (mouth, tongue, teeth, ...). It is therefore a question of determining the shape of the sound with precision, this should give us an exact representation of the phenomenon produced and therefore the way it is perceived. Different works proved that MFCCs can also be useful in the field of music similarity. [34]

Figure 8: Process to compute MFCC [35]

We then apply Hamming windowing to each frame in order to reduce the edge effects. [36]

The next step is simply to convert the signal into the frequency domain.

For this we use the Fast Fourier Transform method to obtain the desired periodogram. This is motivated by the functioning of the human cochlea, which has the particularity of vibrating according to the frequency of the sound heard. More precisely, depending on the exact location of the vi- brating cochlea (detected by small hairs), the nerves transmit to the brain which frequencies are present.

As the cochlea cannot discern differences between close frequencies, seg-

ments are summed to determine the amount of energy present in each

frequency region. For this we use Mel filterbank, the first filter is very

thin and indicates the concentrated energy close to 0 Hz. The more fre-

quencies increase, the wider the filters become since the variations matter

less.

(28)

of filters. The filters are all overlapping, so the goal of this step is to decorrelate the energies of each other.

Finally, as humans can discern small variations in pitch more easily in low frequencies than in high frequencies, the Mel scale is more suitable.

Usually 13 coefficients are finally kept for each frame.

Frequencies to Mel: [7]

M (f ) = 1125 ln(1 + f /700) (35)

• Loudness:

This is the first of the three sound quality descriptors, these are percep- tual and subjective metrics that are often used to assess the noise nuisance caused by products / worksites. [37] Its value is non-linear and represents the sound volume as perceived by the human ear, it is an intensity sensa- tion. [33]

Generally, the human ear focuses on frequencies between 2000 and 5000 Hz but it varies with age, population, culture, ... This is why, even songs that would have the same sound pressure, physically measured in decibels (dB), but with frequencies that are not in this range are perceived as softer to the human ear. [6]

Loudness N computation is achieved thanks to the Zwicker and Stevens model: [37]

N =

Z

24Bark 0Bark

N

⁰

(z)dz (36)

N

⁰

is the specific loudness, it is the loudness density according to the critical band rate, and is measured in sone/Bark. So N

⁰

(z) is the loudness in the z

^th

Bark band.

N , the loudness, is a value in sone and corresponds to the sound volume.

One sone represents the perception of a sound volume equivalent to that of a pure 1 kHz sound at a pressure level of 40 dB. Thus two sones correspond to a sound twice as intense as one sone for the average listener.

• Perceptual Sharpness:

It is an indicator of the perception of a noise as high-pitched. [33] It is an equivalence of the centroid spectral at the perceptual level, it is therefore calculated from the loudness. [32] Low acuity corresponds to ”dull sounds”

while high acuity corresponds to ”screeching sounds”. Generally, listeners prefer deaf sounds, but an extremely low value can also be annoying. One of the possible model is called Aures and is computed as follows: [37]

S = c ·

Z

24Bark 0Bark

N

⁰

(z)g

_s

(z)

ln(

^{N +20}₂₀

) dz (37)

c is the correction factor,

g

s

(z) is the weighting function for sharpness,

S is measured in acum.

(29)

2 BACKGROUND – 2.4 Features for content-based recommendation

• Perceptual Spread:

It is a measure of the distance between the largest specific loudness and the total loudness: [32]

Sd = N − max

z

N

⁰

(z) N

²

(38)

• Perceptual Roughness:

It evaluates the perception of time envelope modulations for frequencies between 20 and 150 Hz, maximum at 70 Hz (low and middle frequency variations). It allows to quantify the rapid variations that can be perceived as dissonant for the listener. [33] As for the loudness: [38]

R =

Z

24Bark 0Bark

R

⁰

(z)dz (39)

where R

⁰

is the specific roughness.

2.4.2 Middle-level features

Middle-level features focus on aspects that are meaningful musically and are understandable by a music expert. The first ones are focused on the harmony and melody of the music. Harmony is defined as the combined use of different pitch values and chords in music, it is called the vertical part of music. Melody is the horizontal part, it describes a sequence of pitched events that are perceived as a whole. [39] Different features allow you to extract information on harmony and melody:

• Pitch:

The pitch is related to the fundamental frequency, i.e. the frequency whose integer multiples best fit to the spectral content of a signal. [40] It is used to qualify sounds as ”high” or ”low” in the sense associated with musical melodies. To estimate the pitch we often estimate the so-called tuning system. This defines the tones (choice of number and spacing of frequency values) used in the music.

• Tonality / Modality:

It outlines the relationship between simultaneous and consecutive tones.

[40] It indicates whether the mode of the track is major or minor.

The following focus on the temporal and rhythmic properties of music:

• Duration of the track:

The duration of a given music is a simple element to extract that can help us to classify music.

• Onset events:

Onset detection is about finding the temporal position of all sonic events in a piece of music.

• Metrical levels:

(30)

present in a piece of music, generally higher metrical levels are multiple of lower. [40]

The lowest level, called tatum, corresponds to the shortest durational values. The one that the listener would describe as ”most important”

is called tactus, it corresponds to the foot tapping or what is commonly called beat. The tactus enables us to define the tempo which is the rate of the tactus pulse. [41]

• Beat:

The beat is the fundamental unit of time. Usually it is between 40 and 200 beats per minute. [40]

• Rhythm:

The rhythm also describes a pattern repetition in time, over longer periods of time than that of the beats. [40]

2.4.3 High-level features

High-level features are the one that can be understood by any listener, they describe music as it is perceived by humans. As they require interpretation, they sometimes seem intuitive but they are complex to extract reliably. Care must be taken with these features that are not always relevant. Moreover, most of them are classified as ”trade secrets” and are held by The Echo Nest (owned by Spotify), among others.

• Danceability:

This parameter estimates the ability of music to make people dance. It usually takes values between 0 and 3, the higher the value, the more danceable the music is. [42]

One way to calculate it could be based on the velocity v for each sample time t and tempo of the music: [43]

D = X

t

tempo ∗ v(t) (40)

• Liveness:

It consists in determining whether or not an audience is present while recording.

• Speechiness:

The predominance of voices in a music makes it possible to differentiate for example slam / rap which will have very high values from jazz / classical where the values will be very low. [43]

• Instrumentalness:

This feature contrasts with the previous one, a strong instrumentalness value corresponds to a strong domination of the instruments. [43]

• Instruments and Singer:

Knowing the instruments present, and knowing if there is a singer as well

as if it is a man or a woman can help to recommend the best music.

(31)

2 BACKGROUND – 2.5 Features selection algorithms

• Valence:

The valence characterises the mood of a music, a high value corresponds to a joyful, lively music while a low value is more likely to be sad, low energy, or even depressing music...

• Lyrics:

The mood of the music can also be determined through the lyrics. Natural Language Processing (NLP) is used to extract information from the lyrics.

This machine learning method is used to analyse texts and extract rele- vant information. The first step, after retrieving the lyrics (from websites like http://www.lyrics.com) in text form, is a preprocessing step in which punctuation and stopwords (’now’, ’how’, ‘I’, ‘they’, ...) are removed. We then vectorize the words to extract redundant topics for each genre.

2.5 Features selection algorithms

Many features can be extracted from audio files. The task is to eliminate those that are irrelevant or less significant and could increase the complexity of the model as well as the computation time while making less reliable predictions.

The selection of features is usually defined as a process of investigation in order to find a ”relevant” subset of features. The selection algorithms used to evaluate a subset of features can be classified into three main categories: filter, wrapper and embedded.

2.5.1 Filter model

The aim is to assess the relevance of a feature based on measures that rely on the properties of the learning data. It’s a preprocessing step that filters the features before performing the actual classification.

Let X = {x

k

|x

k

= (x

k,1

, x

k,2

, ..., x

k,n

), k = 1, 2, ..., m} be a set of m training values. Let Y = {y

k

, k = 1, 2, ..., m} be the labels of training values. To determine the relevance of a feature, there are several evaluation criteria.

• Correlation criteria: it is used in the case of a binary classification, µ

_i

and µ

_y

represent respectively the mean values of the feature i and its labels: [44]

C(i) =

P

m

k=1

(x

_k,i

− µ

i

)(y

_k

− µ

y

) pP

m

k=1

(x

k,i

− µ

i

)

²

P

m

k=1

(y

k

− µ

y

)

²

(41)

• Fisher criteria: measures the degree of separability of the classes using a given feature. n

_c

, µ

ⁱ_c

and σ

_cⁱ

represent respectively the number of samples, the average and the standard deviation of the i

^th

feature within class C.

µ

ⁱ

is the overall average of the first feature. [45]

F (i) = P

C

c=1

n

c

(µ

ⁱ_c

− µ

ⁱ

)

²

P

C

c=1

n

_c

(σ

_cⁱ

)

²

(42)

(32)

• Mutual Information: measures the dependence between the distribu- tions of two populations:

I(i) = X

x_i

X

y

P (X = x

i

, Y = y) log

P (X = x

i

, Y = y) P (X = x

i

)P (Y = y)

(43)

• Signal-to-Noise Ratio coefficient: similar to the Fisher criterion, it is a score that measures the discriminatory power of a feature between two classes:

SN R(i) = 2 · |µ

C i1

− µ

C i2

| σ

C i1

− σ

C i2

(44)

This filtering method is efficient and robust against overfitting. However, since it does not take into account interactions between features, it tends to select features with redundancy rather than complementary information. Moreover, this method does not take into account the performance of the classification methods that once the selection is made.

2.5.2 Wrapper model

The wrapper method was introduced by Kohavi and John [46]. In this case, the evaluation is done using a classifier that estimates the relevance of a given subset of features. That is why the subset of features selected by this method matches the classification algorithm used, but the subset is not necessarily valid if the classifier is changed.

Most common implementations of wrapper are:

• Forward selection: start with no features and add the most relevant one at each step:

1. Choose the significance level (e.g. 0.05)

2. Select the feature that fits the model with the lower p-value

3. If p value < α, add the feature to the feature set and go back to step 2, else stop the process. α is the significance level.

• Backward elimination: start with every feature, and remove the in- significant one at each step:

1. Choose the significance level (e.g. 0.05)

2. Fit the model with all features in the feature set 3. Consider the feature with highest p-value

4. If p value > α, remove the feature from the feature set and go back to step 2, else stop the process.

• Stepwise Selection / Bidirectional elimination: Similar to forward selection, a feature is added to each iteration, but it also verifies the signif- icance of features already added and can remove a feature through back- ward elimination if needed.