Classification of explicit music content using lyrics and music metadata

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Classification of explicit music

content using lyrics and music

metadata

LINN BERGELID

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Classification of explicit

music content using lyrics

and music metadata

LINN BERGELID

Master in Computer Science Date: June 28, 2018

Supervisor: Iman Sayyaddelshad Examiner: Örjan Ekeberg

Host company: Petter Machado, Soundtrack Your Brand

Swedish title: Klassificering av stötande innehåll i musik med hjälp av låttexter och musik-metadata

(4)

(5)

iii

Abstract

In a world where online information is growing rapidly, the need for more efficient methods to search for and create music collections is larger than ever. Looking at the most recent trends, the application of machine learning to automate different categorization problems such as genre and mood classification has shown promising results.

In this thesis we investigate the problem of classifying explicit music content using machine learning. Different data sets containing lyrics and music metadata, vectorization methods and algorithms includ-ing Support Vector Machine, Random Forest, k-Nearest Neighbor and Multinomial Naive Bayes are combined to create 32 different config-urations. The configurations are then evaluated using precision-recall curves.

The investigation shows that the configuration with the lyric data set together with TF-IDF vectorization and Random Forest as algorithm outperforms all other configurations.

(6)

iv

Sammanfattning

I en värld där online-information växer snabbt, ökar behovet av ef-fektivare metoder för att söka i och skapa musiksamlingar. De senaste trenderna visar att användandet av maskininlärning för att automati-sera olika kategoriseringsproblem så som klassificering av genre och humör har gett lovande resultat.

I denna rapport undersöker vi problemet att klassificera stötande in-nehåll i musik med maskininlärning. Genom att kombinera olika da-tamängder med låttexter och musik-metadata, vektoriseringsmetoder samt algoritmer så som Support Vector Machine, Random Forest, k-Nearest Neighbor och Multinomial Naive Bayes skapas 32 olika konfigurationer som tränas och utvärderas med precision-recall-kurvor.

Resultaten visar att konfigurationen med datamängden som endast innehåller låttexter tillsammans med TF-IDF-vektorisering och algo-ritmen Random Forest presterar bättre än alla andra konfigurationer.

(7)

List of Acronyms

BOW- Bag of Words

CBOW- Continuous Bag of Words

CNN- Convolutional Neural Network

D2V- Doc2Vec

HAN- Hierarchical Attention Network

ISRC- International Standard Recording Code

KNN- k-Nearest Neighbor

MIR- Music Information Retrieval

MNB- Multinomial Naive Bayes

NMF- Non-Negative Matrix Factorization

RF- Random Forest

SVM- Support Vector Machine

TDM- Term-Document Matrix

TF-IDF- Term Frequency - Inverse Document Frequency

(10)

List of Tables

2.1 Example of similarity measures used for k-Nearest Neighbor. . 13 3.1 Summary of the data structure of the lyrics and explicit data

set. . . 27 3.2 The parameter setting used for tuning the TF-IDF

vectoriza-tion model. . . 30 3.3 The parameter setting used for the Doc2Vec vectorization model. 30 3.4 Summary of the models used and their corresponding

func-tion in Scikit Learn. . . 32 4.1 Abbreviations for the classifiers used when presenting the

re-sults. . . 34 A.1 The results of the classification for each configuration. . . 51

(11)

List of Figures

2.1 Example of a term-document matrix. . . 6 2.2 Example of a term-document matrix with TF-IDF weights. . . 7 2.3 Example of a context window where the words in green are

the context of the word in yellow. . . 8 2.4 Example of networks for the CBOW model (left) and

Skip-Gram model (right) [2]. . . 9 2.5 Example of a SVM with linear kernel. Two classes are

pre-sented, one with blue circles and one with red squares. The black line corresponds to the hyperplane and the filled square and circle are the support vectors. . . 11 2.6 Example of a k-Nearest Neighbor classifier with k = 6. The

blue circles and red squares corresponds to data points of two different classes. The green circle would be classified as red since 5 out of 6 of the closest neighbors are red. . . 13 2.7 Example of a decision tree that decides whether or not to play

tennis based on the weather. . . 14 2.8 A confusion matrix for a binary classifier. . . 18 2.9 Example of a precision-recall curve. Algorithm 2 performs

slightly better than Algorithm 1 since the area is larger under the former one [10]. . . 18 3.1 A summary of the work flow of this thesis. . . 26 3.2 A summary of the distribution of the classes in the data set. . . 31 3.3 A summary of the combinations of data, pre-processing and

classifiers that has been evaluated. . . 32 4.1 Precision-Recall Curves per classifier. . . 37 4.2 Precision-Recall Curves per data set and sampling. . . 38

(12)

(13)

Chapter 1 Introduction

This chapter introduces the objective and aim of this degree project together with the problem definition. It ends with a small discussion about the ethical considerations of the project followed by a thesis outline.

Over the last couple of years music has become one of the largest types of online information. The increasing size of digital music collections has posed a major challenge and need to find more efficient methods for searching and organizing music collections [9]. The growth has extended music search methods from more traditional methods such as artists and album names to more advanced properties including mood, genre or similar artists based on previous listening. One prop-erty that has been requested multiple times recently is the propprop-erty of being able to filter out explicit content [7, 47].

1.1 Background

The research for this thesis has been conducted at the company Sound-track Your Brand1_{. Soundtrack Your Brand provides a music}

stream-ing service for companies to play music in their public areas. The customers are able to control and schedule music by using hardware players, mobile applications or a web interface. In addition to this the company has an expert team within music that helps each customer find their own soundtrack that matches their core values.

1_{https://www.soundtrackyourbrand.com/}

(14)

2 CHAPTER 1. INTRODUCTION

The company works with many markets including the US market. A key feature for this market is to filter out explicit content such as sex, drugs, alcohol and violence as some customers do not identify with such content for their core values. To manually annotate songs has multiple drawbacks. It is time consuming and thus costly, error prone and partly a subjective task which creates a need for an automated so-lution.

The idea is that an automated solution will both enhance speed and accuracy. An automated system would not only be useful for Sound-track Your Brand and other music streaming services (such as Spotify2

and Apple music3), but may also be applied to other kinds of media where explicitness occurs such as within movies or books. To the best of the author’s knowledge, there exists no automatic technology where it does the whole process of explicit filtering and thus it will be the en-try point of this thesis.

A small but growing research field called Music Information Retrieval (MIR) focuses on research and development of computational systems to retrieve music information. Currently the focus within this area has been on mood or genre classification of songs [17, 24, 30, 33, 49] or finding approaches on how to automatically generate play lists that fit a specific user [12]. The presented models are mostly built using lyrical or audio features or a combination of both. What seems to be missing is work on how to automatically filter out songs with explicit content.

1.2 Aim and Objective

The main aim of the thesis is to target an automated tool which pro-vides a practical feature of filtering explicit music. On the way of achieving the aim, the following objectives play a significant role:

• Research of text classification in order to select relevant algo-rithms and pre-processing methods.

2_{https://www.spotify.com/}

(15)

CHAPTER 1. INTRODUCTION 3

• Extracting and pre-processing of data to be used for the classifi-cation.

• Implementing the selected algorithms and evaluating if they are useful for classifying explicit content.

1.3 Problem definition and statement

The project entails finding a machine learning algorithm that based on text features is able to classify song tracks with explicit content. The main feature used for this purpose will be lyrics of the songs, but since lyrics differ a lot from ordinary text like news in that it uses rhyme and is presented in poetic form, using lyrics only can be difficult for natu-ral language processing. In [4] and [27], tags and musical reviews have been used as a complement to get a better classification. Therefore this thesis will investigate music metadata as an additional feature to en-rich the lyrics. Music metadata may include everything from informa-tion produced by the community, such as user annotainforma-tions of what a song is about, to information about the artist and album or acoustic features of the track.

The question this thesis aims to answer is:

What is an efficient method for classifying explicit music using machine learn-ing?

1.4 Ethical considerations

As the area of machine learning is growing more than ever, it is impor-tant to discuss different ethical aspects in relation to it. Creating au-tomated systems could possibly remove jobs from humans who used to perform a task manually, or it could help them focus on more ad-vanced tasks where they make more use.

Another aspect is that of privacy. The data used in this thesis contains no personal information and thus anonymization or removal of sensi-tive information has not been a concern.

(16)

4 CHAPTER 1. INTRODUCTION

Regarding explicit content there are a lot of views on what should be called explicit or not. This could potentially be a problem within for example religious or political music where some people would find this music explicit whereas some would feel like their view is dis-criminated if such music would be filtered out. One way to solve that would be by doing a multi-label classification based on different cate-gories within explicit and leave the choice to the customer on what is considered explicit or not.

1.5 Thesis outline

The thesis is structured as follows. Chapter 2 covers relevant theory and background of the thesis. It ends with a section about previous work within MIR where machine learning algorithms and textual fea-tures have been used. Chapter 3 describes the methods used in this thesis. In chapter 4 the results of the classification are presented. Chap-ter 5 covers an analysis of the results. The thesis ends with a conclusion and ideas for future work in chapter 6.

(17)

Chapter 2 Theory

This chapter starts with a presentation of how to pre-process text before it can be used as a feature. It then describes the theory behind the relevant classifi-cation algorithms used in this degree project followed by a short description of different evaluation methods. It ends with a summary of previous work.

2.1 Word embeddings

This section describes how the data is processed before it can be used as a feature in a classifier. Most classifiers are not able to use raw text strings as input and thus the text has to be converted into a suitable format before it can be used. Several methods exist and these can gen-erally be divided into two categories, frequency based embedding and prediction based embedding which are elaborated in the following sec-tions.

2.1.1 Frequency based embedding

Bag of Words

One of the most simple vector representations of a text is based on the frequency of all words. The frequency of each word in a document is calculated and the ordering between the words are removed. The result is stored in a term-document matrix (TDM) where each row cor-responds to a term and each column corcor-responds to a document, see Figure 2.1. This model is referred to as bag of words (BOW) [1].

(18)

6 CHAPTER 2. THEORY

Figure 2.1: Example of a term-document matrix.

Since documents may contain millions of unique words, the matrix is sparse and contains a lot of zeros. To reduce the size of the matrix and to make the computation more efficient some steps can be performed before vectorizing the document:

1. Stop word removal: Stop words are defined as frequently used words in a language. These are often common in a majority of the documents and thus do not contribute to the actual content of a document. Common stop words are often articles, preposi-tions and conjuncpreposi-tions and by removing these, noise is avoided. Several stop word lists exist for multiple languages.

2. Stemming: Stemming is used to reduce a word to its original root or word stem. For example, went would be reduced to go and struggling would be reduced to struggle. One of the most common methods is the Porter stemming algorithm [50].

3. Removal of punctuation and numbers: Punctuation marks, num-bers and hyphens are removed as they do not contain semantic relevance.

TF-IDF Vectorization

Term frequency-inverse document frequency (TF-IDF) is a more advanced measure used for calculation of the importance of a word based on its appearance in all documents. The more documents a word appears in, the less important that word becomes since it does not identify a

(19)

CHAPTER 2. THEORY 7

specific document. The measure also addresses the problem where two documents are of different length and terms might appear more times in a long document than in a short one [1]. This value, which is calculated in two steps, is then used instead of the raw frequency in the term-document matrix to provide a more balanced view of the frequencies.

The first step, term frequency, measures how frequent a term is in a document and is defined as follows:

T Ft,d= ft,d X t0_∈d ft0_,d (2.1)

where t is the term, d is the document and ft,dthe raw frequency of the

selected word. The second part, inverse documented frequency measures how important a term is and is defined as follows:

IDFt,D= log

N

|{d ∈ D : t ∈ d}|. (2.2)

where D is the set of all documents and N = |D|. Thus, if a term occurs in many documents the inverse document frequency will be small. These numbers are combined into TF-IDF by taking the product of them:

T F − IDFt,d,D = T Ft,d× IDFt,D. (2.3)

Figure 2.2 illustrates what the first term-document matrix would look like with TF-IDF frequencies instead of raw ones.

(20)

8 CHAPTER 2. THEORY

N-grams

N-grams are a set of N words occurring together in a text. For ex-ample, given the sentence "The moon shines bright." and N = 2, the 2-grams (also known as bi2-grams) of the sentence would be "The moon", "moon shines" and "shines bright". Instead of counting each word on its own, N-grams can be used in the TDM in frequency based embed-ding. The original bag of words approach is basically N-grams with N = 1(also known as unigrams). N-grams with different values of N can be combined in the same TDM, helping an algorithm understand the context while still preserving the information given by the terms on their own.

2.1.2 Prediction based embedding

Prediction based embedding is more complex than frequency based embedding as it takes the relationship between words into considera-tion. The most famous algorithm that is prediction based is Word2Vec which creates a vector for each word based on the semantic relation-ship between the words in a text [35].

Word2Vec

Word2Vec is based on a fully connected feed-forward neural network and its goal is to map words with similar meanings close to each other [35]. The main concept consists of words and their context in a sen-tence. A context window around a word w with size c is defined as the c words before and after w in a sentence. Figure 2.3 illustrates an example using the sentence "Both apples and pears are fruits." with w = pears and c = 2.

Figure 2.3: Example of a context window where the words in green are the context of the word in yellow.

Word2Vec is based on either one of two models, continuous bag of words (CBOW) or Skip-Gram. The CBOW model works by predicting the probability of a word given a context (of a specified size), whereas

(21)

CHAPTER 2. THEORY 9

Skip-Gram is the inverse of CBOW where a context is predicted given a word [2].

In the CBOW model, the training data consists of all context words, w1, ..., w2c. Representing a target word with w, the goal is to calculate

the probability P (w|w1, ..., w2c). Thus, the network consists of 2c input

layers, represented as one-hot-encoded vectors and one output layer where the target word is predicted, all of them of size d, the size of the dictionary [2].

In the Skip-Gram model, the training data instead consists of the target word, w with the goal of calculating P (w1, ..., w2c|w). The network has

a single input layer which is fed by a one-hot-encoded target word vector and 2c multiple output layers for each context word [2]. Figure 2.4 illustrates an example of the networks for both CBOW and Skip-Gram.

Figure 2.4: Example of networks for the CBOW model (left) and Skip-Gram model (right) [2].

The size of the hidden layer vector is p (< d), thus, by letting it be the word embedding gives a less sparse representation than the one-hot-encoded vector of a word.

(22)

10 CHAPTER 2. THEORY

Doc2Vec

Doc2Vec is an extension of Word2Vec which instead of creating a vec-tor for each word, creates a vecvec-tor representing an entire document [25]. It works by adding an additional vector that identifies a specific document. When the word vectors are trained, the document vector is trained at the same time and can then be used as a document embed-ding.

2.2 Supervised learning

Machine learning is divided into two main approaches, supervised learn-ing and unsupervised learnlearn-ing. Supervised learnlearn-ing uses labeled data, that is a set of features and their corresponding labels, as input to fit a model. The model can then be used to classify unlabeled data. If no such labeled data exists, unsupervised learning can be used instead. Unsupervised learning is used to understand the relationship between the unlabeled data for example by clustering [19].

Supervised learning is further divided into classification and regres-sion depending on if the labeled data is numerical or categorical. If the labeled data has numerical values such as a person’s height or weight, it is a regression problem and if it has categorical values, such as a per-son’s gender, it is a classification problem [19].

In the following subsections common classification algorithms within text categorization are presented. The idea is to provide a general de-scription, describe in what ways they differ as well as their advantages and disadvantages when used for text analysis.

2.2.1 Support Vector Machine

Support Vector Machine (SVM) is a model that produces hyperplanes to separate the data into classes. It aims to find the hyperplane which maximizes the sum of distances between the hyperplane and each training instance. The instances closest to the hyperplane are called support vectors. After the model has been trained and hyperplanes are created it can easily be used to classify new instances.

(23)

CHAPTER 2. THEORY 11

Some data sets may not be linearly separable and since the original algorithm only included a linear classifier, a non-linear solution was introduced using kernel functions to describe non-linear hyperplanes [6]. The most common kernel functions, except the linear, include polynomial kernel, Gaussian radial basis kernel and the sigmoid ker-nel. Figure 2.5 shows an example of a SVM with a linear kerker-nel.

Figure 2.5: Example of a SVM with linear kernel. Two classes are presented, one with blue circles and one with red squares. The black line corresponds to the hyperplane and the filled square and circle are the support vectors.

One big advantage of SVM is that it has proven to perform well in many text categorization problems specifically in terms of accuracies [3, 20, 37]. It also handles overfitting well since the model’s complexity is not dependent on the number of features [41]. One disadvantage is that the implementation of the model scales badly with the number of documents [37].

2.2.2 Naive Bayes

Naive Bayes is a group of probabilistic classifiers based on Bayes’ the-orem combined with a strong independence assumption between the features. Bayes’ theorem states the probability of an event given con-ditions related to the event and is defined as

(24)

P (Ck|F1, ..., Fn) =

P (Ck) × P (F1, ..., Fn|Ck)

P (F1, ..., Fn)

(2.4) where Ck is a class and F1, ..., Fn is a set of features. P (Ck|F1, ..., Fn)

denotes the probability of the set of features with values F1, ..., Fn

be-longing to class Ck. Adding the independence assumption means

as-suming that all features are conditionally independent from each other and the probability model can instead be formulated as

P (Ck|F1, ..., Fn) =

P (Ck) ×Qni=1P (Fi|Ck)

P (F1, ..., Fn)

. (2.5)

To estimate P (Fi|Ck), different event models are used that describe the

distribution of the features. The two most common models used for text classification are a multivariate Bernoulli model and a multino-mial model [21]. In the Bernoulli event model each document is repre-sented using a binary vector which indicates whether a word occurs or not in the document and thus, ignores the frequency. In the multino-mial event model the document is represented using a vector of word occurrences. The order of the words is ignored in both event models [31]. The model is then combined with a decision rule to build the classifier. The most common rule is called maximum a posteriori which means choosing to most probable class as the prediction.

The benefits of using Naive Bayes is that it is computationally fast, easy to implement and performs well in high dimensions [44]. Ac-cording to [31], the multinomial model is best for large vocabulary sizes. The disadvantage is the strong assumption of independence be-tween the features [37].

2.2.3 k-Nearest Neighbors

k-Nearest Neighbor (KNN) is an instance-based and non-parametric method that is based on feature similarity. Instance-based means that there is no explicit training phase to configure a model before the clas-sification [1]. Non-parametric means that all data points are needed in order to predict a test point and that no assumptions are made on the underlying data distribution which makes it good to use when there is little or no prior knowledge on the distribution of the data.

(25)

The KNN classifier is one of the simplest classifiers, for a given in-stance the k closest training samples are determined and the dominant class among the k samples is predicted by the classifier, see Figure 2.6. To calculate the distance between the samples a similarity measure has to be used. Some of the existing measures can be seen in Table 2.1 be-low where the Euclidean distance is the most common.

Table 2.1: Example of similarity measures used for k-Nearest Neighbor.

Measure Formula Euclidean D(x, y) =qPn i=1(xi− yi)2 Manhattan D(x, y) =Pn i=1|xi− yi| Minkowski D(x, y) = (Pn i=1|xi− yi|p)1/p, p ≥ 1

where x = (x1, ..., xn) and y = (y1, ..., yn) are feature vectors of two

samples.

Figure 2.6: Example of a k-Nearest Neighbor classifier with k = 6. The blue circles and red squares corresponds to data points of two different classes. The green circle would be classified as red since 5 out of 6 of the closest neighbors are red.

One of the major disadvantages with the KNN model is that it suf-fers from high storage requirements since all training data is needed during the test phase. However, it benefits of its simplicity, noise ro-bustness and that no assumptions has to be done on the data [37, 42].

(26)

2.2.4 Random Forest

In order to understand how a Random Forest (RF) classifier works, first the concept of decision trees and ensemble methods has to be in-troduced.

Decision Trees

A decision tree is a rule-based classifier which recursively splits data into smaller sets based on predefined rules in a tree-like structure. At each node, a decision is made based on a split criterion and the data is split into two or more subsets. The leaf nodes are labelled with a class and after the tree has been constructed that label is used when classi-fying new instances [1]. Figure 2.7 illustrates a tree that helps decide whether or not to play tennis based on the weather. For example, if the outlook is sunny and the humidity is normal, one may play tennis since the class at the leaf node is Yes.

Figure 2.7: Example of a decision tree that decides whether or not to play tennis based on the weather.

Ensemble methods

An ensemble method consists of multiple classifiers whose predictions are combined in order to improve the performance and obtain better

(27)

accuracy than the single classifiers would have on their own. There are two main techniques within ensemble learning called bagging and boosting. Bagging (bootstrap aggregating) randomly divides the train-ing set in partitions with replacement and uses each set for each clas-sifier. In boosting each set is chosen based on the performance of the previous classifier. Data that were incorrectly classified in a previous classifier is prioritized in the next run in order to improve the predic-tion of data that gave poor performance in an earlier run [45].

Random Forest

Random forest combines the above concepts by using a large set of de-cision trees and outputs the resulting class based on a majority vote between the results of the different trees [45]. The benefits of using RF classifiers includes simplicity and good performance for data sets with high dimensions. They rarely overfit since there are multiple trees that helps to reduce the variance. One disadvantage is that using a large amount of trees can be very computationally heavy and ineffective [52].

2.3 Evaluation

After a model has been created, its performance has to be evaluated. In this section multiple evaluation methods and metrics are presented.

2.3.1 Methodology

The goal for all classifiers is to predict the output as accurately as pos-sible. Labeled data called ground truth is often used to compare with the classifier output. It is important that this data has not been used for training since that would lead to an overestimation of the accuracy due to overfitting. Therefore, all labeled data needs to be divided into a training and a test set before it can be used. Depending on how the data is divided it may effect the results. Small labeled data sets are extra sensitive and if the test data is not representative it will lead to poor results. Multiple methods exist in order to prevent this which are presented below.

(28)

Holdout

The holdout method partitions the labeled data into two sets for train-ing and testtrain-ing. The model is built ustrain-ing the traintrain-ing data and the accuracy of the model is calculated using the test data. This ensures that the model is not overestimated due to overfitting. However, it might instead lead to an underestimation of the accuracy if the class distributions in the test and training data are not equal [1].

Cross Validation

The cross validation method partitions the data into k parts of equal size. One part will be used for testing and the rest (k − 1 parts) are used as training data. This is then repeated k times using all k parts as test data one time each. The final accuracy is usually an average of all kresults.

2.3.2 Metrics

Several metrics exist to evaluate the performance of a classifier. This section lists the most common ones which are all based on the termi-nology of true positives, false positives, true negatives and false neg-atives. The terminology is described using an example of a binary classification with two classes, A and B.

• True Positive (TP) - True positives are items of class A that are correctly predicted as items of class A.

• False Positive (FP) - False positives are items of class B that are incorrectly predicted as items of class A.

• True Negative (TN) - True negatives are items of class B that are correctly predicted as items of class B.

• False Negative (FN) - False negatives are items of class A that are incorrectly predicted as items of class B.

Accuracy

Accuracy is used as an overall measure of a classifier and is defined as the ratio between all correctly predicted items and all items.

(29)

Accuracy = T P + T N

T P + F P + T N + F N (2.6)

Precision

Precision is a measure of what percentage of items predicted to belong to class A actually belongs to class A.

P recision = T P

T P + F P (2.7)

Recall

Recall is a measure of what percentage of items belonging to class A where predicted correctly. This is also known as sensitivity or true positive rate.

Recall = T P

T P + F N (2.8)

F1 score

The F1-score is defined as the harmonic mean between precision and recall.

F1 = 2 ∗

P recision ∗ Recall

P recision + Recall (2.9)

Confusion Matrix

A confusion matrix can be used to summarize and illustrate the per-formance of a classifier. Each row represents the actual classes whereas each column represents the prediction of the classifier. Figure 2.8 illus-trates an example with a binary classifier with a positive and a nega-tive class.

(30)

Figure 2.8: A confusion matrix for a binary classifier.

Precision-Recall Curve

Precision-recall curves are used to demonstrate the relationship be-tween precision and recall along different thresholds. Precision and re-call are inversely related so when precision increases, rere-call decreases and vice versa. The X axis shows the recall while the Y axis shows the precision. A large area under the curve means both high precision and recall. Given the plot, an appropriate threshold can be chosen depend-ing on if a high precision with a lower recall or high recall at the cost of precision is preferred. Figure 2.9 illustrates an example.

Figure 2.9: Example of a precision-recall curve. Algorithm 2 performs slightly better than Algorithm 1 since the area is larger under the former one [10].

(31)

2.4 Imbalanced classification

Imbalanced classification arises when the classes in a data set are un-equally distributed. Training an algorithm with imbalanced data often leads to reduced accuracy. This is due to the imbalanced distribution of the dependent variable which causes the classifier to get biased to-wards the larger class [11, 14]. Classifier algorithms assume that the different class errors are equal and aim to minimize the total error which will be dominated by the larger class and thus the error of the smaller class will not be seen. Multiple methods exist to deal with the imbalance issue by altering the size of the data set to create subsets of data of each class of more equal sizes. The group of methods are known as sampling methods. They are further divided into groups of oversampling and undersampling.

2.4.1 Oversampling

Oversampling works by replicating samples of the smaller class. An advantage of this type of sampling is that no information is lost [28]. Several techniques exist with different approaches of replicating.

2.4.2 Undersampling

Undersampling works by reducing the size of the larger class to make the data set more balanced. Since data is removed this method works well for large data sets. Several techniques within under-sampling ex-ist regarding how to select what samples of the data set to remove [28].

Random Undersampling

Random undersampling is the simplest technique which balances the data by randomly selecting a subset equal to the size of the underrep-resented class. The parameter setting allows the user to bootstrap the data if wanted by selecting samples with replacement [28].

NearMiss

NearMiss works by adding one out of three heuristic rules to select samples. The first rule selects samples of the larger class with the smallest average distance to the smaller class. The second rule is the

(32)

same as the first rule except that it selects samples with the largest av-erage distance instead. The third rule consists of two steps. The M nearest neighbors of each sample in the smaller class will be kept. For the rest of the samples in the larger class, the ones will be kept whose average distance to their N nearest neighbors are the closest [18].

TomekLinks

A Tomek’s Link between two samples of different classes means that the samples are the nearest neighbors of each other. That is, (a, b) is a Tomek’s Link if there exist no sample c such that d(a, c) < d(a, b) or d(b, c) < d(b, a)where a and b belong to different classes and d() is the distance between two samples. Depending on the parameter setting of the method either the sample of the larger class or both samples in a Tomek’s Link will be removed [14].

Edited Nearest Neighbor

Edited Nearest Neighbor uses a nearest neighbor algorithm to remove samples that are not close enough to other samples of the same class. Two settings of the method exists, one where a majority of the N clos-est neighbors have to belong to the same class and one where all of the N closest neighbors have to belong to the same class [18].

2.5 Related work

This section covers earlier research within text classification where fo-cus has been on lyrics as data.

2.5.1 Music Information Retrieval

Studies within Music Information Retrieval (MIR) have focused on dif-ferent forms of classification including genre and mood, with artist similarity and music annotation being more recently studied subjects within this area. Primary focus has been on audio features, but grad-ually, text features have become more popular. To the best of the au-thor’s knowledge, no scientific studies on explicit filtering has been found. The most relevant research studies are work on genre and topic classification with lyrical features as main features. Below follows a

(33)

summary on the work within MIR over the last decade. The work is divided in sections based on the classification subject.

2.5.2 Mood classification

Hui et al. [17] were some of the first to use lyrics for mood classifica-tion. They used the n-gram model and part-of-speech tagging to tackle the difficulties when working with lyrics as text since lyrics often lack emotion words. For feature selection and weighting they tried boolean value, absolute term frequency and TF-IDF. They used Naïve Bayes, Maximum Entropy and SVM algorithms and evaluated the problem with all combinations of algorithms and pre-processing. The results demonstrated that Maximum Entropy and SVM performed better than Naïve Bayes and that TF-IDF was better than the other feature weight-ing methods.

In 2009, Hu et al. [16] tried to build a ground truth data set of songs with mood by comparing different experiments of using only lyrical features, only audio features and a hybrid of both. The experiment evaluated three different text processing methods, bag of words (with and without stemming), part-of-speech tagging and removal of stop words. The classifier used was a SVM as it has shown strong per-formance within earlier reports on text classification within MIR. The results showed that bag of words and TF-IDF weighting perform best for lyric analysis in terms of accuracy and that the stemming did not play a meaningful role.

An extensive research using audio, lyrics and social tags as features has been reported in [24]. Expectation Maximization was used for clustering of the tags. For lyrical and audio features eight different classifiers were evaluated where SVM with polynomial kernel per-formed best. The results showed that audio was stronger than lyrics as a feature but both features together were complementary. Other work using tags as feature is [27] where they use emotion and genre tags from allmusic.com1 to build an emotion classifier and [8], which uses last.fm2 _{tags, to create two data sets, one containing categorization of}

songs into 4 different emotions and the other discriminating between

1_{https://www.allmusic.com/} 2_{https://www.last.fm/}

(34)

positive and negative songs.

Contributing to the studies on mood classification using lyrics alone, [49] presented two different feature categories, one using the lyrics as a whole and another by considering character count, word count and line count. TF-IDF were used on the lyrics to measure word relevance and the results showed that the former feature category outperformed the latter word-based features.

2.5.3 Genre classification

Neumeyer et al. [36] used a combination of audio and lyrical features to perform genre classification. The lyrics were pre-processed using the bag of words model and weighted with TF-IDF. Feature selection removed terms that were either very frequent or almost not occurring at all. The classifier used was a SVM and the results showed that a combination of lyrics and audio gave best result compared to using them separately.

In [32] the tool jMIR was used to extract audio, symbolic and cultural features from SLAC, a dataset containing MP3 recordings, metadata and lyrics of each recording. Creating an entire set of 173 features, they divided these into a total of 15 groups to be used for classifica-tion. The results showed that combining features improved the per-formance given that cultural features were available. Overall lyrical features performed poor in comparison to other features.

Mayer et al. [30] did a more extensive evaluation on genre classifica-tion by combining different styles of textual features. Rhyme features, part-of-speech and statistical features (such as number of words per line or number of characters in a word) were compared against the more classical bag of words approach. Classification was done with Naïve Bayes, KNN with different values of k, SVM with linear and polynomial kernels and Decision Trees. Before starting the classifica-tion the lyrics were manually pre-processed and cleaned. Stemming was applied and yielded better results for some of the classifiers but not all. For all classifiers the results showed that the three new pro-posed textual features performed the best and that a combination of

(35)

these features with the classical bag of words approach outperformed using only bag of words.

In 2016 Oramas et al. [38] tried to add sentiment features by using a combination of customer reviews from Amazon3_{, metadata from}

Mu-sicBrainz4 _{and audio descriptors from AcousticBrainz}5_{. Text}

process-ing was performed by doprocess-ing sentiment analysis followed by entity linking. A bag of words model with TF-IDF were used and stop words were removed. The vectors were enriched with information from the entity linking step. During the sentiment analysis a sentiment score was assigned creating a group of sentiment features. The model was evaluated using Naïve Bayes and SVM using different combinations of features. The result showed that adding sentiment features outper-forms using purely text-based features.

During last year two different papers used deep networks for genre classification. In [39] audio, text and images were used as features and input for a convolutional neural network (CNN). To create the text fea-tures all reviews of an album were concatenated to one text and then truncated to the same size. A vector space model with TF-IDF were applied to create a feature vector of each album. To enrich the text a tool called Babelfy6 _{was used to map words to Wikipedia categories.}

The results showed that text-based features outperformed audio and image features and that the enriched version of the text was superior to the other one. It also showed that using neural networks outperform more traditional approaches. The other paper [48] used hierarchical attention networks (HAN) with lyrical features for genre classification and compared them with non-neural approaches. The performance of HAN were compared to four other baseline models including a major-ity classifier, logistic regression, long short-term memory and hierar-chical networks. The results showed that HAN performs better than all earlier attempts on classifying genre using lyrical features.

3_{https://www.amazon.com/} 4_{https://musicbrainz.org/} 5_{https://acousticbrainz.org/} 6_{http://babelfy.org/}

(36)

2.5.4 Topic classification

Mahedero et al. [29] have explored how natural language processing tools can be applied to lyrics in order to perform language identifica-tion, structure extracidentifica-tion, categorization and artist similarity searches. For the thematic categorization the goal was to build a classifier that could recognize five categories: love, violent, protest, christian and drugs. The algorithm used for this purpose was Naïve Bayes which yielded promising results. Overall the report concluded that lyrics can be a good complement to audio and cultural metadata features.

Kleedorfer et al. [22] created a vector space model out of lyrics and used non-negative matrix factorization (NMF) to identify clusters of topics and then labeled these manually. The lyrical pre-processing was done by tokenizing the lyrics and creating a term-document matrix. Stop words and terms with very high or very low document frequency were removed and term weighting was done using TF-IDF. The results showed that a reasonable portion of the clusters described distinguish-able topics and that they were reliably tagged.

In [5], topic analysis was performed by using data in forms of au-dio features and social annotations on last.fm. The authors found that SVM performed best on audio features whereas Naïve Bayes per-formed best on tag features. The result showed that these two types of features are complementary and should be used together for best results.

(37)

Chapter 3 Method

This chapter describes the work flow during the thesis and the methods used. This includes description of data sets, chosen algorithms and how the evalua-tion was carried out.

3.1 Work flow

After defining the problem, necessary data is extracted from different sources. The data is pre-processed and vectorized for the text classi-fiers and then split into training and test data. The training set is then used to train each classifier and the classifier is evaluated using the test data to plot a precision-recall curve. Figure 3.1 presents a sum-mary from data collection to result.

3.2 Data set

In this degree project two different data sets are used. One contain-ing the lyrics of the songs and the explicit tags that has been used as ground truth and the other containing additional data about each song to help enrich the lyrics.

3.2.1 Lyrics and explicit tags

Lyrics is obtained through the LyricFind API1_{. The API provides}

ex-cept for lyrics, the song name, artist name, lyric language and a song

1_{http://lyricfind.com/}

(38)

26 CHAPTER 3. METHOD

(39)

CHAPTER 3. METHOD 27

identification number called ISRC which is internationally used for songs in general. 2

The lyric data is joined together with the manually screened songs by Soundtrack Your Brand based on the ISRC number. A summary of the result can be seen in Table 3.1. This results in a total of 27378 tracks.

Table 3.1: Summary of the data structure of the lyrics and explicit data set.

Field Definition

ISRC International identification number of a recording. Artist The name of the artist.

Title The title of the song.

Explicit The explicit tag in boolean form. Language The language of the song.

Lyric The lyric of the song.

3.2.2 Music metadata

Previous work has shown that enriching the lyrics with other data can improve the results for different text classification tasks within music information retrieval. For topic classification, user-annotated data has been used whereas for genre classification customer reviews and tags has been tried [8, 24, 27, 38].

Lyrics is often in poetical form using rhymes and its meaning can sometimes be presented indirectly via context which makes it diffi-cult for an algorithm to grasp what a song is about. To emphasize the meaning of a song, a first idea was to use user-annotated data describ-ing what a song is about. Multiple databases containdescrib-ing this type of information exists but due to difficulties obtaining publishing rights and the time limitation of this thesis another direction is chosen. Spotify offer via their web API3 _{some metadata for each song. The}

metadata includes, but is not limited to, editorial information such as artist name, album and year and acoustic features such as duration, tempo, valence, energy and mode. Some of these features are chosen

2_{http://isrc.ifpi.org/en/}

(40)

for enrichment of the lyrics to see if it will improve the results. The extracted features are:

• Artist name - The artist of a song.

• Release year - The year a song was released.

• Energy level - A measure from 0.0 to 1.0 that describes the inten-sity of a song.

• Valence - A measure from 0.0 to 1.0 that describes the positive-ness of a song. Higher value means positive and happy, whereas a low value corresponds to a negative and sad song.

3.3 Software

All code in this thesis is written in Python (version 3.6). The classi-fiers are built using the Python module Scikit Learn which integrates several machine learning algorithms for both supervised and unsu-pervised problems [40]. The Doc2Vec vectorization is made using the library Gensim [43]. An API called Imbalanced-learn is used to handle the imbalance issues with sampling methods [26]. For hyper parame-ter optimization, a library called Scikit-Optimize is used [13].

3.4 Pre-processing

Pre-processing is used to extract the feature vectors for building and evaluating the model.

3.4.1 Data cleaning

The first step consists of cleaning the lyric data set. The lyrics provided by LyricFind comes in multiple languages and thus all non-English songs have to be removed. The data set provided has a language field but many entries are missing and thus the language is detected manu-ally for these records.

Records with lyrics shorter than 100 characters are removed as they do not contain the correct or complete lyric of the song. This includes

(41)

for example all records with an empty lyric field or instrumental songs which has the word "Instrumental" in the lyric field. After these steps 25488 songs are left in the data set.

3.4.2 Feature selection and transformation

The classifiers use two different data sets; one containing only lyrics, and the other one, containing lyrics, artist name, energy, valence and year. Both data sets use the explicit tag as label.

To combine multiple features as one input, Scikit Learn’s FeatureUnion is used, which provides the ability to combine different feature extrac-tion methods for different features and the features used are a com-bination of text and numbers with different pre-processing require-ments. Since the artist feature consisted of a string, each artist was mapped to a unique numerical id to be used instead.

3.4.3 Feature extraction

To transform the lyrics to a suitable format for the classifiers two dif-ferent approaches are used, TF-IDF vectorization and Doc2Vec vec-torization. As explained in chapter 2, the first approach is based on word frequency whereas the second one is based on neural networks to calculate the relationship between the words. Since they are very different it is interesting to use both to see which one is best suited for explicit classification.

Both approaches come with different sets of parameters. The param-eters of the TF-IDF approach are tuned together with the paramparam-eters of the classifier using a parameter grid search. The details of this step is explained in section 3.6. The parameters of the Doc2Vec approach are based on the results of [23] who have done an empirical evaluation of Doc2Vec and provided recommendations on parameter settings for semantic textual similarity tasks. The recommended settings are used for the Doc2Vec model in this degree project. The reason for not tuning the parameters of the Doc2Vec approach is that building the Doc2Vec model takes time and thus rebuilding the model with new parameters within each iteration of a grid search would not be feasible within the time limitation of this degree project.

(42)

TF-IDF Vectorization

TF-IDF vectorization is done by using Scikit Learn’s feature TfidfVec-torizer. For each lyric a vector is created with each element containing the word frequency. The length of the vector is equal to the size of the vocabulary. The parameters of the vectorizer that are tuned are summarized in Table 3.2. The stop word list used is the built-in one provided by Scikit Learn. Setting use_idf to false means that normal-ization using TF-IDF is turned off and the bag of words model is used. The resulting vector is then used as input to the classifier.

Table 3.2: The parameter setting used for tuning the TF-IDF vectorization model.

Parameter Type Interval

stop_words string ’english’, None

use_idf boolean True, False

ngram_range tuple (1,1), (1,2), (1,3), (2,2), (2,3), (3,3)

Doc2Vec Vectorization

Doc2Vec is implemented using the Gensim library. For each lyric a TaggedDocument object is created and then used as input to create the model. The model is initialized by defining the parameters summa-rized in Table 3.3. After the model is created the infer_vector() function is used to create the feature vector to use as input to the classifier.

Table 3.3: The parameter setting used for the Doc2Vec vectorization model.

Parameter Value window 15 size 300 min_count 1 sample 1e-5 alpha 0.025 min_alpha 0.0001 negative 5 epoch 400

(43)

3.4.4 Sampling

As the acquired data set is imbalanced (illustrated in Figure 3.2), pling is investigated to make it more balanced. One drawback of sam-pling the data is that a majority of it is lost, therefore the classification is evaluated using both sampled and non-sampled data.

Figure 3.2: A summary of the distribution of the classes in the data set. The method used is RandomUnderSampler from the library Imbalanced-learn. It reduces the size of the majority class by removing samples randomly until it is the same size as the minority class.

3.5 Model selection

Given previous research, SVM is the most frequent classifier used in literature [5, 16, 17, 24, 30, 36, 38]. Therefore it seems to be a wise clas-sifier to implement. A linear kernel is chosen as many problems within text categorization has shown to be linearly separable [20]. Other ad-vantages include its speed compared to other kernels since only one parameter has to be tuned which speeds up the grid search a lot. More-over, KNN and MNB are used to provide a comparison since they have

(44)

been used before as well [5, 17, 29, 30, 38]. RF is chosen as it has shown good performance in other text classification problems [51, 52].

A summary of selected classifiers and their corresponding function in Scikit Learn is presented in Table 3.4.

Table 3.4: Summary of the models used and their corresponding function in Scikit Learn.

Classifier Scikit Learn

Linear Support Vector Machine SVC(kernel=’linear’)

Multinomial Naive Bayes MultinomialNB()

k-Nearest Neighbors KNeighborsClassifier()

Random Forest RandomForestClassifier()

Each data set is tested with and without sampling using both kinds of vectorization with all classifiers. This gives a total of 32 configurations which is summarized in Figure 3.3.

Figure 3.3: A summary of the combinations of data, pre-processing and clas-sifiers that has been evaluated.

3.6 Evaluation

The evaluation is conducted in two steps. The first step consists of finding the optimal parameter settings for all classifiers and in the sec-ond step the classifiers are evaluated and compared to each other us-ing their optimal parameter settus-ing.

(45)

3.6.1 Parameter tuning

To tune the parameters of each classifier Scikit Learn provides a func-tion called GridSearchCV which takes a set of values for each parameter and then performs a total search of all combinations of the values of the parameters. However, the complexity grows exponentially and for a classifier with parameters with a large set of values (e.g. parameters with values within a continuous range) it is not very scalable.

Instead, a library called scikit-optimize is used for parameter tuning. It provides a function called BayesSearchCV which replaces Scikit Learn’s GridSearchCV. It uses Bayesian optimization where the search space is modelled as a function with the aim of finding the optimal parame-ter values in as few iparame-terations as possible. It works by constructing a posterior distribution of functions that describes the function that is to be optimized. For each iteration, the model becomes more certain of which ranges of the parameters that are stronger and which are not. Thus, it does not have to test all combinations in order to find out the most optimal one. All parameters are evaluated using 10-fold strati-fied cross-validation.

3.6.2 Classification

After the tuning, all models are trained and evaluated using the opti-mized parameters. The metric used for evaluation is precision-recall curves as they work well with imbalanced data [10].

(46)

Chapter 4 Results

This chapter presents the results of the classification. The results are presented using precision-recall curves and divided into sections per algorithm and per data set.

As mentioned in section 3.5, each data set is evaluated with all clas-sifiers with all combinations of sampling and vectorization, giving a total of 32 configurations. To simplify the presentation of the re-sults, abbreviations are introduced for each classifier, data set and pre-processing method. Table 4.1 presents the abbreviations for each clas-sifier.

Table 4.1: Abbreviations for the classifiers used when presenting the results.

Classifier Abbreviation

Linear Support Vector Machine SVM

Multinomial Naive Bayes MNB

k-Neareast Neighbor KNN

Random Forest RF

Further, the data sets are denoted by ’L’ for lyrics only and ’LM’ for lyrics combined with music metadata, sampling is denoted by ’S’ and vectorization is denoted by ’TF’ for TF-IDF and ’D2V’ for Doc2Vec. For each configuration a pattern is used where the abbreviations are concatenated with a dot in between, e.g. using the lyric data set with sampling, TF-IDF and the classifier RF would be ’L.S.TF.RF’ and using the lyric data set without sampling, with Doc2Vec and the classifier

(47)

CHAPTER 4. RESULTS 35

SVM would be ’L.D2V.SVM’.

The following sections present the results using precision-recall curves. For numerical results with accuracy, precision, recall and F1-score, see Appendix A.

4.1 Results per classifier

This section provides precision-recall curves for each classifier.

4.1.1 Linear Support Vector Machine

The precision-recall curves for all configurations using SVM are pre-sented in Figure 4.1a. Of the prepre-sented configurations ’L.S.TF.SVM’ has the best performance and ’LM.TF.SVM’ the worst.

4.1.2 Multinomial Naive Bayes

The precision-recall curves for all configurations with MNB are pre-sented in Figure 4.1b. As for SVM, the configuration with the sampled lyric data set and TF-IDF vectorization performs best. ’LM.D2V.MNB’ has the worst performance.

4.1.3 k-Nearest Neighbor

Figure 4.1c presents the precision-recall curves for the KNN classifier. ’L.S.TF.KNN’ and ’LM.S.TF.KNN’ perform best and ’L.D2V.KNN’ has the worst performance.

4.1.4 Random Forest

Figure 4.1d presents the precision-recall curves for all configurations using the RF classifier. As for KNN, ’L.S.TF.RF’ and ’LM.S.TF.RF’ per-forms best and ’L.D2V.RF’ has the worst performance.

4.2 Results per data set and sampling

This section presents the result per each data set and whether or not sampling was used.

(48)

36 CHAPTER 4. RESULTS

4.2.1 Lyrics

With sampling

For the configurations that use the sampled lyric data set, RF performs best and KNN performs worst, see Figure 4.2a. Looking at the vector-ization, configurations using TF-IDF perform better than those using D2V except for ’L.S.TF.KNN’ which performs worse than three of the configurations using D2V.

Without sampling

Figure 4.2b presents the precision-recall curves for the configurations that use the lyric data set without sampling. As for the lyric data set with sampling, RF together with TF-IDF vectorization has the best performance and KNN with D2V vectorization has the worst perfor-mance.

4.2.2 Lyrics + Music Metadata

With sampling

Figure 4.2c presents the precision-recall curves for the configurations that are using the lyric and metadata data set with sampling. RF with TF-IDF vectorization performs the best, while MNB with D2V vector-ization performs the worst.

Without sampling

Precision-recall curves for the lyric and metadata data set without sam-pling are presented in Figure 4.2d. As for the data set with samsam-pling, RF with TF-IDF vectorization performs the best. ’LM.TF.SVM’ has the worst performance.

(49)

CHAPTER 4. RESULTS 37 (a) Pr ecision-Recall Curve for Support V ector Machine. bbb (b) Pr ecision-Recall Curve for Multinomial Naive Bayes. (c) Pr ecision-Recall Curve for k-Near est Neighbor . bbbbbbbb (d) Pr ecision-Recall Curve for Random For est. bbbbbbbbb Figur e 4.1: Pr ecision-Recall Curves per classifier .

(50)

38 CHAPTER 4. RESULTS (a) Pr ecision-Recall Curve for L yric data set with sampling. bbbbbbb (b) Pr ecision-Recall Curve for L yric data set without sampling. bbbbb (c) Pr ecision-Recall Curve for L yric and Metadata data set with sampling. (d) Pr ecision-Recall Curve for L yric and Metadata data set without sampling. Figur e 4.2: Pr ecision-Recall Curves per data set and sampling.

(51)

Chapter 5 Discussion

In this chapter the results of the degree project are discussed and the problem question answered.

The goal of this thesis was to evaluate the possibilities of using ma-chine learning for explicit classification of song lyrics. To the best of the author’s knowledge no previous research within explicit text clas-sification using lyrics exist. The closest research areas that have been found include mood and genre classification. Most of the research within those areas are either based on audio only or a combination of audio and lyrics, where audio or the combination have been stronger than lyrics only [5, 15, 16, 24, 34, 36, 46]. Since explicit classification has to do mostly with the textual content, audio was not reasonable to use in this study.

However, looking at research within mood and genre classification made it clear that lyrics sometimes can be hard to interpret due to its poetical style, using slang words and rhyme. [30, 32, 38] all used ex-ternal features to enrich the lyrics with good results, which motivated the use of it in this study as well.

Two of the newest reports within these areas were [39] and [48] whom both used neural networks which made it interesting to look at it in this project as well. The reason for not doing that was partly that the data sets were too small, but also the time limitation of the degree project.

(52)

40 CHAPTER 5. DISCUSSION

5.1 Model evaluation

As illustrated by the precision-recall curves in Figure 4.2 it is clear that the RF classifier has the best performance where it performs the best in all configurations regardless of data set or using sampling or not. The top performing configurations are "LM.S.TF.RF" and "L.S.TF.RF". The worst performing algorithms are KNN when looking at the lyric data set and MNB when looking at the lyric and metadata data set. RF per-forming well was expected since that classifier is known to have good performance on sets with high dimensions, such as text. The reason for why MNB performed so bad when combining different kinds of features might be that the classifier is sensitive to feature selection and that not all features fit to the multinomial distribution.

5.2 Sampling vs. No Sampling

The biggest differences seen between configurations using sampled and non-sampled data is that using sampled data reduces the accu-racy but increases precision and recall (Table A.1). The reason for this is most likely due to that the classifier favors the majority class when they are imbalanced which leads to a high number of correct classi-fications on the majority class but few on the minority class. This in turn means a high accuracy since the majority class makes up almost all data records. The curves in Figure 4.1 indicates this as well where most of the configurations using sampled data perform better than the ones with data sets without sampling.

5.3 TF-IDF vs. Doc2Vec Vectorization

The results in Figure 4.2 show that TF-IDF vectorization outperforms Doc2Vec vectorization in almost all configurations. One possible rea-son is that the Doc2Vec model is not trained with enough data, since it is built on a neural network which is sensitive to too small data sets. One way to solve it would be to use a larger data set for building the Doc2Vec models.

(53)

CHAPTER 5. DISCUSSION 41

5.4 Data set evaluation

The biggest difference between the two data sets was that the LM data set included features that were categorical whereas the lyric data set only included numerical data. Different machine learning models han-dle categorically data differently well, where KNN and SVM are based on Euclidean distance and thus should not perform well with categor-ical data.

Figure 4.1a and 4.2d show that the ’LM.TF.SVM’ performed the worst of all configurations. All other configurations using SVM and the LM data set performed worse than the SVM configurations using the lyric data set except for ’LM.S.D2V.SVM’ which performed the second best (Figure 4.1a). The worst performance using the LM data set had MNB (Figure 4.1b) where all configurations using the lyric data set perform-ed better than the configurations using the LM data set.

Oddly, the configurations using the LM data set performed well in KNN and RF as can be seen in Figure 4.1c and 4.1d where the curves for the configurations using the LM data set are almost the same as the ones using the lyric data set. Thus, the conclusion that can be drawn is that the data set to use is very dependent on the algorithm chosen. One way to get around categorical data is to use a one-hot encoder for the artists. This means that one would create one feature per artist and set it to 1 for the correct artist and all other features fields to 0. However, since there were over 7000 different artists this would have increased the dimensionality so much that it would be very impracti-cal to use. In addition to this one-hot encoded data would not perform well with MNB since the features that are encoded would depend on each other.

(54)

Chapter 6 Conclusion and Future work

In this thesis 32 configurations using different data sets, vectoriza-tion methods and algorithms have been implemented and evaluated with the purpose of classifying explicit music content. The results show that the best configurations in terms of precision recall curves are ’LM.S.TF.RF’ and ’L.S.TF.RF’. Overall the two data sets used gave mixed performance based on algorithm, the sampled data sets formed the ones without sampling and TF-IDF vectorization outper-formed D2V vectorization.

For future work one should firstly focus on collecting more data. One way to expand the data set is to join the LyricFind database with the explicit database using other fields than ISRC as ISRC is unique for each recording and same song can have different recordings but still have the same lyrics. This means that right now same songs with dif-ferent recordings would not match.

Another way to expand the data is to include more or other features. One original idea for this project was to make use of user annotated data on what a song is about. The idea is that this would help under-standing the context of a song since lyrics are poetical with a meaning that sometimes is presented indirectly.

Some lyrics contained words that were not part of the actual song text, e.g. "Chorus x2" and "Intro". These were not excluded in this project as it would have required to manually go through all lyrics which was not feasible due to the time limitations in this project. To manually

(55)

CHAPTER 6. CONCLUSION AND FUTURE WORK 43

clean the songs and perform stemming would be interesting to look at to see if the results could be improved.

It would also be interesting to explore a multi-label version of the problem where different kinds of explicit language could be detected. However, this requires creation of ground truth for each explicit class before being able to train and evaluate a classifier, something that was not available for this project.

Classification of explicit music content using lyrics and music metadata

Classification of explicit music

content using lyrics and music

metadata

LINN BERGELID

Classification of explicit

music content using lyrics

and music metadata

LINN BERGELID

Abstract

Sammanfattning

Contents

List of Acronyms

List of Tables

List of Figures

Chapter 1

Introduction

1.1

Background

1.2

Aim and Objective

1.3

Problem definition and statement

1.4

Ethical considerations

1.5

Thesis outline

Chapter 2

Theory

2.1

Word embeddings

2.1.1

Frequency based embedding

2.1.2

Prediction based embedding

2.2

Supervised learning

2.2.1

Support Vector Machine

2.2.2

Naive Bayes

2.2.3

k-Nearest Neighbors

2.2.4

Random Forest

2.3

Evaluation

2.3.1

Methodology

2.3.2

Metrics

2.4

Imbalanced classification

2.4.1

Oversampling

2.4.2

Undersampling

2.5

Related work

2.5.1

Music Information Retrieval

2.5.2

Mood classification

2.5.3

Genre classification

2.5.4

Topic classification

Chapter 3

Method

3.1

Work flow

3.2

Data set

3.2.1

Lyrics and explicit tags

3.2.2

Music metadata

3.3

Software

3.4