Identification model of musical works using record linkage

(1)

Identification model of

musical works using record linkage

PIERRE COURNUT

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

musical works using record linkage

PIERRE COURNUT

Master in Machine Learning Date: March 7, 2019

Email: cournut@kth.se Supervisor: Bob Sturm Examiner: Sten Ternström Principal: Marc Legroux Host company: IBM France

School of Electrical Engineering and Computer Science

(4)

Abstract

This thesis is based on a project that is part of IBM’s collaboration with a Collecting Right Organization that collects and distributes payments of authors’ rights. The project aimed at helping this organization iden- tify right beneficiaries for musical tracks listened on online stream- ing platforms. Given as an input a list of tracks composed of meta- data such as artist names, titles and listening statistics, the goal was to match each line with its corresponding element in this organiza- tion’s documentation. Since each broadcaster has its own catalogue of music, it can be hard sometimes to find the correct matching for each song. In practice, this organization has a dedicated team that handles manually some of the non-trivial cases. Whereas their identification process focuses on resources that contribute to 90% of the revenue of each listening report, it achieves an identification rate of around 70%

of the resources declared which represent a substantial amount of un- processed tracks left aside.

In this thesis, we investigate the possibility to outperform the current

solution and design a new identification model that combines concepts

and technologies from various fields including search engines, string

metrics and machine learning. First, the identification process used by

the organization was reproduced and refined to quickly process the

most trivial cases. On top of this, an identification algorithm that re-

lies on a machine learning framework was built to process non-trivial

cases. This method showed very promising results since it achieves an

identification rate and a false discovery rate of the order of those of the

current solution without the use of a dedicated team of experts.

(5)

Sammanfattning

Detta examensarbete bidrar till ett samarbetsprojekt mellan IBM och en upphovsrättsorganisation, som samlar in och distribuerar royalties till upphovsmän. Projektet syftade till att hjälpa denna organisation att identifiera upphovsrättsinnehavare för musikverk som spelas på strömmande plattformar. Givet en verklista med metadata, såsom artistnamn, titlar och lyssningsstatistik, var målet att matcha varje rad med motsvarande element i organisationens dokumentation.

Eftersom varje musikdistributör har sin egen musikkatalog kan det vara svårt att hitta rätt upphovsman för ett givet verk. I praktiken har denna organisation ett arbetslag som hanterar de icke triviala fallen manuellt. Detta sökarbete fokuserar på resurser som bidrar till 90% av intäkterna för varje lyssningsrapport, och uppnår en identifierings- grad på omkring 70%. En betydande mängd obearbetade lyssnings- rapporter lämnas alltså åt sidan, vilket leder till förluster för rättighets- innehavarna.

I föreliggande arbete undersöktes möjligheten att överträffa den nuva-

rande lösningen. En ny identifieringsmodell utformades som kom-

binerar begrepp och teknik från olika områden, inklusive sökmotorer,

strängmätningar och maskininlärning. För det första reproducerades

och förfinades identifieringsprocessen som användes av organisatio-

nen för att snabbt behandla de mest triviala fallen. Utöver detta

tillkommer en identifieringsalgoritm som bygger på maskin-

inlärning, för att behandla icke triviala fall. Metoden uppvisade

mycket lovande resultat; den uppnår en identifieringstakt och en

felprocent av samma storleksordning som den nuvarande lösningen,

utan att använda människor som experter.

(6)

1 Introduction 1

1.1 Context . . . . 1

1.2 Problem at hand . . . . 2

1.3 Motivation . . . . 3

2 Background 6 2.1 Previous related work in record linkage . . . . 7

2.1.1 Blocking strategies . . . . 7

2.1.2 Pairing strategies . . . . 8

2.2 Overview of information retrieval . . . . 9

2.2.1 Ranked search . . . . 9

2.2.2 Inverted index . . . 10

2.2.3 tf-idf score . . . 11

2.3 Overview of string metrics . . . 12

2.3.1 Edit-based methods . . . 12

2.3.2 Token-based methods . . . 12

2.3.3 Phonetics-based methods . . . 13

2.4 Overview of machine learning . . . 14

2.4.1 The task of supervised classification . . . 14

2.4.2 Naive Bayes classifiers . . . 15

2.4.3 Support Vector Machines . . . 16

2.4.4 Random Forests . . . 17

3 Methods 20 3.1 Model overview . . . 20

3.2 Documentation storage . . . 22

3.3 Preprocessing steps . . . 22

3.4 Search strategies . . . 23

3.4.1 Exact search . . . 23

v

(7)

3.4.2 Fuzzy search . . . 24

3.5 Feature extraction . . . 25

3.6 Classifier . . . 26

3.6.1 Model training . . . 26

3.6.2 Model prediction . . . 26

3.7 Postprocessing steps . . . 26

4 Experimental details 28 4.1 Evaluation . . . 28

4.2 Datasets . . . 30

4.3 Direct processing . . . 31

4.4 Feature selection . . . 32

4.4.1 Feature ranking . . . 33

4.4.2 Wrapper methods . . . 36

5 Results 37 5.1 Naive Bayes classifier . . . 38

5.2 Support Vector Machine . . . 38

5.3 Random Forests . . . 38

6 Discussion 42 6.1 Nature of selected features . . . 42

6.2 Trade off between identification and false discovery . . . 43

6.3 Model selection . . . 43

6.4 Extrapolation . . . 44

7 Conclusion and future work 47 7.1 Conclusion . . . 47

7.2 Future work . . . 48

Bibliography 49

(8)

Introduction

1.1 Context

The Collecting Right Organization that we worked with collects pay- ments of authors’ rights and distributes the rights to the original song- writers, composers and music publishers. With the surge of internet, more and more users have been sharing and listening to music online.

As a result, this organization is facing difficulties handling this huge amount of data, and royalties are not always well distributed among right holders.

The aim of the collaboration between IBM and the Collecting Right Or- ganization is the design of an online platform which handles the pro- cessing of all this data and proceeds to the payment of right holders.

This all-in-one solution is made of different processing steps including integration of reports, identification of tracks, pricing, annotation and finally the billing and distribution among contributors. I worked on the identification part.

My project aimed at helping the Collecting Right Organization identify music tracks for a special type of clients, the Digital Service Providers (DSP) which include Spotify and Apple Music for instance. Those DSPs hand in Digital Sales Reports (DSR) weekly or monthly. DSRs state all the tracks that the DSPs have played together with frequen- cies, dates and geographies. Since each DSP has its own catalogue of music, it can be hard sometimes to find the correct matching to each

1

(9)

song in the organization’s documentation. The main issues that the identification process has to overcome are missing fields, misspellings ("Helo" instead of "Hello" for example), some additions ("official mu- sic" for instance), the use of different artist names for the same person ("Taylor Swift", "Taylor Alison" and "Taylor Alison Swift" are the same person), or other inaccuracies in the data. In practice, the organization has a dedicated team of experts that handles manually some of the non-trivial cases.

Whereas the Collecting Right Organization’s identification process fo- cuses on resources that contribute to 90% of the revenue of each lis- tening report, it achieves an identification rate of around 70% of the resources declared which represent a substantial amount of unpro- cessed tracks left aside. Not all those tracks could generate additional revenue for rights holders because a fairly large amount of streamings are required before generating the minimum amount eligible for dis- tribution. Nevertheless a higher identification rate would improve the traceability of the distribution process and in some cases avoid pos- sible loss for authors. My project was to investigate different types of identification models, combining concepts and technologies from various fields including search engines, string metrics and machine learning.

1.2 Problem at hand

Input: a Digital Sales Report (DSR) containing L lines each of which consists of the following selected fields:

• artist;

• title;

• International Standard Recording Code (ISRC): enables record- ings to be uniquely and permanently identified.

Master catalogue: the Collecting Right Organization’s documentation

containing 120 million track lines and consisting of the same fields as

above plus an identification code, specific to the organization, called

COCV.

(10)

Goal: find a unique corresponding element in the documentation for each input line. This problem can be associated with that of record linkage where one wishes to determine if two database records refer to the same entity.

Objectives: we are aiming for a model that achieves a better identi- fication rate than the current solution used by the organization and a limited number of incorrect identification.

Figure 1.1: Representation of the purpose of the identification model.

Each input DSR line (blue) has to be associated with a unique corre- sponding element in the documentation (yellow).

1.3 Motivation

The identification model currently used by the Collecting Right Orga-

nization consists of successive parts. First, report lines that contains an

International Standard Recording Code (ISRC) are processed. On the

sample of reports that we worked with, this represents around 45% of

the total of lines. Then, around 15% of lines can be processed using

strict string comparisons on title and artist field values. At this point,

around 60% of the total of lines have been processed, which constitute

what we call trivial cases. Finally, the organization has a dedicated

team that handles manually some of the remaining unprocessed lines.

(11)

Whenever a musical work has been identified by the team, a new as- sociation is stored so that the same line will lead to an automatic iden- tification next time. This system allows the organization to enhance its identification rate gradually but is not robust to new musical works and new inconsistencies. On the sample of reports that we worked with, the overall process achieves an identification rate of around 70%

and a very low false discovery rate of the order of 1%.

Our purpose is to replace the manual work carried out by the ded-

icated team with automated processing through the use of machine

learning. To do so, we can get inspiration from the way they proceed

with manual identification. We, as humans, would probably check in

the organization’s documentation for several candidate matches, and

then evaluate which one is the best depending on several characteris-

tics such as the presence of similar words or patterns. This possible ap-

proach to tackle the problem could motivate the use of a search engine

combined with a machine learning classifier that draws conclusions

based on similarity measures as we will describe below.

(12)

Figure 1.2: Diagram representation of our objective. The current solu-

tion used by the Collecting Right Organization (left) processes around

60% of input DSR lines through automatic identification and then a

team of experts manually handles around 10% additional lines, thus

reaching an overall 70% identification rate. The solution we are aim-

ing for (right) should automatically process more than 70% of input

lines with the help of machine learning (ML).

(13)

Background

Our problem of finding a unique corresponding element in a docu- mentation for each line of a report can be associated with that of record linkage where one wishes to determine if two database records refer to the same entity. The problem of record linkage first occurred in the public health sector in 1959 when files of individual patients were brought together using names, birth dates and other information [1].

Since then, it has been addressed using different approaches and re- ferred to as ”record linkage” [2, 3, 4] but also ”duplicate detection” [5, 6, 7, 8, 9] or ”approximate string matching” [10, 11, 12, 13, 14, 15].

This problem of matching an element with a corresponding entity has various applications in computational biology [16, 17, 18] (e.g. finding DNA subsequences after possible mutations), in signal processing [19, 20, 21] (e.g. retrieving musical passages similar to a sample) and in information retrieval [22, 23, 24] (e.g. data cleaning).

In this chapter, we first present the problem of record linkage and pre- vious related work on the matter and confront each approach to the needs of our specific context. We then provide technical details for selected methods in the following sections.

6

(14)

2.1 Previous related work in record linkage

Record linkage is the methodology of bringing together correspond- ing records from two or more files or finding duplicates within files [4]. One of the main problems of record linkage is the scalability of the different approaches. With d databases of n records each, brute- force approaches, using all-to-all comparisons, require (n

^d

) compar- isons which can be quickly prohibitive [25]. One way to drastically reduce the number of comparisons made, without compromising link- age accuracy, is to use ”blocking”. Blocking is about finding a subset of pairs of records which are likely to be matched together. Then, for each pair, we apply a pairing function which decides if the records should be matched or not. The problem of record linkage can thus be sub- divided into two steps: blocking and pairing. We now present major blocking and pairing techniques.

2.1.1 Blocking strategies

Blocking divides records into mutually exclusive and jointly exhaus- tive ”blocks", allowing the linkage to be performed within each block [25]. More specifically, blocking is about restricting comparisons to just those records for which one or more discriminating identifiers agree.

One example in the case of medical patients records would be to only consider pairs of individuals that have the same birth month.

In our specific case, where fields contain inconsistencies, correspond- ing but non-equal values and missing values, it is necessary to allow for flexibility at this step. This flexibility can be obtained using tech- niques developed in problems such as approximate string matching [10, 11, 12, 13, 14, 15] and all pairs similarity search [26, 27], where one wishes to find a pattern in a text where one or both have suffered some kind of undesirable corruption. Methods to tackle these prob- lems are usually categorized between on-line searching and index- based searching, depending on whether the text can be processed to build an index on it or not.

On-line searching methods [10, 12, 15] are most often based on hand-

coded functions that involve string metrics and fine-tuned thresholds.

(15)

Although very fast on-line algorithms exist, many applications handle so large texts that no on-line algorithm can provide acceptable perfor- mance.

An alternative approach when the text is large and searched frequently is to build an index on the text beforehand and use it to speed up searches [6, 13, 14, 26, 27]. In our specific context, the size of the docu- mentation is an incentive to use such index-based searching methods.

Moreover, the preprocessing time necessary to build an index is not an issue since it only happens once and upstream the identification pipeline. We will present notions of information retrieval that are rel- evant to our context in section 2.2.

2.1.2 Pairing strategies

Pairing is about deciding if a pair of records refer to the same entity, and thus whether they should be matched together or not. Most tech- niques include the extraction of features derived from fields compar- isons followed by the application of a pairing function.

Following blocking operations, we obtain pairs of potential matching candidates. For each such pair of records, we must extract features that contain information on whether those records should be matched or not. The intuitive way of doing so is to measure some kind of distance between each pair of corresponding fields. Several types of distance have been explored:

• edit-based distances: drawn from the number of operations re- quired to transform one string into the other [10], which is ap- propriate when a small number of differences is to be expected as in the case of typographical errors;

• token-based distances: drawn from set operations on the sets formed by groups of characters that constitute each string [13, 14], which is adapted to detect similar patterns in different strings without taking positional information into account;

• phonetic-based distances: strings are compared in terms of sound

[11], which is particularly relevant in the case of names compar-

isons for example;

(16)

• learned distances: rather than hand-tuning a distance metric for each field, we can use trainable similarity metrics, learned from a corpus of label examples. The explored approaches include an extended variant of the edit distance [7, 28] and a vector-space- based measure that employs Support Vector Machine for training [7].

We will present string metrics that are relevant to our context in section 2.3. Once features bound to represent each pair of records have been extracted, a pairing function must be applied so that they are classi- fied into match and mismatch classes. Most existing solutions rely on hand-coded functions that combine those features by using thresholds and Boolean conditions. However, rules and thresholds are domain dependent in the sense that they have to be assessed for each new dataset. One way of reducing the tedium of hand-coding such func- tions is to relegate the task of distinguishing between duplicates and non-duplicates to a machine learning algorithm [7, 8]. We will present the machine learning framework and models adapted to our context in section 2.4.

2.2 Overview of information retrieval

Information Retrieval (IR) is the task of finding relevant information in unstructured large sets of documents that satisfy an information need of a user. An information need is the topic about which the user desires to know more.

2.2.1 Ranked search

An information retrieval process begins when a user enters a query into the system. A query is what the user communicates to the com- puter in an attempt to convey the information need. However, a query may not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy. A document is said to be relevant if it contains infor- mation of value with respect to the information need of the user [29].

IR systems usually compute a numeric score that reflects the extent

(17)

to which each object in the database matches the query. Objects are then ranked according to this value and returned to the user. Most IR systems use the ”term frequency - inverse term frequency” (tf-idf) as weighting scheme to design a scoring function [30]. We will describe it in further detail after presenting one of the major concepts of IR: the inverted index.

2.2.2 Inverted index

An inverted index consists of a list of all unique words that appear in any document, and for each word, a list of documents in which it appears. Words are conventionally referred to as postings or tokens and the associated lists of appearance are called ”postings lists” [29].

For example, let us say that we have two text documents:

1. The quick brown fox jumped over the lazy dog 2. Quick brown foxes leap over lazy dogs in summer

In this case, the inverted index would look something like this:

Figure 2.1: An inverted index, as represented in the documentation of Elasticsearch [31], a real-time distributed search and analytics engine used to store and explore data in this project.

The purpose of an inverted index is to allow fast full text searches,

at the cost of increased processing when a document is added to the

(18)

database [32]. Retrieving all documents that contain a sequence of words gathered into a query is only reduced to the intersection of the postings lists of each word. However, whenever a new document is added to the database, we need to index it which involves updating the postings lists of each word that appear in that document.

2.2.3 tf-idf score

A scoring function assesses the relevance of a document d with regard to a query. The tf-idf score of a document d is the sum of all the tf-idf weights associated with each term t with regard to document d. The tf-idf weighting scheme is the combination of two weights, the Term Frequency and the Inverse Document Frequency [29].

First off, a document that mentions a query term t more often has more to do with that query and therefore should receive a higher score [33].

Toward this end, each document is assigned a weight for that term that depends on the number of occurrences of the term in the docu- ment, called the Term Frequency tf

d

.

However, term frequency alone is not sufficient to assess the relevance of a document with regard to a query, since all terms in the query are considered equally important. Words that appear most often have no discriminating power in determining relevance and should thus be weighed less. Hence the introduction of the concept of Inverse Doc- ument Frequency (idf

t

) [34] which diminishes the weight of terms that occur very frequently in the set of documents and increases the weight of terms that occur rarely. Denoting the total number of documents by N , the Inverse Document Frequency of a term t is derived from the Document Frequency df

t

, which is the number of documents in the collection that contain a term t, as follows:

idf

t

= log N df

t

Lastly, the tf-idf weighting scheme assigns to term t a weight in docu- ment d given by:

tf-idf

t,d

= tf

t,d

× idf

t

(19)

2.3 Overview of string metrics

In our specific case, string fields are quite dissimilar one from another so we will try to combine string metrics from different kinds. A string metric is a metric that measures distance between two text strings.

There are several types of string metrics and we will here present those that are based on edit operations, token division and phonetics.

2.3.1 Edit-based methods

Edit distances quantify how close two strings are to one another by counting the minimum number of operations required to transform one string into the other. The distance d(x, y) between two strings x and y is the minimal cost of a sequence of operations that transform x into y. Each operation δ is associated with a cost c. Denoting the empty set, operations are usually limited to:

• Insertion: δ(, a): inserting the letter a;

• Deletion: δ(a, ): deleting the letter a;

• Substitution: δ(a, b): substituting a by b, for a 6= b;

• Transposition: δ(ab, ba): swapping the adjacent letters a and b, for a 6= b.

Based on those operations, the most commonly used distance func- tions are:

• Levenshtein distance [35]: allows for insertions, deletions and substitutions. All the operations cost 1 in the simplified defini- tion;

• Hamming distance [36]: only allows substitutions which cost 1.

2.3.2 Token-based methods

A string can be considered as a set of words (or tokens) and each word

can itself be considered as a set of characters or as a set of n-gram

(contiguous sequence of n characters). Set operations on a pair of such

(20)

sets are the foundation to several token-based distances between two strings. Denoting A and B the sets of words of two strings s and t, we consider the following similarity functions (inverse of distance):

• Jaccard index:

^|A∩B|_|A∪B|

;

• Sørensen-Dice coefficient:

_|A|+|B|^2|A∩B|

;

• Overlap coefficient:

min(|A|,|B|)^2|A∩B|

.

We also considered the Ratcliff/Obershelp similarity which is the num- ber of matching characters divided by the total number of characters in the two strings. Matching characters are those in the longest common subsequence plus, recursively, matching characters in the unmatched region on either side of the longest common subsequence.

2.3.3 Phonetics-based methods

Phonetic string metrics measure the distance between two strings in terms of sound and are used in applications such as name retrieval [11]. Most common phonetic matching techniques include:

• Soundex: phonetic algorithm patented in 1918 [37] that creates the sound footprint of a string of characters. It uses codes based on the sound of each letter to translate a string into a canonical form of at most four characters, preserving the first letter;

• NYSIIS (New York State Identification and Intelligence System):

phonetic algorithm that has been developed to describe names [38] and which involves successive rules of group letters trans- formations;

• Metaphone: phonetic algorithm developed in 1990 [39] as a re-

sponse to the impairments of the Soundex algorithm. It includes

basic rules of the English pronunciation to allow the algorithm

for more robustness. Its creator designed a second version of the

algorithm in 2000 [40] that takes into account the transcriptions

of some other languages into Latin characters.

(21)

2.4 Overview of machine learning

In our specific case, the current identification model used by the Col- lecting Right Organization relies on hand-coded functions for pairing.

However, this method has shown its limitations. The main difficulty lies in the fact that new music tracks are uploaded on streaming plat- forms everyday leading to new kinds of inconsistencies in the data that is to be matched with the organization’s documentation. The algo- rithm used today was designed about two decades ago leaving plenty of new kinds of dissimilarities unprocessed. Machine learning models used as pairing functions, through training on a corpus of examples, could resolve the tedium issue of hand-coding new rules.

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed [41]. We can broadly de- fine it as computational methods that use experience obtained through training on previous examples to make predictions or to improve per- formance. Given the nature of our problem and data, we will here solely focus on classification problems solved in a supervised learning framework.

2.4.1 The task of supervised classification

Supervised classification is about identifying to which of a set of classes a new instance belongs, based on information extracted from a training set of data that contains instances of known classes. Examples include the classification of mails into ‘’spam” and ‘’non-spam” or the medical diagnosis of a patient based on vital signs (blood pressure, age, sex).

Binary classification is a particular case of classification tasks where

instances are to be divided into two classes, usually denoted 0 and

1. A classifier appropriate for our task should be able to learn well

from a limited training set and must also produce meaningful confi-

dence estimates that correspond to relative likelihoods of each class

label. Based on those requirements, we will compare three types of

classifiers: naive Bayes classifiers, Support Vector Machines and Ran-

dom forests. In the following subsections, we will provide a general

description of those models, describe in what way they differ as well

as their advantages and disadvantages.

(22)

2.4.2 Naive Bayes classifiers

Naive Bayes methods are a set of probabilistic classifiers based on Bayes’ theorem combined with a strong independence assumption be- tween the predictor variables [42]. Let us say that we want to design a classifier that models the decision to play tennis Y (True or False) based on weather attributes X (outlook, wind, rain, temperature, etc.).

Given a new observation of weather attributes X, we want to assign the class label Y such that P (Y |X) is maximal. Bayes’ theorem allows us to express this conditional probability as:

P (Y |X) = P (X|Y ) × P (Y ) P (X)

P (X) is not class dependent and P (Y ) is the a priori probability of each class among all samples. However, P (X|Y ) is more difficult to compute sometimes which is why we introduce the naive assumption on attributes independence. This allows us to decompose this condi- tional probability as follow:

P (x

₁

, . . . , x

_k

|Y ) =

k

Y

i=1

P (x

_i

|Y )

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P (x

i

|Y ). For model selection (5), we consider several distributions. Fitting of the model involves deriving a posteriori probabilities P (x

i

|Y ) for each attribute i based on relative frequency of each class among samples for each values of x

i

. At classification time, the model is combined with a decision rule. The most common rule, called maximum a posteriori, is to chose the most probable class as the prediction.

Naive bayes classifiers are simple to train as it only involves estimat-

ing conditional probabilities for each feature-class pair. They work

surprisingly well in practice and are often compared with much more

sophisticated techniques [43]. However, even though the Naive as-

sumption makes computations possible, it can be inefficient in practice

as attributes are often correlated [44].

(23)

2.4.3 Support Vector Machines

The Support Vector Machine (SVM) is a generalization of a simple and intuitive classifier known as the maximum margin classifier, where clas- sification is done by separating data points by a hyperplane that is as far as possible from the observations. In some cases, there exists no hyperplane that separates the data linearly and SVMs are designed to tackle this specific setting. The way to address non-linearity in SVMs is by enlarging the feature space by adding features that are created as functions of the original variables (quadratic or cubic, for example).

Although the decision boundary is linear in the higher-dimensional feature space, called kernel [45], it is non-linear in the original feature space, as seen in figure 2.2.

Figure 2.2: Example of two SVMs: one with a polynomial kernel of degree 3 (left) and the other with a radial kernel (right). Both SVMs capture the decision boundary which could not be found linearly in the original space. Reprinted from [45].

One big advantage of SVM is that it has proven to perform well in

many text classification problems, especially in terms of accuracy [46,

47, 44]. One disadvantage is that the implementation of the model

scales badly with the number of documents [44].

(24)

2.4.4 Random Forests

In order to understand the concept of a Random Forest, the concept of decision trees first has to be introduced. Decision trees are tools that model decision-making by representing a set of decisions and conse- quences as a tree-like structure [48]. In the case of classification trees, where the set of final values is finite, leaves represent class labels and branches represent conjunctions of features that lead to those class la- bels.

Assuming a set of data and a set of labels, the objective is to find a partition of the data based on the distinct values of the features that results in clusters of homogeneous class labels. This notion of homo- geneity is formalized using some impurity measure that we can here reduce to the number of missclassified data points for simplicity. The construction of the tree starts with a single node called the root. Then, recursively, each node is split based on some question such that the node impurity is maximally decreased. Figure 2.3 represents a deci- sion tree that formalizes the decision of whether someone would like to play tennis.

Although very understandable and visual, decision trees suffer from high variance that usually leads to an overfitting over the training set.

To overcome this problem, predictions of several randomized trees can be combined into a single model using ensemble learning methods.

The collective knowledge of a diverse and independent body of peo- ple typically exceeds the knowledge of any single individual and can be harnessed by voting [49]. Ensemble learning methods take advan- tage of this postulate by aggregating high-variance, low-bias classifiers (called weak learners) to reduce the variance of the ensemble classifier.

We will here present a particular ensemble learning method applied to decision trees called the Random Forest (RF).

Random Forest is an ensemble learning method where decision trees

are combined according to the Bagging method [50]. Bootstrap Ag-

gregating (Bagging) is about using bootstrap replicates of the train-

ing set by sampling with replacement. On each bootstrap replicate,

one decision tree is trained and all of those classifiers are combined

using majority voting. This process induces less correlation between

weak learners thus leading to lower overall variance for the combined

model.

(25)

Figure 2.3: Decision tree representation of a decision to play tennis.

The decision-making process starts at the root node (top of the tree)

and each weather attributes configuration defines a way to browse the

different branches to a leaf node that constitutes a final decision.

(26)

Figure 2.4: A Random Forest, from training to prediction. Training

data is split into random subsets by sampling with replacement. On

each of those subsets, a decision tree is trained following the method-

ology previously described. At prediction time, each decision tree is

fed with the input features and the different outputs are combined us-

ing majority voting to produce the final predicted class.

(27)

Methods

3.1 Model overview

In this chapter, we describe the model that we designed and provide details on each important step in the following sections. The identifi- cation pipeline, represented in figure 3.1, contains the following steps:

• preprocessing operations on raw input data to extract relevant fields and remove recurrent useless patterns (3.3)

• direct processing of trivial lines through exact search (3.4.1)

• blocking through fuzzy search (3.4.2)

• pairing through feature extraction (3.5) and model prediction (3.6.2)

• postprocessing operations to select the dominant match for each input line (3.7)

Documentation storage (3.2) as well as model training (3.6.1) only have to be processed once beforehand and thus do not appear in the identi- fication pipeline.

20

(28)

Figure 3.1: Identification pipeline

(29)

3.2 Documentation storage

We want to keep only a few candidate lines among the documentation for each report line before applying the model for performance issues both in terms of accuracy (identification rate) and efficiency (comput- ing times).

To access documentation lines that are closest to the report line being considered, we use an inverted index that we query using direct or fuzzy search. The purpose of an inverted index is to allow fast full text searches, at the cost of increased processing when a document is added to the database (2.2.2). Since the documentation is only indexed once upstream, we do not care for the additional processing time of that in- dexing. However, the processing time gains provided by the inverted index at query time are much appreciated.

3.3 Preprocessing steps

Once the Digital Sales Report (DSR) is loaded and that relevant fields have been extracted, we need to perform some preprocessing steps to clean the data.

Among other things, those operations include the removal of several

useless patterns, punctuation and signs that do not help for identifica-

tion and that could blind the model. Some words and signs indeed do

not carry any relevant information and the model needlessly focusing

on those would inevitably mean a drop in performance. Those words

include for instance "performed by", "conducted by" or "vocals by" in

the artist field and "official", "explicit" or "featuring" in the title field.

(30)

3.4 Search strategies

3.4.1 Exact search

Some of the DSR lines can be processed easily because relevant fields are well filled and complete. The processing of those lines is what the Collecting Right Organization’s model is basically doing. Since this process allows for a fast and accurate matching of trivial lines, we in- cluded this step in our model.

Firstly, DSR lines that contain an International Standard Recording Code (ISRC) which appears in the organization’s documentation are processed. Then, some of the DSR lines can also be processed by per- forming an exact search on the artist and the title. Each of those lines have a corresponding line in the organization’s documentation with the exact same title and artist. Those lines correspond to the trivial cases where relevant fields are complete and accurate. The follow- ing parts of the model are designed to tackle the remaining non-trivial lines.

Figure 3.2: Flowchart representation of the direct processing of trivial

lines through exact search.

(31)

3.4.2 Fuzzy search

For each non-trivial DSR line, we fuzzy search on the inverted index using the title and the artist combined into a single boolean query.

More precisely, each element in the documentation gets assigned a score that assesses its relevance with regard to this query. This score for each document is based on the tf-idf weighting scheme described in section 2.2.3. Candidate lines in the documentation are then ranked according to that score.

We only keep the top k candidate documentation lines for each input DSR line, k being a fixed number. As k increases, so does the proba- bility to retrieve the matching element in the subset of candidate lines at the cost of higher overall computation time (feature extraction is the most greedy part of the pipeline and its computation time is linear in k ). After investigating this trade-off, we decided to set k to 5. Each candidate documentation line is then merged with its associated DSR line to form a pair on which features are then computed.

Figure 3.3: Flowchart representation of the blocking step. Each re-

maining input DSR line is merged with some candidate lines from the

Collecting Right Organization’s documentation retrieved using fuzzy

search.

(32)

3.5 Feature extraction

For each pair of elements (report line, documentation line), we first com- pute similarity measures on text fields: artists and titles. Those simi- larity measures indicate how similar the DSR line and the documen- tation line are with respect to each field. For example, the feature levenshtein

_artist

= levenshtein(dsr

_artist

, doc

_artist

) will compare how close the DSR line and the documentation line are with respect to artists. For the Levenshtein similarity here being presented as an example, the no- tion of closeness reflects the number of single-character edits necessary to change the artist of the DSR into the artist of the documentation. If the two strings are equals, no edit needs to be performed and the simi- larity is then 1. Conversely, if all characters need to be edited, then the similarity is 0. Other notions of closeness have been explored and are presented in 2.3.

Then, other features are computed based on the ISRC . The ISRC (Inter- national Standard Recording Code) is a 12 character-long unique iden- tifier for recording and it has the following structure: country code (2 letters) - registrant code (3 characters) - year of reference (2 digits) - designation code (5 digits). For the country and the registrant codes, we considered Boolean features that assess the equality of those codes in the ISRC of the DSR line and in that of its candidate documentation line. For the year and the recording designation code, we considered features derived from standardized absolute differences.

Finally, we considered three features extracted from the score assigned to each candidate line during the fuzzy search:

• the score: each retrieved documentation element is assigned a score for each queried field based on the tf-idf weighting scheme (2.2.3). Those scores are then multiplied to produce the final score of each document with regard to the query.

• the rank: the score divided by the maximum score among the candidate lines associated with the considered input DSR line.

• the weighted score: the score multiplied by the rank.

In fine, all of those features form a feature vector for each pair of el-

ements (report line, documentation line). Relevance of each feature was

assessed and results are presented in section 4.4.

(33)

3.6 Classifier

We designed a pairing function that takes as input the feature vector associated with a pair of elements (report line, documentation line) and decides whether those two elements should match (True) or not (False).

We considered three common machine learning classifiers: naive Bayes classifiers (2.4.2), Support Vector Machines (2.4.3) and Random Forests (2.4.4).

3.6.1 Model training

To train our model in a supervised manner, we used DSR lines that have been previously identified by the Collecting Right Organization’s model. Each identified line has exactly the same field as a regular DSR line plus an identification code, specific to the organization, called COCV and which refers to a unique documentation line. Given a pair of elements (report line, documentation line), we can thus determine the target label (True or False) by checking whether the COCV is the same in both lines or not.

3.6.2 Model prediction

At identification time, features for each pair of elements (report line, documentation line) are fed to the trained classifier which decides if the pair should match or not. Each pair is associated with a predicted label and a probability that assesses the certainty of the prediction. This probability is a confidence estimate that corresponds to the relative likelihood of each class label.

3.7 Postprocessing steps

Once each pair of elements (report line, documentation line) has been

labeled with a prediction by the model, we still have some manipula-

tions to perform in order to obtain the final output which is a unique

corresponding element in the documentation for each input DSR line.

(34)

At this step, we still have k candidate lines for each DSR line and there can very well be zero or two of those lines that have been attributed a predicted label of True.

On the one hand, lines that do not have any candidate line with a True prediction are left unprocessed. On the other hand, some of the lines have several candidate lines with a True prediction. For those specific lines, we selected the dominant matching line using the certainty prob- ability that the model outputs with the prediction.

Finally, given an input line which has been assigned a dominant match-

ing line with a certainty probability of p, we decide to leave this input

line unprocessed if p is lower than a certain threshold. Discussion on

the choice of this threshold is provided in section 6.2.

(35)

Experimental details

4.1 Evaluation

The problem at hand can be thought of as a binary classification prob- lem. First, some of the report lines are associated with candidate doc- umentation lines and form pairs of elements of the form (report line, documentation line). For each such pair, we say that there is a match if and only if those lines refer to the same original musical work, and we label it as True. Conversely, a pair that does not correspond to a match is labeled as False. Then, the model we propose outputs a predicted label for each pair of elements (report line, documentation line). Finally, we apply postprocessing steps that ensure the quality and uniqueness of the matching.

Along that process, some report lines are left unprocessed, either be- cause they are not associated with any candidate line, or because the model does not predict any true label for one of its candidate lines or because none of the candidate lines that have been labeled as True by the model fulfill the postprocessing constraints.

We can now formalize better the objectives described in the introduc- tion by computing summary statistics of the following basic measures:

• True Positives (TP): elements correctly classified as positive (la- bel = prediction = True);

• True Negatives (TN): elements correctly classified as negative

28

(36)

(label = prediction = False);

• False Positives (FP): elements falsely classified as positive (label

= False and prediction = True);

• False Negatives (FN): elements falsely classified as negative (la- bel = True and prediction = False).

Theses four measures can be arranged into a 2 × 2 confusion matrix, conventionally with the test result (prediction) on the vertical axis and the actual condition (target) on the horizontal axis:

T P F P F N T N

Conceptually, we are trying to maximize the number of true positives (TP). However, we also want to minimize the number of false posi- tives (FP) which corresponds to incorrectly processed tracks. We pre- fer to leave tracks unprocessed rather than having them incorrectly processed, leading to the wrong author being paid.

Using this terminology and noting L the number of lines in the input report, we define our evaluators as follows:

• the Identification Rate (IR) is the rate of correct matches among all input lines.

IR = T P L

• the False Discovery Rate (FDR) is the rate of incorrect matches among processed lines.

F DR = F P T P + F P

Note that the False Discovery Rate is equal to 1 − P P V where PPV is the Positive Predictive Value, also known as the precision.

P P V = T P T P + F P Our objectives are to:

• maximize the Identification Rate;

• minimize the False Discovery Rate.

(37)

4.2 Datasets

For training, we had access to a fraction of the overall documenta- tion of the Collecting Right Organization. We then gathered DSR lines identified by the organization which IBM had stored along the years.

We made sure to select only identified DSR lines which assigned COCVs (identification codes specific to the organization) were among the frac- tion of the documentation available. As a result, we trained and eval- uated our model using the following number of elements:

• identified resources: 88874 lines;

• documentation: 21874 lines.

For testing and with the idea of providing relevant results on general- ization performance, we started with another distinct identified DSR and gathered the documentation lines associated with identified re- sources. For each identified resource, we added a "tag" field that states whether it has been processed automatically or manually by the Col- lecting Right Organization. As a result, we tested our model perfor- mance using the following number of elements:

• resources: 61684 lines;

• identified resources: 43578 lines (70.6%) of which:

– 34638 (56.1%) have been processed automatically by the Col- lecting Right Organization;

– 8940 (14.5%) have been processed manually by the Collect- ing Right Organization;

• documentation: 36592 lines.

(38)

4.3 Direct processing

As developed in section 3.4.1, some input DSR lines can be directly processed using exact search on the ISRC or on the artist and the ti- tle. Proportions of lines processed using exact search are presented in tables 4.1 and 4.2.

Fields queried Processed lines Proportion of lines FDR processed

ISRC 39960 45% 0.5%

ARTIST and TITLE 16105 18% 0.7%

Table 4.1: Statistics on lines processed using exact search in the dataset used for training. FDR is the False Discovery Rate as defined in sec- tion 4.1. 45% of input lines are processed using exact search on ISRC with 0.5% of false positives among processed lines. 18% of input lines are processed using exact search on artist and title with 0.7% of false positives among processed lines.

Fields queried Processed lines Proportion of lines FDR processed

ISRC 22555 51.8% 0.6%

ARTIST and TITLE 6533 15.0% 0.2%

Table 4.2: Statistics on lines processed using exact search in the dataset used for testing. FDR is the False Discovery Rate as defined in section 4.1. 51.8% of input lines are processed using exact search on ISRC with 0.6% of false positives among processed lines. 15% of input lines are processed using exact search on artist and title with 0.2% of false positives among processed lines.

Following direct processing, 63% of lines in the dataset used for train-

ing have been processed (table 4.1). The training dataset was con-

structed on the remaining 37% lines. Each DSR line was associated

with at most 5 candidate lines in the documentation to form pairs of

elements. As a result, training was conducted on 158410 pairs of which

30900 have a True target label (19.5%).

(39)

4.4 Feature selection

As detailed in 2.3 and 3.5, we considered 25 different features com- puted on each pair of input DSR line with a candidate documentation line. As a reminder, those features can be grouped into:

• 18 features based on similarities between artists and titles:

– edit-based methods: Levenshtein and Hamming;

– token-based methods: Jaccard, Sørensen-Dice, overlap and Ratcliff/Obershelp;

– phonetic-based methods: Soundex, NYSIIS, Metaphone.

• 4 features based on ISRCs comparisons:

– equality assessments of country codes and registrant codes;

– similarities between years of reference and designation codes.

• 3 features extracted from the score assigned to each candidate line during the fuzzy search

– the score;

– the rank;

– the weighted score.

As the computation of those features is the most greedy part of the

overall identification pipeline, we performed feature selection to re-

duce the number of kept variables. Additional gains to feature selec-

tion are the improvement of prediction performance, faster and more

cost-effective predictors and a better understanding of the underlying

process that generated the data [51]. We first present some methods

to rank features according to their relevance with regard to different

notions in the next subsection. Then we show how we exploited those

rankings to perform feature selection using wrapper methods.

(40)

4.4.1 Feature ranking

We want to assign a score to each variable based on some statistical measures. Features are then ranked according to this score and either selected to be kept or removed from the dataset. We considered three different ranking methods:

• Univariate selection: each feature is assigned a score that as- sesses the strength of the relationship between the feature and the output variable. Using the χ

²

statistical test for non-negative features, our 20 best features are presented in table 4.3;

Rank Training feature χ

²

score 1 search_weighted_score 142809.81

2 search_score 72139.30

3 hamming_title 38154.54

4 metaphone_title 32319.29

5 same_registrant 16770.47

6 search_rank 14324.75

7 jaccard_title 12102.15

8 nysiis_title 11235.07

9 leven_title 9166.72

10 same_country 7558.27

11 soundex_title 7471.08

12 ratcliff_title 6558.43

13 overlap_title 4590.12

14 sorensen_title 4015.87

15 designation_code_closeness 23.62

16 nysiis_artist 15.26

17 soundex_artist 9.01

18 jaccard_artist 6.60

19 sorensen_artist 3.80

20 metaphone_artist 3.66

Table 4.3: Top 20 features using the χ

²

statistical test evaluated on the

training set. χ

²

is a measure of how much expected counts (here the

target label) and observed counts (here training features) deviate from

each other. A high value thus indicates that the hypothesis of indepen-

dence is incorrect [29].

(41)

• Correlation matrix: represents the relationships between vari- ables according to some notion of correlation. We used the Pear- son coefficient which measures the linear correlation between two variables and corresponds to the covariance of the two vari- ables divided by the product of their standard deviations. The correlation matrix between the 20 best features selected with uni- variate selection is presented in figure 4.1. We want to keep fea- tures that are highly correlated to the target label (referred to as

”label” in figure 4.1) which gives us a new ranking of features to consider;

Figure 4.1: Correlation matrix between the top 20 features selected us-

ing univariate selection and the target label evaluated on the training

set.

(42)

• Feature importance: tree based models can assign a score to each feature based on the way they operate. Each tree contains nodes and each node represent a subset of features to perform thresh- old decision on. Features that appear more frequently in those subsets are assigned a higher score than those that appear less.

Our 20 best features based on feature importance of a Random Forest are presented in figure 4.2.

Figure 4.2: Bar diagram representing the top 20 features and their

scores using feature importance of a Random Forest evaluated on the

training set.

(43)

4.4.2 Wrapper methods

Based on the three feature rankings presented above, we used wrapper methods to select a final subset of features. In wrapper methods, we consider successively subsets of features on which we train and evalu- ate our model. Wrapper methods are greedy search algorithms as they evaluate all possible combinations of features and select the combina- tion that entails the best performance. We considered two different approaches:

• Forward search: the subset of selected features is constructed from scratch and at each step the best feature with respect to model performance is added;

• Recursive feature elimination: this search algorithm begins with the subset of selected features being equal to all features and at each step the worst performing feature is eliminated.

For computation time issues, at each step of each considered approach, we reduced the search to only best and worst features identified in fea- tures ranking presented above. In the end, our best performing subset contains 10 features and it produced better model performance than any other selected subsets, including that which contains all features.

Those features are:

• rank

• weighted_score

• sorensen_title

• leven_title

• hamming_title

• ratcliff_title

• soundex_title

• ratcliff_artist

• leven_artist

• same_registrant

(44)

Results

We here present the results obtained with the three machine learning classifiers that we considered: naive Bayes classifiers, Support Vector Machines and Random Forests. Models were fine tuned before eval- uation on the testing set and the minimization of the false discovery rate was favored over the maximization of the identification rate as explained in further details in section 6.2. Fine tuning was done either trying different model hypotheses, in the case of the naive Bayes clas- sifier and the Support Vector Machine, or using grid search to identify best performing model parameters, in the case of the Random Forest.

Once each pair of input line with a candidate documentation line has been labeled with a prediction, we select a dominant matching line among the k candidate documentation lines using the certainty proba- bility p that the model outputs with the prediction. Then we leave the input line unprocessed if p is lower than a certain threshold (3.7). The results presented below consist of identification rates and false discov- ery rates for different values of p and were evaluated on the testing set. Results are distinguished between data processed automatically and manually by the Collecting Right Organization and discussed in section 6.3.

37

(45)

5.1 Naive Bayes classifier

Fine tuning of a naive Bayes classifier involves choosing the distri- bution of P (x

i

|Y ) that produces the best overall performance, where P (x

_i

|Y ) is the a posteriori probability of attribute i (2.4.2). We tried the Gaussian, the Multinomial and the Complement distribution and the one that yielded the best results was the the multinomial distribution:

P (x

_i

|y) = N

_yi

+ α N

y

+ αn where N

yi

= P

x∈T

x

i

is the number of times feature i appears in a sam- ple of class y in the training set T and N

y

= P

2

i=1

N

_yi

is the total count of all features for class y. Results for our best naive Bayes classifier are presented in figure 5.1.

5.2 Support Vector Machine

Fine tuning of a Support Vector Machine (SVM) involves choosing the kernel that best fits the data. As explained in section 2.4.3, the kernel is the higher-dimensional feature space in which hyperplane separa- tion is then performed. The feature space is enlarged using functions of the original variables such as linear combinations, polynomial com- binations, radial basis functions (RBFs) or sigmoid functions. Among all those kind of functions, those that performed best in our case were RBFs. Results for our best SVM are presented in figure 5.2.

5.3 Random Forests

Fine tuning of our Random Forest classifier involved choosing through grid search the following model parameters:

• the number of trees: 14;

• the maximum depth of each tree: 11;

• the class weights: 1 to class True and 2 to class False;

(46)

• the minimum number of samples required to split an internal node: 4;

• if samples are drawn with replacement: False

Other model parameters did not have relevant impact or the best set-

ting was when set to default value (using sklearn.ensemble.RandomForestClassifier).

Results for our best Random Forest are presented in figure 5.3.

(47)

Figure 5.1: Graph performance of our naive Bayes classifier with Multinomial a posteriori probabilities. Identification rates (red curves and left scale) and false discovery rates (blue curves and right scale) are displayed as functions of the certainty probability of the classifier.

Data processed automatically (dotted lines) by the Collecting Right Or-

ganization is distinguished from data processed manually (solid lines).

(48)

Figure 5.2: Graph performance of our Support Vector Machine classi- fier with a RBF kernel.

Figure 5.3: Graph performance of our best Random Forest classifier

selected through grid search.

(49)

Discussion

6.1 Nature of selected features

As a reminder, the 10 selected features are: rank, weighted_score, sorensen_title, leven_title, hamming_title, ratcliff_title, soundex_title, ratcliff_artist, leven_artist and same_registrant.

For each kind of features explored, here is the number of features se- lected in the best performing subset:

• edit-based similarities: 3

• token-based similarities: 3

• phonetic-based similarities: 1

• ISRCs comparisons: 1

• fuzzy search score: 2

First off, it is interesting to notice that all kinds of features explored are represented in this best performing subset.As the most represented features in this best performing subset, edit-based and token-based similarities between artist and title fields are certainly features con- taining the most relevant information when it comes to assessing re- semblance between lines. Features derived from fuzzy search scores also play a key role in model performance since they are among the first selected features in wrapper methods and are the features most correlated to the target variable.

42

(50)

6.2 Trade off between identification and false discovery

Classifiers evaluation (5) highlighted the fact that there is a tradeoff be- tween the maximization of the identification rate and the minimization of the false discovery rate. The threshold value on the certainty proba- bility p that classifiers outputs with each prediction acts as a control pa- rameter over this tradeoff: the increase of the threshold value entails a decrease of the identification rate and a decrease of the false discovery rate. For model selection and fine tuning, we favoured the minimiza- tion of the false discovery rate over the maximization of the identifica- tion rate. This stems from a specificity of the problem at hand where a false positive corresponds to the wrong artist being paid, which is highly unwanted.

6.3 Model selection

For each fine tuned model, we selected a threshold value on the cer- tainty probability that allowed for the false discovery rate to be under 1.5% in order to compare with the current solution used by the Col- lecting Right Organization. Results for those particular thresholds are presented in table 6.1.

Classifier Threshold Identification Rate False Discovery Rate Automatic Manual Automatic Manual Multinomial NB p > 0.85 23.5% 23.4% 1.3% 1.5%

SVM p > 0.82 45.4% 58.3% 1.5% 1.4%

Random Forest p > 0.89 42.3% 54.0% 1.4% 1.1%

Table 6.1: Performance comparison of our three classifiers. Identifica-

tion rates and false discovery rates are displayed for a threshold value

on the certainty probability of the classifier that was selected to favour

a false discovery rate low enough (< 1.5%). Data processed automat-

ically by the Collecting Right Organization is distinguished from data

processed manually.

(51)

First off, even though the multinomial naive Bayes classifier obtains decent results relatively to the simplicity of the model, it is not com- petitive compared to the other two models. Then, the SVM and the Random Forest perform almost as well and manage to combine very low false discovery rates with interesting identification rates of around 50%.

6.4 Extrapolation

To do a summary of our results on an entire Digital Sales Report (DSR), we selected our best performing model which is a Support Vector Ma- chine with a RBF kernel. We set the threshold value on the certainty probability to 0.82 in order to compete with the current solution used by the Collecting Right Organization in terms of false discovery rate, as seen above. Our model performs as follows:

Identification Rate False Discovery Rate Automatic Manual Automatic Manual

45.4% 58.3% 1.5% 1.4%

Table 6.2: Performance of the selected SVM with a RBF kernel and a threshold of 0.82 on the certainty probability, evaluated on the testing set. Data processed automatically by the Collecting Right Organiza- tion is distinguished from data processed manually.

Considering the entirety of our identification pipeline, our model per- forms as follows on the DSR that we considered for testing:

• on lines processed automatically by the Collecting Right Organi- zation (56.1%):

– 60.5% are directly processed using the ISRC;

– 9.7% are directly processed using the artist and the title;

– 45.4% of the remaining 29.8% lines (13.5% of the total) are processed using fuzzy search and machine learning;

⇒ 83.7% of those lines are processed using our model.

(52)

• on lines processed manually by the Collecting Right Organiza- tion (14.5%):

– 17.9% are directly processed using the ISRC;

– 35.7% are directly processed using the artist and the title;

– 58.3% of the remaining 46.4% lines (27.0% of the total) are processed using fuzzy search and machine learning;