Musical Instrument Recognition using the Scattering Transform

(1)

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN 2020

Musical Instrument Recognition using the Scattering

Transform

Laura Cros Vila

(2)

Authors

Laura Cros Vila <lcros@kth.se>

Master in Machine Learning KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Examiner

Anders Friberg

KTH Royal Institute of Technology

Supervisor

Bob Sturm

KTH Royal Institute of Technology

(3)

Acknowledgements

First of all I want to thank those who followed me closely, patiently and wisely

advising me on the right moves, during these long and difficult months: Bob,

Carl and Daniel. I want to thank all the amazing people I have met at Epidemic

Sound, who feel not just like coworkers but also family. Especially Carl, who

helped and encouraged me since day one, and who told me so. I thank all the

classmates who have accompanied me in these years, without whom I would

not have learned half of what I have. I thank my whole family, who are always

there for me, for anything. I thank with all my heart my friends, who have

incredibly put up with me for years and that I could not replace anything in

the world.

(4)

Abstract

Thanks to the advancement of technological progress in networking and signal processing, we can access a large amount of musical content. In order for users to search among these vast catalogs, they need to have access to music-related information beyond the pure digital music file. Manual annotation of music is too expensive, therefore automated annotation would be of great use. A mean- ingful description of the musical pieces requires the incorporation of information about the instruments present in them.

In this work, we present an approach for musical instrument recognition us- ing the scattering transform, which is a transformation that gives a transla- tion invariant representation, that is stable to deformations and preserves high frequency information for classification. We study recognition in both single instrument and multiple-instrument contexts. We compare the performance of models using the scattering transform to those using other standard features.

We also examine the impact of the amount of training data.

The experiments carried out do not show a clear superior performance of ei- ther feature representation. Still, the scattering transform is worth taking into account when choosing a way to extract features if we want to be able to char- acterize non-stationary signal structures.

Keywords

Audio classification, music information retrieval, musical instrument recogni-

tion, scattering transform

(5)

Abstract

Tack vare den tekniska utvecklingen i n¨ atverk och signalbehandling kan vi f˚ a tillg˚ ang till en stor m¨ angd musikaliskt inneh˚ all. F¨ or att anv¨ andare ska s¨ oka bland dessa stora kataloger m˚ aste de ha tillg˚ ang till musikrelaterad information ut¨ over den rena digitala musikfilen. Eftersom den manuella annotationspro- cessen skulle vara f¨ or dyr m˚ aste den automatiseras. En meningsfull beskrivning av musikstyckena kr¨ aver inf¨ orlivande av information om instrumenten som finns i dem.

I det h¨ ar arbetet presenterar vi en metod f¨ or igenk¨ anning av musikinstrument med hj¨ alp av den scattering transform, som ¨ ar en transformation som ger en

¨

overs¨ attnings-invariant representation, som ¨ ar stabil f¨ or deformationer och be- varar h¨ ogfrekvensinformation f¨ or klassificering. Vi studerar igenk¨ annande i b˚ ade enskilda instrument- och flera instrumentf¨ orh˚ allanden. Vi j¨ amf¨ or modellerna med den scattering transforms prestanda med de som anv¨ ander andra standard- funktioner. Vi unders¨ oker ocks˚ a effekterna av m¨ angden tr¨ aningsdata.

Experimenten som utf¨ ors visar inte en tydlig ¨ overl¨ agsen prestanda f¨ or n˚ agon av representationsf¨ orest¨ allningarna j¨ amf¨ ort med den andra. Fortfarande ¨ ar den scattering transform v¨ ard att ta h¨ ansyn till n¨ ar man v¨ aljer ett s¨ att att extrahera funktioner om vi vill kunna karakterisera icke-station¨ ara signalstrukturer.

Nyckelord

Ljudklassificering, h¨ amtning av musikinformation, igenk¨ anning av musikinstru-

ment, scattering transform

(6)

1 Introduction 9

1.1 Benefits, ethics and sustainability . . . . 9

2 Overview on musical instrument recognition 10 2.1 Recognizing musical instruments . . . . 10

2.1.1 Perceptual dimensions of sound . . . . 10

2.1.2 Differences between musical instruments . . . . 11

2.2 Literature survey . . . . 11

2.2.1 Single instrument recognition . . . . 11

2.2.2 Multiple Instrument Recognition . . . . 12

2.3 Representations and features . . . . 14

2.3.1 Mel-frequency spectrum . . . . 14

2.3.2 Mel-Frequency Cepstral Coefficients (MFCC) . . . . 14

2.4 Machine learning methods . . . . 14

2.4.1 Naive Bayes . . . . 15

2.4.2 Nearest Neighbour (NN) . . . . 15

2.4.3 Linear Discriminant Analysis (LDA) . . . . 15

2.4.4 Support Vector Machine (SVM) . . . . 15

2.4.5 Artificial Neural Networks (ANN) . . . . 16

2.5 Datasets . . . . 16

2.6 Evaluation metrics . . . . 17

3 Applying the scattering transform to instrument recognition 18 3.1 Representations and features . . . . 18

3.1.1 Continuous Wavelet Transform (CWT) . . . . 18

3.1.2 The scattering transform . . . . 19

3.2 Scattering transform for musical instrument recognition . . . . . 20

4 Methodology 21 4.1 Data . . . . 21

4.1.1 Essid . . . . 21

4.1.2 MedleyDB . . . . 21

4.1.3 Modified labels . . . . 22

4.2 Stem mixing . . . . 23

4.3 CNNs for scattering transform . . . . 23

4.4 Data split . . . . 25

4.5 Pre-processing . . . . 26

4.6 Cross validation . . . . 27

4.7 Model details . . . . 27

5 Results 29 5.1 Proof of concept . . . . 29

5.2 Reduced training data . . . . 32

(7)

6 Discussion 36 6.1 Single instrument experiment . . . . 36 6.2 Multiple instrument experiment . . . . 36 6.3 Problems encountered . . . . 37

7 Conclusions and future work 37

A Instruments’ fundamental frequencies ranges 46

B Feature representations examples 47

C Learned kernels 50

(8)

List of Figures

1 Comparison of the time-frequency resolution between STFT and CWT [50]. . . . 19 2 Multiple inputs configuration for the scattering coefficients. . . . 24 3 Normalized confusion matrix of the Log-mel model. . . . 30 4 Normalized confusion matrix of the first order coefficients model. 31 5 Normalized confusion matrix of the second-order coefficients model. 31 6 On the left, training and validation loss curves over 100 epochs

of the CNN two model without dropout. On the right, training and validation loss curves over 100 epochs of the CNN two model with dropout. . . . 32 7 Box plot of the test losses of 10 trials for the Log-mel and scatter-

ing models with different amounts of training data for the origi- nal mixed tracks. The data used were the originally mixed tracks found in the MedleyDB data set. . . . 34 8 Box plot of the test losses of 10 trials for the Log-mel and scat-

tering models with different amounts of training data. The data used were the newly mixed tracks. . . . 35 9 Frequency Ranges of Musical Instruments and the Human Voice[36] 46 10 Log-mel features for an observation with a guitar (Gt). . . . 47 11 First-order scattering coefficients logarithmically scaled for an ob-

servation with a guitar (Gt). . . . . 47 12 Second-order scattering coefficients logarithmically scaled for an

observation with a guitar (Gt). . . . 48 13 Log-mel features for an observation with a cello (Co). . . . . 48 14 First-order scattering coefficients logarithmically scaled for an ob-

servation with a cello (Co). . . . 49 15 Second-order scattering coefficients logarithmically scaled for an

observation with a cello (Co). . . . 49 16 Learned kernels for the Log-mel features. . . . . 50 17 Learned kernels for the first-order scattering coefficients logarith-

mically scaled. . . . . 51 18 Learned kernels for the second-order scattering coefficients loga-

rithmically scaled. . . . . 52

List of Tables

1 A qualitative comparison of different existing datasets for instru- ment identification. . . . 16 2 Summary of the number of sources and samples for each of the

instruments in the Essid database. . . . 21 3 Summary of the number of sources and samples for each of the

instruments in the MedleyDB database. . . . 22

4 Correspondence between the new labels and the MedleyDB labels. 23

5 Amount of samples per each label in the different subsets. . . . . 25

(9)

6 Amount of samples per each label for the final split in the different subsets. . . . 26 7 Results of the randomized hyper-parameter search. . . . 28 8 Different networks used for the experiments, CNN6 and CNN two.

The size of the kernel of each convolutional layer is specified be- fore symbol @. The number after symbol @ denotes the number of feature maps used in each layer. BN stands for Batch Normal- ization and FC denotes a fully connected layer. . . . 29 9 Model performances on the test set. . . . . 30 10 Instrument-wise F1 score on the test set. . . . 30 11 Instrument-wise F1 score measured on the validation set together

with the number of samples present in the training and validation set. . . . . 33 12 Instrument-wise F1 score on the test set of the two best perform-

ing models. Both of them were trained on 100% of the training

set. . . . 35

(10)

1 Introduction

Musical instrument recognition is the engineering of automated methods that infer the presence of musical instruments playing in audio recordings [38, 13, 23, 26]. The need for developing a system that can automatically tag musical instruments becomes obvious when thinking how labor intensive it is to manually annotate when instruments are playing in mixes. The instrumentation metadata can be used for many purposes such as allowing the user to explore the libraries in a new dimension, creating new ways to search music. Moreover, adding the musical instruments as prior knowledge could boost the performance of source separation models and automatic music transcription.

Recognizing instruments, either solo or mixed, requires a lot of specialized train- ing; and so for machine learning systems to be able to do it will require a lot of data expressed in sufficiently rich ways. By extracting relevant features from the data, the dimensionality of the problem can be reduced so that a machine is able to interpret the data. Typical approaches use time-frequency features that capture identifiable timbral characteristics of instruments. One such recently proposed feature, the scattering transform [5], seems to have great potential for music informatics, but has yet to be applied to musical instrument recogni- tion.

The scattering transform was first introduced by And´ en and Mallat [5] as a way to incorporate information about large-scale temporal structure present in audio signals. The scattering transform extends the commonly used Mel- frequency cepstral coefficients (MFCCs) representation through a cascade of wavelet transformations and modulus operators, capturing structures at many time scales. And´ en and Mallat’s work [5] reports high classification accuracies when using the scattering transform, although it has been shown that one of the datasets used in the experiments has several faults [64].

To the best of our knowledge, the scattering transform has never been used for instrument recognition. How will it do for instrument recognition? How will it compare against state-of-the-art systems? What are the benefits? The goal of this project is to study the effectiveness of the scattering transform coefficients as features for instrument recognition in data-poor regimes. By combining the scattering transform and deep models such as Convolutional Neural Networks (CNNs) we explore the capability of our models to properly detect instruments in single-instrument and multi-instruments settings.

1.1 Benefits, ethics and sustainability

Music is a cultural experience and commercial product, and automated tools

that extract information from audio data can be useful to a wide variety of

users. Musical instrument recognition can help automatically tag tracks by

creating instrumentation metadata so that users can explore libraries in a new

dimension, as well as boost the performance of recommendation systems.

(11)

2 Overview on musical instrument recognition

In this section we describe what musical instrument recognition is by giving an insight on the physical and perceptual notions related with music, audio and instruments. Following, a literature survey on both single and multiple instrument recognition. Also, the representations and features used for the task as well as the dataset and the methods used to perform the classification.

Finally, the evaluation techniques used to assess the performance of the musical instrument recognition systems.

2.1 Recognizing musical instruments

Sound is a longitudinal pressure wave in a medium. Humans can perceive sound via their auditory system. The medium of interest is of course air, which is com- pressed and rarefied in a periodic way. The pattern of sound pressure vibration created by that phenomena is known as waveform.

2.1.1 Perceptual dimensions of sound

We review three of the primary perceptual attributes of sound, namely loudness, pitch, and timbre.

Loudness is related to the amplitude of a sound waveform. It depends on the generated air pressure and the acoustic energy at the receiver’s position, the duration of the sound and the spectral content of itself. The decibel (abbreviated dB) is the unit used to measure the intensity of a sound.

Pitch is related to the frequency in which the particles of the medium through which the sound moves is vibrating. It is the perceptual property that can relate sounds to notes and melodies.

Timbre is the quality and tone of a sound which makes it unique. Also referred

as “color” , it is considered to be directly affected by the shape of an instrument

and the envelope of an instrument’s sound. It is also affected by many other

factors such as the posture of the person who is playing the instrument or small

differences in the waveform frequencies. By definition, timbre is the perceptual

attribute that distinguishes two sounds that have the same pitch, loudness, and

duration [1]. The ambiguity of this definition is emphasized in Albert Bergman’s

book [12] ”We do not know how to define timbre, but it is not loudness and

it is not pitch.” The qualities of sound we call timbre are harder to describe

and to manipulate than pitch and loudness, and the methods may vary from

instrument to instrument. The names of the instruments themselves can be

understood as labels of timbre. By calling a sound a ”piano” sound, we describe

the general impression of it, clearly distinguishing it from sounds produced by

other sources. Still, the term timbre is a perceptual quality of sound that is

defined very ambiguously due to its subjectivity.

(12)

2.1.2 Differences between musical instruments

Each instrument has different acoustic and physical properties that define their own particular timbre. Humans can reliably identify the sources of sounds based on listening, which must be due to some properties in the sound that can be related to timbre. Some similarities in timbres can be found in instru- ments that belong to the same family. Musical instruments can be divided into three families[57], namely string, wind and percussion instruments. String in- struments are the ones that produce sound by making one or multiple strings vibrate. Wind instruments are the ones that produce sound by making a col- umn of air vibrate. Percussion instruments are the ones that produce sound by vibrating when struck.

In Figure 9 at Appendix A we can see a chart showing the overlap in frequency ranges of the fundamentals of musical instruments.

2.2 Literature survey

We will now survey work in the area of musical instrument recognition.

2.2.1 Single instrument recognition

Early work on musical instrument recognition focused on single instrument recordings. The classification task consisted on identifying a single class of instrument for each track. The audio clips contained notes coming from one source.

One of the earliest works on monophonic instrument recognition, i.e. where audio signals contain single notes from isolated instruments, is the one done by Kaminsky and Materka [38]. In their work, they extracted features using a short term Root-Mean-Square (RMS) energy envelope and used an artificial neural network and a nearest neighbour classifier to classify instrument tones over a one-octave band. De Poli and Prandoni [18] were among the first to use Mel-Frequency Cepstrum Coefficients (MFCCs). Those features were extracted from isolated tones, and later fed into a Kohonen self-organizing map, in order to describe the different musical timbres to be classified.

In [52], Martin computed the MFCC featues and built a hierarchical structured model using univariate Gaussian distributions that performed a maximum like- lihood classification for each of the 15 orchestral instruments tested. The audio clips consisted on single isolated tones. In order to further study the feature space, Eronen [22] compared the MFCC features with spectral and temporal fea- tures such as amplitude envelope and spectral centroids. In those experiments, the Karhunen Loeve transform was used to decorrelate the features to later classify single instruments by means of k-nearest neighbors (k-NN) classifiers.

The results showed that the MFCC features outperformed the rest.

Agostini et al. [2] used the mean and standard deviation of 9 features as descrip-

tors for each tone isolated. Their experiments showed that SVM and Quadratic

(13)

discriminant analysis (QDA) outperformed k-NN and canonical discriminant analysis. In their work, they remark how the choice of features is more critical than the choice of a classification method.

Krishna and Sreenivas [43] propose the so-called Line Spectral Features (LSF) with a Gaussian mixture model (GMM) to classify solo phrases rather than isolated notes. They evaluated their model for instrument family classification and instrument classification over 14 classes. Essid et al. [26] performed musical instrument recognition on solo performances, considering both GMM and SVM methods. In their work, take into account the classification efficiency of the features rather than their perceptual meaning, addressing the task of instrument recognition using a pairwise classification (one-vs-one) strategy.

Joder et al. [37] make use of a classifier such as Hidden Markov Model (HMM), which performs a feature integration by fitting a generative model to the tem- poral evolution. Furthermore, HMMs performed better than GMMs, which suggests that temporal integration can significantly improve the performance of a classification system.

Yu et al. [69] applied sparse coding on logarithm cepstra and power-scale cep- stra for predominant instrument recognition. They use SVM to perform the classification for solo recordings. They applied the model to polytimbral audio, but it greatly underperformed. More work on sparse coding was done by Han et al. [32], where they use sparse coding to learn features from mel-spectrograms.

A SVM was trained to classify 24 instruments from single-note audio clips using the learned features.

In [49], Lonstanlen et al. used deep learning to perform musical instrument recognition over single-instrument audio recordings. The deep convolutional network consisted on different weight sharing strategies in the time-frequency domain: temporal kernels; time-frequency kernels; and a linear combination of time-frequency kernels. In later work [48], Lonstanlen et al. make use of scattering coefficients as features as opposed to MFCC to train a large-margin nearest neighbors (LMNN) algorithm.

2.2.2 Multiple Instrument Recognition

For the past few years, research has focused on polytimbral audio, since it is closer to how music is found in real life. To model classifiers capable of fully automating the task of labeling instruments in real life settings is an essential problem to face in the Music Information Retrieval (MIR) community.

The way the different sources blend with each other in a polytimbral context is what makes instrument recognition so challenging, as the audio signals contain a lot of overlap in the time-frequency domain. A number of systems are based on separating the sources from a mixture and then classifying them separately.

Heittola et al. [34] employed a source-filter model to obtain the signals of in-

dividual instruments from a mixture. In their work, they use Non-Negative

(14)

Matrix Factorization (NMF) for source filter model, together with a multipitch estimator, to later estimate the MFCC features of each separate source’s signal and perform the classification with a GMM classifier. The benefit of this ap- proach is that previously mentioned monotimbral techniques can be used since the classification is performed on isolated instrument, assuming that the source separation step is successful.

Essid et al. [25] proposed a new approach which did not require prior musical source separation. They proposed an automatic method to recognize combina- tions of instruments present in polytimbral music, by using hierarchical clus- tering to represent every possible combination of instruments that is likely to be played simultaneously in Jazz music. Alternatively, one can simply detect the presence of individual instruments rather than combinations of instruments.

Little and Pardo [46] did that by training binary classifiers using weakly labeled mixtures. Here, weakly labeled means that the exact times of activation for each instrument was unknown, the only information available was the presence or absence of the target instrument.

Eggink and Brown [20] proposed a system based on a missing feature approach to estimate a binary mask that defines time–frequency regions based on the esti- mated fundamental frequencies. The binary mask indicates the spectral features which should be employed by a GMM classifier. In later work, they proposed a system to identify a solo instrument in the presence of musical accompani- ment from the extraction of the most prominent fundamental frequency and the corresponding harmonic overtone series [21].

Kitahara et al. [40] proposed a weighting method to deal with source inter- ference, where the louder instrument masks other instruments when they are played simultaneously. Linear Discriminant Analysis (LDA) was used to min- imise within-class variance and maximize between-class variance, enhancing the features that discriminate best the different instruments. Barbedo and Tzane- takis [7] used pitch and onset detection to identify individual partials, i.e. indi- vidual sine waves that form a timbre. They reduce ambiguity caused by source interference by extensively using voting and majority rules on the retrieved par- tials.

Cont et al. [16] used Non-Negative Matrix Factorization (NMF) decomposition to learn one modulation spectrum template for each note of each instrument.

Those templates were constructed in the training process as single basis function in the resulting classification matrix. Prediction was extracted by matching an unknown input to the classification matrix.

In the last years, research has been influenced by the popularity of deep learn-

ing methods, especially with the introduction of Convolutional Neural Networks

(CNNs). In [45], Li et al. used CNNs to train a multi-label classification model

using raw audio as input. That way, the system relied less on domain knowledge

since the deep architecture learned both the feature extraction and the semantic

interpretation from the data. However, using raw audio signals as input shows a

(15)

slightly lower performance than using spectral input such as Mel-spectrogram.

By adding a pre-processing step to extract the Mel-spectrogram as input to the CNN model, Han et al. [31] achieved state-of-the-art performance in predomi- nant instrument identification in real-world polyphonic music.

2.3 Representations and features

2.3.1 Mel-frequency spectrum

A standard step in extracting features for audio classification is performing the short-time Fourier transform (STFT) [50]:

F x(t, ω) = Z

+∞

−∞

x(u)g(u − t)e

^−iωu

du (1)

where x(t) is the input signal and g(t) corresponds to a window function of unit norm kgk = 1 and centered at t = 0. We can see that this transformation is locally invariant to time-shifting. Let x

_c

(t) = x(t − c) and g(t) a window of fixed size T , then for |c| T , |F x

c

(t, ω)| ≈ |F x(t, ω)|. However it is not stable to time warping. Let x

τ

(t) = x(t−τ (t)), with τ (t) = t, then ||F x

τ

(t, ω)| − |F x(t, ω)|| is large at high ω even for small . A solution to this is to have as a representation the power spectrogram averaged in frequency, which gives the mel-frequency spectrogram:

M x(t, λ) = 1 2π

Z

|F x(t, ω)|

²

|Ψ

λ

(ω)|

²

dω (2) where {Ψ

λ

(ω)}

_λ

is a mel-frequency filterbank, which is constant Q for high λ, ensuring that we compensate for the increased movement in the high frequency components. As a result, kM x

τ

(t, λ) − M x(t, λ)k is of the order of ∆τ (t) = [51]. We use the mel-frequency spectrogram to extract representative features in this work.

2.3.2 Mel-Frequency Cepstral Coefficients (MFCC)

The components of the mel-frequency spectrogram are highly correlated [54].

These components can be decorrelated by performing the Discrete Cosine Trans- form (DCT) on the logarithmic mel-spectral vectors, resulting in the so-called cepstral features. The DCT transform here is an approximation to the Karhunen- Lo` eve (KL) transform (or equivalently Principal Component Analysis (PCA)), where the first principal components describe the majority of the variance in the data[47].

2.4 Machine learning methods

In this section we mention some of the machine learning approaches used to

perform musical instrument recognition.

(16)

2.4.1 Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on apply- ing Bayes’ theorem with the assumption of conditional independence between every pair of features given the value of the class variable [70]. A Naive Bayes classifier estimates the probability distribution for all different combinations of feature values to later use Maximum a Posteriori (MAP) estimation to estimate the class of the given sample. However, Naive Bayes takes the assumption that all features are independent which speeds the classification process at the cost of accuracy.

2.4.2 Nearest Neighbour (NN)

The basic idea behind the k-NN algorithm [17] is that it is based on the close- ness of the training examples in the feature space. This algorithm labels an observation according to the labels of its k nearest neighbors. The ”nearest neighbor” is the example that has the smallest distance between the example to be analyzed and the training example. The most commonly used types of distance are geometric distance and Euclidean distance.

k-NN is a completely non-parametric approach, i.e. no assumptions are made about the shape of the decision boundary. A disadvantage of this method is that it is very computationally expensive, since for classifying an object its distance to all the objects in the learning set has to be calculated.

2.4.3 Linear Discriminant Analysis (LDA)

LDA fits a Gaussian distribution conditioned on each class, assuming a com- mon covariance matrix [33]. We can then derive the class posteriors from the Bayes theorem. The equal covariance matrices cause the normalization factors to cancel, as well as the quadratic part in the exponents, resulting in a lin- ear equation as classification boundary between any pair of classes. However, logistic regression can outperform LDA if these Gaussian assumptions are not met.

In real-world problems the boundary that separates two classes is rarely linear.

A Quadratic discriminant analysis (QDA) allows for more complex decision boundaries than LDA.

2.4.4 Support Vector Machine (SVM)

SVM is able to perform binary classification. In a multi-class setting combines some SVM classifiers by using the one-versus-the-rest method [67] and the one- versus-one method [8].

In this project, we use a SVM classifier using the one-versus-one method for the

single instrument experiments.

(17)

2.4.5 Artificial Neural Networks (ANN)

Inspired by the structure of the human brain, neural networks are made up of single nodes called neurons, which have an input and an output, organized in array-like structures called layers. These layers are then stacked together in order to create the network.

The introduction of Convolutional Neural Networks (CNNs) improved results in many MIR tasks, such as onset detection [59, 60], audio structure analysis [66], automatic tagging [19], source separation [15] and musical instrument recogni- tion [49, 45, 31]. A CNN is characterized by convolution operations that are applied with determined filter sizes. This filter convolution can be thought of as inspecting the input in a specific point and its surroundings.

In this project, we use a CNN as a classifier for the multiple instrument exper- iments.

2.5 Datasets

We now summarize some of the public datasets that have been published and used for musical instrument recognition (see Table 1).

There exist three databases commonly used for single tone classification: the McGill University Master Samples (MUMS) database [53], the University of Iowa Musical Instrument Samples (MIS) database [44] and the RWC Music database [29]. Another dataset is the Good sounds dataset [56], which contains recordings of single notes and scales. Finally, the Essid dataset [24] contains excerpts of seven isolated instruments.

The IRMAS Dataset [11] contains annotations for the automatic recognition of predominant instruments in musical audio. As for databases with multi-label annotations, the most commonly used are: the MedleyDB database [9, 10], the MusicNet database [65] and, more recently, the OpenMIC-2018 database [35].

Database #Examples #Instruments Duration

MUMS [53] 667 18 song

single MIS [44] 2,182 20 scale

instrument RWC [29] 3,544 50 scale

Good sounds [56] 6,548 12 note

Essid [24] 2,755 7 5s

IRMAS [11] 6,705 11 3s

multiple MedleyDB [9, 10] 196 80 song

instruments MusicNet [65] 330 11 song

OpenMIC-2018 [35] 20,000 20 10s

Table 1: A qualitative comparison of different existing datasets for instrument

identification.

(18)

2.6 Evaluation metrics

We now define some basic terms before explaining the metrics used in musical instrument recognition. A result is a true positive (TP) if a binary classifier predicts ”yes” and the actual value is ”yes”. Contrarily, a result is a true negative (TN) if a binary classifier predicts ”no” and the actual value is ”no”.

A result is said to be false positive (FP) if a binary classifier predicts ”yes”

but the actual value is ”no”. Finally, a result is false negative (FN) if a binary classifier predicts ”no” but the actual value is ”yes”. In musical instrument recognition, saying that the classifier predicts a ”yes” for a specific instrument would mean that it predicts that the instrument is present in the sample. In the multi-instrument experiments we say that an instrument is present in the sample when the predicted score is higher than a threshold, usually set to 0.5.

Accuracy is the correct prediction rate (”true” predictions) of the classifier.

Accuracy = TP + TN

total (3)

Precision is the rate of correct predictions when the prediction is ”yes”.

Precision = TP

TP + FP (4)

Recall, also known as sensitivity, expresses how often we have a correct predic- tion when the true value is ”yes.”

Recall = TP

TP + FN (5)

The F1 score is widely used for classification tasks as it combines the pre- cision and recall scores as harmonic mean. That gives a single number that characterizes the performance of the model.

F1 = 2 × Precision × Recall

Precision + Recall = 2 × TP

2 × TP + FP + FN (6)

A confusion matrix summarizes the rates for which each predicted label cor- respond to which true label.

All of the above scores are used in multi-class classification problems, in where each of the samples belong to one of C classes. However, for a multi-label classification problem where each sample can belong to more than one class, it is not that simple. Still, there are many practical metrics that can be used [68].

One way to assess the performance of a classifier in a multi-instrument classifi-

cation task is to check the F1 score for each instrument individually. That way

(19)

we can evaluate how well the model predicts each label. However, since we have different values for each instrument, this is not a single value that represents the overall performance of the model. In order to obtain that, we can use the weighted-F1 score, which is the F1-measure calculated on each label to calcu- late the weighted average with the number of true instances for each label. This measure accounts for label imbalance.

Confusion matrices can be computed for the case where we have a single instru- ment. However, in a multi-label classification task every instance has multiple possible predictions and ground truth labels, so it is not possible to compute a traditional confusion matrix. Nevertheless, alternative forms of confusion visu- alization have been proposed [30].

3 Applying the scattering transform to instru- ment recognition

In this section we present the theory behind the scattering transform. We also discuss why it is a good way to extract useful representations for musical instrument recognition.

3.1 Representations and features

To help a system learn a classification task accurately and quickly, it is desired to have robust and discriminatory features to learn from. The features should be compact in size while still describing a signal completely and accurately. There are some properties that we would like these representations to have, such as time-shift invariance or stability to time-warping.

3.1.1 Continuous Wavelet Transform (CWT)

One of the problems with the STFT is that it has the same resolution across the time-frequency plane. When we use the STFT we assume that some portion (defined by the width and shifting size of the window) of a non-stationary signal is stationary. The Heisenberg Uncertainty Principle states that as one increases time resolution, resolution in frequency decreases [27]. What one can know are the time intervals in which certain band of frequencies exist, which is a resolution problem. With a narrower STFT window we get better time resolution, but the frequency resolution is poorer. In order to analyse signal structures of very different sizes, it is necessary to use time-frequency atoms with different time supports [5]. Figure 1 shows the time-frequency resolution of the STFT and the CWT in a graphical way.

The wavelet transform is composed by multiple dilations of a mother wavelet Z

+∞

−∞

ψ(t)dt = 0 (7)

(20)

The transformed signal is a function of two variables, τ and s, the translation and scale parameters, respectively.

CW T

_x^ψ

(τ, s) = hx, ψ

^τ_s

i = Z

+∞

−∞

x(t) 1

√ s ψ

^∗

t − τ s

dt (8)

The translation term is related to the location of the window, as the window is shifted through the signal. The scaling parameter either dilates or compresses a signal, having a reduced time support when the scale decreases but with an increased frequency spread that covers and interval shifted towards high frequencies.

(a) Heisenberg time-frequency boxes of two win- dowed Fourier transforms.

(b) Heisenberg time-frequency boxes of two wavelets.

Figure 1: Comparison of the time-frequency resolution between STFT and CWT [50].

3.1.2 The scattering transform

A scattering transform is a cascade of wavelet modulus operators and a time- average operator. Let W x define the following wavelet transform [5]:

W x = (x ? φ(t), x ? ψ

_λ

(t))

_{t∈R,λ∈Λ}

(9)

ψ

λ

= λψ(λt) (10)

with ψ(t) as the mother wavelet and where φ(t) is a low-pass filter of frequency

bandwidth 2πT and Λ contains all wavelet center frequencies λ. For λ ≥ 2πQ/T

the wavelets ψ

λ

are defined with (10). For λ < 2πQ/T the filters ψ

λ

are about

Q − 1 equally-spaced filters with constant frequency bandwidth 2πT . In this

definition, T corresponds to the maximum scaling factor and Q denotes the

number of wavelets per octave.

(21)

Let U x represent the following wavelet modulus operator:

U x = (x ? φ(t), |x ? ψ

λ

(t)|)

_{t∈R,λ∈Λ}

(11) The modulus is applied in order to have some non-linearity. This operator pro- vides stability to deformations and in addition is contracting, as ||a|−|b|| ≤ |a−b|

for any (a, b) ∈ C. However, this transform is not invariant to translation. To make it locally translation invariant we average it through time. This means that some information will be lost, particularly the high frequencies. This lost information though will be captured in the wavelet coefficients, which are rec- tified taking the modulus to then be able to calculate the next invariant. That way, we can define the different scattering coefficients as [5]:

S

0

x(t) = x ? φ(t) (12)

S

1

x(t, λ

1

) = |x ? ψ

λ₁

(t)| ? φ(t) (13)

S

2

x(t, λ

1

, λ

2

) = ||x ? ψ

λ1

(t)| ? ψ

λ2

(t)| ? φ(t) (14)

S

_m

x(t, λ

₁

, . . . , λ

_m

) = ||x ? ψ

_λ₁

(t)| ? · · · | ? ψ

_λ_m

(t)| ? φ(t) (15) This cascade of non-linearities and convolutions can be seen as a convolutional network, although in this case the filters are not trained because they are pre- defined wavelets.

3.2 Scattering transform for musical instrument recogni- tion

The scattering transform was proposed as a way to build particular invariances into a representation. In fact, Mallat [51] points out the resemblance of the first-order scattering coefficients to the outputs of the first layers of trained convolutional deep neural networks with audio signals as inputs. The scattering transform has been used for musical genre classification [4, 5], texture audio synthesis [14] and environmental sound classification [58]. To the best of our knowledge, however, the scattering coefficients have not been used for the task of musical instrument recognition.

The first layer of scattering coefficients is comparable to MFCCs, which are the most commonly used features for musical instrument recognition. However, the further layers of scattering coefficients yield temporal details that are not considered in MFCCs such as spectro-temporal modulations.

There are many properties that make the scattering transform well-suited for

represented structured signals such as audio recordings. One of them is that the

wavelet transform is contractive, and so is the complex modulus, so the whole

scattering transform is contractive, which means that the information is being

(22)

compressed into fewer parameters. The resulting transformation has invariance to time shifting and stability to deformations of the original signal. These properties make the scattering transform a good feature extraction mechanism to use for the task of musical instrument representation.

4 Methodology

This section describes our methodology for exploring the effectiveness of scat- tering transform features for music instrument recognition.

4.1 Data

4.1.1 Essid

For the single instrument experiments we use the Essid dataset [24]. It consists of single instrument recordings extracted from real music performance contexts.

It collects various studio recordings with the aim of obtaining for each instru- ment a maximum number of different recording conditions, performers and types of music. Extracts were thus obtained from digital recordings (CD: Compact Disc) of classical music, jazz or sound carriers used for teaching music, as well as studio recordings. Presented as a PhD thesis, Essid proposed this dataset to help improve the automatic identification of instruments in realistic contexts.

Our experiments consider seven instruments: guitar (Gt), piano (Pn), cello (Co), clarinet (Cl), oboe (Ob), trumpet (Tr) and violin (Vl). For each of these instruments, we have several 5-second snippets from recordings from five differ- ent recording conditions per class. A summary of the database can be found in Table 2.

Instrument Code Sources Samples/source Total samples

Guitar Gt 5 74 370

Piano Pn 5 77 385

Cello Co 5 116 580

Clarinet Cl 5 56 280

Oboe Ob 5 90 450

Trumpet Tr 5 49 245

Violin Vl 5 89 445

Total 2,755

Table 2: Summary of the number of sources and samples for each of the instru- ments in the Essid database.

4.1.2 MedleyDB

For the multi-instrument experiments we use the MedleyDB dataset [9, 10],

which contains high quality melody annotations for 195 tracks of different lengths

(23)

that are varied in musical style. This dataset has annotations that were gen- erated in a semi-automated fashion by using monophonic pitch tracking on selected stems to compute the instrument activations on the final mixed track.

Due to the small sample size of this dataset, we limit the experiments to cover only 5 instruments: drums (Dr), bass (Ba), guitar (Gt), voice (Vo) and piano (Pn). This is explained in more detail in Section 4.1.3. The songs in MedleyDB were obtained from many different sources

¹

, the majority of which were recorded in professional studios and mixed by experienced engineers.

Instrument Code Sources Total samples

Drums Dr 64 1,716

Bass Ba 78 2,284

Guitar Gt 63 1,871

Voice Vo 63 1,570

Piano Pn 49 1,445

Total 8,886

Table 3: Summary of the number of sources and samples for each of the instru- ments in the MedleyDB database.

4.1.3 Modified labels

The MedleyDB dataset offers high-level annotations for the tracks in the dataset, however there are not a lot of tracks to work with. As the amount of data available is reduced, we decide to gather some classes into mother classes to have more samples per label. In Table 4 we can see the relation between the new labels and the ones that MedleyDB has.

1

https://medleydb.weebly.com/acknowledgements.html

(24)

New label MedleyDB label

drums drum set

bass electric bass double bass

guitar distorted electric guitar clean electric guitar acoustic guitar voice

male singer female singer male speaker

female speaker male rapper female rapper

beatboxing vocalists choir

male screamer female screamer

piano piano tack piano electric piano

synthesizer synthesizer

cello cello cello section

clarinet clarinet clarinet section bass clarinet

cymbals cymbal

flute

flute dizi flute

flute section piccolo bamboo flute

panpipes recorder

mallet percussion xylophone vibraphone glockenspiel marimba

mandolin mandolin

saxophone alto saxophone baritone saxophone tenor saxophone soprano saxophone

trombone trombone trombone section

trumpet trumpet trumpet section

violin violin violin section

Table 4: Correspondence between the new labels and the MedleyDB labels.

4.2 Stem mixing

Since we choose to use only 5 labels and disregard the rest, the unlabeled instru- ments become more common when there are more songs, and thus have more ambiguous labels for similar feature patterns. This might result on worse per- formance for a bigger training set since the more training data we use, the more disperse distribution we have. In order to take care of that problem, we mixed single instrument stems, as MedleyDB provides that data. The final tracks are the result of linearly averaging the different stems that contain the 5 instru- ments that are considered, disregarding the rest of the stems. There are other ways of mixing the stems that might provide more realistic tracks by using a digital audio workstation (DAW), but that is left for future work.

4.3 CNNs for scattering transform

Previous work applying the scattering transform to audio classification has used

a multiclass SVM [4, 5, 3]. In addition, this project makes use of CNNs since

state-of-the-art results have been obtained with them [31, 62]. One of the issues

(25)

with this approach is that the kernel might enhance undesired edges that corre- spond to the different second order coefficients, since they derive from different first order coefficients. The group of second order coefficients that correspond to the ith first order coefficient are separated from the ones derived from the i − 1 and i + 1 first order coefficient by a ”hard” edge. One way to make sure that the feature representations extracted make sense is by having multiple in- puts to the network corresponding to the first order coefficients and each of the second order coefficients separated by their correspondence to each first order coefficient. If we look at Figure 2, we would have as many inputs in the network as C

_i

matrices.

Figure 2: Multiple inputs configuration for the scattering coefficients.

Although this might seem a good way of adapting a CNN to the scattering transform, this approach would require too many input layers to the network, which is not necessarily required. In order to prevent the network from being too complex, a new approach was taken. Instead of having as many inputs as C

i

matrices, we take two inputs consisting of two matrices containing the first and

second order coefficients respectively. To prevent the kernels from detecting un-

desirable edges in the second order coefficients, we use a one-dimensional kernel

that convolves only through time instead of a two-dimensional kernel.

(26)

4.4 Data split

For the Essid dataset, we split the dataset into training and test, with four different sources in the training set and a different one in the test set. This way, we get 80% of the samples in the training set and the remaining 20%

in the test set. This split is also done by taking into account the number of samples per instrument, maintaining the same label distribution across the two subsets.

For the MedleyDB dataset, the dataset is split to train, validate and test subsets.

Initially the split was done song-wise, taking into account that no chunks of the same song would be in different subsets. This is extremely important to do in order to reasonably draw the conclusion that the classifier can generalize to new tracks. However, as the dataset is quite small, when doing that kind of split we did not get well-balanced subsets. Some label combinations were not present at all in the training set, which would result in a poor performance of the model on the other subsets. In order to assure that we had a song-wise split and that all the label combinations were in the different subsets, multiple random splits were performed until we got the one that had all the combinations in the training set and most of the combinations in the validation and test set while still being a song-wise split. After splitting the dataset like that, the number of samples of each label in each subset are the following:

Instrument Training Validation Test

drums 1,382 191 143

bass 1,901 195 188

guitar 1,562 130 179

voice 1,305 104 161

piano 1,200 54 191

synthesizer 889 41 58

cello 273 4 80

clarinet 143 16 47

cymbals 84 3 40

flute 312 11 0

mallet percussion 161 0 21

mandolin 465 0 29

saxophone 54 24 0

trombone 22 0 0

trumpet 79 20 0

violin 411 6 111

Table 5: Amount of samples per each label in the different subsets.

We can see from Table 5 that a lot of instruments such as the mandolin and the

trombone have no samples in the validation and/or test subsets. This happens

when there are only a few tracks with that instrument. When we prioritize to

(27)

have all instruments present in the training set, if an instrument only appears in one or two tracks then all the samples go to the training set since we have to maintain a song-wise split. Consequently, the validation and test do not have any instance of that instrument. After seeing this distribution of the samples, we see it fit to work only with the top 5 instruments: drums, bass, guitar, voice and piano. The final split is shown in Table 6, with a split rate of approximately 80%-10%-10% for train-validation-test subsets.

Instrument Training Validation Test

drums 1,382 191 143

bass 1,901 195 188

guitar 1,562 130 179

voice 1,305 104 161

piano 1,200 54 191

Table 6: Amount of samples per each label for the final split in the different subsets.

4.5 Pre-processing

For the Essid dataset, the tracks are downsampled to a sampling rate of 22.05kHz.

When extracting the Log-mel coefficients, the spectrogram is calculated by com- puting discrete Fourier transforms (DFT) over short overlapping Hann windows of size 400 samples and a hop size of 160 samples. Next, we calculate the Log-mel spectrogram with 64 mel bins, a minimum frequency of 50Hz and a maximum frequency of 11.025kHz. When extracting the scattering coefficients, the first and second order scattering coefficients are extracted using Kymatio [6] scatter- ing1D. The averaging scale is specified as a power of two, 2

^J

. Here, we set J = 6 to get an averaging, or maximum, scattering scale of 2

⁶

= 64 samples (window size of about 3ms). The number of waveforms per octave is 8 for the first or- der coefficients and 1 for the second order, as empirically shown to give sparse representations of audio signals [61]. The resulting dimensions of the samples are 64×827, and all the features are standardized by removing the mean and scaling to unit variance with estimates from the training data.

For the MedleyDB dataset, the tracks are downsampled with a sampling rate

of 22.05kHz, normalized and later split into non-silent intervals to get 6s long

snippets. We choose a large value for the length of the snippet since there are

long structures in this dataset that we would like to be able to characterize. The

labels of the fragments are assigned accordingly to the start time and end time of

each of the five selected instruments, following the Source ID annotations of the

MedleyDB dataset. Before extracting the features, we generate the new tracks

by mixing the single instruments’ stems. The Log-mel coefficients are extracted

in the same way as with the Essid dataset. When extracting the scattering

coefficients, the first and second order scattering coefficients are extracted using

Kymatio [6] scattering1D with a scaling factor of 9 (window of 23ms) and 8 and

(28)

one wavelets per octave for the first and second order coefficients respectively.

The resulting dimension of the samples are two matrices of sizes 62×259 for the first order coefficients and 237×259 for the second order coefficients. All the features are standardized by removing the mean and scaling to unit variance with estimates from the training data.

4.6 Cross validation

In order to tune an estimator’s hyperparmeters a part of the dataset is commonly used as a validation set. However, when the number of observations in a dataset is small, holding a part of the available data as validation set reduces drastically the amount of data which can be used for learning the model. A solution to that problem is to use the so-called k-fold Cross Validation. In this approach, the dataset is divided into training and test set, but the training set is also split into k subsets. The model is then trained using k − 1 of the training subsets, and the remaining subset is then used to evaluate the model. The process is repeated for all the subsets and the final validation score is the average of the multiple calculated evaluation scores. We can select the best performing parameters by using the validation score, and perform a final evaluation on the test set.

4.7 Model details

For the experiment where we have one instrument per track we use a Support vector machine (SVM) with a radial basis function (RBF) kernel. The ”one- against-one” approach [41] is used to perform the multi-class classification task.

This way, if K is the number of classes, then K(K − 1)/2 binary classifiers are constructed, one for every possible pair of classes. Each sample is then classified according to a majority vote amongst the classifiers. In our experiment, the different features that we test are averaged through time and then input to the SVM. Those inputs result into vectors of length 64 (the number of mel bins) for the Log-mel case, vectors of length 38 in the case of using only the first order coefficients and 125 in case of using both first and second order coefficients. The parameters of the RBF kernel C and γ are selected using the RandomizedSearchCV provided by Scikit-learn [55]. This way, we can extract the hyperparameters from a uniform distribution. The limits of that distribution can be seen in Table 7. Through cross validation, the Cross Entropy loss of the estimator is calculated. The finally chosen parameters are those that give the lowest Cross Entropy loss over 100 trials.

We evaluate the performance of a SVM on three feature settings: the commonly

used Log-mel Spectrogram, the first order scattering transform coefficients and

the first- and second-order scattering transform coefficients. To perform instru-

ment identification we train each SVM using five-fold cross validation. We then

use the remaining testing data to evaluate the final model. We never include

samples from the same source in both the training and testing data in order

to avoid biasing the measurements of performance due to unrealistic testing

(29)

scenarios.

Model Parameters Lower boundary Upper boundary Best

Log-mel C 1,000 10,000 2,799.38

γ 1e-5 1e-3 8.16e-04

scattering first-order C 1,000 10,000 9,531.04

γ 1e-5 1e-3 9.36e-4

scattering first- and second-order

C 1,000 10,000 9,945.76

γ 1e-5 1e-3 9.43e-4

Table 7: Results of the randomized hyper-parameter search.

For the experiment where we have multiple instruments per track, we use two different networks. The first one is the one with the Log-mel coefficients as input. For this setting we use an adapted version of the CNN6 architecture in the Pretrained Audio Neural Networks (PANNs) [42]. For our experiments, we did not use the proposed spectral augmentation since it scarcely improved the performance.

As for the scattering network CNN two, the first and second order scattering coefficients are treated as two separate inputs of the network. The main differ- ence with the CNN6 architecture is that the filters convoluted with the second order coefficients are one-dimensional. The reasoning behind it is that unde- sired features might be extracted from the second order coefficients, due to the fact that in the spectral dimension there are blocks of second order coefficients that correspond to first order coefficients. The boundaries between those blocks might influence too much the learned features.

In order to be able to make a fair comparison between the CNN6 and the

CNN two models we had to have roughly the same number of trainable param-

eters. We selected the number of layers and their dimensions of the CNN two ar-

chitecture by keeping that in mind. The resulting architectures have ∼4.500.000

trainable parameters. Table 8 summarizes the used CNN architectures. All con-

volutional layers have a stride of 1 and a padding of 2. Dropout [63] is applied

after downsampling with the average pooling layers to prevent systems from

overfitting. The dropout of the CNN6 network is 20% after the 2×2 pooling

layers and 50% after the global pooling layer. The dropout of the CNN two

network is 50% after all the pooling layers.

(30)

CNN6 CNN two Log-mel spectrogram

827 time steps × 64 mel bins

First order coefficients

529 time steps × 62 coefficients

Second order coefficients

529 time steps × 237 coefficients 5×5 @ 64

BN, ReLU

5×5 @ 64 BN, ReLU

5×1 @ 64 BN, ReLU Pooling 2×2

5×5 @ 128 BN, ReLU

5×1 @ 128 BN, ReLU Pooling 2×2

5×5 @ 256 BN, ReLU

5×1 @ 256 BN, ReLU

Pooling 2×2 Global pooling Pooling 2×2

5×5 @ 512 BN, ReLU

5×1 @ 512 BN, ReLU

Global pooling Global pooling

FC 512, ReLU Concatenate

FC 5, Sigmoid FC 256, ReLU

FC 5, Sigmoid

Table 8: Different networks used for the experiments, CNN6 and CNN two.

The size of the kernel of each convolutional layer is specified before symbol @.

The number after symbol @ denotes the number of feature maps used in each layer. BN stands for Batch Normalization and FC denotes a fully connected layer.

5 Results

In this section we present the results we got from our experiments.

5.1 Proof of concept

We explore how the different sets of features perform with a dataset based on single instrument tracks.

Table 9 shows the loss, accuracy, weighted F1-score, weighted Precision and

weighted Recall of the different models, using the parameters selected over 100

trials that can be seen in Table 7. We can see that the scattering first- and

second-order features perform similar than the Log-mel features. Performance

is poor when using only first-order scattering coefficients. Table 10 shows the

instrument-wise F1 score. In this table we see that for instruments Gt, Co, Cl

and Ob the scattering transform with the first- and second-order coefficients

outperforms the Log-mel features. For the remaining three instruments, the

Log-mel features give the best performance.

(31)

Loss Accuracy F1 Precision Recall

Log-mel 1.372 0.793 0.804 0.845 0.793

Scattering first-order 1.430 0.735 0.746 0.788 0.735

Scattering first- and second-order 1.374 0.791 0.802 0.839 0.791 Table 9: Model performances on the test set.

Code Log-mel Scatter first-order

Scatter first- and second-order

Gt 0.788 0.759 0.828

Pn 0.752 0.701 0.714

Co 0.857 0.783 0.862

Cl 0.370 0.261 0.400

Ob 0.966 0.954 0.978

Tr 0.721 0.701 0.719

Vl 0.810 0.657 0.766

Table 10: Instrument-wise F1 score on the test set.

Gt Pn Co Cl Ob Tr Vl

Predicted label Gt

Pn Co Cl Ob Tr Vl

True label

0.93 0.03 0.04 0.00 0.00 0.00 0.00 0.06 0.65 0.00 0.10 0.01 0.16 0.01 0.04 0.00 0.93 0.01 0.00 0.00 0.02 0.39 0.07 0.04 0.27 0.00 0.23 0.00 0.00 0.00 0.00 0.01 0.94 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.90 0.10 0.00 0.00 0.26 0.00 0.00 0.00 0.74

Figure 3: Normalized confusion matrix of the Log-mel model.

(32)

Gt Pn Co Cl Ob Tr Vl Predicted label

Gt Pn Co Cl Ob Tr Vl

True label

0.86 0.11 0.03 0.00 0.00 0.00 0.00 0.03 0.60 0.01 0.27 0.00 0.08 0.01 0.03 0.00 0.94 0.01 0.00 0.01 0.01 0.38 0.04 0.12 0.27 0.00 0.20 0.00 0.00 0.00 0.00 0.01 0.92 0.07 0.00 0.00 0.00 0.00 0.04 0.02 0.84 0.10 0.00 0.00 0.46 0.00 0.00 0.01 0.53

Figure 4: Normalized confusion matrix of the first order coefficients model.

Gt Pn Co Cl Ob Tr Vl

Predicted label Gt

Pn Co Cl Ob Tr Vl

True label

0.95 0.03 0.03 0.00 0.00 0.00 0.00 0.01 0.58 0.00 0.25 0.00 0.16 0.00 0.03 0.00 0.97 0.00 0.00 0.00 0.00 0.38 0.04 0.02 0.34 0.02 0.20 0.02 0.00 0.00 0.00 0.00 0.99 0.01 0.00 0.00 0.00 0.00 0.02 0.04 0.84 0.10 0.00 0.00 0.34 0.00 0.00 0.00 0.66

Figure 5: Normalized confusion matrix of the second-order coefficients model.

(33)

5.2 Reduced training data

We examine the effect of reducing the amount of data used to train the different CNN models mentioned above. In order to do so we train the two models specified in Table 8 using 10%, 20% and 100% of the total available training data. In these experiments, the learning rate is set to 1e − 03 together with the Adam optimizer [39]. The parameters of the Adam optimizer are set to the default values proposed in the paper, which are 0.9 for β

1

, 0.999 for β

1

, and 1e − 08 for . The batch size is fixed to 64. As for the weights initialization, we use the Xavier initialization [28] for the convolutional layers’ weights. The batch normalization layers’ weights are initialized to one and all the biases are initialized to be zero. The loss function we use is the Binary Cross-Entropy Loss.

While training the CNN two model, we saw from the train and validation loss curves that it was overfitting (Figure 6). Some of the most common ways of handling overfitting are reducing the complexity of the network, adding weight regularization or using dropout layers. The validation loss curves show no signs of overfitting when we use dropout at a rate of 0.5.

0 20 40 60 80 100

0.0 0.2 0.4 0.6 0.8

1.0 validation loss training loss

0 20 40 60 80 100

0.1 0.2 0.3 0.4 0.5 0.6

0.7 validation loss

training loss

Figure 6: On the left, training and validation loss curves over 100 epochs of the CNN two model without dropout. On the right, training and validation loss curves over 100 epochs of the CNN two model with dropout.

A lot of the instruments of the MedleyDB we were taking into account gave really low F1 score. After further analysis we saw that the instruments that had an F1 of 0.0 were the ones that had fewer samples, as seen in Table 11.

That is why, as further explained in Section 4.1.3, we decided to select the five

instruments with the most examples and disregard the rest.

(34)

Instrument F1 score Training samples Validation samples

drums 0.946 1,382 191

bass 0.955 1,901 195

guitar 0.782 1,562 130

voice 0.731 1,305 104

piano 0.654 1,200 54

synthesizer 0.653 889 41

cello 0.000 273 4

clarinet 0.000 143 16

cymbals 0.000 84 3

flute 0.667 312 11

mallet percussion 0.000 161 0

mandolin 0.000 465 0

saxophone 0.512 54 24

trombone 0.000 22 0

trumpet 0.000 79 20

violin 0.286 411 6

Table 11: Instrument-wise F1 score measured on the validation set together with the number of samples present in the training and validation set.

While having to recognise five instrument instead of sixteen made the multi-label

classification problem much easier, another problem arose. When disregarding

the unlabelled instruments, the labels became more ambiguous for similar fea-

ture patterns. The test loss obtained after training the models with all of the

training set was higher than the one obtained training with a reduced training

set (Figure 7). Because of that, we created new mixes from single instrument

stems, as explained in Section 4.2.

(35)

logmel scatter model

0.4 0.5 0.6 0.7 0.8 0.9

test loss

training data 10% 20%

100%

Figure 7: Box plot of the test losses of 10 trials for the Log-mel and scattering models with different amounts of training data for the original mixed tracks.

The data used were the originally mixed tracks found in the MedleyDB data set.

We train each for 100 epochs, select the model that gives the smallest validation

loss and calculate the final test loss. Then, we examine the distribution of the

test loss for each of those settings over 10 runs. The distributions are shown in

Figure 8. Even if there is a lot of variability in the loss, the weighted F1 score

in the different runs does not differ that much.

(36)

logmel scatter model

0.5 0.6 0.7 0.8 0.9

test loss

training data 10% 20%

100%

Figure 8: Box plot of the test losses of 10 trials for the Log-mel and scattering models with different amounts of training data. The data used were the newly mixed tracks.

We can see in Table 12 the instrument-wise F1 score on the test set for each of the models that gave the smallest validation loss for the different input features.

The scattering transform features outperform the Log-mel ones for the Dr, Ba and Pn instruments. For the Gt and Vo instruments, the Log-mel features are better than the scattering ones.

Code Log-mel Scatter

Dr 0.968 0.980

Ba 0.751 0.857

Gt 0.803 0.789

Vo 0.730 0.676

Pn 0.837 0.894

Table 12: Instrument-wise F1 score on the test set of the two best performing

models. Both of them were trained on 100% of the training set.

(37)

6 Discussion

We now discuss the results of our experiments and reflect on our research ques- tions.

6.1 Single instrument experiment

Judging from the results obtained, the model has particular difficulty in detect- ing the clarinet instrument. In Table 10 we see that the clarinet F1 score is very low. If we look at the normalized confusion matrices from Figures 3, 4 and 5, we see that the clarinet is mostly confused by a guitar or a trumpet. The prediction from the scattering first- and second-order coefficients seems to be the one that can classify it best. The scattering first-order model is the one that shows the worst performance out of the three tested models. The Log-mel model performs similar to the scattering first- and second-order model (see Table 9), with a slight improvement in the Log-mel model results. However, if we look at the confusion matrices we see that the troublesome clarinet instrument is bet- ter recognised when using the first- and second-order scattering coefficients. It seems like, in order to better model challenging instruments such as the clarinet, other instruments are confused more. That might indicate that the extracted features are not representative enough of the data, that the features were er- roneously extracted, or that there is a problem with the data itself. There are not a lot of Clarinet observations in the training set, but the same goes with the Trumpet and this later one is barely confused with any other instrument.

This might be because the features extracted from Trumpet samples are very distinctive from the others.

The piano instrument seems to be confused with the clarinet and the trumpet, which subjectively speaking do not sound similar. Another misclassified instru- ment is the violin that is confused with a cello many times, although the cello is not confused with a violin. The reason might be the imbalance of the data that we used as there are many more samples of cello than violin, hence the fact that the cello is chosen over the violin more often. Still, those instruments sound similar, so it is not odd to see confusion between them.

6.2 Multiple instrument experiment

The range of values of the test losses obtained in the different runs is smaller in the scatter models. Judging from the results shown in Figure 8, we can conclude that the scattering transform provides features that help the model to be more robust to the stochasticity of its initialization.

Some instruments activations are better reproduced with the use of the Log-mel coefficients, others with the scattering coefficients. In Table 12 we see that the biggest difference in performance is for the bass.

In the appendix C we have a summary of the learned kernels of these convolu-

tions done on the different features. In Figure 17 we can see clear horizontal

Musical Instrument Recognition using the Scattering Transform

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN 2020

Musical Instrument Recognition using the Scattering

Transform

Laura Cros Vila

Authors

Laura Cros Vila <lcros@kth.se>

Master in Machine Learning KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Examiner

Anders Friberg

KTH Royal Institute of Technology

Supervisor

Bob Sturm

KTH Royal Institute of Technology

Acknowledgements

First of all I want to thank those who followed me closely, patiently and wisely

advising me on the right moves, during these long and difficult months: Bob,

Carl and Daniel. I want to thank all the amazing people I have met at Epidemic

Sound, who feel not just like coworkers but also family. Especially Carl, who

helped and encouraged me since day one, and who told me so. I thank all the

classmates who have accompanied me in these years, without whom I would

not have learned half of what I have. I thank my whole family, who are always

there for me, for anything. I thank with all my heart my friends, who have

incredibly put up with me for years and that I could not replace anything in

the world.

Abstract

We also examine the impact of the amount of training data.

The experiments carried out do not show a clear superior performance of ei- ther feature representation. Still, the scattering transform is worth taking into account when choosing a way to extract features if we want to be able to char- acterize non-stationary signal structures.

Keywords

Audio classification, music information retrieval, musical instrument recogni-

tion, scattering transform

Abstract

I det h¨ ar arbetet presenterar vi en metod f¨ or igenk¨ anning av musikinstrument med hj¨ alp av den scattering transform, som ¨ ar en transformation som ger en

¨

Nyckelord

Ljudklassificering, h¨ amtning av musikinformation, igenk¨ anning av musikinstru-

ment, scattering transform

Contents

1 Introduction 9

1.1 Benefits, ethics and sustainability . . . . 9

2 Overview on musical instrument recognition 10 2.1 Recognizing musical instruments . . . . 10

2.1.1 Perceptual dimensions of sound . . . . 10

2.1.2 Differences between musical instruments . . . . 11

2.2 Literature survey . . . . 11

2.2.1 Single instrument recognition . . . . 11

2.2.2 Multiple Instrument Recognition . . . . 12

2.3 Representations and features . . . . 14

2.3.1 Mel-frequency spectrum . . . . 14

2.3.2 Mel-Frequency Cepstral Coefficients (MFCC) . . . . 14

2.4 Machine learning methods . . . . 14

2.4.1 Naive Bayes . . . . 15

2.4.2 Nearest Neighbour (NN) . . . . 15

2.4.3 Linear Discriminant Analysis (LDA) . . . . 15

2.4.4 Support Vector Machine (SVM) . . . . 15

2.4.5 Artificial Neural Networks (ANN) . . . . 16

2.5 Datasets . . . . 16

2.6 Evaluation metrics . . . . 17

3 Applying the scattering transform to instrument recognition 18 3.1 Representations and features . . . . 18

3.1.1 Continuous Wavelet Transform (CWT) . . . . 18

3.1.2 The scattering transform . . . . 19

3.2 Scattering transform for musical instrument recognition . . . . . 20

4 Methodology 21 4.1 Data . . . . 21

4.1.1 Essid . . . . 21

4.1.2 MedleyDB . . . . 21

4.1.3 Modified labels . . . . 22

4.2 Stem mixing . . . . 23

4.3 CNNs for scattering transform . . . . 23

4.4 Data split . . . . 25

4.5 Pre-processing . . . . 26

4.6 Cross validation . . . . 27

4.7 Model details . . . . 27

5 Results 29 5.1 Proof of concept . . . . 29

5.2 Reduced training data . . . . 32

6 Discussion 36 6.1 Single instrument experiment . . . . 36 6.2 Multiple instrument experiment . . . . 36 6.3 Problems encountered . . . . 37

7 Conclusions and future work 37

A Instruments’ fundamental frequencies ranges 46

B Feature representations examples 47