Evaluation of Audio FeatureExtraction Techniques to ClassifySynthesizer Sounds

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2016,

Evaluation of Audio Feature

Extraction Techniques to Classify Synthesizer Sounds

FANNY ROCHE

(2)

(3)

Abstract

After many years focused on speech signal processing, the research in audio processing started to investigate the eld of music processing. Music Information Retrieval is a very new topic steadily growing since a few years as music is more and more part of our daily life, particularly thanks to the new technologies like mp3 players and smartphones.

Moreover, with the development of electronic music and the huge improvements in computational power, new instruments have appeared such as virtual instruments, bringing with them new needs concerning the availability of sounds.

One main necessity which came with these novel technologies is to have a user- friendly system to make it easy for the users to have access to the whole range of sounds the device can oer.

In this thesis, the purpose is to implement a smart automatic classication of synthesizer sounds based on audio descriptors without any human inuence.

Hence the study rst focus on what is a musical sound and what are the main characteristics of synthesizer sounds that need to be extracted using wisely chosen audio descriptors extraction. Then the interest moves to a classier system based on the Self-Organizing Map model using unsupervised learning to match with the main purpose to avoid any human bias and use only objective parameters for the sounds classication. Finally the evaluation of the system is done, showing that it gives good results both in terms of accuracy and time eciency.

Keywords : Music Information Retrieval, synthesizer sounds, audio descriptors extraction, neural networks, unsupervised learning, Self-Organizing Map model.

(4)

(5)

Sammanfattning

Efter många år då signalbehandling koncentrerade sig på röstigenkänning, star- tade ljudforskingen intressera sig av musikbehandling. Music Information Re- trieval är ett nytt och ständigt växande område, då musik de senaste åren har blivit en del av vardagen, tack vare den senaste tekniken inom mp3spelare och smartphones.

Dessutom, med den stora utvecklingen inom elektronisk musik och förbät- tringen av beräkningskraft, kom nya musikinstrument såsom virituella instrument, nya behov för tillgänglighet till ljud. En nödvändighet som kom med dessa nya teknologier är att ha ett användarvänligt system för att göra det enkelt för användarna att få tillgång till alla ljud som systemet kan erbjuda.

Målet för detta examensarbete är att genomföra en smart och automatisk klassikation av syntljud, med grund i ljudbeskrivning utan mänskligt påverkan.

Första delen av studien fokuserar på vad musik är och en karaktärisera syntljud som måste bli extraherade av väl valda ljudbeskrivningar. Sedan används ett klassikationsystem grundat på en model som heter Self-Organizing Map som använder oövervarkad inlärning för att avvärja mänsklig påverkan med bara objektiva parametrar. Slutligen utvärderas systemet vilket leder goda resultat när det gäller båda noggrannhet och tidseektivitet.

Nyckelord : Music Information Retrieval, syntljud, extraherande ljudbeskri- vare, neurala nätverk, oövervarkad inlärning, Self-Organizing Map modell.

(6)

(7)

Acknowledgement

The work presented in this thesis has been realized at a company named Arturia from September 2015 to February 2016.

First, and most importantly, I would like to thank Samuel, my supervisor at the company for his endless help, support and trust. Nothing would have been possible without him and I would particularly like to thank him for his patience reading all my writings and his deep commitment in my current and future projects.

Particular thanks go to Kévin and Adrien for their help and the interest they expressed concerning the project I presented them for the future. I would also like to thank all the people at Arturia and more particularly the dev team for their kindness and friendliness during this thesis making me feel like I belonged there. I really hope that the adventure will continue with you guys.

Then I thank Saikat Chatterjee, my supervisor at KTH for sharing with me his knowledge and the interest he has on research. Thanks to him I discovered the eld of the research in audio processing and I am very grateful for that.

I also thank my examiner Mikael Skoglund for reading this thesis and Tiphanie for acting as my opponent for this project.

A special attention comes to my family for their love and unconditional support during my studies, and also for the trust they place in me. I would also like to thank Camille for all his love, support and patience no matter the distance between us.

Finally, last but lot least, special thanks to my friends for being here for me regardless of how far away they are.

(8)

(9)

Introduction

1.1 Motivation

1.1.1 Music

Music is more and more part of our daily life. When you enter a shop, or the bus, or when you switch your television on, you can hear music. And nowadays, thanks to new technologies such as mp3 players or smartphones, everybody can listen to their own music wherever and whenever they want.

Hence, after many years focused only on speech processing, the research on audio signals started to get interested in music and its processing. Then, a new

eld of science has emerged which is called Music Information Retrieval.

Due to this new interest of research, many music related applications have been developed, Spotify or Shazam for instance. In this thesis, we will focus on musical sound classication, which is one of the main branches of music information retrieval.

1.1.2 Synthesizers

The rst synthesizer was invented in 1920 by Léon Theremin (who gave his name to the instrument). But we had to wait until the last 1960's and early 1970's to really start hearing synthesizer music with all these new sounds that

"classical" instruments (e.g. the piano or the clarinet) could not produce. And today, synthesizers are everywhere. Almost every time you listen to a song, a synthesizer is involved.

With the huge improvements of computational power of computers, virtual synthesizers have been developed allowing musicians to have thousands of dif- ferent sounds available on a unique software application.

Hence, it is really easy to understand the necessity for such systems to have a clear and ergonomic classication of sounds in order for the users to be able to nd easily the sounds they seek.

(12)

1.1.3 Context of the Project

This project was lead at a company named Arturia which produces analog synthesizers and virtual synthesizers software applications (VST plug-ins).

Until now, in order for the users to nd the sounds they want in their software applications, they use a classication based on tags which are manually implemented by the sound designers. Hence it is very laborious and totally subjective. For example, a sound will be tagged as a "bass" by one designer whereas another would have considered it as a "lead" sound. The current classication is thus sound designer-dependent.

The purpose of the project was to implement an automatic classication of the available synthesizer sounds based only on audio characteristics which are objective.

1.2 Outline

In this thesis, we will begin by explaining in more detail what is a musical sound and the particularities of synthesizer sounds, which are the kinds of musical sounds we dealt with during this project. The two main stages of classication process will also be introduced : the feature extraction and the choice of a classier system.

Then, the study will focus more deeply on the feature extraction process by presenting the useful audio descriptors for characterizing our sounds in order to classify them.

There exist many dierent models of classication systems. In a third part, the chosen model and its operating mode will be discussed.

Finally, in a last section, the implementation of the classier, the choices we made to develop a working system and the obtained results will be presented.

(13)

Chapter 2

Music Information Retrieval

After many years focused on speech signal processing, the research in audio signals has just started to investigate the eld of music and has been steadily growing in past decade.

Music Information Retrieval (MIR) is a highly interdisciplinary eld of research whose purpose is to extend the understanding and usefulness of music data. It connects many dierent domains such as audio signal processing, pat- tern recognition, software design, machine learning, musicology or even psychol- ogy to develop many real-world applications.

Nowadays, the main applications using music processing are music genre classication, instrument recognition or automatic music transcription (see [43], [10], [5], [15], or [8] for example). Even if these applications are slightly dierent from our project, many fundamental concepts are identical, such as feature extraction or the use of classication methods.

But rst, before entering some complex scientic concepts, we will focus on our ground material which is music and more particularly synthesizer sounds.

2.1 Music Signals and Synthesizer Sounds

Music signals are very complex audio signals which may seem quite close to speech signals in the way that they are organized, unlike noise. However, music signals present very specic acoustic and structural characteristics which dierentiate them clearly from other audio signals ([24]).

2.1.1 Musical Sounds

Concerning speech sounds, the interest lies on determining whether it represents a vowel or a consonant. For music, the interest is absolutely not the same.

Hence, the characteristics of the sound we will focus on are dierent.

The three main characteristics of music are pitch, loudness and timbre ([35], [25], [8]).

Pitch : this is the rst of the characteristics of a musical sound. It is related to the tone and depends on the frequency of vibration. The pitch is a perceptual property of the sound which allows the listener to judge the note as "low"

(14)

or "high" in the sense of musical melodies. For instance, a note played by a contrabass will be "low-pitched" while a note played by a piccolo will be "high- pitched". This property of the sound is directly linked to the representation of the music (seeFigure 2.1).

Figure 2.1 Music representation Sound Sample

This characteristic is very important because it is related to the human perception of musical intervals. This perception is logarithmic with respect to frequencies (as we perceive [220 Hz - 440 Hz] and [440 Hz - 880 Hz] intervals as the same and are called octave). In western music, a pitch space has been created where each octave has a size of 12 and the interval between two consecutive pitches is called semitone.

The pitch is most of the time represented by letters :

A A]/B[ B/C[ B]/C C]/D[ D D]/E[ E/F[ E]/F F]/G[ G G]/A[.

Loudness : this quanties the subjective eect of the intensity of a sound received by the listener's ear. It depends on the intensity of the sound which is objective in nature, and the sensitivity of the ear which is subjective. The measure scale of loudness is logarithmic. The unit of measurement of this characteristic is the Decibel (dB).

L = 10 log I I₀

where I is the intensity of the sound and I0is the threshold of hearing, i.e. the lowest intensity of the sound that can be detected by our ear within the range of audibility (depending on the frequency of the signal, seeFigure 2.2).

This quantity permits to determine whether a sound is loud or quiet.

Timbre : this characteristic permits to distinguish two notes of the same pitch and intensity, for example played by dierent instruments. Sometimes it is also called "Color" or tonal quality of a sound.

The timbre measures the complexity of a sound, i.e. the number and the intensity of frequencies that compose it.

It is mainly characterized by :

• Attack, sustain and decay : the attack of the sound is the initial action given to the instrument to produce the sound, plucking the string of a guitar for example. The decay is the way the amplitude reduces with time after the attack. The sustain of the sound is the period of time during

(15)

Figure 2.2 Equal loudness contours Picture taken from http://www.offbeatband.com/wp-content/uploads/2009/08/

equal-loudness-contour.jpg

which it remains audible before fading away (e.g. the time during which a musician blows into his or her wind instrument).

• Harmonic content : components of the spectrum left after we removed the fundamental frequency.

• Vibrato/Tremolo : periodic little changes in the pitch/intensity of a tone that add richness to a sound (e.g.Vibrato SoundTremolo Sound). They can be seen as, respectively, frequency modulation (FM) and amplitude modulation (AM) of the tone.

We can easily distinguish between dierent instruments playing exactly the same note with the same loudness (seeFigure 2.3). The tuning fork is the only instrument whose sound is composed of exactly one frequency (one pure sine wave). The harmonic content of a clarinet sound is mostly composed of odd harmonics (odd multiples of the fundamental frequency) whereas the one of the guitar usually contains all harmonics with decreasing amplitude.

2.1.2 Synthesizer Sounds

An analog synthesizer is an electronic musical instrument that generates elec- tric signals which will be converted to sounds using ampliers, loudspeakers or headphones.

Synthesizers are really particular instruments because they can imitate existing acoustic instruments such as the guitar or the piano, but also create new electronic timbres. Such instruments use various methods to generate sounds.

The most used methods of waveform synthesis are : subtractive synthesis, ad- ditive synthesis or frequency modulation synthesis. But these methods will not be developed here, as it is out of the scope of the project (see [45]).

(16)

(a)Tuning Fork (b)Guitar (c)Clarinet

Figure 2.3 Comparison of the timbre of dierent instruments

Figure 2.4 Picture of a synthesizer : Arturia MATRIXBRUTE Picture taken from https://www.arturia.com/images/products/matrixbrute/05.jpg

Synthesizers, just like the piano or the guitar for instance, can produce polyphonic sounds. This means that the instrument is able to play dierent notes at the same time and that, even if you press only one key. Hence, we will have to deal with multiple fundamental frequencies inside of a same spectrum.

As already expressed before, synthesizers can imitate existing acoustic instruments and this includes also percussive instruments (e.g. drums or cymbals).

They can also imitate the sounds of the daily life or the nature called special eects sounds, such as thunder or breeze for instance. This means that unpitched sounds can also be encountered, i.e. sounds without fundamental frequency, or at least without well-dened pitch.

Another specicity of such instruments is that they are totally congurable using a number of parameters which varies from a synthesizer to another (and can be really huge). A particular conguration of these parameters (leading to one particular sound) can be stored in order to be re-used. This is called a preset.

Finally, one last diculty with synthesizers is that some presets use an arpeggiator or a sequencer (tools creating an audio sequence instead of one single note listen to examples : arpeggioandsequence).

2.2 Feature Extraction

Feature extraction is a fundamental process whose purpose is to gather inter- esting and relevant data before implementing any kind of system.

(17)

A lot of research has been done on this topic, including in Music Information Retrieval, in order to choose wisely the relevant information to extract with as less redundancy as possible and which are the most representative of the initial data.

In this section we will explain the interest of such a process and how we used it for our project.

2.2.1 Interest and principle

Signals in general contain a huge amount of data. And most of the time, the raw signal is not really useful for a particular application and contains a lot of irrelevant data regarding it.

The idea behind feature extraction is to process the raw input signal in order to gather the most meaningful feature vectors for our application (which is here the classication of our sounds, as in most of the cases) as explained in [18].

In some way, feature extraction can be seen as taking the relevant information for our application from the raw data, organizing it and concentrating it into a convenient object (generally a vector or a matrix). Then, this object will be fed to the classier system instead of the raw signal.

The main principle of feature extraction methods is that extracting only the most relevant pieces of information from the raw signal works signicantly better than simply giving all the data to the classier system without any processing.

Independent aspects of the data are separated into dierent components which will be then concatenated to obtain the nal representation object. Those independent components are called descriptors. The most important stage of feature extraction is to choose them wisely in order to represent the raw signal as well as possible.

2.2.2 Choice of the descriptors

Feature extraction is a fundamental process on which depends the behavior of our classier system. Hence, it is really important to choose well the descriptors that will be used to represent our raw signal.

For audio signals, and more precisely musical signals, there exist dierent categories of descriptors which represent dierent aspects of the signal and need dierent processing methods to be extracted ([32], [34], [30], [10]).

Global descriptors : they represent some characteristics of the entire raw signal. They are computed for the whole signal. Examples of such descriptors are attack or duration (seesection 3.1).

Instantaneous descriptors : they represent some aspects of the signal which are time-varying. To calculate them, the signal needs to be separated into time frames (i.e. short time segments of the signal, seeFigure 2.5).

The purpose is to decompose a signal which is highly non-stationary into some portions where the signal is stationary. To do so and in order not to loose too many information, we usually choose to decompose the signal into overlapping frames (50 %). In our case, frames of 512 samples are extracted (using a sampling frequency of 44.1 kHz, it represents about 12 ms).

(18)

Figure 2.5 Signal decomposition into overlapping frames

Examples of descriptors of this kind can be temporal centroids or zero- crossing rate (ZCR), seesection 3.2.

Some other descriptors (which are also instantaneous) also need more pre- processing to be computed such as Short-Term Fourier Transform or Per- ceptual Models.

Short-Term Fourier Transform :

In practice, the STFT of a given time frame is computed using the Fast-Fourier Transform (FFT) of the corresponding time frame. To do so, we rst apply a Hamming window to the raw signal (seeFigure 2.6). After zero-padding the windowed signal in order to increase its spectral resolution, its 1024 samples FFT is computed (see [6]).

X[k] =

N −1

X

n=0

x[n]w[n]e⁻^2πkn^N for k ∈ [0, N − 1]

where X is the spectrum of the original signal x, N is the size of the FFT, and wis the window function.

The Hamming window function is dened as :

w[n] = α − (1 − α) cos 2πn N

where α = 0.54

Each time you have to choose a window function in order to reduce the artifacts introduced by discontinuities in the waveform, you have to nd the best trade-o between the time resolution and the frequency resolution with respect to the wanted application.

This window function was chosen because it is optimized to minimize the maximum side lobe (∼ -43 dB) even if the main lobe is wider than with other window ones, thus resulting in a good compromise. Indeed, by having very low side lobes, the spectral leakage is reduced and thus the selectivity in the frequency domain is improved by preventing the side lobe of a strong peak to cover up the weak peak of the main lobe.

This kind of window is the frequently used in music processing ([30]).

(19)

Figure 2.6 Hamming window and its transform

Perceptual Models :

Since we deal with music signals and want to classify them for human purpose, descriptors related to the way we perceive sounds called perceptual descriptors need to be extracted. In order to compute them, several process steps are applied to the spectrum.

Mel Scale : The human auditory system does not interpret pitch in a linear manner but rather in a logarithmic one, increasing with the frequency. Hence, to represent the human auditory system in a linear manner, the mel scale has been developed by experimenting.

A common formula to convert the frequency of a note into mel is [32] : (

m = fc.

1 + log₁₀_f

fc

for f > 1kHz m = f for f < 1kHz

where f is the frequency in Hz, m is the frequency expressed in mel and fc = 1000 Hz.

Figure 2.7is a plot of the relationship between mel scale and Hertz-scale.

In has been found that the human ear acts dierently depending on the frequency and that the frequency range can be divided into critical bands ac-

(20)

Figure 2.7 Mel scale versus Hertz scale Picture taken from https://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Mel-Hz_

plot.svg/512px-Mel-Hz_plot.svg.png

cordingly. Its behavior in each critical band can be compared to a triangular band-pass lter. The distribution of those critical bands leads to 24 equally spaced bands (in the mel scale) see Figure 2.8.

Figure 2.8 Mel bands Picture taken from [32]

Bark Scale : Another perceptual model is the Bark scale. This is a scale that is more accurate to describe the human auditory system in terms of loudness perception while the mel scale is adapted to do a subjective measure of pitch.

A common formula to convert the frequency into Bark scale is [32] :

B = 13. arctan

f

1315.8

+ 3, 5. arctan

f 7518

where B is the frequency expressed in Bark and f in Hz.

Here again we decompose this linear Bark scale into 24 equally spaced bands, seeFigure 2.9.

(21)

Figure 2.9 Bark bands Picture taken from [32]

Summary :

To describe our musical sounds, we will need to extract dierent kinds of descriptors :

• Global descriptors : they represent some characteristics of the entire raw signal.

• Temporal descriptors : they represent the temporal evolution of the signal.

• Spectral descriptors : they represent the shape of the signal's spectrum.

• Harmonic descriptors : they represent the harmonic and musical content of the signal.

• Perceptual descriptors : they represent the perception we have of the signal.

The main stages of the feature extraction process can be illustrated with the following diagram : seeFigure 2.10

Figure 2.10 Block diagram of the feature extraction process

(22)

2.3 Classication Methods - Machine Learning

Once the relevant features are extracted, we will use them as input in order to feed a classier system.

2.3.1 Principle

The main purpose of a classier system is to classify samples into a given set of categories.

For implementing a machine learning classication algorithm, there are 5 main steps we have to go through : [18]

1. Create a representative database.

2. Design a feature extractor.

3. Choose a classier model.

4. Use the data to estimate the parameters of the chosen model.

5. Evaluate the resulting classier.

Figure 2.11 Dierent stages of implementation of a classier

The rst stage consists in gathering samples of the objects we want to classify which cover the whole range of existing objects for our application. For example, in our case, we have to select a database of sounds which are representative of the whole range of sounds that we have in our synthesizers (seesubsection 2.1.2).

This database has to contain samples that will be used in the step 4 to estimate the parameters of the model and other samples which will be needed to test and evaluate the system. Hence, the database has to be divided smartly into two parts : the learning database and the testing database. Usually, a ratio of at least 10 between the two is used.

The second step has already been discussed in the previous section (see section 2.2).

The three last steps are strongly correlated. Indeed, in function of the model you choose for your classier, the way to estimate its parameters and then to test it will be dierent. The step during which we estimate the parameters of the model is called the training or the learning of the system. During this phase, the system is trained to recognize the object and classify it into the category it belongs to. To do so, we use the objects of our database from which we will extract the features and feed them to the system.

(23)

There exists dierent kinds of classication methods (grouping the 3 last steps) that we will develop in the next section.

2.3.2 Dierent Methods of Classication

The dierent methods of classication may dier by the model they are based on, in terms of probabilistic description of the features, or by the type of learning algorithm they use to estimate the dierent parameters. And even if the two dierent characteristics of the system are intrinsically correlated because both aect directly the learning stage of the system, they will be developed separately for more clarity.

Dierent Models

There are three main types of models that can be used in machine learning : generative, discriminative and geometric ([18], [26], [38]).

Generative and discriminative classiers are both probabilistic models based on statistical properties of the dierent classes. The class probabilities pC(j) and the class-conditional feature distributions pX|C(x|j)have to be determined by the learning phase.

The dierence between them is that the generative model learns a model of the joint probability pX,C(x, j) where x is the input and j the class, whereas the discriminative model learns of the conditional probability pC|X(j|x).

The third model is not based on statistical properties but on distances in the feature space (in two or three dimensions usually). To learn the model, we calculate the distance between the observed feature vector and the target classes and select the closest one d(x, j). To do so, dierent distances can be used, for example :

• Minkowski distance : d(x, y) =q^p PN −1

i=0 (xi− yi)^p where p ∈ N.

• Mahalanobis distance [22] : d(x, y) = p(x − y)^TS⁻¹(x − y) where S is the covariance matrix.

The most commonly used distance is the Euclidian distance (Minkowski distance for p = 2).

The probabilistic models have an advantage over the geometric model : we can have an insight on the uncertainty we have in the classication. But to use such models, we have to initialize those probabilities correctly in order to be able to match as best as possible the real probabilities at the end of the learning phase, which is impossible when we do not know the classes in advance.

Dierent Learning Approaches

There are two main ways to deal with the learning phase of a system and this depends substantially on the kind of data we deal with ([40]).

The rst one is supervised learning. It consists in training the system in a way such that each input is labeled according to the class it belongs to, i.e.

the output value. During this step, the parameters of the system are updated so that, for each iteration, the system knows exactly what to update. We can think of this approach as a student learning a course with a teacher. In these

(24)

scenarios, the goal of the system is to learn to produce the correct output (return the correct class) given a new input using previous experiences. This method requires the user to know exactly what are the output classes and how to classify the objects into them.

The other way is called unsupervised learning ([11]). At the opposite of the rst one, the data are not labeled so we do not know in advance which class the input object belongs to. The system will thus update its parameters

"deciding by itself" what to do for each iteration. The purpose of this method is to nd patterns which describe the hidden structure of the data and then build a model that can be used for decision making or predicting future inputs using these patterns. For this type of learning, we often use the term clustering instead of classication because there are no predened classes but rather subsets whose structures are close to the ones of the objects of the database.

There is also an intermediate learning approach which is called reinforcement learning. It consists in a machine which interacts with the environment by producing actions and receives rewards for it. The purpose of such a system is to learn how to act in a particular situation in order to maximize the future rewards it will receive.

2.3.3 Introduction to Articial Neural Networks

In this section, we will introduce a kind of machine learning algorithm which will be useful for our application (see section 4.1) : the Articial Neural Networks (ANN). These models are inspired by the biological neural structure of the brain (see [37], [40]). The idea is that a system could carry out complex computations in a similar way to the one of the human brain, i.e. using many computational elements joined together by communication links. Those elements are called neurons and we will describe their structure in the next section.

Structure of the Neurons

The structure of articial neurons, also called perceptrons ([37], [14], [27]), is based on a simplied version of the biological neurons (seeFigure 2.12).

Figure 2.12 Structure of an articial neuron Picture taken from [14]

As we can see in Figure 2.12, a neuron is connected to a layer containing many inputs : (x1, x₂, ..., x_n), and contains a vector of weights : (w1, w₂, ..., w_n) which are real numbers expressing the pertinence of each input.

As we said before, a neuron is a computational element, and it contains two dierent functions. The rst function, denoted as Σ in Figure 2.12, represents

(25)

basically the integration of the transmitted information at the neuron. It usually consists in adding the dierent weighted signals : Pⁿ_i=1w_ix_i.

The second function (denoted as f inFigure 2.12) is called activation function or primitive function. It transforms the transmitted information into a unique precisely dened output. Hence, a neural network can be seen as a network of primitive functions that takes decisions from inputs.

Classical functions used as primitive functions are :

• Sigmoid function : f(z) =_1+e¹−z

• Hyperbolic tangent function : f(z) = tanh(z) = ^e_e^zz^−e+e^−z^−z

• Heaviside function, also called unit step function : f(z) =

(1 if z ≥ 0 0 otherwise

Figure 2.13 Dierent presented primitive functions

Architecture of the Network

As already explained before, an articial neural network is constituted by organized layers of connected neurons.

We can decompose the dierent layers of the network into three distinct categories :

• the input layer (which is unique)

• hidden layers (as much as we need, the more numerous the more complex)

• the output layer (which is also unique)

Hence, the inputs enter the system in the input layer and then are transmitted to neurons to nally return a unique output which corresponds, in our case, to the matching class.

(26)

Figure 2.14 Architecture of an ANN Picture taken from [27]

Learning of the Model

As every machine learning algorithm, the system needs to be trained, as we saw in subsection 2.3.1.

Neural networks can be trained using either supervised, unsupervised or reinforcement learning.

The main principle of learning for such systems is to update the weights of the dierent neurons based on the input samples which are given to it. Hence, after the learning phase, it is able to make the right decision (accordingly to the main application) when a new input comes in.

Usually, at the beginning, the weights are randomly initialized.

Then, the model is trained using a particular algorithm depending on the kind of neural network that has been chosen for the application. We will describe in more detail the one we chose for our application insection 4.1.

(27)

Chapter 3

Audio Descriptors and Feature Extraction

In the context of the project, our nal purpose is to cluster synthesizer sounds in a way that is only based on objective parameters.

As we have seen in chapter 2, the fact that we deal with musical sounds, and more particularly synthesizer sounds, is really important and will strongly inuence the types of necessary descriptors for this application. We have already discussed the dierent types of descriptors we need to extract and the dierent pre-processing stages we need to apply to our raw signal to be able to extract them insection 2.2.

In Music Information Retrieval, a lot of research has been done about the dierent audio descriptors which can be useful for various applications : [32], [34], [33], [30], [23], [10], [5], [43]. These papers are the basis from which we chose the dierent audio descriptors.

In this chapter, we will see in detail the descriptors we chose to extract in order to classify our synthesizer sounds.

3.1 Global Descriptors

The rst descriptors that will be extracted are the global ones, as they do not need any particular pre-processing treatment (as seen insection 2.2).

Figure 3.1 Extract global descriptors

In the project, we focused on two dierent global descriptors : the log- attack time and the eective duration of the signal. They both are scalar

(28)

values.

3.1.1 Power Envelope Extraction

To extract those two descriptors, we rst need to compute the power envelope of the signal.

The envelope of an oscillating signal is a smooth curve which "draws" the contour of its extrema. For example, see Figure 3.2:

Figure 3.2 Example of a signal envelope Picture taken from https://

upload.wikimedia.org/wikipedia/commons/3/31/Signal_envelopes.png Hence, rst we need to compute the instantaneous power of the signal s : p[n] = s²[n]. And then the envelope of p is extracted using an envelope detector.

Many dierent envelope detector exist, we will develop further the method we chose insection 5.2).

3.1.2 Log-Attack Time

Most of the time, the power envelope of an musical signal can be sketched like this :

Figure 3.3 Sketch of the power envelope of an audio signal

The attack is the beginning of an audio signal. It is the time slot during which the instantaneous power of a signal goes from zero to its rst local maximum.

It is principally described by its duration.

It is really hard to determine exactly the duration of the attack of a sound, but it can be estimated using thresholds (seeFigure 3.4).

The thresholds are set at about 20% of the local maximum value of the power envelope for the start of the attack and around 90% for its end. Those

(29)

Figure 3.4 Attack of an audio signal Picture taken from [32]

are empirically chosen values and can be adapted in function of the type of sounds. But they seem to be the best values in order to ignore parasite noise at the beginning of the sound (which could be interpreted as the start of the attack) and to deal with the case that the rst maximum value of the power envelope of the signal is reached during the sustain part and not the attack.

Hence, the start and the end of the attack are known, it is easy to calculate the log-attack time of the signal :

logAttackT ime = log₁₀(tstop_attack− tstart_attack)

3.1.3 Eective Duration

The eective duration of a signal is the measure of time during which the signal is considered as being "meaningful", i.e. the duration during which we consider the power of the signal as being high enough.

The calculation of this descriptor is here again done using thresholds on the power envelope of the signal (see Figure 3.5).

Figure 3.5 Eective duration of a signal Picture taken from [32]

The threshold is set at 40% of the maximum value of the power envelope of the signal. The eective duration is thus the measure of the period during which the power envelope is higher than this threshold.

(30)

3.2 Temporal Descriptors

As we already discussed insection 2.2, the temporal descriptors are extracted using frame decomposition of the original signal. Hence, the descriptors extracted are vectors whose length is the number of frames.

Figure 3.6 Extract temporal descriptors

Here are the temporal descriptors we chose to extract. For the rest of the report, we will consider the signal x being one frame of the original audio signal s.

3.2.1 Zero-Crossing Rate (ZCR)

The zero-crossing rate measures the number of times the signal crosses the x- axis. This descriptor permits to know if the signal is periodic or rather noisy.

The higher the ZCR, the noisier the signal.

3.2.2 Total Energy

This descriptor measures the energy of the signal for each frame.

totalEnergy = 1 N

N −1

X

n=0

x²[n]

where N is the length of the frame and x the signal.

The total energy descriptor permits to know if the frame has a high or a low energy. This way we can determine the evolution of the original signal's energy along with time.

3.3 Spectral Descriptors

To extract spectral descriptors, we rst need to compute the spectrum of each signal frame as explained before (see section 2.2).

We chose to extract several spectral descriptors to describe dierent characteristics of the spectra.

(31)

Figure 3.7 Extract spectral descriptors

3.3.1 Spectral Centroid

The spectral centroid represents the barycenter, or "center of gravity", of the signal frequency components. This descriptor can be considered as a measure of the brightness of the sound.

It is dened as the average frequency weighted by amplitudes divided by the sum of all the amplitudes of the spectrum.

spectralCentroid = PN −1

k=0 f [k].X[k]

PN −1 k=0 X[k]

where N is the size of the spectrum, k is the frequency bin and f[k] = k^f_N^s the center frequency of the bin (with fs the sampling frequency) and X[k] is the amplitude of the bin. This descriptor is expressed in Hz.

Generally, a sound which is considered as having "dark" qualities tends to have more low frequencies, whereas a "brighter" sound will contain more high frequencies. Hence, using this spectral centroid descriptor, we are able to know which kinds of frequencies are predominant in the sound, and thus have a measure of its brightness.

3.3.2 Spectral Spread

The spectral spread is dened as the spread of the spectrum around its mean value, i.e. around its spectral centroid (seeprevious subsection). Hence, we can see this quantity as the variance of the spectrum of the frame. This descriptor can be compared to the richness of a sound in terms of frequency content.

spectralSpread = PN −1

k=0 (f [k] − spectralCentroid)².X[k]

PN −1 k=0 X[k]

3.3.3 Spectral Skewness

The spectral skewness is a measure of the degree of asymmetry of the spectrum of the frame around its "center of gravity". Once again, this descriptor is using the spectral centroid.

spectralSkewness = m₃ σ³

(32)

with m3=

PN −1

k=0 (f [k]−spectralCentroid)³.X[k]

PN −1

k=0 X[k] and σ³= spectralSpread^3/2.

• if spectralSkewness = 0 then the spectrum is symmetric around the spectral centroid.

• if spectralSkewness < 0 then there is more energy on the right of the spectral centroid, i.e. more energy in higher frequencies.

• if spectralSkewness > 0 then there is more energy on the left of the spectral centroid, i.e. more energy in lower frequencies.

(a) spectralSkewness = 0 (b) spectralSkewness < 0 (c) spectralSkewness > 0

Figure 3.8 Examples of distributions with dierent spectral skewness values Pictures taken from [32]

3.3.4 Spectral Kurtosis

The spectral kurtosis is a measure of the atness of the spectrum of the frame around the spectral centroid, i.e. around its barycenter. It is very similar to the spectral skewness but with central moment of order four.

spectralSkewness = m4

σ⁴ with m4=

PN −1

k=0 (f [k]−spectralCentroid)⁴.X[k]

PN −1

k=0 X[k] and σ⁴= spectralSpread².

• if spectralKurtosis = 3 then the spectrum has a normal distribution.

• if spectralKurtosis < 3 then the spectrum has a at distribution.

• if spectralKurtosis > 3 then the spectrum has a peaky distribution.

3.3.5 Spectral Slope

The spectral slope characterizes how quickly the spectrum decreases towards the high frequencies.

This descriptor is calculated using a linear regression on the spectral amplitudes. The spectrum is then roughly approximated by ˆX :

X[k] = a.f [k] + bˆ where b is a constant and a = PN −1¹

k=0X[k]

NPN −1

k=0 (f [k]X[k])−(PN −1

k=0 f [k])(PN −1 k=0 X[k]) NPN −1

k=0 f²[k]−(PN −1

k=0 f [k])² .

(33)

(a) spectralKurtosis = 3 (b) spectralKurtosis < 3 (c) spectralKurtosis > 3

Figure 3.9 Examples of distributions with dierent spectral kurtosis values Pictures taken from [32]

Figure 3.10 Example of spectral slope estimation

3.3.6 Spectral Roll-O

The last spectral descriptor is the spectral roll-o. It measures the frequency point below which 95% of the power spectrum is contained. It is frequently used to dierentiate between percussive or noisy sounds and more constant sounds such as notes played by a violin for instance.

We can calculate it using the following formula :

SpectralRollof f

X

k=0

X²[k] = 0.95

N −1

X

k=0

X²[k]

This descriptor is somehow related to the harmonic/noise cut-o frequency.

3.4 Harmonic Descriptors

The harmonic descriptors are extracted from the spectra of the signal frames.

As explained in section 2.2, these descriptors represent the harmonic and musical content of the signal. They are thus quite important as they permit to determine two main characteristics of musical sounds : the pitch and the

(34)

Figure 3.11 Extract harmonic descriptors

timbre of the sound (see subsection 2.1.1). Of course, as they characterize the harmonic content of the signal, they are not relevant for unpitched sounds (see subsection 2.1.2) and thus will not be extracted.

Here are the dierent harmonic descriptors we chose to extract.

3.4.1 Fundamental Frequency

The fundamental frequency is the lowest frequency such that the integer multiples of that frequency explain well the content of the spectrum of the frame signal. This descriptor has a real meaning only for periodic or nearly periodic sounds. For other audio signals such as unpitched sounds, it is neither well- dened nor relevant, and thus will not be extracted (as mentioned earlier).

It is one of the most important descriptors, rstly because it is directly linked to the pitch of the audio signal, and also because it will be used for the computation of many other descriptors, such as inharmonicity (see next subsection).

There exist several ways to extract the fundamental frequency from an audio signal, we will discuss the chosen one insection 5.2.

3.4.2 Inharmonicity

The inharmonicity of a signal can be described as the divergence between a purely harmonic signal (only integer multiples of the fundamental frequency) and the real spectrum, see Figure 3.12.

This descriptor is computed using the following formula :

inharmonicity = 2 f0

P

h|f [h] − hf0|.X²[h]

P

hX²[h]

where h are the partials of the spectrum (i.e. the multiples of the fundamental frequency not necessarily integer multiples), f[h] the frequency corresponding to the observed h^thpartial, f0 the fundamental frequency and X the spectrum of the frame signal.

Inharmonicity is principally observed with string instruments such as the piano or the guitar, because of the stiness or the strings. This inharmonicity phenomenon is particularly present with higher harmonics of the signal.

This descriptor ranges from 0 to 1, where 0 stands for a purely harmonic signal and 1 a purely inharmonic signal.

(35)

Figure 3.12 Example of inharmonicity of a signal

3.4.3 Tristimulus

In image processing, almost any color can be described using the combination of three primary colors : red, green and blue, giving the RGB scale. In music processing, the same concept is used to describe the timbre of a sound. We use a three dimensions space to represent its harmonic content which is called tristimulus : (T1, T₂, T₃).

T1 represents the relative weight of the rst harmonic (the fundamental frequency), T2 the relative weight of the second, third and fourth harmonics taken together, and T3 the relative weight of all the remaining harmonics, see Figure 3.13.

Figure 3.13 Example of harmonic distribution for tristimulus estimation

T1= X[1]

PH h=1X[h]

(36)

T2=X[2] + X[3] + X[4]

PH h=1X[h]

T3= PH

h=5X[h]

PH h=1X[h]

where h are the harmonics, X[h] is the amplitude of the h^th harmonic of the spectrum and H is the total number of harmonics.

3.4.4 Odd to Even Harmonic Ratio (OEHR)

The odd to even harmonic ratio is a measure of the repartition of odd harmonics and even harmonics. This descriptor also is directly linked to the timbre by its harmonic content. This ratio allows for example to distinguish between a note played by aclarinet(containing almost only odd harmonics) and a note played by atrumpet(containing almost every harmonics until the 10th).

Figure 3.14 Example of repartition of odd and even harmonics It is computed from the ratio between the power of the odd harmonics and the power of the even harmonics.

OEHR = P^H/2

h=0X²[2h + 1]

PH/2 h=1X²[2h]

3.4.5 Harmonic Spectral Deviation

The harmonic spectral deviation is the measure of the dierence between the harmonic peaks and a global spectral envelope.

The spectral envelope is a curve which determines the global shape of the spectrum, i.e. the location of dominant peaks on the frequency scale. Many dierent methods exist to estimate this envelope, we will explain how we chose to do so insection 5.2.

(37)

Figure 3.15 Example of harmonics of a signal and its spectral envelope

HSD = 1 H

H

X

h=1

(X[h] − SE[h])

where h are the harmonics of the signal, X[h] is the amplitude of the h^thhar- monic of the signal, SE is the spectral envelope of the signal and H is the total number of harmonics.

3.4.6 Noisiness

The last harmonic descriptor we will extract is the noisiness. It is the ratio between the energy of the noise (i.e. the non-harmonic part), and the total energy of the spectrum. We can compute the noise energy by subtracting the energy of the harmonics to the total energy. Hence, we can calculate the noisiness with the following formula :

noisiness = noise_energy

total_energy = 1 −harmonic_energy total_energy = 1 −

PH

h=1X[h]² PN −1

k=0 X[k]² where h are the harmonics of the signal, H is the total number of harmonics, k are the frequency bins and X[k] is the spectral amplitude for the k^thbin.

This descriptor ranges from 0 to 1, where 0 stands for a purely harmonic signal and 1 a purely noise signal.

(38)

3.5 Perceptual Descriptors

As already seen before, the perceptual descriptors are linked to our perception of the signal and its characteristics (section 2.2.2).

Figure 3.16 Extract perceptual descriptors

We chose to extract descriptors using both the mel-scale and the Bark-scale presented in section 2.2.2.

3.5.1 Mel-Frequency Cepstral Coecients (MFCC)

The rst perceptual descriptor to be extracted is the MFCC (Mel-Frequency Cepstral Coecients). As we can guess from its name, this descriptor is based on the mel-scale.

The MFCC represent the shape of the spectrum mapped onmel-bands, using very few coecients.

The computation of MFCC can be decomposed into ve steps : [20]

• computing the spectrum.

• mapping the power spectrum using mel-bands.

• taking the logarithm of all the mel-bands energies.

• taking the DCT (Discrete Cosine Transform) of the log mel-bands energies.

• keeping the 12 rst coecients (excluding rst one).

Figure 3.17 Block diagram of the computation of MFCC

Spectrum Computation : this process has already been discussed before (seesection 2.2).

Evaluation of Audio FeatureExtraction Techniques to ClassifySynthesizer Sounds

Evaluation of Audio Feature

Extraction Techniques to Classify Synthesizer Sounds

FANNY ROCHE

Abstract

Sammanfattning

Acknowledgement

Contents

Chapter 1

Introduction

1.1 Motivation

1.2 Outline

Chapter 2

Music Information Retrieval

2.1 Music Signals and Synthesizer Sounds

2.2 Feature Extraction

2.3 Classication Methods - Machine Learning

Chapter 3

Audio Descriptors and Feature Extraction

3.1 Global Descriptors

3.2 Temporal Descriptors

3.3 Spectral Descriptors

3.4 Harmonic Descriptors

3.5 Perceptual Descriptors

Evaluation of Audio FeatureExtraction Techniques to ClassifySynthesizer Sounds

Evaluation of Audio Feature

Extraction Techniques to Classify Synthesizer Sounds

FANNY ROCHE

Abstract

Sammanfattning

Acknowledgement

Contents

Chapter 1

Introduction

1.1 Motivation

1.2 Outline

Chapter 2

Music Information Retrieval

2.1 Music Signals and Synthesizer Sounds

2.2 Feature Extraction

2.3 Classication Methods - Machine Learning

Chapter 3

Audio Descriptors and Feature Extraction

3.1 Global Descriptors

3.2 Temporal Descriptors

3.3 Spectral Descriptors

3.4 Harmonic Descriptors

3.5 Perceptual Descriptors

2.3 Classication Methods - Machine Learning