Representing Voices Using Convolutional Neural Network Embeddings

(1)

Representing Voices Using Convolutional Neural Network Embeddings

NIKLAS EMBRETSÉN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Convolutional Neural Network Embeddings

NIKLAS EMBRETSÉN

Master in Machine Learning Date: July 4, 2019

Supervisor: Mats Nordahl

Supervisor at Storytel: Sathvik Katam Examiner: Olov Engwall

School of Electrical Engineering and Computer Science Host company: Storytel

Swedish title: Representation av röster med hjälp av inbäddningar

från faltningsnätverk

(4)

Abstract

In today’s society services centered around voices are gaining popularity. Be- ing able to provide the users with voices they like, to obtain and sustain their attention, is of importance for enhancing the overall experience of the service.

Finding an efficient way of representing voices such that similarity compar- isons can be performed is therefore of great use.

In the field of Natural Language Processing great progress has been made using embeddings from Deep Learning models to represent words in an unsu- pervised fashion. These representations managed to capture the semantics of the words.

This thesis sets out to explore whether such embeddings can be found for audio data as well, more specifically voices from narrators of audiobooks, that captures similarities between different voices. For this two different Convolu- tional Neural Networks are developed and evaluated, trained on spectrogram representations of the voices. One is performing regular classification while the other one uses pairwise relationships and a Kullback–Leibler divergence based loss function, in an attempt to minimize and maximize the difference of the output between similar and dissimilar pairs of samples. From these models the embeddings used to represent each sample are extracted from the different layers of the fully connected part of the network during the evaluation.

Both an objective and a subjective evaluation is performed. During the objective evaluation of the models it is first investigated whether the found embeddings are distinct for the different narrators, as well as if the embed- dings do encode information about gender. The regular classification model is then further evaluated through a user test, as it achieved an order of magnitude better results during the objective evaluation. The user test sets out to eval- uate whether the found embeddings capture information based on perceived similarity.

It is concluded that the proposed approach has the potential to be used for representing voices in a way such that similarity is encoded, although more extensive testing, research and evaluation has to be performed to know for sure.

For future work it is proposed to perform more sophisticated pre-proceessing

of the data and also to collect and include data about relationships between

voices during the training of the models.

(5)

Sammanfattning

I dagens samhälle ökar populariteten för röstbaserade tjänster. Att kunna förse användare med röster de tycker om, för att fånga och behålla deras uppmärk- samhet, är därför viktigt för att förbättra användarupplevelsen. Att hitta ett effektiv sätt att representera röster, så att likheter mellan dessa kan jämföras, är därför av stor nytta.

Inom fältet språkteknologi i maskininlärning har stora framstegs gjorts ge- nom att skapa representationer av ord från de inre lagren av neurala nätverk, så kallade neurala nätverksinbäddningar. Dessa representationer har visat sig innehålla semantiken av orden.

Denna uppsats avser att undersöka huruvida liknande representationer kan hittas för ljuddata i form av berättarröster från ljudböcker, där likhet mellan röster fångas upp. För att undersöka detta utvecklades och utvärderades två faltningsnätverk som använde sig av spektrogramrepresentationer av röstdata.

Den ena modellen är konstruerad som en vanlig klassificeringsmodell, tränad för att skilja mellan uppläsare i datasetet. Den andra modellen använder par- visa förhållanden mellan datapunkterna och en Kullback–Leibler divergens- baserad optimeringsfunktion, med syfte att minimera och maximera skillna- den mellan lika och olika par av datapunkter. Från dessa modeller används representationer från de olika lagren av nätverket för att representera varje da- tapunkt under utvärderingen.

Både en objektiv och subjektiv utvärderingsmetod används. Under den ob- jektiva utvärderingen undersöks först om de funna representationerna är di- stinkta för olika uppläsare, sedan undersöks även om dessa fångar upp infor- mation om uppläsarens kön. Den vanliga klassificeringsmodellen utvärderas också genom ett användartest, eftersom den modellen nådde en storleksord- ning bättre resultat under den objektiva utvärderingen. Syftet med användar- testet var att undersöka om de funna representationerna innehåller information om den upplevda likheten mellan rösterna.

Slutsatsen är att det föreslagna tillvägagångssättet har potential till att an-

vändas för att representera röster så att information om likhet fångas upp, men

att det krävs mer omfattande testning, undersökning och utvärdering. För fram-

tida studier föreslås mer sofistikerad förbehandling av data samt att samla in

och använda sig av data kring förhållandet mellan röster under träningen av

modellerna.

(6)

Acknowledgements

First of all I would like to thank Storytel and all my colleagues for being an awesome host company and making me feel part of the team right away. A special thanks goes to Sathvik Katam, my supervisor at Storytel, for providing me with all the data and information I needed as well as being a great sounding board throughout the work with this thesis.

I would also like to thank my academic supervisor, Mats Nordahl, in help- ing shaping this thesis and giving constructive feedback throughout this work.

Lastly I would like to thank my examiner, Olov Engwall, for providing me with

the final feedback and examining this thesis.

(7)

1 Introduction 1

1.1 Purpose . . . 2

1.2 Problem Description . . . 2

1.3 Research Question . . . 3

1.4 Ethics, Sustainability and Social Aspects . . . 3

1.5 Disposition . . . 4

2 Background 5 2.1 Signal Processing . . . 5

2.1.1 Sound . . . 5

2.1.2 Speech . . . 6

2.1.3 Audio Data . . . 6

2.1.4 Features of Audio Data . . . 8

2.2 Machine Learning . . . 11

2.2.1 Supervised and Unsupervised Learning . . . 11

2.2.2 Artificial Neural Nets . . . 12

2.2.3 Neural Network Embeddings . . . 17

2.3 Related Work . . . 18

3 Methods 20 3.1 Project Overview . . . 20

3.2 Resources . . . 21

3.3 Data . . . 22

3.3.1 Preprocessing of Data . . . 22

3.4 Models . . . 24

3.4.1 Network Architecture . . . 24

3.4.2 Regular Classification Model . . . 24

3.4.3 Pairwise Constraint Model . . . 25

3.4.4 Hyperparameter settings . . . 26

vi

(8)

3.5.1 Evaluation . . . 28

4 Results 31 4.1 Neighborhood Evaluation . . . 31

4.1.1 Regular Model . . . 31

4.1.2 Pairwise Constraint Model . . . 33

4.2 User Test Evaluation . . . 33

5 Discussion 35 5.1 Analysis of Results . . . 35

5.2 Discussion of Methods . . . 37

6 Conclusion 39 6.1 Conclusion . . . 39

6.2 Future Work . . . 40

Bibliography 42

A Raw Data 47

B Hyperparameter Grids 48

C Survey Results 49

(9)

(10)

Introduction

Audio- and voice-based services are gaining popularity in many fields of tech- nology, and customer satisfaction is critical to remain competitive on the mar- ket. Captivating voices have the potential of making almost anything interest- ing and are of importance in contexts where it is desirable to obtain and sustain peoples attention.

Traditional approaches to audio recognition and audio classification, such as GMM-HMM models (see, e.g, [1]), have involved a lot of manual feature engineering, which requires considerable domain knowledge from the people developing the models. In recent years however, Deep Learning (DL) (see, e.g, [2]) has been introduced to the problem area, which allows for sophisticated analysis of audio signals without requiring deep domain knowledge of signal processing. Instead of performing complex transformations of the audio sig- nals to acquire different features, more general representations of the signals, such as spectrograms, can be used as a basis. This approach leaves the task of determining what characterizes a signal to the model itself. Not only does this make this area more accessible to people who lack deeper knowledge of signal processing but it also reduces the risk of introducing human bias into the model, as manual feature engineering is usually driven by intuition. This methodology has been embraced especially in the area of computer vision, specifically in image recognition, with the introduction of Convolutional Neu- ral Networks in the ImageNet competition in 2012 [3]. A convolutional neural network extracts more and more high level features, starting with straight and curved lines and moving to more complex patterns, as the image goes through the layers of the network (see, e.g, [2]).

This thesis will focus on the area of audio recognition and classification, more precisely on the human voice. However, instead of identifying a cer-

1

(11)

tain speaker the aim is to identify groups of people who exhibit a similar way of speaking. This question is interesting to many fields providing audio and voice services, since the right voice has the potential to enhance the whole experience, whether it be audio books, presentations or tutorials.

1.1 Purpose

The purpose of this study is to investigate the possibility of, by automatic means, finding representations for audio recordings of voices from audiobooks that capture their characteristics. This question is interesting for actors on the voice-service market as it would allow them to more efficiently find good voices for their service and/or to give good recommendation of new voices the users might enjoy, thus enhancing the overall experience. To achieve this different ways of representing audio data, such that samples with similar at- tributes are close to each other while being far away from voices with dissim- ilar attributes will be explored. Finding such a representation would allow for efficient and simple analysis of the audio signals.

1.2 Problem Description

The problem investigated in this thesis is the task of grouping narrators to- gether based on similar attributes. Manual classification and grouping of au- dio is not just time consuming but also introduces human bias in the process since the way different people perceive sounds and voices is subjective. The more traditional models for this purpose require manual feature engineering, which make them somewhat limited. Manual feature engineering (see, e.g, [4]) can be a limitation in two ways: 1) It requires specific domain knowledge, and 2) once again introduces human bias into the equation since the choice of features can affect the end results.

Therefore the aim of this thesis is to develop models that, by automatic means, can extract features and find representations of the audio data in an objective manner. The models will be developed and trained using the large amount of voice recordings from audiobooks present at the host company, Sto- rytel, which provides a subscription based streaming service of audiobooks.

The representations found are evaluated by exploring the new space spanned by the representations.

To find a representation of the data that is suitable for this evaluation, train-

ing a regular classification model might not be optimal. The goal of a classifi-

(12)

classes into account, which is a measurement that will be used during the eval- uation. Thus an approach using pairwise constraints between data points to- gether with the Kullback–Leibler divergence [5] as an optimizing criterion will be implemented and compared to a regular classification model.

1.3 Research Question

Given the problem description presented in Section 1.2 the question to be ex- amined in this thesis is:

To what extent can Convolutional Neural Network embeddings be used to group voices based on similarity?

1.4 Ethics, Sustainability and Social Aspects

There are some ethical aspects with this work that need to be considered.

Firstly the model could be biased towards a special type of voice, by only taking certain characteristics into account or being fed a non-representative data set during training, and therefore discriminate some types of voices. This problem has been encountered in several fields of machine learning, such as image recognition [6] and in systems for predicting criminality [7]. This also poses the risk of isolating users to only certain types of speakers, and po- tentially certain types of content. In the case of fictional content this might not matter that much but for more informative content it could contribute to homogeneous information sharing, instead of giving a diverse picture. This potential favouring of certain types of speakers could also contribute to bias in the hiring process for roles where the voice is part of the process.

Some positive effects this work could have are that it might reduce the repetitive work of manually labeling the audio files in simple distinctions, such as labeling as male/female voices. It also has the potential of making more so- phisticated labeling of the audio data that can be used to enhance the customer experience and make it more enjoyable, which is good both from the perspec- tive of the service provider as well as for the consumers.

An increase in digital book consumption could also reduce the number

of physical paper books that need to be produced, which could save trees and

slow down deforestation around the globe. Digital versions are also accessible

anywhere there is an internet connection, which could potentially contribute

(13)

to reducing the emission of green house gases as books do not need to be dis- tributed across the globe to as great an extent. In this setting it is also worth mentioning the environmental aspects of deep learning (DL), which is usu- ally a very resource heavy procedure, requiring much energy to perform the computations. This issue was brought up in [8], where the authors explored different state of the art models in Natural Language Processing (NLP) and approximated the resource usage for training and tuning these models. The authors showed that the training of Google’s BERT [9] was roughly equivalent to a trans-American flight for one person with respect to the CO

2

emission, which when put into perspective is a great deal considering all the research and development going on within DL.

With the audiobook becoming more popular there is also a risk for people moving away from reading regular books due to the convenience of listening to it instead. Especially for kids this might be a problem as they have not fully developed their reading skills, which could lead to problems later in life. On the other hand, improvments in this field could also lead to more books being consumed by more people which in turn could move the society forward by more information being shared.

1.5 Disposition

The rest of this thesis will be structured as follows. In Chapter 2, the back-

ground information necessary to understand the presented work will be intro-

duced. Chapter 3 will present the resources used followed by a description

of the developed models, the conducted experiments and how the work was

carried out. In Chapter 4 the results from the conducted experiments will be

presented. Chapter 5 will discuss the methodology used and the presented

results. Lastly, in Chapter 6, the conclusions drawn from the results will be

presented followed by proposed future research.

(14)

Background

This section will present the background theory necessary to understand this thesis. First, theory regarding audio and speech will be introduced followed by its digital representation and transformation of this data, which enables a more thorough analysis. This will be followed up by an introduction to machine learning, focusing on artificial neural nets and neural network embeddings, giving a foundation for understanding the presented work.

2.1 Signal Processing

Signal processing concerns the analysis of different physical phenomena, such as sound and different measurements (see, e.g, [10]). This section will focus on signals corresponding to sound and their digital representation, followed by different transformations of this data that open up for interesting analysis of the signals.

2.1.1 Sound

Sound is pressure waves of air molecules that arise from forces compressing the air molecules into more and less tightly packed areas. These fluctuations in air pressure cause our eardrums to vibrate. These vibrations are further transmitted through bones to the inner ear (cochlea), which is a spiral tube directly connected to the auditory nerve and can roughly be regarded as filter bank for different frequencies. The cochlea is filled with liquid and has bundles of hair cells inside it, which respond to different frequencies depending on their position. The bundles of hair cells are responsible for turning the movements from the vibrations into electrical signals that are carried through the auditory

5

(15)

nerve to the brain, where they are interpreted as sound. (see, e.g, [10] [11]) As a sound is a fluctuation of a unit (air pressure) in time it can be mathematically described by the waveform.

2.1.2 Speech

For humans speech is a way of communicating and involves the transmis- sion of information between two actors, a speaker (transmitter) and one or more listener(s) (receivers) (see, eg, [10]). This communication begins with a speaker forming some thoughts to express, that activates muscular move- ments in the vocal tract of the speaker which produce the sounds of speech as air gets pushed through. The listener(s) receives these sound waves in the auditory systems, which works as a decoder from the physical sound wave into neurological signals to the brain, where the received information is processed.

The speech apparatus can be divided into two components: phonatory or- gans and articulatory organs (see, e.g, [12]). The phonatory organs consists of the lunges and the larynx whose responsibilities are to create the voice source sounds. The sounds are produced by pushing air form the lunges through the larynx, which causes the vocal chords to vibrate. The phonatory organs con- trols the pitch and loudness, as well as other prosodic patterns of the speech by adjusting the tightness of the vocal chords and the flow of air through the larynx.

The articulatory organs are made up of different parts in the mouth, such as tongue, lips, jaws and teeth. They are responsible of manipulating the sounds from the phonatory organs to generate additional sounds of the speech.

2.1.3 Audio Data

A convenient way of representing an analog audio signal is through a contin- uous function of time, t. A speech signal can therefore be represented by the function x(t) whose variations over time corresponds to the amplitude of the signal at different time steps. By sampling from the signal x(t) with a sam- pling period of T a discrete representation of the signals can be obtained as x[n] = x(nT ), where the sampling rate (samples taken per second from the signal), F

s

, of the digital signal is defined as 1/T . (see, e.g, [10])

A digital representation of the analog signal describing the sound is thus

a sequence of numbers representing the amplitude of the signal for different

discrete time steps, nT . In Figure 2.1 the waveform of a speech signal is pre-

sented.

(16)

Figure 2.1: The waveform representation of a speech signal (left) and a snippet from the signal (right) to better illustrate its waveform.

It is hard to derive properties other than loudness, duration and periodicity of the sound from the wave shape presented in Figure 2.1. By analyzing the wave it is clear that the sound is made up by three equally spaced bursts of some sort, but what caused it can not be deduced from this; was it three words or just a siren going off three times? To be able to further analyze an audio signal more high level features are needed to describe the signal and for that the signal needs to be transformed. Through different transformations different aspects of the audio data can be analyzed. Transformations mainly brings the analysis into two domains; the temporal and/or the frequency domain (see, e.g, [13]).

These kinds of analysis enables for high level features to be extracted from the audio signal, which in turn can be used for more sophisticated analysis of the sound. Features and transformations of audio signals will be more thoroughly described below.

Nyquist-Shannon Sampling Theorem

When constructing a digital signal from an analog signal the sampling rate (rate at which samples are taken from the analog signal), F

s

, and the bit depth (how many bits are used to represent each sample), which determines the res- olution of the signal, are of great importance. To be able to capture all infor- mation of the original signal in its digitized version the sampling rate has to be sufficiently high. The Nyquist-Shannon sampling theorem [14] states that the sampling rate has to be double the maximum frequency of the analog signal when digitizing it to a digital signal. In other words:

F

s

2 ⇥ F

max

to guarantee proper construction of the original signal. As the hearing range

of a human roughly extends to 20 kHz the standard sampling frequency used

when digitizing audio signals, to meet this criterion, is 44.1 kHz (see, e.g,

[15]).

(17)

2.1.4 Features of Audio Data

Features of audio are different describing properties of an audio signal and includes both physical (low-level) as well as psychoacoustic (high-level) at- tributes. All audio signals can be described in terms of:

• Duration - the time between the start and end of a signal,

• Loudness - the size of the changes in sound pressure levels (related to the energy of the signal),

• Pitch - the frequency of the signal and,

• Timbre - is related to perceived sound quality an is the most complex attribute of an audio signal. For example it reflects the the difference in signals between two instruments playing the same note as well as the ability to distinguishing between different categories of instruments.

(see, eg, [15])

These attributes of audio data can be represented in digital format through different types of transformations of the audio signal, which will be described below. In general, these digital representations of the attributes of an audio signal are called audio features.

Transformations

Transformations of audio data are functions that, when applied to a signal, map the data from one domain into another domain. The perhaps most popular transformation in signal processing is the Fourier Transform (FT) (see, e.g, [10]). The FT takes a signal and decomposes it into its frequency components, making it possible to analyze the signal’s components individually.

The FT is a transformation from the time domain into the frequency do- main and can be interpreted as sliding a frequency window across the signal measuring how much each frequency contributes to the signal. The FT of a signal x(t), in the continuous case, is defined as:

X(!) = Z

₁

1

x(t)e

^j!t

dt (2.1)

where the signal, x(t), gets transformed to a function of frequency, X(!), by

applying e

^j!t

on the signal. The exponential term, e

^j!t

, can through Eulers

formula be written as:

(18)

e

^j!t

= cos(!t) i sin(!t) (2.2) and describes a periodic motion through time t with frequency !. Hence, as you integrate Equation 2.1 over time the signal response, x(t), and e

^j!t

will tend to align if the frequency ! is present in the signal, yielding a higher value for the integral X(!) indicating the presence of ! in x(t). The exponential term can thus be viewed as a frequency filter.

In the case of digital signals however, the data is no longer continuous as the digital signal is a finite set of numbers, acquired through sampling from the original signal. In this setting the Discrete Fourier Transform (DFT) (see, e.g, [10]) can be used. The formula for the DFT is similar to the formula of FT and is defined as:

X(k) = 1 N

N 1

X

n=0

x[n]e

^j^2⇡^N^kn

(2.3)

where k represents the frequency bin number, n the sample number and N the total number of samples from the signal x(t).

The frequency filters applied during the transformation are independent and together covers the full frequency spectrum of the signal. These filters can therefore capture the different frequency components of the signal, their amplitudes and their phase. This information can for example be used to con- struct spectrograms, a topic that will be covered in Section 2.1.4 as well as reconstructing the signal through the inverse FT.

The Fast Fourier Transform (FFT) is an optimized version of the DFT which is used as the implementation of the transform in computers, reducing the time complexity of the function from O(N

²

) to O(N log N) [10].

The Mel Scale

The Mel scale is a logarithmic transformation of the linear frequency range.

The mel values are obtained through applying the following function to the original frequency:

m = 1125 ⇥ ln

✓ 1 + f

700

◆ where m is the mel scale value and f the original frequency. In Figure 2.2

mel is plotted as a function of frequncy (Hz) and log

₂

(frequency). The Mel

scale was invented as an attempt to mimic the way humans perceive sounds

[16]. The cochlea in the inner ear acts as a frequency filter and its complex

(19)

mechanism results in the perception of sounds at different frequencies not be- ing linear, hence the logarithmic nature of the curves. When looking at the logarithm of base two of the frequency one step on the x-axis corresponds to doubling the frequency, this interval is referred to as an Octave. In a musical setting moving one octave results in the same letter note and therefore also the same pitch class (see, e.g, [15]). By looking an the right plot in Figure 2.2 it is quite easy to get an intuition of what the mel sclae does: it makes greater distinction between higher octaves. This is also how the human hearing works, humans are better at distinguishing between higher pitches than lower pitches.

Figure 2.2: The mel scale plotted as a function of frequency (Hz) (left) and as a function of log

₂

(F requency) (right).

Spectrograms

Spectrograms illustrate the energy distribution (amplitude) on the frequency spectra of a signal over time and is therefore a suitable representation for per- forming time-frequency analysis of the signal (see, e.g, [10]). A spectrogram of the speech signal from Figure 2.1 is presented in Figure 2.3, in this diagram brighter colors illustrate higher energy density.

Figure 2.3: The spectrogram representation of the speech signal from Figure 2.1.

A spectrogram is computed by performing Short-Time Fourier Transform

(STFT) on the signal. The idea behind STFT is to compute the FT (or rather an

FFT) over very short periods of time of the signal, also known as frames. By

(20)

(frame) independently and stacking the result from each frame together results in a spectrogram. Spectrograms can also be constructed for other scales than frequency, for example the mel scale, by mapping the frequency bins onto this scale instead.

A spectrogram is more informative than the regular waveform as it still displays the temporal characteristics of the signal but also gives information about the contributing frequency components of the signal over time.

2.2 Machine Learning

Machine Learning (ML) (see, e.g, [2] [17]) is the technique of using statistical models for computers to learn certain tasks without explicitly giving the com- puter instructions on how to perform the task. This is usually done through exposing the model to data, during training, where the model infers statistical properties of the data. When the model later gets exposed to unseen data it can use the properties it learnt during training to make a decision about this new data.

This section firstly presents the two main types of tasks machine learn- ing models are exposed to, supervised and unsupervised learning. Secondly Artificial Neural networks and a special kind of network, Convolutional Neu- ral Networks, are introduced. Finally neural network embeddings, a way of representing the data as vectors, are presented.

2.2.1 Supervised and Unsupervised Learning

The models in ML can be divided into two different groups; supervised and unsupervised models. Supervised models perform a task that is controlled by the user. A supervised model is exposed to labeled data, for example images, from which it learns how to distinguish between the different labels, for exam- ple cats and dogs.

Unsupervised learning algorithms process unlabeled data and finds simi-

larities based on statistical properties of the data. This method is very powerful

for exploratory data analysis as it can find hidden structures in the data without

requiring any sort of guidance.

(21)

Figure 2.4: The structure of a feed forward neural network, with one input layer, a number of hidden layers and one output layer. The black arrows in the Figure repre- sents the connections between units and the weights (parameters) of the network.

2.2.2 Artificial Neural Nets

Artificial Neural Nets (ANN) is a programming model that tries to mimic the behavior of the human brain and, in broad terms, consists of many inter- connected neurons that computes responses based on inputs. The term neural network originates from as far back as the 1940’s and was a first attempt to describe the human brain in a mathematical way [18][19].

The first attempt to mimic this architecture in a computer was through the development of the perceptron in 1958, a single layer network capable of bi- nary classification [20]. However, a single layer percetron is only capable of finding linear decision boundaries, which is a huge limitations for more com- plex tasks where data needs to be separated in a non-linear manner [21]. This gave rise to more sophisticated models where several layers were connected, introducing the Multi-Layer Perceptron (see, e.g, [22]), the predecessor of the feed-forward networks that are used today.

Feed-forward Neural Networks

Feed-forward neural networks are a special type of ANN’s where the con- nected neurons do not form loops and the data is therefore fed forward through the network.

The structure of a regular feed-forward neural network is presented in Fig- ure 2.4. The network consists of one input layer, one output layer and zero or more hidden layers. The introduction of multiple hidden layers, making the previous shallow networks deeper, gave rise to the term Deep Learning.

A feed forward neural net can be seen as multiple linear combinations of

the input and the weights (neurons in the different layers), combined with non-

(22)

cally, the forward pass for a feed-forward network with k hidden layers can be expressed as:

~y = (W

k

· (W

k 1

· (. . . (W

1

· ~x +~b

1

)) + ~b

k 1

) + ~b

k

) (2.4) where ~x is the input vector, ~y the output vector, (·) a non-linear activation function and W

i

and~b

i

the weight matrix and bias vector for hidden layer i in the network.

Activation Function

Activation functions are functions that are applied to the output of each neuron in the network to break the linearity, enabling the network to create complex decision surfaces. Today the default recommendation for the activation func- tion is to use the rectified linear unit (ReLU) (e.g. [2]), defined in its simplicity as:

ReLU(x) = max(0, x) (2.5)

One great property of the ReLU activation function is that it yields sparse activations in the network as all values less than zero will output zero when ReLU is applied (see, e.g, [23]).

Training of the Network

The task of a feed-forward network is to approximate some arbitrary function:

y = f (x; ✓) (2.6)

where x is the input data, y the output (labels) and ✓ represents the parameters of the network, i.e the weights and biases of the connections between the nodes in the network (e.g, [2]).

The values of the network’s weights, ✓, are approximated through two op- erations; 1) the forward pass, where the input is sent through the network to generate an output (equation 2.4), and 2) the backward pass, where the learn- ing happens. The backward pass is also known as back propagation[22][24]

and begins by computing the error of the output, by comparing it to the true

value (label) using a pre-determined cost function. After this gradient decent

is performed, which computes the gradient of the cost function with respect

(23)

to each parameter w

i

, b

i

2 ✓ in the network and updates the values of each connection in the opposite direction, in order to to minimize the error.

The cost function of an ANN is what determines the behaviour of the net- work and it needs to be selected to suit the task the network is expected to perform. The most commonly used cost function used for classification is the Cross-entropy loss function:

cross_entropy(y, ˆy) = H(y, ˆy) = X

C

i=1

y

i

⇥ log ˆy

ⁱ

where C is the number of classes, y is the true class and ˆy the prediction from the model. For classification y is a binary C-dimensional vector with element i set to 1 if the sample comes from class i and all other entries set to 0. The prediction, ˆy, is also a C-dimensional vector and can be viewed as a discrete probability distribution over all classes. The probability distribution is usually obtained by applying the Softmax function to the networks output, defined as:

Softmax(x

i

) = exp (x

i

) P

C

j=1

exp (x

j

)

which normalizes the output vector to create a valid probability distribution.

It is also a continuous and differentiable function, allowing for backprop to be performed (e.g, [2]).

Cross-entropy is closely related to Entropy, which is a measurement of uncertainty, defined as:

Entropy(P ) = H(P ) = X

i

P (i) log P (i)

which gives the theoretical minimum average encoding size for the events of a probability distribution P . The expression for entropy looks identical to that of cross-entropy except that cross-entropy has two different distributions. To better illustrate the difference they can be expressed as expectations:

H(P ) = E

x⇠P

[ log P (x)] and H(y, ˆ y) = E

x⇠y

[ log ˆ y(x)]

The cross-entropy thus gives the realized average encoding size since it is an

weighted average over the true distribution of events (y) using the estimated

distribution (prediction) ˆy for the encoding size. However, if the two distri-

butions are the same it is easy to see that the cross-entropy is equal to the

(24)

imum average encoding size, the minimum of the cross-entropy is when the two distributions are the same. This means that by optimizing the network to minimize the cross-entropy function the network minimizes the difference between its predictions and the true labels.

One problem with deep feed-forward networks is their computational com- plexity. In a regular feed-forward network each input unit is connected to each output unit which means that the number of parameters (connections between units) in a network drastically increases as more layers and/or more nodes are added. To reduce the number of parameters the model has to learn there are several methods, one of which is a special kind of feed-forward network, called a Convolutional Neural Network, which is described below.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a special kind of feed-forward networks with an architecture that makes them very suitable for processing data that has a natural grid-like topology, such as time-series data and images (see, e.g, [2]). The network first appeared in [25][26], as an approach to im- prove hand written digit recognition. In the last decade it has entered various fields of ML with great success, starting with computer vision where it out- performed state-of-the-art techniques in the ImageNet challenge 2012 [3] on classifying images.

Figure 2.5: An example of the general structure of a CNN.

A general architecture of a CNN is presented in Figure 2.5 and consist

of two different parts: 1) The convolutional part, where the network learns

the features of the input data, and 2) the fully connected part, a regular feed-

forward network, that performs the classification.

(25)

Figure 2.6: Two steps of the convolutional operation of a CNN, the red grid represents the feature map, which is created from sliding the kernel (shaded area) over the input (green grid). Picture adopted from [27].

The convolutional part of a CNN usually consists of two different types of layers: a convolutional layer and a pooling layer, which will be described below.

The "convolutional" in the name comes from the fact that the network ap- plies the mathematical operation convolution on the data. In general a con- volution is an operation on two functions that creates a third function and is defined as the following integral:

s(t) = (x ⇤ w)(t) = Z

x(a)w(t a)da (2.7)

for continuous functions (e.g. [2]). To use terminology that is consistent with what is used in ML s(t) is called a feature map, the function x the input and the function w the kernel. In ML the input and kernels are usually multidimen- sional arrays, which means that the convolution can be performed over more than one axis at a time. In a CNN the integral is changed to a summation as the data is no longer continuous. The convolution operation used in a CNN can thus be expressed as:

S(i, j) = (I ⇤ K)(i, j) = X

m

X

n

I(i + m, j + n)K(m, n) (2.8)

where the input I for example could be a 2D image, the kernel K a 2D array

and the feature map S the 2D array resulting from applying the kernel on the

image as a sliding window. This process is illustrated in Figure 2.6. As the

filters of a CNN slide over the input they are able to capture structures and

patterns in the input. For an image this could for example be edges, lines and

curves which can be thought of as the features of the image. These features

become more complex in the deeper layers of the CNN as more low-level fea-

tures get combined to form the high-level features of the input.

(26)

are used for downsampling the output after a convolution has been applied on the data. The downsampling works in a similar manner as the convolution operation, where a filter is applied as a sliding window over the output matrix, as is illustrated in Figure 2.6, and then an operation on the elements covered by the filter is performed. The most common pooling operation is the so called max-pooling, which only extracts the maximum value in the current window and discards the rest, thus reducing the size of the data before propagating it further through the network (see, e.g, [28]).

As was mentioned in Section 2.2.2, the computational complexity of deep neural nets can be problematic. This is something the structure of a CNN over- comes by introducing sparse interactions and parameter sharing. Where as a regular ANN needs separate parameters to describe the interactions between all the input and output units, a CNN only needs parameters for the interaction of the kernels and the input. As the kernels are usually one or several magni- tudes smaller than the input this means that the total number of parameters in the network is greatly reduced (see, e.g, [2]).

2.2.3 Neural Network Embeddings

Neural network embeddings are vectors extracted from one layer of an ANN, which has gained a lot of popularity in the area of Natural Language Processing (NLP) with the introduction of Word2Vec, by Mikolov and others [29]. Em- beddings are a continuous vector representation of a discrete variable, which can be achieved by training a deep learning model to perform some sort of classification and then extracting the produced output from the neurons at any layer in the network. This new vector can be used to represent the original data point in a space spanned by the embeddings. Since the network is optimized to map samples from the same class to the same output (label), these repre- sentations tend to be close in the space spanned by the embeddings for similar data samples, which opens up for interesting analysis. In NLP the embeddings have been able to capture the semantic of the language and through linear al- gebra operations on the vector representations of words it has been possible to derive other words. Examples of such relationships are:

queen = (king man) + woman

where you get the embedding for queen by performing vector arithmetic on the

embeddings for king, man and woman. Other such properties are: connecting

(27)

capitals to their countries, famous people to their occupation and more. (see, e.g, [30])

The vector representation of the data can be viewed as data points in a d- dimensional space, and thus a common use of the embeddings is to look at their nearest neighbors when analyzing the results (see, e.g, [31]). In order to find the nearest neighbors an evaluation metric of the distance between embeddings must be decided. In NLP, where documents can be represented as a vector with each dimension representing the frequency of a word within the document, it makes sense to use the cosine distance between documents as it measures the angle between the vectors. In this setting the angle between two vectors would indicate if the proportion between the words in the documents is similar or not, and the magnitude of the vector is therefore not important. In other contexts, such as cluster analysis, where the magnitude of the vector is also important the euclidean distance between the vectors can be used.

2.3 Related Work

This section gives an overview of related work that has been done within the field of audio processing, with a focus on clustering of audio signals based on speech characteristics and deep learning approaches to audio classification.

The presented work will give a foundation for the methodology and conducted experiments that will introduced in the following chapter.

Székely et al. [32] presents a novel approach for clustering voices together

based on the voice quality parameters of the Liljencrants-Fant acoustic model

of the glottal source [33]. From these parameters the mean and variance over

short utterances are used as feature vectors. The data used in this study was

a 50 minutes long recording of an audio book with one single narrator. The

recording was divided up in smaller segments based on pause detection, with

an average length of 1.6 seconds. For each of these segments the features de-

scribed above were calculated for every 10ms window and the segments were

clustered based on these feature vectors, with the goal to identify and cluster

different expressive speech styles of the narrator. To assess the performance

of the clustering an A/B-test was conducted where the participants were pre-

sented a reference audio sample together with two more samples, one from the

same cluster and one from another cluster. The participants were then asked

to pick the sample which they perceived to be most similar to the reference

audio sample. The results showed that the presented approach was successful

in separating sentences associated with different styles of speech. This work

presents an interesting approach to how the evaluation of the clustering results,

(28)

Lukic et al. [34] uses CNN for speaker classification and clustering. The CNN is trained on the surrogate task of classifying different speakers from a spectrogram representations of speech from the TIMIT data set [35], using both the full data set (670 different speakers) and a subset of the data (100 speakers). After acquiring 97% accuracy on classification of the speakers, clustering of the speakers is performed. By using the activation layers in the fully connected part of the CNN architecture as feature vectors they manage to reach the same level of performance as that of traditional models, using manually engineered feature (see, e.g, [36]) in the task of clustering voices together, without the need for handcrafting the features.

In [37] the same authors improved their approach and managed to tie with the previous model while using only a fraction of the data for training. In- stead of training the CNN to identify all the different speakers in the corpus, a novel training approach is presented where binary pair-wise constraint (same or different speaker) is used to teach the network to further distantiate dis- similar voices. To increase distance between samples from different speaker while decreasing the distance between samples from the same speakers, each embedding is treated as a probability distribution and Kullback-Leibler (KL) divergence [5] is used to calculate the loss.

This approach was inspired by the work of Hsu et al. [38], where the au- thors used it to construct an ANN as an end-to-end unsupervised clustering model. In regular clustering methods there are usually two different steps re- quired: 1) feature extraction (construction of feature space) and 2) clustering based on similarity/distance in this constructed feature space. Both these steps opens up for the introduction of human bias in the model, a problem that is re- moved in their proposed approach.

These papers, in combination with the recent use of embeddings in NLP

(e.g, [29], [30]), illustrate the potential of ANN’s to find informative repre-

sentations (embeddings) of the data in an unsupervised fashion, requiring no

domain knowledge. The use of CNNs on spectrogram representations of the

audio data has the potential to extract features that are unknown to humans. By

using embeddings from the trained network to represent each audio sample,

representations of samples with similar attributes could potentially be found,

similar to how word embeddings contained the semantics of the words. As

evaluating similarity between voices is a subjective task standard cluster anal-

ysis would not be able to determine the success of the models. Therefore the

A/B-test performed in [32] can be adopted during the evaluation of the work

presented in this report, enabling us to answer the stated research question.

(29)

Methods

This chapter gives a detailed description of the experiments conducted in this study. First an overview of how the experiments were carried out is presented.

Secondly the resources used for the experiments are described, followed by a presentation of the data set and the preprocessing steps of said data. This is followed by a description of the models derived, their architecture and the settings of the hyper-parameters. Finally the method used for evaluating the embeddings, created by the models, is presented.

3.1 Project Overview

In Figure 3.1 the approach taken in this study is presented. The practical work consisted of 4 main parts, which will be briefly described below and then pre- sented more thoroughly throughout this chapter.

1. Train Models

The first step was to develop and train two CNNs, one performing regu- lar classification of audio samples using the cross-entropy loss function.

The other model used pairwise constraints between data points and a Kullback–Leibler divergence (KL) based loss function during training.

2. Create Embeddings

Once the models were trained, embeddings from different depths of the fully connected part of the networks were extracted for each sample, which was used to represent the audio sample in a high dimensional space.

3. Evaluate Embeddings

The evaluation of the embeddings was performed in two different ways:

20

(30)

Figure 3.1: An illustration of how the work was carried out.

an objective evaluation and a subjective evaluation. The objective evalu- ation was performed through examining the neighborhood of samples in the embedding spaces. The subjective evaluation was performed through a user test, where participants were asked to select the narrator they per- ceived to be most similar to a reference narrator out of two choices: the narrator that was closest to the reference narrator in the embedding space and the one furthest away.

3.2 Resources

For the experiments, training of the models was performed on Google Cloud Platform (GCP) by deploying the models for training on the ai-platform

¹

. The training was done using the STANDARD_1 scale tier on the platform,

1

https://cloud.google.com/ai-platform/

(31)

giving access to one master instance and four worker instances, each with 8 virtual CPUs and 7.20 GB of RAM.

Once training was completed the embedding extraction and evaluation were run locally on a Macbook Pro using Pyhton, all the details about the machine and the packages used are presented in Table 3.1.

Table 3.1: Hardware and software resources used to run the experiments Computer A Macbook Pro (15-inch, 2018)

Operating System MacOS Mojave version 10.14.2 Processor 2,2 GHz Intel Core i7

Memory 16 GB 2400 MHz DDR4

Programming Language Python 3.7.1

Python packages tensorflow 1.13.1 matplotlib 3.0.2 scikit-learn 0.20.1 numpy 1.15.4

3.3 Data

The data used in the experiments are snippets from audiobooks provided by the host company. Snippets from a total of 100 audiobooks were used for training the models, featuring 100 different narrators, equally distributed be- tween males and females. The snippets are made up of two second consecutive sequences from a three minute sample of each audiobook, resulting in 90 snip- pets per book, which gave a total of 100 ⇥ 90 = 9000 data points. The audio files are single channel (mono) MP3-files with a sampling frequency of 44.1 kHz and a bitrate of 64 Kbit/s.

3.3.1 Preprocessing of Data

Each snippet of audio was firstly run through a pre-emphasis step, which am- plifies high frequency components with respect to low frequency components in order to reduce noise in the audio snippets. Pre-emphasis of the signal was calculated as follows:

y[n] = x[n] 0.97 ⇤ x[n 1]

where y represents the pre-emphasized signal and x the original signal. As

low frequency components of the signal tend to change more slowly than high

(32)

Figure 3.2: An illustration of STFT on a signal and the meaning of the parameters n_hop and n_FFT.

frequency components this procedure can be interpreted as removing this slow change from adjacent signal components, thus enhancing the more rapidly changing high frequency components.

After pre-emphasis the resulting signals were put through a Short-Time Fourier Transform (STFT) to construct the mel-spectrogram representation of the signals. For the STFT the settings presented in Table 3.2 were used, inspired by the work in [34].

Table 3.2: The settings used for the STFT Parameter Value

F

_s

44.1 kHz

n_FFT 1024

n_hop 160

The n_FFT parameter dictates how many samples to use for each window when computing the Fast Fourier Transform (FFT), n_hop determines the amount of overlap between consecutive windows and F

s

the sampling rate used when sampling the signal. An illustration of this process and the meaning of the parameters is presented in Figure 3.2. These settings yielded a 128 ⇥ 552 spectrogram representation of each two second snippet, which can be derived as follows:

2 ⇥ F

^s

n_hop = 2 ⇥ 44100

160 ⇡ 552

as for a 2 second snippets there will be 2 ⇥ F

^s

samples and the window will

be moved forward 160 (n_hop) samples at a time. The height of 128 comes

from the fact that 128 bins were used for mapping the frequency components

from the STFT onto the mel scale.

(33)

After the spectrograms were constructed dynamic range compression was applied on the resulting matrices by applying the element-wise function:

f (x) = log (1 + x ⇤ 10

⁴

)

which reduces loud sounds and amplifies low sounds of the signal, as is sug- gested in [34]. Lastly the matrix is normalized column wise, to get zero mean and a standard deviation of one for each time step in the spectrogram. This procedure of normalizing data is performed to speed up the convergence of the models (see, e.g, [39]).

3.4 Models

The models were built using the machine learning library TensorFlow[40]

developed at Google. The models are built as identical CNNs where only the cost function used differ.

3.4.1 Network Architecture

The network architecture used for both models is presented in Table 3.3. The inputs to this network were the 128 ⇥ 552 matrix representation of the spec- trograms, which were described in Section 3.3.1. The network had 2 convo- lutional layers consisting of 32 and 64 4 ⇥ 4 kernels respectively, all using the ReLU activation function. The kernels used a stride of 1, meaning that they were slid one step at a time across the data grid. After each convolutional layer max-pooling was performed, also using a kernel of size 4⇥4 but this time with a stride of 2, reducing the size of each dimension of the data in half. After the second max-pooling was performed the data was flattened to a 1 ⇥ 261120 vector, which was then feed to the fully connected part, consisting of a 3-layer feed forward network where the activation function ReLU was applied to the output from the two hidden layers FC1 and FC2. Finally the Softmax func- tion was applied in the 100-dimensional output layer, yielding a probability distribution across the 100 different narrators in the data set for each sample.

3.4.2 Regular Classification Model

The regular classification model had the structure presented in Table 3.3 and used the cross-entropy loss function, defined in Chapter 2, and the Adam [41]

optimizer. This model was thus trained on classifying each snippet to the cor-

responding narrator, yielding a classifier with 100 classes. The output of this

(34)

Layer N Kernel Size Stride Activation Function

Conv1 32 4 ⇥ 4 1 ReLU

Pool1 4 ⇥ 4 2

Conv2 64 4 ⇥ 4 1 ReLU

Pool2 4 ⇥ 4 2

FC1 261120 ⇥ 200 ReLU

FC2 200 ⇥ 150 ReLU

Out 150 ⇥ 100 Softmax

network, after the Softmax function has been applied, can be interpreted as a 100 dimensional probability vector where each entry represents the probability of a sample belonging to a certain narrator.

3.4.3 Pairwise Constraint Model

The pairwise constraint (PWC) model used the same network structure as the regular model (presented in Table 3.3) but was optimized on a different cost function, still using the Adam optimizer. This model used pairwise constraints between all data points instead of calculating the Cross-entropy loss used for classification. For each batch

ⁿ₂

=

^{n(n 1)}₂

, where n = batch size, pairs were constructed as (i, j, r) tuples, where i and j represents the index of the two data points in the batch and r is an integer describing the relationship between the data points, where:

r =

( 0 if i and j are from different narrators 1 if i and j are from the same narrator

Between all the

ⁿ₂

pairs in a batch a Kullback-Leibler(KL)-divergence based loss function is used, which uses the pairwise relationships between the data points. In this setting the embeddings of pairs of points is treated as two dis- crete probability distributions, PPP and Q Q Q. The task of the loss function is, through the use of KL-divergence, to minimize the difference of the proba- bility distributions (embeddings) between similar points and maximize it be- tween dissimilar points. The loss function, introduced in [38], is defined as follows:

L(P P P , Q Q Q) = l(P P P ||Q Q Q) + l(Q Q Q ||P P P ) where:

l(P P P ||Q Q Q) = I

s

· KL(P P P ||Q Q Q) + I

ds

· max(0, margin KL(P P P ||Q Q Q))

(35)

and:

I

s

=

( 1 if r = 1

0 otherwise and I

ds

=

( 1 if r = 0 0 otherwise

The KL-divergence, KL(PPP ||Q Q Q), is a measure of how well the distribution Q Q Q to approximate the true distribution, PPP , and is defined as:

KL(P P P ||Q Q Q) = X

i

p

i

· log( p

i

q

i

) p

i

2 P P P , q

i

2 Q Q Q

which is very similar to the cross-entropy, defined in Chapter 2, with the only difference being the term in the logarithm of the function. The function can be rewritten as:

KL(P P P ||Q Q Q) = X

i

p

_i

· log( p

i

q

i

) = X

i

p

_i

log q

_i

X

i

p

_i

· log p

i

which is the cross-entropy minus the entropy, i.e how how well Q Q Q approx- imates PPP as mentioned above. Further, the KL-divergence is asymmetric (KL(PPP ||Q Q Q) 6= KL(Q Q Q ||P P P ) which is the reason the loss function is formulated the way it is, calculating the loss both with respect to PPP and Q Q Q. The second term in l(PPP ||Q Q Q) is the hinge loss where margin sets an upper limit on the maximum difference between PPP and Q Q Q to use in the loss.

As the relationship is determined between each pair of points this would result in each data point being fed-forward n 1 times per batch. To avoid this, and thus reduce the computational complexity, the enumeration of all pairs first occurs in the cost layer of the model which enables for the cost to be properly calculated while each data point is only fed-forward once. This process, adopted from [38], is illustrated in Figure 3.3.

3.4.4 Hyperparameter settings

In this section the hyperparameters of the final models are presented. The hyperparameters that were modified were the learning rate and batch size. For the model using pairwise constraints the margin hyperparameter was set to 2, as was suggested in [37]. Further, both of the models used the Xavier normal initialization [42] for the weights and biases of all layers in the network.

To find the optimal values an initial grid search was performed where the

models were trained on just a subset of data, to speed up the process. The

values used for the parameters during the grid search are presented in Tables

(36)

Figure 3.3: As one batch is fed-forward through the network to calculate the pair-wise constraint loss, each sample only has to be propagated through once as the pair-wise relationships are computed only in the cost function.

3.4 and 3.5. Once the hyperparameter space had been reduced the models were trained on the full data set for 800 epochs. In Appendix B the achieved loss for the different combinations of hyperparameters are presented. Missing values indicate failure to complete training (due to vanishing/exploding gradients) or that the combination of parameters were excluded after the initial search had been performed on a subset of the data set.

Table 3.4: The values used for the learning rate during the grid search.

lr 1.0 · 10

⁵

0.5 · 10

⁵

1.0 · 10

⁶

0.5 · 10

⁶

1.0 · 10

⁷

Table 3.5: The values used for the batch size during the grid search.

batch size 32 64 128

The hyperparameters that achieved the best performance when training the networks are presented in Table 3.6 and 3.7.

Table 3.6: The hyperparameter settings used when training the regular model.

Hyperparameter Value

lr 0.5 · 10

⁵

batch size 64

(37)

Table 3.7: The hyperparameter settings used when training the PWC model.

Hyperparameter Value

lr 0.5 · 10

⁶

batch size 32

margin 2

3.5 Embeddings

The embeddings of the data were extracted from the fully connected (FC1 and FC2) layers of the networks. The vector representations found by the model were treated as points in a n-dimensional space, where n represented the size of the layer from which the embedding was extracted. By using the euclidean distance between the data points and looking for the nearest neighbours of indi- vidual points in this space the newly found representations of the spectrograms could be evaluated.

The output-layer (Out) is not used as an embedding due to the Softmax function being applied, which normalizes the vectors. The normalization re- moves the aspect of magnitude from the vectors, meaning that for example the vectors:

~u = 2 4 1

1 1 3

5 and ~u = 2 4 5

5 5 3 5 would after Softmax has been applied result in:

Softmax(~u) = Softmax(~v) = 0

@ 1/3 1/3 1/3

1 A

which would result in a (euclidean) distance of 0 between the points instead of the actual distance of:

dist(~u, ~v) = p

(5 1)

²

+ (5 1)

²

+ (5 1)

²

= p 48 ⇡ 7

3.5.1 Evaluation

To evaluate the results two different methods were used, one to objectively

evaluate the models performances by exploring the positions of the new repre-

sentations in the embedding space. The other method was performed through

(38)

narrators and was thus of the subjective nature. These two methods will be more thoroughly described below.

Objective Evaluation

The evaluation of the representations found was performed by splitting the data set, D, into two sets: D

⁰

and D

samples

such that D

samples

[ D

⁰

= D and D

samples

\ D

⁰

= ;. By looking at the proportion of samples from the same narrator in D

⁰

for the samples in D

samples

, for neighborhoods of different sizes, the different representations could be evaluated. Since the data set was made up of 90 samples per narrator the sizes of the neighborhoods were chosen to be 9, 45 and 90, representing 10%, 50% and 100% of the total number of samples per narrator respectively. This approach led to three different scores for each embedding and model.

When looking at the neighborhoods of a sample the percentage of neigh- bors within this neighborhood who had the same gender as the sample was also computed. This was done to explore if this information seemed to be encoded in the found embeddings, although this information had not been specified during training.

Both of the evaluation methods described above was performed 10 times, constructing different random splits of the data at each iteration, and then av- eraged to get a final measurement.

Subjective Evaluation

To evaluate whether the embeddings found captured similarities between voices a user test was performed using the embeddings that achieved the best perfor- mance during the objective evaluation. The test was carried out by giving the participants 10 different reference samples, 5 males and 5 females, chosen at random from the data set. Each of these samples were then combined with two other samples: one from the narrator that was on average the closest to the ref- erence narrator in the embedding space while the other was from the narrator furthest away on average, using the euclidean distance. The users were then asked to select the sample they perceived to be most similar to the reference sample.

The test was carried out through a survey where the participants were given 10 questions, each consisting of 10 second samples of the three types presented above (one reference as well as the one closest to and furthest away on average).

The results from the survey could thus be compiled as individual percentages

(39)

of the samples chosen for each question for all participants. In total 27 people

participated in the survey and the participants were both employees from the

company as well as acquaintances to the author of this thesis.

(40)

Results

This chapter will presents the results from the experiments conducted for the objective and subjective evaluation, presented in the previous chapter.

4.1 Neighborhood Evaluation

In this section the results from the objective evaluation of analyzing the dis- tribution of narrators and genders within the neighborhood of samples for the two derived models are presented. The raw data used to create the figures can be found in Appendix A.

4.1.1 Regular Model

In Figure 4.1 the average proportion of embeddings from the same narrator within neighborhoods of different sizes is presented. These results shows that the lowest embedding layer (FC1) achieved significantly better performance for all sizes of neighborhoods than FC2.

Figure 4.2 presents the average percentage of embeddings having the same gender within neighborhoods of different sizes. Once again the embeddings from FC1 achieved better performane, although not by as much.

When comparing the results presented in Figures 4.1 and 4.2 it is worth noting that the percentage of embeddings with same gender is significantly higher than for embeddings from the same narrator for all combinations of neighborhood sizes and embedding layers.

31

(41)

Figure 4.1: The average proportion of embeddings from the same narrator within different neighborhoods for the regular model. The proportion was achieved through randomly sampling embeddings from all narrators in the data set and looking at the distribution of narrators of the embeddings in their 9-, 45- and 90-neighborhood, av- eraging the results over 10 different runs.

Figure 4.2: The average percentage or embedding with the same gender within dif-

ferent neighborhoods for the regular model. The percentage was achieved through

randomly sampling embeddings from all narrators in the data set and looking at the

distribution of gender of the embeddings in their 9-, 45- and 90-neighborhood, aver-

aging the results over 10 different runs.

(42)

Figure 4.3: The average proportion of embeddings from the same narrator within dif- ferent neighborhoods for the pairwise constraint model. The proportion was achieved through randomly sampling embeddings from all narrators in the data set and looking at the distribution of narrators of the embeddings in their 9-, 45- and 90-neighborhood, averaging the results over 10 different runs.

4.1.2 Pairwise Constraint Model

In Figure 4.3 the average proportion of embeddings from the same narrator within neighborhoods of different sizes for the pairwise constraint model is presented. Note that the y-axis only extends to 10% in this figure, as opposed to 100% in Figure 4.1. Once again the embeddings from FC1 achieves better performance than those of FC2 although the results from FC1 are one order of magnitude lower compared to the regular model.

The average percentage of embeddings with the same gender within neigh- borhoods of different sizes for the pairwise constraint model is presented in Figure 4.4. Here, the results are centered around 50% for all combinations of neighborhood sizes and embedding layers, indicating that the separation between gender is no better than chance.

4.2 User Test Evaluation

For the user test the FC1 embeddings from the regular model was used as these achieved the best performance during the objective evaluation, presented above.

The results from the 27 participants of the user test are presented in Figure

4.5. From these results it is clear that in 8 out of the 10 questions the closest

embedding from the model is also what a large majority of the participants

found to be the most similar to the reference sample. For question 3 and 10