Representing Voices Using Convolutional Neural Network Embeddings
NIKLAS EMBRETSÉN
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Convolutional Neural Network Embeddings
NIKLAS EMBRETSÉN
Master in Machine Learning Date: July 4, 2019
Supervisor: Mats Nordahl
Supervisor at Storytel: Sathvik Katam Examiner: Olov Engwall
School of Electrical Engineering and Computer Science Host company: Storytel
Swedish title: Representation av röster med hjälp av inbäddningar
från faltningsnätverk
Abstract
In today’s society services centered around voices are gaining popularity. Be- ing able to provide the users with voices they like, to obtain and sustain their attention, is of importance for enhancing the overall experience of the service.
Finding an efficient way of representing voices such that similarity compar- isons can be performed is therefore of great use.
In the field of Natural Language Processing great progress has been made using embeddings from Deep Learning models to represent words in an unsu- pervised fashion. These representations managed to capture the semantics of the words.
This thesis sets out to explore whether such embeddings can be found for audio data as well, more specifically voices from narrators of audiobooks, that captures similarities between different voices. For this two different Convolu- tional Neural Networks are developed and evaluated, trained on spectrogram representations of the voices. One is performing regular classification while the other one uses pairwise relationships and a Kullback–Leibler divergence based loss function, in an attempt to minimize and maximize the difference of the output between similar and dissimilar pairs of samples. From these models the embeddings used to represent each sample are extracted from the different layers of the fully connected part of the network during the evaluation.
Both an objective and a subjective evaluation is performed. During the objective evaluation of the models it is first investigated whether the found embeddings are distinct for the different narrators, as well as if the embed- dings do encode information about gender. The regular classification model is then further evaluated through a user test, as it achieved an order of magnitude better results during the objective evaluation. The user test sets out to eval- uate whether the found embeddings capture information based on perceived similarity.
It is concluded that the proposed approach has the potential to be used for representing voices in a way such that similarity is encoded, although more extensive testing, research and evaluation has to be performed to know for sure.
For future work it is proposed to perform more sophisticated pre-proceessing
of the data and also to collect and include data about relationships between
voices during the training of the models.
Sammanfattning
I dagens samhälle ökar populariteten för röstbaserade tjänster. Att kunna förse användare med röster de tycker om, för att fånga och behålla deras uppmärk- samhet, är därför viktigt för att förbättra användarupplevelsen. Att hitta ett effektiv sätt att representera röster, så att likheter mellan dessa kan jämföras, är därför av stor nytta.
Inom fältet språkteknologi i maskininlärning har stora framstegs gjorts ge- nom att skapa representationer av ord från de inre lagren av neurala nätverk, så kallade neurala nätverksinbäddningar. Dessa representationer har visat sig innehålla semantiken av orden.
Denna uppsats avser att undersöka huruvida liknande representationer kan hittas för ljuddata i form av berättarröster från ljudböcker, där likhet mellan röster fångas upp. För att undersöka detta utvecklades och utvärderades två faltningsnätverk som använde sig av spektrogramrepresentationer av röstdata.
Den ena modellen är konstruerad som en vanlig klassificeringsmodell, tränad för att skilja mellan uppläsare i datasetet. Den andra modellen använder par- visa förhållanden mellan datapunkterna och en Kullback–Leibler divergens- baserad optimeringsfunktion, med syfte att minimera och maximera skillna- den mellan lika och olika par av datapunkter. Från dessa modeller används representationer från de olika lagren av nätverket för att representera varje da- tapunkt under utvärderingen.
Både en objektiv och subjektiv utvärderingsmetod används. Under den ob- jektiva utvärderingen undersöks först om de funna representationerna är di- stinkta för olika uppläsare, sedan undersöks även om dessa fångar upp infor- mation om uppläsarens kön. Den vanliga klassificeringsmodellen utvärderas också genom ett användartest, eftersom den modellen nådde en storleksord- ning bättre resultat under den objektiva utvärderingen. Syftet med användar- testet var att undersöka om de funna representationerna innehåller information om den upplevda likheten mellan rösterna.
Slutsatsen är att det föreslagna tillvägagångssättet har potential till att an-
vändas för att representera röster så att information om likhet fångas upp, men
att det krävs mer omfattande testning, undersökning och utvärdering. För fram-
tida studier föreslås mer sofistikerad förbehandling av data samt att samla in
och använda sig av data kring förhållandet mellan röster under träningen av
modellerna.
Acknowledgements
First of all I would like to thank Storytel and all my colleagues for being an awesome host company and making me feel part of the team right away. A special thanks goes to Sathvik Katam, my supervisor at Storytel, for providing me with all the data and information I needed as well as being a great sounding board throughout the work with this thesis.
I would also like to thank my academic supervisor, Mats Nordahl, in help- ing shaping this thesis and giving constructive feedback throughout this work.
Lastly I would like to thank my examiner, Olov Engwall, for providing me with
the final feedback and examining this thesis.
1 Introduction 1
1.1 Purpose . . . 2
1.2 Problem Description . . . 2
1.3 Research Question . . . 3
1.4 Ethics, Sustainability and Social Aspects . . . 3
1.5 Disposition . . . 4
2 Background 5 2.1 Signal Processing . . . 5
2.1.1 Sound . . . 5
2.1.2 Speech . . . 6
2.1.3 Audio Data . . . 6
2.1.4 Features of Audio Data . . . 8
2.2 Machine Learning . . . 11
2.2.1 Supervised and Unsupervised Learning . . . 11
2.2.2 Artificial Neural Nets . . . 12
2.2.3 Neural Network Embeddings . . . 17
2.3 Related Work . . . 18
3 Methods 20 3.1 Project Overview . . . 20
3.2 Resources . . . 21
3.3 Data . . . 22
3.3.1 Preprocessing of Data . . . 22
3.4 Models . . . 24
3.4.1 Network Architecture . . . 24
3.4.2 Regular Classification Model . . . 24
3.4.3 Pairwise Constraint Model . . . 25
3.4.4 Hyperparameter settings . . . 26
vi
3.5.1 Evaluation . . . 28
4 Results 31 4.1 Neighborhood Evaluation . . . 31
4.1.1 Regular Model . . . 31
4.1.2 Pairwise Constraint Model . . . 33
4.2 User Test Evaluation . . . 33
5 Discussion 35 5.1 Analysis of Results . . . 35
5.2 Discussion of Methods . . . 37
6 Conclusion 39 6.1 Conclusion . . . 39
6.2 Future Work . . . 40
Bibliography 42
A Raw Data 47
B Hyperparameter Grids 48
C Survey Results 49
Introduction
Audio- and voice-based services are gaining popularity in many fields of tech- nology, and customer satisfaction is critical to remain competitive on the mar- ket. Captivating voices have the potential of making almost anything interest- ing and are of importance in contexts where it is desirable to obtain and sustain peoples attention.
Traditional approaches to audio recognition and audio classification, such as GMM-HMM models (see, e.g, [1]), have involved a lot of manual feature engineering, which requires considerable domain knowledge from the people developing the models. In recent years however, Deep Learning (DL) (see, e.g, [2]) has been introduced to the problem area, which allows for sophisticated analysis of audio signals without requiring deep domain knowledge of signal processing. Instead of performing complex transformations of the audio sig- nals to acquire different features, more general representations of the signals, such as spectrograms, can be used as a basis. This approach leaves the task of determining what characterizes a signal to the model itself. Not only does this make this area more accessible to people who lack deeper knowledge of signal processing but it also reduces the risk of introducing human bias into the model, as manual feature engineering is usually driven by intuition. This methodology has been embraced especially in the area of computer vision, specifically in image recognition, with the introduction of Convolutional Neu- ral Networks in the ImageNet competition in 2012 [3]. A convolutional neural network extracts more and more high level features, starting with straight and curved lines and moving to more complex patterns, as the image goes through the layers of the network (see, e.g, [2]).
This thesis will focus on the area of audio recognition and classification, more precisely on the human voice. However, instead of identifying a cer-
1
tain speaker the aim is to identify groups of people who exhibit a similar way of speaking. This question is interesting to many fields providing audio and voice services, since the right voice has the potential to enhance the whole experience, whether it be audio books, presentations or tutorials.
1.1 Purpose
The purpose of this study is to investigate the possibility of, by automatic means, finding representations for audio recordings of voices from audiobooks that capture their characteristics. This question is interesting for actors on the voice-service market as it would allow them to more efficiently find good voices for their service and/or to give good recommendation of new voices the users might enjoy, thus enhancing the overall experience. To achieve this different ways of representing audio data, such that samples with similar at- tributes are close to each other while being far away from voices with dissim- ilar attributes will be explored. Finding such a representation would allow for efficient and simple analysis of the audio signals.
1.2 Problem Description
The problem investigated in this thesis is the task of grouping narrators to- gether based on similar attributes. Manual classification and grouping of au- dio is not just time consuming but also introduces human bias in the process since the way different people perceive sounds and voices is subjective. The more traditional models for this purpose require manual feature engineering, which make them somewhat limited. Manual feature engineering (see, e.g, [4]) can be a limitation in two ways: 1) It requires specific domain knowledge, and 2) once again introduces human bias into the equation since the choice of features can affect the end results.
Therefore the aim of this thesis is to develop models that, by automatic means, can extract features and find representations of the audio data in an objective manner. The models will be developed and trained using the large amount of voice recordings from audiobooks present at the host company, Sto- rytel, which provides a subscription based streaming service of audiobooks.
The representations found are evaluated by exploring the new space spanned by the representations.
To find a representation of the data that is suitable for this evaluation, train-
ing a regular classification model might not be optimal. The goal of a classifi-
classes into account, which is a measurement that will be used during the eval- uation. Thus an approach using pairwise constraints between data points to- gether with the Kullback–Leibler divergence [5] as an optimizing criterion will be implemented and compared to a regular classification model.
1.3 Research Question
Given the problem description presented in Section 1.2 the question to be ex- amined in this thesis is:
To what extent can Convolutional Neural Network embeddings be used to group voices based on similarity?
1.4 Ethics, Sustainability and Social Aspects
There are some ethical aspects with this work that need to be considered.
Firstly the model could be biased towards a special type of voice, by only taking certain characteristics into account or being fed a non-representative data set during training, and therefore discriminate some types of voices. This problem has been encountered in several fields of machine learning, such as image recognition [6] and in systems for predicting criminality [7]. This also poses the risk of isolating users to only certain types of speakers, and po- tentially certain types of content. In the case of fictional content this might not matter that much but for more informative content it could contribute to homogeneous information sharing, instead of giving a diverse picture. This potential favouring of certain types of speakers could also contribute to bias in the hiring process for roles where the voice is part of the process.
Some positive effects this work could have are that it might reduce the repetitive work of manually labeling the audio files in simple distinctions, such as labeling as male/female voices. It also has the potential of making more so- phisticated labeling of the audio data that can be used to enhance the customer experience and make it more enjoyable, which is good both from the perspec- tive of the service provider as well as for the consumers.
An increase in digital book consumption could also reduce the number
of physical paper books that need to be produced, which could save trees and
slow down deforestation around the globe. Digital versions are also accessible
anywhere there is an internet connection, which could potentially contribute
to reducing the emission of green house gases as books do not need to be dis- tributed across the globe to as great an extent. In this setting it is also worth mentioning the environmental aspects of deep learning (DL), which is usu- ally a very resource heavy procedure, requiring much energy to perform the computations. This issue was brought up in [8], where the authors explored different state of the art models in Natural Language Processing (NLP) and approximated the resource usage for training and tuning these models. The authors showed that the training of Google’s BERT [9] was roughly equivalent to a trans-American flight for one person with respect to the CO
2emission, which when put into perspective is a great deal considering all the research and development going on within DL.
With the audiobook becoming more popular there is also a risk for people moving away from reading regular books due to the convenience of listening to it instead. Especially for kids this might be a problem as they have not fully developed their reading skills, which could lead to problems later in life. On the other hand, improvments in this field could also lead to more books being consumed by more people which in turn could move the society forward by more information being shared.
1.5 Disposition
The rest of this thesis will be structured as follows. In Chapter 2, the back-
ground information necessary to understand the presented work will be intro-
duced. Chapter 3 will present the resources used followed by a description
of the developed models, the conducted experiments and how the work was
carried out. In Chapter 4 the results from the conducted experiments will be
presented. Chapter 5 will discuss the methodology used and the presented
results. Lastly, in Chapter 6, the conclusions drawn from the results will be
presented followed by proposed future research.
Background
This section will present the background theory necessary to understand this thesis. First, theory regarding audio and speech will be introduced followed by its digital representation and transformation of this data, which enables a more thorough analysis. This will be followed up by an introduction to machine learning, focusing on artificial neural nets and neural network embeddings, giving a foundation for understanding the presented work.
2.1 Signal Processing
Signal processing concerns the analysis of different physical phenomena, such as sound and different measurements (see, e.g, [10]). This section will focus on signals corresponding to sound and their digital representation, followed by different transformations of this data that open up for interesting analysis of the signals.
2.1.1 Sound
Sound is pressure waves of air molecules that arise from forces compressing the air molecules into more and less tightly packed areas. These fluctuations in air pressure cause our eardrums to vibrate. These vibrations are further transmitted through bones to the inner ear (cochlea), which is a spiral tube directly connected to the auditory nerve and can roughly be regarded as filter bank for different frequencies. The cochlea is filled with liquid and has bundles of hair cells inside it, which respond to different frequencies depending on their position. The bundles of hair cells are responsible for turning the movements from the vibrations into electrical signals that are carried through the auditory
5
nerve to the brain, where they are interpreted as sound. (see, e.g, [10] [11]) As a sound is a fluctuation of a unit (air pressure) in time it can be mathematically described by the waveform.
2.1.2 Speech
For humans speech is a way of communicating and involves the transmis- sion of information between two actors, a speaker (transmitter) and one or more listener(s) (receivers) (see, eg, [10]). This communication begins with a speaker forming some thoughts to express, that activates muscular move- ments in the vocal tract of the speaker which produce the sounds of speech as air gets pushed through. The listener(s) receives these sound waves in the auditory systems, which works as a decoder from the physical sound wave into neurological signals to the brain, where the received information is processed.
The speech apparatus can be divided into two components: phonatory or- gans and articulatory organs (see, e.g, [12]). The phonatory organs consists of the lunges and the larynx whose responsibilities are to create the voice source sounds. The sounds are produced by pushing air form the lunges through the larynx, which causes the vocal chords to vibrate. The phonatory organs con- trols the pitch and loudness, as well as other prosodic patterns of the speech by adjusting the tightness of the vocal chords and the flow of air through the larynx.
The articulatory organs are made up of different parts in the mouth, such as tongue, lips, jaws and teeth. They are responsible of manipulating the sounds from the phonatory organs to generate additional sounds of the speech.
2.1.3 Audio Data
A convenient way of representing an analog audio signal is through a contin- uous function of time, t. A speech signal can therefore be represented by the function x(t) whose variations over time corresponds to the amplitude of the signal at different time steps. By sampling from the signal x(t) with a sam- pling period of T a discrete representation of the signals can be obtained as x[n] = x(nT ), where the sampling rate (samples taken per second from the signal), F
s, of the digital signal is defined as 1/T . (see, e.g, [10])
A digital representation of the analog signal describing the sound is thus
a sequence of numbers representing the amplitude of the signal for different
discrete time steps, nT . In Figure 2.1 the waveform of a speech signal is pre-
sented.
Figure 2.1: The waveform representation of a speech signal (left) and a snippet from the signal (right) to better illustrate its waveform.
It is hard to derive properties other than loudness, duration and periodicity of the sound from the wave shape presented in Figure 2.1. By analyzing the wave it is clear that the sound is made up by three equally spaced bursts of some sort, but what caused it can not be deduced from this; was it three words or just a siren going off three times? To be able to further analyze an audio signal more high level features are needed to describe the signal and for that the signal needs to be transformed. Through different transformations different aspects of the audio data can be analyzed. Transformations mainly brings the analysis into two domains; the temporal and/or the frequency domain (see, e.g, [13]).
These kinds of analysis enables for high level features to be extracted from the audio signal, which in turn can be used for more sophisticated analysis of the sound. Features and transformations of audio signals will be more thoroughly described below.
Nyquist-Shannon Sampling Theorem
When constructing a digital signal from an analog signal the sampling rate (rate at which samples are taken from the analog signal), F
s, and the bit depth (how many bits are used to represent each sample), which determines the res- olution of the signal, are of great importance. To be able to capture all infor- mation of the original signal in its digitized version the sampling rate has to be sufficiently high. The Nyquist-Shannon sampling theorem [14] states that the sampling rate has to be double the maximum frequency of the analog signal when digitizing it to a digital signal. In other words:
F
s2 ⇥ F
maxto guarantee proper construction of the original signal. As the hearing range
of a human roughly extends to 20 kHz the standard sampling frequency used
when digitizing audio signals, to meet this criterion, is 44.1 kHz (see, e.g,
[15]).
2.1.4 Features of Audio Data
Features of audio are different describing properties of an audio signal and includes both physical (low-level) as well as psychoacoustic (high-level) at- tributes. All audio signals can be described in terms of:
• Duration - the time between the start and end of a signal,
• Loudness - the size of the changes in sound pressure levels (related to the energy of the signal),
• Pitch - the frequency of the signal and,
• Timbre - is related to perceived sound quality an is the most complex attribute of an audio signal. For example it reflects the the difference in signals between two instruments playing the same note as well as the ability to distinguishing between different categories of instruments.
(see, eg, [15])
These attributes of audio data can be represented in digital format through different types of transformations of the audio signal, which will be described below. In general, these digital representations of the attributes of an audio signal are called audio features.
Transformations
Transformations of audio data are functions that, when applied to a signal, map the data from one domain into another domain. The perhaps most popular transformation in signal processing is the Fourier Transform (FT) (see, e.g, [10]). The FT takes a signal and decomposes it into its frequency components, making it possible to analyze the signal’s components individually.
The FT is a transformation from the time domain into the frequency do- main and can be interpreted as sliding a frequency window across the signal measuring how much each frequency contributes to the signal. The FT of a signal x(t), in the continuous case, is defined as:
X(!) = Z
11
x(t)e
j!tdt (2.1)
where the signal, x(t), gets transformed to a function of frequency, X(!), by
applying e
j!ton the signal. The exponential term, e
j!t, can through Eulers
formula be written as:
e
j!t= cos(!t) i sin(!t) (2.2) and describes a periodic motion through time t with frequency !. Hence, as you integrate Equation 2.1 over time the signal response, x(t), and e
j!twill tend to align if the frequency ! is present in the signal, yielding a higher value for the integral X(!) indicating the presence of ! in x(t). The exponential term can thus be viewed as a frequency filter.
In the case of digital signals however, the data is no longer continuous as the digital signal is a finite set of numbers, acquired through sampling from the original signal. In this setting the Discrete Fourier Transform (DFT) (see, e.g, [10]) can be used. The formula for the DFT is similar to the formula of FT and is defined as:
X(k) = 1 N
N 1
X
n=0
x[n]e
j2⇡Nkn(2.3)
where k represents the frequency bin number, n the sample number and N the total number of samples from the signal x(t).
The frequency filters applied during the transformation are independent and together covers the full frequency spectrum of the signal. These filters can therefore capture the different frequency components of the signal, their amplitudes and their phase. This information can for example be used to con- struct spectrograms, a topic that will be covered in Section 2.1.4 as well as reconstructing the signal through the inverse FT.
The Fast Fourier Transform (FFT) is an optimized version of the DFT which is used as the implementation of the transform in computers, reducing the time complexity of the function from O(N
2) to O(N log N) [10].
The Mel Scale
The Mel scale is a logarithmic transformation of the linear frequency range.
The mel values are obtained through applying the following function to the original frequency:
m = 1125 ⇥ ln
✓ 1 + f
700
◆
where m is the mel scale value and f the original frequency. In Figure 2.2
mel is plotted as a function of frequncy (Hz) and log
2(frequency). The Mel
scale was invented as an attempt to mimic the way humans perceive sounds
[16]. The cochlea in the inner ear acts as a frequency filter and its complex
mechanism results in the perception of sounds at different frequencies not be- ing linear, hence the logarithmic nature of the curves. When looking at the logarithm of base two of the frequency one step on the x-axis corresponds to doubling the frequency, this interval is referred to as an Octave. In a musical setting moving one octave results in the same letter note and therefore also the same pitch class (see, e.g, [15]). By looking an the right plot in Figure 2.2 it is quite easy to get an intuition of what the mel sclae does: it makes greater distinction between higher octaves. This is also how the human hearing works, humans are better at distinguishing between higher pitches than lower pitches.
Figure 2.2: The mel scale plotted as a function of frequency (Hz) (left) and as a function of log
2(F requency) (right).
Spectrograms
Spectrograms illustrate the energy distribution (amplitude) on the frequency spectra of a signal over time and is therefore a suitable representation for per- forming time-frequency analysis of the signal (see, e.g, [10]). A spectrogram of the speech signal from Figure 2.1 is presented in Figure 2.3, in this diagram brighter colors illustrate higher energy density.
Figure 2.3: The spectrogram representation of the speech signal from Figure 2.1.
A spectrogram is computed by performing Short-Time Fourier Transform
(STFT) on the signal. The idea behind STFT is to compute the FT (or rather an
FFT) over very short periods of time of the signal, also known as frames. By
(frame) independently and stacking the result from each frame together results in a spectrogram. Spectrograms can also be constructed for other scales than frequency, for example the mel scale, by mapping the frequency bins onto this scale instead.
A spectrogram is more informative than the regular waveform as it still displays the temporal characteristics of the signal but also gives information about the contributing frequency components of the signal over time.
2.2 Machine Learning
Machine Learning (ML) (see, e.g, [2] [17]) is the technique of using statistical models for computers to learn certain tasks without explicitly giving the com- puter instructions on how to perform the task. This is usually done through exposing the model to data, during training, where the model infers statistical properties of the data. When the model later gets exposed to unseen data it can use the properties it learnt during training to make a decision about this new data.
This section firstly presents the two main types of tasks machine learn- ing models are exposed to, supervised and unsupervised learning. Secondly Artificial Neural networks and a special kind of network, Convolutional Neu- ral Networks, are introduced. Finally neural network embeddings, a way of representing the data as vectors, are presented.
2.2.1 Supervised and Unsupervised Learning
The models in ML can be divided into two different groups; supervised and unsupervised models. Supervised models perform a task that is controlled by the user. A supervised model is exposed to labeled data, for example images, from which it learns how to distinguish between the different labels, for exam- ple cats and dogs.
Unsupervised learning algorithms process unlabeled data and finds simi-
larities based on statistical properties of the data. This method is very powerful
for exploratory data analysis as it can find hidden structures in the data without
requiring any sort of guidance.
Figure 2.4: The structure of a feed forward neural network, with one input layer, a number of hidden layers and one output layer. The black arrows in the Figure repre- sents the connections between units and the weights (parameters) of the network.
2.2.2 Artificial Neural Nets
Artificial Neural Nets (ANN) is a programming model that tries to mimic the behavior of the human brain and, in broad terms, consists of many inter- connected neurons that computes responses based on inputs. The term neural network originates from as far back as the 1940’s and was a first attempt to describe the human brain in a mathematical way [18][19].
The first attempt to mimic this architecture in a computer was through the development of the perceptron in 1958, a single layer network capable of bi- nary classification [20]. However, a single layer percetron is only capable of finding linear decision boundaries, which is a huge limitations for more com- plex tasks where data needs to be separated in a non-linear manner [21]. This gave rise to more sophisticated models where several layers were connected, introducing the Multi-Layer Perceptron (see, e.g, [22]), the predecessor of the feed-forward networks that are used today.
Feed-forward Neural Networks
Feed-forward neural networks are a special type of ANN’s where the con- nected neurons do not form loops and the data is therefore fed forward through the network.
The structure of a regular feed-forward neural network is presented in Fig- ure 2.4. The network consists of one input layer, one output layer and zero or more hidden layers. The introduction of multiple hidden layers, making the previous shallow networks deeper, gave rise to the term Deep Learning.
A feed forward neural net can be seen as multiple linear combinations of
the input and the weights (neurons in the different layers), combined with non-
cally, the forward pass for a feed-forward network with k hidden layers can be expressed as:
~y = (W
k· (W
k 1· (. . . (W
1· ~x +~b
1)) + ~b
k 1) + ~b
k) (2.4) where ~x is the input vector, ~y the output vector, (·) a non-linear activation function and W
iand~b
ithe weight matrix and bias vector for hidden layer i in the network.
Activation Function
Activation functions are functions that are applied to the output of each neuron in the network to break the linearity, enabling the network to create complex decision surfaces. Today the default recommendation for the activation func- tion is to use the rectified linear unit (ReLU) (e.g. [2]), defined in its simplicity as:
ReLU(x) = max(0, x) (2.5)
One great property of the ReLU activation function is that it yields sparse activations in the network as all values less than zero will output zero when ReLU is applied (see, e.g, [23]).
Training of the Network
The task of a feed-forward network is to approximate some arbitrary function:
y = f (x; ✓) (2.6)
where x is the input data, y the output (labels) and ✓ represents the parameters of the network, i.e the weights and biases of the connections between the nodes in the network (e.g, [2]).
The values of the network’s weights, ✓, are approximated through two op- erations; 1) the forward pass, where the input is sent through the network to generate an output (equation 2.4), and 2) the backward pass, where the learn- ing happens. The backward pass is also known as back propagation[22][24]
and begins by computing the error of the output, by comparing it to the true
value (label) using a pre-determined cost function. After this gradient decent
is performed, which computes the gradient of the cost function with respect
to each parameter w
i, b
i2 ✓ in the network and updates the values of each connection in the opposite direction, in order to to minimize the error.
The cost function of an ANN is what determines the behaviour of the net- work and it needs to be selected to suit the task the network is expected to perform. The most commonly used cost function used for classification is the Cross-entropy loss function:
cross_entropy(y, ˆy) = H(y, ˆy) = X
Ci=1
y
i⇥ log ˆy
iwhere C is the number of classes, y is the true class and ˆy the prediction from the model. For classification y is a binary C-dimensional vector with element i set to 1 if the sample comes from class i and all other entries set to 0. The prediction, ˆy, is also a C-dimensional vector and can be viewed as a discrete probability distribution over all classes. The probability distribution is usually obtained by applying the Softmax function to the networks output, defined as:
Softmax(x
i) = exp (x
i) P
Cj=1
exp (x
j)
which normalizes the output vector to create a valid probability distribution.
It is also a continuous and differentiable function, allowing for backprop to be performed (e.g, [2]).
Cross-entropy is closely related to Entropy, which is a measurement of uncertainty, defined as:
Entropy(P ) = H(P ) = X
i
P (i) log P (i)
which gives the theoretical minimum average encoding size for the events of a probability distribution P . The expression for entropy looks identical to that of cross-entropy except that cross-entropy has two different distributions. To better illustrate the difference they can be expressed as expectations:
H(P ) = E
x⇠P[ log P (x)] and H(y, ˆ y) = E
x⇠y[ log ˆ y(x)]
The cross-entropy thus gives the realized average encoding size since it is an
weighted average over the true distribution of events (y) using the estimated
distribution (prediction) ˆy for the encoding size. However, if the two distri-
butions are the same it is easy to see that the cross-entropy is equal to the
imum average encoding size, the minimum of the cross-entropy is when the two distributions are the same. This means that by optimizing the network to minimize the cross-entropy function the network minimizes the difference between its predictions and the true labels.
One problem with deep feed-forward networks is their computational com- plexity. In a regular feed-forward network each input unit is connected to each output unit which means that the number of parameters (connections between units) in a network drastically increases as more layers and/or more nodes are added. To reduce the number of parameters the model has to learn there are several methods, one of which is a special kind of feed-forward network, called a Convolutional Neural Network, which is described below.
Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are a special kind of feed-forward networks with an architecture that makes them very suitable for processing data that has a natural grid-like topology, such as time-series data and images (see, e.g, [2]). The network first appeared in [25][26], as an approach to im- prove hand written digit recognition. In the last decade it has entered various fields of ML with great success, starting with computer vision where it out- performed state-of-the-art techniques in the ImageNet challenge 2012 [3] on classifying images.
Figure 2.5: An example of the general structure of a CNN.
A general architecture of a CNN is presented in Figure 2.5 and consist
of two different parts: 1) The convolutional part, where the network learns
the features of the input data, and 2) the fully connected part, a regular feed-
forward network, that performs the classification.
Figure 2.6: Two steps of the convolutional operation of a CNN, the red grid represents the feature map, which is created from sliding the kernel (shaded area) over the input (green grid). Picture adopted from [27].
The convolutional part of a CNN usually consists of two different types of layers: a convolutional layer and a pooling layer, which will be described below.
The "convolutional" in the name comes from the fact that the network ap- plies the mathematical operation convolution on the data. In general a con- volution is an operation on two functions that creates a third function and is defined as the following integral:
s(t) = (x ⇤ w)(t) = Z
x(a)w(t a)da (2.7)
for continuous functions (e.g. [2]). To use terminology that is consistent with what is used in ML s(t) is called a feature map, the function x the input and the function w the kernel. In ML the input and kernels are usually multidimen- sional arrays, which means that the convolution can be performed over more than one axis at a time. In a CNN the integral is changed to a summation as the data is no longer continuous. The convolution operation used in a CNN can thus be expressed as:
S(i, j) = (I ⇤ K)(i, j) = X
m
X
n
I(i + m, j + n)K(m, n) (2.8)
where the input I for example could be a 2D image, the kernel K a 2D array
and the feature map S the 2D array resulting from applying the kernel on the
image as a sliding window. This process is illustrated in Figure 2.6. As the
filters of a CNN slide over the input they are able to capture structures and
patterns in the input. For an image this could for example be edges, lines and
curves which can be thought of as the features of the image. These features
become more complex in the deeper layers of the CNN as more low-level fea-
tures get combined to form the high-level features of the input.
are used for downsampling the output after a convolution has been applied on the data. The downsampling works in a similar manner as the convolution operation, where a filter is applied as a sliding window over the output matrix, as is illustrated in Figure 2.6, and then an operation on the elements covered by the filter is performed. The most common pooling operation is the so called max-pooling, which only extracts the maximum value in the current window and discards the rest, thus reducing the size of the data before propagating it further through the network (see, e.g, [28]).
As was mentioned in Section 2.2.2, the computational complexity of deep neural nets can be problematic. This is something the structure of a CNN over- comes by introducing sparse interactions and parameter sharing. Where as a regular ANN needs separate parameters to describe the interactions between all the input and output units, a CNN only needs parameters for the interaction of the kernels and the input. As the kernels are usually one or several magni- tudes smaller than the input this means that the total number of parameters in the network is greatly reduced (see, e.g, [2]).
2.2.3 Neural Network Embeddings
Neural network embeddings are vectors extracted from one layer of an ANN, which has gained a lot of popularity in the area of Natural Language Processing (NLP) with the introduction of Word2Vec, by Mikolov and others [29]. Em- beddings are a continuous vector representation of a discrete variable, which can be achieved by training a deep learning model to perform some sort of classification and then extracting the produced output from the neurons at any layer in the network. This new vector can be used to represent the original data point in a space spanned by the embeddings. Since the network is optimized to map samples from the same class to the same output (label), these repre- sentations tend to be close in the space spanned by the embeddings for similar data samples, which opens up for interesting analysis. In NLP the embeddings have been able to capture the semantic of the language and through linear al- gebra operations on the vector representations of words it has been possible to derive other words. Examples of such relationships are:
queen = (king man) + woman
where you get the embedding for queen by performing vector arithmetic on the
embeddings for king, man and woman. Other such properties are: connecting
capitals to their countries, famous people to their occupation and more. (see, e.g, [30])
The vector representation of the data can be viewed as data points in a d- dimensional space, and thus a common use of the embeddings is to look at their nearest neighbors when analyzing the results (see, e.g, [31]). In order to find the nearest neighbors an evaluation metric of the distance between embeddings must be decided. In NLP, where documents can be represented as a vector with each dimension representing the frequency of a word within the document, it makes sense to use the cosine distance between documents as it measures the angle between the vectors. In this setting the angle between two vectors would indicate if the proportion between the words in the documents is similar or not, and the magnitude of the vector is therefore not important. In other contexts, such as cluster analysis, where the magnitude of the vector is also important the euclidean distance between the vectors can be used.
2.3 Related Work
This section gives an overview of related work that has been done within the field of audio processing, with a focus on clustering of audio signals based on speech characteristics and deep learning approaches to audio classification.
The presented work will give a foundation for the methodology and conducted experiments that will introduced in the following chapter.
Székely et al. [32] presents a novel approach for clustering voices together
based on the voice quality parameters of the Liljencrants-Fant acoustic model
of the glottal source [33]. From these parameters the mean and variance over
short utterances are used as feature vectors. The data used in this study was
a 50 minutes long recording of an audio book with one single narrator. The
recording was divided up in smaller segments based on pause detection, with
an average length of 1.6 seconds. For each of these segments the features de-
scribed above were calculated for every 10ms window and the segments were
clustered based on these feature vectors, with the goal to identify and cluster
different expressive speech styles of the narrator. To assess the performance
of the clustering an A/B-test was conducted where the participants were pre-
sented a reference audio sample together with two more samples, one from the
same cluster and one from another cluster. The participants were then asked
to pick the sample which they perceived to be most similar to the reference
audio sample. The results showed that the presented approach was successful
in separating sentences associated with different styles of speech. This work
presents an interesting approach to how the evaluation of the clustering results,
Lukic et al. [34] uses CNN for speaker classification and clustering. The CNN is trained on the surrogate task of classifying different speakers from a spectrogram representations of speech from the TIMIT data set [35], using both the full data set (670 different speakers) and a subset of the data (100 speakers). After acquiring 97% accuracy on classification of the speakers, clustering of the speakers is performed. By using the activation layers in the fully connected part of the CNN architecture as feature vectors they manage to reach the same level of performance as that of traditional models, using manually engineered feature (see, e.g, [36]) in the task of clustering voices together, without the need for handcrafting the features.
In [37] the same authors improved their approach and managed to tie with the previous model while using only a fraction of the data for training. In- stead of training the CNN to identify all the different speakers in the corpus, a novel training approach is presented where binary pair-wise constraint (same or different speaker) is used to teach the network to further distantiate dis- similar voices. To increase distance between samples from different speaker while decreasing the distance between samples from the same speakers, each embedding is treated as a probability distribution and Kullback-Leibler (KL) divergence [5] is used to calculate the loss.
This approach was inspired by the work of Hsu et al. [38], where the au- thors used it to construct an ANN as an end-to-end unsupervised clustering model. In regular clustering methods there are usually two different steps re- quired: 1) feature extraction (construction of feature space) and 2) clustering based on similarity/distance in this constructed feature space. Both these steps opens up for the introduction of human bias in the model, a problem that is re- moved in their proposed approach.
These papers, in combination with the recent use of embeddings in NLP
(e.g, [29], [30]), illustrate the potential of ANN’s to find informative repre-
sentations (embeddings) of the data in an unsupervised fashion, requiring no
domain knowledge. The use of CNNs on spectrogram representations of the
audio data has the potential to extract features that are unknown to humans. By
using embeddings from the trained network to represent each audio sample,
representations of samples with similar attributes could potentially be found,
similar to how word embeddings contained the semantics of the words. As
evaluating similarity between voices is a subjective task standard cluster anal-
ysis would not be able to determine the success of the models. Therefore the
A/B-test performed in [32] can be adopted during the evaluation of the work
presented in this report, enabling us to answer the stated research question.
Methods
This chapter gives a detailed description of the experiments conducted in this study. First an overview of how the experiments were carried out is presented.
Secondly the resources used for the experiments are described, followed by a presentation of the data set and the preprocessing steps of said data. This is followed by a description of the models derived, their architecture and the settings of the hyper-parameters. Finally the method used for evaluating the embeddings, created by the models, is presented.
3.1 Project Overview
In Figure 3.1 the approach taken in this study is presented. The practical work consisted of 4 main parts, which will be briefly described below and then pre- sented more thoroughly throughout this chapter.
1. Train Models
The first step was to develop and train two CNNs, one performing regu- lar classification of audio samples using the cross-entropy loss function.
The other model used pairwise constraints between data points and a Kullback–Leibler divergence (KL) based loss function during training.
2. Create Embeddings
Once the models were trained, embeddings from different depths of the fully connected part of the networks were extracted for each sample, which was used to represent the audio sample in a high dimensional space.
3. Evaluate Embeddings
The evaluation of the embeddings was performed in two different ways:
20
Figure 3.1: An illustration of how the work was carried out.
an objective evaluation and a subjective evaluation. The objective evalu- ation was performed through examining the neighborhood of samples in the embedding spaces. The subjective evaluation was performed through a user test, where participants were asked to select the narrator they per- ceived to be most similar to a reference narrator out of two choices: the narrator that was closest to the reference narrator in the embedding space and the one furthest away.
3.2 Resources
For the experiments, training of the models was performed on Google Cloud Platform (GCP) by deploying the models for training on the ai-platform
1. The training was done using the STANDARD_1 scale tier on the platform,
1
https://cloud.google.com/ai-platform/
giving access to one master instance and four worker instances, each with 8 virtual CPUs and 7.20 GB of RAM.
Once training was completed the embedding extraction and evaluation were run locally on a Macbook Pro using Pyhton, all the details about the machine and the packages used are presented in Table 3.1.
Table 3.1: Hardware and software resources used to run the experiments Computer A Macbook Pro (15-inch, 2018)
Operating System MacOS Mojave version 10.14.2 Processor 2,2 GHz Intel Core i7
Memory 16 GB 2400 MHz DDR4
Programming Language Python 3.7.1
Python packages tensorflow 1.13.1 matplotlib 3.0.2 scikit-learn 0.20.1 numpy 1.15.4
3.3 Data
The data used in the experiments are snippets from audiobooks provided by the host company. Snippets from a total of 100 audiobooks were used for training the models, featuring 100 different narrators, equally distributed be- tween males and females. The snippets are made up of two second consecutive sequences from a three minute sample of each audiobook, resulting in 90 snip- pets per book, which gave a total of 100 ⇥ 90 = 9000 data points. The audio files are single channel (mono) MP3-files with a sampling frequency of 44.1 kHz and a bitrate of 64 Kbit/s.
3.3.1 Preprocessing of Data
Each snippet of audio was firstly run through a pre-emphasis step, which am- plifies high frequency components with respect to low frequency components in order to reduce noise in the audio snippets. Pre-emphasis of the signal was calculated as follows:
y[n] = x[n] 0.97 ⇤ x[n 1]
where y represents the pre-emphasized signal and x the original signal. As
low frequency components of the signal tend to change more slowly than high
Figure 3.2: An illustration of STFT on a signal and the meaning of the parameters n_hop and n_FFT.
frequency components this procedure can be interpreted as removing this slow change from adjacent signal components, thus enhancing the more rapidly changing high frequency components.
After pre-emphasis the resulting signals were put through a Short-Time Fourier Transform (STFT) to construct the mel-spectrogram representation of the signals. For the STFT the settings presented in Table 3.2 were used, inspired by the work in [34].
Table 3.2: The settings used for the STFT Parameter Value
F
s44.1 kHz
n_FFT 1024
n_hop 160
The n_FFT parameter dictates how many samples to use for each window when computing the Fast Fourier Transform (FFT), n_hop determines the amount of overlap between consecutive windows and F
sthe sampling rate used when sampling the signal. An illustration of this process and the meaning of the parameters is presented in Figure 3.2. These settings yielded a 128 ⇥ 552 spectrogram representation of each two second snippet, which can be derived as follows:
2 ⇥ F
sn_hop = 2 ⇥ 44100
160 ⇡ 552
as for a 2 second snippets there will be 2 ⇥ F
ssamples and the window will
be moved forward 160 (n_hop) samples at a time. The height of 128 comes
from the fact that 128 bins were used for mapping the frequency components
from the STFT onto the mel scale.
After the spectrograms were constructed dynamic range compression was applied on the resulting matrices by applying the element-wise function:
f (x) = log (1 + x ⇤ 10
4)
which reduces loud sounds and amplifies low sounds of the signal, as is sug- gested in [34]. Lastly the matrix is normalized column wise, to get zero mean and a standard deviation of one for each time step in the spectrogram. This procedure of normalizing data is performed to speed up the convergence of the models (see, e.g, [39]).
3.4 Models
The models were built using the machine learning library TensorFlow[40]
developed at Google. The models are built as identical CNNs where only the cost function used differ.
3.4.1 Network Architecture
The network architecture used for both models is presented in Table 3.3. The inputs to this network were the 128 ⇥ 552 matrix representation of the spec- trograms, which were described in Section 3.3.1. The network had 2 convo- lutional layers consisting of 32 and 64 4 ⇥ 4 kernels respectively, all using the ReLU activation function. The kernels used a stride of 1, meaning that they were slid one step at a time across the data grid. After each convolutional layer max-pooling was performed, also using a kernel of size 4⇥4 but this time with a stride of 2, reducing the size of each dimension of the data in half. After the second max-pooling was performed the data was flattened to a 1 ⇥ 261120 vector, which was then feed to the fully connected part, consisting of a 3-layer feed forward network where the activation function ReLU was applied to the output from the two hidden layers FC1 and FC2. Finally the Softmax func- tion was applied in the 100-dimensional output layer, yielding a probability distribution across the 100 different narrators in the data set for each sample.
3.4.2 Regular Classification Model
The regular classification model had the structure presented in Table 3.3 and used the cross-entropy loss function, defined in Chapter 2, and the Adam [41]
optimizer. This model was thus trained on classifying each snippet to the cor-
responding narrator, yielding a classifier with 100 classes. The output of this
Layer N Kernel Size Stride Activation Function
Conv1 32 4 ⇥ 4 1 ReLU
Pool1 4 ⇥ 4 2
Conv2 64 4 ⇥ 4 1 ReLU
Pool2 4 ⇥ 4 2
FC1 261120 ⇥ 200 ReLU
FC2 200 ⇥ 150 ReLU
Out 150 ⇥ 100 Softmax
network, after the Softmax function has been applied, can be interpreted as a 100 dimensional probability vector where each entry represents the probability of a sample belonging to a certain narrator.
3.4.3 Pairwise Constraint Model
The pairwise constraint (PWC) model used the same network structure as the regular model (presented in Table 3.3) but was optimized on a different cost function, still using the Adam optimizer. This model used pairwise constraints between all data points instead of calculating the Cross-entropy loss used for classification. For each batch
n2=
n(n 1)2, where n = batch size, pairs were constructed as (i, j, r) tuples, where i and j represents the index of the two data points in the batch and r is an integer describing the relationship between the data points, where:
r =
( 0 if i and j are from different narrators 1 if i and j are from the same narrator
Between all the
n2pairs in a batch a Kullback-Leibler(KL)-divergence based loss function is used, which uses the pairwise relationships between the data points. In this setting the embeddings of pairs of points is treated as two dis- crete probability distributions, PPP and Q Q Q. The task of the loss function is, through the use of KL-divergence, to minimize the difference of the proba- bility distributions (embeddings) between similar points and maximize it be- tween dissimilar points. The loss function, introduced in [38], is defined as follows:
L(P P P , Q Q Q) = l(P P P ||Q Q Q) + l(Q Q Q ||P P P ) where:
l(P P P ||Q Q Q) = I
s· KL(P P P ||Q Q Q) + I
ds· max(0, margin KL(P P P ||Q Q Q))
and:
I
s=
( 1 if r = 1
0 otherwise and I
ds=
( 1 if r = 0 0 otherwise
The KL-divergence, KL(PPP ||Q Q Q), is a measure of how well the distribution Q Q Q to approximate the true distribution, PPP , and is defined as:
KL(P P P ||Q Q Q) = X
i
p
i· log( p
iq
i) p
i2 P P P , q
i2 Q Q Q
which is very similar to the cross-entropy, defined in Chapter 2, with the only difference being the term in the logarithm of the function. The function can be rewritten as:
KL(P P P ||Q Q Q) = X
i
p
i· log( p
iq
i) = X
i
p
ilog q
iX
i