Generating Training Data for Keyword Spotting given Few Samples

(1)

Generating Training Data for Keyword Spotting given Few Samples

PIUS FRIESCH

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Keyword Spotting given Few Samples

PIUS FRIESCH friesch@kth.se

Master in Machine Learning Date: June 4, 2019

Supervisor: Jonas Beskow Examiner: Sten Ternström

School of Electrical Engineering and Computer Science Host company: snips.ai

Swedish title: Generering av träningsdata för nyckelordsigenkänning utifrån ett fåtal exempel

(4)

(5)

Abstract

Speech recognition systems generally need a large quantity of highly variable voice and recording conditions in order to produce robust results. In the specific case of keyword spotting, where only short commands are recognized instead of large vocabularies, the resource-intensive task of data acquisition has to be repeated for each keyword individually. Over the past few years, neural methods in speech synthesis and voice conversion made tremendous progress and generate samples that are realistic to the human ear. In this work, we explore the feasibility of using such methods to generate training data for keyword spotting methods. In detail, we want to evaluate if the generated samples are indeed realistic or only sound so and if a model trained on these generated samples can generalize to real samples. We evaluated three neural network speech synthesis and voice conversion techniques : (1) Speaker Adaptive VoiceLoop, (2) Factorized Hierarchical Variational Autoencoder (FHVAE), (3) Vector Quantised-Variational AutoEncoder (VQVAE).

These three methods are evaluated as data augmentation or data generation techniques on a keyword spotting task. The performance of the models is compared to a baseline of changing the pitch, tempo, and speed of the original sample. The experiments show that using the neural network techniques can provide an up to 20% relative accuracy improvement on the validation set. The baseline augmentation technique performs at least twice as good. This seems to indicate that using multi-speaker speech synthesis or voice conversation naively does not yield varied or realistic enough samples.

(6)

Sammanfattning

Taligenkänningssystem behöver generellt en stor mängd träningsdata med varie- rande röst- och inspelningsförhållanden för att ge robusta resultat. I det specifika fallet med nyckelordsidentifiering, där endast korta kommandon känns igen i stället för stora vokabulärer, måste resurskrävande datainsamling göras för varje sökord individuellt. Under de senaste åren har neurala metoder i talsyntes och röstkonvertering gjort stora framsteg och genererar tal som är realistiskt för det mänskliga örat. I det här arbetet undersöker vi möjligheten att använda sådana metoder för att generera träningsdata för nyckelordsidentifiering. I detalj vill vi utvärdera om det genererade träningsdatat verkligen är realistiskt eller bara låter så, och om en modell tränad på dessa genererade exempel generali- serar väl till verkligt tal. Vi utvärderade tre metoder för neural talsyntes och röstomvandlingsteknik: (1) Speaker Adaptive VoiceLoop, (2) Factorized Hie- rarchical Variational Autoencoder (FHVAE), (3) Vector Quantised-Variational AutoEncoder (VQVAE).

Dessa tre metoder används för att antingen generera träningsdata från text (talsyntes) eller att berika ett befintligt dataset för att simulera flera olika talare med hjälp av röstkonvertering, och utvärderas i ett system för nyckelordsidentifiering. Modellernas prestanda jämförs med en baslinje baserad på traditionell signalbehandling där tonhöjd och tempo varieras i det ursprungliga träningsda- tat. Experimenten visar att man med hjälp av neurala nätverksmetoder kan ge en upp till 20% relativ noggrannhetsförbättring på valideringsuppsättningen jämfört med ursprungligt träningsdata. Baslinjemetoden baserad på signalbehandling ger minst dubbelt så bra resultat. Detta tycks indikera att användningen av talsyntes eller röstkonvertering med flera talare inte ger tillräckligt varierade eller representativa träningsdata.

(7)

1 Introduction 1

2 Background 5

2.1 Data Augmentation . . . 5

2.2 Keyword Spotting . . . 6

2.3 Generative Models . . . 7

2.3.1 Autoencoders . . . 7

2.3.2 Variational Autoencoders . . . 7

2.3.3 Generative Adversarial Nets . . . 9

2.3.4 Text to Speech Models . . . 9

2.4 Generative Models for Audio . . . 11

2.4.1 VoiceLoop . . . 11

2.4.2 Factorized Hierarchical Variational Autoencoder . . . 14

2.4.3 Vector Quantised-Variational AutoEncoder . . . 17

2.5 Feature extraction . . . 19

3 Method 22 3.1 Speaker Adaptive VoiceLoop . . . 23

3.2 Factorized Hierarchical Variational Autoencoder . . . 24

3.3 Vector Quantised-Variational AutoEncoder . . . 24

4 Experiments 27 4.0.1 Modified VQVAE architecture . . . 27

4.1 Evaluation Setup . . . 28

4.1.1 Qualitative Evaluation . . . 28

4.1.2 Quantitative Evaluation . . . 29

4.2 Experiments . . . 31

4.2.1 Experiments: VoiceLoop . . . 32

4.2.2 Experiments: Scaleable FHVAE . . . 33

4.2.3 Experiments: VQVAE . . . 33

v

(8)

5 Results 35

5.0.1 Speaker VoiceLoop . . . 35

5.0.2 Voice Conversion Results . . . 36

5.1 Qualitative Evaluation . . . 37

5.1.1 Speaker VoiceLoop . . . 38

5.1.2 FHVAE . . . 40

5.1.3 VQVAE . . . 42

6 Discussion 45

7 Conclusion 48

Bibliography 49

(9)

Introduction

Recent advances in automatic speech recognition (ASR) [1, 2, 3, 4, 5, 6] systems using neural network based approaches have brought the state of the art (SOTA) to a human or over human level on constrained tasks. A common requirement with deep learning approaches is the need for large amounts of training data to train deep neural networks to reach this level. For common languages like Chinese, English, etc. these large amounts of transcribed speech are usually readily available, in contrast to languages with relatively few speakers where it is considerably more difficult to acquire the needed amount of transcribed data for well-performing models.

ASR mainly considers the case of large vocabulary recognition or phoneme recognition in combination with a language model. For online ASR it is usually not feasible to run the ASR all the time. For this reason, smaller models are run in practice to listen for a wake word to start the general ASR. In environments where the hardware resources are heavily constraint, general ASR models might not be used at all. Especially when only low amounts of memory are available, bigger ASR models are unfit. Wake word or keyword spotting models are very small networks highly fitted on a single or few keywords. Given the requirement for small resource usage, deep learning wake word models are not trained from a general large vocabulary speech corpus but need a specialized training set only for the particular commands. Therefore, adding a keyword is linked to a substantial effort to acquire training data. This provides a strong incentive to find methods that help to lower the amount of data needed.

One approach could be to synthesize training data using a neural network learned on a large vocabulary speech corpus and then generate samples of the given keyword which then are used to train the smaller keyword spotting model.

Given that the learned keyword spotting model should generalize well to many

1

(10)

speakers, the generated samples should cover a broad range of different voices.

Another method could be to follow an approach which has proven to be effective in the field of image recognition, namely, augmenting the training data. Data augmentation does not change the corresponding label and can help to teach the model invariances in the data. Similar methods that force an implicit bias for certain invariances are widely used in speech recognition [7, 8, 9, 10]. These digital signal processing (DSP) methods help neural networks to generalize better in most cases. These proposed methods only cover variances in time, pitch or vocal tract length. However, speakers are furthermore distinguished by their exact physical build, accent, prosody or other attributes. Given that these attributes are hard to quantify and thus not easily varied using DSP methods, we turn to approximate neural network methods in this work.

The task of turning a sample of one speaker into another sample with the same content but said by a different speaker is commonly referred to as voice conversion. The straightforward supervised method is to create a parallel dataset with time annotations of multiple speakers saying the same thing and then learn how to do the transformation. However, this kind of data is hard to acquire and usually only available in small quantities. Unsupervised methods that discover an alignment would not have the issue of costly data. Similar to classical methods for voice conversion that rely on small parallel datasets to convert speech to a limited amount of speakers, the classical methods for TTS rely on selecting short samples of recorded speech and stitch these together to form new outputs. Yet, in recent literature more parametric neural network methods have been proposed and shown to result in large improvements in naturalness and scalability compared to classical methods by utilizing large amounts of data. Given such progress in recent years, we want to evaluate if samples produced by these methods can be used as training data for the unique task of keyword spotting. We chose to evaluate the three following methods:

1. Speaker Adaptive VoiceLoop augments the basic VoiceLoop text to speech model with a speaker encoding prediction network. Thus, it enables fast adaption to speakers and therefore more variability in the network output compared to static learned speaker embeddings during training.

2. Factorized Hierarchical Variational Autoencoder (FHVAE) is a take on disentangled Variational Autoencoder (VAE) which promises to find hidden local and global features of audio samples in an unsupervised manner. The local features can be recombined with the hidden global

(11)

features, which represent speaker information in our case, of a different audio sequence to synthesize the audio in the given speaker style.

3. Vector Quantised-Variational AutoEncoder (VQVAE) finds discrete speaker independent local codes from speech samples in an unsupervised manner. The found discrete codes are combined with different speaker encodings to reconstruct a given audio sample in the voice of a different speaker. The synthesizing part of the originally proposed network is replaced by a Long short-term memory (LSTM) based architecture instead of a WaveNet based architecture in order to allow for faster generation of a large number of samples.

First, we train these methods using multi-speaker datasets in order to create trained models which can produce samples with a variety of speakers. These trained networks can then be used for data generation. The produced samples are tested on a keyword spotting task, in order to test the augmentation performance. In order to generate training data, a phoneme transcription for different keywords is given to the Speaker Adaptive VoiceLoop model to generate keyword audio samples. The trained FHVAE and VQVAE models augmented different smaller amounts of training samples by reconstructing them with different global styles. Different numbers of provided samples correlate with the effort needed to acquire data for a new keyword. Finally, the accuracy of predicting the different keyword classes is reported. As baseline augmentation technique random pitch and speed adjustments of the given audio samples were chosen.

The neural network techniques produced subjectively good and varied generated samples. Applying these naively to the keyword spotting task improved the performance only when few base samples are given. Larger amounts of base samples with a larger variety show no improvement. Furthermore, the baseline augmentation technique performs at least twice as good. Finally, we discuss possible reasons for the difficulty of training the keyword spotting task on the generated samples.

The remainder of the thesis is organized as follows. Chapter 2 describes related approaches for data augmentation in automatic speech recognition and keyword spotting models as well as generative models for audio. Also, an overview of the three methods used in this work is given. In Chapter 3 we describe how the neural network methods are trained and used to generate the training data and for keyword spotting. Followed by Section 4.1 where we describe how we evaluate the chosen methods. How the three data generation methods are trained is described in Section 4.2. The performance on the key-

(12)

word spotting task is reported in Section 5, followed by a qualitative evaluation in Section 5.1.

Finally we discuss the results in Chapter 6 and conclude the thesis in Chapter 7.

(13)

Background

2.1 Data Augmentation

A common practice in machine learning is the use of data augmentation. In a vision task, for example, one could use mirroring, rotation or zooming which do not influence the resulting label of the image. Similarly, audio data can be augmented without changing the resulting label. The factors to which the model should be invariant to, change from application to application. Commonly expected invariances of models in ASR are the quality of the recording, the different attributes of the speaker( e.g. pitch, prosody, accent,... ), environmental noise, the distance to the microphone or reverberations of the sound in closed rooms. This invariance is often correlated with how well a model generalizes to new environments and scenarios the model has not been exposed to during training time.

One of the simplest methods is adding a noise track to the audio. This can be varying recorded environments like a noisy room or a factory setting. In Hannun et al. [2], they report that this helped the performance of the ASR system, especially in their noisy test set. They also stated, that the length of the recorded noise sample should not be too small since this could lead to the network overfitting or remembering the specific added noise. However, just adding noise misses to capture the Lombard Effect [11] which describes the tendency of speakers to increase their pitch and intonation to overcome a loud and noisy environment. The authors of this work recorded special data for this use case where speakers were exposed to a noisy environment through headphones.

Another simple method can be resampling of the audio signal and thereby speeding up or slowing down the signal [7]. This has effects on the pitch or

5

(14)

fundamental frequency of the audio. So methods which either keep the speed or the pitch constant [12] can also be used to modify the audio. In Collobert, Puhrsch, and Synnaeve [6] they found that stretching helps with small datasets, but the effect vanishes for bigger datasets.

Another proposed approach is vocal tract length perturbation (VTLP) [8], which takes the inverse approach of vocal tract normalization (VTLN). VTLN helps to learn better speaker invariance with respect to vocal tract length by normalizing input data of different speakers to a common mean by using speaker depending warp factors based on the physical structure of the speaker’s vocal tract. VTLP uses this in inverse by using randomly generated warp factors to artificially generate more varied samples.

Frequency-axis random distortion as introduced in Kanda, Takeda, and Obuchi [13], adds uniform noise to the spectrogram which is then transformed based on local averaging in small patches. This adds a distortion which changes the input spectrogram in local patches instead of shifting it based on a global value.

Another approach to augment training data is using an impulse response of a room to augment the audio signal. This simulates the case of far-field recordings and helps in these cases, but might hurt close-field performance according to Arik et al. [14]. Impulse responses can also be synthesized without the need for physical measurements [15].

2.2 Keyword Spotting

The task of spotting only specific keywords is an active area of research next to general ASR. These networks are used in exactly these cases where general ASR systems are not viable. Mostly as small-footprint networks to be run very resource efficient at inference. However, given that the number of keywords is highly limited, instead of predicting sequences, the commands can be predicted directly from the whole sequence. Before the recent surge in interest in neural networks, HMMs with sequence search algorithms were commonly used. However, even simple 3-layer DNN outperformed these systems in Chen, Parada, and Heigold [16]. Using more complex architectures like CNN [17] or combining convolutional with recurrent elements [14] showed to be similarly resource efficient while improving the performance. One drawback of imple- menting RNNs for an online task is that one has to address the problem of a diverging hidden state. Training is done on short sequences, but at inference the network is run online without resetting the hidden state. To reproduce the same setting as in training one would have to clear the hidden state of the

(15)

network regularly to prevent a diverging state. Alternatively, one could choose to train the network emulating an online setting as seen in Hwang, Lee, and Sung [18]. For this work, the choice fell on a simple convolutional architecture similar to LeCun et al. [19], given that convolutional networks appear still to be competitive while having a low complexity.

2.3 Generative Models

2.3.1 Autoencoders

Variational autoencoders take their name from basic autoencoders, which basi- cally can be seen as learned compression networks. Common uses for autoencoders are de-noising autoencoders or feature or representation learning. Given an input datapoint which is fed through a network with an information bottleneck, the model has to reconstruct the original image. The network is optimized to reconstruct the input. However, the network can not just copy the input given the bottleneck element. This forces the network to find an information-rich representation. An information dense representation is similarly interesting for use in discriminative models, since it is a learned completely unsupervised representation of the data without the need for manual feature engineering. Yet, classical autoencoders produce representations as point-estimates in a hidden space. Thus, manipulating this hidden space to generate new samples is quite limited.

2.3.2 Variational Autoencoders

In order to generate new samples, one could choose to estimate a probabilistic distribution of the hidden variables p(z) that represents the underlying structure of the data from the given data x with function p(z|x). Given a function p(x|z), this structure can then be used to generate new and realistic samples by sampling from p(z). However, it is not clear how this underlying distribution given the data looks like. But we could define an approximate inference model q(z|x) for which we can choose the distribution. Then choose an optimization strategy to reduce the divergence to the underlying distribution. Yet, this seems like circular reasoning since we do not know the underlying distribution. However, to work around this Kingma and Welling [20] introduced the variational lower bound:

L(✓, ; x⁽ⁱ⁾) = DKL(q (z|x⁽ⁱ⁾)||p✓(z)) +Eq (z|x⁽ⁱ⁾)

⇥log p✓(x⁽ⁱ⁾|z)⇤ (2.1)

(16)

Where q (z|x⁽ⁱ⁾)is the inference network to predict the variational parameters of the distribution over z, p✓(z)is the Gaussian prior over the hidden variables and Eq (z|x⁽ⁱ⁾)

⇥log p✓(x⁽ⁱ⁾|z)⇤ represents the reconstruction error.

f (x)

||X f (x)||²

Decoder

Encoder

X

(X)

µ(X)

KL(N (µ(X), (X))||N (0, I))

Sample z from N (µ(X), (X))

||X<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> f (x)||²

f (x)

Decoder

Sample fromN (0, I)

X<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Encoder (X)

µ(X)

KL(N (µ(X), (X))||N (0, I))

Figure 2.1: Left: Sampling operation from a gaussian distribution. Right:

Sampling operation from a reparameterized gaussian distribution. The sample operation in red symbolizes the non-differentiable sampling operation in the computational graph.

Furthermore, the model infers the parameters of the hidden distribution and then samples from it. This sampling operation is not differentiable. To alleviate this issue Kingma and Welling [20] proposed a reparameterization trick. In the case for a univariate Gaussian, the hidden variable z is sampled from the distribution p(z|x) = N (µ, ²), where µ and are predicted by the inference network. Thus, z can be reparameterized as z = µ + ✏ where ✏ represents a auxiliary noise variable ✏ ⇠ N (0, 1). As seen in Figure 2.1, when the reparameterization trick is used, the sampling operation is moved out of the computational graph which enables to propagate the gradient around the sampling operation.

A recent extension of the standard Varaiational Autoencoders are -Varaiational Autoencoders [21] which add tighter bottleneck on the hidden Gaussian distributions by adding a weight on the prior and therefore force the hidden distributions to be more disentangled.

One idea that is very close to the standard VAE is proposed in Hsu et al.

[22] where a speaker representation is concatenated to the hidden variable z.

Thus, the local or phonetical information gets encoded in z while the speaker information is given externally. At inference time, the speaker information is changed in order to reconstruct the sample in a different voice.

(17)

2.3.3 Generative Adversarial Nets

The basic idea of Generative Adversarial Nets (GAN) is that a generative network provides samples which are rated by an adversarial network which discriminates if the generated sample is real or generated. This enables to only model the generation process from the random variable z. Thus, in the standard GAN no approximation is necessary for the inference of z compared to a Variational Autoencoder. The generator network tries to fool the discriminator network by producing real samples. This leads to a minimax game, where either the generator network succeeds by producing real samples or the discriminator succeeds by exposing fake samples. The model converges when the discriminator can’t distinguish between real and fake samples and randomly guesses. However, practice has shown that it is hard to train GANs given the missing convergence guarantee [23].

One example of how this could be used for voice conversion is presented in Hsu et al. [24], where the idea is to add a GAN objective to the VAE framework to model voice conversion more directly using a Wasserstein objective.

In a more pure GAN setting, Kaneko and Kameoka [25] proposes to use a CycleGAN style network for voice conversion. The main idea of a CycleGAN is when a sample is converted from one style to another and then back to the original one, the same sample should be reconstructed. This can be done from the direction of both domains and thus learned without supervised parallel data.

2.3.4 Text to Speech Models

As an alternative to detect local phonetical features in an unsupervised fashion, one could also use text or phoneme labels that have been labeled manually.

This could free the model from trying to approximate a good local encoding.

Therefore, the networks modeling capabilities could shift the focus on modeling the speaker information better. Text to speech models have shown to produce the best quality samples rated by human judges. Recent lines of popular research are Deep Voice 1, 2 and 3 [26, 27, 28]. Deep Voice 3, the most recent paper follows a TTS sequence to sequence architecture. The model first encodes the given character or phoneme sequence to a learned encoding. An autoregressive convolutional decoder is then used to predict the next frame or frames of the Mel spectrogram output. Given the autoregressive nature, new frames depend on previously predicted frames. However, during training, the autoregressive decoder network is trained using teacher forcing by feeding the network with ground truth frames instead of the previously predicted frames.

(18)

The encoded input is attended to by an attention module. In Deep Voice 3 an unconstrained attention mechanism is used which could lead to skipping or reverting in time. The authors reported a heuristic during inference that forces the attention to be monotonic and found that this gave better performance in their experiments compared to purely monotonic attention. Furthermore, several decoding strategies for raw audio were evaluated, more in Section 2.5.

Similarly, Tacotron 1 and 2 models [29, 30] rely on a sequence to sequence architecture which encodes the input text to a fixed encoding which is then focused over using an attention module queried by the decoder. The main differences to Deep Voice 3 is that the network mostly uses recurrent neural networks in its encoder and decoder architecture as well as an attention module that is encouraged to be monotonic. Furthermore, the decoder is also learned using teacher forcing. This would lead the network to only learn how to predict the next frame and not how to model the sequence. Like in Deep Voice 3, the previous output, or here the ground truth frame, is compressed using a shallow two-layer module followed by dropout [31]. This is labeled as the PreNet. Moreover, as seen in Figure 2.2, the architecture is extended by a PostNet in order to improve the quality of the predicted spectrogram. The output of this residual layer is added to the output of the decoder to form the final predicted spectrogram. A completely different approach is the recently proposed VoiceLoop architecture[32]. Instead of using common recurrent architectures, they introduced a novel method based on a shifting memory.

However, they use a similar teacher forcing method, but instead of dropping out neurons, they add noise to prevent the network from learning only to predict the next frame.

mel spectrogram

Bi-directional Linguistic Encoder

Location Sensitive Attention

2 Layer Pre-Net 5 Conv Layer Post-Net

Linear Projection 2 LSTM Layers WaveNet MoL

Waveform samples

Figure 2.2: Schema of the Tacotron 2 architecture.

Given that these models can produce samples that come close to be indistin- guishable from real human voice, the recent research focus has shifted to model

(19)

the expressiveness and variability in the output. For example, Skerry-Ryan et al. [33] adds a prosody embedding to their encoding process. This is predicted given a sample sequence and then put through a strong bottleneck which reduces the sequence to a fixed length encoding. The input of the decoding network is then composed of this prosody embedding, the speaker embedding and the local encoding predicted from the text transcription. Similar approaches to model these global properties of an utterance in an unsupervised fashion showed to achieve a similar goal while focusing on different approaches in detail Hsu et al. [34], Henter, Wang, and Yamagishi [35], and Wang et al. [36].

2.4 Generative Models for Audio

In this section, the general concepts of the three generative neural networks are presented which are used in the experiments. All of these models have in common that they can be used to model different dimensionalities of audio. For instance the hidden global information of the speaker and the local information of the content. Each model uses different degrees of supervision. While the VoiceLoop model requires phoneme transcription and speaker IDs as information at training time, VQVAE only requires the speaker ID in addition to the audio. FHVAE is completely unsupervised.

2.4.1 VoiceLoop

The VoiceLoop [32] architecture is a text to speech model which takes phonemes as input and predicts vocoder features, which in turn can be synthesized to raw audio/voice. When a speaker identification is given, the model can be trained to model speech of different speakers.

With the VoiceLoop neural network, a unique recurrent network architecture to model the sequence was proposed. The main difference to classical recurrent neural networks (RNNs) is the use of a stacked memory in the recurrent architecture instead of a single hidden state. A number of hidden states are saved from previous timesteps in a FIFO queue style. Thus there are always S hidden states saved. As seen in Figure 2.3, each iteration one is added to the beginning and the oldest one is discarded. Therefore, the time context is implicit in the model instead of the model having to learn to compress information over time in a single hidden state.

Two inputs are given to the network. First, phonemes with force aligned silence information are fed into the network. This is followed by an embedding of the speaker id. In order for the network to focus on a specific part in

(20)

N

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> o

N

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> u

N

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> a

Output Vocoder features

t<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 t<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

ðɪs ɪz ɐ lˈɒŋɡəɹ ɛɡzˈampəl sˈɛntəns

S

z

Figure 2.3: Schema of the VoiceLoop architecture.

the given phoneme sequence, an attention module ↵t = Na(St 1)is added.

Because of this, the decoding network never sees the full phoneme sequence, but rather a weighted accumulation over the phoneme sequence as a context vector ct= ↵t⇤xphn. To compute ↵ta monotonic attention mechanism, similar to Graves [37], in the form of a mixture of Gaussians is used. Each mean of the Gaussian distributions is advanced monotonically based on the current memory state. This means that, the model can never go back to focus on already seen phonemes.

A shallow two fully connected layer network is used to predict the next memory frame based on the weighted phoneme input and the speaker information ut = Nu(z, St 1, ↵t⇤ x^phn). The new memory frame is added to the buffer, while the oldest frame is discarded. The final output of the network ot

is computed by another shallow two-layer network Nowhich takes the current complete buffer state St.

Given that speech is not deterministic when derived from a sequence of phonemes, it is very unlikely for the network to predict the exact same sequence as the ground truth. However, when predicting only the next frame, the range

(21)

of likely outputs is more limited. Thus, in order to guide the network during training to predict the same output as the ground truth, the real sequence of vocoder features is fed into the network. Instead of using the previously generated frames the ground truth frames are used. This technique is usually referred to as teacher forcing. Normal noise is added to the ground truth to prevent the network from just predicting the next from the previous output frames while ignoring the phoneme and speaker encoding inputs.

The output is learned via a mean squared error (MSE) loss comparing the predicted vocoder features with the ground truth features.

Speaker Adaptive VoiceLoop

Ns

aɪ spˈiːk ðˈɜːfɔːɹ aɪˈam Input Phonemes Input Vocoder Features

N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> o

N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> u

N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> s

Na

L<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> M SE

Lcycle

Current context Output Vocoder

features t 2

Lcontrast

Figure 2.4: Schema of the Speaker Adaptive VoiceLoop architecture.

The Voiceloop architecture is extended in Nachmani et al. [38] by adding a speaker extraction network rather than giving the speaker id supervised to the model. The speaker ID is not given but rather extracted by the model from a given audio sample into a fixed size vector.

In order to predict a fixed length speaker encoding, the ground truth vocoder features xvocoder=x1,...,xT are fed into the speaker network Nsas additional

(22)

input. This encoding z is then used instead of the speaker id to predict the next memory frame. The speaker encoding is also fed to the output network No

additionally to the current buffer. In addition to the main MSE loss to predict the vocoder features two other losses are added to improve the learned speaker encoding.

First, a triple loss is used, which takes the current sample plus another sample from the same and a sample from a different speaker as inputs. Then the optimizer reduces the MSE loss between the samples of the same speaker and maximizes the MSE between the sample and the other speaker over a given margin. In addition, as seen in Figure 2.4, a cycle loss is employed which reduces the MSE of the speaker vector derived from the output of the network to the speaker vector derived from the ground truth.

Given that the speaker network has to encode the complex information of a whole voice, a shallow two-layer linear network is not used, contrary to the theme in the proposed method. The speaker recognition network consists of several convolutional layers followed by an average pooling layer over time to reduce the output to a single speaker encoding.

Priming

A common theme in recurrent neural networks is the presence of a memory, which at the beginning of a sequence is its initial value, independent of the sequence to be generated. This means contextual information is not saved yet.

The hidden state can also be initialized using the hidden state of a previous run of a different sample. This technique is labeled priming. Given that voice, prosody or emotion mostly stays consistent during a given sample, the intuition is that RNNs memorize that in the hidden state. Thus a particular style could be forced by giving an example sequence. Our experiments during the development process did not show a noticeable subjective difference in the resulting output when using other very similar samples to prime the memory.

However, the original authors show that priming can lead to largely variable output. Thus it is left to further investigation into how different the priming samples need to be to produce a noticeable difference.

2.4.2 Factorized Hierarchical Variational Autoencoder

The proposed Factorized Hierarchical Variational Autoencoders (FHVAE) [39, 40] architecture takes inspiration from the VAE model as described in 2.3.2.

Contrary to the standard VAE, the FHVAE model has two disentangled hidden variables with different semantics.

(23)

FHVAE are based on the assumption of the multi-scale nature of speech data. In detail, phonemes would represent localized information, while the speaker or the noise environment would correspond to global information. This hierarchical nature can be modeled in an unsupervised manner by applying different bottlenecks per time scale. Thus the goal of the model is for it to learn to disentangle the hidden distributions for a local time-scale and a global consistent time-scale. Thus two random hidden variables are introduced instead of only one as in VAEs.

First, the random variable z2represents global information like the speaker or recording condition and will later be used to condition the network. The second random variable z1represents the remaining local information. Both are parameterized by a Gaussian distribution. The authors of Hsu and Glass [39]

chose a single layer LSTM [41, 42] network followed by a linear transformation to parameterize the Gaussian distributions for z1, z2 and the output distribution.

The full architecture can be seen in Figure 2.5.

X

x<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>1

. . .

xt

µz2|x

2 z2|x

z2

z

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 2

Global Encoder

2 z1|x,z2

µz₁|x,z2

z<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>1

x1

z2

xt

z2

. . .

z

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1

Local Encoder

x

µx1|z1,z2

2 x1|z1,z2

2 xt|z1,z2

µxt|z1,z2

z1

z2

z1

z2

. . .

Decoder

Figure 2.5: The FHVAE inference architecture consists of two encoders (orange and green) and one decoder (blue). x = [x1,· · · , x²⁰]is a segment of 20 frames.

Dotted lines in the encoders denote sampling from normal distributions.

The main point is that the hidden variable z1 is generated based on a global prior similar to the standard VAE, whereas the second hidden variable z2

depends on a sequence-dependent prior conditioned on a sequence level hidden variable µ2. Following the VAE framework, the model is built as a sequence to sequence model with an inference or encoder network which infers the hidden variables and a decoder or generation network which reproduces the given samples from the hidden variables. The model only has either a log or a Mel- spectrogram as input. In order to introduce a notion of local and global, the utterance or here labeled as sequences X⁽ⁱ⁾, are split into several segments x^(i,n). A single z1and a single z2 are sampled per segment. During training,

(24)

a hidden distribution for z2 is learned which is consistent within segments of the same sequence. Followed by the encoding network of z1 which is given z2 as input and is therefore encouraged to contain the remaining factors which change between segments.

In the variational autoencoder framework the inference of the exact posterior of the hidden variables and network parameters is intractable, therefore an inference model, q (Z₁⁽ⁱ⁾, Z₂⁽ⁱ⁾, µ⁽ⁱ⁾₂ |X⁽ⁱ⁾), that approximates the true posterior, p✓(Z₁⁽ⁱ⁾, Z₂⁽ⁱ⁾, µ⁽ⁱ⁾₂ |X⁽ⁱ⁾), for variational inference [20] is introduced.

This approximate model is constructed so that the evidence lower bound can be computed which in turn is maximized to approximate the true posterior as close as possible. The complete derivation of the lower bound can be found in Hsu, Zhang, and Glass [40]. The eventual lower bound in equation 2.2 does not depend on the whole sequence, given that the estimation of the second hidden variables mean µ2 has been replaced with a cache hµ2(i), where i indexes the training sequences. The inference model for µ2 then becomes q(µ2|X⁽ⁱ⁾) = N (h^µ2(i), I)[39]. During training, the whole sequence is never seen by the model at once, only the hidden variable µ2 encodes sequence information.

L(p, q; x^(i,n)) =L(p, q; x^(i,n)|hµ2(i)) + 1

N⁽ⁱ⁾ log p(h_µ₂(i)) (2.2) Given this model, we can now optimize the different distributions for the different variables. For a forward pass through this model, the outputs of these variables are sampled from the corresponding distribution. This operation is not differentiable. However, we can choose to set all distributions to be Gaussian, which allows us to use the reparametrization trick [20] to be able to train the whole model using gradient descent.

Yet, no part of the network forces the different means µ2 in the look-up table to be different for other sequences and the prior is maximized with zero mean. However, the goal for z2is to contain sequence level information which differentiates itself from other sequences. Therefore a discriminative objective in eq. 2.3 is introduced that favors z2from the same sequence to be close to the same µ2but also encourages it to be far from µ2 of all other sequences [39].

log p(i| ¯z₂^(i,n)) := log p( ¯z₂^(i,n)| ¯µ⁽ⁱ⁾₂ ) PM

j=1p( ¯z₂^(i,n)| ¯µ^(j)₂ ) (2.3) To come to the final objective function the lower bound is combined with the discriminative objective by a weighing parameter ↵ which is then optimized

(25)

to be maximized by the FHVAE model.

L^total(p, q; x^(i,n)) = L(p, q; x^(i,n)) + ↵ log p(i| ¯z₂^(i,n)) (2.4) However, the denominator of the discriminative objective is dependent on the number of training samples. This influences the weighing between the variational lower bound and the discriminative objective, thus ↵ has the be adjusted for each dataset. Furthermore, for every sample in the training batch, the gradient is dependent on every sample in the embedding table for µ2. To be able to train FHVAE for a larger training set, Hsu and Glass [39] introduced a hierarchical sampling approach. In this approach, only a limited number of samples defined by a hyperparameter K are kept in the lookup table. When these K sequences are drawn from the training set the corresponding µ2 are estimated using the current model. Then the normal optimization steps are done for a number of steps defined by another hyperparameter N. Thus the model can be scaled to a larger amount of training data.

2.4.3 Vector Quantised-Variational AutoEncoder

Another ongoing line of research in unsupervised learning is finding discrete latent codes rather than continuous representations, common in a VAE setup.

One proposed model is the Vector Quantised-Variational AutoEncoder (VQ- VAE) [43]¹, which uses vector quantization (VQ) to learn a discrete latent representation. Using discrete latent codes allows the model to circumvent a common issue in VAEs when paired with a powerful autoregressive decoder called ’posterior collapse’. In this phenomenon, the variational posterior of the hidden variables collapses to the prior and the generative model ignores the latent variables.

The idea behind this method is relatively simple. An encoder produces a continuous representation ze(x)of the input speech. That representation is quantized and a decoder is trained to reconstruct the original input from the quantized embedding. The quantization is achieved by clustering and saving the centers of the clusters as an embedding e in a codebook. In the quantization step the representation ze(x)is mapped to the nearest element ekin the codebook as given by eq. 2.5.

zq(x) = ek,where k = argminj||ze(x) ej||2 (2.5)

1Even though the name contains "Variational", no parameterized distributions are used in this approach.

(26)

During the forward pass, first, the raw signal gets compressed into a smaller hidden encoding by the encoder. This encoding is then mapped to the nearest centroid. These centroids are part of the model and learned during training and are the discrete codes. A powerful decoder is then used to reconstruct the original input.

In the case of an audio signal, the proposed model uses several layers of 1d-Convolutions with d filters followed by a ReLU(x) = max(0, x) activation function. Each convolution has a kernel size of 4 and a stride of 2. Thus, every layer halves the frequency of the signal. In the original paper either 6 or 7 convolutions are used which lead to an either 64 or 128 times smaller time-frequency encoding of the signal into d dimensional vectors. For every timestep the closest embedding by l2-norm is selected. This is followed by the decoder reconstructing the sequence of discrete codes. The original authors use a WaveNet [44] architecture as the powerful autoregressive model which has shown that it is able to model raw speech very well.

However, this mapping operation on the codebook or embedding is not differentiable. Neither the embedding nor the encoder could be optimized with gradient decent this way. Therefore, Oord, Vinyals, and Kavukcuoglu [43]

proposes a ’trick’ to make the model differentiable. To optimize the encoder, the gradient from the decoder is copied straight to the encoder. Since encoder output and discrete code share the same dimensionality, the assumption is that this should work. Especially, given that they are close in space, the gradient of the cluster still carries useful information for the encoder output. Yet, this mapping operation does not modify the codebook. Thus, in order to learn the codebook, the selected clustered are moved closer to the encoder outputs ze(x) by the optimizer used using the l2distance as loss.

This technique resembles learning the embedding clusters with a nearest neighbor approach. This suggests other clustering algorithms could be used as well. For example, there is recent research in using the expectation- maximization framework [45] to cluster in the latent space. The original authors mentioned that an exponential moving average [43] variant similar to a batched k-means algorithm could help the model to learn good embeddings faster and make the optimization less volatile.

(27)

Encoder

ze(x)

zq(x)

39 47 89 …

q(z<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> |x) r<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> zL e1

Embedding Space

Upsample

D

rzL

ze(x)

z_q(x)

ez

p(x|zq)

x

Figure 2.6: Left: A Figure describing the VQ-VAE. Right: Visualization of the embedding space. The output of the encoder z(x) is mapped to the nearest point e2. The gradient rzL(in red) will push the encoder to change its output, which could alter the configuration in the next forward pass.

The point of the model is to learn local discrete features in the given audio signal without providing similar supervised features like phonemes. However, this disregards global factors which are constant over a given sequence. In the case of multiple speakers, the decoder would benefit from information about the speaker to be synthesized. In a fully unsupervised manner, a speaker recognition network that produces a speaker encoding could be added. However, in practice, speaker labels are easy to acquire and contain little noise. Thus a speaker recognition network can be replaced by either a one-hot encoding of the different speakers or better, a learned embedding. This not only adds the possibility to add unknown speakers but can also be used to investigate how the model uses the speaker information. Additionally, a learned speaker embedding can be used to generate ’fake’ speakers by sampling from the embedding space.

In the model, the speaker encoding is concatenated to the output of the quantization step. This helps the model to focus on the residual local factors independently from the speaker. Furthermore, since the encoder only encodes local features, one can use the local encodings of one utterance and combine them with a speaker encoding of a different speaker to resynthesize the original utterance in a different speakers voice.

2.5 Feature extraction

With the popularity of deep learning, learning more and more approaches that learn from raw pulse-code modulation (PCM) audio signals have gained popularity [46, 47, 48]. However, these models usually need to use considerable more resources than comparable methods using extracted features given the high time resolution of audio signals.