Generating Facial Animation With Emotions In A Neural Text-To-Speech Pipeline

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

LIU-ITN-TEK-A-019/048--SE

Generating Facial Animation

With Emotions In A Neural

Text-To-Speech Pipeline

Viktor Igeland

(2)

LIU-ITN-TEK-A-019/048--SE

Generating Facial Animation

With Emotions In A Neural

Text-To-Speech Pipeline

Examensarbete utfört i Medieteknik

vid Tekniska högskolan vid

Linköpings universitet

Viktor Igeland

Handledare Gabriel Eilertsen

Examinator Jonas Unger

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Linköpings universitet

Linköping University | Department of Science and Technology

Master’s thesis, 30 ECTS | Media Technology and Engineering

202019 | LIU-ITN/LITH-EX-A--2019/001--SE

Generating

Facial

Animation

With Emotions In A Neural

Text-To-Speech Pipeline

Generering av ansiktsanimering med känslor i en neural

text-till-tal-pipeline

Viktor Igeland

Supervisor : Gabriel Eilertsen Examiner : Jonas Unger

(5)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(6)

Abstract

This thesis presents the work of incorporating facial animation with emotions into a neural text-to-speech pipeline. The project aims to allow for a digital human to utter sen-tences given only text, removing the need for video input.

Our solution consists of a neural network able to generate blend shape weights from speech which is placed in a neural text-to-speech pipeline. We build on ideas from previous work and implement a recurrent neural network using four LSTM layers and later extend this implementation by incorporating emotions. The emotions are learned by the network itself via the emotion layer and used at inference to produce the desired emotion.

While using LSTMs for speech-driven facial animation is not a new idea, it has not yet been combined with the idea of using emotional states that are learned by the network itself. Previous approaches are either only two-dimensional, of complicated design or re-quire manual laboring of the emotional states. Thus, we implement a network of simple design, taking advantage of the sequence processing ability of LSTMs and combines it with the idea of emotional states.

We trained several variations of the network on data captured using a head mounted camera, and the results of the best performing model were used in a subjective evaluation. During the evaluation the participants were presented several videos and asked to rate the naturalness of the face uttering the sentence. The results showed that the naturalness of the face greatly depends on which emotion vector was used, as some vectors limited the mobility of the face. However, our best achieving emotion vector was rated at the same level of naturalness as the ground truth, proving our method successful.

The purpose of the thesis was fulfilled as our implementation demonstrates one possi-bility of incorporating facial animation into a text-to-speech pipeline.

(7)

Acknowledgments

First of all I would like to thank Digital Domain for making this thesis possible by allowing me to come to Vancouver. I would like to give a special thanks to my supervisor at Digital Domain: Doug Roble, for all the great ideas and insights, as well as Mark Williams for his ideas and the technical help he provided me in Vancouver. Furthermore, I would also like to give a big thanks to the rest of the Digital Human Group. You are doing some cool stuff!

Secondly, I would like to thank my supervisor Gabriel Eilertsen for both the technical and theoretical advice regarding the thesis, as well as a big thanks to my examiner Jonas Unger.

Finally, I would like to give a huge thanks to my partner Sofia for always being there for me, as well as my family and everyone else who supported me. Thank you!

August 21 2019, Linköping Viktor Igeland

(8)

1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 2 1.3 Research questions . . . 2 1.4 Delimitations . . . 2 2 Background 3 2.1 Text-to-speech . . . 4 2.1.1 Concatenative . . . 4 2.1.2 Formant . . . 5 2.1.3 Articulatory . . . 5 2.1.4 Statistical parametric . . . 5 2.2 Neural text-to-speech . . . 6 2.2.1 Feature prediction . . . 6 2.2.2 Vocoder . . . 7

2.3 Speech-driven facial animation . . . 8

3 Theory 10 3.1 Neural networks . . . 10

3.1.1 Convolutional neural networks . . . 10

3.1.2 Recurrent neural networks . . . 11

3.2 Mel-spectrogram . . . 11 3.3 Facial animation . . . 11 4 Method 12 4.1 Overview . . . 12 4.2 Data capturing . . . 12 4.3 Speech-to-Animation . . . 13 4.3.1 Data pre-processing . . . 13 4.3.2 Network architecture . . . 13 4.3.3 Emotional states . . . 14 4.3.4 Loss function . . . 15 4.3.5 Training . . . 15

(9)

4.3.6 Inference . . . 16 4.4 Speech synthesis . . . 17 4.4.1 Data pre-processing . . . 17 4.4.2 Training . . . 17 5 Results 18 5.1 Implementation . . . 18 5.2 Evaluation . . . 20 6 Discussion 22 6.1 Results . . . 22 6.1.1 Implementation . . . 22 6.1.2 Evaluation . . . 23 6.2 Method . . . 23 6.2.1 Data capturing . . . 23 6.2.2 Speech-to-Animation . . . 24 6.2.3 Speech synthesis . . . 25 6.3 Source criticism . . . 25

6.4 The work in a wider context . . . 25

7 Conclusion 26 7.1 Research questions . . . 26

7.2 Future work . . . 26

(10)

List of Figures

2.1 Network architecture Eskimez et al. . . 8

2.2 Network architecture Karras et al. . . 9

4.1 An overview of the pipeline at inference. . . 12

4.2 Proposed network architecture. . . 14

5.1 Ground truth uttering sentence. . . 19

5.2 Inference with neutral emotion vector uttering sentence. . . 19

5.3 Inference with the different emotion vectors used. . . 20

5.4 Inference with sentence created by Tacotron. . . 20

5.5 Mean opinion score with standard deviation for each clip. . . 21

(11)

List of Tables

4.1 Layer-by-layer description of implemented network. . . 13

4.2 Values of parameters used. . . 16

5.1 Training and validation loss (MSE) for some networks. . . 18

(12)

1 Introduction

1.1 Motivation

Speech, our most natural way to communicate, plays an essential part of our lives where talking and interacting with other people is something we do on a daily basis. However, the sound itself is not the only thing that helps us understand what is being said, visual cues from the face and mouth are constantly being evaluated to assist our brain in determining what is being said, as well as in what emotional state the other person is. This means that the ability to see whom you are talking to would not only contribute to less misunderstandings but also make it easier to relate and understand their feelings.

Apart from the typical face-to-face interaction, communication now exist on a completely different level thanks to the recent technologies. Not only do we interact with other people online, but we also interact with machines; talking with virtual assistants to schedule a meet-ing, get recommendations or just ask what the weather is like. And even though they reply with a human-like voice there is currently one stimulus that communication with these forms of artificial intelligence is missing: the visual cues. Being able to read the words on the lips of the machine and see its emotion could not only simply the communication, but also make them more relatable.

However, creating a realistic looking digital human is challenging. We spend all our lives looking at faces, making us experts at detecting the slightest facial cues and movements. For visual effects companies to achieve a near realistic resemblance that does not fall into the uncanny valley has proven difficult, as the slightest unnatural motion or appearance will cause an unsettling feeling of eeriness.

Digital Domain1, one of the world’s leading visual effects studios, have a designated Dig-ital Human Group which is pushing the limits of what is possible. Already able to control a photorealistic digital human in real-time using a head mounted camera and a motion capture suit one might wonder: What is the next step? To give technology a face, as said by Doug Roble, Director of software R&D at Digital Domain, in his TED Talk2_{. Soon your virtual}

assistant will be more than just a voice.

1_{https://www.digitaldomain.com/}

(13)

1.2. Aim

1.2 Aim

Through several facial scans and large amounts of captured data the Digital Human Group have created a digital human called DigiDoug based of their Director of software R&D Doug Roble. To control the face of DigiDoug one must wear a head mounted camera which sends each frame to a convolutional neural network (CNN). The CNN looks at each frame and predicts the blend shape weights of DigiDoug, trying to match the captured face as closely as possible. By using machine learning the process of animating a face has become almost instant, alleviating the need of manual animation.

However, the Digital Human Group want to take this technique one step further and give technology a face. To achieve this, the need for a head mounted camera to control the character must be removed. While several techniques exist, the most common problem with text-to-facial animation models is that the result is two-dimensional, which does not satisfy their needs. Furthermore, when a model is three-dimensional the network is often compli-cated or the method require manual adjustments to achieve good results. Thus, the purpose of this master’s thesis is to incorporate facial animation into a neural text-to-speech pipeline in a simple, yet efficient way.

1.3 Research questions

1. How can video data be incorporated with a text-to-speech network so that facial ex-pressions can be correlated with the text?

2. Video and audio are sampled at different frequencies. What processing is needed to keep them synchronized throughout the network?

3. How should the loss function be designed to correlate audio with animation?

1.4 Delimitations

Training text-to-speech models are time consuming, even when having access to multiple graphic cards. Due to the fact that this project is done with limited time and resources, no text-to-speech model will be trained from scratch; instead, publicly available pre-trained models will be used. Furthermore, the priority does not lie with synthesizing the best sounding speech but instead producing realistic facial animations. Finally, being that the eye rotation moves independently from the facial expression, and might want to be controlled with a higher level of detail, they will not be handled in the work carried out by this thesis.

(14)

2 Background

The challenge of creating a realistic looking human face has been researched for years, mean-ing a variety of different approaches exist. For this project two approaches: visual text-to-speech and text-to-speech-driven facial animation, are particularly interesting. Visual text-to-text-to-speech typically relies on a text-to-speech model and the ability to represent the text as visemes that are used to animate the face. Speech-driven methods on the other hand, creates a direct mapping from the input audio to the facial features. Furthermore, these approaches can be divided into 2D imaged-based and 3D mesh-based methods.

Two-dimensional method MikeTalk [14] is a text-to-audiovisual speech synthesizer utiliz-ing optical flow methods to concatenate visemes acquired from a visual corpus. As a result of concatenating recorded sequences the result produced is realistic; however, incorporating new expressions or emotions require additional data to be recorded which limits the use of the work.

In order to increase flexibility; statistical models such as hidden Markov models (HMM) were applied in work by Wang et al. [47], as well as Anderson et al. [1]. The latter combined HMM with active appearance models (AAM) [8] resulting in near-videorealistic results. Sub-sequent work was later performed by Parker et al. [36], proposing the switch from HMM to deep neural networks (DNN). The results of their experiments showed a clear preference of the result obtained by the DNN, which was preferred more than 50% over the HMM-model. However, while producing state-of-the-art results the method is only two-dimensional and thus does not meet our projects requirements.

Looking at speech-driven methods; Voice Puppetry [5] introduced a purely data-driven method predicting full facial animation from the information obtained in an audio track us-ing a HMM. By maximizus-ing the HMM’s compactness, sparsity, as well as its capacity to store contextual information and specificity of the states, the HMM learns the mapping from sound to facial poses. Cao et al. [6] used a radial basis function, that in addition to produce facial animation also had the ability to perform different emotions. This was achieved by train-ing their model on a database containtrain-ing speech and facial motion labeled with emotions. However, labeling data is a time-consuming and difficult task as it can be unclear what emo-tion is being expressed. Taking a completely different approach, Sifakis et al. [42] built an anatomically accurate model of a face and then learn to simulate muscle interaction from au-dio. The approach, while being accurate, is time-consuming to develop and required manual adjustments to fine-tune the speech results.

(15)

2.1. Text-to-speech Eventually, came the rise of neural networks leading to many new approaches. Pham et al. [38] introduced an approach using long short-term memory (LSTM) [21] units to estimate both facial animation and head rotation, where the emotion of the speaker was recreated us-ing expression weights. Usus-ing a sequence of 4 LSTMs, Eskimez et al. [13], generated facial landmarks of a talking face. Furthermore, Nvidia [23] developed a method based on con-volutional neural networks instead of recurrent networks. What set their approach apart from others was that the network itself could learn a representation of emotions without any labelling. Speech-driven techniques are explained further in Section 2.3.

2.1 Text-to-speech

Throughout the years there has been a wide variety of different approaches on how to de-sign a text-to-speech system. Before today’s trend of using neural networks, speech synthe-sis could be divided into unit-based and model-based approaches. Unit-based approaches worked by concatenating short utterances into the desired sentence, while model-based syn-thesized speech using a mathematical model. The following sections will explain the most common, non-neural, techniques used to develop a text-to-speech system.

2.1.1 Concatenative

Concatenative speech synthesis can be considered one of the simplest methods for synthesiz-ing speech as the synthesis does not rely on any mathematical model. Instead, the synthesis is performed by concatenating prerecorded utterances, know as units, which are stored using a large database. These units can come in many forms ranging from phones to sentences, and depending on the type of unit that is being used concatenative speech synthesis can be cate-gorized into three different categories: diphone based, corpus based and hybrid [24]. While the quality of the produced speech often is high and can be considered natural, the use is limited due to the large data base requirement. Using large units like words or sentences will produce the most natural sounding speech due to the fact that they are being read by a human. However, this would mean that every single word would had to be recorded before it could be produced through synthesis, which reduces the flexibility of the system and in-creases the required size of the database. On the other hand, using a smaller unit like phones would provide greater flexibility but reduce quality of the speech [45]. The difficulty of us-ing a concatenative approach is to decide what unit to use considerus-ing their advantages and disadvantages, as well as developing the unit-selection algorithm to choose which units to select from the database

Diphone based

A diphone, spans over two of phones, starting at the middle of the first and ending in the middle of the following phone. Using a diphone as the basic unit this approach creates a database of minimal size, with each diphone only occurring once. Although this reduces the footprint of the database it also limits the number of units available, creating the need for performing various types of signal processing; thus, making the synthesized audio lose some of its naturalness.

In 1998 Beutnagel et al. [4] conducted a set of experiments where 44 listeners rated a set of utterances synthesized using diphones as the basic unit, as well as a phones. Evaluating the results they came to the conclusion that the audio synthesized using a diphone as the basic unit was preferred over using a phone. However, using diphones has got some drawbacks when it comes to synthesizing speech. This is due to the fact that the linear predictive coding method, often used for concatenating diphones, is not able to represent all speech parameters required [30], leading to robotic-sounding speech.

(16)

2.1. Text-to-speech Corpus based

As opposed to the relatively small database used by diphone based methods, corpus based methods use large databases of speech data. The use of a large database allows corpus based synthesis able to produce high quality natural speech, but it also comes with a drawback. The main issue with corpus based approaches is that the data has to be annotated [30], requiring a large amount of human effort before the corpus can be used.

Hybrid

The hybrid approach combines the strengths of two different approaches; taking the nat-uralness of the concatenative method and the ability to create smooth transitions between the units from the statistical parametric approach which is explained further in section 2.1.4. Tiomkin et al. [46] proposed such a hybrid system in 2011, which interwove natural and sta-tistically generated segments with the aim of including as many natural segments as possible and cover the discontinuities with the generated ones. Such a method is able to maintain al-most the same amount of naturalness as a complete concatenative approach, while reducing the size of the database.

2.1.2 Formant

Format speech synthesis, or rule-based synthesis, is an approach that creates rules to describe the resonant frequencies occurring in the vocal tract [30]. A waveform is synthesized by vary-ing speech parameters such as the fundamental frequency over time. The speech produced via formant synthesis is of high quality, but not very human-like; however, due to the fact that the method have full control over all parameters producing the speech, a wide variety of intonations can be produced. This in combination with that the formant synthesis does not require a large database; allowing for fast synthesis with a small footprint, has made the method adopted by screen readers such as eSpeak [11].

2.1.3 Articulatory

Articulatory speech synthesis is an anatomical approach to speech synthesis. The idea is to look at the way humans generate sound; with a vocal tract. An articulatory model tries to represent the vocal tract as a mathematical function of the different articulatory organs; which includes lips, jaw, tongue and velum [30]. By simulating air flowing through the mathemat-ical function of the vocal tract the audio is generated; however, obtaining an accurate model is difficult due to the complexity of the human organs [30].

2.1.4 Statistical parametric

Statistical parametric methods differ from previous methods by learning a speech representa-tion from the data itself. Zen et al. gave a simplified explanarepresenta-tion in [50] where they explained a statistical parametric method to function by: generating the average of a set of similarly sounding speech segments.

One of the most well-known techniques used for speech synthesis are hidden Markov models (HMM). By first using a speech database to extract a parametric representation of the speech; the HMM can be trained to create a representation of the speech. The parametric representation consists of the mel-cepstral coefficients [49], as well as the logarithmic funda-mental frequency [30]. After the HMM has been trained it can, given text, generate a set of speech parameters that when synthesized should utter the written text.

Although requiring many hours of training statistical parametric models became popular due to their flexibility and small footprint once trained, the biggest drawback, however, is the low quality of the generated speech [30].

(17)

2.2. Neural text-to-speech

2.2 Neural text-to-speech

Even though machine learning has been around for decades it was not adopted in most fields, due to the large amount of data and processing power required. However, in recent years things have changed, thanks to the improvements of graphics processing units (GPU) and the possibility to easily access and create data, machine learning has become one of the most popular tools used today. The use of machine learning has increased to the point that most people are affected by decisions made by a computer on a daily basis. Whether you are scrolling through your news feed on social media [15], talking with your virtual assistant [27] or even asking for customer support on a website [9] everything is controlled by machines, and the field of text-to-speech is no different.

Typically, neural text-to-speech is performed in two steps: feature prediction and synthe-sis. The feature prediction transforms the text into time-aligned features that contain informa-tion and characteristics on how the text is uttered. The second step: the synthesis, performed by a model often called a vocoder, takes the predicted features and use them to synthesize speech. In recent years many attempts have been made to develop the best technique for these steps. The following sections will present some of the most notable recent work.

2.2.1 Feature prediction

Previously, features and parameters of the audio had to be hard-coded before human-like audio could be synthesized. However, with the use of neural networks these features can be predicted, removing the need of manually extracting and coding these features.

There is no sole answer to how such a neural network should be designed and depending on what approach is taken the network structure will vary. Here, to follow, are a few feature prediction networks from recent state-of-the-art text-to-speech systems presented.

Char2Wav

Char2Wav by Sotelo et al. [43] is an end-to-end text-to-speech model consisting of two com-ponents: a reader and a neural vocoder. In this section the focus will lie on the reader which performs the feature prediction, the vocoder is explained further in section 2.2.2.

The reader consists of an encoder and decoder, where the encoder is a bidirectional recur-rent neural network and the decoder an attention-based recurrecur-rent sequence generator (ARSG) [7]. Using the encoder network, a text or sequence of phonemes is preprocessed and sent as input to the ARSG which generates a set of acoustic features conditioned on the input se-quence.

Deep Voice

Deep Voice by Baidu, Arik et al. [2], is a complete text-to-speech system based on neural networks, that performs its feature prediction using the following components:

• Grapheme-to-phoneme model • Segmentation model

• Phoneme duration model • Fundamental frequency model

Based on an encoder-decoder architecture, the grapheme-to-phoneme model converts text into phonemes by encoding the characters into the phonetic alphabet.

After the text has been converted into phonemes the next step is to align these phonemes to the spoken utterances. This is handled by the segmentation model which is based on a

(18)

2.2. Neural text-to-speech technique proven successful in speech recognition. The technique, Connectionist Temporal Classification (CTC), is often used in speech recognition problems since it allows a RNN to handle problems where the alignment of the input and output is unknown [17]. By modify-ing a state-of-the-art speech recognition system and trainmodify-ing it usmodify-ing the CTC approach the authors obtained a network that, with a few optimizations to the training, roughly can align the phonemes to the audio.

The final two models: phoneme duration and fundamental frequency model, are pre-dicted using a single architecture. Given a sequence of phonemes the network predicts the: phoneme duration, probability that the phoneme is voiced and fundamental frequency, for each phoneme in the sequence.

Since the release of Deep Voice the authors have continued their development and pre-sented two new versions: Deep Voice 2 [3] and Deep Voice 3 [39]. While Deep Voice 2 is built on the same architecture as the original it extends its predecessor with the functional-ity of handling multiple speakers. Deep Voice 3 on the other hand, features a completely redesigned architecture with the focus of being able to train on huge amounts of data. Tacotron

Google’s Tacotron, Wang et al., removed the need for multiple components to predict features as done by previous approaches [48]. Instead, Tacotron learns to predict mel-spectrograms directly from the text, only requiring one network to be trained.

Tacotron uses a sequence-to-sequence model with attention consisting of an encoder, a decoder and a post-processing network. One important building block of their model is the CBHG module [48], which is used to extract representations from sequences. The encoder utilizes the CBHG module when transforming the representation of the text into a representa-tion suitable for the attenrepresenta-tion module. Consisting of a stack of gated recurrent units (GRU) the decoder predicts the output using a mel-spectrogram as the target. The post-processing net-work then uses this output to predict the parameters needed to be able to synthesize speech using Griffin-Lim [19].

Later an improved version of Tacotron was released, Tacotron 2 [41], which simplified the architecture by replacing the CBHG blocks with LSTM cells. The biggest difference; however, was the synthesis method, were they switched from the Griffin-Lim algorithm to WaveNet 2.2.2, allowing for higher quality audio.

2.2.2 Vocoder

The purpose of the vocoder is to take the features provided by the feature prediction and use them to condition and generate the desired audio. As with the feature prediction there is no optimal network architecture, but instead several different designs that produce fairly similar results.

WaveNet

WaveNet [33] by van den Oord et al., is a neural vocoder that generates raw audio, sample by sample. It works in an autoregressive generative fashion, meaning that each individual sample generated is conditioned on all of the previous samples. Building on the idea of PixelCNN [34] WaveNet process the input through multiple convolutional layers, and use dilated causal convolutions to constrain the model to only depend on previous timesteps. The benefit of using stacked dilated convolutions instead of regular convolutions is that it allows for the network to gain a significantly larger receptive field while consisting only of a few layers [33], growing exponentially with depth.

WaveNet is not conditioned, in other words it does not produce any meaningful audio by itself. To generate human speech a second input is required, containing linguistic features for example extracted from the text 2.2.1, that is used to structure the synthesized audio.

(19)

2.3. Speech-driven facial animation Although providing good results WaveNet’s drawback lies in its slow generation speed. Synthesizing audio sample-by-sample is time consuming and considering that a fairly low quality audio file contains 16000 samples per second this becomes an expensive computation requiring minutes to generate seconds of audio. However, van den Oord et al. acknowledged the drawback in their paper Parallel WaveNet [35], where they present a solution using dis-tillation i.e. transferring the knowledge from a large network into a smaller one; allowing WaveNet to run in real-time.

SampleRNN

Unlike the use of convolutional layers as in WaveNet, Mehri et al. proposed sampleRNN which instead took advantage of recurrent layers [28]. According to the authors it was pos-sible to implement their model using any recurrent unit, but gated recurrent units (GRU) seemed to work best.

The layers of SampleRNN are created in an hierarchical structures, tiers, where each tier work at different clock-rates allowing for different levels of abstraction; resulting in faster generation of audio. The output of each tier conditions the tiers below it, and each timestep is also conditioned on the previous timestep in the same tier. An autoregressive multi-layer perceptron, conditioned on the higher tiers, is used as the lowest layer in the network WaveGlow

WaveGlow [40], by Nvidia, combined the ideas from WaveNet [33] with the generative flow using 1x1 convolutions presented in Glow [25]. By using a flow network, which allows for an invertible mapping, they were able able to train by minimizing the negative log-likelihood of the data.

WaveGlow is conditioned on mel-spectrograms and audio is generated by transforming samples obtained from a random distribution into samples that contain the desired distribu-tion by sending them through the network. Being a non-autoregressive network it produces audio samples at a speed of around 500kHz, which allows for real-time synthesis.

WaveGlow was designed to allow for simple training [40], without the need of distillation and autoregression; leading to fast generation of samples while maintaining the quality of WaveNet.

2.3 Speech-driven facial animation

Figure 2.1: Network architecture proposed by Eskimez et al. Generates facial landmarks given the first and second order temporal difference of the mel-spectrogram.

Eskimez et al. generated facial landmarks from speech using the first and second order temporal difference of the mel-spectrogram as input, as well as contextual information in the form of N previous frames. The network consisted of 4 LSTM layers, Figure 2.1, and was trained to minimize the mean squared error between the ground truth and predicted landmarks on an audio-visual dataset containing 28 hours of data [13]. Furthermore, delay was added to the facial movement; allowing the network to prepare for future movements, as the audio was heard slightly before they occurred.

Karras et al. proposed a different network architecture using convolutional neural net-works, Figure 2.2. A formant analysis network extracts the audio features through an auto-correlation layer and refines it through 5 convolutional layers. The audio representation is

(20)

2.3. Speech-driven facial animation

Figure 2.2: Network architecture proposed by Karras et al. Audio and an emotion vector is used as input to the network which predicts the 3D vertex positions.

later fed to an articulation network, also consisting of 5 convolutional layers, which analyzes the features and determines a vector representing the facial pose [23]. Concurrently in the articulation network, a second input representing the emotional state is concatenated to the output of each layer; allowing the network to disambiguate between facial expressions [23]. Finally, two fully connected layers transform the representation into the vertex positions of the mesh.

(21)

3 Theory

This chapter gives an introduction to some of the concepts needed to understand the rest of the thesis.

3.1 Neural networks

Neural networks is an attempt to simulate how a human brain works. The networks make use of several different layers each performing some sort of mathematical processing in an attempt to learn from the input data. A neural network consists of many small units, called neurons, which are arranged into a series of layers. The neurons from one layer interact with the neurons from another layer via weighted connections. The weights for each of these connections decides how much of an impact the value from the current neuron will have on the next one, and it is by optimizing these weights neural networks are able to learn and become exceptionally good on certain tasks.

3.1.1 Convolutional neural networks

Convolutional neural networks (CNNs) are a type of neural networks that have proven re-markable success in the fields of image and video recognition. What separates CNNs from other neural networks is that it operates over a volume, rather than on a single vector [22].

The success of CNNs comes from the idea of having specialized components looking for certain characteristics in the image. In 1962 Hubel and Wiesel [10] demonstrated that an individual neuronal cell in the brain’s visual cortex could respond to the presence of edges with a certain orientation. Thus, the CNN adopts this idea by using weighted filters that look for certain characteristics. These filters are used in the convolutional layer where they slide, or convolve, over the pixels of the image while multiplying the filter values with the values of the image. The sum of these values represents the filter at the current position in the image, and the processed is repeated until the whole input volume is covered. What we end up with is a new set of numbers, called a feature map, that the network can use to find similarities between images.

(22)

3.2. Mel-spectrogram

3.1.2 Recurrent neural networks

Recurrent neural networks (RNNs) are a different type of neural networks that can put the current input into context by remembering previous information. This has led RNNs to be successful in areas such as speech recognition [18] and music composition [12].

Long short-term memory

A special type of RNN is the long short-term memory, or LSTM. While normal RNNs are capable of remembering previous information they struggle with long-term dependencies, and this is where the LSTM shines.

What differs the LSTM from other RNNs is the repeating module. Simplified, the repeat-ing module of the LSTM consists of four different parts: the cell state and three gates [32]. The key to the LSTM is the cell state which allows for a flow of information; what information that flows through the cell state is controlled by the gates.

The first gate, called the forget gate, consists of a single sigmoid layer and is used to decide what information to remove from the cell state. The second gate selects what new information to pass to the cell state. The gate consists of a sigmoid and tanh layer, where the sigmoid layer decides which information to update and the tanh layer finds the candidates containing new information [32]. Finally, the last gate decides what information to send as output from the module. This will be a filtered version of the cell state.

3.2 Mel-spectrogram

The mel-spectrogram, or mel-frequency cepstrum [49], is a type of spectrogram often fea-tured in applications regarding speech recognition. It is based on the Mel-scale [44], a scale constructed so that sounds of equal distance from each other on the scale also sound like they are equal in distance to another for a human listener.

By computing the Fourier transform of a signal and mapping the powers of the spec-trum onto the scale, followed by logarithmic and discrete cosine transforms, the mel-spectrogram is obtained. The benefit of using the mel-mel-spectrogram is that it is a better ap-proximation of the response that a human’s auditory system would produce; thus, allowing for a better representation of audio.

3.3 Facial animation

Facial animation is an area of computer graphics covering several methods for the generation and animation of a character’s face. One of the techniques used for recording animation is known as motion capture. Motion capture can be performed either with or without markers [31]; a marker-based approach requires the drawing of several markers on the actors face and the performance is captured using one or several cameras. A markerless approach on the other hand; does as the name states not need any markers on the actor. Instead it often relies on several other sensors or algorithms to achieve a robust facial tracking [31].

After the data has been captured it needs to be applied to the character. A common tech-nique used in movies is called morph targets, more commonly referred to as blend shapes. The method requires the modelling of several facial meshes that together cover the desired range of facial movements. New expressions can then be generated by blending portions of the different meshes together [26].

(23)

4 Method

After researching we realized that modifying the existing pre-trained models to incorporate video was harder than expected and would not be plausible in the time available. Thus, we instead aimed to create a separate network based on speech-driven facial animation and placing it in the text-to-speech pipeline.

The project’s development can be divided into two parts: the main one being the de-velopment of our network producing facial animation from speech, and a secondary part performing transfer learning on Tacotron. All work was carried out using Python 3.6 with the main work done in Tensorflow 1.13 on a single Nvidia Quadro P5000. Pytorch 1.0 was used to run pre-trained text-to-speech models.

4.1 Overview

Figure 4.1 illustrates the complete pipeline used in our project which consists of three main parts: feature prediction, speech synthesis and speech-to-animation. The input to the pipeline was text which Tacotron, the feature predictor, converted into several time-aligned features in the form of a mel-spectrogram. WaveGlow then took the mel-spectrogram and used it to synthesize audio which finally was used as input for our proposed network that generates a blend shape weight sequence containing the final animation.

Figure 4.1: An overview of the pipeline at inference.

4.2 Data capturing

Neural networks require large amounts of data for training, in this case the data consisted of audio, video and text. Being that the project required correlated audio and video data, as well as a transcript of what was being said, the decision was made to read a publicly available e-book. Reading a book would alleviate the need to spend time developing utterances, while

(24)

4.3. Speech-to-Animation

Layer type Output Activation

Emotion 128ˆ180 ˆ E -Input + concat 128ˆ180 ˆ(2080+E) -LSTM + concat 128ˆ180 ˆ(256+E) Tanh LSTM + concat 128ˆ180 ˆ(256+E) Tanh LSTM + concat 128ˆ180 ˆ(256+E) Tanh LSTM 128ˆ180 ˆ 256 Tanh

Fully connected 128ˆ180 ˆ 160 Linear Table 4.1: Layer-by-layer description of implemented network.

also simplifying the transcription of the audio as the text was available online in several formats.

The data was captured during several sessions where the subject wore a head mounted camera rig with a microphone attached to it. The data accumulated over the sessions totalled at about 1 hour and 45 minutes of data. While video was captured via the head mounted camera system, audio and several other features were recorded in Unreal Engine using a tool developed by Digital Domain. This allowed for direct recording of the blend shape weights, reducing time needed to be spent pre-processing images into weights.

4.3 Speech-to-Animation

This section covers the development of our proposed network that generates facial animation from speech. It will cover the data pre-processing and network architecture, as well as explain the emotional states and loss function used, and finally explain how new animation was generated.

4.3.1 Data pre-processing

With blend shape weights recorded at 60 fps and audio captured at 48 kHz, a correspondence between each frame of video needed to be created from the audio. This was achieved by computing the 80 bin mel-spectrogram of the audio using a Hanning window of about 17 ms, or more precisely 800 samples. By doing so we obtained a more compact representation of the audio, where there for each audio segment existed a paired frame containing the weights. Following the conclusion by [13] that using the first and second order temporal difference of the mel-spectrogram as input yielded better results, we proceeded to compute and use them as well.

The data was then further divided as the network required input of a fixed size. Further-more, contextual information was concatenated to the input, resulting in each frame of the input having a length of:

L=2 ¨ mbin¨(cwin+1)

where the number of bins in the mel-spectrogram was represented by mbin and the number

of frames in the context window by cwin. See 4.3.5 for all parameter values.

4.3.2 Network architecture

The network architecture presented in this thesis combines the idea of using emotion vectors to handle ambiguity [23] and applies it to a LSTM network similar to [13]. The proposed net-work, illustrated in Figure 4.2 and Table 4.1, consists of one emotion layer, four LSTM layers and one fully connected layer. Our emotion layer is of simple design and only functions as a placeholder for the emotion vectors. The weights of the emotion layer is updated for each

(25)

4.3. Speech-to-Animation

Figure 4.2: Our proposed network architecture. The network takes audio as input and out-puts the predicted blend shape weights. A secondary input representing the emotional state is concatenated to the output of some layers.

iteration to the emotion vectors of the current samples. The layer returns the emotion vec-tors by taking an input tensor of ones and multiplying it with the weights. The purpose for having such a layer is to allow the emotion vectors to become trainable parameters, meaning they will be updated by the network during backpropagation.

Each LSTM layer contained 256 hidden units, H, and used the hyperbolic tangent as ac-tivation function. To allow for better generalization they also made use of regular dropout between each layer, as well as recurrent dropout masking the connections between recurrent units. Finally, the output features of the last LSTM layer were fed into a fully connected layer predicting the final 160 blend shape weights.

At each timestep, the first and second order temporal difference of the mel-spectrogram of the current and N previous frames were sent as input to the network, see 4.3.1. By providing a sequence of previous frames the network obtained a contextual window containing short-term information of past inputs allowing for temporal stability. Furthermore, following [13] we also introduced a delay to the weights with respect to the audio. The reason behind introducing delay was that it allowed the network to be prepared for the upcoming mouth movements as it heard the sound before any movement occurred.

A secondary input: the emotion vector, obtained from the emotion layer, represents a learned definition of the ambiguous information in the data. The emotion vector, was repre-sented as an E-dimensional vector and was concatenated onto the output of all layers with the exception of the final LSTM layer, allowing all layers apart from the fully connected layer to adjust the values of the emotion vector. Attaching the emotion vector to the fully con-nected layer resulted in more temporal noise visible by the eyebrows, hence why it does not get access to the emotion vectors.

4.3.3 Emotional states

Trying to use sound to infer facial movement is not a trivial task considering that the same utterance can be made with different facial expressions. Therefore, we take advantage of emotional states [23].

The purpose of attaching an emotional state to each training sample is to let the network be able to infer the output without taking the ambiguous data into consideration. As ex-plained by Karras et al., ideally, the goal of the additional data is to store relevant information of the facial movements that from the audio itself cannot be inferred. Apart from being useful during training, Karras et al. also found it to provide great use during inference as it allows for fine control over the final animation.

As previously mentioned the emotional state was represented by an E-dimensional vec-tor. We stored one emotional vector for every training sample and, in accordance with Karras et al., we initialized each vector with values drawn from a Gaussian distribution. The vec-tors themselves were stored in matrix referred to as the emotion database from where we could access and update all vectors. During forward propagation the emotion vectors correspond-ing to the traincorrespond-ing samples are collected from the emotion database and set as the weights of

(26)

4.3. Speech-to-Animation the emotion layer. At backpropagation the values of each emotion vector is updated by each layer and stored in the emotion database.

In the end we were left with our emotion database containing thousands of vectors repre-senting ambiguous information. However, the information stored in the vectors are without semantic meanings ("happy", "sad", etc.) [23], and no two vectors that yield a "happy" result will contain the same values. What, ideally, should be similar for an emotion is the gradient between each value of the vector. Selection of emotion vectors are discussed in Section 4.3.6.

4.3.4 Loss function

In accordance with [23] a specialized loss function was created to deal with the ambiguity of the data. Using two of their three presented terms; as the motion term was not of interest to us since we dealt with blend shape weights and not vertex positions. The primary loss term was created as the mean squared error between predicted and ground-truth blend shape weights. The goal of this term is to ensure that predicted and ground truth blend shape weights are roughly the same. With N being the total number of blend shape weights, y being ground-truth i.e. the desired output, and ˆy being the network’s prediction; we defined the term for sample x as: MSE(x) = 1 N N ÿ i=1

y(i)(x)´ˆy(i)(x)2 (4.1) Since the purpose of the emotional vectors was to store ambiguous information i.e. in-formation that was not correlated with the audio, we had to make sure this was handled by the loss function. Considering that the mouth moves quickly during speech it was assumed that such short-term effects are correlated with the audio, while other more slowly moving long-term effects should be handled by the emotional database [23]. This was achieved using the regularization term, as used by Karras et al., which discouraged emotion vectors from containing short-term information. Following the notation of [23] we define operator m[¨]as the finite difference between two neighbouring frames, and describe the regularization term as follows: R1₍_x_{) =} 2 E E ÿ i=1 mhe(i)(x)i2 (4.2)

with E denoting the size of the emotion vector and e(i)₍_x₎_{the i:th component stored in the}

database for sample x. As mentioned by Karras et al. [23], the regularization term does not forbid short-term information to exist in the emotional database but merely discourage it, as the possibility exist that the emotional state occasionally changes quickly.

Furthermore, Karras et al. note that Eq. 4.2 can be minimized by simply reducing all values of the emotional vector to zero. We adopted their proposed solution, inspired by batch normalization, to normalize R1₍_x₎_{with respect the observed magnitude of e}₍_x₎_:

R(x) =R1₍_x₎ ₁ EB E ÿ i=1 B ÿ j=1 e(i)(xj)2 (4.3)

4.3.5 Training

This section will explain the parameters used, as well as how we selected our model. The network was trained for about 450 epochs using Adam optimizer.

(27)

4.3. Speech-to-Animation Parameter Value Hidden dimension 256 Dropout 0.5 Recurrent dropout 0.2 Context window 12 Delay 5 Frames 180 Batch size 128 Learning rate 0.001 Table 4.2: Values of parameters used.

Parameters

Table 4.2 lists the parameter values used in the final model. These values were selected through a manual hyperparameter search looking at related work as a starting point. The parameters were tuned until they achieved an optimal value, see Section 4.3.5 about evalua-tion. Context window refers to the number of frames added to each input as context, see 4.3.1.

Delayis the number of frames of delay that was added to the blend shape weights and Frames is the number of frames of each training sample. For the emotion vectors we used a size of 20 and a standard deviation of 0.01.

Model evaluation

With the goal of obtaining the model best suited for the task there had to be a form of evalua-tion to decide which model performed better. Therefore, we randomly selected 10 percent of the training data to use for validation. By modifying one parameter at a time and comparing the resulting loss value obtained when running the validation data on different models we were able to see how it affected their ability to perform. In the end, the model with the lowest loss value was chosen.

4.3.6 Inference

After a network had been trained new facial animation could be generated for speech by inferring the model with an audio file of arbitrary length. For inference to work the model had to be provided with a single emotion vector, or a sequence of emotion vectors if the desired result is to vary the emotion over time. To achieve this we first needed to extract vectors from the emotional database. Karras et al. [23] called this process mining, and mentioned that they manually mined for emotion vectors using a three-step process. We approached the mining in a similar fashion, however, due to the fact that our database contained more than three hundred thousand vectors we opted to randomly select vectors creating a subset. We then, as [23], proceeded to manually perform inference with novel audio and selected the vectors that displayed the desired behaviour.

Moreover, during inference the structure of the network was changed as the emotion vec-tors no longer needed to be collected from the emotional database, but instead was provided as an input. Therefore, the emotion layer was removed and the secondary input of ones was replaced with the emotion vector. Finally, as during the training phase, the emotion vector was concatenated directly to the input.

(28)

4.4. Speech synthesis Rendering

The sequence of blend shape weights produced during inference was converted into vertices, which allowed us to render the face in Maya. Rendering was done in black and white; with-out eyes and hair, as realistic rendering was not a priority.

4.4 Speech synthesis

This section describes the process of synthesizing speech from text. By using Nvidia’s pub-licly available pre-trained models1,2 _{we were able to put bigger focus on producing good}

facial animation. The pre-trained model: Tacotron 2 and WaveGlow, explained further in 2.2.1 and 2.2.2, were used without any modification to their network architecture. However, to obtain the desired voice we performed transfer learning on the Tacotron model using a small dataset of our speaker.

4.4.1 Data pre-processing

Tacotron required an audio file as well as the transcription of the audio. Furthermore, it also required the length of the audio not to be much longer than 10 seconds; thus, it was necessary to split our data into shorter segments suitable as input. To handle this transcription and splitting of the audio two methods were implemented due to different requests: one requiring some manual work and one completely automatic.

Forced alignment

Force alignment synchronizes audio with text by converting segments of the transcription into audio using text-to-speech and computing the mel-spectrogram. The text is then aligned to the audio by comparing short windows of spectrograms, and we split the files after each sentence. This was achieved using aeneas [37], a Python library developed for automatic synchronization of text and audio.

Speech recognition

Automatic transcription of audio was achieved using a pre-trained model of Mozilla’s Deep-Speech [29], an implementation of Baidu’s neural speech recognizer [20]. However, the model only worked on files with a length of around 5-10 seconds, meaning that the audio files had to be split beforehand. By analyzing each audio clip and splitting it at silences longer than 800 ms with a threshold of -50 dB, we were able to obtain files of the correct length.

4.4.2 Training

Tacotron was trained using transfer learning on the pre-trained model with a small dataset of our speaker. We used a dropout of 0.12 and allowed the network to train for 2000 iterations as training longer made the model lose the learned alignment. No training was performed on WaveGlow.

1_{https://github.com/NVIDIA/tacotron2} 2_{https://github.com/NVIDIA/WaveGlow}

(29)

5 Results

This chapter presents the results of the implemented method, as well as the results of the evaluation.

5.1 Implementation

In Table 5.1 we report the result of the mean squared error (MSE) between the ground truth and predicted blend shape weights. The results are presented in form of: train loss the MSE computed on data seen before and validation loss the MSE of unseen data. These results were used as part of the model selection, where in the end CTX12-B128-F180 was selected.

Table 5.2 presents the result of the network selected in Table 5.1 after the implementation of emotions. In this case the validation loss is not shown due to the fact that it depends on the emotion vector specified and we do not know which ones provide good results during training.

Network Train Loss Validation Loss CTX6-B128-F180 0.182 0.355 CTX12-B128-F180 0.161 0.350 CTX12-B256-F150 0.159 0.362 CTX12-B256-F180 0.160 0.352 CTX30-B128-F180 0.158 0.379

Table 5.1: Training and validation loss for some networks. CTX is the size of the context window, B the batch size and F number of frames in each sample. The rest of the parameters were set as described by Table 4.2.

Network Train Loss E20-S0.01 0.051 E18-S0.01 0.052

Table 5.2: Training loss for network with emotions added. E is the size of the emotion vector and S the standard deviation.

(30)

5.1. Implementation Moreover, Figure 5.1 contains a sequence of images illustrating the captured ground truth data uttering a sentence. The same sentence, not seen during training, is uttered by our implemented network in Figure 5.2.

Figure 5.1: Ground truth uttering sentence: "Left his shop."

(31)

5.2. Evaluation To illustrate the different emotion vectors used a single sentence has been inferred with four different emotional states, see Figure 5.3. The emotion vectors were selected manually; by randomly choosing a set of emotion vectors and looking at their inferred results we sub-jectively categorized them according to the perceived emotion. We then compared the results from each category and selected the one that, to us, visually looked the best.

Finally, to show the results obtained when using text-to-speech Figure 5.4 displays a se-quence of images where the implemented network uses the audio produced by Tacotron.

Figure 5.3: Inference with the different emotion vectors used. Left to right: happy, neutral,

surprised (eyebrows)and surprised (mouth).

Figure 5.4: Inference with sentence created by Tacotron: "Words not seen before".

5.2 Evaluation

To determine how realistic our results were we performed a subjective test on 30 volunteers, where each participant was shown 25 short clips. The set of clips was made up of five dif-ferent utterances, each generated with four difdif-ferent emotions as well as one ground truth, meaning that each utterance occurred five times in the set. After each clip the participant were asked to rate the naturalness of the face when talking on a scale between 1 to 5, where

(32)

5.2. Evaluation 1 was unnatural and 5 completely natural. The evaluation did not contain any utterances produced by Tacotron.

The results of our subjective evaluation is presented in Figure 5.5 and Figure 5.6, where we present the mean opinion score (MOS) for each clip and the averaged MOS per emotion, respectively.

Figure 5.5: Mean opinion score with standard deviation for each clip.

(33)

6 Discussion

Incorporating video directly into a text-to-speech system turned out to be harder than ex-pected, thus the project took a different approach from what first was planned. Developing a separate network producing facial animation from audio turned out to be simpler and still produced good results.

This chapter will continue by first discussing the obtained results, the method and the evaluation process. Furthermore, there will be a discussion about source criticism and finally the work in a wider context.

6.1 Results

This section will first discuss the objective results obtained from our implemented method, and will then continue on discussing the subjective results obtained from our evaluation.

6.1.1 Implementation

By looking at the generated output in combination with the results in Table 5.1 we were able to make the following observations. Reducing the batch size would yield more noise, but at the same time lower the loss value; we explain this observation by considering how a neural network operates. When operating on a smaller batch size the network is able to better optimize it weights due to the fact that the variety of the data is smaller. However, fewer samples also leads to a less generalized training, as less data has been seen a smaller change could have a bigger impact, thus the increase of noise.

Moreover, while the variation of delay is not displayed in Table 5.1 we did a few exper-iments concerning it and quickly found that using delay generated better results. However, as one might imagine adding delay also rendered the audio out of sync. Therefore, we chose such a delay that would provide the best prediction without being noticeably out of sync.

Another important factor was the size of the context window, as shown in Table 5.1, we can see that a too small or too big value negatively affects the result. Visually looking at the resulting prediction we could see that a too big context window caused over fitting, while having it too small caused jittering. Our reasoning behind why this occurred was that having a large context window would cause the network to rely too heavily on that information;

(34)

6.2. Method while having a small context window would provide to little information resulting in jittering as the temporal stability from previous frames was lost.

The model selected from Table 5.1 was chosen due to its lower validation loss, meaning it should be better at generalizing; in other words produce better results for unseen data. However, comparing the different loss values we can see that there is not much variation between them, and since parameters were modified manually there could possibly exist a better model that we did not find. Furthermore, comparing the results from Table 5.1 with the results from Table 5.2 we observe that there is a big difference in the loss value obtained. This proves that the usage of emotions allow the network to disambiguate non speech-related information to the emotion vector, thus being able to better optimize its weights for speech-related information.

With the main objective of producing realistic facial animation the speech synthesis did not get as much attention; resulting in the transfer learning of Tacotron not producing good results. When we attempted to use a larger dataset Tacotron was unable to learn the align-ment between audio and text. Possibly there was too much variation in the audio, as it had been recorded over several sessions.

6.1.2 Evaluation

Analyzing the results of the evaluation in Figure 5.5, we observe that certain emotions pro-vide a relatively high MOS score for all of the clips, while some lack behind. This was ex-pected as the happy and surprised (mouth) emotions have trouble closing their mouth com-pletely. Furthermore, we observe that the neutral emotion got a higher score than ground truth in three out of the five clips in Figure 5.5, meaning the participants seemed to prefer this one over the ground truth. Averaging the MOS for each emotion, Figure 5.6, we observe that the neutral and surprised (eyebrows) emotion obtained results similar to the ground truth, meaning the subjects rated them as the same level of naturalness.

If we go back to 5.5 we can see that there is a standard deviation of about ˘1 for each clip, meaning that there was a big variation in what the participants perceived as natural and not. With the evaluation performed in an uncontrolled environment each subject was exposed to different conditions that could have affected the results. Another possible source of error is that none of the participating subjects were native English speakers; however, as the goal was not to judge the audio but instead how the face moves, we believe the impact of this is minor. Furthermore, even though the goal was not to judge the audio, we decided to not use any clips generated by Tacotron in the evaluation as we believed the difference in audio quality; the robotic voice, would affect the subject’s opinion on naturalness.

6.2 Method

The results showed that our implemented method produced promising results. However, after finishing the project there are several steps we would have done differently to possibly improve performance. These steps will be discussed in the following sections.

6.2.1 Data capturing

As the data was captured over several sessions we cannot be completely sure that the camera and microphone were positioned in the same way for all sessions. There was a noticeable difference in the audio between the sessions which could be the reason for some ambiguity in the data, as similar utterances could have slightly different sound. Furthermore, the total amount of data captured had a running time of 1 hour and 45 minutes. We did not do any tests on how the amount of data impacted the result, meaning we possibly could get similar results with way less data, or even better performance with way more.

(35)

6.2. Method Finally, the reading of the book was not done in different emotions as having different emotions was not a part of the project at that stage. Thus, for better results with emotions new capturing should be made with the reader acting in different emotions.

6.2.2 Speech-to-Animation

The combination of our animation network with the text-to-speech pipeline could have been more efficient if we would have trained our network on the same type of mel-spectrograms as Tacotron generates. Doing so would allow us to place our network in parallel with the speech synthesis, possibly reducing the time required by the timeline.

Network architecture

Designing and training a neural network has endless possibilities; architectural decisions, which hyperparameters to use etc. Our proposed architecture was based on a fairly simple LSTM architecture [13] chosen due to its simplicity and ability to produce good results. The addition of a fully connected layer was made as this allowed for greater flexibility, as op-posed to scaling the output of last LSTM. As the idea of adding emotions appeared later in the project no early design choices were made to ease for such addition later. Nevertheless, emotions were successfully implemented based on the work by Karras et al.; however, in our own way, as it was not completely clear how it was done. As this was the case, we cannot be completely sure that we are achieving the same results, however, the results we obtained seemed promising.

Loss function

An important part of designing a successful neural network is the loss function, in this case two different terms were used. As explained by Karras et al. balancing multiple loss functions can be tricky, thus they came up with a way to determine the weight of each loss term in similarity to Adam optimization. Modifying the weights of the loss terms was not performed in our project as we did not have the time to implement it. Possibly this had an impact on the result, as the loss function greatly affects how the network learns.

Training

Finding the best hyperparameters is a difficult task as they vary for each network and the task at hand. We chose to perform a manual hyperparameter search as this, at the time, seemed to be the best option with the resources available. By performing the search manually, we could incorporate the knowledge obtained from observing the changes of previous values; how-ever, we have to take the human factor into consideration. The person manually modifying the parameters could have misunderstood the previous results and thus changing the wrong parameter. With bigger resources automated hyperparameter optimization would have been preferred.

With the objective evaluation of the model relying heavily on the data selected as valida-tion data, the concern exist that the variavalida-tion of the data could be heavily unbalanced in favor of one type of utterance. If so, is the case the model evaluation process is unreliable. However, as none of the utterances are the same this should not be a concern. Moreover, evaluating the network with emotions turned out to be difficult as we do not know which vectors produce good results during training, thus the result of each model had to be evaluated visually. Inference

Inference with emotions brought many new challenges to the table, most of them we were unable to solve as the addition of emotions came very late in the project. There is currently

Generating Facial Animation With Emotions In A Neural Text-To-Speech Pipeline

Department of Science and Technology

Institutionen för teknik och naturvetenskap

LIU-ITN-TEK-A-019/048--SE

Generating Facial Animation

With Emotions In A Neural

Text-To-Speech Pipeline

Viktor Igeland

LIU-ITN-TEK-A-019/048--SE

Generating Facial Animation

With Emotions In A Neural

Text-To-Speech Pipeline

Examensarbete utfört i Medieteknik

vid Tekniska högskolan vid

Linköpings universitet

Viktor Igeland

Handledare Gabriel Eilertsen

Examinator Jonas Unger

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

Linköping University | Department of Science and Technology

Master’s thesis, 30 ECTS | Media Technology and Engineering

202019 | LIU-ITN/LITH-EX-A--2019/001--SE

Generating

Facial

Animation

With Emotions In A Neural

Text-To-Speech Pipeline

Generering av ansiktsanimering med känslor i en neural

text-till-tal-pipeline

Viktor Igeland

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions