Speech Reading with Deep Neural Networks

(1)

UPTEC IT 18 012

Examensarbete 30 hp Juli 2018

Speech Reading with Deep Neural Networks

Linnar Billman Johan Hullberg

Institutionen för informationsteknologi Department of Information Technology

1

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Speech Reading with Deep Neural Networks

Linnar Billman and Johan Hullberg

Recent growth in computational power and available data has increased popularity and progress of machine learning techniques. Methods of machine learning are used for automatic speech recognition in order to allow humans to transfer information to computers simply by speech. In the present work, we are interested in doing this for general contexts as e.g. speakers talking on TV or newsreaders recorded in a studio. Automatic speech recognition systems are often solely based on acoustic data. By introducing visual data such as lip movements, robustness of such system can be increased.

This thesis instead investigates how well machine learning techniques can learn the art of lip reading as a sole source for automatic speech recognition. The key idea is to use a sequence of 24 lip coordinates to feed to the system, rather than learning directly from the raw video frames.

This thesis designs a solution around this principle empowered by state-of-the-art machine learning techniques such as recurrent neural networks, making use of GPUs. We find that this design reduces computational requirements by more than a factor of 25 compared to a state-of-art machine learning solution called LipNet.

This however also scales down performance to an accuracy of 80% of what LipNet achieves, while still outperforming human recognition by a factor of 150%. The accuracies are based on processing of yet unseen speakers.

This text presents this architecture. It details its design, reports its results, and compares its performance to an existing solution. Based

on this, it is indicated how the result can be further refined.

Tryckt av: Reprocentralen ITC UPTEC IT 18 012

Examinator: Lars-Åke Nordén

Ämnesgranskare: Kristiaan Pelckmans

Handledare: Mikael Axelsson 2

(4)

(5)

Popul ¨arvetenskaplig Sammanfattning

Den stora utvecklingen av beräkningskraft tillsammans med det ökade flödet av tillgänglig data har lett till ett ökat intresse och utveckling av maskininlärning. Framförallt den del av maskininlärning som kallas deep learning.

Deep learning har lett till förbättringar av m˚anga applikationer och omr˚aden. En av dessa omr˚aden är automatisk taligenkänning: tekniken att träna en dator att känna igen talkommandon.

Taligenkänning används till m˚anga applikationer s˚asom personliga assistenter, röstkontrollerade system i fordon, utbildningsapplikationer och m˚anga fler.

Taligenkänningssystem använder sig oftast endast av akustisk data, i.e. ljud. Ett s˚adant system är väldigt beroende av kvalitén p˚a ljudet d˚a otillräcklig kvalité kan leda till att urskiljning av ord blir sv˚art. En lösning till detta problem är att introducera visuell data s˚asom videor av läpprörelser av talaren tillsammans med ljudet. Genom att introducera visuell data har systemet en chans att urskilja ord även när ljudet fallerar.

Denna rapport menar att träna ett läppläsningssystem med hjälp av deep learning och sedan jämföra detta med det nuvarande ledande systemet inom läppläsning: LipNet.

3

(6)

Acknowledgements

First of all we would like to thank our supervisor Mikael Axelsson for supporting us through this project and providing helpful ideas and a positive atmosphere.

We would like to thank Consid AB for providing us with a wonderful office to work at with pleasant colleges and great coffee.

Thank you Kristiaan Pelckmans, our reviewer at Uppsala University, for taking your time to provide us with helpful feedback throughout the project.

A special thanks to the team behind LipNet for providing us with a bible in the form of your report and source code which we could refer to in doubt.

4

(7)

“ Never half-ass two things.

Whole-ass one thing.

”

Ron Swanson, Parks and Recreation, 2012

5

(8)

1 Introduction 12

1.1 Background . . . . 13

1.2 Purpose . . . . 13

1.2.1 Problem Statement . . . . 14

1.3 Delimitations . . . . 14

2 Related Work 15 2.1 LipNet: End-to-End Sentence-level Lipreading . . . . 16

2.2 Various Works Related to Lip Reading . . . . 16

3 Theory 17 3.1 Machine Learning . . . . 18

3.1.1 Supervised Learning . . . . 18

3.1.2 Logistic Regression . . . . 18

3.1.3 Optimization . . . . 19

3.1.4 Artificial Neural Networks . . . . 20

3.1.5 Deep Learning . . . . 21

3.1.6 Recurrent Neural Networks . . . . 22

3.1.7 Convolutional Neural Network . . . . 24

3.1.8 Object Detection . . . . 26

3.1.9 Object Recognition . . . . 31

3.2 Speech Recognition . . . . 31

3.2.1 Lip Reading . . . . 31

3.2.2 Automatic Speech Recognition . . . . 32

3.2.3 Moving Average Filter . . . . 34

4 LipNet 36 4.1 Overview . . . . 37

4.2 Model . . . . 37

4.3 Preprocessing . . . . 39

4.4 Training . . . . 39

4.5 Prediction and Decoding . . . . 39

4.6 Evaluation . . . . 39

6

(9)

5 Methods 40

5.1 Dataset . . . . 41

5.1.1 Subset . . . . 41

5.2 Preprocessing . . . . 41

5.2.1 Mouth Tracking . . . . 42

5.2.2 Moving Average Filtering . . . . 43

5.3 Clustering . . . . 43

5.4 Logistic Regression . . . . 44

5.5 Deep Neural Network Models . . . . 44

5.5.1 Model 1 . . . . 45

5.5.2 Model 2 . . . . 46

5.5.3 Model 3 . . . . 47

5.5.4 Model 4 . . . . 48

5.5.5 Model 5 . . . . 48

5.6 Language Model . . . . 50

5.7 Results on Real-World Appliance . . . . 50

5.8 Memory Usage and Training Time . . . . 51

6 Discussion 52 6.1 Models . . . . 53

6.2 Adam parameters . . . . 53

6.3 Dataset . . . . 54

6.4 Smoothing, Noise and Normalization . . . . 54

6.5 Facial Features . . . . 54

6.6 Performance Comparison . . . . 55

7 Conclusion 56 7.1 Reading Lips with Facial Landmarks . . . . 57

7.2 Future Work . . . . 57

Bibliography 59

7

(10)

List of Figures

3.1 The structure of an Artificial Neuron . . . . 21

3.2 The structure of a shallow Artificial Neural Network . . . . 21

3.3 The structure of a Recurrent Neural Network . . . . 22

3.4 Gates of a Long-Short Term Memory unit . . . . 23

3.5 Convolutional Neural Network applied on a RGB image . . . . 25

3.6 A max-pooling layer retrieves the maximum value in a region and reduces the dimensionality . . . . 26

3.7 Classification of a feature vector . . . . 27

3.8 Probabilistic Neural Network for classification . . . . 29

3.9 Salient Point Detection detecting a triangle . . . . 30

3.10 Moving average filter. Original signal (blue) and filtered signal(green). . . . 34

4.1 Graph representation of the architecture of LipNet . . . . 38

5.1 Pictures of two speakers from the GRID corpus . . . . 41

5.2 The 68 facial landmarks identified with dlib’s facial recognition . . . . 42

5.3 Vector representation of extracted mouth coordinates from two frames . . . . . 42

5.4 Plot of a sequence of one (y) coordinate over 75 frames. One line is the original coordinate (blue) and the other is a smoothed version (green). . . . 43

5.5 KNN WER for different K-values . . . . 44

5.6 Graph representation of the architecture of model 1 . . . . 45

5.7 Graph representation of the architecture of model 2 . . . . 46

5.8 Graph representation of the architecture of model 3 . . . . 47

5.9 Graph representation of the architecture of model 4 . . . . 48

5.10 Graph representation of the architecture of model 5 . . . . 49

5.11 Comparison on real world appliances between Model 5 and LipNet . . . . 51

8

(11)

List of Tables

5.1 Results for logistic regression . . . . 44

5.2 WER of model 1 . . . . 45

5.3 WER of model 2 (1) . . . . 46

5.4 WER of model 2 (2) . . . . 46

5.5 WER of model 3 . . . . 47

5.6 WER of model 4 . . . . 48

5.7 WER of model 5 (1) . . . . 49

5.8 WER of model 5 (2) . . . . 49

9

(12)

Acronyms

AI Artificial Intelligence. 13 AN Artificial Neuron. 20, 21

ANN Artificial Neural Network. 13, 21

ASR automatic speech recognition. 13, 22, 33, 53, 58 AVR Audio-visual recognition. 16

BiGRU bidirectional GRU. 37, 48, 49, 53, 55 BiLSTM bidirectional LSTM. 45–48, 53 BiRNN bidirectional RNN. 24, 45, 53 BN Biological Neuron. 20

CER Character Error Rate. 34, 49, 54

CNN Convolutional Neural Network. 16, 25, 37, 48, 53, 55, 57 CTC Connectionist Temporal Classification. 16, 33, 37, 39, 45–48, 50 DL Deep Learning. 13, 14, 17, 22

DNN Deep Neural Network. 45 GD Gradient Descent. 19

GRU Gated Recurrent Unit. 24, 48, 53 HMM Hidden Markov Model. 33 KNN K-Nearest Neighbors. 30, 43, 55 LM Language Model. 32, 50

10

(13)

LR Logistic Regression. 18, 44, 55

LSTM Long Short-Term Memory. 16, 23, 24, 45, 46, 48, 53 MA Moving Average. 34, 54

ML Machine Learning. 13, 17, 18, 22, 55 NLP Natural Language Processing. 45 NN Neural Network. 16, 18, 22, 23 PNN Probabilistic Neural Network. 27 PR Pattern Recognition. 20

RGB Red-Green-Blue. 25, 26

RNN Recurrent Neural Network. 16, 22–24, 33, 45, 46, 48, 53, 57 SL Supervised Learning. 18, 19

SPD Salient Point Detector. 30 SR Speech Recognition. 16, 17, 32

WER Word Error Rate. 16, 33, 34, 37, 43–50, 53, 55

11

(14)

Chapter 1 Introduction

In this chapter the background and motivation behind this thesis is presented. It briefly mentions some history behind machine learning. It also explains the purpose and goal of this thesis.

12

(15)

Introduction 13

1.1 Background

With the recent popularity with Machine Learning (ML), it is easy to believe it to be a brand new concept. This is however very far from the truth. Ever since the dawn of computers the idea of a machine being able to emulate human thinking and to learn has existed. Alan Turing, famously known as one of the biggest contributers to computer science and Artificial Intelligence (AI) [1], created a test named the Turing Test [2]. This test was supposed to determine if a machine could be seen as intelligent by displaying behavior indistinguishable to that of a human.

In the 1950s Arthur Samuel created the first computer learning program which was designed to play checkers [3]. Around the same time, Frank Rosenblatt invented the perceptron [4], giving grounds to the Artificial Neural Network (ANN). 1967 the nearest neighbor algorithm was presented [5], allowing computers to use basic pattern recognition. In 1997, IBM’s computer Deep Blue was able to beat the world champion of Chess, Garri Kasparov [6][7]. In 2011, IBM’s Watson was able to defeat two champions in the game show Jeopardy! [8].

One of the reasons behind the recent spurt in popularity and advancement in ML is the increase of available computational power along with the ever increasing amount of available data. This advancement has resulted in more advanced problems being solved with ML than ever before. One of the biggest advancements is the growing use of Deep Learning (DL).

DL is a subset of ML that uses computational models that are built up with numerous processing layers, allowing the learning of data with several levels of abstraction. DL has improved the state-of-the-art in automatic speech recognition (ASR), object recognition, object detection along with many other fields drastically [9][10].

The improvement in ASR has led to many applications e.g. personal assistants, such as Apple’s Siri [11] or Microsoft’s Cortana [12], system controls in vehicles [13], assistance for people with disabilities [14], and many more. Even though ASR has come a long way, it still has many challenges left to overcome [15].

1.2 Purpose

ASR using only acoustic data depends heavily on the quality of the sound. If the data is polluted with noise or other disturbances an ASR system will have a much more difficult task than if the data was in perfect condition. As ASR systems are becoming increasingly popular with mobile devices and home entertainment systems, the demand increases for ASR to be robust to real-world noise and acoustic disturbances [16].

One technique for making ASR more robust is using lip reading when possible [17]. By training a model to not only look at the acoustic data of the speaker but also the visual features, the visual data could help the model predict the correct answers even when the acoustic data is less than ideal [18]. An extreme case of acoustic disturbance is where there is no usable acoustic data at all. This would require the ASR system to go by visual data alone, leading to a more complex problem: pure lip reading.

The current state-of-the-art lip reading system is LipNet [19]. LipNet was trained on an

(16)

Introduction 14

Nvidia DGX-1 [20]: absolute state-of-the-art hardware for AI and DL, not accessible to everyone.

The heavy training of LipNet is not applicable or at least not practical on all hardware and might therefore be inaccessible to many.

This thesis aims to explore the possibilities of reproducing the accuracies shown by LipNet with a model able to adequately train using the limited hardware available for this project: one Nvidia GTX 1060 GPU with 6GB memory. This will be attempted by extracting visual features, i.e. facial landmarks, from the videos before feeding said features to the model. This thesis also aims to evaluate the differences in accuracy and computational requirements between the created model and LipNet.

1.2.1 Problem Statement

• Can facial landmarks be used as efficient data for lip reading?

• What are the most important visual features when performing lip reading with DL?

• Does it require less computation when using facial landmarks compared to using images?

• Is it possible to train a model with DL on facial landmarks using an Nvidia GTX 1060 GPU or even more limited hardware?

• Is it possible to replicate the results of LipNet with this model?

1.3 Delimitations

In order to have a consistent dataset of reasonable size that is available for training and testing the models, the first delimitation will be to contain only english. The dataset will itself have a limited vocabulary consisting of certain words in a specific pattern, explained further in section 5.1.

The approach to use facial landmarks implicitly sets a delimitation of the project, unless adressed. Using each frame of a video as input to the model also includes the entire region around the speaker’s mouth, which may have an impact on the performance of the lip reading ability.

This project will simply use the 24 coordinates gathered by an existing library that correspond to

different parts of the speaker’s lips. This project will not include any other algorithm that tracks

facial features.

(17)

Chapter 2 Related Work

This chapter presents previous solutions and research related to the subject of this thesis. It describes several methods for solving problems related to lip reading with Artificial Neural Networks, with a focus on LipNet as mentioned in the previous chapter.

15

(18)

Related Work 16

2.1 LipNet: End-to-End Sentence-level Lipreading

Y.Assael et al. [19] introduces LipNet, a Neural Network (NN) architecture that maps sequences of video frames to text, on a sentence-level prediction instead of wordwise. LipNet uses spa- tiotemporal convolutions, a Recurrent Neural Network (RNN) and the Connectionist Temporal Classification (CTC) loss, trained entirely end-to-end with variable-length videos. On the GRID corpus [21] it achieves 4.8% Word Error Rate (WER) using an overlapped speaker split, classi- fying full sentences instead of words, phonemes or visemes. The overlapped speaker split uses videos from all speakers to train but withholds a few videos from each speaker for validation purposes.

LipNet uses a couple of libraries to extract a small section from each frame containing a cen- tered image of the speaker’s mouth, and sends it as input to the model. As part of the evaluation of LipNet, three hearing-impaired members of Oxford Students’ Disability Community were introduced to the GRID corpus and shown 300 random videos to measure their ability to read the lips of those speakers. They were on average able to achieve a WER of 47.7% on videos of unseen speakers.

LipNet will be used as a comparative tool throughout the project and is described further in chapter 4.

2.2 Various Works Related to Lip Reading

Audio-visual recognition (AVR) is a solution to Speech Recognition (SR) problem where the audio is corrupted. The goal of AVR is to use the information from one modality and use it to complement information in the other. The problem however is to find the correspondence between the audio and visual streams. A.Torfi et al. [22] uses a coupled 3D Convolutional Neural Network (CNN) to find the correlation between the visual information and audio information.

H.Akbari et al. [23] uses a combination of CNNs, Long Short-Term Memory (LSTM) networks, and fully connected layers to reconstruct the original auditory spectrogram with a 98%

correlation from silent lip movement videos.

N.Rathee [24] proposes an algorithm consisting of feature extraction and classification for word recognition. The word prediction is done by a Learning Vector Quantization neural network.

The algorithm is applied for recognizing ten Hindi words and achieves an accuracy of 97%.

A.Garg et al. [25] proposes various solutions based on CNNs and LSTMs for lip reading.

The best performing model, using the concatenated sequence of all frames of the speakers face, achieves a validation accuracy of 76%.

Gregory J. Wolff et al. [26] proposes visual preprocessing algorithms for extracting relevant

features from the frames of a grayscale video to use as input to a lipreading system. They also

propose a hybrid speechreading system with two time delayed NNs, one for image and one for

acoustics, integrating their responses by independent opinion pooling. The hybrid system has

a 25% lower error rate than the acoustic system alone, indicating that lipreading can improve

SR.

(19)

Chapter 3 Theory

This chapter presents theory about the cornerstones of this project. It will mention ML, DL and the underlying network layers in the models, object detection and recognition, some linguistics and the ability to read lips, as well as SR.

17

(20)

Theory 18

3.1 Machine Learning

Broadly defined, ML is a collection of computational methods or algorithms making accurate predictions based on collected data [27]. The learning algorithm improves by using its experience of previous data to provide more accurate predictions on new data by finding patterns. The accuracy and success rate of a learning algorithm depends greatly on the data used when training it. The sample complexity and size must be sufficiently large to allow the algorithm to analyze the data and find these patterns.

3.1.1 Supervised Learning

Supervised Learning (SL) tries to infer a function from labeled data [28]. A SL algorithm uses the labeled training data to try to determine a function that maps the input to the desired output.

This function is then used to determine the output of new examples. More formally, given a set of N training examples {(x 1 , y 1 ), · · · , (x _N , y _N )}, where x i is the feature vector of the i ^th input object in the dataset and y i is its corresponding output target, the learning algorithm tries to determine a function g : X → Y , where X is the input space and Y is the output space.

There are several steps that must be taken to solve a supervised learning problem:

1. The first part is the decision of training examples. If training a speech recognition system, for example, these could be single letters, words or whole sentences.

2. Collecting a dataset. Either build one or find an existing one relevant to the task at hand.

This dataset should consist of input objects and corresponding outputs.

3. Choosing the input feature representation. The learning success of the algorithm can depend highly on how the input object is represented.

4. Choosing a learning algorithm, e.g. support vector machines, decision trees or NN 5. Running the training algorithm on the dataset.

6. Evaluating the learned function.

3.1.2 Logistic Regression

Logistic Regression (LR) [29] is a statistical method for analyzing a dataset. It is used to classify binary problems, i.e. problems with only two output classes. The goal of LR is to find the best coefficients which describes the relationship between the input and the output classes.

In the case of more than two output classes, multinomial logistic regression can be used.

Another solution is to divide the problem into smaller binary problems where a prediction

is made for each output class and then compared to the rest. This is called One-vs-Rest or

One-vs-All.

(21)

Theory 19

3.1.3 Optimization Gradient Descent

Gradient Descent (GD) [30] is an optimization algorithm widely used in SL. It is used to find the parameters of a function that minimizes a cost function. For a neuron, it calculates its error approximation to the target. The most commonly used error is the sum of squared errors

ε =

P T

X

p=1

(t _p − o _p ) ²

where t p is the target output and o t is the actual output for the p-th pattern and P T is the total number of patterns in the training set.

The goal is to minimize ε. To do this, the gradient of ε is calculated in weight space, then the weight values are moved along the negative gradient. Given a training pattern, the weights are updated with

v _i (t) = v _i (t − 1) + ∆v _i (t) with

∆v i (t) = η(− ∂ε

∂v _i ) where

∂ε

∂v _i = −2(t _p − o _p ) ∂f

∂net _p z _i,p

and η > 0 is the learning rate, or the size of the steps taken when changing the weights. ∂net p is the net input for pattern p and z i,p is the i ^th input signal corresponding to pattern p.

Adam

Adam [31] is an optimization algorithm that can be used instead of the classical GD algorithm.

It is a gradient-based optimization algorithm that computes adaptive learning rates for each parameter. It updates exponential moving averages of the gradient (m t ) and the squared gradient (v t ). The exponential decay rates of these are controlled by the hyper-parameters β 1 , β 2 . The

following are the configuration parameters of Adam.

• α is the learning rate.

• β ₁ is the exponential decay rate of the first moment estimation. 0.9 is a recommended

value of this parameter.

(22)

Theory 20

• β ₂ is the exponential decay rate of the second moment estimation. 0.999 is a recommended value of this parameter.

• ε is a small number to prevent division by zero. Recommended value is 10 ⁻⁸ . Pseudo code for the Adam optimization algorithm, requiring the above parameters:

m ₀ ← 0 (Initialize first moment vector) v 0 ← 0 (Initialize second moment vector) t ← 0 (Initialize timestep)

while θ _t not converged do t ← t + 1

g t ← ∇ _θ f t (θ t−1 ) (Get gradients w.r.t. stochastic objective at timestep t) m _t ← β ₁ · m _t−1 + (1 − β ₁ ) · g _t t (Update biased first moment estimate) v t ← β ₂ · v _t−1 + (1 − β 2 ) · g _t ² (Update biased second raw moment estimate)

ˆ

m t ← m _t /(1 − β ^t ₁ ) (Compute bias-corrected first moment estimate) ˆ

v _t ← v _t /(1 − β ₂ ^t ) (Compute bias-corrected second raw moment estimate) θ t ← θ _t−1 − α · ˆ m t ( √

ˆ

v t + ) (Update parameters) end while

return θ _t t (Resulting parameters)

In his paper, [32], S.Ruder compares several gradient descent optimization algorithms where his conclusion recommends Adam as the best overall choice.

3.1.4 Artificial Neural Networks

The human brain is an extraordinarily complex computer. It has the incredible ability to memorize and learn, to complete complex tasks such as Pattern Recognition (PR) much faster than any computer. The brain is built up of large networks of simple Biological Neuron (BN)s. Signals propagate through these large networks where neurons are connected via synapses. If the input signal to a neuron surpasses a certain threshold, the neuron transmits a signal to its connected neurons [33].

An Artificial Neuron (AN) is modeled on a BN. It receives input from other connected ANs where each input signal is weighted with a numerical weight associated with each connection.

The excited signal from the AN is controlled by a function, called the activation function. When

the AN receives inputs, it uses the sum of the weighted signals as an input to the activation

function which calculates the output of the AN as seen in the figure below [30].

(23)

Theory 21

x 2 w 2 Σ ^f

Activation function

y Output

x ₁ w ₁

x 3 w 3

Weights Inputs

Figure 3.1: The structure of an Artificial Neuron

An ANN consists of layered networks of ANs. The first layer is called the input layer, the last is called the output layer and all layers in between are called hidden layers.

Input layer

Hidden layer

Output layer Input 1

Input 2

Input 3

Input 4

Output

Figure 3.2: The structure of a shallow Artificial Neural Network

ANNs are used in many different types of applications today such as speech recognition, image processing, pattern recognition and classification. These are however only a small part of the applications using ANNs.

3.1.5 Deep Learning

The performance of machine learning algorithms depend highly on the representation of the data

given, i.e. what features are included. Manually choosing which features in the data are important

and which ones are not can be quite difficult. What a human considers an important feature might

(24)

Theory 22

not be considered an important feature by a computer. Allowing the computer to not only map the feature representation to the output, but also to map the data to the feature representation, often results in better performance than with manually designed feature representations [10].

DL is a subfield of ML which utilizes deeper network architectures to enable the computer to build complex concepts out of simpler ones [34]. This enables the network to, by itself, find lower level representations of higher level features, allowing it to represent functions with higher complexity [10].

DL algorithms has led to many state-of-the-art results in several areas, among them is ASR[35][36][37].

3.1.6 Recurrent Neural Networks

The idea behind RNN is derived from the human ability to understand by sequence. Many tasks, such as understanding speech or remembering the alphabet, are based on sequence [38].

Traditional feed-forward NNs do not have the ability to remember sequences.

To address this issue RNNs introduces loops. The output of the network in one time step is fed back as input in the next time step [39]. Given a sequence x = (x 1 , x 2 , · · · , x T ), the hidden state h _t is updated by

h _t =

( 0, t = 0

φ(h _t−1 , x _t ), otherwise ,

where φ is a nonlinear function such as a composition of a logistic sigmoid with an affine transformation [40]. This results in an internal memory, allowing the network to process sequences. The update of the hidden state h t is implemented as

h _t = g(W x _t + U h _t−1 ),

where g is a smooth bounded function such as a logistic sigmoid, W and U being weights.

h _t

A

x _t

=

h ₁

A

x 1

h ₂

A

x 2

h ₃

A

x 3

h _t

A

x t

. . . Figure 3.3: The structure of a Recurrent Neural Network

Even with the ability to remember sequences, RNNs suffer from the problem of long-term

dependencies [41]. If two sets of information are separated with a too large time step gap, the

RNN will have difficulty connecting them.

(25)

Theory 23

Long Short-Term Memory

LSTM is a variant of the RNN architecture which improves the long-term dependency [42].

While a RNN only has a simple structure of one NN, simply passing the output of one time step as an input to the next, a LSTM has a more complex structure of four NNs, called gates. These gates determine what information should be kept in the system and what information should be removed, preventing the decay of important information.

h t−1

A

x t−1

h t

x _t

σ σ tanh σ

x +

x x

tanh

h t+1

A

x t+1

Figure 3.4: Gates of a Long-Short Term Memory unit

Walking through figure 3.4, starting with the line running along the top of the unit, this is called the cell state [39]. The cell state is a flow of information passed from one time step to the next.

The information in the cell state can be altered by the gates. The leftmost gate on the bottom row decides what information should be removed from the cell state, called the forget gate layer (f ).

It looks at the information passed as input x t as well as the output of the previous time step h t−1 . It then outputs a number between 0 and 1 where 1 means the information should be completely kept in the cell state while a 0 means it should be completely removed.

f t = σ(W f x t + U f h t−1 + b f ).

W ∈ R ^h×d , U ∈ R ^h×h and b ∈ R ^h

The forget gate is represented by f, and the formula contains three different weights where d and h refer to number of input features and hidden units respectively.

The next gate decides what information should be added in the cell state. This is divided into two parts. First the input gate layer (i) which decides what information should be updated, then a tahn layer creates a vector of candidate values that could be added. These are then combined to update the cell state.

i _t = σ(W _f · [h _t−1 , x _t ] + b _i ) C ˜ _t = tahn(W _C · [h _t−1 , x _t ] + b _C ).

When adding the removal of information f _t · C _t−1 and the addition of information i _t · ˜ C _t to the old cell state C t−1 we get the new cell state.

C t = f t · C _t−1 + i t · ˜ C t .

(26)

Theory 24

Lastly a filtered version of the cell state is multiplied with the final sigmoid layer to produce the output (o).

o _t = σ(W _o [h _t−1 , x _t ] + b _o ) h _t = o _t · tahn(C _t ).

Gated Recurrent Unit

The Gated Recurrent Unit (GRU) is another variant of the RNN architecture. Similar to the LSTM, the GRU has gate units controlling the information flow in the unit.

The update gate (z t ) decides how much the unit updates its content, computed as z _t = σ(W _z x _t + U _z h _t−1 ).

The forget gate (r t ) allows the unit to forget the previously computed state, given as r t = σ(W r x t + U r h t−1 ).

The candidate activation function (¯ h _t ) is given as

¯ h t = tahn(W x t + U (r t h _t−1 )),

where is an element wise multiplication. Finally, the activation (h t ) of the GRU is computed by

h t = (1 − z t )h t−1 + z t ¯ h t .

Bidirectional RNN

It can often be useful to analyze both the future and past at a given point in a sequence. RNNs are however designed to analyze a sequence in one direction [43]. A solution to this is the bidirec- tional RNN (BiRNN), where the input is presented forwards and backwards to two separate RNNs that are connected to the same output layer. Thus, given a sequence of inputs (x 1 , · · · , x _T ), one RNN maps (x 1 , · · · , x _T ) → ( − →

h ₁ , · · · , − →

h _T ) and the other maps (x _T , · · · , x ₁ ) → ( ← −

h ₁ , · · · , ← − h _T ), then h t = [ − →

h _t , ← −

h _t ]. BiRNNs have shown improved results in sequence learning compared to RNNs, notably in speech processing [44][45].

3.1.7 Convolutional Neural Network

The main issue of linear classifier is that accuracy decreases if the classes are not separable by a

hyperplane. This may require a transformation of the data into a new space where the classes

are linearly separable. The transformation becomes more difficult as the dimensionality of the

input vector increases. Fully connected feed-forward networks are commonly used to learn to

(27)

Theory 25

classify the data, however the number of neurons could become very large when applied to high dimensionality data, such as images. A CNN reduces the number of parameters for the feature transformation function allowing networks to be deeper with fewer parameters [46].

To reduce the amount of parameters the neurons are rearranged into blocks, allowing them to share the same weights within the block, called weight sharing. A pixel in an image is highly correlated with the close neighbors, and neurons with the same coordinates in each block can be used to extract information from a region of pixels in the image. By convolving each filter weight on the input image, the result is a series of images representing the output of the convolutional layer. With an image with the dimensions W × H and a convolutional layer with P filters with size M × N , the output of the convolutional layer is P feature maps with dimensions W − M + 1 × H − N + 1. An activation function is then applied on each of these feature maps separately in an element-wise fashion as shown in figure 3.5. [46].

Using convolutional filters in regard to image processing works well due to the ability to keep the dimensions of the filters the same regardless if the image is gray scale, RGB or any other format. The filters are separately applied on each color channel. Convolving a filter with a multichannel input results in one single channel as output. The filters are three-dimensional arrays in image processing and the size of the dimensions are the width, height and the amount of channels of the image. Adding another convolutional layer and using the first layer as input, the third dimension would be the same size as the amount of images produced by the first layer. The output of a convolutional layer is called feature maps where each map is the result of convolving a filter with the input of the layer [46].

1 B G R

L1 F1

5 6 3 4

2 G

1 2 3

4 5 6 Figure 3.5: Convolutional Neural Network applied on a RGB image

In 3.5 a convolutional layer (L1) is applied to the Red-Green-Blue (RGB) channels, resulting

in feature maps (F1). An activation function (G) on the feature maps results in an output of

6 multichannel images. The dimensionality of the feature maps can get quite large when the

amount of input channels and other dimensions grow, and may need downsampling (pooling) to

be reduced to a more reasonable amount. The stride (s) of the pooling decides which elements

in a vector to skip. However, important information could get discarded this way. Therefore

the size of the local neighborhood (d) is introduced. This is used in max pooling, where the

maximum value in the region is used. Pooling feature maps in convolutional layers divides the

image into a d × d region every s pixels row-wise and column-wise. It is applied to each feature

map separately, meaning the amount of channels remain the same but the other dimensions are

reduced with a factor equal to the stride.

(28)

Theory 26

32x32

111 140

97 123 16x16

max = 140

Figure 3.6: A max-pooling layer retrieves the maximum value in a region and reduces the dimensionality

3.1.8 Object Detection

Detection and recognition in images are closely related, however there is a distinction between the two. The goal for object detection is to determine if a given type of object is present in an image. For instance the detection algorithm could conclude that there is a car in the image, while the recognition algorithm could tell us which brand. [47].

A single image contains multiple colors, often depicted in the RGB color model, where each color can be represented with a dimension. Local structures in the images can be detected with structural tensors describing the objects. Putting multiple images or frames together into a video adds another dimension to the tensor. Structures found in images are encoded with their color where the orientation of areas are represented by different colors. Strong signal variations, such as edges, affect the saturation value. This encoding gives colorful edges with high saturations, creating high contrasts with the dark, weak structures. This proves useful in order to detect objects in images [47].

Classification

A discriminant function can tell if a component of the feature vector corresponds to the certain

classification the discriminant function is searching for. The higher the count of components

matching this function, the more likely it is that the algorithm will choose that class. It looks at

the input feature vector and sends the components to the discriminant functions, it then chooses

the class depending on which discriminant function has the most matches [47].

(29)

Theory 27

Feature vector Discriminant functions

Maximized score

x L

x 4

x 3

x 2

x ₁

g c

g 3

g 2

g 1

MAX

Figure 3.7: Classification of a feature vector

Probabilistic Neural Networks

Probabilistic Neural Network (PNN) follows a combination of two classification methods: Bayes maximum a posteriori classification scheme and Parzen method for nonparametric density esti- mation.

ω _c ⁰ = argmax

1≤c≤C

{P (x | ω _c )P (ω _c )}

Bayes maximum a posteriori classification finds the class with the highest probability, given the input x. By getting the maximum argument of the discrimination functions, the function returns the class with maximal response.

ω _c ⁰ = argmax

1≤c≤C

{g _c (x)}

The Parzen method:

p(x) = 1 N

N

X

i=1

1 h ^L _i K( x − x _i h i

)

Where K is a function with highly localized response, called a kernel-function, that takes the

difference of two vectors (x and x i ) that lie within a hypercube of radius h i . N is the total number

of points, and h ^L _i is the volume of the hypercube in L-dimensional space. The formula is slightly

(30)

Theory 28

altered to allow the neurons to compute a kernel function of the reference patterns and the present input into

W cn (x) = K( x − x cn

h _c ), for 1 ≤ c ≤ C, and 1 ≤ l ≤ N c

The neurons W cn , belonging to only one class, computes the kernel function with input X and a n-th prototype from a class x cn , divided by the kernel width for that class. The number of available class prototypes N _c defines the width.

The output from each neuron is fed from the pattern layer to the summation layer, leading to the discrimination functions with a scaling parameter α c :

g c (x) = α c N c

X

n=1

W cn (x)

Input pattern vectors of dimension L are fed to the input layer, where each vector can belong to

one of the determined classes. The pattern layer contains the number of weights that store the

components of the reference patterns. The resulting architecture looks similar to figure 3.8 and

can be seen below [47].

(31)

Theory 29

Input Patterns layer Summation layer Output Class

x L

x ₂ x ₁

W ₁₁ W ₁₂

W 1N

1

W ₂₁ W ₂₂

W _2N

₂

W _C1 W _C2

W CN

C

P

1 P

2 P

C

MAX g ₁ (x)

g 2 (x)

g _C (x)

Figure 3.8: Probabilistic Neural Network for classification

The input features are sent to each pattern’s weights, and summarized in discrimination functions to find the most probable class for the input given the classification scheme.

Detection

To classify objects one must be able to characterize the traits, e.g. color, shape or texture. Direct pixel classification can often distinguish between objects and the background of the image. One way to classify images is by the color of objects, as colors can reveal a lot of information about the contents of an environment. The object classification requires a set or range of colors that is descriptive of the certain object type, traffic signs for example have certain specific colors that can be distinguished from the background environment [47].

Clustering is a way to improve the accuracy of classifying more complicated data sets, where

the input data is divided into partitions with similar properties that are sufficiently separated from

other partitions. This heavily relies on similarities within the data set, with a representative set of

(32)

Theory 30

features. The K-Nearest Neighbors (KNN) algorithm classifies data points into clusters, and new data is assigned a class given the closest points around it. The numerical parameter K looks at the K closest points and classifies the new point to the same category as the majority of its neighbors.

A fundamental low-level task in computer vision is the ability to detect basic shapes such as lines, circles and ellipses described by a certain mathematical model. Structural tensors are useful to detect basic shapes. For each point the tensor is computed and provides information if said point belongs to an edge, what the local phase is and what type of local structure it belongs to.

The remaining parameter of the tensor is the distance between the coordinate system’s origin and the tensor. By analyzing the local phase and the coherence of a tensor the computation becomes quick and accurate [47].

Regular shapes can easily be detected as they have characteristic (salient) points, such as corners or edges. A Salient Point Detector (SPD) has some predefined rules about how these common shapes are constructed, e.g. a triangle has three corners with lines between them. The neighborhood of each pixel is divided into a number of regions, and each region is analyzed to see if it contains certain selected features. The regions can be compared with each other, or with a predefined model of the shape. By segmenting an image into a binary space and looking at the distribution of selected features in each of the regions, the image can be compared to predefined models for these shapes. If a match is found then the pixel can be classified as a salient point of that given type. SPD is therefore efficient in detecting triangles, rectangles and diamonds [47].

P

Figure 3.9: Salient Point Detection detecting a triangle

Shape deformations and noise may occur in images for which the SPD adapts and returns a number of points that fulfill a predefined rule instead of a single location. By processing these groups of points and determining their center of gravity, the cluster can be replaced with a single location that corresponds to the center of gravity of the cluster [47].

The SPD technique can only be used to find basic shapes with a few characteristic points as

more complex shapes may need more features to be defined. A technique called the adaptive

window growing method aims to address this issue by finding dense clusters in the image that are

detected based on other characteristic features of the object, i.e. color or texture. The method

looks at a dense cluster, that represents a high probability of the object residing in that part of the

image. A rectangular window is inserted at the location of the cluster and expands in all directions

until the borders of the object is detected or until the stop criteria, the expansion threshold, for

said direction is reached. If the window grows one pixel each step the neighbor-connected shapes

can be detected, if the steps are larger a more sparse version is obtained [47].

(33)

Theory 31

In order to verify if separated salient points belong to the same shape or figure, the detection algorithm must subsequently check all possible configurations of the points such as size, or if the shape is occluded or rotated. This gives flexibility to have a simple formulation of the shape with a dynamic processing of the rules of the shape [47].

3.1.9 Object Recognition

An object can be viewed from angles with different scales, rotations and lighting conditions.

Trying to figure out which object it is can be challenging for many other reasons as well.

Prototype templates may provide a certain test pattern for the algorithm, but the dimensionality of the template grows with the different possibilities of how an object can match with the test pattern. It is challenging to tell if a pattern is present in an image and how reliable the result is.

Methods mentioned in the previous subsection are some of the many methods used to address these issues [47].

One of the main problems in object recognition is how the objects change in the observed scene, where the template may not always match the object very well. Instead one can find characteristic features of the object, that are as much invariant to object transformations as possible to adjust for this. Features such as geometric properties remain the same, e.g. the length of a line segment remains the same regardless of rotation of the object, are useful for this [47].

Distance transformations of images and templates can help detect features by creating a binary image of the template and the image. A distance map is constructed from the binary image while the template is processed into several maps, where each is a combination of horizontal and vertical shifts as well as rotations and scale changes. Each of the template’s maps is compared to the distance map of the image to find features that are close to the template. The smaller the distance between the image’s distance map and the template, the more similar it is. The method measuring the distance can provide different results, as some methods have greater resistance to missing features due to occlusions for example [47].

Combining multiple classifiers can improve the accuracy and is often used in facial detection, where a cascade of weak classifiers can be configured to work as serial operators on the images.

The training procedure of this ensemble of classifiers should be organized in a way to increase diversity of the classifiers given a training set, to react unanimously when observing known data but making them as different as possible when making errors [47].

3.2 Speech Recognition

3.2.1 Lip Reading

Lip reading, also called speech-reading, is a technique to interpret the movements of a person’s

lips, mouth and face to understand speech. Facial speech gestures can aid in understanding

what a person is saying when there is a lot of noise in the environment. However even skilled

speech-readers are seldom able to perfectly interpret sentences, and it is even more difficult to

(34)

Theory 32

understand unrelated sentences where the performance rarely exceeds 10-30% accuracy [48].

The ability to accurately read lips relies heavily on the psychological and cognitive processes to comprehend language, for the message to have a clear context and even the use of guesses [49].

The ability to decode information and the processing speed are two general factors that affect the performance of speech reading, and plain guessing is important in situations when the contextual support is low. [48].

A study by Bernstein et al. [50] with formal and informal communication, the most proficient test subjects were able to accurately get approximately 80% words correct in the sentences [49].

Speech gestures, movements of the face, mouth and lips, together with body language is primarily used as visual cues for reading lips, together with complementary help of the communicative context. Studies have shown that the message content is more informative for speech-readers than facial expressions and body language, as a hypothesis that emotional expres- sion could improve speech-readers ability to understand the content has been disproved [48].

A phoneme is a unit of sound that can distinguish words in a particular language, the charac- ters in phonetic text. Phonemes that are stressed in speech to articulate more properly are easier to discriminate in noise, but much harder when it comes to speech-reading.

The ability to identify phonemes correctly, depending on phonetic context, have been shown to be below 50%, as many of the phonemes are hard to distinguish by only using sight and are therefore easy to confuse [49].

3.2.2 Automatic Speech Recognition

When solving a SR problem the algorithm must first be able to decode the message, usually by converting it into a series of numeric values to represent the characteristic vocals, or movements, and speech patterns. The numerical representation can then be mapped to a lexicon or dictionary.

The mapped words can then be passed to a Language Model (LM) which follows certain rules for the particular language about the structure of sentences. The grammatical rules can improve accuracy as it may eliminate some possibilities when the algorithm is trying to determine the words in a sentence [51].

Visemes are the visually indistinguishable unit of speech and are claimed to be the visual equivalent of phonemes. To be able to relate the two, one of them needs to be mapped to the other. It is common to map phonemes to visemes, as many phonemes cannot be distinguished visually, but how the mapping should be performed for optimal results is still under investigation.

However, studies [52] [53] have shown that visemes are suboptimal recognition units compared to using the phonemes [54].

Hidden Markov Models are efficient in capturing the temporal behavior of the visual speech

when the duration can vary for each utterance of the same word, and one model can be trained for

each word to be detected. This however requires a significant amount of task specific knowledge

to design the state models [55].

(35)

Theory 33

Connectionist Temporal Classification

As RNNs require presegmented training data, and the network outputs require postprocessing to give a final label sequence, it is hard to apply direct sequence labelling. A CTC attempts to label unsegmented data sequences with a modified RNN to approach this issue by using the training set to train a temporal classifier to classify previously unseen input in a way that minimises the rate of transcription mistakes. Several parameters are needed to calculate the most probable labelling for an input sequence:

h(x) = argmax

l∈L ^≤T

p(l | x)

Where L ^≤T is the many-to-one mapping B of possible labellings for the alphabet used B : L ^0T 7→

L ^≤T , where the possible paths are mapped to the set of possible labellings, and T is the length of the input sequence x. Finding this label is refered to as decoding [56].

A CTC network can be trained with a forward-backward algorithm, with an iterative sum over the paths corresponding to prefixes of that labelling with recursive forward and backward variables, similar to the algorithm for Hidden Markov Model (HMM)s by [57]. This reduces the amount of computations required to calculate the sum over all paths corresponding to a label, as this number can grow quite large. The forward-backward algorithm allows the network to be trained with a maximum likelihood algorithm that maximises the probability of all correct classifications in the training set [56].

Accuracy in Speech Recognition

There are different approaches to measuring accuracy when an algorithm performs ASR, the most commonly used being WER [58]. By comparing the reference sentence to a hypothesis sentence the WER can be calculated with:

W ER = S + D + I N Accuracy = 1 − W ER

This can be done for each character in a word, or each complete word in a sentence. S represents the number of substitutions, when the algorithm predicts the wrong word. D represents the number of deletions, where a word is missing. I represents the insertions, when a word is added.

N is the number of words in the reference sentence.

• Reference: This is a sample sentence

• S-Hypothesis: This id a sample sentence (WER=1/5)

• D-Hypothesis: This is sample sentence (WER=1/5)

• I-Hypothesis: This is a nice sample sentence (WER=1/5)

(36)

Theory 34

The WER of two strings ref and hyp is given by the Levenshtein distance lev ref,hyp (|ref|, |hyp|) where:

lev _ref,hyp (i, j) =



 



 



max(i, j) if min(i, j) = 0

min



 

 

lev _ref,hyp (i − 1, j) + 1 Deletion lev _ref,hyp (i, j − 1) + 1 Insertion lev _ref,hyp (i − 1, j − 1) + f _ref _i _6=hyp _j S/C

otherwise

The indicator f ref i 6=hyp _j function adds +1 if the words do not match (S) and adds 0 if the words match (C). The same algorithm can be used on separate words to calculate the Character Error Rate (CER), that looks at the Levenhstein distance between two words instead of two sentences.

CER is another possible measurement of accuracy.

Language Model

A N-gram language model can be generated by looking at the all the possible labels, and calculating the probability of one word following another. The N stands for how long these known sequences are, e.g. a 3-gram model knows the probability of sequences of three words given the dataset’s dictionary.

3.2.3 Moving Average Filter

Moving Average (MA) filter [59] is used to smooth time series of data. Smoothing is applied to remove noise from a sequential dataset while still capturing the important features and patterns in the data. MA filters are a simple and common type of smoothing. MA calculates the average value of the data over a window of time steps to create a smooth approximation of the original sequence.

Figure 3.10: Moving average filter. Original signal (blue) and filtered signal(green).

(37)

Theory 35

Centered Moving Average

The smoothed value y of time T over a window of N time steps are calculated with T as center such that:

y = x(T − ^N ₂ ) + ... + x(T − 1) + x(T ) + x(T + 1) + ... + x(T + ^N ₂ )

N ,

Where x(t) is the value at time t. This approach is only possible if future values are known. This is therefore used when the full dataset is known.

Trailing Moving Average

The smoothed value y of time T over a window of N time steps are calculated with T as leading time step such that:

y = x(T − N ) + ... + x(T − 2) + x(T − 1) + x(T )

N ,

Where x(t) is the value at time t. This approach is possible even if future values are unknown and

is therefore used on time series forecasting.

(38)

Chapter 4 LipNet

In this section LipNet is thoroughly explained. The network model, preprocessing, training parameters as well as output decoding is explained to give clarity to the system.

36

(39)

LipNet 37

4.1 Overview

LipNet is a state-of-the-art lip reading network mapping video sequences to text, achieving an accuracy of 95.2% (4.8% WER) in sentence-level, using overlapped speaker split to train and validate. LipNet is able to handle unseen speakers (cross-validation) from the GRID corpus with an accuracy of 88.6% (11.4% WER). It is open source, allowing anyone to use it and read the source code. It is trained and tested on the GRID corpus, an open source dataset easily available for download. These parameters makes LipNet a reasonable model to use as a comparative tool for this project. While this project might not reach the impressive accuracy displayed by LipNet, it might at least offer a less computationally heavy model available for use on a personal computer.

4.2 Model

LipNet consists of three layers of 3D CNNs with normalization and max pooling. These layers

are followed by a pair of bidirectional GRU (BiGRU) ending with a dense layer, an activation

layer and finally CTC.

(40)

LipNet 38

Conv3D BatchNormalization

ReLu

Softmax CTC 5D vector input

MaxPooling3D

Conv3D BatchNormalization

ReLu MaxPooling3D

Conv3D BatchNormalization

ReLu MaxPooling3D

Bidirectional GRU Bidirectional GRU

Dense

Figure 4.1: Graph representation of the architecture of LipNet

(41)

LipNet 39

4.3 Preprocessing

The network requires a sequence of 75 images of 100x50 pixels of the speaker’s mouth as input.

The frames of the videos in the GRID corpus are not of these dimensions and the videos also include more than just the speaker’s mouth. Therefore some preprocessing must be performed before feeding the data to the network. The align files containing the words spoken in the videos, are coded into series of numbers as labels for the network. The labels are also padded to ensure all are of equal length, which is required by the network.

The preprocessing script splits each video into 75 frames. In each frame, the mouth of the speaker is located and the video is cropped to a rectangle of 100x50 pixels around the mouth.

The 75 frames of 100x50 pixels are then saved as a sequence and can be used as input to the network.

4.4 Training

LipNet is set to train for 5000 epochs with the following Adam parameters:

lr = 0.0001 β ₁ = 0.9 β ₂ = 0.999 = 1E − 08

4.5 Prediction and Decoding

When the network has been trained, the trained weights can be loaded into the model to perform predictions on new videos. A video can be used as input to predict the sentence spoken. The prediction the video undergoes the same preprocessing as the training videos before being fed into the network. The output is then decoded using Keras CTC decoder. Finally the labels are translated back from numbers to text. The spelling of the resulting text is then corrected using some static rules and is finally presented.

4.6 Evaluation

LipNet uses two measurements for their accuracy, overlapping speakers or unseen speakers.

When the algorithm is trained with overlapping speakers, all speakers are present in the training.

However some of each speaker’s videos are withheld from the training, to be used as test data.

The other measurement uses test data that consists of speakers that are entirely withheld from the

training process, meaning speakers the model has never seen before.

(42)

Chapter 5 Methods

In this chapter the dataset used in the project is described as well as the software developed during the project together with the methods for said software.

40

(43)

Methods 41

5.1 Dataset

The GRID corpus [21] was used as a dataset. It contains videos of 1000 sentences spoken by each of the 34 speakers (18 male, 16 female). The sentences follow a certain pattern: command + color + preposition + letter + digit + adverb. Commands include {bin, lay, place, set}, colors are:

{blue, green, red, white}. The prepositions used are: {at, by, in, with} and all latin letters except W are used. Digits are between zero and nine, and finally the adverbs are: {again, now, please, soon}. One example sentence in the dataset is: ’bin white at f zero again’. For each of the videos there is a transcription containing the words spoken as well as information on when each word is spoken. The videos of speaker 21 is however not available, leaving 33 speakers available for the dataset.

Figure 5.1: Pictures of two speakers from the GRID corpus

5.1.1 Subset

For the evaluation of the proposed models a subset of 14000 (14 different speakers) videos from the GRID corpus was used: 6000 for the training set, 6000 for the validation set and 2000 for the testing. The reason for using a subset of all the videos was to limit the time spent on training the various models. The test set consists of videos of speakers not included in the training or validation set.

When the most suitable models had been established the full dataset of 31000 videos (31 speakers) were used for training said models with the remaining two unseen speakers with 2000 videos used for testing.

5.2 Preprocessing

The preprocessing performed on the data prior to training and evaluation is described.

(44)

Methods 42

5.2.1 Mouth Tracking

To prepare the data for the training, each video in the dataset is split into 75 frames. Each frame is then analyzed using the Face Recognition API [60] for Python [61] which is written using dlib’s [62] face recognition built with deep learning. It extracts the (x,y) coordinates of 68 landmark features of the face including eyes, nose, mouth and chin shown in figure 5.2.

Figure 5.2: The 68 facial landmarks identified with dlib’s facial recognition

The coordinates corresponding to the speaker’s mouth is saved while the rest are discarded.

They are then normalized such that the mouth originates from the smallest possible (x,y) coordi- nates.

Figure 5.3: Vector representation of extracted mouth coordinates from two frames

(45)

Methods 43

5.2.2 Moving Average Filtering

The coordinates described in 5.2.1 fluctuated between frames despite the lack of movement of the speaker’s lips. To counter these fluctuations two versions of a MA filter were developed, one centered and one trailing, to smooth each coordinate in the sequence of the frames of the video.

The window sizes for these filters were chosen to the smallest possible, 3 for the centered and 2 for the trailing, as to minimize the removal of important features or patterns from the data.

Figure 5.4: Plot of a sequence of one (y) coordinate over 75 frames. One line is the original coordinate (blue) and the other is a smoothed version (green).

5.3 Clustering

KNN was implemented as a naive solution in this project. It was built with scikit-learn [63]. For

each video sample, each word was extracted with the corresponding coordinates for those frames

and used to train a KNN algorithm. It was trained and tested on different amount of neighbors,

where the distance measurement was optimized based on the training data using the sklearn

neighbors library [64]. The best accuracy recorded was a WER of 69.91%, when the algorithm

used the single closest neighbor to classify new data.

(46)

Methods 44

Figure 5.5: KNN WER for different K-values

5.4 Logistic Regression

A simple LR model was built and trained using scikit-learn as a second naive solution to the problem. The dataset was reshaped in the same fashion as in section 5.3. Each unique word in the dataset was encoded to a number. The coordinates in each sample was then mapped to a specific encoded word. The tflearn framework was used for this model [65].

The LR model was tested with several training parameter combinations on a small subset of the dataset to find the optimal combination. The dataset was the same used in section 5.3. The most optimal combination leading to the highest accuracy is described in table 5.1. The accuracy was based on training on the full dataset.

Penalty C Dual Fit

intercept Intercept Solver Warm

start WER

l1 0.1 false true 1 liblinear false 76.2%

Table 5.1: Results for logistic regression

5.5 Deep Neural Network Models

In this section all tested models are presented with motivation for each particular model as well as the best results obtained. All models were built and trained using Keras [66] with TensorFlow [67].

Each model was trained on the subset of 14000 videos mentioned in section 5.1.1. Model 2 and 5

were also trained and tested on the full dataset of 33000 videos as they achieved the best results

Speech Reading with Deep Neural Networks

UPTEC IT 18 012

Examensarbete 30 hp Juli 2018

Speech Reading with Deep Neural Networks

Linnar Billman Johan Hullberg

Institutionen för informationsteknologi Department of Information Technology

1

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Speech Reading with Deep Neural Networks

Linnar Billman and Johan Hullberg

This thesis instead investigates how well machine learning techniques can learn the art of lip reading as a sole source for automatic speech recognition. The key idea is to use a sequence of 24 lip coordinates to feed to the system, rather than learning directly from the raw video frames.

This however also scales down performance to an accuracy of 80% of what LipNet achieves, while still outperforming human recognition by a factor of 150%. The accuracies are based on processing of yet unseen speakers.

This text presents this architecture. It details its design, reports its results, and compares its performance to an existing solution. Based

on this, it is indicated how the result can be further refined.

Tryckt av: Reprocentralen ITC UPTEC IT 18 012

Examinator: Lars-Åke Nordén

Ämnesgranskare: Kristiaan Pelckmans

Handledare: Mikael Axelsson 2

Popul ¨arvetenskaplig Sammanfattning

Den stora utvecklingen av beräkningskraft tillsammans med det ökade flödet av tillgänglig data har lett till ett ökat intresse och utveckling av maskininlärning. Framförallt den del av maskininlärning som kallas deep learning.

Deep learning har lett till förbättringar av m˚anga applikationer och omr˚aden. En av dessa omr˚aden är automatisk taligenkänning: tekniken att träna en dator att känna igen talkommandon.

Taligenkänning används till m˚anga applikationer s˚asom personliga assistenter, röstkontrollerade system i fordon, utbildningsapplikationer och m˚anga fler.

Denna rapport menar att träna ett läppläsningssystem med hjälp av deep learning och sedan jämföra detta med det nuvarande ledande systemet inom läppläsning: LipNet.

3

Acknowledgements

First of all we would like to thank our supervisor Mikael Axelsson for supporting us through this project and providing helpful ideas and a positive atmosphere.

We would like to thank Consid AB for providing us with a wonderful office to work at with pleasant colleges and great coffee.

Thank you Kristiaan Pelckmans, our reviewer at Uppsala University, for taking your time to provide us with helpful feedback throughout the project.

A special thanks to the team behind LipNet for providing us with a bible in the form of your report and source code which we could refer to in doubt.

4

“ Never half-ass two things.

Whole-ass one thing.

”

Ron Swanson, Parks and Recreation, 2012

5

Contents

1 Introduction 12

1.1 Background . . . . 13

1.2 Purpose . . . . 13

1.2.1 Problem Statement . . . . 14

1.3 Delimitations . . . . 14

2 Related Work 15 2.1 LipNet: End-to-End Sentence-level Lipreading . . . . 16

2.2 Various Works Related to Lip Reading . . . . 16

3 Theory 17 3.1 Machine Learning . . . . 18

3.1.1 Supervised Learning . . . . 18

3.1.2 Logistic Regression . . . . 18

3.1.3 Optimization . . . . 19

3.1.4 Artificial Neural Networks . . . . 20

3.1.5 Deep Learning . . . . 21

3.1.6 Recurrent Neural Networks . . . . 22

3.1.7 Convolutional Neural Network . . . . 24

3.1.8 Object Detection . . . . 26

3.1.9 Object Recognition . . . . 31

3.2 Speech Recognition . . . . 31

3.2.1 Lip Reading . . . . 31

3.2.2 Automatic Speech Recognition . . . . 32

3.2.3 Moving Average Filter . . . . 34

4 LipNet 36 4.1 Overview . . . . 37

4.2 Model . . . . 37

4.3 Preprocessing . . . . 39

4.4 Training . . . . 39

4.5 Prediction and Decoding . . . . 39

4.6 Evaluation . . . . 39

6

5 Methods 40

5.1 Dataset . . . . 41

5.1.1 Subset . . . . 41

5.2 Preprocessing . . . . 41

5.2.1 Mouth Tracking . . . . 42