UPTEC IT 18 012
Examensarbete 30 hp Juli 2018
Speech Reading with Deep Neural Networks
Linnar Billman Johan Hullberg
Institutionen för informationsteknologi Department of Information Technology
1
Teknisk- naturvetenskaplig fakultet UTH-enheten
Besöksadress:
Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0
Postadress:
Box 536 751 21 Uppsala
Telefon:
018 – 471 30 03
Telefax:
018 – 471 30 00
Hemsida:
http://www.teknat.uu.se/student
Abstract
Speech Reading with Deep Neural Networks
Linnar Billman and Johan Hullberg
Recent growth in computational power and available data has increased popularity and progress of machine learning techniques. Methods of machine learning are used for automatic speech recognition in order to allow humans to transfer information to computers simply by speech. In the present work, we are interested in doing this for general contexts as e.g. speakers talking on TV or newsreaders recorded in a studio. Automatic speech recognition systems are often solely based on acoustic data. By introducing visual data such as lip movements, robustness of such system can be increased.
This thesis instead investigates how well machine learning techniques can learn the art of lip reading as a sole source for automatic speech recognition. The key idea is to use a sequence of 24 lip coordinates to feed to the system, rather than learning directly from the raw video frames.
This thesis designs a solution around this principle empowered by state-of-the-art machine learning techniques such as recurrent neural networks, making use of GPUs. We find that this design reduces computational requirements by more than a factor of 25 compared to a state-of-art machine learning solution called LipNet.
This however also scales down performance to an accuracy of 80% of what LipNet achieves, while still outperforming human recognition by a factor of 150%. The accuracies are based on processing of yet unseen speakers.
This text presents this architecture. It details its design, reports its results, and compares its performance to an existing solution. Based
on this, it is indicated how the result can be further refined.
Tryckt av: Reprocentralen ITC UPTEC IT 18 012
Examinator: Lars-Åke Nordén
Ämnesgranskare: Kristiaan Pelckmans
Handledare: Mikael Axelsson 2
Popul ¨arvetenskaplig Sammanfattning
Den stora utvecklingen av ber¨akningskraft tillsammans med det ¨okade fl¨odet av tillg¨anglig data har lett till ett ¨okat intresse och utveckling av maskininl¨arning. Framf¨orallt den del av maskininl¨arning som kallas deep learning.
Deep learning har lett till f¨orb¨attringar av m˚anga applikationer och omr˚aden. En av dessa omr˚aden ¨ar automatisk taligenk¨anning: tekniken att tr¨ana en dator att k¨anna igen talkommandon.
Taligenk¨anning anv¨ands till m˚anga applikationer s˚asom personliga assistenter, r¨ostkontrollerade system i fordon, utbildningsapplikationer och m˚anga fler.
Taligenk¨anningssystem anv¨ander sig oftast endast av akustisk data, i.e. ljud. Ett s˚adant system ¨ar v¨aldigt beroende av kvalit´en p˚a ljudet d˚a otillr¨acklig kvalit´e kan leda till att urskiljning av ord blir sv˚art. En l¨osning till detta problem ¨ar att introducera visuell data s˚asom videor av l¨appr¨orelser av talaren tillsammans med ljudet. Genom att introducera visuell data har systemet en chans att urskilja ord ¨aven n¨ar ljudet fallerar.
Denna rapport menar att tr¨ana ett l¨appl¨asningssystem med hj¨alp av deep learning och sedan j¨amf¨ora detta med det nuvarande ledande systemet inom l¨appl¨asning: LipNet.
3
Acknowledgements
First of all we would like to thank our supervisor Mikael Axelsson for supporting us through this project and providing helpful ideas and a positive atmosphere.
We would like to thank Consid AB for providing us with a wonderful office to work at with pleasant colleges and great coffee.
Thank you Kristiaan Pelckmans, our reviewer at Uppsala University, for taking your time to provide us with helpful feedback throughout the project.
A special thanks to the team behind LipNet for providing us with a bible in the form of your report and source code which we could refer to in doubt.
4
“ Never half-ass two things.
Whole-ass one thing.
”
Ron Swanson, Parks and Recreation, 2012
5
Contents
1 Introduction 12
1.1 Background . . . . 13
1.2 Purpose . . . . 13
1.2.1 Problem Statement . . . . 14
1.3 Delimitations . . . . 14
2 Related Work 15 2.1 LipNet: End-to-End Sentence-level Lipreading . . . . 16
2.2 Various Works Related to Lip Reading . . . . 16
3 Theory 17 3.1 Machine Learning . . . . 18
3.1.1 Supervised Learning . . . . 18
3.1.2 Logistic Regression . . . . 18
3.1.3 Optimization . . . . 19
3.1.4 Artificial Neural Networks . . . . 20
3.1.5 Deep Learning . . . . 21
3.1.6 Recurrent Neural Networks . . . . 22
3.1.7 Convolutional Neural Network . . . . 24
3.1.8 Object Detection . . . . 26
3.1.9 Object Recognition . . . . 31
3.2 Speech Recognition . . . . 31
3.2.1 Lip Reading . . . . 31
3.2.2 Automatic Speech Recognition . . . . 32
3.2.3 Moving Average Filter . . . . 34
4 LipNet 36 4.1 Overview . . . . 37
4.2 Model . . . . 37
4.3 Preprocessing . . . . 39
4.4 Training . . . . 39
4.5 Prediction and Decoding . . . . 39
4.6 Evaluation . . . . 39
6
5 Methods 40
5.1 Dataset . . . . 41
5.1.1 Subset . . . . 41
5.2 Preprocessing . . . . 41
5.2.1 Mouth Tracking . . . . 42
5.2.2 Moving Average Filtering . . . . 43
5.3 Clustering . . . . 43
5.4 Logistic Regression . . . . 44
5.5 Deep Neural Network Models . . . . 44
5.5.1 Model 1 . . . . 45
5.5.2 Model 2 . . . . 46
5.5.3 Model 3 . . . . 47
5.5.4 Model 4 . . . . 48
5.5.5 Model 5 . . . . 48
5.6 Language Model . . . . 50
5.7 Results on Real-World Appliance . . . . 50
5.8 Memory Usage and Training Time . . . . 51
6 Discussion 52 6.1 Models . . . . 53
6.2 Adam parameters . . . . 53
6.3 Dataset . . . . 54
6.4 Smoothing, Noise and Normalization . . . . 54
6.5 Facial Features . . . . 54
6.6 Performance Comparison . . . . 55
7 Conclusion 56 7.1 Reading Lips with Facial Landmarks . . . . 57
7.2 Future Work . . . . 57
Bibliography 59
7
List of Figures
3.1 The structure of an Artificial Neuron . . . . 21
3.2 The structure of a shallow Artificial Neural Network . . . . 21
3.3 The structure of a Recurrent Neural Network . . . . 22
3.4 Gates of a Long-Short Term Memory unit . . . . 23
3.5 Convolutional Neural Network applied on a RGB image . . . . 25
3.6 A max-pooling layer retrieves the maximum value in a region and reduces the dimensionality . . . . 26
3.7 Classification of a feature vector . . . . 27
3.8 Probabilistic Neural Network for classification . . . . 29
3.9 Salient Point Detection detecting a triangle . . . . 30
3.10 Moving average filter. Original signal (blue) and filtered signal(green). . . . 34
4.1 Graph representation of the architecture of LipNet . . . . 38
5.1 Pictures of two speakers from the GRID corpus . . . . 41
5.2 The 68 facial landmarks identified with dlib’s facial recognition . . . . 42
5.3 Vector representation of extracted mouth coordinates from two frames . . . . . 42
5.4 Plot of a sequence of one (y) coordinate over 75 frames. One line is the original coordinate (blue) and the other is a smoothed version (green). . . . 43
5.5 KNN WER for different K-values . . . . 44
5.6 Graph representation of the architecture of model 1 . . . . 45
5.7 Graph representation of the architecture of model 2 . . . . 46
5.8 Graph representation of the architecture of model 3 . . . . 47
5.9 Graph representation of the architecture of model 4 . . . . 48
5.10 Graph representation of the architecture of model 5 . . . . 49
5.11 Comparison on real world appliances between Model 5 and LipNet . . . . 51
8
List of Tables
5.1 Results for logistic regression . . . . 44
5.2 WER of model 1 . . . . 45
5.3 WER of model 2 (1) . . . . 46
5.4 WER of model 2 (2) . . . . 46
5.5 WER of model 3 . . . . 47
5.6 WER of model 4 . . . . 48
5.7 WER of model 5 (1) . . . . 49
5.8 WER of model 5 (2) . . . . 49
9
Acronyms
AI Artificial Intelligence. 13 AN Artificial Neuron. 20, 21
ANN Artificial Neural Network. 13, 21
ASR automatic speech recognition. 13, 22, 33, 53, 58 AVR Audio-visual recognition. 16
BiGRU bidirectional GRU. 37, 48, 49, 53, 55 BiLSTM bidirectional LSTM. 45–48, 53 BiRNN bidirectional RNN. 24, 45, 53 BN Biological Neuron. 20
CER Character Error Rate. 34, 49, 54
CNN Convolutional Neural Network. 16, 25, 37, 48, 53, 55, 57 CTC Connectionist Temporal Classification. 16, 33, 37, 39, 45–48, 50 DL Deep Learning. 13, 14, 17, 22
DNN Deep Neural Network. 45 GD Gradient Descent. 19
GRU Gated Recurrent Unit. 24, 48, 53 HMM Hidden Markov Model. 33 KNN K-Nearest Neighbors. 30, 43, 55 LM Language Model. 32, 50
10
LR Logistic Regression. 18, 44, 55
LSTM Long Short-Term Memory. 16, 23, 24, 45, 46, 48, 53 MA Moving Average. 34, 54
ML Machine Learning. 13, 17, 18, 22, 55 NLP Natural Language Processing. 45 NN Neural Network. 16, 18, 22, 23 PNN Probabilistic Neural Network. 27 PR Pattern Recognition. 20
RGB Red-Green-Blue. 25, 26
RNN Recurrent Neural Network. 16, 22–24, 33, 45, 46, 48, 53, 57 SL Supervised Learning. 18, 19
SPD Salient Point Detector. 30 SR Speech Recognition. 16, 17, 32
WER Word Error Rate. 16, 33, 34, 37, 43–50, 53, 55
11
Chapter 1
Introduction
In this chapter the background and motivation behind this thesis is presented. It briefly mentions some history behind machine learning. It also explains the purpose and goal of this thesis.
12
Introduction 13
1.1 Background
With the recent popularity with Machine Learning (ML), it is easy to believe it to be a brand new concept. This is however very far from the truth. Ever since the dawn of computers the idea of a machine being able to emulate human thinking and to learn has existed. Alan Turing, famously known as one of the biggest contributers to computer science and Artificial Intelligence (AI) [1], created a test named the Turing Test [2]. This test was supposed to determine if a machine could be seen as intelligent by displaying behavior indistinguishable to that of a human.
In the 1950s Arthur Samuel created the first computer learning program which was designed to play checkers [3]. Around the same time, Frank Rosenblatt invented the perceptron [4], giving grounds to the Artificial Neural Network (ANN). 1967 the nearest neighbor algorithm was presented [5], allowing computers to use basic pattern recognition. In 1997, IBM’s computer Deep Blue was able to beat the world champion of Chess, Garri Kasparov [6][7]. In 2011, IBM’s Watson was able to defeat two champions in the game show Jeopardy! [8].
One of the reasons behind the recent spurt in popularity and advancement in ML is the increase of available computational power along with the ever increasing amount of available data. This advancement has resulted in more advanced problems being solved with ML than ever before. One of the biggest advancements is the growing use of Deep Learning (DL).
DL is a subset of ML that uses computational models that are built up with numerous processing layers, allowing the learning of data with several levels of abstraction. DL has improved the state-of-the-art in automatic speech recognition (ASR), object recognition, object detection along with many other fields drastically [9][10].
The improvement in ASR has led to many applications e.g. personal assistants, such as Apple’s Siri [11] or Microsoft’s Cortana [12], system controls in vehicles [13], assistance for people with disabilities [14], and many more. Even though ASR has come a long way, it still has many challenges left to overcome [15].
1.2 Purpose
ASR using only acoustic data depends heavily on the quality of the sound. If the data is polluted with noise or other disturbances an ASR system will have a much more difficult task than if the data was in perfect condition. As ASR systems are becoming increasingly popular with mobile devices and home entertainment systems, the demand increases for ASR to be robust to real-world noise and acoustic disturbances [16].
One technique for making ASR more robust is using lip reading when possible [17]. By training a model to not only look at the acoustic data of the speaker but also the visual features, the visual data could help the model predict the correct answers even when the acoustic data is less than ideal [18]. An extreme case of acoustic disturbance is where there is no usable acoustic data at all. This would require the ASR system to go by visual data alone, leading to a more complex problem: pure lip reading.
The current state-of-the-art lip reading system is LipNet [19]. LipNet was trained on an
Introduction 14
Nvidia DGX-1 [20]: absolute state-of-the-art hardware for AI and DL, not accessible to everyone.
The heavy training of LipNet is not applicable or at least not practical on all hardware and might therefore be inaccessible to many.
This thesis aims to explore the possibilities of reproducing the accuracies shown by LipNet with a model able to adequately train using the limited hardware available for this project: one Nvidia GTX 1060 GPU with 6GB memory. This will be attempted by extracting visual features, i.e. facial landmarks, from the videos before feeding said features to the model. This thesis also aims to evaluate the differences in accuracy and computational requirements between the created model and LipNet.
1.2.1 Problem Statement
• Can facial landmarks be used as efficient data for lip reading?
• What are the most important visual features when performing lip reading with DL?
• Does it require less computation when using facial landmarks compared to using images?
• Is it possible to train a model with DL on facial landmarks using an Nvidia GTX 1060 GPU or even more limited hardware?
• Is it possible to replicate the results of LipNet with this model?
1.3 Delimitations
In order to have a consistent dataset of reasonable size that is available for training and testing the models, the first delimitation will be to contain only english. The dataset will itself have a limited vocabulary consisting of certain words in a specific pattern, explained further in section 5.1.
The approach to use facial landmarks implicitly sets a delimitation of the project, unless adressed. Using each frame of a video as input to the model also includes the entire region around the speaker’s mouth, which may have an impact on the performance of the lip reading ability.
This project will simply use the 24 coordinates gathered by an existing library that correspond to
different parts of the speaker’s lips. This project will not include any other algorithm that tracks
facial features.
Chapter 2
Related Work
This chapter presents previous solutions and research related to the subject of this thesis. It describes several methods for solving problems related to lip reading with Artificial Neural Networks, with a focus on LipNet as mentioned in the previous chapter.
15
Related Work 16
2.1 LipNet: End-to-End Sentence-level Lipreading
Y.Assael et al. [19] introduces LipNet, a Neural Network (NN) architecture that maps sequences of video frames to text, on a sentence-level prediction instead of wordwise. LipNet uses spa- tiotemporal convolutions, a Recurrent Neural Network (RNN) and the Connectionist Temporal Classification (CTC) loss, trained entirely end-to-end with variable-length videos. On the GRID corpus [21] it achieves 4.8% Word Error Rate (WER) using an overlapped speaker split, classi- fying full sentences instead of words, phonemes or visemes. The overlapped speaker split uses videos from all speakers to train but withholds a few videos from each speaker for validation purposes.
LipNet uses a couple of libraries to extract a small section from each frame containing a cen- tered image of the speaker’s mouth, and sends it as input to the model. As part of the evaluation of LipNet, three hearing-impaired members of Oxford Students’ Disability Community were introduced to the GRID corpus and shown 300 random videos to measure their ability to read the lips of those speakers. They were on average able to achieve a WER of 47.7% on videos of unseen speakers.
LipNet will be used as a comparative tool throughout the project and is described further in chapter 4.
2.2 Various Works Related to Lip Reading
Audio-visual recognition (AVR) is a solution to Speech Recognition (SR) problem where the audio is corrupted. The goal of AVR is to use the information from one modality and use it to complement information in the other. The problem however is to find the correspondence between the audio and visual streams. A.Torfi et al. [22] uses a coupled 3D Convolutional Neural Network (CNN) to find the correlation between the visual information and audio information.
H.Akbari et al. [23] uses a combination of CNNs, Long Short-Term Memory (LSTM) networks, and fully connected layers to reconstruct the original auditory spectrogram with a 98%
correlation from silent lip movement videos.
N.Rathee [24] proposes an algorithm consisting of feature extraction and classification for word recognition. The word prediction is done by a Learning Vector Quantization neural network.
The algorithm is applied for recognizing ten Hindi words and achieves an accuracy of 97%.
A.Garg et al. [25] proposes various solutions based on CNNs and LSTMs for lip reading.
The best performing model, using the concatenated sequence of all frames of the speakers face, achieves a validation accuracy of 76%.
Gregory J. Wolff et al. [26] proposes visual preprocessing algorithms for extracting relevant
features from the frames of a grayscale video to use as input to a lipreading system. They also
propose a hybrid speechreading system with two time delayed NNs, one for image and one for
acoustics, integrating their responses by independent opinion pooling. The hybrid system has
a 25% lower error rate than the acoustic system alone, indicating that lipreading can improve
SR.
Chapter 3
Theory
This chapter presents theory about the cornerstones of this project. It will mention ML, DL and the underlying network layers in the models, object detection and recognition, some linguistics and the ability to read lips, as well as SR.
17
Theory 18
3.1 Machine Learning
Broadly defined, ML is a collection of computational methods or algorithms making accurate predictions based on collected data [27]. The learning algorithm improves by using its experience of previous data to provide more accurate predictions on new data by finding patterns. The accuracy and success rate of a learning algorithm depends greatly on the data used when training it. The sample complexity and size must be sufficiently large to allow the algorithm to analyze the data and find these patterns.
3.1.1 Supervised Learning
Supervised Learning (SL) tries to infer a function from labeled data [28]. A SL algorithm uses the labeled training data to try to determine a function that maps the input to the desired output.
This function is then used to determine the output of new examples. More formally, given a set of N training examples {(x 1 , y 1 ), · · · , (x N , y N )}, where x i is the feature vector of the i th input object in the dataset and y i is its corresponding output target, the learning algorithm tries to determine a function g : X → Y , where X is the input space and Y is the output space.
There are several steps that must be taken to solve a supervised learning problem:
1. The first part is the decision of training examples. If training a speech recognition system, for example, these could be single letters, words or whole sentences.
2. Collecting a dataset. Either build one or find an existing one relevant to the task at hand.
This dataset should consist of input objects and corresponding outputs.
3. Choosing the input feature representation. The learning success of the algorithm can depend highly on how the input object is represented.
4. Choosing a learning algorithm, e.g. support vector machines, decision trees or NN 5. Running the training algorithm on the dataset.
6. Evaluating the learned function.
3.1.2 Logistic Regression
Logistic Regression (LR) [29] is a statistical method for analyzing a dataset. It is used to classify binary problems, i.e. problems with only two output classes. The goal of LR is to find the best coefficients which describes the relationship between the input and the output classes.
In the case of more than two output classes, multinomial logistic regression can be used.
Another solution is to divide the problem into smaller binary problems where a prediction
is made for each output class and then compared to the rest. This is called One-vs-Rest or
One-vs-All.
Theory 19
3.1.3 Optimization Gradient Descent
Gradient Descent (GD) [30] is an optimization algorithm widely used in SL. It is used to find the parameters of a function that minimizes a cost function. For a neuron, it calculates its error approximation to the target. The most commonly used error is the sum of squared errors
ε =
P T
X
p=1
(t p − o p ) 2
where t p is the target output and o t is the actual output for the p-th pattern and P T is the total number of patterns in the training set.
The goal is to minimize ε. To do this, the gradient of ε is calculated in weight space, then the weight values are moved along the negative gradient. Given a training pattern, the weights are updated with
v i (t) = v i (t − 1) + ∆v i (t) with
∆v i (t) = η(− ∂ε
∂v i ) where
∂ε
∂v i = −2(t p − o p ) ∂f
∂net p z i,p
and η > 0 is the learning rate, or the size of the steps taken when changing the weights. ∂net p is the net input for pattern p and z i,p is the i th input signal corresponding to pattern p.
Adam
Adam [31] is an optimization algorithm that can be used instead of the classical GD algorithm.
It is a gradient-based optimization algorithm that computes adaptive learning rates for each parameter. It updates exponential moving averages of the gradient (m t ) and the squared gradient (v t ). The exponential decay rates of these are controlled by the hyper-parameters β 1 , β 2 . The
following are the configuration parameters of Adam.
• α is the learning rate.
• β 1 is the exponential decay rate of the first moment estimation. 0.9 is a recommended
value of this parameter.
Theory 20
• β 2 is the exponential decay rate of the second moment estimation. 0.999 is a recommended value of this parameter.
• ε is a small number to prevent division by zero. Recommended value is 10 −8 . Pseudo code for the Adam optimization algorithm, requiring the above parameters:
m 0 ← 0 (Initialize first moment vector) v 0 ← 0 (Initialize second moment vector) t ← 0 (Initialize timestep)
while θ t not converged do t ← t + 1
g t ← ∇ θ f t (θ t−1 ) (Get gradients w.r.t. stochastic objective at timestep t) m t ← β 1 · m t−1 + (1 − β 1 ) · g t t (Update biased first moment estimate) v t ← β 2 · v t−1 + (1 − β 2 ) · g t 2 (Update biased second raw moment estimate)
ˆ
m t ← m t /(1 − β t 1 ) (Compute bias-corrected first moment estimate) ˆ
v t ← v t /(1 − β 2 t ) (Compute bias-corrected second raw moment estimate) θ t ← θ t−1 − α · ˆ m t ( √
ˆ
v t + ) (Update parameters) end while
return θ t t (Resulting parameters)
In his paper, [32], S.Ruder compares several gradient descent optimization algorithms where his conclusion recommends Adam as the best overall choice.
3.1.4 Artificial Neural Networks
The human brain is an extraordinarily complex computer. It has the incredible ability to memorize and learn, to complete complex tasks such as Pattern Recognition (PR) much faster than any computer. The brain is built up of large networks of simple Biological Neuron (BN)s. Signals propagate through these large networks where neurons are connected via synapses. If the input signal to a neuron surpasses a certain threshold, the neuron transmits a signal to its connected neurons [33].
An Artificial Neuron (AN) is modeled on a BN. It receives input from other connected ANs where each input signal is weighted with a numerical weight associated with each connection.
The excited signal from the AN is controlled by a function, called the activation function. When
the AN receives inputs, it uses the sum of the weighted signals as an input to the activation
function which calculates the output of the AN as seen in the figure below [30].
Theory 21
x 2 w 2 Σ f
Activation function
y Output
x 1 w 1
x 3 w 3
Weights Inputs
Figure 3.1: The structure of an Artificial Neuron
An ANN consists of layered networks of ANs. The first layer is called the input layer, the last is called the output layer and all layers in between are called hidden layers.
Input layer
Hidden layer
Output layer Input 1
Input 2
Input 3
Input 4
Output
Figure 3.2: The structure of a shallow Artificial Neural Network
ANNs are used in many different types of applications today such as speech recognition, image processing, pattern recognition and classification. These are however only a small part of the applications using ANNs.
3.1.5 Deep Learning
The performance of machine learning algorithms depend highly on the representation of the data
given, i.e. what features are included. Manually choosing which features in the data are important
and which ones are not can be quite difficult. What a human considers an important feature might
Theory 22
not be considered an important feature by a computer. Allowing the computer to not only map the feature representation to the output, but also to map the data to the feature representation, often results in better performance than with manually designed feature representations [10].
DL is a subfield of ML which utilizes deeper network architectures to enable the computer to build complex concepts out of simpler ones [34]. This enables the network to, by itself, find lower level representations of higher level features, allowing it to represent functions with higher complexity [10].
DL algorithms has led to many state-of-the-art results in several areas, among them is ASR[35][36][37].
3.1.6 Recurrent Neural Networks
The idea behind RNN is derived from the human ability to understand by sequence. Many tasks, such as understanding speech or remembering the alphabet, are based on sequence [38].
Traditional feed-forward NNs do not have the ability to remember sequences.
To address this issue RNNs introduces loops. The output of the network in one time step is fed back as input in the next time step [39]. Given a sequence x = (x 1 , x 2 , · · · , x T ), the hidden state h t is updated by
h t =
( 0, t = 0
φ(h t−1 , x t ), otherwise ,
where φ is a nonlinear function such as a composition of a logistic sigmoid with an affine transformation [40]. This results in an internal memory, allowing the network to process sequences. The update of the hidden state h t is implemented as
h t = g(W x t + U h t−1 ),
where g is a smooth bounded function such as a logistic sigmoid, W and U being weights.
h t
A
x t
=
h 1
A
x 1
h 2
A
x 2
h 3
A
x 3
h t
A
x t
. . . Figure 3.3: The structure of a Recurrent Neural Network
Even with the ability to remember sequences, RNNs suffer from the problem of long-term
dependencies [41]. If two sets of information are separated with a too large time step gap, the
RNN will have difficulty connecting them.
Theory 23
Long Short-Term Memory
LSTM is a variant of the RNN architecture which improves the long-term dependency [42].
While a RNN only has a simple structure of one NN, simply passing the output of one time step as an input to the next, a LSTM has a more complex structure of four NNs, called gates. These gates determine what information should be kept in the system and what information should be removed, preventing the decay of important information.
h t−1
A
x t−1
h t
x t
σ σ tanh σ
x +
x x
tanh