IN
DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS
STOCKHOLM SWEDEN 2018 ,
A comparison between a
conventional LSTM network and a grid LSTM network applied on
speech recognition
GUSTAV EDHOLM XUECHEN ZUO
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
www.kth.se
INOM
EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP
STOCKHOLM SVERIGE 2018 ,
En jämförelse mellan ett
konventionellt LSTM-nätverk, och ett Grid LSTM-nätverk tillämpat på taligenkänning
GUSTAV EDHOLM XUECHEN ZUO
KTH
SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP
www.kth.se
Abstract
In this paper, a comparision between the conventional LSTM network and the one-dimensional
grid LSTM network applied on single word speech recognition is conducted. The performance
of the networks are measured in terms of accuracy and training time. The conventional LSTM
model is the current state of the art method to model speech recognition. However, the
grid LSTM architecture has proven to be successful in solving other emperical tasks such as
translation and handwriting recognition. When implementing the two networks in the same
training framework with the same training data of single word audio files, the conventional
LSTM network yielded an accuracy rate of 64.8 % while the grid LSTM network yielded an
accuracy rate of 65.2 %. Statistically, there was no difference in the accuracy rate between
the models. In addition, the conventional LSTM network took 2 % longer to train. However,
this difference in training time is considered to be of little significance when tralnslating it to
absolute time. Thus, it can be concluded that the one-dimensional grid LSTM model performs
just as well as the conventional one.
Referat
I denna rapport har en jämförelse gjorts mellan det konventionella LSTM-nätverket och det
endimensionella grid LSTM-nätverket tillämpat på enordstaligenkänning. Modellernas pre-
standa mäts i termer av andelen korrekta gissningar samt träningstiden. Den konventionella
LSTM modellen är idag den främsta tekniken att modellera taligenkänning. Grid LSTM-
nätverket har emellertid visat sig framgångsrikt inom modellering av andra empiriska problem
så som översättning och handstilsigenkänning. Vid en implementation av dessa två modeller
i samma träningsramverk innehållandes samma träningsdata bestäms andelen korrekta giss-
ningar från det konventionella LSTM-nätverket till 64.8 % medan grid LSTM-nätverkets andel
landar på 65.8%. Statistiskt finns ingen signifikant skillnad i andelen korrekta gissningar mel-
lan modellerna. Det visade sig vidare att den konventionella LSTM-modellen tog 2 % längre
tid att träna. Denna tidsskillnad anses dock vara av försumbar betydelse när den översätts
till absolut tid. Således kan slutsatsen dras att grid LSTM-nätverket presterar minst lika bra
som det konventionella LSTM-nätverket gör.
Acknowledgements
We would like to thank our supervisor, Pawel Herman, for the support and interesting discussions throughout this project. We really appreciate your advice and suggestions of how
to improve our work. Thank you!
We would also like to thank Google’s Tensorflow team for providing their open source training framework and open source training data. This is a valuable resource when conducting research and we are grateful for being allowed to use it in this project. Thank
you!
Contents
1 Introduction 1
1.1 Purpose and research question . . . . 1
1.2 Assumptions . . . . 1
1.3 The outline of the paper . . . . 2
2 Background 2 2.1 The classical HMM approach of speech recognition . . . . 2
2.2 RNN and the LSTM cell . . . . 3
2.3 The Grid LSTM architecture . . . . 4
2.4 Related work . . . . 6
2.5 Hypophsis statement . . . . 7
3 Method 7 3.1 Data . . . . 7
3.2 Training and testing . . . . 7
3.3 Models . . . . 8
3.3.1 The LSTM Model . . . . 9
3.3.2 The Grid LSTM Model . . . . 9
3.4 Evaluation metric . . . . 10
4 Results 11 4.1 Results and analysis of accuracy and training time . . . . 11
4.2 Results of the confusion matrices . . . . 12
4.3 Results on training time . . . . 14
5 Discussion 14 5.1 Discussion . . . . 14
5.2 Limitations . . . . 14
6 Conclusion 15
6.1 Future studies . . . . 16
1 Introduction
Speech recognition (SR) is the concept of a computer program decoding spoken words. The com- puter program takes a voice recording as input and outputs in text what was said. It is needed whenever a human interacts with a computer using only voice, often dictating what otherwise would be written in a text box, or giving commands. Many application areas for speech recog- nition have emerged in the past ten years, and many more are expected to emerge in the next ten. These applications include speaking to a virtual assistant, or using a computer hands free while driving. Increasing the speed and accuracy of speech recognition is essential for making these interactions more fluid and seamless. That is why speech recognition is, and has been researched continuously and improved upon for decades.
Hidden Markov models (HMM) have been the state of the art in speech recognition for a long time [4], but new methods using recurrent neural networks (RNN) and in particular long short- term memory (LSTM) RNNs have shown great results in recent years. [4] An LSTM network works as an ordinary RNN, but at each recursion there are different gates that regulate what to forget, what to add and what to remember. This gives the system a memory that spans many time steps backwards and is therefore useful in applications where there is a time dependent context.
[8] Speech is time dependent, and in recognizing speech it is important to look at the time con- text of each data point. Good results have been acquired using LSTMs for speech recognition [8].
To improve performance, several variations of LSTM architectures have been suggested. One of these is Grid LSTM, an architecture where LSTM-blocks are interconnected in a multidimensional structure, rather than the traditional way of only having size in the time dimension [2]. Having a multidimensional structure capable of having size in the depth dimension opens up possibilities for deep LSTM networks. This variation of LSTM was found to work in several applications including character prediction, machine translation and digit recognition on the MNIST data set, with strong results [2]. Given the progress of LSTMs in speech recognition and the seeming strength of the grid architecture, it is a good next step is to try to apply the grid LSTM to speech recognition.
1.1 Purpose and research question
In this project, the answer to the following question is sought. What are the advantages and dis- advantages of using a one-dimensional Grid LSTM network compared to traditional LSTM with regards to speech recognition? Measures of advantages and disadvantages include training time and accuracy of recognition. The comparison will be done by constructing two architectures, one traditional LSTM and one grid LSTM, then train and test both on the same data set. This will be done using Python and the Tensorflow library for machine learning functions.
The purpose of asking this is to investigate the applicability of the depth dimension of a Grid LSTMs on SR, in order to advance speech recognition capabilities. Thus, this can be seen as a prestudy to further research on applying grid LSTM networks with higher dimension on SR. The objective of this project is to evaluate a one dimensional grid LSTM model based on current state of the art technology in the field of SR, which is the conventional LSTM model. It will be tested if the depth dimension has any advantages or disadvantages over the best model today. Thus, the intended contributions of this research project to the research field is partly to widen the applica- bility of the grid LSTM architecture and partly to challenge the current state of the art models of speech recognition.
1.2 Assumptions
In order to give a meaningful answer to the research question, assumptions that reflects the scope of the project are needed. To reduce the complexity of the speech data, the aim is to find a data set that makes up a one-hot problem. One-hot problems are the type of problems where an ideal model lights up one neuron only in the output vector with all other neurons equal to zero. The single lit up neuron is the model’s prediction. The one-hot model architecture is applicable when- ever you have a limited amount of prediction alternatives where only one is correct. Letting each neuron of the output layer correspond to one alternative, it is easy to realize that only one neuron of the output layer should be lit up when a perfect prediction is made. In the field of speech recog-
1
nition, one-hot data sets consist of data sets with labeled voice recordings where each recording consists of only one word. The amount of possible labels must be limited. However, the assumption is made that the comparison is meaningful for speech recognition problems with more complex data.
In order to make a fair comparison that is based on mathematics and statistics, the difference in accuracy rates were tested through hypotheses. For these calculations, the assumption was made that the accuracy rates of both the conventional LSTM model and the grid LSTM model fol- low a normal distribution with the same standard deviation. The assumption on similar standard deviations is motivated by the fact that both models are trained with the exact same framework.
However, the training time of each network is assumed to be constant since the assumption is made that each training step takes the same amount of time through the whole training process.
1.3 The outline of the paper
The outline of this paper is as follows. Chapter 2 contains a careful description in chronological order of conventional methods that have been used to model speech recognition. In chapter 3, the specific models and experimental setup for this particular project are described. Furthermore, the results produced by the two models are presented in chapter 4, and in chapter 5 they are compared and analyzed. Finally, in chapter 6, a discussion of the analysis is conducted and limitations and future studies in the field is discussed.
2 Background
Speech recognition is defined by the interaction between computer and its surroundings through spoken or recorded words. In general, a speech recognition system takes acoustic waves as input and transforms them into time series data in either the frequency domain or time domain. The computer then analyses the time series data in order to predict what word was spoken. Speech recognition has a wide range of applications and because of that, the problem area of speech recognition can be divided into many subparts, such as speech-to-text, gender identification, speaker identification and acoustic modeling [4]. These applications share the same type of single-channel raw input data that is highly frequency and time dependent. Many approaches of how to apply deep learning to speech recognition have been tested [4]. In the following chapter, an orientation of previous work in the field of speech recognition is presented. The traditional approaches as well as state of the art methods for speech recognition are described and a new possible architecture is evaluated.
2.1 The classical HMM approach of speech recognition
HMM was for decades the best performing method for speech recognition. The model consists of two probability distributions, the transition distribution and the observation distribution, that are assumed to be stationary. The transition distribution P (y
t, y
t 1) links the hidden layers in the model together and the observation distribution P (x
t, y
t) defines how the input data x is related to the hidden layer at the current time step. However, a common problem is that HMM requires discrete state space that often leads to unrealistic assumptions [4]. This in turn weakens the model.
For speech recognition, the discretization has traditionally been achieved through Gaussian mixture models. However, restricted Boltzmann machines have been proven more efficient for discretiza- tion. Therefore, research in the field of modeling speech recognition has focused on different types of restricted Boltzmann machines such as the mean-covariance restricted Boltzmann machine and the conditional restricted Boltzmann machine [4] [5].
However, recent work within the field of speech recognition has found an alternative to the HMM.
A long short term memory recurrent neural network (LSTM RNN) was used to achieve a classifica- tion error of 17.7 % on the TIMIT data set, which outperforms the HMM approach and challenges it as the standard way to model speech recognition tasks. The LSTM RNN can replace a combi- nation of different techniques previously used such as HMM combined with restricted Boltzmann machines [4].
2
2.2 RNN and the LSTM cell
The recurrent neural network (RNN) is one of the approaches used to model sequential data. It can be created from a feed-forward network by connecting the output of the neurons back to its inputs. This means that the output at the next time step depends not only of the current input data but also of previous output data [7]. A common way of training RNNs is through a method called backpropagation-through-time. They can be viewed as very deep networks with shared parameters between the hidden layers when unfolding in time. This give rise to the problem of vanishing gradients, which means that the RNN cannot base its current output on output that are many time steps away but only on the most recent output data. One way to explain it is to say that standard RNNs cannot posses long-term memory. This can turn out to be problematic for areas where current output highly depend on previous output, such as words in a sentence and this problem motivates higher dimensions of networks to build in more memory in RNNs [7]. One common method to do this is the LSTM cell implemented in a RNN. As mentioned before, this architecture has proven to be successful in achieving a low classification error for speech recognition [7].
The conventional LSTM RNN consists of an RNN holding LSTM cells, also known as memory blocks, in its hidden layers [9]. Each block has a multiplicative input gate and output gate. The input gate controls which neuron activations that will pass into the memory block and the out- put gate controls which activations will be passed back to the RNN. Through this procedure, the LSTM RNN model can store and extract information from output data many time steps back and hence, this method has become a common way to model temporal sequences and their long-term dependency. Furthermore, a forget gate was added to the LSTM model in order to be able to filter out data from the memory that has become irrelevant. Thus, the forget gate is used to segment the data stored in the memory reset parts of the memory blocks [7].
ct-1
+
yt
ct
x
Forget gate Input gate Output gate
x
tanh
ht
ht-1
xt
Figure 1: An illustration of the basic LSTM cell where x represents the input, y and h represent the output, and c represents the state, all with respect to t.
Mathematically, the LSTM model maps the input sequence x
tto an output sequence y
twhere t spans from 1 to T .
3
i
t= (W
txx
t+ W
tmm
t 1+ W
tcc
t 1+ b
t) f
t= (W
f xx
t+ W
f mm
t 1+ W
f cc
t 1+ b
f) c
t= f
t⇤ c
t 1+ i
t⇤ g(W
cxx
t+ W
cmm
t 1+ b
c)
o
t= (W
oxx
t+ W
omm
t 1+ W
occ
t 1+ b
o) m
t= o
t⇤ h(c
t)
y
t= (W
mtm
t+ b
m)
Where ⇤ denotes element-wise multiplication, W represents a weight matrix, b is the bias vector and is the logistic sigmoid function. Furthermore, i is the input gate, f is the forget gate, c is the cell activation vector, m is the cell output activation vector and o is the output gate, all of the same length. The function g denotes the cell input activation function and the function h denotes the cell activation output function [7].
Due to the cyclic connections of the LSTM architecture, the LSTM model has been successfully applied on many sequential problems. These include sequence labeling and sequence prediction tasks, such as language modeling, labeling of acoustic frames and handwriting recognition [8].
Previously, the method was limited to small scale tasks such as bots, that listens for keywords, answering customer service phones. However, recent work has expanded the applicability of the LSTM structure to large vocabulary speech recognition by evaluating and testing different size of the model’s parameters [8].
2.3 The Grid LSTM architecture
With the advantages that LSTM networks bring, it has been appealing to generalize them for deep computation. In the past, LSTM cells have been applied in multidimensional arrays. In the Stacked LSTM method, LSTM layers are stacked on top of each other in the depth dimension, but are not integrated in both dimensions [2]. In the Multidimensional LSTM model, LSTM layers are stacked in an N-dimensional array. At each input x the network receives N hidden vectors h
1, ..., h
N, and N memory vectors m
1, ..., m
N, it then outputs one hidden vector h and one memory vector m.
The hidden- and the memory vector are then passed on as the new state in all N dimensions. In this method the memory vector m grows combinatorially with the number of dimensions N, and the size of each dimension. This is a problem that causes instability in large networks, and is due to how the method calculates the vector m [2].
To avoid this problem, the Grid LSTM architecture was introduced. In this method LSTM cells are stacked in N dimensions to form a grid structure, an N-dimensional grid LSTM network is denoted N-LSTM for short. A 2-LSTM network is represented in figure 2. This method is similar to the Multidimensional LSTM model, but the difference is a mechanism that stops the memory vector m from growing combinatorially. This main difference is that in grid LSTM, instead of inputing N hidden vectors and memory vectors and outputting one of each, it outputs N of each, as h’
1, ..., h’
Nand m’
1, ..., m’
N. It does so through the following calculation [2]. The N hidden vectors are concatenated into
H = 2 6 6 6 4
h
1h
2h ...
N3 7 7 7 5
and this matrix is then used in the calculation of each new (h’,m’) pair as (h’
1, m’
1) = LST M (H, m
1, W
1)
...
(h’
N, m’
N) = LST M (H, m
N, W
N)
4
The Grid LSTM method has been tested on three different algorithmic tasks, and three empiri- cal tasks showing very good results. These tasks include determining the parity of k-bit strings, adding 15-digit integers, and on memorizing sequences of numbers, as well as character prediction, translation, and digit recognition [2].
In applying Grid LSTM to determining parity, a one-dimensional Grid LSTM has been shown successful for strings with k 250, which is very good compared to non LSTM feed-forward net- works, showing success for k 30 [2]. Adding 15-digit numbers was done with a 2-LSTM network, and compared with a stacked LSTM network. The results show that after 5 million training ex- amples the 2-LSTM network outperforms a stacked LSTM network [2]. A 2-LSTM network was also compared to a stacked LSTM network in memorization of randomly generated 20 symbol long strings with 64-symbol vocabularies. Both networks had 100 hidden units and between 1 and 50 layers, and it was found that the 2-LSTM network outperformed the stacked LSTM network.
It was further noted that the accuracy of the stacked LSTM network decreased with increased number of layers. Above 16 layers the accuracy was consistently below 50 % [2].
In character prediction a 2-LSTM network was tested on the Hutter challenge Wikipedia dataset.
It achieved a bits-per-character score of 1.47, better than a stacked LSTM, an MRNN network, and a GFRNN network, despite having fewer parameters than both the stacked LSTM and the GFRNN network [2]. In translation, a novel approach has been introduced using a grid LSTM network [2]. Instead of using a traditional encoder-decoder method, a 3-LSTM network is applied.
The method is evaluated using the IWSLT BTEC Chinese-to-English corpus, and is compared with the baseline state-of-the-art hierarchical phrase-based system CDEC. The 3-LSTM system reaches a perplexity of 4.54 on the test data, and it outperforms CDEC baseline with the following results. Valid: 51.8, test: 60.2, compared to CDECs valid: 50.1, test: 58.9 [2]. In character recog- nition, a 3-LSTM network was applied on the MNIST data set. Two dimensions correspond to the 2-dimensional input of the pixel grid, and the third corresponds to the depth of the network. Two approaches are tested, one replacing the traditional LSTM transfer function with ReLu connections along the depth. The two methods perform near state-of-the-art, with test errors of 0.32 and 0.36 respectively [2].
In the so far tested applications, Grid LSTM networks have consistently performed better than respective stacked LSTM networks on algorithmic tasks, and near state-of-the-art on empirical tasks, when compared to the current standards. This shows great promise for the grid method, and motivates further exploration of the method.
5
LSTM cell
LSTM cell LSTM cell
LSTM cell
Depth
Time ct-1
ht-1
ct
ht
ct+1
ct+1 al+1
al+1
al al
al-1 al-1
ht-1 ht ht+1
bl-1 bl
bl+1
bl-1 bl
bl+1
ht+1 ct
ct-1
Figure 2: An illustration of a 2D Grid LSTM architecture where x
tis the input, y
tis the output, c
tand h
tare the state with respect to time t and a
land b
lare the state with respect to depth l.
2.4 Related work
Grid LSTM networks have been applied to speech recognition in the past. A prioritized Grid LSTM was proposed and trained on several different datasets, and compared with a Stacked LSTM, a Highway LSTM [11] and a non-prioritized Grid LSTM [1]. In this comparison the Stacked LSTM performed worse than the Grid LSTM, even with a small depth, and as the depth increased the Stacked LSTM performed worse while the Grid LSTM performed better [1]. The datasets used in the comparison are different collections of long sequences of speech with very long time dependen- cies. This makes them very different from single word speech recognition, which has much smaller time dependency. Therefore it is not certain that this result generalizes to the one-word case.
There have been other attempts made to model speech recognition successfully. A model that has proven to be effective both for one-hot speech recognition problems [6] and more complex problems such as sentence classification [3] is the convolutional neural network (CNN). A one- dimensional CNN can be viewed as a series of operations between a vector of weights m and a vector of inputs s. The vector s can be interpreted as a sequence where each element s
icorresponds to one data point extracted from the voice recording and m is the filter of the CNN. The idea is to take the scalar product of the two vectors in the following way
c
j= m
Ts
j m+1:jwhere c
jis the j:th element of a new sequence vector [3]. CNN models have proven to be less sensitive to speaker style and speaker variations, which makes them perform better than some DNN models [6]. More specifically, work in this field has shown that a CNN could perform better than the DNN Google uses for keyword spotting, which is a demonstration of a alternative method to model an one-hot problem.
6
2.5 Hypophsis statement
In light of the background and the research question, the following hypotheses can be formulated
H
0: The grid LSTM network does not bring any changes to the accuracy rate compared to the conventional LSTM network.
H
1: The grid LSTM network does bring changes to the accuracy rate compared to the conventional LSTM network.
If the results were to show that there is a significant enough difference between the accuracies and training time, then H
0can be discarded.
3 Method
Both the conventional, and the Grid LSTM networks were implemented into an existing framework produced by Tensorflow under the Apache 2.0 Licence. The training and testing data lies under the Creative Commons Attribution 3.0 License, and is also a product of Tensorflow. The Tensorflow framework is designed to be able to use different model types, where any type of model is able to be plugged into it. The framework imports and pre-processes the data and inputs it into the plugged in model for training and testing.
3.1 Data
The data set was downloaded from the Simple Audio Recognition Tutorial at Tensorflow’s website [10]. It consists of 65,000 1 second WAVE audio files of people saying one out of thirty different English words. Each audio file is labeled with one of twelve different labels. The labels are are
"yes", "no", "up", "down", "go", "stop", "left", "right", "on", "off", "silence" and "unknown".
The labels represent the word that is spoken in the corresponding audio clip, unless the label is
"silence" or "unknown". Files labeled silence are clips of background noise and no audible words spoken, and files labeled "unknown" are spoken words not among the other labels, but among the 20 other words. The reason for having the category unknown is that it is good to know how well the model can differentiate words it was trained to hear from words it was not.
The large size of the data set is seen as a valuable resource although it makes the training process longer. In addition, the nature of the data creates a one-hot problem which is relatively easy to implement. Thus, focus can be centered around creating the two models, which is essential for answer to the research question.
The data was preprocessed to input the mechanical waves of the audio, meaning the amplitude at each time step. The amplitude of the audio file at each time step was normalized to a number between 1 and -1. This is a very simple way to pre-process data, and was done due to its simplicity, its ease of implementation. With more time and resources different approaches would have been tested. The data was also separated into tree parts. Training data constituted 80 %, testing data 10 %, and validation data 10 %. The validation is done continuously during training in order to verify and tune parameters on data separate from the training data. Moreover, since the models train on labeled data, this constitutes supervised learning.
3.2 Training and testing
The same implementation of Python code was used to train and test both models. This code was downloaded from Tensforflow’s GitHub, where it was released as part of the Simple Audio Recognition Tutorial. This implementation of training and testing was considered appropriate mainly since it was written to fit this particular data set with its one-hot nature but also since the code was structured in a way that made the implementation of alternative neural network models
7
convenient. This made it possible to implement the two model functions without changing the training and testing code, which ensures that as few parameters as possible differs between the two models.
The conventional LSTM-model and the Grid LSTM model were plugged into the existing frame- work, with the output format set to match the framework’s expectations. The training was done in segments of 100 steps and the total number of training steps was 18 000.
At the end of training, the models were tested, and a confusion matrix was derived together with the overall accuracy on the testing data. A confusion matrix is an n ⇥ n matrix where n is the number of labels the WAVE audio files can be tagged with, in this case 12 labels. Each column in the matrix represents the sample set predicted to be each label. To exemplify, the first column shows all words that were predicted to be "silence" and the second column shows all words that were predicted to be "unknown". Furthermore, each row represents a set of samples that were actually tagged with each label. To exemplify, the first row shows all words that actually were labeled with "silence" and the second row shows all words that were actually labeled with
"unknown". Thus, a perfect neural network model would produce a confusion matrix with only zeros apart from the diagonal. This would mean that the number of predictions of each label exactly matches the number of samples that were tagged with the label. A visualization of the ideal confusion matrix is shown in table 1. The matrix is transformed into a table in order to be able to label each row and column and thereby provide a more descriptive representation of the ideal confusion matrix.
Guess, Label silence unknown yes no up down left right on off stop go
silence n
silence0 0 0 0 0 0 0 0 0 0 0
unknown 0 n
unknown0 0 0 0 0 0 0 0 0 0
yes 0 0 n
yes0 0 0 0 0 0 0 0 0
no 0 0 0 n
no0 0 0 0 0 0 0 0
up 0 0 0 0 n
up0 0 0 0 0 0 0
down 0 0 0 0 0 n
down0 0 0 0 0 0
left 0 0 0 0 0 0 n
lef t0 0 0 0 0
right 0 0 0 0 0 0 0 n
right0 0 0 0
on 0 0 0 0 0 0 0 0 n
on0 0 0
off 0 0 0 0 0 0 0 0 0 n
of f0 0
stop 0 0 0 0 0 0 0 0 0 0 n
stop0
go 0 0 0 0 0 0 0 0 0 0 0 n
goTable 1: An ideal confusion matrix
The confusion matrix allows for a deeper understanding of the models’ behavior by adding valu- able information on which words that the neural network most often predicted correctly. From this matrix, accuracy for each word can be derived and analyzed. This might be interesting in the comparison between the conventional and grid LSTM models since the confusion matrix makes it possible to tell if one model performs better than the other depending on words.
The training was done on a CPU due to the convenience and simplifications of the Tensorflow installation. All initial weights and biases were randomized.
3.3 Models
Two models were build, with help from Deep learning with Python, Tensorflow and RNN tutorial by Sentdex. The models were written in python, and consist of one function each, in which the data constitutes the main input. These functions create the respective networks of cells using Tensorflow’s built in structures, BasicLSTMCell is used for the conventional LSTM network, and GridLSTMCell is used for the grid LSTM network. Each network is given a certain size, which is the number over which the recursion is done. The data is then put into the network, and a final output and state is obtained. The output is then transferred through a matrix multiplication of a layer of weights and biases. This returns the final one-hot output that is passed back into the
8
surrounding architecture.
3.3.1 The LSTM Model
Specific to the LSTM model is only the cell that is instantiated, a basic LSTM cell from Tensorflow’s library. This cell functions as represented in figure 1, and is only extended in the time dimension as represented in figure 3, not in the depth dimension. The chosen size of the network was 3920 due to the size of the data vector.The function passes the input data through the layers of the basic LSTM network and returns the output, a one-hot vector.
ct-2 +
yt-1
x
Forget
gate Input
gate Output gate
x tanh
ht-2
xt-1
ct-1
+
yt
ct
x
Forget
gate Input
gate Output gate
x tanh
ht
ht-1
xt
+
yt+1
ct+1
x
Forget
gate Input
gate Output gate
x tanh
ht+1
xt+1
Depth
Time
Figure 3: A graphical representation of the conventional LSTM network as implemented in this project
3.3.2 The Grid LSTM Model
The particulars of the implemented Grid LSTM are simple. The cell used is provided through the Tensorflow library and was built by Tensorflow to represent the model presented in [2]. It functions as portrayed in figure 4. The implementation made is a 1-D grid network, as shown in figure 4. This means that the network has a size only in the depth dimension, and none in the time dimension. This essentially strips the time dependency from the network and instead focuses on depth. The whole time series of data is given to the first node, and then propagates through grid cells in the depth dimension until the end, where the final result is returned. Because only the first cell is fed the time series, a forget gate is not used in this implementation. If a forget gate were used, too much information would be lost. The size of the network was set to 1960, half of the size of the input vector. This is due to how the Grid LSTM cell is implemented in Tensorflow.
9
Depth
Time
LSTM cell
LSTM cell
al+1
al
al-1 bl-1 bl bl+1
LSTM cell
LSTM cell
al-2
al-3 bl-3 bl-2
Figure 4: A graphical representation of the 1D Grid LSTM network implemented in this project There are several reasons for implementing the grid method as a 1D structure. Having both models tested be 1D structures opens up for a more direct comparison, where the differences in results can be attributed in large part to the choice of dimension, not the choice of number of dimensions. It gets closer to the essence of the issue. Also, implementing a 1D Grid LSTM is more simple than implementing a 2D grid, and due to resource limitations it was a more appropriate choice.
3.4 Evaluation metric
In order to measure the training time, the assumption was made that the time one training step takes is the same for all training steps. With this assumption, the first 400 training steps of each model were timed, denoted T(Grid) and T(LSTM). Because of how training time varies depending on hardware, the times were calculated relative to T(Grid), thus the metric of time is standardized to a fraction of how it relates to the time it takes to train the Grid LSTM. It is assumed that the training time of each model will not vary with different training sessions. Training and testing was performed on a 2.7 GHz CPU.
However, the accuracy rates of the models were assumed to follow a normal distribution. There- fore, each model was trained ten independent times. From this, the mean of the accuracy rate of each model was calculated. The mean of the conventional LSTM network was used as an estimate of the population mean since the goal is to find out if the grid structure causes any difference in the mean. Furthermore, the sample standard deviation of the conventional LSTM model’s accuracy rate was calculated. This standard deviation was used as an estimate of the population standard deviation. The t-test is used to determine if the P-value is of significance or not. The choice of test is motivated by the fact that the sample size is n = 10 + 10 = 20 < 30 and that the real population standard deviation is unknown. A two-tailed version of the t-test is used since the mean difference can potentially be both positive and negative.
Furthermore, the resulting confusion matrices of the two models were compared. In order to increase the reliability of the comparison, a statistical approach was used to find the correlation between the confusion matrices. A correlation test reveals to what extent the numbers in the two matrices vary together. Describing it in less mathematical terms, it shows if a given area of the confusion matrices consists of elements of similar value or not. A number close to one would indicate that they are similar while a number close to zero that they are very different and inde-
10
pendent from each other. The correlation was calculated in MATLAB, using MATLAB’s built in function "r = corr2(A, B)". This function takes two 2D-matrices A and B as inputs and returns the correlation r between them. This correlation is then used as the answer of how much the two matrices differ from one another.
4 Results
4.1 Results and analysis of accuracy and training time
Both the conventional LSTM model and the grid LSTM model were trained 10 independent times.
The resulting accuracy rate from each training session are shown in table 2.
Standard LSTM Grid LSTM
Session 1 64.0 % 66.0 %
Session 2 64.8 % 64.8%
Session 3 64.5 % 63.9%
Session 4 64.8 % 65.1%
Session 5 64.8 % 64.2%
Session 6 65.5 % 65.6%
Session 7 65.2 % 65.8%
Session 8 65.0 % 65.6%
Session 9 65.3 % 65.6%
Session 10 64.5 % 65.6%
Table 2: The table shows the accuracy rate of the standard LSTM network and the grid LSTM network for each training session.
In order to conduct the statistical analysis, the P-value needs to be calculated. For this, the mean of each network’s accuracy rate is needed. The standard deviation of the accuracy rates, which was assumed to be the same for both models, is also needed. The means and the standard deviation are displayed in table 3.
Standard LSTM Grid LSTM
Mean µ 0.648 0.652
Std. dev. 0.0044 0.0070
Table 3: The mean and standard deviation of the accuracy rate for the conventional LSTM and the grid LSTM model respectively.
The mean and standard deviation of the conventional LSTM model’s accuracy rate are used as the population mean and standard deviation. This is a reasonable estimation since the goal is to compare the networks with the grid structure to those without the grid structure. Thus, the pop- ulation parameters should be estimated from the networks without the grid structure. If the mean and standard deviation of the grid model’s accuracy would have been included it would instead mean that a model in the population would have a 50 % chance of having the grid structure. Then, the comparison would no longer be between models of pure grid structure and models without it.
11
0.63 0.64 0.65 0.66 0.67 Accuracy
0 20 40 60 80
Probabillity density
Grid LSTM