A comparison between aconventional LSTM network and agrid LSTM network applied onspeech recognition

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2018 ,

A comparison between a

conventional LSTM network and a grid LSTM network applied on

speech recognition

GUSTAV EDHOLM XUECHEN ZUO

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

www.kth.se

(3)

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2018 ,

En jämförelse mellan ett

konventionellt LSTM-nätverk, och ett Grid LSTM-nätverk tillämpat på taligenkänning

GUSTAV EDHOLM XUECHEN ZUO

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

(4)

www.kth.se

(5)

Abstract

In this paper, a comparision between the conventional LSTM network and the one-dimensional

grid LSTM network applied on single word speech recognition is conducted. The performance

of the networks are measured in terms of accuracy and training time. The conventional LSTM

model is the current state of the art method to model speech recognition. However, the

grid LSTM architecture has proven to be successful in solving other emperical tasks such as

translation and handwriting recognition. When implementing the two networks in the same

training framework with the same training data of single word audio files, the conventional

LSTM network yielded an accuracy rate of 64.8 % while the grid LSTM network yielded an

accuracy rate of 65.2 %. Statistically, there was no diﬀerence in the accuracy rate between

the models. In addition, the conventional LSTM network took 2 % longer to train. However,

this diﬀerence in training time is considered to be of little significance when tralnslating it to

absolute time. Thus, it can be concluded that the one-dimensional grid LSTM model performs

just as well as the conventional one.

(6)

Referat

I denna rapport har en jämförelse gjorts mellan det konventionella LSTM-nätverket och det

endimensionella grid LSTM-nätverket tillämpat på enordstaligenkänning. Modellernas pre-

standa mäts i termer av andelen korrekta gissningar samt träningstiden. Den konventionella

LSTM modellen är idag den främsta tekniken att modellera taligenkänning. Grid LSTM-

nätverket har emellertid visat sig framgångsrikt inom modellering av andra empiriska problem

så som översättning och handstilsigenkänning. Vid en implementation av dessa två modeller

i samma träningsramverk innehållandes samma träningsdata bestäms andelen korrekta giss-

ningar från det konventionella LSTM-nätverket till 64.8 % medan grid LSTM-nätverkets andel

landar på 65.8%. Statistiskt finns ingen signifikant skillnad i andelen korrekta gissningar mel-

lan modellerna. Det visade sig vidare att den konventionella LSTM-modellen tog 2 % längre

tid att träna. Denna tidsskillnad anses dock vara av försumbar betydelse när den översätts

till absolut tid. Således kan slutsatsen dras att grid LSTM-nätverket presterar minst lika bra

som det konventionella LSTM-nätverket gör.

(7)

Acknowledgements

We would like to thank our supervisor, Pawel Herman, for the support and interesting discussions throughout this project. We really appreciate your advice and suggestions of how

to improve our work. Thank you!

We would also like to thank Google’s Tensorflow team for providing their open source training framework and open source training data. This is a valuable resource when conducting research and we are grateful for being allowed to use it in this project. Thank

you!

(8)

1 Introduction 1

1.1 Purpose and research question . . . . 1

1.2 Assumptions . . . . 1

1.3 The outline of the paper . . . . 2

2 Background 2 2.1 The classical HMM approach of speech recognition . . . . 2

2.2 RNN and the LSTM cell . . . . 3

2.3 The Grid LSTM architecture . . . . 4

2.4 Related work . . . . 6

2.5 Hypophsis statement . . . . 7

3 Method 7 3.1 Data . . . . 7

3.2 Training and testing . . . . 7

3.3 Models . . . . 8

3.3.1 The LSTM Model . . . . 9

3.3.2 The Grid LSTM Model . . . . 9

3.4 Evaluation metric . . . . 10

4 Results 11 4.1 Results and analysis of accuracy and training time . . . . 11

4.2 Results of the confusion matrices . . . . 12

4.3 Results on training time . . . . 14

5 Discussion 14 5.1 Discussion . . . . 14

5.2 Limitations . . . . 14

6 Conclusion 15

6.1 Future studies . . . . 16

(9)

1 Introduction

Speech recognition (SR) is the concept of a computer program decoding spoken words. The com- puter program takes a voice recording as input and outputs in text what was said. It is needed whenever a human interacts with a computer using only voice, often dictating what otherwise would be written in a text box, or giving commands. Many application areas for speech recog- nition have emerged in the past ten years, and many more are expected to emerge in the next ten. These applications include speaking to a virtual assistant, or using a computer hands free while driving. Increasing the speed and accuracy of speech recognition is essential for making these interactions more fluid and seamless. That is why speech recognition is, and has been researched continuously and improved upon for decades.

Hidden Markov models (HMM) have been the state of the art in speech recognition for a long time [4], but new methods using recurrent neural networks (RNN) and in particular long short- term memory (LSTM) RNNs have shown great results in recent years. [4] An LSTM network works as an ordinary RNN, but at each recursion there are diﬀerent gates that regulate what to forget, what to add and what to remember. This gives the system a memory that spans many time steps backwards and is therefore useful in applications where there is a time dependent context.

[8] Speech is time dependent, and in recognizing speech it is important to look at the time con- text of each data point. Good results have been acquired using LSTMs for speech recognition [8].

To improve performance, several variations of LSTM architectures have been suggested. One of these is Grid LSTM, an architecture where LSTM-blocks are interconnected in a multidimensional structure, rather than the traditional way of only having size in the time dimension [2]. Having a multidimensional structure capable of having size in the depth dimension opens up possibilities for deep LSTM networks. This variation of LSTM was found to work in several applications including character prediction, machine translation and digit recognition on the MNIST data set, with strong results [2]. Given the progress of LSTMs in speech recognition and the seeming strength of the grid architecture, it is a good next step is to try to apply the grid LSTM to speech recognition.

1.1 Purpose and research question

In this project, the answer to the following question is sought. What are the advantages and dis- advantages of using a one-dimensional Grid LSTM network compared to traditional LSTM with regards to speech recognition? Measures of advantages and disadvantages include training time and accuracy of recognition. The comparison will be done by constructing two architectures, one traditional LSTM and one grid LSTM, then train and test both on the same data set. This will be done using Python and the Tensorflow library for machine learning functions.

The purpose of asking this is to investigate the applicability of the depth dimension of a Grid LSTMs on SR, in order to advance speech recognition capabilities. Thus, this can be seen as a prestudy to further research on applying grid LSTM networks with higher dimension on SR. The objective of this project is to evaluate a one dimensional grid LSTM model based on current state of the art technology in the field of SR, which is the conventional LSTM model. It will be tested if the depth dimension has any advantages or disadvantages over the best model today. Thus, the intended contributions of this research project to the research field is partly to widen the applica- bility of the grid LSTM architecture and partly to challenge the current state of the art models of speech recognition.

1.2 Assumptions

In order to give a meaningful answer to the research question, assumptions that reflects the scope of the project are needed. To reduce the complexity of the speech data, the aim is to find a data set that makes up a one-hot problem. One-hot problems are the type of problems where an ideal model lights up one neuron only in the output vector with all other neurons equal to zero. The single lit up neuron is the model’s prediction. The one-hot model architecture is applicable when- ever you have a limited amount of prediction alternatives where only one is correct. Letting each neuron of the output layer correspond to one alternative, it is easy to realize that only one neuron of the output layer should be lit up when a perfect prediction is made. In the field of speech recog-

1

(10)

nition, one-hot data sets consist of data sets with labeled voice recordings where each recording consists of only one word. The amount of possible labels must be limited. However, the assumption is made that the comparison is meaningful for speech recognition problems with more complex data.

In order to make a fair comparison that is based on mathematics and statistics, the diﬀerence in accuracy rates were tested through hypotheses. For these calculations, the assumption was made that the accuracy rates of both the conventional LSTM model and the grid LSTM model fol- low a normal distribution with the same standard deviation. The assumption on similar standard deviations is motivated by the fact that both models are trained with the exact same framework.

However, the training time of each network is assumed to be constant since the assumption is made that each training step takes the same amount of time through the whole training process.

1.3 The outline of the paper

The outline of this paper is as follows. Chapter 2 contains a careful description in chronological order of conventional methods that have been used to model speech recognition. In chapter 3, the specific models and experimental setup for this particular project are described. Furthermore, the results produced by the two models are presented in chapter 4, and in chapter 5 they are compared and analyzed. Finally, in chapter 6, a discussion of the analysis is conducted and limitations and future studies in the field is discussed.

2 Background

Speech recognition is defined by the interaction between computer and its surroundings through spoken or recorded words. In general, a speech recognition system takes acoustic waves as input and transforms them into time series data in either the frequency domain or time domain. The computer then analyses the time series data in order to predict what word was spoken. Speech recognition has a wide range of applications and because of that, the problem area of speech recognition can be divided into many subparts, such as speech-to-text, gender identification, speaker identification and acoustic modeling [4]. These applications share the same type of single-channel raw input data that is highly frequency and time dependent. Many approaches of how to apply deep learning to speech recognition have been tested [4]. In the following chapter, an orientation of previous work in the field of speech recognition is presented. The traditional approaches as well as state of the art methods for speech recognition are described and a new possible architecture is evaluated.

2.1 The classical HMM approach of speech recognition

HMM was for decades the best performing method for speech recognition. The model consists of two probability distributions, the transition distribution and the observation distribution, that are assumed to be stationary. The transition distribution P (y

t

, y

t 1

) links the hidden layers in the model together and the observation distribution P (x

t

, y

t

) defines how the input data x is related to the hidden layer at the current time step. However, a common problem is that HMM requires discrete state space that often leads to unrealistic assumptions [4]. This in turn weakens the model.

For speech recognition, the discretization has traditionally been achieved through Gaussian mixture models. However, restricted Boltzmann machines have been proven more eﬃcient for discretiza- tion. Therefore, research in the field of modeling speech recognition has focused on diﬀerent types of restricted Boltzmann machines such as the mean-covariance restricted Boltzmann machine and the conditional restricted Boltzmann machine [4] [5].

However, recent work within the field of speech recognition has found an alternative to the HMM.

A long short term memory recurrent neural network (LSTM RNN) was used to achieve a classifica- tion error of 17.7 % on the TIMIT data set, which outperforms the HMM approach and challenges it as the standard way to model speech recognition tasks. The LSTM RNN can replace a combi- nation of diﬀerent techniques previously used such as HMM combined with restricted Boltzmann machines [4].

2

(11)

2.2 RNN and the LSTM cell

The recurrent neural network (RNN) is one of the approaches used to model sequential data. It can be created from a feed-forward network by connecting the output of the neurons back to its inputs. This means that the output at the next time step depends not only of the current input data but also of previous output data [7]. A common way of training RNNs is through a method called backpropagation-through-time. They can be viewed as very deep networks with shared parameters between the hidden layers when unfolding in time. This give rise to the problem of vanishing gradients, which means that the RNN cannot base its current output on output that are many time steps away but only on the most recent output data. One way to explain it is to say that standard RNNs cannot posses long-term memory. This can turn out to be problematic for areas where current output highly depend on previous output, such as words in a sentence and this problem motivates higher dimensions of networks to build in more memory in RNNs [7]. One common method to do this is the LSTM cell implemented in a RNN. As mentioned before, this architecture has proven to be successful in achieving a low classification error for speech recognition [7].

The conventional LSTM RNN consists of an RNN holding LSTM cells, also known as memory blocks, in its hidden layers [9]. Each block has a multiplicative input gate and output gate. The input gate controls which neuron activations that will pass into the memory block and the out- put gate controls which activations will be passed back to the RNN. Through this procedure, the LSTM RNN model can store and extract information from output data many time steps back and hence, this method has become a common way to model temporal sequences and their long-term dependency. Furthermore, a forget gate was added to the LSTM model in order to be able to filter out data from the memory that has become irrelevant. Thus, the forget gate is used to segment the data stored in the memory reset parts of the memory blocks [7].

c_t-1

+

y_t

c_t

x

Forget gate Input gate Output gate

x

tanh

h_t

h_t-1

x_t

Figure 1: An illustration of the basic LSTM cell where x represents the input, y and h represent the output, and c represents the state, all with respect to t.

Mathematically, the LSTM model maps the input sequence x

t

to an output sequence y

t

where t spans from 1 to T .

3

(12)

i

t

= (W

tx

x

t

+ W

tm

m

t 1

+ W

tc

c

t 1

+ b

t

) f

t

= (W

f x

x

t

+ W

f m

m

t 1

+ W

f c

c

t 1

+ b

f

) c

t

= f

t

⇤ c

^{t 1}

+ i

t

⇤ g(W

^cx

x

t

+ W

cm

m

t 1

+ b

c

)

o

t

= (W

ox

x

t

+ W

om

m

t 1

+ W

oc

c

t 1

+ b

o

) m

t

= o

t

⇤ h(c

^t

)

y

t

= (W

mt

m

t

+ b

m

)

Where ⇤ denotes element-wise multiplication, W represents a weight matrix, b is the bias vector and is the logistic sigmoid function. Furthermore, i is the input gate, f is the forget gate, c is the cell activation vector, m is the cell output activation vector and o is the output gate, all of the same length. The function g denotes the cell input activation function and the function h denotes the cell activation output function [7].

Due to the cyclic connections of the LSTM architecture, the LSTM model has been successfully applied on many sequential problems. These include sequence labeling and sequence prediction tasks, such as language modeling, labeling of acoustic frames and handwriting recognition [8].

Previously, the method was limited to small scale tasks such as bots, that listens for keywords, answering customer service phones. However, recent work has expanded the applicability of the LSTM structure to large vocabulary speech recognition by evaluating and testing diﬀerent size of the model’s parameters [8].

2.3 The Grid LSTM architecture

With the advantages that LSTM networks bring, it has been appealing to generalize them for deep computation. In the past, LSTM cells have been applied in multidimensional arrays. In the Stacked LSTM method, LSTM layers are stacked on top of each other in the depth dimension, but are not integrated in both dimensions [2]. In the Multidimensional LSTM model, LSTM layers are stacked in an N-dimensional array. At each input x the network receives N hidden vectors h

1

, ..., h

N

, and N memory vectors m

1

, ..., m

N

, it then outputs one hidden vector h and one memory vector m.

The hidden- and the memory vector are then passed on as the new state in all N dimensions. In this method the memory vector m grows combinatorially with the number of dimensions N, and the size of each dimension. This is a problem that causes instability in large networks, and is due to how the method calculates the vector m [2].

To avoid this problem, the Grid LSTM architecture was introduced. In this method LSTM cells are stacked in N dimensions to form a grid structure, an N-dimensional grid LSTM network is denoted N-LSTM for short. A 2-LSTM network is represented in figure 2. This method is similar to the Multidimensional LSTM model, but the diﬀerence is a mechanism that stops the memory vector m from growing combinatorially. This main diﬀerence is that in grid LSTM, instead of inputing N hidden vectors and memory vectors and outputting one of each, it outputs N of each, as h’

1

, ..., h’

N

and m’

1

, ..., m’

N

. It does so through the following calculation [2]. The N hidden vectors are concatenated into

H = 2 6 6 6 4

h

1

h

2

h ...

N

3 7 7 7 5

and this matrix is then used in the calculation of each new (h’,m’) pair as (h’

1

, m’

1

) = LST M (H, m

1

, W

1

)

...

(h’

N

, m’

N

) = LST M (H, m

N

, W

N

)

4

(13)

The Grid LSTM method has been tested on three diﬀerent algorithmic tasks, and three empiri- cal tasks showing very good results. These tasks include determining the parity of k-bit strings, adding 15-digit integers, and on memorizing sequences of numbers, as well as character prediction, translation, and digit recognition [2].

In applying Grid LSTM to determining parity, a one-dimensional Grid LSTM has been shown successful for strings with k  250, which is very good compared to non LSTM feed-forward net- works, showing success for k  30 [2]. Adding 15-digit numbers was done with a 2-LSTM network, and compared with a stacked LSTM network. The results show that after 5 million training ex- amples the 2-LSTM network outperforms a stacked LSTM network [2]. A 2-LSTM network was also compared to a stacked LSTM network in memorization of randomly generated 20 symbol long strings with 64-symbol vocabularies. Both networks had 100 hidden units and between 1 and 50 layers, and it was found that the 2-LSTM network outperformed the stacked LSTM network.

It was further noted that the accuracy of the stacked LSTM network decreased with increased number of layers. Above 16 layers the accuracy was consistently below 50 % [2].

In character prediction a 2-LSTM network was tested on the Hutter challenge Wikipedia dataset.

It achieved a bits-per-character score of 1.47, better than a stacked LSTM, an MRNN network, and a GFRNN network, despite having fewer parameters than both the stacked LSTM and the GFRNN network [2]. In translation, a novel approach has been introduced using a grid LSTM network [2]. Instead of using a traditional encoder-decoder method, a 3-LSTM network is applied.

The method is evaluated using the IWSLT BTEC Chinese-to-English corpus, and is compared with the baseline state-of-the-art hierarchical phrase-based system CDEC. The 3-LSTM system reaches a perplexity of 4.54 on the test data, and it outperforms CDEC baseline with the following results. Valid: 51.8, test: 60.2, compared to CDECs valid: 50.1, test: 58.9 [2]. In character recog- nition, a 3-LSTM network was applied on the MNIST data set. Two dimensions correspond to the 2-dimensional input of the pixel grid, and the third corresponds to the depth of the network. Two approaches are tested, one replacing the traditional LSTM transfer function with ReLu connections along the depth. The two methods perform near state-of-the-art, with test errors of 0.32 and 0.36 respectively [2].

In the so far tested applications, Grid LSTM networks have consistently performed better than respective stacked LSTM networks on algorithmic tasks, and near state-of-the-art on empirical tasks, when compared to the current standards. This shows great promise for the grid method, and motivates further exploration of the method.

5

(14)

LSTM cell

LSTM cell LSTM cell

LSTM cell

Depth

Time c_t-1

h_t-1

c_t

h_t

c_t+1

c_t+1 al+1

a_l+1

al a_l

a_l-1 a_l-1

h_t-1 h_t h_t+1

b_l-1 bl

b_l+1

b_l-1 bl

b_l+1

h_t+1 c_t

ct-1

Figure 2: An illustration of a 2D Grid LSTM architecture where x

t

is the input, y

t

is the output, c

t

and h

t

are the state with respect to time t and a

l

and b

l

are the state with respect to depth l.

2.4 Related work

Grid LSTM networks have been applied to speech recognition in the past. A prioritized Grid LSTM was proposed and trained on several different datasets, and compared with a Stacked LSTM, a Highway LSTM [11] and a non-prioritized Grid LSTM [1]. In this comparison the Stacked LSTM performed worse than the Grid LSTM, even with a small depth, and as the depth increased the Stacked LSTM performed worse while the Grid LSTM performed better [1]. The datasets used in the comparison are different collections of long sequences of speech with very long time dependen- cies. This makes them very different from single word speech recognition, which has much smaller time dependency. Therefore it is not certain that this result generalizes to the one-word case.

There have been other attempts made to model speech recognition successfully. A model that has proven to be eﬀective both for one-hot speech recognition problems [6] and more complex problems such as sentence classification [3] is the convolutional neural network (CNN). A one- dimensional CNN can be viewed as a series of operations between a vector of weights m and a vector of inputs s. The vector s can be interpreted as a sequence where each element s

i

corresponds to one data point extracted from the voice recording and m is the filter of the CNN. The idea is to take the scalar product of the two vectors in the following way

c

j

= m

^T

s

j m+1:j

where c

j

is the j:th element of a new sequence vector [3]. CNN models have proven to be less sensitive to speaker style and speaker variations, which makes them perform better than some DNN models [6]. More specifically, work in this field has shown that a CNN could perform better than the DNN Google uses for keyword spotting, which is a demonstration of a alternative method to model an one-hot problem.

6

(15)

2.5 Hypophsis statement

In light of the background and the research question, the following hypotheses can be formulated

H

0

: The grid LSTM network does not bring any changes to the accuracy rate compared to the conventional LSTM network.

H

1

: The grid LSTM network does bring changes to the accuracy rate compared to the conventional LSTM network.

If the results were to show that there is a significant enough diﬀerence between the accuracies and training time, then H

0

can be discarded.

3 Method

Both the conventional, and the Grid LSTM networks were implemented into an existing framework produced by Tensorflow under the Apache 2.0 Licence. The training and testing data lies under the Creative Commons Attribution 3.0 License, and is also a product of Tensorflow. The Tensorflow framework is designed to be able to use diﬀerent model types, where any type of model is able to be plugged into it. The framework imports and pre-processes the data and inputs it into the plugged in model for training and testing.

3.1 Data

The data set was downloaded from the Simple Audio Recognition Tutorial at Tensorflow’s website [10]. It consists of 65,000 1 second WAVE audio files of people saying one out of thirty diﬀerent English words. Each audio file is labeled with one of twelve diﬀerent labels. The labels are are

"yes", "no", "up", "down", "go", "stop", "left", "right", "on", "oﬀ", "silence" and "unknown".

The labels represent the word that is spoken in the corresponding audio clip, unless the label is

"silence" or "unknown". Files labeled silence are clips of background noise and no audible words spoken, and files labeled "unknown" are spoken words not among the other labels, but among the 20 other words. The reason for having the category unknown is that it is good to know how well the model can diﬀerentiate words it was trained to hear from words it was not.

The large size of the data set is seen as a valuable resource although it makes the training process longer. In addition, the nature of the data creates a one-hot problem which is relatively easy to implement. Thus, focus can be centered around creating the two models, which is essential for answer to the research question.

The data was preprocessed to input the mechanical waves of the audio, meaning the amplitude at each time step. The amplitude of the audio file at each time step was normalized to a number between 1 and -1. This is a very simple way to pre-process data, and was done due to its simplicity, its ease of implementation. With more time and resources diﬀerent approaches would have been tested. The data was also separated into tree parts. Training data constituted 80 %, testing data 10 %, and validation data 10 %. The validation is done continuously during training in order to verify and tune parameters on data separate from the training data. Moreover, since the models train on labeled data, this constitutes supervised learning.

3.2 Training and testing

The same implementation of Python code was used to train and test both models. This code was downloaded from Tensforflow’s GitHub, where it was released as part of the Simple Audio Recognition Tutorial. This implementation of training and testing was considered appropriate mainly since it was written to fit this particular data set with its one-hot nature but also since the code was structured in a way that made the implementation of alternative neural network models

7

(16)

convenient. This made it possible to implement the two model functions without changing the training and testing code, which ensures that as few parameters as possible diﬀers between the two models.

The conventional LSTM-model and the Grid LSTM model were plugged into the existing frame- work, with the output format set to match the framework’s expectations. The training was done in segments of 100 steps and the total number of training steps was 18 000.

At the end of training, the models were tested, and a confusion matrix was derived together with the overall accuracy on the testing data. A confusion matrix is an n ⇥ n matrix where n is the number of labels the WAVE audio files can be tagged with, in this case 12 labels. Each column in the matrix represents the sample set predicted to be each label. To exemplify, the first column shows all words that were predicted to be "silence" and the second column shows all words that were predicted to be "unknown". Furthermore, each row represents a set of samples that were actually tagged with each label. To exemplify, the first row shows all words that actually were labeled with "silence" and the second row shows all words that were actually labeled with

"unknown". Thus, a perfect neural network model would produce a confusion matrix with only zeros apart from the diagonal. This would mean that the number of predictions of each label exactly matches the number of samples that were tagged with the label. A visualization of the ideal confusion matrix is shown in table 1. The matrix is transformed into a table in order to be able to label each row and column and thereby provide a more descriptive representation of the ideal confusion matrix.

Guess, Label silence unknown yes no up down left right on oﬀ stop go

silence n

silence

0 0 0 0 0 0 0 0 0 0 0

unknown 0 n

unknown

0 0 0 0 0 0 0 0 0 0

yes 0 0 n

yes

0 0 0 0 0 0 0 0 0

no 0 0 0 n

no

0 0 0 0 0 0 0 0

up 0 0 0 0 n

up

0 0 0 0 0 0 0

down 0 0 0 0 0 n

down

0 0 0 0 0 0

left 0 0 0 0 0 0 n

lef t

0 0 0 0 0

right 0 0 0 0 0 0 0 n

right

0 0 0 0

on 0 0 0 0 0 0 0 0 n

on

0 0 0

oﬀ 0 0 0 0 0 0 0 0 0 n

of f

0 0

stop 0 0 0 0 0 0 0 0 0 0 n

stop

0 go 0 0 0 0 0 0 0 0 0 0 0 n

go

Table 1: An ideal confusion matrix

The confusion matrix allows for a deeper understanding of the models’ behavior by adding valu- able information on which words that the neural network most often predicted correctly. From this matrix, accuracy for each word can be derived and analyzed. This might be interesting in the comparison between the conventional and grid LSTM models since the confusion matrix makes it possible to tell if one model performs better than the other depending on words.

The training was done on a CPU due to the convenience and simplifications of the Tensorflow installation. All initial weights and biases were randomized.

3.3 Models

Two models were build, with help from Deep learning with Python, Tensorflow and RNN tutorial by Sentdex. The models were written in python, and consist of one function each, in which the data constitutes the main input. These functions create the respective networks of cells using Tensorflow’s built in structures, BasicLSTMCell is used for the conventional LSTM network, and GridLSTMCell is used for the grid LSTM network. Each network is given a certain size, which is the number over which the recursion is done. The data is then put into the network, and a final output and state is obtained. The output is then transferred through a matrix multiplication of a layer of weights and biases. This returns the final one-hot output that is passed back into the

8

(17)

surrounding architecture.

3.3.1 The LSTM Model

Specific to the LSTM model is only the cell that is instantiated, a basic LSTM cell from Tensorflow’s library. This cell functions as represented in figure 1, and is only extended in the time dimension as represented in figure 3, not in the depth dimension. The chosen size of the network was 3920 due to the size of the data vector.The function passes the input data through the layers of the basic LSTM network and returns the output, a one-hot vector.

ct-2 +

yt-1

x

Forget

gate Input

gate Output gate

x tanh

ht-2

xt-1

ct-1

+

yt

ct

x

Forget

gate Input

x tanh

ht

ht-1

xt

+

yt+1

ct+1

x

Forget

gate Input

x tanh

ht+1

xt+1

Depth

Time

Figure 3: A graphical representation of the conventional LSTM network as implemented in this project

3.3.2 The Grid LSTM Model

The particulars of the implemented Grid LSTM are simple. The cell used is provided through the Tensorflow library and was built by Tensorflow to represent the model presented in [2]. It functions as portrayed in figure 4. The implementation made is a 1-D grid network, as shown in figure 4. This means that the network has a size only in the depth dimension, and none in the time dimension. This essentially strips the time dependency from the network and instead focuses on depth. The whole time series of data is given to the first node, and then propagates through grid cells in the depth dimension until the end, where the final result is returned. Because only the first cell is fed the time series, a forget gate is not used in this implementation. If a forget gate were used, too much information would be lost. The size of the network was set to 1960, half of the size of the input vector. This is due to how the Grid LSTM cell is implemented in Tensorflow.

9

(18)

Depth

Time

LSTM cell

a_l+1

a_l

a_l-1 b_l-1 b_l b_l+1

LSTM cell

a_l-2

a_l-3 b_l-3 b_l-2

Figure 4: A graphical representation of the 1D Grid LSTM network implemented in this project There are several reasons for implementing the grid method as a 1D structure. Having both models tested be 1D structures opens up for a more direct comparison, where the diﬀerences in results can be attributed in large part to the choice of dimension, not the choice of number of dimensions. It gets closer to the essence of the issue. Also, implementing a 1D Grid LSTM is more simple than implementing a 2D grid, and due to resource limitations it was a more appropriate choice.

3.4 Evaluation metric

In order to measure the training time, the assumption was made that the time one training step takes is the same for all training steps. With this assumption, the first 400 training steps of each model were timed, denoted T(Grid) and T(LSTM). Because of how training time varies depending on hardware, the times were calculated relative to T(Grid), thus the metric of time is standardized to a fraction of how it relates to the time it takes to train the Grid LSTM. It is assumed that the training time of each model will not vary with diﬀerent training sessions. Training and testing was performed on a 2.7 GHz CPU.

However, the accuracy rates of the models were assumed to follow a normal distribution. There- fore, each model was trained ten independent times. From this, the mean of the accuracy rate of each model was calculated. The mean of the conventional LSTM network was used as an estimate of the population mean since the goal is to find out if the grid structure causes any diﬀerence in the mean. Furthermore, the sample standard deviation of the conventional LSTM model’s accuracy rate was calculated. This standard deviation was used as an estimate of the population standard deviation. The t-test is used to determine if the P-value is of significance or not. The choice of test is motivated by the fact that the sample size is n = 10 + 10 = 20 < 30 and that the real population standard deviation is unknown. A two-tailed version of the t-test is used since the mean diﬀerence can potentially be both positive and negative.

Furthermore, the resulting confusion matrices of the two models were compared. In order to increase the reliability of the comparison, a statistical approach was used to find the correlation between the confusion matrices. A correlation test reveals to what extent the numbers in the two matrices vary together. Describing it in less mathematical terms, it shows if a given area of the confusion matrices consists of elements of similar value or not. A number close to one would indicate that they are similar while a number close to zero that they are very diﬀerent and inde-

10

(19)

pendent from each other. The correlation was calculated in MATLAB, using MATLAB’s built in function "r = corr2(A, B)". This function takes two 2D-matrices A and B as inputs and returns the correlation r between them. This correlation is then used as the answer of how much the two matrices diﬀer from one another.

4 Results

4.1 Results and analysis of accuracy and training time

Both the conventional LSTM model and the grid LSTM model were trained 10 independent times.

The resulting accuracy rate from each training session are shown in table 2.

Standard LSTM Grid LSTM

Session 1 64.0 % 66.0 %

Session 2 64.8 % 64.8%

Session 3 64.5 % 63.9%

Session 4 64.8 % 65.1%

Session 5 64.8 % 64.2%

Session 6 65.5 % 65.6%

Session 7 65.2 % 65.8%

Session 8 65.0 % 65.6%

Session 9 65.3 % 65.6%

Session 10 64.5 % 65.6%

Table 2: The table shows the accuracy rate of the standard LSTM network and the grid LSTM network for each training session.

In order to conduct the statistical analysis, the P-value needs to be calculated. For this, the mean of each network’s accuracy rate is needed. The standard deviation of the accuracy rates, which was assumed to be the same for both models, is also needed. The means and the standard deviation are displayed in table 3.

Standard LSTM Grid LSTM

Mean µ 0.648 0.652

Std. dev. 0.0044 0.0070

Table 3: The mean and standard deviation of the accuracy rate for the conventional LSTM and the grid LSTM model respectively.

The mean and standard deviation of the conventional LSTM model’s accuracy rate are used as the population mean and standard deviation. This is a reasonable estimation since the goal is to compare the networks with the grid structure to those without the grid structure. Thus, the pop- ulation parameters should be estimated from the networks without the grid structure. If the mean and standard deviation of the grid model’s accuracy would have been included it would instead mean that a model in the population would have a 50 % chance of having the grid structure. Then, the comparison would no longer be between models of pure grid structure and models without it.

11

(20)

0.63 0.64 0.65 0.66 0.67 Accuracy

0 20 40 60 80

Probabillity density

Grid LSTM

Figure 5: Bell curves representing the probability distributions of the accuracies of the two models.

The dotted line represents the bell curve of the Grid LSTM, and the solid line represents the bell curve of the conventional LSTM.

The P-value is given by the mean diﬀerence divided with the population standard deviation. The mean diﬀerence is illustrated in figure 5, where the normal distribution of the accuracy rate of each model is plotted.

Mathematically, the P-value is given by the formula

P = µ

Grid

µ

LST M

which with nummerical values yields

0.652 0.648

0.0044 = 0.84.

The significance level ↵ = 5% is chosen. Next, the value is compared to a t-table. 17 degrees of freedom is used since two means and one standard deviation was calculated which gave the degrees of fredom as df = 20 3 = 17. A comparison of the P-value to a t-table yielded the value 0.8. To be able to reject the null hypophesis H

0

this value should have been at least 0.975. This means that the null hypophesis H

0

cannot be rejected at a significant level and thus, that the alternative hypothesis H

1

is not suggested to hold. This means that statistically, it cannot be stated that one model performs better than the other in terms of accuracy.

4.2 Results of the confusion matrices

Each model also produced a confusion matrix during testing. Due to the small differences between the matrices from different sessions and due to the inefficiency of calculating a matrix of means, the choice was made to use the matrix, with the corresponding accuracy rate closest to the mean, as the resulting confusion matrix. This matrix produced by the conventional LSTM network was translated to a table with labeled rows and columns. As a reminder, the rows represent the number of times its corresponding label occurred and the columns represent each time the network made a guess of its label. This is displayed in table 4.

12

(21)

Guess, Label silence unknown yes no up down left right on oﬀ stop go

silence 247 1 1 0 1 0 1 2 0 1 3 0

unknown 2 77 32 26 12 31 6 23 16 9 4 19

yes 5 6 192 5 6 4 25 13 0 0 2 4

no 4 6 11 144 2 26 3 10 2 2 2 40

up 2 7 3 2 177 2 13 5 2 13 24 22

down 4 15 5 28 3 169 3 1 6 0 5 14

left 3 17 47 6 14 2 143 23 0 9 2 1

right 4 13 12 2 2 1 9 200 8 6 0 2

on 3 13 0 3 6 16 1 6 186 6 1 5

oﬀ 4 4 1 0 13 2 14 5 20 189 4 6

stop 0 9 5 5 26 8 3 2 4 4 164 15

go 4 13 4 48 6 31 2 11 8 1 8 115

Table 4: The resulting confusion matrix from testing the conventional LSTM model translated into a labeled table.

The resulting confusion matrix of the grid LSTM model was chosen in a similar way. The matrix with the corresponding accuracy rate closest to the mean is presented in table 5.

Guess, Label silence unknown yes no up down left right on oﬀ stop go

silence 250 2 2 0 0 0 0 0 1 0 1 1

unknown 2 75 30 17 6 35 5 30 30 6 4 17

yes 5 5 196 1 1 5 21 13 2 0 4 3

no 4 6 9 136 2 26 3 9 1 2 2 52

up 3 5 4 2 166 8 13 4 2 19 26 20

down 4 17 5 24 2 165 4 2 9 0 5 16

left 3 18 47 4 12 3 144 25 0 7 2 2

right 4 20 9 1 1 1 6 202 7 6 0 2

on 3 17 0 5 5 14 1 4 182 8 1 6

oﬀ 4 2 2 0 14 2 12 3 22 190 4 7

stop 1 11 6 5 21 6 3 1 3 5 168 19

go 4 12 4 49 6 27 1 9 7 3 8 121

Table 5: The resulting confusion matrix from testing the Grid LSTM model translated into a labeled table.

As expected, both models produced matrices with diagonals holding numbers of a larger magni- tude than the numbers that the rest of the elements hold. This provides further evidence that the models are working since each increase in the diagonal represents another correct prediction. In fact, all numbers in the diagonal are one order of magnitude larger than the rest of the matrix, except for one. The exception for both matrices is the element of index (2, 2), which has the same oder of magnitude as the elements outside of the diagonal. This means that both matrices found it hardest to predict unknown words. Furthermore, it can be said that both networks found it easiest to predict silence correctly. This can be viewed in the first row of each matrix, there all elements apart from the first one is very close to zero. This means that almost every time a clip with silence was passed through the networks, both of them labeled it correctly. That both models behave similarly for single words further suggests that the null hypothesis cannot be rejected. This is in line with the statistical analysis made on the accuracy rates of the two models.

The correlation between the two confusion matrices was determined to r = 0.998. This is a correlation very close to perfect correlation. This means that the two matrices vary together almost completely. This further supports the acceptance of the null hypothesis H

0

.

13

(22)

4.3 Results on training time

As described in the chapter 3, the training time was normalized in order to avoid that different devices with different capacity yields different training time. The choice was made to normalize the training time for the grid LSTM model to one. The results of the training time are presented in table 6.

Standard LSTM Grid LSTM

Training time 1.02 1

Table 6: The resulting training times of the conventional LSTM network and the grid LSTM network.

Since the training times are assumed to be constant for each model, no statistical tests are needed.

It can be stated that there is a diﬀerence of 2%, where the conventional LSTM model takes longer to train. However, the diﬀerence is small and therefore training time can be said to be of little importance when choosing between the two models.

5 Discussion

5.1 Discussion

In this project, it cannot be statistically confirmed that there are diﬀerences between the conven- tional LSTM model and the grid LSTM model in one dimension in terms of the accuracy rate. In light of the fact that speech data is highly time dependent, it is not surprising that the grid LSTM network had no advantage over the conventional LSM network. However, what is a bit surprising is that it has no disadvantages either. This means that a grid LSTM network can perform just as well as a conventional LSTM model even without the time dimension, which in turn means that single word speech recognition can be modeled just as well with the time dimension interchanged to the depth dimension even though the data is highly time dependent.

The correlation between the confusion matrices of the two networks further supports the con- clusion that there are no significant difference between the two models. The calculated correlation of r = 0.998 indicates that the two confusion matrices correlates almost perfectly. The main difference between the conventional LSTM model and the grid LSTM model is, apart from the dimension unit, that the grid LSTM lacks the forget gate. That the two models perform this similar might be an indication that these two main differences are not of significance in single word speech recognition algorithms.

However, the training time varied slightly between the models. A small variation was expected since the conventional LSTM model and the one-dimensional LSTM model are similar apart from the forget gate. However, due to the fact that the training could be completed over a night even with a relatively weak CPU with a frequency of 2.7 GHz, training time is not considered to be of significance when choosing between the models. Thus, the training times only gives the grid LSTM a small advantage over the conventional one.

Although the conventional LSTM network and the grid LSTM network did not show any sig- nificant diﬀerences, it can be stated that the one-dimensional grid LSTM network are as good as the state of the art LSTM model in predicting single words from audio files correctly. This yields a challenging state of the art approach to acoustic modeling that with further research might outper- form its competition and improve the capabilities of speech recognition technology. This motivates further research on applying grid LSTM models to speech recognition problems.

5.2 Limitations

The data set and methodology of this project has caused limitations to the generality of the results that are important to address. Due to the nature of the data set used, the machine learning prob-

14

(23)

lem is of the one-hot type. Thus, the results are only valid on one-hot machine learning models.

In addition, the data set consisted of single words spoken in isolation. This is a very diﬀerent problem from those using data sets with words spoken in succession since the order of the words is irrelevant in the first case while earlier words holds valuable information about the next coming word in the second case. This means that the results are only valid for data sets where words are spoken in isolation.

As mentioned before, the grid LSTM architecture can be of many dimensions. Due to comparabil- ity and limitations, the choice was made to only create one grid LSTM model with an architecture of one dimension. Therefore, the results can not be extended to grid LSTM architectures in gen- eral, but only to the one dimensional grid LSTM structure.

In the processing of the data, it was decided to use data points directly from the mechanical waves instead of transforming them into the frequency domain. Although data in the frequency domain is known to be more reliable since speech data are highly frequency dependent [4], this decision had to be taken due to time limitations. More reliable results could probably be produced with a transformation of the data to the frequency domain.

The statistical analysis of the accuracy rates showed no significant diﬀerence. Therefore, the null hypothesis could not be rejected. However, it is important to highlight that the sample size of both the conventional LSTM model and the grid LSTM model was quite small. Since the pop- ulation standard deviation was estimated from the sample standard deviations, the small sample sizes decreases the reliability of the results somewhat. However, time limitations made it necessary to limit the sample sizes.

There are many possible ways to measure advantages and disadvantages when comparing ma- chine learning models. In this project, the choice fell upon the accuracy rate and training time, since others conducting research in this field have used these variables [2], [4]. However, there are other possible ways of comparing machine learning network performance such as evaluating the parameter sensitivity of diﬀerent models. Thus, it can be said that the wider concepts of advan- tages and disadvantages of machine learning models have been limited to the accuracy rate and training time of the models. All other ways of quantifying performance lies beyond the scope of this work.

6 Conclusion

In light of the limitations described in section 5.2 the following conclusions were drawn. The null hypothesis can not be rejected based on the statistical evaluation of the accuracy. Since the train- ing time is not expected to be stochastically varying, the training time of the Grid LSTM can be concluded to be slightly shorter than the conventional LSTM model, and therefore pose a slight advantage. With regards to the parameters evaluated in this project, there is a slight advantage of using a 1D Grid LSTM over a conventional LSTM network. This advantage lies in the training time, which is 2 % longer for the LSTM network. However measured in absolute time, a 2 % dif- ference does not represent a significant diﬀerence, and the training times can be approximated as equal. There might be parameters not taken into account in this evaluation that show advantages of using a conventional LSTM network.

The answer to the research question is the following. The advantage of using a Grid LSTM network over a conventional LSTM network is that the grid network produces slightly shorter training time. There is no statistically verifiable advantage in accuracy. With this, the research question has been answered. However, since the diﬀerence in training time is small, it can be said that the two models perform equally. Thus, the grid LSTM model performs at least as well as the conventional LSTM model.

15

(24)

6.1 Future studies

In light of the limitations described above, a reasonable next step would be trying to use both dimensions in a 2D grid model, to gain the advantages of looking at the data as a time series, and the advantage of the depth dimension shown in this study. A two-dimensional Grid LSTM as depicted in figure 2 could perform extremely well if these advantages compound. In particular, grid LSTM architectures with higher dimensions might turn out to be appropriate for speech data where words are not spoken in isolation. In these data sets, each word is not only time dependent in itself but also dependent on earlier spoken words. Thus, higher dimensions in a grid architecture could be a good fit.

Another suggestion for future studies is to use data that have been transformed into the fre- quency domain. This might improve the overall performance of both modes and one can argue that this would make the results more reliable. It would also be interesting to see a replication of this project with a bigger sample size of accuracy rates. Even though such a replication hopefully would yield the same results, the results would be more reliable. In a replication of this study, it is also suggested that a parameter sensitivity analysis of each model could be conducted. Such an analysis would widen the concepts of advantages and disadvantages used to compare the models and in turn increase the understanding of the models.

16

(25)

References

[1] Wei Ning Hsu, Yu Zhang, and James Glass. A prioritized grid long short-term memory RNN for speech recognition. In 2016 IEEE Workshop on Spoken Language Technology, SLT 2016 - Proceedings, pages 467–473, 2016.

[2] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid Long Short-Term Memory. 7 2015.

[3] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A Convolutional Neural Network for Modelling Sentences. 4 2014.

[4] Martin Längkvist, Lars Karlsson, and Amy Loutfi. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters, 42:11–24, 6 2014.

[5] LAWRENCE R. RABINER. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Readings in Speech Recognition, pages 267–296. Elsevier, 1990.

[6] Tara N Sainath and Carolina Parada. Convolutional Neural Networks for Small-footprint Keyword Spotting. Sixteenth Annual Conference of the International Speech Communication Association, 2015.

[7] Haim Sak, Andrew Senior, and Françoise Beaufays Google. Long Short-Term Memory Re- current Neural Network Architectures for Large Scale Acoustic Modeling. INTERSPEECH, pages 338–342, 2014.

[8] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long Short-Term Memory Based Recur- rent Neural Network Architectures for Large Vocabulary Speech Recognition. 2 2014.

[9] Haşim Sak, Andrew Senior, Kanishka Rao, and Françoise Beaufays. Fast and Accurate Re- current Neural Network Acoustic Models for Speech Recognition. 7 2015.

A comparison between aconventional LSTM network and agrid LSTM network applied onspeech recognition

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2018 ,

A comparison between a

conventional LSTM network and a grid LSTM network applied on

speech recognition

GUSTAV EDHOLM XUECHEN ZUO

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

www.kth.se

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2018 ,

En jämförelse mellan ett

konventionellt LSTM-nätverk, och ett Grid LSTM-nätverk tillämpat på taligenkänning

GUSTAV EDHOLM XUECHEN ZUO

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

www.kth.se

Abstract

In this paper, a comparision between the conventional LSTM network and the one-dimensional

grid LSTM network applied on single word speech recognition is conducted. The performance

of the networks are measured in terms of accuracy and training time. The conventional LSTM

model is the current state of the art method to model speech recognition. However, the

grid LSTM architecture has proven to be successful in solving other emperical tasks such as

translation and handwriting recognition. When implementing the two networks in the same

training framework with the same training data of single word audio files, the conventional

LSTM network yielded an accuracy rate of 64.8 % while the grid LSTM network yielded an

accuracy rate of 65.2 %. Statistically, there was no diﬀerence in the accuracy rate between

the models. In addition, the conventional LSTM network took 2 % longer to train. However,

this diﬀerence in training time is considered to be of little significance when tralnslating it to

absolute time. Thus, it can be concluded that the one-dimensional grid LSTM model performs

just as well as the conventional one.

Referat

I denna rapport har en jämförelse gjorts mellan det konventionella LSTM-nätverket och det

endimensionella grid LSTM-nätverket tillämpat på enordstaligenkänning. Modellernas pre-

standa mäts i termer av andelen korrekta gissningar samt träningstiden. Den konventionella

LSTM modellen är idag den främsta tekniken att modellera taligenkänning. Grid LSTM-

nätverket har emellertid visat sig framgångsrikt inom modellering av andra empiriska problem

så som översättning och handstilsigenkänning. Vid en implementation av dessa två modeller

i samma träningsramverk innehållandes samma träningsdata bestäms andelen korrekta giss-

ningar från det konventionella LSTM-nätverket till 64.8 % medan grid LSTM-nätverkets andel

landar på 65.8%. Statistiskt finns ingen signifikant skillnad i andelen korrekta gissningar mel-

lan modellerna. Det visade sig vidare att den konventionella LSTM-modellen tog 2 % längre

tid att träna. Denna tidsskillnad anses dock vara av försumbar betydelse när den översätts

till absolut tid. Således kan slutsatsen dras att grid LSTM-nätverket presterar minst lika bra

som det konventionella LSTM-nätverket gör.

Acknowledgements

We would like to thank our supervisor, Pawel Herman, for the support and interesting discussions throughout this project. We really appreciate your advice and suggestions of how

to improve our work. Thank you!

We would also like to thank Google’s Tensorflow team for providing their open source training framework and open source training data. This is a valuable resource when conducting research and we are grateful for being allowed to use it in this project. Thank

you!

Contents

1 Introduction 1

1.1 Purpose and research question . . . . 1

1.2 Assumptions . . . . 1

1.3 The outline of the paper . . . . 2

2 Background 2 2.1 The classical HMM approach of speech recognition . . . . 2

2.2 RNN and the LSTM cell . . . . 3

2.3 The Grid LSTM architecture . . . . 4

2.4 Related work . . . . 6

2.5 Hypophsis statement . . . . 7

3 Method 7 3.1 Data . . . . 7

3.2 Training and testing . . . . 7

3.3 Models . . . . 8

3.3.1 The LSTM Model . . . . 9

3.3.2 The Grid LSTM Model . . . . 9

3.4 Evaluation metric . . . . 10

4 Results 11 4.1 Results and analysis of accuracy and training time . . . . 11

4.2 Results of the confusion matrices . . . . 12

4.3 Results on training time . . . . 14

5 Discussion 14 5.1 Discussion . . . . 14

5.2 Limitations . . . . 14

6 Conclusion 15

6.1 Future studies . . . . 16

1 Introduction

[8] Speech is time dependent, and in recognizing speech it is important to look at the time con- text of each data point. Good results have been acquired using LSTMs for speech recognition [8].

1.1 Purpose and research question

1.2 Assumptions