Recurrent Neural Networks for End-to-End Speech Recognition: A comparison of gated units in an acoustic model

(1)

Master Thesis Report

Recurrent Neural Networks for End-to-End Speech Recognition

A comparison of gated units in an acoustic model

Johan Hagner 2018-01-10

Johan Hagner (c11jhr@cs.umu.se) Fall 2017

Master Thesis, 30 ECTS

Master of Science in computing science, 300 ECTS

(2)

Acknowledgments

I would like to thank associate professor and my thesis advisor Michael Minock for his encouragement, en- gagement and expertise that has helped clarify this sometimes hard to understand field of research throughout my thesis work. I would also like to thank my father Olof Hagner for his support.

(3)

Abstract

End-to-end speech recognition is the problem of mapping raw audio signal all the way to text. In doing so the process is not explicitly divided into modules. e.g. [signal → phoneme, phoneme → word]. Recurrent neural networks equipped with specialised temporal based loss functions have recently demonstrated breakthrough results for the end-to-end problem.

In this thesis we evaluate a number of neural network architectures for end-to-end learning. LSTM (Long Short Term Memory) is a specialised gated recurrent unit that preserves a signal within a neural network over period of time. GRU (Gated recurrent Unit) is a recently discovered refinement of LSTM with still unknown performance characteristics. It is reported that different architectures works better or worse depending on the problem at hand. We explore these characteristics for the end-to-end speech recognition problem. Specifically we evaluate various networks on the LibriSpeech corpus. All the audio is read in English by people from different parts of the world. The audio files are excerpts sourced from audio books.

The LibriSpeech corpus is divided into both noisy and clean audio. The noisy audio is considered to be more challenging. This corpus is pre-segmented and thus contain ready sub sets for testing. These include both noisy and clean audio and we will evaluate the end-to-end models on both sets.

The findings of our experiments shows that GRU can not perform on the same level as LSTM variants. Especially not when trained on noisy data where the GRU network stop improving after only a small part of the allotted training time.

(4)

List of Figures

2.1 Single and multi layer perceptron . . . . 4

2.2 Diagram of gradient descent . . . . 6

2.3 Recurrent neural network cells unrolling through time . . . 7

2.4 TanH and the derivate of tanH plot . . . 7

2.5 Diagram of the LSTM cell internal structure . . . 8

2.6 Diagram of the GRU cell internal structure . . . 9

2.7 Process of condensing a CTC-alignment . . . 10

2.8 Mel-frequency filter bank diagram . . . . 11

3.1 Information pipeline diagram . . . 15

4.1 Cell error rate bar chart . . . 20

4.2 Word error rate bar chart . . . 20

4.3 Cell error rate for GRU network in experiment 1 . . . 21

4.4 Cell error rate for LSTM network in experiment 1 . . . 21

4.5 Cell error rate for LSTM network with peephole connections in experiment 1 . . . 22

4.6 Cell error rate for GRU network in experiment 2 . . . 22

4.7 Cell error rate for LSTM network in experiment 2 . . . 23

4.8 Cell error rate for LSTM network with peephole connections in experiment 2 . . . 23

(6)

List of Tables

3.1 Settings for experiment 1 . . . 16

3.2 Settings for experiment 2 . . . 16

4.1 CER(Cell Error Rate) and WER(Word Error Rate) for experiment 1 . . . 19

4.2 CER(Cell Error Rate) and WER(Word Error Rate) for experiment 2 . . . 19

(7)

CHAPTER 1

Introduc on

1.1 Ar ﬁcial intelligence

Artificial intelligence is man-made intelligent systems. The debate over what intelligence really means has been long and controversial. Generally it is considered to be problem-solving using information processing in one way or another. The kind of problems that are generally considered to require intelligence are areas such as visual and auditory cognition, decision making and natural language understanding et. al. [Ge�17]

In 1950 the computer’s creator Alan Turing asked himself ”can machines think?”. In the now fa- mous imitation game also called the Turing test, he proposed that if a computer could fool a human into believing that it too was a human over text communication the computer would be considered intelligent.

This thought experiment have received criticism over the years for many reasons. One objection is that results would depend greatly on what questions that was asked. It is also quite easy to imagine a computer that can have a normal conversation with a stranger but would still not be able to answer contemplative questions about its own subjective inner state.

The first artificial intelligence systems that were somewhat useful were rule based systems where behaviour would follow decision trees. This approach could work well in diagnosing medical conditions or rovers driving around indoors in small rooms. However problems arose when the domain became increasingly complex since more and more rules had to be implemented for each possible special case. It was also hard to determine how to weigh the different rules’ importance against each other if they suggested different solutions.

An alternative approach to the rule based system was a machine learning approach called neural networks. In contrast to the rule based systems where explicit rules have to be designed neural networks learns from examples, this is part of the sub field of machine learning. There are two main variants of machine learning; supervised learning and unsupervised learning. In supervised learning the training data consists of pairs of input and output that has been prepared for the task of training. In unsupervised learning an agent generates its own input and output pairs by suggesting actions and receiving feedback on how well it did score. The first neural networks models that emerged were not useful for most problems but could do well in some simple visual recognition tasks such as detecting hand written digits and letters. Later detecting letters was also solved. Despite some initial successes these neural networks were incapable of solving much of the problems that researchers had hoped neural networks would tackle. The main reason why results were unsatisfying is that access to large amounts of training data as well as computational power was lacking.

(8)

However in recent times access to huge amounts of training data, much faster hardware and algorithmic breakthroughs have enabled a whole range of domains to be solved by neural networks. Areas that were previously believed to only be done by humans such as driving cars in traffic, composing music annotate images and summarise text. In some of these domains accuracy now exceeds that of human experts. These new technologies are generally part of the deep learning paradigm. Strictly speaking any learning made by neural networks with more than one hidden layer could be considered as deep learning, but it is generally understood to mean very many layers and vast amounts of training data. This development shows no signs of slowing down. Many prominent thinkers on this topic predict that unless there is something magical about our biological sub-strait and humanity does not go extinct, sooner of later artificial intelligence will become superhuman on any imaginable task.

1.2 Speech Recogni on

The problem of transcribing audio into text have traditionally been a tricky problem but with many possible applications if the error rates would get sufficiently low compared with humans trying to solve the same problem. For example, auto-generating subtitles for movies and video clips with conversational speech such as YouTube videos, interviews or news clips. Another widely used application for speech recognition is for artificial intelligence assistants, such as Siri(Apple), Alexa(Amazon), Cortana(Microsoft) and Bixby(Samsung). These assistants clearly need some way of transcribing speech to text and later try and understand the request and offer an appropriate response.

One of the reasons reason why Speech recognition has been so troublesome to implement in the past is that there are only 26 letters in the English alphabet and the letters get pronounced very different depending on the context. There are massive geographical, cultural and individual differences in pronunciation which further complicates things. With all these possible differences, it still gets encoded into the same vector of letters that make up a text. These factors have made this a very hard problem to solve, until now when Deep learning technology provides a new and very powerful set of tools.

To train a neural network for speech recognition training data will consist of paired corresponding sequences, of A; an input audio signal and B; the output letter sequence i.e. the correct corresponding text.

However, when trying to construct a deep neural network for speech recognition one immediately must face a previously crippling problem. Namely that the audio and letter sequences are of different length. The label sequence, i.e. the text does not consider how long the corresponding audio clip is. To correct for this a large amount of pre-processing including pre-training is required. This leads to having an extra training step, where the sequence is stretched and filled with blanks so that each letter is aligned in time with the sound of that letter was heard.

(9)

CHAPTER 2

Background

2.1 Neural networks

Artificial neural networks is one of the most promising technologies that hopes to bring about artificial intelligence. Artificial neural networks are inspired from nature’s own biological neural networks such as brains in animals. It is thought since neural networks in nature can produce intelligent behaviour artificial ones would be able to do so as well. However, although inspired by biological neural networks artificial neural networks can be very different in many respects incorporating both structural and feature differences, just like Winged aircraft were inspired by birds but they are very different all the same. E.g. air planes does not flap its wings to produce the driving force but rather uses engines. As will be mention later in this thesis the more complex and specialised architectures will not be analogous to anything biological found in nature but will instead drift away from its biological progenitors.

From here onward artificial neural networks will be referred to as neural networks for simplicity’s sake. A neural network is created by multiple processing units called neurons that are connected to create a network. A neuron consists of a number of input connections with associated weights (synapses) that lead to a main body. When sufficiently enough of the input synapses are activated the neuron activate and its state is sent through an output synapse to other neurons. To arrange these neurons into a network at least two layers are constructed with any number of neurons in each layer. The behaviour of the network is determined by by the ensemble of weights. The process of of setting these weights is based on deriving or learning them from a set of corresponding input and output examples.

One of the very simplest version of a neural network is called a perceptron and was invented in 1957. In this version each of the input synapses are associated with a weight that determines how much the neuron should take one particular input into account when determining whether to activate or not. To do this calculation each input signal is multiplied by its associated weight and all are summed up into what is called a weighted sum.

z = (w1x1 + w2x2 + … + wnxn)

In the formula above z is the weighted sum, w is a weight, x is a input and n is the number of inputs. This weighted sum is then passed through an activation function (see section for activation functions). In the strictest definition of a perceptron a binary step transfer function (also called heavy side activation function), as described below.

Heaviside (z) = 0 if z < 0, 1 if z ≥ 0

Perceptrons are a type of the feed forward neural networks because they do not contain any feedback loops. There are two basic types of perceptrons, single layer and multi layer (see figure). In the multilayer

(10)

FIGURE 2.1 The left half of the diagram is showing a single layer perceptron and on the right a multi layer perceptron with one hidden layer is being showed

perceptrons there are one or more hidden layers arranged, see Figure 2.2. For these multilayer perceptrons a sigmoid type function is commonly used. E.g. the tanh function, seeFigure 2.4. Each layer is computed once per training step and the result is fed forward to the next layer. The process of gradually adapting the network weights to improve the the networks performance is called training. Discovery of the backpropagation method was a major breakthrough enabling models such as multilayer perceptrons and deep neural networks to be trained [Ge�17]. The goal is to minimise the squared error for some amount of training data and to do this each of the weights will be modified proportionally to how much it affects the output activations of the output layer. Each weight will be gradually adjusted to make the target output more likely than the actual output that was generated. Backpropagation requires an algorithm from the class of algorithms called iterative optimisation algorithms. The most commonly known is the gradient descent. There are variants of this algorithm that are more complex and well suited for modern tasks. Later in this work one of them will be explained, namely the Adaptive Momentum Estimation (Adam).

The gradient of a function f with respect to a variable x determines how much the value of f will change for a unit of change in x. If the function has many variables the partial derivative will have to be used. Iteratively better and better values are found and the output slowly converges towards the optimal result.

The backpropagation algorithm is divided into two different steps. The forward propagation step and the backpropagation step. It starts with the forward propagation step by feeding some example training data into the network, just like if it is being used to predict after training. Then the difference between the actual and the desired output gets fed into an error function that determines how much each of the unit in the output layer was responsible for the measured error E.

(11)

E = 1

2 ×^∑^kn=i(o_i− tj)² In the formula above:

k : Is the number of samples being computed together o : Is the output value

t : Is the target value

There are some notations that will be needed to understand the backpropagation algorithm:

w^l_jk: Is the weight from the k^thunit in the (l− 1 layer) to the j^thunit in the l^thlayer b^l_j : Is the bias associated with the j^thunit in the l^thlayer

y_j^l : Is the output of the j^thunit in the l^thlayer

h^l_j : Is the activation value of the j^thunit in the l^thlayer t^l_j : Is the target output of the j^thunit in the l^thlayer φ^l: Is the activation function for units in the l^thlayer η : Is the learning rate

The updated weight is described by the following formula:

W_ij^l = w^l_ij− η_∂w^∂El ij

= w^l_ij+ ∆w_ij^l

The updated value is its old value minus the learning rate multiplied the partial derivative of the error with respect to w^l_ij. η and _∂w^∂El

ij

gets combined into the term ∆w_ij^l as kan be seen above.

The delta term derived for each weight is defined in the formula below:

∆w^l_jk = ηy_j^l× δ^l+1_k

Here one can observe that the delta term consists of the learning rate multiplied with the output signal mul- tiplied by error signal from unit in the next layer, i.e. the layer closer to the output layer. δ^l+1_k Is defined differently if it is in the output layer or a hidden layer

• For the output layer case: δ_j^l = (y^l_j− t^l_j)× φ^ly^l_j

• For the hidden layer case: δ^l_j = δ_j^l+1× φ^ly_j^l

2.2 Recurrent Neural Networks

A recurrent neural networks is a type of network that can predict some future element based on a sequence of previous and/or following states. This can be thought of as a kind of memory where it takes context into account not used the current state of the input signal. E.g a well trained recurrent neural networks can make a prediction about the next notes in a melody or a continuation of a sentence.

These recurrent neural networks contain some number of layers with recurrent neural network cells.

The most basic version of a recurrent neural network cell contains an extra synapse that loop back into itself.

In practice this means that for each time step it receives inputs from the preceding layer as well output from

(12)

FIGURE 2.2 This figure shows the iterative steps taken by the gradient descent algorithm

itself in the previous time step.Figure 2.3shows the extra synapse leading back into itself providing a form of recurrent memory.

A recurrent neural network can not be trained by the ordinary version of backpropagation, instead a specialised version called backpropagation through time is used. The recurrent layers get connected one time step to the next creating a chain so that it becomes a very long feed forward network even if the original recurrent neural network only was one layer deep. This principle is known as “unrolling the network through time” and can be seen inFigure 2.3.

In the basic case a recurrent neural network only considers previous time steps, there is however an adaptation that allows recurrent neural networks to also take future time steps into account [Gib]. This adaptation is called bi-directional neural networks. This can be thought of as similar to human understanding where cognition lags behind the sensory inputs quite significantly and before combining them into a coherent world view. This will generally improve performance on tasks where context matters but there are some considerable drawbacks. Firstly the training time gets increased a lot because of the extra weights that will have to be taken into account, in essence for every recurrent unit a second unit is added for the backward pass. This effectively doubles the size of a network. Secondarily it adds a lot of latency during run time. So for where latency is important bi-directional recurrent neural networks are not recommended. Unfortunately deep neural networks and the basic recurrent neural network i.e. network structures with many layers suffer from a problem called vanishing gradient. The problem arises when using gradient based methods such as back propagation or back propagation through time. The problem is that the parameters either become unresponsive or unstable because the gradient vanishes towards zero [Hoc].

The vanishing gradient problem arises because the derivatives of traditional activation functions such as tanh and sigmoid approaches 0 at both ends as shown in Figure 2.4. When the error signal traverses the network backwards the error signal get successively multiplied by these small gradients for each layer to determine the output effect of a parameter. This means that the effect can become very small. For this reason neural networks with many layers are extra susceptible. The vanishing gradient problem effects recurrent neural networks more severely because a recurrent neural network is trained by unrolling the

(13)

FIGURE 2.3 The figure shows the general structure of a basic recurrent neural network cell and a way to visualise it unrolling through time, from time step t-3 to t.

network through time which essentially means that it becomes equivalent to one very deep neural network for a single time step.

2.3 Long Short Term Memory (LSTM)

Long Short Term Memory (LSTM) is the most popular type of recurrent neural network cells that are used to bypass of the vanishing gradient problem. The LSTM cell is a memory cell where memories are represented in two distinct state vectors, one short term and one for long term. This allows it to recognise long term dependencies in the data. It was invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997 [HS] and have had several improvements added over the years. The typical structure of a modern LSTM cell is shown inFigure 2.5.

From the outside; x(t) is the input for the current time step from the preceding layer, h(t-1) is the short term state vector from the previous time step, c(t-1) is the long term state vector from the previous time step, h(t) and c(t) are the new state vectors that are sent to the next time step and lastly y(t) is the output to next layers.

FIGURE 2.4 This figure shows the tanH(blue) and the derivative of the tanH(red) plotted together.

(14)

FIGURE 2.5 Diagram of the LSTM cell internal structure. C(t-1) denotes the previous long term memory state, c(t) denotes the long term memory state that will be preserved to the next time step, h(t-1) denotes the previous short term memory state, h(t) denotes the short term memory state that will be preserved to the next time step, y(t) denotes the output from the cell in the current time step and x(t) denotes the input to the cell in the current time step.

Inside the cell there are four different fully connected layers, the main data path layer uses the tanh activation g(i) defined as:

f (x) = 2

1 + e^−2x − 1

The three other layers f(t), i(t) and o(t) uses the logistic activation defined as:

f (x) = 1 1 + e^−x

These three fully connected layers are called gate controllers. The first gate controller f(t) controls the forget gate which determines what part of the long term state should be forgotten. The second gate controller i(t) controls the input gate that determines what new information should be added to the long term state. Lastly o(t) determines what part of the long term state should be output to the next layer as well as the short term state to the next time step.

One refined variant of the LSTM published in 2000 by Felix A. Gers and Jürgen Schmidhuber is the addition of peephole connections [GS]. This allows the gate layers to take the long term memory state into account not only in the short term. This adds some computation to the LSTM cell but is supposed to give additional performance on tasks where exact timing is important according to the Gers and Schmidhuber.

(15)

2.4 Gated Recurrent Unit (GRU)

Just like LSTM the gated recurrent unit(GRU) protects a recurrent neural network from the vanishing gradient problem. GRU is a simplified version of the LSTM cell that has been shown to offer similar results to LSTM in some cases [CGCB]. However GRU is less computationally intensive because the lack of an output gate.

Instead of having two separate memory vectors it only has one. So both short term and long term memory get stored in the same place. It does not have an input and a forget gate but instead only has one update gate instead.

FIGURE 2.6 Diagram of the GRU cell internal structure. H(t-1) denotes the previous memory state, h(t) denotes the memory state that will be preserved to the next time step, y(t) denotes the output from the cell in the current time step and x(t) denotes the input to the cell in the current time step.

(16)

2.5 Connec onist Temporal Classiﬁca on (CTC)

When training a neural network with a recursive layer the classical implementation requires that the input and output vectors are of the same lengths. A corresponding output element for each input element, i.e. I[e1, e2, e3] and O[e4, e5, e6] . The network would then learn e1→ e4, e2→ e5 and e3→ e6. In an application such as speech recognition this is problematic since it requires the data to be pre-segmented up to a high level of detail. Connectionist Temporal Classification (CTC) [GFGS] is a cost function that solves this problem. It enables recurrent neural networks to be trained with input and output sequences that are of different lengths, e.g. corresponding sequences like I[e1, e2, e3] and O[e4,e5].

This algorithm introduces another symbol; ”blank”(-) this is not to be confused with the character blank space. In the case of character level end-to-end speech recognition as opposed to phonemes the label space consists of the characters in the alphabet plus the blank space. The output layer of CTC network therefore consists of a unit for each label plus the new blank symbol. The output vectors after being normalised with the Softmax function are considered to be the probabilities of the given labels and the new blank symbol.

Softmax is the process of forming the output layer into a probability distribution. Before this step is taken, the values in the output layer can be practically anything depending on what type of activation is active in the output layer. The goal is then to transform each output node into a probability, i.e. a value between 0 and 1. All the output values after the Softmax process should add up to 1, covering all possibilities.

Given this generated probability distribution one can imagine a path by taking the most likely label or blank at every time step. This path is also called a CTC alignment. This alignment becomes a mixture of blanks and characters e.g. [-,-,C,-,A,A,-,T,-,-]. The relevant information in this alignment is the transitions either from a character to blank or from blank to a character. The next step is to condense this alignment by removing all blanks and duplicate letters. So the example alignment becomes [C,A,T]. One then realise that there are many un-condensed alignments that corresponds to the same condensed alignment. To evaluate the true probability that the output is supposed to be [C,A,T] all un-condensed alignments that corresponds to [C,A,T] need to be found and have their probabilities added together. [Han17] The process of condensing the alignments is shown inFigure 2.7

FIGURE 2.7 This image shows the different steps from raw audio to the CTC alignments. Note that the blank symbol is denoted by ε rather than -.

(17)

Ideally one would want to examine all the possible paths but it is computationally unfeasible if the label space is large and/or there are many time steps per sequence. This process is called decoding Examining these paths is called decoding the network. Because full search is unfeasible an approximate search will have to suffice. [Gib] There are a couple of different search algorithm that work well for this purpose, however the search algorithm that was used by CTC’s creator in his second paper on CTC [GJ] is the beam search algorithm. A common heuristic to determine which nodes to expand is to simply choose the nodes with the highest probability. In the end, after a subset of the paths have been explored and the probabilities have been added up the condensed alignment with the highest probability will chosen.

2.6 Mel-frequency ﬁlter bank

Even with the technological breakthrough of CTC some amount of pre-processing of the raw audio sequence is still required. To transform it into a usable format for training the audio clips should be transformed into a mel-frequency filter bank. In order to mimic human logarithmic audio perception the Mel-scale can be applied. This logarithmic scale can be seen in spacing of the filters inFigure 2.8. For speech recognition purposes, the audio signal is generally split up into short frames around 20 milliseconds each. For each of these frames the audio signal gets split up into values for several triangular frequency bands. The triangle represents the weighing of the different frequencies with the centre of the triangle filter being the most critical. The number of frequency bands can vary and, it then follows that the input feature vector size containing the filters can also vary in size. However it is common to use 40 filters.

Accuracy gains have been shown if filter bank does not only include the amplitudes for these triangular windows but also includes the first and/or second derivatives of the amplitude over time. This is called adding the delta or delta-delta. In theory this should allow the model to get a better representation of the audio. If the first derivative terms are added then naturally the size of the input feature vector will double in size. Likewise if both the first and second derivative terms are added the size of the input feature vector will triple in size instead.

FIGURE 2.8 A diagram showing the triangle shaped filters for the different frequencies in the Mel-frequency filter bank.

The increased spacing between the triangles is die to the Mel-scale being applied

(18)

2.7 Evolu on of hardware

The computations required for neural networks and backpropagation with gradient descent have good prop- erties to be able to be parallelized. There are two main factors causing this. Firstly, that the operations for each neuron of a layer is generally the same and very simple. Usually only input * weight + bias. Sec- ondly stochastic gradient descent allows for the training data to be individually calculated before applying the update on the weights. This in turn allows the training data to be split up into batches that can be can be computed on parallel processing units. So, if the problem can be parallelized well the optimal choice of hardware should reflect this. For central processing units (CPU) to perform reasonable well on tasks such as this the core count becomes very important. In more recent years there are not really any CPUs suitable for deep learning tasks but rather the graphical processing units (GPU) have taken the place of best suited hardware for training deep learning models. Modern GPU’s offer at least 10 times the performance of CPU’s.

It has been argued that the reason why GPUs have become so much more powerful and affordable over the last 10 years is the advent of widespread computer gaming which is the biggest market for the GPU manufacturers. Another emerging market for GPU manufacturers is crypto currency mining which is becoming a determining factor for the price. However, if a company is trying to create a top of the line deep learning model trained on huge amounts of data a single GPU will not suffice. In this case it is more common to have clusters of GPUs in order to complete training in a number of days rather than weeks. But maintaining large clusters of GPU is expensive and this have spawned the use of hiring clusters of GPU in the cloud to train the models. And thereby only pay cloud service providers such as Amazon web services and Google platform for the specific use . The latest hardware innovation for deep learning models is the tensor processing unit (TPU) developed by Google. This is very specialized hardware only suited for one specific and very narrow task, namely running deep neural networks. TPUs typically offer an increase in computing performance by an order of magnitude compared to GPUs. When the TPUs were initially released on the Google compute platform they were however only optimized to run neural networks, not to train them.

2.8 Tensorﬂow

The different network designs in this study are implemented using the TensorFlow open software library released by the Google Brain Team. It is used for machine learning with neural networks. It is a relative new entry on the stage only being released in 2015 but have quickly risen to become one of the most popular of the AI frameworks [Ten]. The name TensorFlow comes from tensors that are a type of multi-dimensional matrices that flow through the different mathematical computations performed by the layers in a neural network.

(19)

CHAPTER 3

Methods

3.1 Overview

In this section, a set of alternative setups of the speech recognition system are described followed by considerations and specification of how they were evaluated. Finally, a description of how the different models were trained and the performance and merits of each respective design was evaluated. For anyone interested the small samples of the data sets, the modified code, the complete training logs and transcriptions generated in this work a GitHub repository has been set up at https://github.com/johanhagner/End-To-End-Speech-Recognition

The experiments are run on a Nvidia GTX970 GPU [Nvi]. The GTX 970 was launched in Novem- ber of 2014. It features 1664 processing cores, 4 gigabyte of video random access memory, operates at a maximum of 1178 megahertz and is capable of 3 494 gigaflops. The data sets and pre-processed files are stored on a Samsung 850 EVO solid state drive for convenience.

The tensorflow library was used to implement the neural network. The baseline code for the project was originally sourced from hirofumi0810s github project https://github.com/hirofumi0810/tensorflow_

end2end_speech_recognition under the MIT licence. The TensorBoard toolkit was used to plot the progression of the network training. For pre-processing of the data sets, the Hidden Markov Model Toolkit (HTK) was used was used.

3.2 LibriSpeech corpus

The dataset selected for this experiment was the LibriSpeech corpus from Open Speech and Language Resources, Open SLR (ref). LibriSpeech is an automatic speech recognition corpus made up from open source audio books from the LibriVox project. Audio books are a good source for training data for speech recognition purposes because the audio is generated from a corresponding text, so no additional step of transcription is needed. It contains over 1000 hrs of read English speech for training and evaluation of speech recognition systems. It is based on the text of public domain audio books from the LibriVox project [Pov].

However a somewhat negative aspect of audio books for this purpose is that the audio is not conversational, not spontaneous and that each audio book is only read by a single person. The audio data has a sampling rate of 16 KHz and has been carefully segmented and aligned. In total there are 7 sub sets to this data set. 100-hours-clean, 360-hours clean, 500-hours-other, dev-clean, dev-other, test-clean and test-other.

(20)

In order to determine if the audio is clean or other the creators had trained a speech recognition model on another data set. The audio that was the hardest for this model to classify correctly was then considered to be not ”clean”, i.e. ”other”. Care has been taken to ensure that the balance of gender and speaker lengths is retained among the different subsets. Also, the validation and training sets each contain a unique set of speakers that are not included in the other sets. [Pov]

3.3 Network design

Alternative network frameworks using three variants of gated recurrent units were developed to find a suitable design for a speech recognition system that can be trained and run efficiently on a GPU-equipped PC. The details of the different network models are described below. There is an argument to be made for reducing the number of LSTM cells per layer compared to the number of GRU cells given that the LSTM cells includes more weights per cell. But for simplicity sake the number cells where kept the same in this study.

The following information pipeline is used for all of the models shown inFigure 3.1:

- A data set of paired audio and text

- Pre-processing into mel-frequency filter bank format.

- Input layer corresponding with corresponding size to the input vector.

- 5 Layers cosisting of 224 gated recurrent unints each (GRU, LSTM, LSTM with peephole connections) - Softmax layer

- Connectionist temporal classification

The Adam optimzier was used to adapt the learning rates for each parameter which allows the parameters to change at different rates. These adaptive learning rates are computed by estimating both the first and second moment of the gradients. [KB17]

The purpose of this work is not to evaluate and optimise different hyper-parameters so the choice of learning rate was advised by the relevant chapters in ”Hands-On Machine Learning with Scikit-Learn and TensorFlow” [Ge�17]. These chapters recommended keeping the learning rate at the default 0.001 and not employing any learning rate decay because the Adam optimiser already have regulation of the learning rate for every cell.

In the Tensorflow library there are a many variants of GRU and LSTM implementations. In prelimi- nary testing the LSTM block cell was used as it was advertised to provide the best performance. There is also a GRU block cell but it did not seam to be based on the same theoretical background. In light of this the basic GRU- and LSTM-cells were chosen because they were both based on the original theoretical papers and not restricted in the choice of options. The choice to include the LSTM-cell with peephole connections in this experiment was made because this is the kind of task that could make use of them, i.e. requiring precise timings [GS].

(21)

FIGURE 3.1 A diagram showing the information pipeline from audio signal to output text

3.4 Experiments

In experiment 1 the models were allowed to train on the small 100 hour clean sub set for 24 hours plus some extra time in order to finish the epoch, save the model and complete the evaluation on the development sub sets. After some initial tests that crashed due to insufficient memory, a setup that allowed all three networks to complete the training were found: The choice of dropout rate should be informed by the size of the of network layers as well as the size of the data set. A dropout rate of 0.2 (0.8 keep rate) was settled upon. In experiment 2 the same architectures with a few differences were trained on the much larger 960 hours which is the complete LibriSpeech corpus containing both ”clean” and ”other” type samples.

The first thing that was changed for experiment 2 was the addition of the loss and cell error rate graphs for the dev-clean sub set. Secondly, in experiment 2 the way the time training time was measured was changed. A less naive approach was chosen instead where only the time actually training on batches was counted. Another adjustment was made with regard to stopping the training. Instead of finishing any epoch that had been started within the 24 hours training interval the training was halted and the model performance was evaluated mid epoch. The highest stable setting of batch size in experiment 2 which did not overrun available memory that was found to be 21. Finally the dropout rate was changed to 0.05 (from

(22)

0.2) based on the reduced risk of overfitting due to a much larger training set. An overview of the settings used in the experiments can be found in tables 3.1 and 3.2 respectively.

Model name 100h_GRU 100h_LSTM 100h_LSTM_Peep

Input features 120 120 120

RNN cell type GRU LSTM LSTM (Peephole connections)

Layer size 224 224 224

Number of layers 5 5 5

Batch size 24 24 24

Optimizer Adam Adam Adam

Learning rate 0.001 0.001 0.001

Dropout 0.2 0.2 0.2

TABLE 3.1 Settings for experiment 1

Model name 960h_GRU 960h_LSTM 960h_LSTM_Peep

Input features 120 120 120

RNN cell type GRU LSTM LSTM (Peephole connections)

Layer size 224 224 224

Number of layers 5 5 5

Batch size 21 21 21

Optimizer Adam Adam Adam

Learning rate 0.001 0.001 0.001

Dropout 0.05 0.05 0.05

TABLE 3.2 Settings for experiment 2

3.5 Evalua on

The primary performance criteria that were used to evaluate the model was the cell error rate on the other- development data set which is previously unseen data samples classified as being among the 25 percent most difficult samples. Cell error rate being the rate of which the models predicted the wrong symbol. Zero percent being completely accurate and 100 percent being completely inaccurate. Secondarily the models were also evaluated on the cell error rate of the clean-development data set. The reason for basing the evaluation on the data set labelled other over the data set labelled clean is that if a speech recognition model is supposed to work in any practical environment it needs to perform well on challenging data. However the audio labelled other in this audio set is in fact not that noisy when compared to other data sets such as live recordings from noisy environments e.g. events in front of a crowd.

(23)

CHAPTER 4

Results

All models gradually improved during training with a gradually decreasing rate of improvement. It is clear that most of the gains in performance were made in the early stages of training. In experiment 1 all of the models showed an increasing difference in the error rates between the samples that came from the training set and from the dev-other sub set.

After about 10 epochs (i.e. 10 rounds of training where the model has seen each of the samples in the training set once) the rate of improvement was marginal and even deteriorating in one case. In terms of training speed the basic LSTM model ran at 1 percent fewer steps per hour compared to the GRU model whereas the LSTM model with peephole connections ran at 15 percent fewer steps per hour. The full training logs as well as the transcriptions for the test sub sets (test-clean and test-other) are available at https://github.com/johanhagner/End-To-End-Speech-Recognition

In experiment 1 with the 100 hour training set of clean data:

• The lowest cell error rate after approximately 24 hours of training was achieved with the LSTM model with peephole connections enabled. It managed to achieve a cell error rate of 30.37 percent on the unseen dev-other sub set. The basic LSTM model got a similar score of 31.11 percent whereas the GRU model performed worse with a cell error rate of 33.97 percent. SeeFigure 4.1

• The word error rates seams to follow the same trend as the cell error and keeping the same internal ranking; GRU performing worst, LSTM better and LSTM with peephole connections best, as shown Figure 4.2

• As expected the cell error rates for the dev-clean sub set were lower than for the dev-other sub set across the board for all of the models.

In experiment 2 with the 960 hour training set of both clean and other data:

• The lowest error rate after 24 hours of training was achieved with the basic LSTM model. Producing a cell error rate of 22.84 percent on the dev-other sub set. The LSTM model with peephole connections achieved a similar score of 22.97 percent whereas the GRU model again performed significantly worse at 32.90 percent. SeeFigure 4.1

• Unlike experiment 1, in experiment 2 the internal ranking for word error rates is not exactly the same as for cell error rates. Namely LSTM with peepholes have a better word error rate than basic LSTM but a worse cell error rate, as can be seen inFigure 4.2.

• Again the cell error rates for the dev-clean sub set were lower than for the dev-other sub set for all of the models.

In experiment 1 the cell error rate of the GRU network showed a decreasing trend during most of the training session. There were however three lapses in this decline, namely epoch 8, 11 and 12 which did

(24)

not improve compared to the previous epoch. It reached a cell error rate of 33.97% in the 18th and final epoch.

The LSTM network in experiment 1 reached its lowest cell error rate at epoch number 16 out of 18.

The network failed to improve its score at the 7th, 9th and 13th epoch.

The cell error rate of the LSTM network with peephole connections in experiment 1 had a decreasing trend until the 13th and final epoch. There was however two lapses in this trend, namely epoch 11 and 12 which did not improve the cell error rate for the dev-other sub set.

The cell error rate for the GRU network in experiment 2 seamed to level out early. It had its best full evaluation at step 13 500 (epoch 1.008, and the 9th evaluation out of 20 evaluations). Keep in mind this is at only 11 hours into training. At this point the model achieved a 32.9% cell error rate and on the dev-other sub set and 18.59% on the dev-clean sub set.

The cell error rate in the LSTM network for experiment 2 kept going steadily down overall and had its best evaluation after step 28 500 (epoch 2.128 and the 19th evaluation out of 20 evaluations). There was however three lapses in this decline, namely evaluations 14, 17 and 18 which did not improve the cell error rate for the dev-other sub set. It reached a cell error rate of 22.84% for the dev-other sub set and 10.70% for the dev-clean sub set. For the full train log and test sub sets decoded see appendices.

The cell error rate in the LSTM network with peephole connections for experiment 2 kept going steadily down overall and had its best evaluation after step 24 700 (epoch 1.844 and the 17th evaluation out of 17 evaluations). There was however one lapse in this decline, this being evaluation 14 which did not improve the cell error rate for the dev-other sub set. It reached a cell error rate of 22.97% for the dev-other sub set and 10.75% for the dev-clean sub set. For the full train log and test sub sets decoded see appendices.

(25)

Model name 100h_GRU 100h_LSTM 100h_LSTM_Peep CER dev-clean 15.74% 13.12% 12.41%

CER dev-other 33.97% 31.11% 30.37%

WER dev-clean 42.25% 35.60% 34.41%

WER dev-other 69.18% 64.74% 63.65%

TABLE 4.1 CER(Cell Error Rate) and WER(Word Error Rate) for experiment 1

Model name 960h_GRU 960h_LSTM 960h_LSTM_Peep CER dev-clean 18.59% 10.70% 10.75%

CER dev-other 32.90% 22.84% 22.97%

WER dev-clean 49.11% 31.04% 31.19%

WER dev-other 69.00% 51.91% 51.38%

TABLE 4.2 CER(Cell Error Rate) and WER(Word Error Rate) for experiment 2

(26)

Clean Exp. 1 Other Exp. 1 Clean Exp. 2 Other Exp. 2 0

5 10 15 20 25 30 35

CellErrorRate(Percent)

GRU LSTM LSTM-Peep

FIGURE 4.1 Bar chart describing the cell error rates across all models for both experiments.

Clean Exp. 1 Other Exp. 1 Clean Exp. 2 Other Exp. 2 0

20 40 60

WordErrorRate(Percent)

GRU LSTM LSTM-Peep

FIGURE 4.2 Bar chart describing the Word error rates across all models for both experiments.

(27)

FIGURE 4.3 Measured cell error rate for GRU network, sampled once every 200 steps from previously unseen data (Dev- other subset in the LibriSpeech corpus)

FIGURE 4.4 Measured cell error rate for LSTM network, sampled once every 200 steps from previously unseen data (Dev-other subset in the LibriSpeech corpus)

(28)

FIGURE 4.5 Measured cell error rate for LSTM network with peephole connections, sampled once every 200 steps from previously unseen data (Dev-other subset in the LibriSpeech corpus)

FIGURE 4.6 Measured cell error rate for GRU network, sampled once every 200 steps from previously unseen data (Dev- other and Dev-Clean subsets in the LibriSpeech corpus)

(29)

FIGURE 4.7 Measured cell error rate for LSTM network, sampled once every 200 steps from previously unseen data (Dev-other and Dev-Clean subsets in the LibriSpeech corpus)

FIGURE 4.8 Measured cell error rate for LSTM network with peephole connections, sampled once every 200 steps from previously unseen data (Dev-other and Dev-Clean subsets in the LibriSpeech corpus)

(30)

4.1 Transcrip ons

In the following section samples of audio transcribed by the different models is presented. The audio samples are taken from the test-clean and test-other sub sets.

4.1.1 Noisy audio

These are the transcriptions from the 1688-142285-0000 wav file:

Correct:

there’s iron they say in all our blood and a grain or two perhaps is good but his he makes me harshly feel has got a li le too much of steel anon

Transcrip ons:

GRU-100h-clean: thes in they say in oloublod and a granel two haps is god but his he makes me hoshly fel has got a litle toa much of stal anorne

LSTM-100h-clean: the zan they say in al ou lod and a grain two haps is god but his he makes me hashly fel has got alitual to much of s l anon

LSTM-Peep-100h-clean: ther’s on they say in al low blod and a gray nol two paps is god but his he makes me hashly fel has got a litale toa much of s l anone

GRU-960h-noisy: theis and they say in al ablod and a grean al two paps his god but hes he makes me hashly fel has got a litle to much of s l an on

LSTM-960h-noisy: thes ion they say in al ourblod and a greinlr to phaps his god but his he makes me hashly fel has got a litle to much of s l an on

LSTM-Peep-960h-noisy: this ion they say in al our blod and a grain l two peaps is god but his he makes me harshly fel has got a litle to much of steal anon

Correct:

margaret said mister hale as he returned from showing his guest downstairs i could not help watching your face with some anxiety when mister thornton made his confession of having been a shop boy

Transcrip ons:

GRU-100h-clean: magard said mista hal as he retend from shone his guist oused as i could ut hel boching a face witd suming zaity would mist e thorns anhn mit he’s conﬁsion wil having bein has shuok boy

LSTM-100h-clean: magred said mis ter hel as he retund from shown mes gis foundstays i could not help watchin yor face with some ingxority when mis e thornton mighty’s conﬁsion of having ben as shok bly

LSTM-Peep-100h-clean: magard said mis te hail as he retend from showing his gusts bounse s i could not help watching o face with some inxority when mistethorneton thnd maght hes conﬁsion of having ben his shok boy

GRU-960h-noisy: margared said mister hel as he returned from showin his gis conesters i could not hel pat on o face would some enxiety when mister thougten mad his conﬁsion of having ben ashop boy

LSTM-960h-noisy: margaret said mister hil as he returned from showing his gis counstars i couldnot help watching your face with some anxiety when mister thornton made his conﬁsion of having ben a shop boy

(31)

LSTM-Peep-960h-noisy: margaret said mister hale as he returned from showing is guis ounstairs i could not help watching your face with some anxiety when mister thornton made his conﬁ on of having ben a shop boy

Correct:

you don’t mean that you thought me so silly

Transcrip ons:

GRU-100h-clean: you do it man that he thought me so sily LSTM-100h-clean: e durnt main that you thought me so sily LSTM-Peep-100h-clean: you don’t main that he tholat me so sily GRU-960h-noisy: i on’ man that he thought me so cily

LSTM-960h-noisy: you don’t mean that heu thought me so sily LSTM-Peep-960h-noisy: you don’t mean that he thought me so sily

4.1.2 Clean audio

These are the transcriptions from the 1089-134686-0000 wav file

Correct:

he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mu on pieces to be ladled out in thick peppered ﬂour fa ened sauce

Transcrip ons:

GRU-100h-clean: he hoped there would be sto for diner turnips and carits and brused patatos and fat mutan pieces to be latled out and thick pepered ﬂower faten sauce

LSTM-100h-clean: he hoped there would be steoe for diner turnips and carats and bruised potatoes and fatn muton pieces to be ladled out ind t thick pepered ﬂower fatend sauce

LSTM-Peep-100h-clean: he hoped there would be stew for diner turn ips and carots and bruised potatose and fat muton pieces to be latled out and thick pepered ﬂower fatened souce

GRU-960h-noisy: he hoped there would be sto for diner turnips and carats and brosed petatoes and that mutan pieces to be ladled out and thithick pepered ﬂower fatand sous

LSTM-960h-noisy: he hoped there would be stow for diner turnips and carats and bruised potatoes and at mu n pieces to be latled out into thick pepered ﬂower fa nd saucs

LSTM-Peep-960h-noisy: he hoped there would be stew for diner turnips and charats and rused potatoes and fat muton pieces to be latled out in thick pepered ﬂower faten sauce

(32)

Correct:

stuﬀ it into you his belly counselled him

Transcrip ons:

GRU-100h-clean: stuﬁt in to you his bely countled him LSTM-100h-clean: stufed into you his bely countled him

LSTM-Peep-100h-clean: stuf at into you an his bely coun led him GRU-960h-noisy: stofed in to you his bely councheled him LSTM-960h-noisy: stufed into you his bely coun led him LSTM-Peep-960h-noisy: stufed into you his bely counseled him

Correct:

a er early nigh all the yellow lamps would light up here and there the squalid quarter of the brothels

Transcrip ons:

GRU-100h-clean: a er early night fal the yelow lamps would light ap here and there the squalet quarter of the browhls LSTM-100h-clean: a er early night fal the yelow lamps would light hap here and there the squalid quarter of the brofles LSTM-Peep-100h-clean: a er early night fal the yelow lampse would lighe up here and there the squalid quarter of the brofles GRU-960h-noisy: a er early night fal the yele lamps would light op here and there the squoled quarter of the browfles LSTM-960h-noisy: a er early night fal the yelow lamps would light up here and there the squalid quarter of the brohles LSTM-Peep-960h-noisy: a er early night falw the yelow lamps would light up here and there the squalid quarter of the brofles

(33)

CHAPTER 5

Discussion

5.1 Parameters

The size of the network i.e. number of cells per layer times the number of layers, has to be chosen in conjunction with the intended batch size i.e. how many examples the program will train in parallel. This is because the limiting factor for the GPU is the amount of VRAM(Video Random Access Memory). To further complicate this decision the VRAM requirement is also determined by the lengths of the samples that get randomly selected for each batch. Since a is made up of random samples it is overwhelmingly likely that a batch is unique. If a batch gets unusually large samples the memory requirements become to great and the program will crash. This may happen in the later epochs even if the first epoch manages to complete. The Tensorflow implementation of GRU seams to have larger memory requirements for GRU than the LSTM variants. This is unexpected because theoretically the GRU is less complex design with fewer internal parts that would have to be represented.

In the fist experiment a naive approach to counting the time was taken. This was unsuitable for a couple of reasons. Firstly that the time for evaluating random samples to create the plots as well as the full evaluations at the end of every epoch was counted. Secondly this meant that the different models would have slightly different amount of time to finish their last epoch. Some slight adjustments to the network settings were made for experiment due to lessons learned from experiment 1 as well as the fact the 960 hours is a much larger data set than the 100 hours. It turned out that the 960 hours contained some larger samples that caused memory induced crashes for batch sizes of 24, 23 and 22.

The risk of over fitting was considered lower for experiment 2 because quick calculation showed that only two epochs would have time to finish training in 24 hours. Some would argue that it is not worth keeping any dropout at all but the fact is that dropout also helps to generalise the solution by making any neuron redundant and therefore should help the transfer to evaluation and testing data sets.

5.2 Comparison

In experiment 1 the different models performed similarly to each other. The GRU performed worst with a cell error rate at 33.97%. The standard LSTM unit performed second best with a cell error rate of 31.11%.

The best model in experiment 1 was the LSTM with peephole connections. It got a cell error rate of 30.37.

Another notable difference between the models is that they are operating at different speeds. The standard LSTM was slightly faster then GRU and LSTM with peepholes was the slowest. This observation seams to be supported by the theory of recurrent neural network cells, i.e. that GRU is a simplified version of the

(34)

LSTM that only has one memory vector and that peephole connections does add extra computations. The results indicate that a LSTM network with peephole connections is able to extract more detailed information from training data set than standard LSTM and GRU for any given training batch thereby compensating for the lower training speed. This also echoes well with the theory given that there are more degrees of freedom for the LSTM variants to represent the problem.

In this previous work [REJ⁺] the researchers compared the effectiveness of GRU and LSTM for different kind of noises where they would create composite speech recognition data sets by adding different types of noise to the audio clips. They found that both GRU and LSTM were good at handling different kind of noise. Particularly they found that GRU was better than LSTM on cyclical types of noises whereas LSTM performed better at random noise. The sub sets with the ”other” notation does not represent any particular kind of noise but was only harder for another speech recognition model to transcribe. This would suggest that the noise if of a more varying or random nature and might explain why the difference in performance between GRU and the LSTM variants increased so significantly from experiment 1 to experiment 2.

With the introduction of an almost tenfold size of the data set for experiment 2 the results were much more diverse. The GRU model only had a very small improvement in cell error rate. Down to 32.9% from 33.97%, whereas the two other models showed a much more substantive improvement. The standard LSTM model had the best score for experiment 2 at a cell error rate of 22.84%, down from 31.11 in experiment 1. LSTM with peepholes came very close the to standard LSTM model with a cell error rate of 22.97% compared to 30.37% in experiment 1. With the observations from experiment 2 we can see that the LSTM variants can make better use of the larger training set. Another point of observation is that both LSTM models had very similar scores. The difference in performance between the model is minor which means that the increased complexity of LSTM with peephole connections is more efficient at modelling the speech. That is if one considers efficiency to be; extracting more information per any given training example.

It is interesting to see that the GRU network failed in this fashion during experiment two, only improving for 11 out of 24 hours and then even gradually degrading performance for the dev-clean while the two other models improved the score consistently. Traditional wisdom would not suggest that this failure is due to the increase in size of the data set but would rather point out that the quality of the audio shifts. 500 out 960 hours being considered ”other”. If one were to investigate this failure further one could run these models on the intermediate configuration of the data set; 100-clean and 360-clean in order to establish that it is in fact the introduction of audio considered ”other” that cause the problem. This offers a counterpoint to this previous work [CGCB] by Junyoung Chung et. al that could not see any significant differences between the performance of GRU and LSTM. It is important to note that the task in that case was not exactly the same as speech recognition but rather speech modelling. This is the act of predicting audio given samples from previous time steps. However these tasks are similar somewhat similar. Junyoung Chung et.

al also considered the overall weights rather then units when determining the size of the network to offer a better comparison. I would however be surprised if the extra number of weights for LSTM would account for the distinct failure of the GRU model in experiment 2.

When this thesis was written both Google and Microsoft had passed humans in performance reach- ing impressive 4.9% and 5.1%-word error rate respectively for transcribing English audio. This has been achieved with very specialised and massively deep recurrent neural networks numbering hundreds of layers

(35)

trained on huge data sets probably one or two order of magnitudes larger than the LibriSpeech corpus used in this study. Clearly the results presented in this thesis are no where close to those levels. Unfortunately but not unexpected the data sets used by Google and Microsoft are not public to my knowledge. They are probably more diverse and include much more of conversational audio with many types of noise represented. These big companies seams to be convinced that as long as it is possible to increase the size of the data set along with the network the performance will continue to increase. And it seams to be true for now.

(36)

CHAPTER 6

Conclusions

This study has presented 6 variants of a speech recognition model which is a recurrent neural network that uses the the machine learning technology Connectionist Temporal Classification (CTC). This allows for End-to-End learning. CTC solves a problem that has persisted for long in the field of speech recognition, namely the problem of learning from sequences that are of different lengths. The two purposes of this work is to mainly compare the efficiency of different gated units for the speech recognition problem. The secondary purpose is to see what levels of performance can be achieved for models that are trained on a desktop gaming computer for 24 hours. The library that was used is the Google Developed machine intelligence software TensorFlow [Ten]. The two main variants of gated units that were evaluated are Long Short Term Memory (LSTM) cell and the Gated Recurrent Unit (GRU).

If the indications in this work; that LSTM networks with peephole connections is the most efficient way to extract information from the speech audio is correct it would suggest that if access to training data for a speech recognition model is the bottleneck it might have a positive effect to use the LSTM cell with peephole connections. In experiment 2 the standard LSTM got a slightly better score after 24 hours hinting at the fact that a standard LSTM cell might be more time efficient if it the training time that is the limiting factor rather then the access to training data. However this indication is very weak given the small difference in score. To verify that this is the case additional trials would have to be run at different training data sizes as well as for different training times.

If one tries to create a state of the art speech recognition neural network it is unlikely that the network would only consist of recurrent layers, but rather a combination of different layer types, such as fully connected layers and/or convolutional layers. It is possible that in those cases GRU might perform better than found in these experiments.

(37)

Bibliography

[CGCB] Chung, Junyoung ; Gulcehre, Caglar ; Cho, KyungHyun ; Bengio, Yoshua: Empirical Evalua- tion of Gated Recurrent Neural Networks on Sequence Modeling.https://arxiv.org/pdf/1412.

3555.pdf

[Ge�17] Géron, Aurélien: Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. OReilly, 2017

[GFGS] Graves, Alex ; Fernández, Santiago ; Gomez, Faustino ; Schmidhuber, Jũrgen: Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.

http://www.cs.toronto.edu/~graves/icml_2006.pdf

[Gib] Gibiansky, Andrew: Andrew Gibiansky :: Math → [Code]. http://andrew.gibiansky.com/

blog/machine-learning/speech-recognition-neural-networks/

[GJ] Graves, Alex ; Jaitly, Navdeep: Towards End-to-End Speech Recognition with Recurrent Neural Networks.http://proceedings.mlr.press/v32/graves14.pdf

[GS] Gers, Felix ; Schmidhuber, Jũrgen: Recurrent nets that time and count. https://www.

researchgate.net/publication/3857862_Recurrent_nets_that_time_and_count

[Han17] Hannun, Awni: Sequence Modeling with CTC. https://distill.pub/2017/ctc/. Version: Dec 2017

[Hoc] Hochreiter, Sepp: The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.http://www.worldscientific.com/doi/pdf/10.1142/S0218488598000094 [HS] Hochreiter, Sepp ; Schmidhuber, Jürgen: Long Short-term Memory.https://www.researchgate.

net/publication/13853244_Long_Short-term_Memory

[KB17] Kingma, Diederik P. ; Ba, Jimmy: Adam: A Method for Stochastic Optimization.https://arxiv.

org/abs/1412.6980. Version: Jan 2017

[Nvi] GTX 970.https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970

[Pov] Povey, Dan: LibriSpeech: An ASR corpus based on public domain Audio Books. http://www.

danielpovey.com/files/2015_icassp_librispeech.pdf

[REJ⁺] Rana, Rajib ; Epps, Julien ; Jurdak, Raja ; Li, Xue ; Goecke, Roland ; Breretonk, Margot ; Soar, Jeffrey: Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech. https:

//arxiv.org/pdf/1612.07778.pdf

[Ten] Tensorflow. https://www.tensorflow.org/

Recurrent Neural Networks for End-to-End Speech Recognition: A comparison of gated units in an acoustic model

Master Thesis Report