Semi-Supervised Learning with Sparse Autoencoders in Automatic Speech Recognition

(1)

Semi-Supervised Learning with

Sparse Autoencoders in Automatic Speech Recognition

AKASH KUMAR DHAKA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

Autoencoders in Automatic Speech Recognition

AKASH KUMAR DHAKA

Master in Machine Learning Date: November 2016 Supervisor: Giampiero Salvi Examiner: Danica Kragic

Swedish title: Semi-övervakad inlärning med glesa autoencoders i automatisk taligenkänning

School of Computer Science and Communication

(3)

This work is aimed at exploring semi-supervised learning techniques to improve the performance of Automatic Speech Recognition systems. Semi-supervised learning takes advantage of unlabeled data in order to improve the quality of the representations ex- tracted from the data. The proposed model is a neural network where the weights are updated by minimizing the weighted sum of a supervised and an unsupervised cost function, simultaneously. These costs are evaluated on the labeled and unlabeled por- tions of the data set, respectively. The combined cost is optimized through mini-batch stochastic gradient descent via standard backpropagation.

The model was tested on a phone classification task on the TIMIT American English

data set and on a written digit classification task on the MNIST data set. Our results

show that the model outperforms a network trained with standard backpropagation on

the labelled material alone. The results are also in line with state-of-the-art graph-based

semi-supervised training methods.

(4)

Detta arbete syftar till att utforska halvövervakade inlärningstekniker (eng: semi-supervised learning techniques) för att förbättra prestandan hos automatiska taligenkänningssystem.

Halvövervakad maskininlärning använder sig av data ej märkt med klasstillhörighets- information för att förbättra kvaliteten hos den från datan extraherade representationen.

Modellen som beskrivs i arbetet är ett neuralt nätverk där vikterna uppdateras genom att samtidigt minimera den viktade summan av en övervakad och en oövervakad kostnads- funktion. Dessa kostnadsfunktioner evalueras på den märkta respektive den omärkta da- tamängden. De kombinerade kostnadsfunktionerna optimeras genom gradient descent med hjälp av traditionell backpropagation.

Modellen har evaluerats genom en fonklassificeringsuppgift på datamängden TIMIT

American English, samt en sifferklassificeringsuppgift på datamängden MNIST. Resul-

taten visar att modellen presterar bättre än ett nätverk tränat med backpropagation på

endast märkt data. Resultaten är även konkurrenskraftiga med rådande state of the art,

grafbaserade halvövervakade inlärningsmetoder.

(5)

(6)

1 Introduction 1

1.1 The Speech Recognition Problem . . . . 1

1.2 Motivation For the Thesis . . . . 2

1.3 Research Questions . . . . 3

1.4 Assumptions . . . . 3

1.5 Report Structure . . . . 3

2 Relevant Theory 4 2.1 Automatic Speech Recognition and Phone Classification . . . . 4

2.2 Feature Extraction . . . . 6

2.3 Acoustic Modelling . . . . 7

2.4 MLPs and Deep Neural Networks . . . . 9

2.5 Autoencoders . . . . 11

2.5.1 Manifold Learning with Autoencoders . . . . 11

2.5.2 Sparse Autoencoders . . . . 12

2.5.3 Performance of Sparse Autoencoders . . . . 12

2.5.4 Applications of Autoencoders . . . . 13

2.6 Semi-Supervised Learning . . . . 13

2.7 Assumptions . . . . 14

3 Related Work 15 3.1 Deep Neural Networks in ASR . . . . 15

3.1.1 Deep Belief Networks . . . . 15

3.1.2 Recurrent Neural Networks . . . . 15

3.2 Examples of Semi-Supervised Learning Methods . . . . 16

3.2.1 Heuristic based SSL/Self-Training . . . . 16

3.2.2 Transductive SVMs . . . . 16

3.2.3 Entropy Based Semi Supervised Learning . . . . 17

3.2.4 Graph based SSL . . . . 17

3.2.5 Semi-Supervised Learning with generative models . . . . 18

3.3 Autoencoder Based Semi-Supervised Learning . . . . 19

4 Method 20 4.1 The Model . . . . 20

4.2 Evaluation . . . . 22

4.3 Monitoring and Debugging . . . . 23

4.3.1 Design Choices/Tuning Hyperparameters . . . . 23

(7)

4.3.4 Weight Initialization . . . . 25

4.3.5 Number of Hidden Units . . . . 25

4.3.6 Momentum . . . . 26

4.3.7 Activation Function . . . . 26

4.3.8 Training Epochs . . . . 26

4.3.9 Additive Noise . . . . 26

4.3.10 Alpha . . . . 26

4.3.11 Gradient Checking . . . . 27

5 Experiment Setup and Results 28 5.1 Data . . . . 28

5.1.1 MNIST . . . . 28

5.1.2 TIMIT . . . . 28

5.2 Experimental Setup . . . . 30

5.3 Practical Setup . . . . 31

5.4 Results . . . . 31

6 Discussion, Conclusion and Future Work 36 6.1 Hypotheses discussed . . . . 36

6.1.1 H.1 Do Semi-Supervised Sparse Autoencoders perform better than neu- ral networks on phone classification? . . . . 36

6.1.2 H.2 Does the above result generalize to other domains? . . . . 36

6.1.3 H.3: Do Semi-Supervised Sparse Autoencoders perform better than GBL SSL methods on phoneme classification? . . . . 36

6.2 Evaluation Method . . . . 37

6.3 Effect of α in the model . . . . 37

6.4 Future Work . . . . 37

6.5 Society and Ethics . . . . 38

7 Appendix 39

Bibliography 40

(8)

Introduction

With the invention of computers, the question of whether machines could be made to understand human speech emerged. In more recent years, speech technology has started to change the way we live by becoming an important tool for communication and inter- action with devices. The recent improvements in Spoken Language systems have greatly improved Human-Machine Communication. Personal Digital Assistance (PDA) systems are an example of an intelligent dialogue management system. They have been very pop- ular with the recent launch of products like Apple Siri, Amazon Alexa and Google Allo.

Besides human-machine interaction, speech technology has also been applied in assisted human-human communication. There could be several barriers even when humans com- municate with each other. One of the most prominent of those barriers occurs if the two speakers do not speak a common language. In the past, and to great extent in present days, this was solved by means of a human interpreter. Speech-to-speech Translation sys- tems are, however, reaching sufficient quality to be of help for example for travellers.

These systems accept spoken input in one language and output a spoken translation of the input in a target language.

In all the above examples, a key component is Automatic Speech Recognition (ASR).

This system has the task of translating spoken utterances into their textual transcription that can be more easily handled in the rest of the system. In dialogue systems, this tex- tual representation is fed to a language understanding module that extracts the semantic information to be handled by a dialogue manager. The dialogue manager, in turns, can decide to formulate a spoken response by means of a language generation system and a speech synthesis system. In speech-to-speech translation, instead, the output of the ASR is fed to an automatic translation system, and the translation is then converted to speech by means of speech synthesis.

1.1 The Speech Recognition Problem

Although humans recognize speech in their mother tongue effortlessly, the problem of automatic speech recognition presents a number of challenges. A source of complexity is due to the large variations in speech based on region, accent, age, gender, emotions, physical and mental well-being of the speaker. Another complication, if compared to many classification tasks, is that speech is a continuous stream of units hierarchically combined into speech sounds, syllables, words, phrases and utterances. A speech rec- ognizer must therefore be able to handle sequences of patterns.

1

(9)

The way the speech recognition problem is approached is by means of statistical meth- ods that can incorporate the variation and model sequences of acoustic events in a ro- bust way. The design of these models makes extensive use of domain knowledge com- ing from the fields of linguistics and phonetics, and incorporates this knowledge into a machine learning framework that can learn the variability of the spoken units from large collections of recordings of spoken utterances. The building blocks of the statistical models are short segments of speech that can be considered to be stationary. These seg- ments (or the corresponding models) are then combined to form phonemic segments. A phoneme is the smallest linguistic unit that can distinguish between two words. Phone- mic models are then combined into words and phrases by using lexical and grammatical rules. Although each language uses a specific set of phonemes, there is a large overlap between languages, because the number of sounds that we can produce is constraint by the physics of our speech organs. An example of phoneme classification for the Ameri- can English is reported in Appendix 7.1.

1.2 Motivation For the Thesis

In order to learn the associations between the constituent speech units and the corre- sponding sounds, most speech recognition methods require carefully annotated speech recordings. The increasing interest in speech based applications has produced large amount of such data for many languages with a sufficiently broad consumer basis. However, these linguistic resources are extremely expensive to produce because they require in- tense expert labour (phonetic transcriptions are usually created by phoneticians). A con- sequence of this is that most speech databases are not publicly available, and even re- searchers must pay royalties in order to use them. Another consequence is that speech technology, and in particular speech recognition, does not easily reach speakers of lan- guages spoken by minorities.

This work will specifically target improvements in ASR in a semi-supervised set- ting, therefore reducing the need for annotated material. The existing methods in Semi- supervised learning in ASR are based on Graph based Learning or self-training using Neural Networks. Graph based learning is computationally very intensive, while self- training is based on heuristics and prone to error due to wrong predictions. Recently, learning through neural networks has been found to be scalable on industrial levels and has given state-of-the-art results in automatic speech recognition. Our model is a modi- fication on a single layer network which can be trained using simultaneously unlabeled and labeled data. The method, therefore, incorporates concepts of semi-supervised learn- ing, while retaining all the advantages of a neural network.

A more robust and less resource intensive ASR system will have an immense contri-

bution to better connectivity, better aid systems for the disabled, and better systems in

low-resources languages.

(10)

1.3 Research Questions

The objective of this thesis is to investigate semi-supervised learning using sparse auto- encoders and if they could be used to improve phoneme recognition over a standard neural network when the labeled dataset is very limited. We can break down this state- ment in three seperate hypotheses.

The first hypothesis is to determine if the model we propose here can perform better at phoneme classification than a neural network trained purely discriminatively when we vary the amount of labelled data.

The second hypothesis to be tested is that if the proposed semi-supervised method is not domain specific and can produce better results in other machine learning tasks. For this purpose we focus on handwriting recognition, when the dataset comprises of images of handwritten digits.

Finally, the third hypothesis is to determine if the model we propose here can per- form better at phoneme classification than the different graph based semi-supervised learning algorithms under varying percentage of labeled data from a dataset.

1.4 Assumptions

There is one common assumption behind all semi-supervised models which also be- comes an assumption for this work: The distribution of the data, which unlabeled data will help us to unravel, should be relevant for the classification problem. To state for- mally, the information about distribution p(x) which can be obtained from unlabeled data should also carry information required for inference of y expressed through p(y|x).

If this is not the case, then semi-supervised learning will not work.

1.5 Report Structure

Chapter 2 introduces the theoretical aspects that are relevant to this thesis. These include

aspects from automatic speech recognition and artificial neural networks. Section 2.2 de-

scribes how to make fixed length feature vectors out of raw-speech waveform, which

makes the task of classification and recognition easier for computers. Section 2.4 gives

a background starting with the basic principles of a Multi Layer Perceptron (MLP). This

is then followed with a brief theoretical background on the more recent "deep" neural

networks having multiple layers. We will also talk about several different kinds of deep

neural networks like RBMs, RNNs which we do not work with but have been exten-

sively used in Computer Vision and Speech Recognition. This is followed by a Chapter 3,

which describes recent advances in semi supervised learning (SSL) with particular focus

on Graph Based Learning (GBL) methods and algorithms that currently provide state-

of-the-art results in SSL. The particular SSL algorithm used in this thesis is explained in

Chapter 4. Chapter 5 reports details on the experimental setup and results. The report is

concluded by a discussion on the results, the implications and possibility of future work

in Chapter 6. An appendix giving a mapping from the standard 48-phonemes in English

to 39-phonemes mapping used in these experiments and used by the community in gen-

eral is also given.

(11)

Relevant Theory

This chapter gives a brief insight into the basic concepts of ASR and Neural Networks.

This includes a section on transforming the raw speech into features that are suitable for phone classification. We give a brief overview of GMM-HMM models, that have been state-of-the-art in speech recognition for many years in the past. Finally, in Section 2.4 we introduce MLPs and Deep Neural Networks (DNN) that in recent years have outper- formed GMM-HMM models.

2.1 Automatic Speech Recognition and Phone Classification

Figure 2.1 illustrates the processes involved with a typical speech recognizer. The differ- ent parts use a combination of signal processing and machine learning methods to tran- scribe a spoken utterance into a sequence of words. Because the problem is intrinsically affected by uncertainty, a probabilistic framework is used.

First, the raw speech waveform is converted into a sequence X = x

1

, x

₂

, ..., x

_T

of fea- ture vectors spaced at regular time intervals. This process is called feature extraction and is based on knowledge of speech production and perception. The goal of feature extraction is to convert speech into a representation that is suitable for the classification problem.

More information about feature extraction methods is given in Section 2.2.

Given the sequence of observations X, the objective is to predict the most likely word sequence ˆ W = w

₁

, w

₂

, ..., w

_m

. In probabilistic terms this can be written as follows:

W = arg max ˆ

W

p(W |X), (2.1)

where p(W |X) is the posterior of the sequence of words W given the observation se- quence X. According to Bayes’ Rule, the above expression can be written as:

W ˆ = arg max

W

p(X|W )p(W )

p(X) (2.2)

= arg max

W

p(X|W )p(W ) (2.3)

The term p(W ) in the equation is our prior knowledge about which word sequences are likely to occur in a language and in a specific task. This is called a language model.

The term p(X|W ) in Eq. 2.2 is the likelihood of a sequence X of acoustic features given a particular sequence of words denoted by W . This is computed by acoustic and lexi- cal models. The lexical models describe word’s pronunciations in terms of sequences of

4

(12)

Raw Speech

Feature Extraction

Acoustic Feature MFCC, LPC, PLP, Filterbank

Acoustic Models Our Model

Acoustic Likelihoods Lexicon Language Model

Decoder

Figure 2.1: The illustration shows all the components in an ASR system. The dotted line

shows a particular instance, for example MFCC is a particular instance of acoustic feature,

likewise the network we will propose will be used for acoustic modelling.

(13)

phonemes, and the acoustic models describe the likelihood of acoustic features given a certain phoneme. The decoder is a search algorithm that uses the information in the acoustic, lexical and language models to perform the maximization in Eq. 2.2.

The acoustic models are the main focus in this thesis. They encode knowledge about acoustics, phonetics, microphone, environment variability, and differences due to gender, accent, dialect and age of the speaker. In order to test the effects of the acoustic models alone on the speech recognition task, a slightly simplified task is considered: phone classi- fication. In this case, instead of the optimization in Eq. 2.1, we classify each feature frame x

_n

into one of K possible phonemic classes. The assessment of this classification task is only reliable if the speech data is annotated at the phonetic level (as is the case for the TIMIT database used in this study). Acoustic models that perfrom better phone classi- fication are more likely to perform better speech recognition as well. Although phone classification is only the first step in estimating the method performance, this evaluation is an accepted practice when new speech recognition methods are introduced.

In the following sections, we will only describe feature extraction and acoustic mod- els, because the phone classification task considered in this thesis does not require lexical and language models. We will focus in particular on Multi Layer Perceptrons and Deep Neural Networks that have been successfully used in recent years as acoustic models for speech.

2.2 Feature Extraction

A speech waveform can be represented as a sequence of samples at a sampling rate that can typically vary between 8 and 20 kHz depending on the quality of the recording.

The samples are highly correlated and contain variations that are not easily associated with phonetic classes without any preprocessing. The goal of feature extraction is to pro- vide a representation for the speech signal that is more suitable for classification. Sev- eral methods for feature extraction have been proposed in the past. The most commonly used are based on a short time spectral representation of the signal, as, for example, Mel Frequency Cepstral Coefficients (MFCCs) and Perceptual Linear Predictive coefficients (PLPs). In order to capture the time evolution of those feature vectors, first-order and second-order temporal differences are often appended to the original features.

MFCCs are calculated according to the procedure shown in Algorithm 1:

Algorithm 1 Procedure to calculate MFCCs

1:

divide the signal into short, possibly overlapping frames

2:

For each frame, calculate the short time fourier transform

3:

Apply the (logarithmic) mel filterbank to power spectra, and take the sum of energy in each filter.

4:

Calculate the logarithm of all filterbank energies

5:

Apply the Discrete Cosine Transform (DCT) of the log filterbank energies.

6:

Keep DCT coefficients 1-12 and discard the rest, energy coefficient is optional.

7:

Take the ∆ and ∆∆ of the coefficients w.r.t. preceding frames and append it to original 13 coefficients.

The first step is motivated by the assumption that, though time-varying, the speech

signal is stationary for short time intervals. The length of those intervals is typically be-

(14)

tween 10 and 20 msec. The frames are usually overlapping in time so that there is a smoother transition in the information captured by two consecutive frames. The following steps

are motivated by perceptual phenomena. The cochlea is known to perform a frequency analysis of the signal, and humans are known to have logarithmic resolution both in fre- quency and in loudness. The Mel filterbank is a set of triangular filters, designed in such a way that filters are logarithmically spaced in frequency.

The application of the Discrete Cosine Transform (DCT) is motivated by modelling constraints. When the feature vectors are modelled by Gaussian Mixture Models (GMMs), it is desirable to work with uncorrelated features. This allows to greatly simplify the models by using diagonal covariance matrices. This requirement has become less strin- gent with the advent of acoustic models based on neural networks (NNs), because these models can easily cope with feature correlations. In fact, in NN acoustic modelling it is common to work with filterbank features directly, that is, to skip steps number 5 and 6 in Algorithm 1 [8, 41].

The reason for truncating the MFCC vector to only 12 components is to limit the rep- resentation to a coarse description of the speech spectra, because the details are related to the frequency of vibration of the vocal fold and, in a first approximation, are a disturbing factor in phone classification.

This approach of knowledge driven pre-processing of the waveform has proved to be successful in discarding information that is irrelevant for discrimination. However, it is always possible that relevant information is lost in the process. In some very recent stud- ies [1], Convolutional Neural Networks have been applied to the speech samples directly, eliminating the need for feature extraction. Although more difficult to train, these models can potentially make use of all the information contained in the signal. This similar trend can be observed in Computer Vision as well [21].

2.3 Acoustic Modelling

Hidden Markov Models (HMMs) are the most popular statistical models in Speech Recog- nition. A first-order Markov chain is a state-space model, where the probability distribu- tion of the current state depends only on the previous state. A Hidden Markov model is more complex in the way that the state is not directly observable. We observe an out- put that, given the current state, is conditionally independent from previous outputs and states. The states follow a first-order Markov chain. In speech recognition, the sequences of observations correspond to the feature vectors described in the previous section. The states roughly correspond to phonetic units.

The model defines two types of probability distributions: transition probabilities, which describe what state is more likely to occur in the next step given the current state, and emission probabilities which describe the likelihood of a specific observation given the cur- rent state. One important inference problem is to calculate the posterior probability of the phoneme states, given a sequence of acoustic observations.

The emission probabilities, for continuous feature vectors (e.g. MFCCs) are in ASR

usually modelled by Gaussian Mixture Models. The combined model is called GMM-

HMM. GMMs have been state-of-the-art emission probability models in ASR for many

years, mainly due to their flexibility. Closed form adaptation techniques such as Max-

imum likelihood linear regression (MLLR) [29] and feature-based MLLR (fMLLR) [11],

made it possbile to quickly adapt those models to new speakers or environmental situa-

(15)

Figure 2.2: Illustration of an acoustic model based on neural networks. The input window comprises 11 frames that are stacked and input to a neural network. The output of each node i in the topmost layer of the neural network is interpreted as posterior probability of a certain phoneme given the observations: p(ph

_i

|X).

tion. This is a key feature for methods that need to be used in real-life conditions.

Alternatives to GMM-HMMs based on neural networks have been studied in the past and have become more and more popular in recent years. These discriminative models can outperform generative models like GMMs, at the expense of flexibility. Most often, the output activation of the neural network are interpreted as probabilities and used as estimators for the emission probabilities in a HMM model similar to the one describe above. These combinations are usually referred to as hybrid ANN-HMM systems [38, 41, 26].

Several Research groups have reported that Deep Neural Networks-based acoustic

models outperform the GMM-based system even on large vocabulary continuous speech

recognition (LVCSR) tasks [7, 42]. Figure 2.2 depicts an acoustic model which takes the

MFCC feature of the frames of speech as the input and gives the values of the posterior

for phoneme i given the acoustic observation P (ph

_i

|X). In phone classification, the max-

imum a posteriori class is chosen for each time step independently from the previous

classifications. If we are performing automatic speech recognition, instead, the prosteriors

are turned into likelihoods P (X|ph

_i

) by means of Bayes rule. Those likelihoods are then

used in the complete system described in Figure 2.1. There have also been some recent

attempts at removing the HMM model and let the neural networks model all aspects of

speech, including the lexical and language models [14]. The reminder of this chapter will

introduce a number of ANN models that are relevant to this study in some details.

(16)

2.4 MLPs and Deep Neural Networks

As our model is an autoencoder, a kind of a neural network, we will first describe a sim- ple MLP and then autoencoders more specifically. A Multi Layer Perceptron (MLP), also called feedforward neural network, is a series of logistic regression models stacked on top of each other. In each layer, the inputs are linearly combined and the combination is passed through a non-linear function. A basic MLP is made up of three layers, the first layer is the input layer. The number of nodes in the input layer is equal to the dimension of the input. The second layer is called the "hidden layer", because of not being observed directly. The output layer is used to perform classification or regression. The dimension of the output is equal to the number of classes for classification or to the dimensionality of the output signal for regression. There may be more than one hidden layers in MLPs.

Models with several hidden layers are often called Deep Neural Networks (DNNs).

In a fully connected network, all the nodes in one layer are connected to all the nodes of the next layers. A connection between a node in one layer to a node in the next layer is assigned a weight, w

ij

. The activation of a node from the second to the last layer is a non-linear function applied to the weighted sum of all nodes from the previous layer.

This non-linear function is also called as an activation function. The expression both in scalar form and matrix form is given in 2.4 and 2.5

y

_j

= f X

i

w

_ij

x

_i

+ b

_j

!

(2.4)

y = f (W

^T

x + b), (2.5)

where x is the activation of layer n, or the input vector for the input layer, y is the out- put of layer n + 1, b is the bias value of the hidden layer and W is the weight matrix between the layers.

The activation functions can vary depending on the position of the node in the net- work. For hidden layers, commonly used activation functions are, for example:

1. Sigmoid function :

y

j

= 1

(1 + e

^−z^j

) (2.6)

2. Hyperbolic tangent function :

y

_j

= tanh(z

_j

) (2.7)

3. Rectified Linear Unit (ReLu) function [34]:

y

j

= max(x

j

, 0) (2.8)

where z

j

= P

i

w

_ij

x

_i

+ b

_j

is the linear combination of the activations of all nodes con- nected to node j.

For the output layer, the activation function depends on the task. Linear activation is common for regression, whereas for classification it is common to use softmax activations, given by

y

j

= exp(z

j

) P

k

exp(z

_k

) (2.9)

(17)

Because the activations sum to 1, they can be interpreted as posterior probabilities of the different classes given the input to the network. The maximum a posteriori classifier is then implemented by selecting the class that corresponds to the maximum activation:

o = arg max

j

y

j

(2.10)

DNNs are generally discriminatively trained using the Backpropagation algorithm to minimise a cost function. Backpropagation is an algorithm to train a multi-layered net- work so that each layer in the architecture can learn a mapping from input to output that is optimal according to an optimality criterium. The Backpropagation Learning algorithm requires an input and a target. First the weights in the network are initialized to random values. In the forward pass, an observation is input to the network and activations are generated for each node in each layer based on the current values of the weights. This allows us to measure the difference between the output of the network y and the desired (target) output t. This measurement can be a simple square error on one single observa- tion

E

_E

= 1

2 (t − y)

²

(2.11)

or a cross entropy loss measure computed over several observations:

E

_C

= − 1 n

n

X

i=1 K

X

j=1

t

ⁱ

log y

ⁱ_j

+ (1 − t

ⁱ

)log(1 − y

_jⁱ

) , (2.12)

where the summation over j corresponds to the nodes in the output layer, and the sum over i is an average over a number of observations.

In the backward pass, we calculate how the output error is dependent on each weight in the network, and update the weights in order to minimize the error. This is achieved by factorizing the partial derivatives with the help of the chain rule. This way, we can propagate the deltas back from the ouput layer to the input. If we evaluate the depen- dency of the error on a specific weight w

ij

as

_∂w^∂E

ij

and we start from the value of the weight w

ij

(t) at iteration t, the new value of the weight at iteration t + 1 is calulated by gradient descent as:

w

ij

(t + 1) = w

ij

(t) + ∆w

ij

(t + 1) (2.13)

= w

ij

(t) − η ∂E

∂w

_ij

(2.14)

where η is the learning rate.

For very large datasets, calculating the gradient for the entire dataset at once could be extremely computationally intensive. To mitigate this problem, it is more efficient to compute the derivatives on a small, random mini batch of training points, and then the weights of the layer are modified proportional to the gradient. To remove the effect of noisy spurious training samples, it is common to use an additional momentum term in the training algorithm to make the training more uniform and less spiky. The weight up- date rule including the momentum term α is given by

∆w

_ij

(t) = α∆w

_ij

(t) − η ∂E

∂w

_ij

. (2.15)

(18)

The term α ensures smoother variations in gradient values. The term η gives the learning rate. DNNs with many hidden layers and many units per layer are very flexible models with a very large number of parameters, which can easily overfit due to some spurious characteristic of the training set. To reduce overfitting, several techniques like L2 regular- isation or Drop-out are used, which penalise the magnitude of the weights, and prevent them from becoming very large.

2.5 Autoencoders

Multi layer perceptrons require the target value to be specified for each training example.

If we do not have this information, we can still learn a representation that is dependent on the distribution of the data by means of an auto-encoder. An auto-encoder is special kind of neural network with two components: an encoder and a decoder. The encoder takes the input x, and maps it to a hidden representation y, which is given by

y = σ(Wx + b

h

). (2.16)

The latent or hidden representation y is mapped back to the original input, using a de- coder which is given as:

z = σ(W’ˆ y + b

_v

) (2.17)

where y is the encoded value, ˆ y is a possibly corrupted version of the the encoded value, b

_h

, b

v

are the bias values of encoder and decoder respectively, W is the weight matrix of the encoder, while W

⁰

represents its transpose. This network tries to minimise the recon- struction error given as:

L

_C

(x, z) = ||x − z||

²

(2.18)

L

B

(x, z) = −

d

X

k=1

[x

k

log z

k

+ (1 − x

k

) log(1 − z

k

)] (2.19)

The first equation 2.18 is for continuous input, while the second equation 2.19 is used for classes and binary vectors. It is basically the cross-entropy error already defined above.

Although we have expressed the equations with σ function, the activation function could be any other prominent activation functions.

If the hidden layer of the auto-encoder has a lower dimensionality than the input, the model will perform non-linear dimensionality reduction. If it is of equal or greater di- mensionality, special care must be put to avoid that the model learns a trivial mapping (identity function). For this reason, the input or the hidden representations may be cor- rupted during training.

2.5.1 Manifold Learning with Autoencoders

A reason as to why auto-encoders do so well is that they exploit the idea that data is

generally concentrated around a manifold, or several subsets of manifolds. The theoreti-

cal understanding about how auto-encoders map data manifolds is still a very active area

of research. But to give a brief motivation, the general principle behind all autoencoders

is a trade-off between two ideas: first, to learn a representation y of a training example

x such that x can be approximately recovered from y through a decoder. x should be

(19)

ˆ x

z x

Figure 2.3: Illustration of a sparse autoencoder with more nodes in the hidden layer than the number of nodes in the input layer. But not all nodes in the hidden layer are activated.

The green-colored nodes are the only activated nodes.

drawn from the training data, because it means that the autoencoder need not success- fully reconstruct the inputs that are not probable under the data generating distribution.

The other complementary idea is to satisfy the generalisation or regularisation penalty, the presence of this term will encourage solutions which are less sensitive to small per- turbations in the data.

2.5.2 Sparse Autoencoders

Historically, the first application of autoencoder was for reducing dimensions of the in- put data, hence the name. Even when deep learning came at the forefront, the conven- tional architecture was to have more layers with lesser number of nodes than the input.

Such architectures are called Bottleneck networks. The idea was that more abstract fea- tures should not require as many dimensions as the input. Sparse auto-encoders differ from standard approaches of compression and dimensionality reduction where there are fewer hidden units than the number of dimensions of the input, i.e. N

W

< N

D

. Sparse autoencoders will have "overcomplete" representations which means, N

W

> N

_D

. The idea is to learn a representation, but at the same time impose a "sparsity" constraint on the activation of hidden units, so that only a small percentage of them are really active.

This could be done by adding the sparsity term to the objective function [6]. We make use of this idea in our experiments by having an "overcomplete" network. The details will be presented in Chapter 4.

2.5.3 Performance of Sparse Autoencoders

Sparse Autoencoders have generally performed well in many applications. But it has

some disadvantages too. As shown in Figure 2.3, Sparse autoencoders require more

memory as there are more nodes in the hidden layers (but only a few are actually acti-

vated). Due to these extra nodes, the weight matrices have a bigger rank which slows

down computation. In comparison, autoencoders with bottleneck architecture have fewer

(20)

nodes in the hidden layer and could possibly have lesser time complexity. The over- all time complexity of the neural network is linear in the number of total connections present and though a sparse autoencoder might have much more connections than a bot- tleneck autoencoder, many of which might not even be used after the first few iterations.

So, it is more difficult to find a lower bound on the time complexity for sparse autoen- coders. Sparse autoencoders could offer more flexibility and additional modelling power than a bottleneck autoencoder, although the theoretical understanding of their working is still an active area of research.

2.5.4 Applications of Autoencoders

Autoencoders have been successfully applied to dimensionality reduction and informa- tion retrieval tasks. Dimensionality Reduction, infact was the first application of repre- sentation learning and auto-encoders. One other task that has made successful use of autoencoders is information retrieval, where the dimensionality reduction algorithm pro- duces a low dimensional and binary code, which can be then stored in a hash table map- ping binary code to entries for fast search. The hash table gives us the capability to per- form information retrieval by returning all database entries having the same binary code.

The search is then a lot faster because it is done on "hashed" binary codes.

2.6 Semi-Supervised Learning

While in the above sections, we talked about one popular supervised learning model

and one unsupervised learning model section, in this section we will talk about Semi-

Supervised Learning. Semi-Supervised Learning is a field of study in machine learning

which falls between "supervised" learning (a problem of classification/regression with

full classes labels available in the training data) and unsupervised learning (modelling

of input data into classes without any class labels). In most practical applications, it is

hard to get fully labeled training data. Unlabeled training data is generally easier and in

some cases much cheaper to obtain. Then an interesting question which arises is, how

can the unlabeled data be used in conjunction with labeled data to improve the accuracy

of a system trained on just the labeled data. If it can be, what is the minimum relative

proportion at which the unlabeled data will lose any relevance to the problem. We are

given a set of independent and identically distributed input samples, x

1

, ..., x

_l

∈ X, and

their corresponding targets: y

1

, ..., y

l

∈ Y such that D

L

= (X

L

, Y

L

). In addition we also

have u unlabeled input samples, x

l+1

, x

_l+2

, ..., x

_l+u

∈ X

_u

, such that D

U

= (X

_U

), the total

number of samples is then given as n = l + u. Supervised classification only utilizes the

information from D

L

to learn a decision boundary, whereas in semi-supervised learning

framework, the decision boundary is obtained by utilizing information from both D

L

and

D

_U

at the same time. Semi-supervised Learning attempts to make use of this combined

dataset D = D

L

∪ D

_U

to surpass the classification performance that can be obtained either

by doing supervised learning (discarding unlabeled data) or doing unsupervised learning

such as clustering (discarding labeled information). Some researchers refer SSL as "trans-

ductive learning" or "inductive learning". The goal of transductive learning would be to

infer the correct labels of the unlabeled dataset D

U

.

(21)

2.7 Assumptions

We make several assumptions in SSL, the key idea for semisupervised learning to work

is that, the information about distribution p(x) which can be obtained from unlabeled

data should also carry the information required for inference of y expressed through

p(y|x).

(22)

Related Work

In this chapter, we will talk about existing literature on the application of deep neural networks for ASR and on Semi-Supervised Learning techniques. One prominent such technique is Graph Based Semi-Supervised Learning (GBL SSL).

3.1 Deep Neural Networks in ASR

This section introduces us to the kind of DNNs and different architectures, which have been recently used in ASR.

3.1.1 Deep Belief Networks

The first set of results which beat traditional GMM-HMMs were reported by [38], where the authors trained DBNs (Deep Belief Networks) for recognising the HMM states of phonemes, given an input of 9-11 frames of MFCC feature vectors. A DBN contains sev- eral stacked layers of RBMs (Restricted Boltzmann Machine). The RBMs are generatively trained using the contrastive divergence algorithm outlined in [38]. Initially, the results were reported on TIMIT [10]. Further attempts were made to replicate the success of TIMIT on large scale vocabulary applications. The first such attempt was on data col- lected from Bing mobile voice search application (BMVS). It had about 24 hours of train- ing data with high degree of variability. The DBNs trained on this dataset achieved sen- tence accuracy of 69.6% on the test set compared to just 63.8% achieved by the traditional GMM-HMM baseline. The DBNs are a kind of unsupervised learning model, and they were popular as a way to pretrain a neural network. But Further research revealed that purely supervised learning of a DNN works comparably, provided a large amount of la- beled data is available, the initial weights are set carefully, and the mini-batch sets are set properly [43]. Since then, DBNs have fallen a little out of favour in the speech commu- nity as the overhead of pretraining can be replaced by careful tuning of the network.

3.1.2 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are another kind of neural network with an ability to model temporal and sequential data. While the above mentioned DBNs is a model for unsupervised learning, RNN is purely supervised training. One major advantage with RNNs is that they do not require fixed input feature length like the above discussed feed-forward networks. The experiments described in [14], demonstrate how RNNs can

15

(23)

be made to understand multiple levels of representation, a salient feature of deep nets, and combine it with their ability to make flexible use of long range context in ASR. The authors reported a test set WER (word error rate) of 17.7%, which beat all the previ- ous benchmarks. RNNs have earlier been used with HMMs, but this is the first instance when RNNs have been used from end to end, and proved that stacking multiple recur- rent layer each on top of the other can give better results just like their counterparts in deep feed forward networks.

3.2 Examples of Semi-Supervised Learning Methods

This section introduces us to the different kinds of SSL techniques. The model we use in this thesis falls in the last category of models: Semi Supervised Learning by Autoen- coder.

3.2.1 Heuristic based SSL/Self-Training

Among all the existing and accepted techniques, the simplest algorithm for semi-supervised learning is based on "self-training" scheme, where a model is trained just with the la- beled part of dataset, and newly labeled data obtained from its own highly confident predictions, until the confidence level of the predictions drops below a certain thresh- old. The training can take several iterations. To state formally, the self-training approach starts with the labeled set L = {(x

i

, y

_i

)

^l_i=1

} and unlabeled set U = {(x

i

)

ⁿ_i=l+1

}. An ini- tial model f is trained using only the labeled data using standard supervised learning.

The resulting model is then used to make predictions on U , where the most confident predictions are removed from U and added to L together with their corresponding class predictions. In the next iteration, the model is refined with the new augmented set L. A critical assumption made in this algorithm is that the predictions added to the initial la- beled set are reliable enough themselves. One big advantage of this approach is that it can be used as a wrapper for any learning algorithm.

This is quite a general technique, which has been used in many different research ares such as object classification [40] and speech recognition. In [46, 15, 17], self-training was used in combination with neural networks, whereas in [24, 25, 47] in combination with GMM-based acoustic models. Although showing promising results, these methods in- volve heuristics and can reinforce "bad" predictions. The confidence level and unit selec- tion are very important in these models and any mistuning of these parameters can lead to bad results in later iterations.

3.2.2 Transductive SVMs

Transductive SVMs [18] work on the the principle of avoiding having decision bound-

aries, where input is heavily distributed. Putting a decision boundary in high density re-

gions of input, increases the chances of getting "all predictions wrong", this idea is based

on the cluster smoothness, which was discussed in the previous section. Transductive

Support Vector Machines (TSVMs) is an extension of traditional SVMs with unlabeled

data. They have the same objective of maximising margin between classes, while ensur-

ing that there are few unlabeled examples near the margin. Finding the exact solution is

NP-hard. Some efficient approximate algorithms have been proposed but they lack scala-

bility to problems with very large datasets.

(24)

3.2.3 Entropy Based Semi Supervised Learning

Entropy based Learning jointly model the labeled and unlabeled data. The primary mo- tivation is entropy minimization for these methods and has been proposed in [9, 16].

In [16], the authors proposed to jointly model the labeled and unlabeled data in a con- ditional entropy minimisation framework first demonstrated in [13]. The authors also maximised the conditional entropy regularisation term posed on the unlabeled data, in addition to the usual task of maximising the posteriors on the labeled data. This addi- tional regularizer encourages the model to have as great confidence as possible on label prediction of the unlabeled data. The optimisation of the framework was performed us- ing the extended Baum-Welch algorithm. The method was evaluated on different speech recognition tasks such as phonetic classification and phonetic recognition tasks, where some improvement was obtained on the TIMIT dataset compared to a supervised dis- criminatively trained GMM model with Maximum Mutual Information Criterion [37].

These methods have mostly been applied with GMM models in the literature, and there is further scope for their study in discriminative methods and could be a potential area of future study.

3.2.4 Graph based SSL

Graph based Learning has been quite popular in ASR lately for acoustic modelling. One part of the thesis is to compare the results achieved by our model with the results from Graph based Learning for improving frame-based classification on TIMIT dataset. One of the first works in this area was the application of label propagation algorithm for vo- cal classification task [2]. The work was evaluated on Voice Joystick dataset [19], which is a 8-vowel classification task that was used to develop voice-controlled assistive de- vices for patients with motor impairments. Graph based SSL methods define a graph composed of nodes which represent both labeled and unlabeled training examples as ex- plained in [50]. The nodes are connected to each other by edges, which have a weight.

The weight of the edges is given according to the similarity of the examples. The most popular algorithm in GBL based semi supervised learning is Label Propagation (LP) [49].

It iteratively propagates the information from the labeled data on a graph G. The end goal of all GBL based algorithms is to infer the label via an undirected weighted graph G = (V, E, W ) where V are the data points of the vertices of a graph in D

L

and D

U

, and E are the undirected edges on the graph, weighted by w

ij

∈ W . The label propagation algorithm minimises the following function:

n

X

i=1 n

X

j=1

w

ij

|| ˆ y

i

− ˆ y

j

||

²

(3.1)

subject to ˆ y

_i

= y

_i

where ˆ y is a predicted label and y is a true label. Two more techniques in Graph based SSL known as MAD (Modified Absorption) proposed in [45] and MP (Measure Propagation) proposed in [44] have been recently introduced. MP minimizes the following objective function:

l

X

i=1

DKL(r

ⁱ

||p

_i

) + µ

n

X

i=1

X

j∈N

w

_ij

KL(p

i

||p

_j

) − ν

n

X

i=1

H(p

_i

) (3.2)

(25)

where p is the predicted probability distribution over the classes, r is the true (reference) distribution, N is the graph neighborhood of the node i, KL is the Kullback-Leibler di- vergence, and H is the entropy. The first term in the expression ensures that the pre- dicted probability distribution matches the true distribution over the labeled vertices as closely as possible, the second term stands for the smoothness of the label assignment enforced by the graph G defined above, which essentially means that the class proba- bility distribution on neighbouring vertices (i.e. which will have a higher edge weight) in the graph should have a smaller KL divergence. The third term encourages higher entropy in the final output. A new variant of MP called as prior-regularized measure propagation (pMP) has been given in [30]. The pMP algorithm minimises the following objective function:

F (D, G, p) =

l

X

i=1

KL(r

i

||p

_i

) + µ

n

X

i=1

X

j∈N

w

_ij

KL(p

i

||p

_j

) + ν

l+u

X

i=l+1

KL(p

i

|| p e

_i

). (3.3)

The additional term in Equation 3.3 is a measure of how closely the predicted probability distribution is close to the aprior distribution of classes.

Graph based SSL has recently been applied by [30, 31] in context of ASR. The authors keep a DNN as the final discriminative classifier and achieve state-of-the-art results. We will compare our results with their results in section 5.4. There are several problems with Graph based SSL methods. Firstly, their complexity is O(N

³

), because they involve inver- sion of a N × N matrix. They also do not give any confidence about the estimates. More- over, addition of any new data points can be quite cumber-some as it requires modelling the entire graph structure again.

3.2.5 Semi-Supervised Learning with generative models

These approaches take the SSL problem as a case of missing data imputation task for su- pervised classification problem. They are probabilistic in nature and also give confidence measures or uncertainity measures for predictions. Kingma et al in [20] have presented a Latent-feature discriminative models that provides an embedding or feature representa- tion of the data similar to an autoencoder. The deep generative model of data provides a more robust set of latent features than autoencoders. This latent feature representation al- lows for clustering of related observations in latent feature space and gives quite accurate classification. The generative model used, is given below:

p(z) = N (z|0, I) p

θ

(x|z) = f (x, z, θ)

where the likelihood function f (x; z, θ) could be a Gaussian. Another model presented in the same paper describes the data as generated by a latent class variable y in addition to the embedding z learnt above. The generative process is given as:

p(y) = Cat(y|π);

p(z) = N (z|0, I) p

θ

(x|z) = f (x, z, y, θ)

f (x, y, z, θ) is a Gaussian likelihood function. z

i

is the additional independent latent vari-

able for each x

i

. All z

i

s can be written as the distribution of a single latent variable z.

(26)

3.3 Autoencoder Based Semi-Supervised Learning

This section introduces us to the model we are going to present and motivate about it.

Deep Neural Networks are generally trained either from just fully labeled data, or from just purely unlabeled data. Both models, when used alone are not ideal. Ranzato et al.

in [39] explored the possibility of having a joint objective made of both a supervised and

unsupervised objective on documents where bag of word representations were used as

input features. The authors performed their experiments on Reuters and Newsgroups

datasets. We propose to use a similar approach to frame-based phoneme recognition in

ASR. Although our objective function is the same as the one proposed in [39], our set

up is different in a number of ways. Firstly, instead of the compact and lower dimen-

sional encoding used in [39], we employ sparse encoding. Secondly, instead of stacking

a number of encoders, decoders and classifiers in a deep architecture as in [39], we use

a single layer model. This is motivated by work in [6], where the authors analyse the

effect of several model parameters in unsupervised learning of neural networks on com-

puter vision benchmark data sets such as CIFAR-10 and NORB. They conclude that state-

of-the-art results can be achieved with single layer networks regardless of the learning

method, if an optimal model setup is chosen. An introduction to sparse autoencoders can

be found in Stanford Lecture notes given by Andrew NG [35].

(27)

Method

We present our algorithm here. We will try to motivate it from our understanding of autoencoders and the principles of supervised classification. Neural Networks require proper-care and attention during training, so we will describe how to initialise the net- work and the practical details to be kept in mind, and the hyperparameter values we used. We will also try to explain why we made certain choices in the network config- uration. We will also describe Gradient Checking, which is an important technique for debugging Backpropagation in Neural Networks.

4.1 The Model

Figure 4.1 shows a block diagram of the neural network model used in this study. The topmost path is equivalent to an autoencoder, consisting of an encoder W

E

and a de- coder W

D

. Although autoencoders usually share weights between the encoder and de- coder (W

D

= W

_E^T

), in our case we optimize those weights independently. The reason for this will be clear in the following.

The bottom path in the figure is a neural network classifier that uses the representa- tion learned by the encoder as input features. The figure also shows the reconstruction error E

R

and the classification error E

C

that can be computed for the two paths. Evalu- ating E

C

requires labels for each input observation, whereas E

R

is computed without the need for labels. This allows us to update the model parameters W

E

, W

_D

and W

C

simul- taneously on labelled and unlabelled material in the same batch of observations.

The advantage over a feed-forward network is that the W

E

can be estimated on much larger unlabelled data-sets. The advantage over unsupervised auto-encoders is that W

E

will be continuously optimised during training for the particular classification task we are considering.

The training algorithm is given in Algorithm 2.

Algorithm 2 Algorithm to train the network

1: Transform the training samples x into codes z using the encoder part of the layer.

2: Calculate the reconstruction loss ERusing the encoded input z.

3: Compute the classification error ECusing again z and known labels y.

4: The loss function is then combined , and the final objective function is given as: E = ER+ αEC. 5: The layer is trained by minimising the combined loss term using SGD.

6: The encoded input, z is used as input to train the next layer.

7: The procedure can be repeated with other layers.

20

(28)

x W

_E

tanh W

_D

tanh E

_R

+

W

_C

softmax E

_C

E

t x

z x ˆ

y α

encoder decoder

classifier

Figure 4.1: Flow chart for the cost calculation in a single layer of the network. Three com- ponents are considered: encoder, decoder, and classifier. The loss is weighted sum of cross- entropy E

C

and reconstruction loss E

R

. If several layers are stacked together, only the encoder/decoder pairs are retained after training. The value t represents the value of the targets from the training data and y represents the output of the classifier network.

The computational cost is linear in the number of training samples, and thus it is more efficient than graph based Semi-Supervised Learning algorithms which have O(N

³

) cubic complexity. For each layer, the cost is given by a forward and backward pass through encoder, decoder and classifier.

In the supervised setting, the cost function E

C

is the cross-entropy logloss defined as:

E

_C

= − 1 n

n

X

i=1 K

X

j=1

t

ⁱ

log y

_jⁱ

+ (1 − t

ⁱ

)log(1 − y

ⁱ_j

) , (4.1)

where y

i

is the activation of the output node i of the classification network and t

i

is the corresponding target label and the summation over j corresponds to the nodes in the output layer, and the sum over i is an average over a number of observations. In unsu- pervised settings, instead, the objective is to find a representation that retains enough information in order to reconstruct the input data. The cost function E

R

, in this case, is the second degree norm (also called L2 norm) of difference between original input and reconstructed input.

E

R

= ||x − ˆ x||

²

(4.2)

where x denotes the original input, ˆ x represents the reconstructed input. Similarly to the method described in [3], we add noise to the original input by a process called ’corrup- tion’ in which some dimensions of the input vector are randomly picked and set to zero.

This helps the network learn a more robust representation of the input data.

The final cost function that we want to optimise is a linear combination of E

R

and E

_C

:

E = E

R

+ αE

C

. (4.3)

Optimizing the cost E with respect to the model parameters W

E

, W

D

and W

C

is a non

convex optimisation problem, that we solve with a gradienta algorithm. When the input

(29)

datapoint is not accompanied by a label, the classifier part of the layer is not updated, and the loss function simply reduces to E

R

.

The data is split into three sets. The optimization is performed on the training set, while a validation set is used to optimize the meta parameters for each run, for exam- ple the value of α in the linear combination of Eq. 4.3. Neural Networks contain many hyperparameters and tuning could be a challenge.

The final results are given on an independent test set.

As we will describe later in 5.1, we run experiments on two datasets: MNIST and TIMIT. In case of MNIST, the input x is the raw image represented by the pixels con- catenated row-wise into a single vector. The output y is the digit number (0-9). In case of TIMIT, the input x is the MFCC+ ∆ + ∆∆ feature concatenated together with the fea- tures of 5 previous and 5 next frames as in Figure 2.2. The procedure to obtain these fea- tures has already been explained in Section 2.2 The output y are all the possible speech phonemes. Please note, that recently, it is more prevalent to have senones or the hmm states of the phonemes as the target labels as in [38], but to compare with other tech- niques in Semi-supervised learning for speech [30, 2], we experimented with phonemes.

The training in this model can be peformed greedly layer by layer if we wish to use deep networks. However, in our experiments, we use only a single layer for feature rep- resentation.

We use mini-batch SGD as explained in the theory section as the optimizer over this cost function. The weight matrices, W

C

and W

D

are simply updated by normal back- propagation algorithm as shown in the equations 2.14 given in the Theory section. How- ever, the update of the encoder weight matrix is not straightforward as in the other two, and is given by:

∂E

∂W

E

= ∂E

_R

∂W

E

+ α ∂E

_C

∂W

E

(4.4) W

E

= W

E

− η ∂E

∂W

E

. (4.5)

It is important to note that the update of encoder weight W

E

is dependent both on W

_D

and W

C

, and the delta propagated in the backpropagation algorithm will be a linear combination of the deltas calculated in both parts. Because of its sparse properties, we call this model Semi-Supervised Sparse Autoencoder (SSSAE). The weight update given in Equation 4.5 was implemented using Theano [23] library.

4.2 Evaluation

The evaluation method was simply classification accuracy given as proportion of cor-

rectly classified examples. In machine learning experiments, the standard convention is

to partition the data into three sets: training set, validation set, and a test set. The largest

of them is the training set, which is used to train the model, or fitting parameters of a

model. When given a model class and a choice of hyperparameters, the parameters are

selected which give the minimum error on the training set. Given a type of model, one

tunes the hyperparameters on the validation set. The test set is kept untouched dur-

ing the entire training process. Finding the error on the test set is the last step. We per-

formed experiments on two different datasets. In the MNIST problem, each example

(30)

correspond uniquely to a class (written digit) and the classification accuracy is straight- forward to define. In ASR, however, this is not the case. Because the speech signal is a continuous stream of linguistic units, many metrics can be defined at different levels of details. We can compute errors at the phonetic level, or word level or even at the level of full sentences. We can also consider fine alignment of the recognized linguistic units in time, or just consider the sequence of linguistic units disregarding errors in alignment.

Which metric we use is determined by the application and by the output of our speech recognizer. The most commonly used metric is called Word Error Rate (WER). This met- ric is defined on sequences of words and disregards the alignment in time. The sequence of recognized words is aligned with the sequence of labels by means of dynamic pro- gramming, and the mismatch is computed in terms of number of insertions, deletions or substitutions. In this thesis, however, we focus on phone recognition, and we therefore use a corresponding metric. The most common way to calculate accuracy at the phonetic level is to consider each speech frame as an independent classification and count the pro- portion of correctly classified frames. This is possible because the data set that we use (TIMIT) contains carefully annotated phonetic transcriptions. This particular evaluation method is usually considered as a first step whenever a new method for ASR is intro- duced because it allowed to find possible problems early before a full large vocabulary speech recognizer is constucted.

4.3 Monitoring and Debugging

4.3.1 Design Choices/Tuning Hyperparameters

Neural Networks contain many hyperparameters. A model’s parameters are directly fit- ted by a training algorithm, whereas the hyperparameters are fitted by hand or through trial and error. For a neural network the value of the weight matrices are the parame- ters of the model. In the model given above, W

E

, W

C

and W

D

serve as the parameters.

However, in addition to these parameters, we also generally have another set of parame- ters, that cannot be learned directly from the regular training process. They act one level above the normal training process. For example, the value of K is a hyper-parameter in K-Means Clustering model, the learning rate η is a hyperparameter in gradient descent learning procedure. Setting the hyperparameters properly plays an important role in get- ting the best performance from the network and for the network to converge faster. A proper configuration of the network can also reduce the amount of computation required by reducing the number of epochs required to reach convergence. More recently, it has been shown that the initialization procedure based on unsupervised pretraining with Deep Belief Networks can be avoided if we use better activation functions and a suffi- cienty large training set [43, 12].

In our model as given by Equation 4.3, the value of variable α cannot be determined by one iteration of the training procedure described in 2. The usual way of determin- ing its value is to iterate the training procedure over a set of possible values and pick the one which performs best on a validation set. Other hyperparameters in our model are: the number of nodes in the hidden layer, the learning rate η, momentum term as de- scribed in Section 2.4, batch size B. The procedure for optimising the hyperparameters is shown step-wise below in Algorithm 4.3.1.

Due to limited processing power of our system, we can not perform an exhaustive

(31)

search for hyperparameter optimisation. The strategy we are going to use is called coor- dinate descent. The idea is to change only one hyperparameter at a time, and optimise the particular hyperparameter, and then use its value along with the best configuration of hyperparameters found until now. Another approach is to start the search by consider- ing only a few values of the hyperparameter values but over a very large range. A more local search could be performed in the neighbourhood of the optimal value found in or- der to do finer adjustments with more iterations. For example, the learning rate could be optimised over the range: (0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.5). This procedure is de- scribed below:

for p ∈ HP do

for p ← p

i

from p

1

, p

2

, p

n

do Train the model

Estimate Model Accuracy on Valid set M

i

end for

j ← arg max M p ← p

_j

end for

Table 4.1 shows the values of hyperparameters we used for the experiments. We fol- lowed the procedures given in [4, 5] for tuning them to get the best validation set ac- curacy.

4.3.2 Learning Rate

The learning rate determines how fast weight updates happen in one iteration of the training procedure. It is also probably the most important hyperparameter. If the rate is too low, then it might take us too many epochs to find the minima/solution. If the rate is too high, then we might even jump over the minima. An effective approach to manage learning rate is to decay the learning rate after every few iterations. The most common and prominent ways to adapt the learning rate are:

1. Exponential Decay :

η = η

₀

exp

^−kt

(4.6)

2. Inverse Decay :

η = η

₀

1 + kt (4.7)

where η

0

is the initial learning rate at epoch 0, t is the value of current epoch, k is another hyperparameter to be tuned.

3. Step Decay after delay: Reduce the learning rate by some rate after every few epochs. We used a simple heuristic by observing the validation error after each epoch, the learning rate was reduced by a factor whenever the validation error stopped improving or in some cases, it became bigger.

We tried experimenting with all of the above mentioned schemes, but found the Step Decay scheme easier to work with for our problem as it was easier to tune and moni- tor and did not involve tuning of another hyperparameter, k as in the other two schemes.

We took the initial learning rate η

0