Speech recognition for telephone conversations in Icelandic using Kaldi

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Speech recognition for telephone conversations in Icelandic using Kaldi

ÞORSTEINN DAÐI GUNNARSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Speech recognition for

telephone conversations in Icelandic using Kaldi

ÞORSTEINN DAÐI GUNNARSSON

Master Programme in Machine Learning Date: April 10, 2019

Supervisor: Giampiero Salvi Examiner: Olov Engwall

Swedish title: Igenkänning av tal på isländska över telefon med verktyget Kaldi

School of Electrical Engineering and Computer Science

(4)

(5)

iii

Abstract

In this thesis we train and evaluate an Automatic Speech Recogni- tion system for phone communication in Icelandic. We use Kaldi, an open source toolkit, to build both GMM-HMM and Neural Network based models for general speech recognition in Icelandic. A simple telephone based dialogue system is built to test the speech recognition model in a real world scenario by calling users with a simple back and fourth dialogue between the user and the system. The resulting Speech Recognition models offer improved results compared to baseline systems in terms of Word Error Rate and are found to be successful for use in telephone communication.

(6)

iv

Sammanfattning

I denna uppsats tränar och utvärderar vi ett automatiskt taligenkän- ningssystem för telefonkommunikation på isländska. Vi använder Kal- di, ett ramverk med öppen källkod, så tränas både GMM-HMM och neurala nätverksbaserade modeller för generell taligenkänning på is- ländska. Ett telefonbaserat system byggs för att testa modellerna i ett verklighetstroget scenario. Det bygger på en enkel dialog mellan an- vändaren och systemet. De resulterande taligenkänningsmodellerna visar sig vara framgångsrika vid användning inom telefonkommunikation.

(7)

Chapter 1 Introduction

With growing interest in language technology and more advanced technology, it becomes increasingly difficult for under resourced languages to keep up with new developments. Advancements in speech recognition, language understanding and speech synthesizers make it possible to create virtual assistants, along with various other useful products. Products that users can communicate with just as they are having a conversation with another person. For these systems to be effi- cient and offer good user experience, it is important that the user can communicate with the system in a simple way, in a language the user feels comfortable speaking. Therefore, these products have to be tai- lored specifically to each language they support. Languages that do not have a high population of speakers and are under-resourced for development are often left behind and unsupported by new products.

Good examples are currently available virtual assistants. These assistants currently support only a few widely spoken languages. Though plans to support additional languages have been announced by most major parties, many languages will stay unsupported until further no- tice, to be implemented later, at their leisure.

1.1 Aim of thesis

The objective of this thesis is to build an Automatic Speech Recogni- tion (ASR) system for Icelandic. The ASR system should be capable of transcribing spoken Icelandic through a telephone. We will answer if we are capable of creating an ASR system for telephone communication that is comparable to other available ASR systems in Icelandic,

1

(10)

2 CHAPTER 1. INTRODUCTION

audio out User

TTS "Hæ"

DM audio in ASR "Halló"

Figure 1.1: Demonstration of the flow of messages between the dialogue manager (DM) and the user using text-to-speech (TTS) and automatic speech recognition (ASR) software to translate back and forth between speech and text. The DM initiates the communications by saying "Hæ" (e. "Hi") to which the user replies "Halló" (e. "Hello").

using the resources currently available for research in Icelandic Lan- guage Technology (LT). The system will be evaluated using standard evaluation methods for ASR systems (see section 3.1.7).

Furthermore, the ASR system will be tested with a simple telephone based dialogue system. The dialogue system will use the ASR system and available text-to-speech (TTS) software to translate messages from speech to text and from text to speech respectively. As de- picted in figure 1.1, users will communicate to the system via speech, whereas the dialogue manager (DM), the other participant in the conversation, will receive and output text. An experiment using the dialogue system will be performed, where real users are called and asked a couple of questions, to find out if the ASR system is capable of suf- ficiently translating the conversation from the user to the DM. The results from the experiment will be evaluated, by counting how often the users are asked to repeat their answers, to see how well the dialogue system can keep the conversation flowing.

1.2 Social relevance

Technology is more and more moving towards speech as the main interaction between humans and computers. This technology is becom- ing popular all around the world, even in countries where the systems do not support or understand the countries official language. Users need to adapt by speaking a language that is not their preferred language or mother tongue. The increase in use can have a very negative affect on the future of those unsupported languages. It is easy to

(11)

CHAPTER 1. INTRODUCTION 3

imagine your mother tongue not being relevant or applicable when, for example, going on holidays or visiting foreign countries. But to imagine your mother tongue not being relevant in your own country and in your own home is a far distant reality for most. Though maybe not as distant as we think. Under resourced languages are under pres- sure from many different directions, only one of which is products that use a foreign language as the main interface. It is projected that only 5% of the worlds languages will survive what is being called a digital language death [25].

Icelandic is one of the languages in the other 95%. Icelandic is the national language of Iceland and has less than 350 thousand native speakers. The status of the Icelandic language is heavily disputed, ranging from untimely extinction or vast changes to speech and vocabulary, to being safe for the unforeseeable future [9]. No matter the current state of the language it is important that it does not fall behind as new developments and new technology arrives.

Just before the millennium, research in Icelandic Language Tech- nology was very scarce [33]. Few products were available and limited development or research was being conducted. This, however, changed in 2000 when a special LT program for Icelandic was launched by the Icelandic government. The initiative lasted for four years and started a wave in ongoing research in Icelandic LT, but still left the language far behind bigger and more resourced languages [35]. In 2005, The Icelandic Centre for Language Technology (ICLT) was established as a platform to promote Icelandic LT. The purpose of the ICLT is to promote research in Icelandic LT and help create partnership between researchers and other organizations in matters related to Icelandic LT.

A new initiative is being launched in 2018 with the objective of making Icelandic relevant and available in all communications with technology, both written and spoken. [28]

This thesis project serves to further the development of Icelandic language technology, more specifically automatic speech recognition for Icelandic.

(12)

Chapter 2 Background

Automatic Speech recognition research has over six decades of history. The next section offers a brief recount of the history on the topic inspired by the compilation of B.H. Juang and L.E. Rabiner [22]. Fol- lowing, is an overview of ASR for Icelandic, the current state of ASR research in Iceland and the resources available for building ASR systems for Icelandic.

2.1 A brief history of ASR

In 1952 Bell Laboratories developed, what is often considered the first speech recognizer, Audrey [5]. Audrey could recognize isolated digits for a single speaker with approximately 98 percent accuracy, after adjusting to the speaker. Audrey could recognize digits by compar- ing the input, split into two frequency bands, to pre-evaluated values for each digit and nominating the best match. As research continued, speech recognizers focused mostly on digit recognition, a well defined limited vocabulary problem. Almost 10 years after Audrey was introduced IBM showcased the IBM Shoebox, named after its small size (especially compared to Audrey’s height at almost two meters). IBM Shoebox could recognize 16 spoken words, the digits 0 through 9 and a set of arithmetic functions, and could operate a simple calculator only through voice. At that time, performance went down drastically as the vocabulary was extended and the number of speakers were increased.

Research continued on with some backslash in 1969 after an open letter from J.R. Pierce, a renowned engineer and executive at Bell labs.

In the letter, Pierce expressed his doubts about how general speech

4

(13)

CHAPTER 2. BACKGROUND 5

recognition would work and even if there was any use for it [31].

In the 1980’s speech recognition research shifted from the pattern recognition paradigm to a more rigorous statistical modeling framework as Hidden Markov Models (see section 3.1.3) became the new standard [22]. Vocabulary of the speech recognition systems increased and steadily the accuracy was improved.

In the late 1980’s Artificial Neural Networks (ANN) (see section 3.1.3) were reintroduced as a way of building acoustic models [22]. ANNs had previously lacked the temporal awareness needed for acoustic modeling. A class of ANNs called Recurrent Neural Networks (RNN) changed that. RNNs could be used with the support of HMM in hy- brid systems [24] but more recently HMM-free RNNs with multiple layers have been used for acoustic modeling. Today’s state of the art speech recognition system have a very large vocabulary and have achieved human parity in conversational speech recognition using mul- tilayered recurrent neural networks [43].

2.2 Speech Recognition for Icelandic

So far there is limited amount of speech recognition systems available that support Icelandic. The options available are Hjal [34], an isolated word recognition system created in 2002, Google’s Speech recognition API and most recently two recipes for the Kaldi framework released by the University of Reykjavík (UR) in 2017 [37] and 2018 [29].

Hjal was an important stepping stone in Icelandic LT since not only was it the first system to recognize Icelandic words but the project included the development of the only human transcribed phonetic dictionary for Icelandic. The dictionary includes phonetic transcription of around 60 thousand Icelandic words in two different phonetic no- tations, SAMPA and IPA. The project produced the first speech recognition model for Icelandic, an isolated word recognizer with at least 97% accuracy [34]. Though the project was considered successful the resulting model was never used as much as anticipated.

The next big step in ASR for Icelandic was the Málrómur project.

The University of Reykjavík and the ICLT in partnership with Google, oversaw the collection of an open Icelandic speech corpus from the year 2011 to 2012. A corpus intended for automatic speech recognition research. The project set out to record speech samples from var-

(14)

6 CHAPTER 2. BACKGROUND

ious sources, including news stories, rare triphones (three successive phonemes, see section 3.1.1), names, places and more [14][39]. A short while later in 2012 Google launched the first continuous speech recognition system for Icelandic in Android smart phones and their search engine. Since then the recordings have been overviewed and made public. The corpus now includes over 119 000 validated recordings or around 152 hours of speech from 563 different speakers.

(15)

Chapter 3 Theory

3.1 Automatic speech recognition

A common way to implement ASR systems, and the one used in this thesis, is using a probabilistic model based on Bayes’ theorem. As described by Daniel Jurafsky and James H. Martin in [23], the goal of a Bayesian ASR model can be described as finding the most likely sentence ˆW : ˆW ∈ L out of all sentences in the language L given some acoustic input O. The acoustic input O = o1, o₂, o₃, ..., o_ncan be treated as a sequence of observations and the sentence ˆW = w₁, w₂, w₃, ..., w_n as a sequence of words. More formally it can thus be expressed as follows:

W = arg maxˆ

W ∈L

P (W |O) Using Bayes’ rule we can expand it so that:

W = arg maxˆ

W ∈L

P (O|W )P (W ) P (O)

Now, P (O) is the same for all W and since we are only looking for the highest value of the fraction it can be removed from the equation:

W = arg maxˆ

W ∈L

P (O|W )P (W ) (3.1)

The probabilities on the right hand side of the equation P (O|W ), called the observation likelihood, and P (W ), known as the prior probability, can be estimated by an acoustic model (section 3.1.3) and a language model (section 3.1.4) respectively.

7

(16)

8 CHAPTER 3. THEORY

Audio Input Feature

Extraction Decoding

Acoustic Model

Language Model Lexicon

Text

Figure 3.1: The architecture of an ASR model

As shown in figure 3.1, Bayesian models have multiple components that work in conjunction to translate from audio signals to word sequences. In following sections each component will be discussed in more detail.

3.1.1 A word on linguistics

From the beginning, the design of speech recognition models has been influenced by many different fields of study. Linguistics, the scientific study of language, is among them. Linguists have developed a system to split words, and sentences, into a sequence of distinct sounds called phonemes.

A phoneme is the smallest unit of sound that can be used to distin- guish two different words. For example, the Icelandic words "lag" (e.

"song") and "lög" (e. "laws") each consists of three phonemes, and can be represented as /l a: G/ and /l œ: G/ respectively using the Inter- national Phonetic Alphabet [7]. The two words are distinguished by the substitution of a single phoneme /a:/ for another phoneme /æ:/.

Any given language has a finite number of phonemes, for example Icelandic and English both have roughly 40 different phonemes. All words in the languages can be expressed as a sequence from this set of phonemes.

This approach has proven quite useful for speech recognition modeling since it reduces the problem of finding a sequence from a set of thousands of different words into finding a sequence from a set of only

(17)

CHAPTER 3. THEORY 9

a handful of different phonemes.

3.1.2 Signal Processing

Signal processing is the first step in any speech recognition system.

Signal processing is where features are extracted from an audio signal, i.e. the spoken utterance, and represented in a format that our model can accept and understand. There are three main attributes desired in the extracted features. They should be perceptually meaningful, robust, and temporally correlated [30]. For phone classification this means we want the features for each phone to be as similar as possible while the features for different phones are as different as possible, no matter the speaker, such that the same phones are represented similarly even though the speakers might differ.

Feature extraction

For a probabilistic ASR model, we want to represent the speech input as a sequence of observations. Some of the most common techniques used today to represent the signal for speech recognition tasks are Mel- frequency cepstral coefficients (MFCC) [6] and Perceptual Linear Predictive (PLP) coefficients [18].

Before features are extracted, the audio signal is split up into frames.

Each frame typically consists of 25ms and is sampled every 10ms. This can result in multiple adjacent frames overlapping. MFCC features are computed over a series of steps that are applied for each frame. First a windowing function is applied by multiplying it with the frame.

Secondly, Fast Fourier Transformation is computed. Thirdly, the en- ergy on the mel-frequency spectrum is computed. This results in a N dimensional vector where N is the number of bins used on the mel- frequency scale. Then, the log of the energies is computed and discrete cosine transformation is applied, keeping a specified number of coefficients, usually 13. The final result is a sequence of vectors, one for each frame, that represents the audio as a sequence of observations. More detail about each step can be found in [6].

Splicing and feature transformation

Speech is a temporal signal and relies heavily on context. To account for that, features for adjacent frames, both left and right, are often con-

(18)

10 CHAPTER 3. THEORY

catenated to the features from the current frame. This process is called splicing. The context window, how many frames before or after are concatenated, can differ from left to right.

Another way to add dynamic information to the MFCC features is adding the first and second derivative of the features to the feature vector. This is commonly known as ∆ + ∆∆ features.

Adding neighboring frames to the features introduces a correlation between feature dimensions. The correlation can cause a problem in some probabilistic models where the model assumes dimensions are uncorrelated. To use splicing in these cases some kind of feature transformation is required. There are various different techniques used for feature transformation, for example Linear discriminant analysis (LDA) and Maximum Likelihood Linear Transform (MLLT). These transformation can be used together as detailed in [11].

To further improve the features representation of the speech signal, they can be appended with an i-vector [8]. Similarly to how ∆+∆∆ features help encode the dynamic information in speech, i-vectors encode the speaker information, helping the model adapt to different speech characteristics.

3.1.3 Acoustic models

As mentioned earlier, the purpose of the Acoustic Model (AM) is to es- timate the observation likelihood P (O|W ). That is, find the the probability of seeing the observation sequence O given a sentence W . In more practical terms the purpose of the AM is to find the most likely sequence of phonemes in a spoken utterance.

Since speech is a temporal signal, the linguistic information is not only encoded in the signal at any given time but also in how it changes over time. Not only that, but people talk with different emphasis and different tempo and many languages have different dialects. All this can make some sounds longer or shorter depending on the speaker and the words. The model design is heavily influenced by these features.

There are two main approaches used for acoustic modeling, Gaus- sian Mixture Model-Hidden Markov Model (GMM-HMM) based models and Artificial Neural Network (ANN) based models. Both of them will be discussed in more detail in this section.

(19)

CHAPTER 3. THEORY 11

GMM-HMM

The GMM-HMM architecture consists of two dependent processes, a Hidden Markov Model (HMM) and a Gaussian Mixture Model (GMM).

HMMs are a type of Markov model. Markov models are stochastic models used to model randomly changing systems. All Markov models satisfy the Markov property, which states that for every state in the model the next state should only be dependant on the current state but not the previous state or any number of previous states. Figure 3.2 shows an example of a Markov Model, more precisely a Markov chain, with three states, represented by a directed graph. The edges of the graph represent the transition probability, that is, the probability of moving between states for each time step. If the model is in state a then at the next time step the model can stay in state a with the probability P (a|a) or move to a new state s with probability P (s|a) where s ∈ b, c.

a b

c

P (b|a) P (c|a)

P (a|a)

P (a|b) P (c|b)

P (b|b) P (c|c)

P (a|c)

P (b|c)

Figure 3.2: A directed graph representation of a Markov chain with three states

HMMs are different from Markov chains in that the states are hidden and only the output (i.e. emission), which is directly dependant on the current state can be observed. Figure 3.3 shows the general architecture of a HMM. As in figure 3.2, each circle represents a random variable but only the filled nodes are observable. The state at time step t, x(t), is only dependant on the state at time step t − 1, x(t − 1). The random variable y(t) is the emission at time t. The probability of seeing observation y(t) in state x(t) is called the emission probability or

(20)

observation likelihood. Together with starting probabilities, that is a probability of starting in each state, transition probabilities and a sequence of observations it is possible to find the most likely sequence of hidden states using the Viterbi algorithm [41].

... x(−1) x(t) x(t + 1) ...

y(t − 1) y(t) y(t + 1)

Figure 3.3: The general architecture of a Hidden Markov Model For speech recognition the observations are the spectral features we get from the audio input and the states, i.e the output, are the phonemes that make up the language. Usually, each phoneme is represented by multiple states and modeled by a single HMM. That way it is possible to find the most likely sequence of phonemes from the spectral features. An important quality of HMM is its potential to stay arbitrarily long in each state and by doing so is able to recognize phonemes with variable lengths. Rather than training a HMM for every phoneme, a monophone model, it is possible to train one for every triphone. This increases the dimensionality of the problem significantly, by adding to the total number of states for the HMMs and thus increasing the amount of emission probabilities. This can be coun- tered, by sharing the emission probabilities between similar states, in a process called state tying. Tree based state tying can be done using decision trees [44].

A common way to model the emission probabilities for a HMM is using Gaussian Mixture Models (GMM). Like the name suggests GMMs are a collection of Gaussian distributions that are used to model a probability distribution that can not be described by a single distribution function.

Alternatively Deep Neural Networks (DNN) or Recurrent Neural Net- works (RNN) can be used for estimating the emission distribution for the HMM [19] [24].

(21)

Recurrent Neural Networks

Artificial neural networks (ANN), or simply Neural Networks (NN), are inspired by the interconnected neurons in our brain. An ANN is a collection of simple processing units, or nodes, that receive and send signals between them. A Neural Network can have multiple layers and thousands of nodes in each layer. In its simplest form the output from every node in a layer is the input to each node in the next layer, and so forth. The nodes in an ANN act like Perceptrons [36]. That is, the nodes have weights, wi associated with each input unit xi, and a bias b. The bias is often represented as a weight for a constant input.

The output of a Perceptron is calculated with the following function,

A(

n

X

i=0

wixi + b)

where A(x) is the activation function, or more simply,

A(

n+1

X

i=0

w_ix_i)

if wn+1 = band xn+1 = 1. If the activation function is the unit step function f (x) a single Perceptron acts as a linear binary classifier.

f (x) =

(0 for x < 0 1 for x ≥ 0

Figure 3.4 represents a Neural Network with two hidden layers and 4 nodes in each layer. Each blue node is a Perceptron. The input and output layers serve as the input and output interface for the network.

Standard Neural Networks lack the temporal awareness of models such as HMMs making them inadequate for acoustic modeling by themselves, even though some context can be given to the network by for example splicing. Another way of adding temporal context to a NN is by using a Time Delayed Neural Network (TDNN) [42]. TDNNs work similarly to feed-forward neural networks, but use a contextual window of outputs from the previous layers as input to the next one.

This can be done for multiple layers in succession with varying context width.

(22)

Input #1 Input #2 Input #3 Input #4

Output Hidden

layer 1

Hidden layer 2 Input

layer

Output layer

Figure 3.4: A Neural Network with two hidden layers. The input is a four dimensional vector, each hidden layer has 5 nodes and it returns a single output value.

This lack of temporal awareness is mended by a class of NN called Recurrent Neural Networks, and further eliminated by the invention of Long short-term memory networks (LSTM). RNNs use the output from previous predictions [21] or hidden layers [10] as additional input for the network creating a dependency link between predictions and adding an element of memory to the network. Long short-term memory use additional NNs to pick and choose what to keep in memory allowing the network to remember important context further back in time. [20]

For speech recognition the input for the network are the acoustic features and the output is the phonetic state. For a full explanation on how these networks are trained and used for speech recognition see [38] and [4].

3.1.4 Language models

The Language model (LM) formulates the prior probability in equation 3.1, that is, P (W ) or the probability of the sentence W = w1, w₂, w₃, ..., w_N. The prior probability can be expanded to:

P (W ) =

N

Y

n=1

P (w_n|w_n−1, ..., w₁)

This probability is often estimated using n-grams. A word N-gram

(23)

stores the probability of seeing a word given the n − 1 previous words.

For example a 3-gram, or a tri-gram, can tell us the most likely word to follow the two words "This is" and if we are more likely to see the words followed by "Marta" or "Sparta". The assumption here is that:

P (wn|wn−1, ...w1) ≈ P (wn|wn−1, ..., wn−N +1)

where N is the order of the n-gram. For example, it is assumed the probability of seeing the word "ball" after the sentence "The crowd cheered as he kicked the" is similar to the probability of seeing it after the sentence "he kicked the".

A n-gram language model can be created using a large corpus of text by counting how often each gram appears. More formally it can be expressed as:

P (w_n|w₁, w₂, ..., w_n−1) = C(w₁, w₂, ..., w_n) C(w₁, w₂, ..., w_n−1) where C(W ) is the count of the word sequence W . Evaluating Language models

How well a language model represents a language can be measured using its perplexity score. Perplexity measures how well a language model predicts any given language and will give an idea of how sur- prised the language model is seeing sentences from the language. The perplexity score of a language model for language L is calculated using as set of example sentences, W = w0, w₁, ..., w_n where W ∈ L.

Perplexity is defined as

b⁻^N¹ ^P^Nⁱ⁼¹^log^b^q(wⁱ⁾

where q is the language model probability function so that q(wi)is the probability of seeing word wi. N is the total number of words in the example sentences and b is a constant, usually 2 or 10. The lower the perplexity the better the language model is at predicting the language.

The perplexity will also give an idea of how predictable the example sentences are given a language model. In other words, how hard it is to predict the sentences in the example set. Again, a lower perplexity score suggests the example set is more predictable.

(24)

3.1.5 Lexicon

Since the acoustic model uses phonemes and the language model outputs words, a lexicon is needed to bridge the gap. The lexicon is used to translate words or sentences into a sequence of phonemes. The lexicon needs to be able to map each word in the language model to a sequence of phonemes. Thus, for a general speech recognition system, a well constructed lexicon can include thousands of phonetically transcribed words. The lexicon is often constructed by hand, by man- ually transcribing each word or parsing previously transcribed words from available dictionaries. An alternative is to use a Grapheme-to- Phoneme (G2P) converter. The G2P converter can automatically tran- scribe the words needed for the lexicon. The G2P converter can be built using hand written rules or various sequence models [1], in which case they need a similar, although smaller, lexicon to train a model on.

3.1.6 Decoding

Decoding is the process of finding the most likely word sequence given some acoustic input. At the start of section 3.1 we showed the probability formula we need to do this. Equation 3.1 shows how we find the most likely sentence in some language L. If the number of sentences W ∈ Lis small enough it is possible to loop over the possible sentences to find the most likely one. For a general large vocabulary ASR system this is not the case.

To find the most likely sentence efficiently a decoding graph is constructed using the acoustic model, the language model and a lexicon.

The decoding graph can be constructed as a weighted finite state machine. A detailed description of how the graph is constructed can be found in [27].

Weighted finite state machines

Weighted finite state transducers (WFST) and Weighted finite state acceptors (WFSA) are a type of finite-state machine (FSM) [27]. FSMs are similar to Markov chains that were briefly introduced in section 3.1.3. An FSM is defined by a set of states S, an initial state s0 ∈ S, a set of final states F ⊆ Sand a state transition function δ such that δ : S × Σ → P(S). The transition function defines the state or set of states that the machine

(25)

changes to given the current state and an input from the input alphabet Σ.

A finite-state acceptor is a FSM that produces a binary output in- dicating weather a sequence of input symbols is accepted or not. The sequence is only accepted if the FSM is in one of the final states when all symbols have been parsed in order. A weighted finite-state acceptor defines weights for each transition and thus on top of accepting or not accepting a sequence it also outputs the probability of that sequence.

A finite state transducer is similar but instead of accepting or re- jecting sequences it is used to map from a sequence of input symbols to a sequence output symbols where the input and output symbols can come from entirely different sets. Each transition now includes an output symbol that forms the output sequence as the input is parsed.

A weighted finite-state transducer defines weights for each transition and thus on top of the output sequence it also outputs the probability of that sequence.

Searching the decoding graph

With a decoding graph, it is possible to find the best path through it based on the acoustic input to find the most likely sentence W . Finding the best path through the graph is still a time consuming task so it is approximated using a search algorithm.

One such algorithm is Beam search [13]. Beam search approximates the best path through the decoding graph by using a beam to limit the width of the search tree. The search tree is built using breadth-first search but instead of expanding every node at each step, the nodes are ordered based on some heuristic and only a predetermined amount of the best nodes are expanded. If the beam, the number of expanded nodes is infinite, beam search works just like breadth-first search.

3.1.7 Evaluation

A standard practice to evaluate the performance of ASR models is using Word Error Rate (WER). The WER is computed by finding the Lev- enshtein distance between the model output and the target sentence.

Levenshtein distance, also known as edit distance, measures the distance between two sequences of tokens. For WER we use words in the sentence as tokens. It is computed as:

(26)

W ER = S + D + I

N = S + D + I S + D + C

where S is the number of substitutions, D is the number of dele- tions, I is the number of insertions, C is the number of correct words and N is the number of words in the target sentence. A higher WER means fewer words are correct, subsequently a lower WER describes a better results.

Beside WER a Sentence Error Rate (SER) is often measured. SER measures how many sentences were wrong. It is computed as:

SER = C N

where C is the number of correct sentences and N is the number of sentences. The SER tends to be considerably higher than the WER since only a minor error in a sentence will render it completely wrong.

(27)

Chapter 4 Model Building

To build and train an ASR model we will use the Kaldi toolkit. Kaldi is a toolkit for Speech Recognition research. As well as including implementations for various methods used to build Speech Recognition models Kaldi also includes recipes for using multiple data sets used for speech recognition research and development. The recipes include code and setup to train and evaluate speech recognition models from various open source or closed corpora. Most of the recipes are available as part of the Kaldi source code [32].

4.1 The Kaldi Toolkit

Kaldi is an open-source toolkit, intended for use by speech recognition researchers, written in C++ [32]. Development of Kaldi began in 2009 and now includes implementations for most standard ways to build acoustic models such as GMM-HMM and various types of neural networks including LSTM and TDNN.

A complete speech recognition model in Kaldi is put together from four different finite-state machines (FSM) or more precisely weighted finite state transducers (WFST) and weighted finite state acceptors (WFSA) [32]. Each component is encoded as a graph that represents one of the four major components of the system. The four graphs are:

• G the grammar or the language model;

• L the lexicon;

• C the context dependency; and

19

(28)

20 CHAPTER 4. MODEL BUILDING

• H the HMM definitions or the acoustic model

For decoding the graphs are combined into a decoding graph HCLG = H ◦ C ◦ L ◦ Gthat can be searched simply and efficiently without any extra work [27].

4.1.1 The Language Model

Kaldi does not directly include code for creating n-gram language models, instead it relies on other sources to compile and supply an n-gram language model in a format Kaldi understands. Kaldi expects the n-gram in ARPA format which is standard by most language model toolkits. Kaldi turns the n-gram language model into the weighted finite state acceptor G. The acceptor accepts a sequence of words and outputs the probability of the word sequence.

4.1.2 The Lexicon

The lexicon L is a weighted finite state transducer that maps a sequence of phones into a word. It is built from a list of words and their phonetic transcriptions. The lexicon is combined with the language model into another transducer LG = L ◦ G that maps from a sequence of phones into a sequence of words.

4.1.3 The Context Dependency Model

The context dependency model C is a WFST. Its input symbols represent context dependent phones and maps it to a sequence of phones.

It is combined with LG to make CLG = C ◦ LG. CLG is a WFST that maps from a sequence of context dependent phone representations to a sequence of words.

4.1.4 The Acoustic Model

The last graph, H, contains the HMM definitions. Its input are transition ids and its output are the context dependent phones. H is combined with CLG to make the full decoding graph HCLG = H ◦ CLG.

The transition ids encode information about the HMM definitions.

(29)

CHAPTER 4. MODEL BUILDING 21

Source %

News stories 50

Rare triphones 10 Names of streets 10 Names of people 10 Countries and capitals 5

URLs 5

Misc. 10

Table 4.1: Text sources for the Málrómur corpus

Environment Count

Indoors, quiet 81 589

Indoors, people speaking 34 501 Indoors, TV or music playing 1 279 Indoors, restaurant or cafe 1 464

Other, quiet 246

Car, quiet 11

Table 4.2: Environments in the Málrómur corpus

4.2 Data and data preparation

4.2.1 Acoustic Model

Our training data for the acoustic model will come from the Málró- mur corpus [14][39]. The Málrómur corpus includes 119 090 speech samples from 563 speakers recorded on smart phones in single chan- nel with a 16 kHz sample rate. As mentioned in section 2.2, the corpus contains read speech from various sources and is designed to be used for speech recognition modeling.

The recordings include text from news stories, names of people, names of places, URLs and more, as shown in table 4.1, and were recorded in various environments, shown in table 4.2. The speakers are of all ages though, as figure 4.1 shows, the vast majority is between 20 and 60 years of age.

For evaluation purposes we split the data into two sets, a training and evaluation set, making sure the two sets did not share any speakers between them. Telephone communication is sampled at a lower

(30)

<1213-19 20-29 30-39 40-49 50-59 60-69 70+

0 10,000 20,000 30,000 40,000

Age range

Numberofrecordings

Figure 4.1: Age group distribution in the Malromur corpus.

Set Recordings

Training 96 710

Evaluation 11 858

Table 4.3: Number of samples in the training and evaluation sets.

frequency of 8 kHz. To adapt the data to our needs, another copy of the sets was made where the audio was down sampled to 8 kHz. By this we assume down sampling from 16 kHz to 8 kHz gives us the same or similar results as if the speech would have originally been recorded at 8 kHz. The training sets were used to train the ASR models while the evaluation sets were used to evaluate the performance of the models.

4.2.2 Lexicon

The lexicon used comes from the Ice-kaldi project [29]. It was created by training a grapheme-to-phoneme (G2P) conversion model using the Icelandic pronunciation dictionary from the Hjal project [34] and Se- quitur [1], a G2P converter. The G2P converter was used to phone- tize common words from Malromur [39] and the Leipzig Wortschatz project [16] which were then added to the dictionary. The dictionary

(31)

includes in total about 136 000 phonetically transcribed words.

4.2.3 Language Model

To create a language model we used Mörkuð íslensk málheild (e. The Tagged Icelandic Corpus) or MÍM for short [17]. MÍM is a tagged corpus of various Icelandic texts grouped into 23 different categories. The total size of the corpus is close to 25 million words. The categories and word count for each one can be seen in table 4.4.

For best results the text used to build a language model should resemble the domain the ASR model will be used in. Since our model is intended to be used for telephone communications we reflect that by only using categories from MÍM that resemble every day spoken language. The categories used are blog posts, spoken language, written- to-be-read, E-mail lists and Radio and TV news scripts.

To create a language model from the data set, the text needed to be cleaned, split into sentences and punctuation removed. Cleaning of the text was kept to the minimum and left to future development.

However, as 5 percent of the speech data included popular URLs every instance of a word matching a domain name was spelled out by replacing the period with the sequence " punktur " such that for example ja.is would be represented as ja punkur is which would translate to ja dot is in English. The corpus has already been tagged and split into sentences so the only preprocessing done on the corpus was to extract the sentences. A few issues were not taken into consideration that could effect the final results negatively. Namely, numbers and dates were not written out and abbreviations were not expanded. Icelandic is a highly conjugated language so numbers and some abbreviation can have one of multiple different forms based on the context, making the matter of expanding the information more than trivial.

A second language model was created where names of streets, people, websites and phone numbers were added to the corpus from the Icelandic phone book. The phone book was normalized using a rule based normalizer. Since the text from the phone book is highly struc- tured it was possible to write out most numbers from the text in the proper form using a few carefully assembled rules. To get phonetic transcriptions for the proper nouns and missing words a new lexicon was created for the language model using a grapheme to phoneme converter trained using Sequitur [1] and the lexicon from the Hjal project

(32)

Text category Word count %

Printed books 5 972 893 23.89

Morgunblaðið (newspaper) 5 019 617 20.08

Printed magazines 2 379 848 9.52

Blog posts 1 976 706 7.91

University science web (visindavefur.is) 1 838 909 7.36 Text from government websites 1 695 304 6.78

Text from websites 1.337 764 5.35

Adjucations 886 240 3.54

Speaches from Alþingi 526 444 2.11

Fréttablaðið (newspaper) 514 189 2.06

Spoken language 504 318 2.02

Essays from university students 486 677 1.95

Written-to-be-read 432 287 1.73

Bills of law and law from Alþingi 406 002 1.62

Radio and tv news scripts 262 219 1.05

Webmedia 245 703 0.98

Essays from grammar school students 179 365 0.72

Webmagazines 121 374 0.49

E-mail lists 120 312 0.48

Teletext 45 887 0.18

Program notes from the Icelandic Symphony 24 832 0.10 Film critiques from Morgunblaðið 15 682 0.06

Parish newsletters 7 950 0.03

Total 25 000 522 100.00

Table 4.4: The 23 categories comprise MIM

[34].

4.3 Model Parameters

There are two recipes for Kaldi that have been published for corpora with Icelandic speech one of which uses the Almannarómur corpus [29] [37].

Acoustic models in Kaldi are trained iteratively by progressively training more and more complex models, most often, starting with a monophone model. The previous model is used to align the acoustic features and the new alignment used as input for the next model.

(33)

Text category Word count

Blog posts 1 976 706

Spoken language 504 318

Written-to-be-read 432 287

Radio and tv news scripts (Icelandic Radio) 262 219

E-mail lists 120 312

Total 3 295 842

Table 4.5: The categories of text used to create a Language Model Model # of leaves # of Gaussians

Monophone 1000

Triphone ∆ + ∆∆ A 3200 30000

Triphone ∆ + ∆∆ B 4000 70000

Triphone LDA+MLLT 6000 140000

Table 4.6: Parameters for HMM+GMM based models

The Kaldi recipe used was heavily based on the recipe used for the Switchboard corpus provided with the source code for Kaldi. The Switchboard corpus includes conversational telephone speech in En- glish [12]. Since the recipe is designed for telephone quality audio it fits well for our purpose. The recipe iteratively trains a GMM-HMM model, using MFCC features from the raw audio recordings, where the previous model is used to align the input features for the next one.

First a monophone model is trained on a small section of the data using mostly short utterances. Then, two iterations of triphone models with ∆ + ∆∆ features was trained, increasing with each step the number of Gaussian distributions and the number of leaves in the decision tree for the state tying process. The final HMM-GMM based model is trained using LDA+MLLT transformation. Table 4.6 shows the progression of models and the parameters used.

Finally, a neural network was trained. Both previous recipes for Icelandic use LSTM layers and LSTMs have been used previously with good results for relatively small models [38]. The trained neural network has a combination of TDNN and LSTM layers and is also based on the Switchboard recipe provided with Kaldi. Due to hardware lim- itations all node counts are reduced by half from the original recipe.

The NN configuration is available in appendix A.

For the language models we built tri-gram models.

(34)

4.4 Kaldi recipe

The Kaldi recipe for this project is available here [15]. The recipe takes care of formatting the data for the Kaldi toolkit as well as running the training and evaluation scripts for the models. As previously mentioned the recipe is heavily based on the Switchboard recipe included in Kaldi.

(35)

Chapter 5 Call experiment

To evaluate the performance of the speech recognition the WER of each model was computed using the evaluation set, as well as a perplexity score for the evaluation set based on each language model. These results are displayed in chapter 6.

To further evaluate the ASR system, and show how it can be used as part of a task based spoken dialogue system, a preliminary experiment was conducted in which real users were called and interacted with via telephone. To do this, we set up a prototype dialogue system with a dialogue manager capable of handling a simple task where user interaction is required. By doing this we will be able to see how an ASR system is capable of performing as part of a spoken dialogue system and establish a procedure to compare systems for future development.

In this chapter we will describe the experiment setup.

5.1 Experiment dialogue

The system is evaluated using a short back and fourth dialogue between the user and a dialogue manager (DM).

The DM asks one or two questions and primes the user for an answer. The DM is hard-coded to only accept a limited set of words or phrases. An answer is accepted if it is a total match to the primed answer, if the primed answer is part of the answer or if the edit distance between the answers is under a certain threshold.

The dialogue conducted is shown below and was intended to in- form users about a conference and give them the possibility to register to the event. This task was chosen because it is well defined, fairly

27

(36)

28 CHAPTER 5. CALL EXPERIMENT

simple and gave the opportunity to call actual users and get authentic responses.

1. Introduction and reason for calling.

2. Question 1: Ask the user a, if it is interested in hearing more, b, not interested, or c, to call later. Depending on the answer, following actions were carried out:

(a) Continue to 3.

(b) Continue to 5.

(c) Notify that the user will be called again, continue to 5.

3. Information about event, time and location.

4. Question 2: Ask if the user wants to a, register to the conference, or b, not register. Depending on the answer, following actions were carried out:

(a) Notify the user he is now registered, continue to 5.

(b) Continue to 5.

5. Send regards and hang up.

The questions and the accepted answers are listed with translations in table 5.1. Question one has three accepted answers while question two has two accepted options.

5.2 Call system

The system used to make the call and handle information flow back and fourth between the user and individual components is demon- strated in figure 5.1.

The information flow from the user through the system and back to the user goes as follows. Asterisk, the telephone service (see 5.2.1), sends the input stream from the user to a handling script. The handling script detects when the user starts and stops speaking. When the user stops speaking the input is sent to the ASR service for transcription. The transcription is sent back to the dialogue manager through the handling script. The dialogue manager is responsible of keeping

(37)

CHAPTER 5. CALL EXPERIMENT 29

Question 1

Viltu heyra meira?

Valmöguleikarnir eru, endilega, nei takk og hringdu seinna.

Would you like to hear more? The options are absolutely, no thank you or call later.

Accepted answers:

- Endilega - Absolutely

- Nei takk - No thank you

- Hringdu seinna - Call later

Question 2

Viltu skrá þig? Segðu heldur betur til að skrá þig, en nei takk ef þú he- fur ekki áhuga.

Would you like to register? You can say totally to register, or no thank you if you are not interested.

Accepted answers:

- Heldur betur - Totally

- Nei takk - No thank you

Table 5.1: Questions and accepted answers.

track of the state of the conversation and choosing based on the input what to say next. Based on the transcribed input the dialogue manager sends back the appropriate response in string format to the handling script. After receiving the response the handling script calls a text-to- speech service to turn the string into speech. Finally, the speech from the TTS service is delivered through the handling script, back to the telephone service as an output stream and to the user.

If the user starts speaking again before the state of the dialogue manager is updated, all further process is haltered and started back from the beginning with the new utterance whence the user stops speaking.

5.2.1 Phone service

To make the calls we used Asterisk [40], a software implementation of a telephone private branch exchange. Asterisk uses software phones connected to a phone provider through the internet using the Session Initiation Protocol (SIP) or SIP-phones. Asterisk can call any mobile

(38)

User Asterisk Handling

Script

Dialogue Manager ASR

TTS

Figure 5.1: Information flow between the user, the call system and its components.

or land line telephone and works like a normal phone but can additionally relay the input audio stream to another software, as well as sending a receiving audio stream from a software through the phone line and to the user.

5.2.2 Handling script

The handling script ties all the different components together and acts as the communicator between them. The script has only two functions, to relay data between components and to cut the audio stream from the phone service into utterances. The script listens to the audio input stream from the phone service and detects when the user starts and stops speaking. Because the handling script is a middleman for all the other components it can halt any current processes when a new utterance is detected.

5.2.3 ASR service

For the ASR service we built a web service around the Kaldi model using PyKaldi [3], a Python wrapper for the Kaldi framework, and Bottle, a lightweight web framework in Python.

For the experiment we used the Triphone LDA+MLLT model.

(39)

CHAPTER 5. CALL EXPERIMENT 31

5.2.4 TTS service

We use Amazons Polly, Amazon’s TTS HTTP API, to translate text into speech. The service accepts strings of text through HTTP and sends back an audio file with the string spoken out in a specified format.

5.2.5 Dialogue manager

The dialogue manager (DM) keeps track of the dialogue. For this experiment the bot has a list of sentences it communicates to the user.

After a question has been posed the DM waits for a response and asks the user to repeat the answer until it gets an answer it accepts. The dialogue is best illustrated using a state machine as seen in figure 5.2.

The dialogue goes through the graph, communicating the dialogue in each state to the user. The edges show the accepted input to transition to the next state. A ∗ means any input is accepted, means no input is needed. When the state machine transitions to a final state, double-circled, the dialogue is over and the call is terminated.

5.3 Evaluation

Evaluating a conversational system is non-trivial. There are a lot vari- ables that could be observed but many have some disadvantages when looked at in isolation. For example, minimizing time sounds like a promising metric for efficiency, but focusing only on time could result in the system just hanging up right away if there is a problem or it detects that there could be a problem. To evaluate how well the system works we recorded how efficiently the system could transfer from one dialogue state to another. This is done by counting the number of attempts the user needs to answer a question correctly and move from one state to the other. We also measured user interest by keeping track of the state progression and number of calls completed.

(40)

start 1

2

2_b

2_a 2_c

3

4 4_a

4b 5

"endilega" "nei takk"

"hringdu seinna"

∗

"heldur betur"

"nei takk"

∗

Figure 5.2: A directed graph representation of the experiment dialogue as seen in section 5.1

(41)

Chapter 6 Results and Evaluation

6.1 ASR model

6.1.1 Model comparison

The WER for each iteration of ASR models we trained with the basic LM, for both 8 kHz and 16 kHz audio, are listed in table 6.4. For each model we compute the word error rate and the sentence error rate on our evaluation set (see section 4.2.1) as described in 3.1.7. The results displayed are from a decoding graph compiled using the basic LM, that is the model only based on texts from MÍM. The perplexity score of the language model for the test data can be seen in table 6.2.

The model for telephone quality audio, 8 kHz, reached a WER of 20.72% and a SER of 42.92%. Comparably, the same architecture reached a WER of 19.91% when trained and evaluated on 16 kHz audio. The more advanced neural network architecture performs significantly better than HMM-GMM based models.

8 kHz 16 kHz

Model WER (%) SER (%) WER (%) SER (%)

Triphone ∆ + ∆∆ A 43.69 69.60 42.35 68.57

Triphone ∆ + ∆∆ B 41.42 67.75 41.42 67.39

Triphone LDA+MLLT 39.01 64.66 37.83 64.45

TDNN-LSTM 20.72 42.92 19.91 41.77

Table 6.1: Word error rate (WER) and sentence error rate (SER) for the ASR models on the evaluation set.

33

(42)

34 CHAPTER 6. RESULTS AND EVALUATION

6.1.2 Language Models

Table 6.2 shows a comparison of the two language models used. The language model with the phone book added has roughly 10% lower perplexity score and half as many out of vocabulary (OOV) words, that is words that are missing in the language model, for the evaluation set.

A model trained using the phone book language model has the WER improved by 47% relatively, from 20.72% to 10.90%, for 8 kHz audio and 48%, from 19.91 to 10.32%, for 16 kHz audio as seen in table 6.3.

Language model perplexity OOV words

basic LM 7675 5721

phone book LM 6859 2861

Table 6.2: Comparison of the two language models. Perplexity calculated on the evaluation set (b=10).

Model 8 kHz WER (%) 16 kHz WER (%)

TDNN-LSTM basic LM 20.72 19.91

TDNN-LSTM phone book LM 10.90 10.32

Table 6.3: WER for the 8 kHz TDNN-LSTM model using the two different language models.

6.1.3 Comparison to baseline systems

Previously there are two published attempts to build an general-purpose ASR model for Icelandic. Both of which use Kaldi for training and evaluation. Ice-kaldi [29] reportedly achieved WER of 15.72% and a system built specifically for the Icelandic parliament, Althingi, reports a WER of 14.76% for that purpose [37]. Both systems are designed for 16 kHz audio. The literature suggests lower sample rates give less accurate results than higher sample rates for a Neural Network architecture [45] [26]. However, since these are the only published results for an ASR model for Icelandic, they will serve as the baseline for this project.

An overview of the comparison can be seen in table 6.4. Our TDNN- LSTM model with the phone book language model achieves a lower WER for both 8 kHz and 16 kHz audio than comparable systems. The

(43)

CHAPTER 6. RESULTS AND EVALUATION 35

Model 8 kHz WER (%) 16 kHz WER (%)

Althingi - 14.76

Ice-kaldi - 15.72

TDNN-LSTM, basic LM 20.72 19.91

TDNN-LSTM, phone book LM 10.90 10.32

Table 6.4: Word error rate (WER) compared with other results.

TDNN-LSTM models are similar in structure to the Ice-kaldi model.

Though our models have more layers in the network, the node count is considerably lower.

6.2 Call experiment

In total 62 users were called. Of the 62 users 33 hanged up before the first question was ever asked. Seven users hanged up after trying at least once to talk back or after answering the first question. 22 users of the 62 called went through the whole dialogue and finished the conversation.

Out of the remaining 29, that interacted with the system, an answer was accepted on first try for 79% for the first question and 42% for the second question. On average users needed 1.4 and 1.8 tries for the system to recognize an answer for the first and second question respectively. Furthermore as seen in figure 6.1, most answers were accepted on first try.

(44)

36 CHAPTER 6. RESULTS AND EVALUATION

1 2 3 4 5+

0 5 10 15 20

Number of tries

Numberofusers

Number of tries needed for question 1

1 2 3 4 5+

0 2 4 6

Number of tries

Numberofusers

Number of tries needed for question 2

Figure 6.1: Number of users per number of tries for question 1 and question 2

(45)

Chapter 7 Conclusion

The main focus of the project, to build a continuous speech recognition system for telephone conversations in Icelandic, was success- fully achieved. The experiments showed a continuous speech recognition system can be built for telephone communication in Icelandic using currently available resources. The resulting neural network show good results despite of low node count.

Though the resources suffice, there are a few drawbacks with the available data. The data from the Málrómur corpus are recordings of read speech. It has been shown, using spontaneous speech for the training process can significantly improve performance for recogniz- ing spontaneous speech [2]. The difference is not represented in our results since our evaluation data comes from the same source as the training data. However, it can be speculated that it had some negative effect on the call experiment. Additionally, the vocabulary of the lexicon for the basic language model was chosen from a list of the most used words from written text, giving precedence to words more commonly used in writing than when speaking, furthering the mismatch between model and the intended use case.

The different results between the two language models we developed, the basic LM and the phone book LM, shows clearly how important the vocabulary and the content of the language model is. The basic LM was design with spontaneous speech in mind while the phone book LM was specifically adapted to the training data. The content of the basic LM has some negative impact on the reported WER of the speech recognizer since the text for the evaluation data is mostly named entities and written text. Around 25% of the utterances in the

37

(46)

38 CHAPTER 7. CONCLUSION

Málrómur corpus are names of streets, people and URLs. Adding vocabulary from the Icelandic phone directory improved the WER im- mensely. The difference shows the importance of having a language model that fits the context and how the proper vocabulary for the intended use case can give better results.

Finally, it was showed how the resulting speech recognizer can be used in phone communications with actual users to complete a simple conversational task.

7.1 Social impact and ethical implications

After Hjal was published in 2003, with good results, the impact of the project was less then expected and it never got widely used [34].

There are still few speech recognition systems available for Icelandic and there are few examples of them being used.

This is the first publicly available system for continuous speech recognition in Icelandic for telephone communication. This project will make it possible for all organizations, that communicate with people via telephone, to operate more efficiently by automating telephone communication. To help increase the availability of the system Já/- Gallup are working on a platform to make the system easily available for both commercial applications and public use. The platform will offer a simple way to conduct fully automated conversations thorough telephone as well as a programming interface for continuous speech recognition in Icelandic.

It is difficult to predict if the increased availability will have any impact on the usage of speech recognition in Iceland, it has failed before. Users turn more and more to the internet instead of picking up the phone. It could be too late for automatic dialogue systems to take over many telephone based tasks, since as the users disappear so does the change for success. That said, there are still various tasks that could gain from the usage of automatic dialogue systems and do to lack of availability it has never been possible to give it a change before. Whether that is the only reason remains to be seen.

Increased availability of automatic speech recognition for Icelandic will only have positive affect on the Icelandic language. With it comes the possibility to replace services that use foreign languages and in- troduce services that are missing in Iceland do to previous lack of

(47)

CHAPTER 7. CONCLUSION 39

support. This would help the Icelandic language to stay relevant and lower the change of the language being replaced.

The increased availability also offers chance for more automation.

Looking at the industrial revolutions it is hard to deny new technology and more automation is advantageous for the society in general. How- ever, it can have bad implications for the individuals being replaced.

Automating telephone based tasks is no different. It is then important to keep in mind, that though the overall gain would be positive in the end, it can affect people and organizations differently.

7.2 Future work

To improve the performance of the overall system and the ASR model further work should be put into the language model. Better parsing of the text, for example, expanding numbers and abbreviations would be advantageous. It could also be advantageous to increasing the vocabulary to minimize number of OOV words. A language model that can form compound words would help lower the number of OOV while still keeping the size of the vocabulary manageable.

A development of a spontaneous speech corpus, of similar structure and quality as the Málrómur corpus, to use in addition to the available data, will be valuable for further improvements on the acoustic model.

(48)

Bibliography

[1] Maximilian Bisani and Hermann Ney. “Joint-sequence models for grapheme-to-phoneme conversion”. In: Speech communication 50.5 (2008), pp. 434–451.

[2] John Butzberger et al. “Spontaneous speech effects in large vocabulary speech recognition applications”. In: Proceedings of the workshop on Speech and Natural Language. Association for Com- putational Linguistics. 1992, pp. 339–343.

[3] Dogan Can et al. “Pykaldi: A python wrapper for kaldi”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE. 2018, pp. 5889–5893.

[4] Gaofeng Cheng et al. “An exploration of dropout with LSTMs”.

In: Proc. Interspeech. 2017.

[5] KH Davis, R Biddulph, and Stephen Balashek. “Automatic recognition of spoken digits”. In: The Journal of the Acoustical Society of America 24.6 (1952), pp. 637–642.

[6] Steven B Davis and Paul Mermelstein. “Comparison of paramet- ric representations for monosyllabic word recognition in contin- uously spoken sentences”. In: Readings in speech recognition. Else- vier, 1990, pp. 65–74.

[7] Donald M Decker et al. Handbook of the International Phonetic As- sociation: A guide to the use of the International Phonetic Alphabet.

Cambridge University Press, 1999.

[8] Najim Dehak et al. “Front-end factor analysis for speaker ver- ification”. In: IEEE Transactions on Audio, Speech, and Language Processing 19.4 (2011), pp. 788–798.

40

(49)

BIBLIOGRAPHY 41

[9] Sigríður Sigurjónsdóttir og Eiríkur Rögnvaldsson. “Íslenska á

tölvuöld. Kynning á verkefninu: Greining á málfræðilegum afleiðingum stafræns málsambýlis”. In: Presented at Frændafundur 9 at the

University of Iceland, 2016.^URL: https://notendur.hi.is/

eirikur/ff.pdf.

[10] Jeffrey L Elman. “Finding structure in time”. In: Cognitive science 14.2 (1990), pp. 179–211.

[11] Mark JF Gales. “Semi-tied covariance matrices for hidden Markov models”. In: IEEE transactions on speech and audio processing 7.3 (1999), pp. 272–281.

[12] John J Godfrey, Edward C Holliman, and Jane McDaniel. “SWITCH- BOARD: Telephone speech corpus for research and development”.

In: icassp. IEEE. 1992, pp. 517–520.

[13] Alex Graves. “Sequence transduction with recurrent neural networks”. In: arXiv preprint arXiv:1211.3711 (2012).

[14] Jón Guðnason et al. “Almannaromur: An open icelandic speech corpus”. In: Spoken Language Technologies for Under-Resourced Lan- guages. 2012.

[15] Þorsteinn Daði Gunnarsson. Ja-ASR. https://github.com/

JaGallup/kaldi/tree/master/egs/jais/s5. 2019.

[16] Erla Hallsteinsdóttir et al. “Íslenskur orðasjóður–Building a Large Icelandic Corpus”. In: Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007). 2007, pp. 288–291.

[17] Sigrún Helgadóttir et al. “The tagged Icelandic corpus (MÍM)”.

In: Language Technology for Normalisation of Less-Resourced Lan- guages (2012), p. 67.

[18] Hynek Hermansky. “Perceptual linear predictive (PLP) analysis of speech”. In: the Journal of the Acoustical Society of America 87.4 (1990), pp. 1738–1752.

[19] Geoffrey Hinton et al. “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups”. In: IEEE Signal processing magazine 29.6 (2012), pp. 82–

97.

[20] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8 (1997), pp. 1735–1780.

Speech recognition for telephone conversations in Icelandic using Kaldi

Speech recognition for telephone conversations in Icelandic using Kaldi

ÞORSTEINN DAÐI GUNNARSSON

Speech recognition for

telephone conversations in Icelandic using Kaldi

ÞORSTEINN DAÐI GUNNARSSON

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 Aim of thesis

1.2 Social relevance

Chapter 2 Background

2.1 A brief history of ASR

2.2 Speech Recognition for Icelandic

Chapter 3 Theory

3.1 Automatic speech recognition

3.1.1 A word on linguistics

3.1.2 Signal Processing

3.1.3 Acoustic models

3.1.4 Language models

3.1.5 Lexicon

3.1.6 Decoding

3.1.7 Evaluation

Chapter 4

Model Building

4.1 The Kaldi Toolkit

4.1.1 The Language Model

4.1.2 The Lexicon

4.1.3 The Context Dependency Model

4.1.4 The Acoustic Model

4.2 Data and data preparation

4.2.1 Acoustic Model

4.2.2 Lexicon

4.2.3 Language Model

4.3 Model Parameters

4.4 Kaldi recipe

Chapter 5

Call experiment

5.1 Experiment dialogue

5.2 Call system

5.2.1 Phone service

5.2.2 Handling script

5.2.3 ASR service

5.2.4 TTS service

5.2.5 Dialogue manager

5.3 Evaluation

Chapter 6

Results and Evaluation

6.1 ASR model

6.1.1 Model comparison

6.1.2 Language Models

6.1.3 Comparison to baseline systems

6.2 Call experiment

Chapter 7 Conclusion

7.1 Social impact and ethical implications

7.2 Future work

Bibliography