A Keyword Based Interactive Speech Recognition System for Embedded Applications

(1)

A Keyword Based Interactive

Speech Recognition System for

Embedded Applications

Master’s Thesis

by

Iv´

an Francisco Castro Cer´

on

Andrea Graciela Garc´ıa Badillo

June, 2011

School of Innovation, Design and Engineering M¨alardalen University

V¨aster˚as, Sweden Supervisor: Patrik Bj¨orkman

(2)

Abstract

Speech recognition has been an important area of research during the past decades. The usage of automatic speech recognition systems is rapidly increasing among different areas, such as mobile telephony, automotive, healthcare, robotics and more. However, despite the existence of many speech recognition systems, most of them use platform specific and non-publicly available software. Nevertheless, it is possible to develop speech recognition systems using already existing open source technology.

The aim of this master’s thesis is to develop an interactive and speaker independent speech recognition system. The system shall be able to identify predetermined keywords from incoming live speech and in response, play audio files with related information. Moreover, the system shall be able to provide a response even if no keyword was identified. For this project, the system was implemented using PocketSphinx, a speech recognition library, part of the open source Sphinx technology by the Carnegie Mellon University.

During the implementation of this project, the automation of different steps of the process, was a key factor for a successful completion. This automation consisted on the development of different tools for the creation of the language model and the dictionary, two important components of the system. Similarly, the audio files to be played after identifying a keyword, as well as the evaluation of the system’s performance, were fully automated.

The tests run show encouraging results and demonstrate that the system is a feasible solution that could be implemented and tested in a real embedded application. Despite the good results, possible improvements can be implemented, such as the creation of a different phonetic dictionary to support different languages.

(3)

Acknowledgements

This master’s thesis represent the completion of a journey we decided to start two years ago. Studying in Sweden has been quite an experience and for that, we would like to thank MDH for giving us the opportunity to study here. We would also like to express our gratitude to our Professor Lars Asplund for his guidance throughout the development of this project. Moreover, we would like to thank the Consejo Nacional de Ciencia y Tecnolog´ıa (CONACYT) at M´exico for the financial support provided through the scholarships we were both granted.

We would also like to thank our families for their immense support and encouragement during all this time we have been away. Finally, we are grateful with life for having had the great opportunity of being able to accomplish one more goal together.

(4)

List of Figures

2.1 DARPA Speech Recognition Benchmark Tests. . . 4

2.2 Milestones and Evolution of the Technology. . . 5

3.1 Signal processing to obtain the MFCC vectors . . . 11

3.2 A typical triphone structure. . . 14

3.3 Three state HMM . . . 16

3.4 Example of phone and word HMMs representation . . . 18

3.5 A detailed representation of a Multi-layer Perceptron. . . 19

3.6 Tree Search Nerwork using the A* algorithm. . . 21

3.7 Sphinx training procedure . . . 25

4.1 Keyword File Example . . . 30

4.2 Sentences for Audio Files Example . . . 31

4.3 Conversion of Sentences into Audio Files . . . 32

4.4 Statistical Language Model Toolkit . . . 34

4.5 TextExtractor and CMUCLMTK . . . 35

4.6 Dictionary Generator Tool . . . 37

4.7 SphinxTrain Tool . . . 39

4.8 Decoding Algorithm . . . 42

4.9 Component Diagram . . . 43

5.1 Typical Test Setup . . . 45

5.2 KDA Test Setup . . . 45

5.3 Automated Test Process . . . 45

5.4 Test Sentences . . . 46

5.5 rEvaluator Tool . . . 47

5.6 TextExtractor Operation Mode 1 . . . 49

5.7 TextExtractor Operation Mode 2 . . . 50

5.8 DictGenerator Operation Modes 1 and 2 . . . 50

5.9 Generated Dictionary Files . . . 51

5.10 RespGenerator: Execution Messages . . . 51

(8)

A.1 ASR Project Folder Structure . . . 57

A.2 Message . . . 59

A.3 Decoding Example . . . 59

A.4 Decoding Example . . . 59

A.5 Keywords and URLs . . . 60

A.6 Example of a Set of Predefined Responses . . . 62

A.7 Responses Structure . . . 62

A.8 OGG Files . . . 63

(9)

List of Tables

3.1 Example of the 3 first words in an ASR dictionary . . . 15

4.1 Dictionary Words Not Found: Causes and Solutions . . . 37

4.2 Phonetical Dictionary . . . 39

5.1 Keyword List . . . 46

5.2 KDA Report Format . . . 47

5.3 Overall Test Results . . . 48

5.4 Computer Specifications . . . 48

(10)

Chapter 1

Introduction

Automatic speech recognition systems (ASR) have caught the attention of researchers since the middle of the 20th century. From the initial attempts to identify isolated words using a very limited vocabulary, to the latest advancements processing continuous speech composed by thousands of words, the ASR technology has grown progressively.

The raise of robust speech recognition systems during the last two decades has triggered a number of potential applications for the technology. Existing human-machine interfaces, such as keyboards, can be enhanced or even replaced by speech recognizers that interpret voice commands to complete a task. This kind of applications is particularly important as speech is the main form of communication among humans, for this reason, it is much simpler and faster to complete a task by providing vocal instructions to a machine rather than typing commands using a keyboard [21].

The convergence of several research disciplines such as digital signal processing, machine learning and language modeling have allowed the ASR technology to mature and currently be used in commercial applications. The current speech recognition systems are capable of identifying and processing commands with many words. However, they are still unable to fully handle and understand a typical human conversation [9].

The performance and accuracy of existing systems allows them to be used in simple tasks like telephone-based applications (call-centers, user authentication, etc). Nevertheless, as the accuracy and the vocabulary size are increased, the computational resources needed to implement a typical ASR system grow as well. The amount of computational power required to execute a fully-functional speech recognizer system can be easily supplied by a general-purpose personal computer, but it might be difficult to execute the same system in a portable device [21].

Mobile applications represent a very important area where ASR systems can be used, for example in GPS navigation systems for cars, where the user can provide navigation commands by voice, or a speech-based song selector for a music player. However, a large amount of existing portable devices lack the processing power needed to execute

(11)

a high-end speech recognizer. For this reason, there is a large interest in designing and developing flexible ASR systems that can be run on both strong and resource-constrained devices [8].

Typically, the speech recognition systems designed for embedded devices use restrictive grammars and do not support large vocabularies. Additionally the accuracy achieved by those systems tends to be lower when compared to a high-end ASR. This compromise between processing power and accuracy is acceptable for simple applications, however, as the ASR technology becomes more popular, the expectations regarding accuracy and vocabulary increase [9].

One particular area of interest in the research community is to optimize the performance of existing speech recognizers by implementing faster, lighter and smarter algorithms. For example, PocketSphinx is an open source, continuous speech recognition system for handheld devices developed by the Carnegie Mellon University (CMU). As its name suggests, this system is a heavily simplified and optimized version of the existing tool Sphinx, developed by the same university. There are other existing speech recognizers developed for embedded devices, but most of these systems are not free and their source code is not publicly available. For this reason it is difficult to use them for experimentation and research purposes [8].

1.1 Outline

This master’s thesis report is organized as the following:

Chapter 2 reviews part of the history of speech recognition, the main challenges faced during the development of ASR systems as well as some of the current areas of research. Chapter 3 presents some of the sources of variability within human speech and their effects on speech recognition. Also described are the main components of an ASR system, how are they commonly trained and evaluated.

Chapter 4 describes in detail the design and implementation of the Keyword Based Interactive Speech Recognition System. This chapter also discusses the reasons for selecting PocketSphinx as the main decoder system for this project.

Chapter 5 presents the tests and results obtained after evaluating the implemented Keyword Based Interactive Speech Recognition System.

Chapter 6 presents the summary and conclusions of the paper and discusses possibilities of future work.

(12)

Chapter 2

Background

This chapter provides an overview of the history of speech recognition and ASR systems, as well as some of the main challenges regarding ASR systems and the current areas of research.

2.1 History

The idea of building machines capable of communicating with humans using speech emerged during the last couple of decades of the 18th century. However, these initial trials were more focused on building machines able to speak, rather than listening and understanding human commands. For example, using the existing knowledge about the human vocal tract, Wolfgang von Kempelen constructed an “Acoustic-Mechanical Speech Machine” with the intention of replicating speech-like sounds [10].

During the first half of the 20th century, one of the main research fields related to language recognition was the spectral analysis of speech and its perception by a human listener. Some of the most influential documents were published by Bell Laboratories and in 1952 they built a system able to identify isolated digits for a single speaker. The system had 10 speaker-dependent patterns, one associated to each digit, which represented the first two vowel formants for each digit. Although this scheme was rather rudimentary, the accuracy levels achieved were quite remarkable, reaching around 95 to 97 % [11].

The final part of the 1960’s saw the introduction of feature extraction algorithms, such as the Fast Fourier Transform (FFT) developed by Cooley and Tukey, the Linear Predictive Coding (LPC) developed by Atal and Hanauer, and the Cepstral Processing of speech introduced by Oppenheim in 1968. Warping, also known as non-uniform time scale was another technique presented to handle differences in the speaking rate and segment length of the input signals. This was accomplished by shrinking and stretching the input

(13)

signals in order to match stored patterns.

The introduction of Hidden Markov Models (HMMs) brought significant improvements to the existing technology. Based on the work published by Baum and others, James Baker applied the HMM framework to speech recognition in 1975 as part of his graduate work at CMU. In the same year, Jelinek, Bahl and Mercer applied HMMs to speech recognition while they were working for IBM. The main difference was the type of decoding used by the two systems; Baker’s system used Viterbi decoding while IBM’s system used stack decoding (also known as A* decoding) [10].

The use of statistical methods and HMMs heavily influenced the design of the next generation of speech recognizers. In fact, the use of HMMs started to spread since the 1980s and has become by far the most common framework used by speech recognizers. During the late 1980s a technology involving neural networks (ANN) was introduced into the speech recognition research, the main idea behind this technology was to identify phonemes or even complete words using multilayer perceptrons (MLPs). However, more modern research has tried to apply MLPs as a complementary tool for HMMs, in other words some modern ASRs use hybrid MLP/HMMs in order to improve their accuracy.

The Defense Advanced Research Projects Agency (DARPA) in the US played a major role in the funding, creation and improvement of speech recognition systems during the last two decades of the 20th century. The DARPA funded and evaluated many systems by measuring their accuracy and, most importantly, their word error rate (WER), as illustrated by Figure 2.1.

(14)

Many different tasks were created with different difficulty levels. For example, some tasks involved continuous speech recognition using structured grammar such as in military commands, some other tasks involved recognition of conversational speech using a very large vocabulary with more than 20 thousand words. The DARPA program also helped to create a number of speech databases used to train and evaluate ASR systems. Some of these databases are the Wall Street Journal, the Switchboard and the CALLHOME databases; all of them are publicly available.

Driven by the DARPA initiatives, the ASR technology evolved and several research programs were created in different parts of the world. The speech recognition systems eventually become very sophisticated and capable of supporting large vocabularies with thousands of words and phonemes. However, the WER for conversational speech still can be considered high, with a value of near 40% [9].

During the first decade of the 21st century the research community has focused on the use of machine learning in order to not only recognize words, but to interpret and understand human speech. In this regard, text-to-speech (TTS) synthesis systems have become popular in order to develop machines able to speak to humans. Nonetheless, designing and building machines able to mimic a person seems to be a challenge for future work [10]. Figure 2.2 depicts important milestones of speech recognition.

(15)

2.2 Main challenges and Current Research Areas

During the past five decades, the ASR technology has faced many challenges. Some of these challenges have been solved; some others are still present even on the most sophisticated systems. One of the most important roadblocks of the technology is the high WER for large-vocabulary continuous speech recognition systems. Without knowing the word boundaries in advance, the whole process of recognizing words and sentences becomes much harder. Moreover, failing to correctly identify word boundaries of words will certainly produce incorrect word hypotheses and incorrect sentences. This means that sophisticated language models are needed in order to discard incorrect word hypotheses [21].

Conversational and natural speeches also contain co-articulatory effects; in other words, every sound is heavily influenced by the preceding and following sounds. In order to discriminate and correctly determine the phones associated to each sound, the ASR requires complex and detailed acoustic models. However the use of large language and acoustic models typically increases the processing power needed to run the system. During the 1990s even some common workstations did not have enough processing power to run a large vocabulary ASR system [21].

It is well known that increasing the accuracy of a speech recognition system increases its processing power and memory requirements. Similarly, these requirements can be relaxed by sacrificing some accuracy in certain applications. Nevertheless, it is much more difficult to improve accuracy while decreasing computational requirements. Speech variability in all of its possible categories represents one of the more difficult challenges for speech recognition technology. Speech variability can be traced to many different sources such as the environment, the speaker itself or the input equipment. For this reason, it is critical to create robust systems that are able to overcome speech variability regardless of its source [3].

Another important challenge for the technology is the improvement of the training process. For example, it is important to use training data that is similar to the type of speech used in a real application. It is also valuable to use training data that helps the system to discard incorrect patterns found during the process of recognizing speech. Finally, it is desired to use algorithms and training sets that can adapt in order to discriminate incorrect patterns.

As it can be seen, there are many opportunity areas and ideas to improve the existing ASR technology. Nevertheless, we can organize these areas in more specific groups, for example:

• It is important to improve the ease-of use of the existing systems, in other words, the user needs to be able to use the technology in order to find more applications.

(16)

• The ASR systems should be able to adapt and learn automatically. For instance, they should be able to learn new words and sounds.

• The systems should be able to minimize and tolerate errors. This can be achieved by designing robust systems that can be used in real life applications.

• Recognition of emotions could play a very important role in order to improve speech recognition as it is one of the main causes of variability [3].

(17)

Chapter 3

Automatic Speech Recognition

This chapter discusses the main sources of speech variability and their effects on the accuracy of speech recognition. Additionally, it describes the major components of a typical ASR system, presents some of the algorithms used during the training phase and their common method of evaluation.

3.1 Speech Variability

One major challenge for speech recognition is speech variability. Due to human nature, a person is capable of emitting and producing a vast variety of sounds. Therefore, since each person has different vocal tract configurations, shapes and lengths, (articulatory variations), it is impossible for two persons to speak alike, not even a same person can reproduce a same waveform after repeating the same word. However, is not only the vocal tract configuration, but different factors that can create different effects on the resulting speech signal.

3.1.1 Gender and Age

Speaker’s gender is one of the main sources of speech variability. It makes a difference on the produced fundamental frequencies as men and woman have different vocal sizes. Similarly, age contributes to speech variability as children ASR becomes particularly difficult since their vocal tracts and folds are smaller compared to adults.

This has a direct impact on the fundamental frequency as it becomes higher than adult frequencies. Also, according to [1], it has been shown that children under ten years old, increase the duration of the vowels resulting in variations of the formant locations and fundamental frequencies. In addition, children might also lack of a correct pronunciation and vocabulary or even have a spontaneous speech grammatically incorrect.

(18)

3.1.2 Speaker’s Origin

Variations exist when recognizing native and non-native speech. Speech recognition among native speakers does not represent a significant change in the acoustics and therefore there is not a big impact in the ASR system performance. However, this might not be the same when recognizing a foreign speaker. Factors such as the level of knowledge of the non-native speaker, vocabulary, speaker’s accent and pronunciation represent variations of the speech signal that could impact the system’s performance. Moreover, if the used speech models are only considering native speech data, the system behavior could not be correct.

3.1.3 Speaker’s style

Apart from the origin of the speaker, speech also varies depending on the speaking style. A speaker might reduce the pronunciation of some phonemes or syllables during a casual speech. On the other hand, portions of speech containing a complex syntax and semantic tend to be articulated more carefully by speakers. Moreover, a phrase can be emphasized and the pronunciation can vary due to the speaker’s mood. Thus, the context also determines how a speech signal is produced. Additionally, the speaker can introduce word repetitions or expressions that denote hesitation or uncertainty.

3.1.4 Rate of Speech

Another important source of variability is the rate of speech (ROS) since it increases the complexity, within the recognition process, of mapping the acoustic signal with the phonetic categories. Therefore, timing plays an important role as an ASR system can have a higher error rate due to higher speaking rate if the ROS is not properly taken into account. In contrast, a lower speaking rate can either affect or not the ASR system’s performance, depending on factors such as if the speaker extremely articulates or introduces pauses within syllables.

3.1.5 Environment

Other sources of speech variability reside at the transmission channel, for example, distortions to the speech signal can be introduced due to the microphone arrangement. Also, the background environment can produce noise that is introduced into the speech signal or even the room acoustics can modify the speech signal received by the ASR system.

From physiological to environmental factors, different variations exist when speaking and between speakers, such as gender or age. Therefore, speech recognition deals with

(19)

properly overcoming all these type of variations and their effects during the recognition process.

3.2 ASR Components

An ASR system comprises several components and each of them should be carefully designed in order to have a robust and well implemented system. This section presents the theory behind each of these components in order to better understand the entire process of the design and development of an ASR system.

3.2.1 Front End

The front end is the first component of almost every ASR system as it is the one to see the speech signal as it comes into the system. This component is the one in charge of doing the signal processing to the received speech signal. At the moment that the input signal arrives to the front end, it had already passed through an acoustic environment, where the signal might have suffered from diverse effects, such as additive noise or room reverberation [15]. Thus, a proper signal processing needs to be made in order to enhance the signal suppressing possible sources of variation and extract its features to be used by the decoder.

Feature Extraction

The speech signal is usually translated into a spectral representation comprising acoustic features. This is done by compressing the signal into smaller set of spectral N features, where the size of N most likely depends on the duration of the signal. Therefore, in order to make a proper feature extraction, the signal needs to be properly sampled and treated prior the extraction.

The feature extraction process should be able to extract the features that are critical to the recognition of the message within the speech signal. However, the speech waveform comprises several features, but, the most important feature dimension is the so called spectral envelope. This envelope contains the main features of the articulatory apparatus and is considered the core of speech analysis for speech recognition [9].

The features within the spectral envelope are obtained from a Fourier Transform, a Linear Predictive Coding (LPC) analysis or from a bank of bandpass filters. The most common used ASR features are the Mel-Frequency Cepstral Coefficients (MFCCs), however there also exist LPC coefficients, Line-Spectral Frequencies (LSFs) among many others. Despite the extensive number of feature sets, they all intend to capture the enough spectral information needed to recognize the spoken phonemes [16].

(20)

MFCC vectors

MFCCs can be seen as a way to translate an analog signal into digital feature vectors of typically 39 numbers. However, this process requires of the execution of several steps in order to obtain these vectors. Figure 3.1 depicts the series of steps required to get the feature vectors from the input speech signal.

Figure 3.1: Signal processing to obtain the MFCC vectors

Due to the nature of human speech, the speech signal is attenuated as the frequencies increase. Also, the speech signal is subject to a falloff of -6dB when passing through the vocal tract [16]. Therefore, the signal needs to be preemphasized, which means that a preemphasis filter, which is a high-pass filter, is applied to the signal. This increases the amplitude of the signal for the high frequencies while decreasing the components of lower frequencies. Then, having the original speech signal x, a new preemphasized sample at time n is given by the difference

xn= xn− a(xn−1) (3.1)

where a is a factor that is typically set to a value near 0.9 [16].

Once the signal has been preemphasized, it is partitioned into smaller frames of sizes of about 20 to 30 milliseconds sampled every 10 milliseconds. The higher the sample rate, the better it is to model fast speech changes [16]. The frames are overlapped in order to avoid missing information that could have been in between the limits of each frame. This process is called windowing and it is used in order to minimize the effects of partitioning the signal into small windows. The most used window function in ASR is the Hamming window [13]. This function is described by

wn= α − (1 − α) ∗ cos 2nπ N − 1 (3.2) where w is a window of size N and α has a value of 0.54 [18].

Next step is to compute the power spectrum by means of the Discrete Fourier Transform (DFT) using the Fast Fourier Transform (FFT) algorithm to minimize the required computation. Then, using a mel-filter bank the power spectrum is mapped onto the

(21)

mel-scale in order to obtain a mel-weighted spectrum. The reason for using this scale is because it is a non-linear scale that approximates the non-uniform human auditory system [9]. A mel-filter bank consists on a number of overlapped triangular bandpass filters where the center frequencies are equidistant on the mel-scale. This scale is linear up to 1000 hertz and logarithmic thereafter [18].

Next, a logarithm compression is made to the data obtained from the mel-filter bank resulting into log-energy coefficients. This coefficients are then orthognonalized by a Discrete Cosine Transform (DCT) in order to compress the spectral data into a set of low order coefficients, also known as the mel-cepstrum or cepstral vector [18]. This is described by the following

Ci= M X k=1 Xkcos k − 1 2 π M , i = 1, 2, ..., M (3.3) where Ci is the ith MFCC and M is the number of cepstrum coefficients and Xk

represents the log-energy coefficients of the kth mel-filter [2]. This vector is later normalized in order to account for the distortions that may have occurred due to the transmission channel.

Additionally, it is common to get the first and second order differentials of the cepstrum sequence to obtain the delta-cepstrum and the delta-delta cepstrum. Furthermore, the delta-energy and delta-delta energy parameters, which are the first and second order differentials of the power spectrum are also added to the feature vector [9]. This is how, an ASR feature vector typically consists of 13 cepstral coefficients, 13 delta values and 13 delta-delta values, which gives a total of 39 features per vector [16].

3.2.2 Linguistic Models

Language Models and N-grams

After processing the input sound waves in the front end, the ASR generates a series of symbols representing the possible phonemes in a piece of speech. In order to form words using those phonemes, the speech recognition system uses language modeling. This modeling is characterized by a set of rules regarding how each word is related to other words. For example, a group of words cannot be put together arbitrarily; they have to follow a set of grammatical and syntactical rules of a language. This part of speech processing is necessary in order to determine the meaning of a spoken message during the further stages of speech understanding [9].

Most ASR systems use a probabilistic framework in order to find out how words are related to each other. In a very general sense, the system tries to determine what word is the latest one received based on a group of words previously received. For instance, what would be the next word in the following sentence?

(22)

I would like to make a collect. . .

Some of the possible words would be “call”, “phone-call” or “international” among others. In order to determine the most probable word in the sentence, the ASR needs to use probability density functions, P (W ). Where W represents a sequence of words w1, w2, . . . wn. The density function assigns a probability value to a word sequence

depending on how likely it appears in a speech corpus [11]. Using the same example as before, the probability of the word “call” occurring after “I would like to make a collect”, would be given by:

P (W ) = P (”I”, ”would”, ”like”, ”to”, ”make”, ”a”, ”collect”, ”call”) (3.4) If we substitute the words by mathematical expressions we have:

P (W ) = P (w1, w2, w3, . . . .wn− 1, wn) (3.5)

P (W ) = P (w1)P (w2|w1)P (w3|w2, w1). . . P (wn|wn− 1. . . w1) (3.6)

As it can be seen, the probability function requires using all of the words in a given sentence. This might represent a big problem when using large vocabularies, especially when dealing with very long sentences. A large speech corpus would be required in order to compute the probability of each word occurring after any combination of other words. In order to minimize the need of a large corpus, the N-gram model is used. This can be achieved by approximating the probability function of a given word by using a predefined number of previous words (N). For example, using a bigram model, the probability function is approximated using just the previous word. Similarly, the trigram model uses the previous two words and so on [11].

Using a bigram model it is easier to approximate the probability of occurrence for the word “call“, given the previous words:

P (W ) = P (”call”|”collect”)P (”collect”|”a”)P (”a”|”make”). . . P (”would”|”I”) (3.7) The trigram model looks slightly more complicated as it uses the two words to compute each conditional probability:

P (W ) = P (”call”|”collect”, ”a”)P (”collect”|”a”, ”make”). . . P (”like”|”would”, ”I”) (3.8) The N-gram model can be generalized using the following form [9]:

P (W ) =

n

Y

k=1

P (wk|wk− 1, wk− 2. . . .wk−N +1) (3.9)

(23)

Acoustic Model

The main purpose of an acoustic model is to map sounds into phonemes and words. This can be achieved by using a large speech corpus and generating statistical representations of every phoneme in a language. The English language is composed by around 40 different phonemes [6]. However, due to the co-articulation effects, one phoneme is affected by the preceding and succeeding phonemes depending on the context. The most common approach to overcome this problem is by using triphones, in other words each phone is modeled along its preceding and succeeding phones [9]. For example, the phoneme representation for the word HOUSEBOAT is:

[ hh aw s b ow t ]

In this context, in order to generate a statistical representation for the phoneme [aw], the phonemes [hh] and [s] need to be included as well. This triphone strategy can be implemented using Hidden-Markov models, where the triphone is represented using three main states (one for each phoneme), plus one initial and one end state. An example of a triphone can be seen at Figure 3.2.

Figure 3.2: A typical triphone structure.

The purpose of the Hidden Markov model is to estimate the probability that an acoustic feature vector corresponds to a certain triphone. The computation of this probability can be obtained by using what is called the evaluation problem of HMMs. There is more than one way to solve the evaluation problem; however the most used algorithms are the forward-backward algorithm, the Viterbi algorithm and the stack decoding algorithm [16].

Dictionary

The ASR dictionary complements the language model and the acoustic model by mapping written words into phoneme sequences. It is important to notice that the size of the dictionary should correspond to the size of the vocabulary used by the ASR system. In other words, it should contain every single word and its phonetic representation used by the system, as it can be seen at Table 3.1.

(24)

Word Phoneme representation aancor AA N K AO R aardema AA R D EH M AH aardvark AA R D V AA R K

Table 3.1: Example of the 3 first words in an ASR dictionary

The representation of these phonemes must correspond to the representation used by the acoustic model in order to identify words and sentences correctly. Similarly, the words included in the dictionary must be the same as the words used by the language model. Furthermore, the dictionary should be as detailed as possible in order to improve the accuracy of the speech recognition system. For example, the dictionary used by the CMU-Sphinx system is composed by more than 100,000 words.

3.2.3 Decoder

Hidden Markov Models

ASR systems typically make use of finite-state machines (FSM) to overcome all variations found within speech by means of stochastic modeling. Hidden Markov Models (HMM) are one of the most common FSM used within ASR and were first used for speech processing in the 1970s by Baker at CMU and Jelinek at IBM [16]. This type of models comprises a set of observations and hidden states with self or forward transitions and probabilistic transitions between them. Most HMMs have a left-to-right topology, which in case of ASR allows the modeling of the nature of sequential speech [15].

In a Markov model, the probability or likelihood of being in a given state depends only on the immediate prior state, leaving earlier states out of consideration. Unlike a Markov model, in where each state corresponds to an observable event, in an HMM the state is hidden, only the output is visible.

Then, an HMM can be described as having states, observations and probabilities [20]: • States.

N hidden interconnected states and state qt at time t,

S = S1, S2, ...SN (3.10)

• Observations.

Symbols. M observation symbols per state,

(25)

Sequences. T observations in the sequence,

O = O1, O2, ...OT (3.12)

• Probabilities.

State transition probability distribution A = aij where,

aij = P [q(t + 1) = Sj|qt= Si], 1 ≤ i, j ≤ N (3.13)

Observation symbol probability distribution B = bj(k) at state Sj, where,

bj(k) = P [vk at t|qt= Sj], 1 ≤ j ≤ N, 1 ≤ k ≤ M (3.14)

Initial state distribution π = πi where,

πi= P (q1 = Si], 1 ≤ i ≤ N (3.15)

Therefore, an HMM model can be denoted as:

λ = (A, B, π) (3.16)

An example of an HMM is depicted at Figure 3.3. The model has three states having self and forward transitions between them. State q1 has two observation symbols, state q2

has three observation symbols and state q3 has four observation symbols. Each symbol

has its own observation probability distribution per state.

Figure 3.3: Three state HMM

An HMM shall address three basic problems in order to be useful in real-world applications such as in an ASR system [20]. The first problem is selecting a proper method to determine the probability P (O|λ) that the observation sequence O = O1, O2, ...OT is

(26)

produced by the model λ = (A, B, π). The typical approach used to solve this problem is the forward-backward procedure, being the forward step the most important at this stage. This algorithm is presented at Section 3.3

The second problem is finding the best state sequence Q = q1, q2, ...qtthat produced the

observation sequence O = O1, O2, ...OT. For this, it is necessary to use an optimality

criterion and learn about the model’s structure. One possibility is to select the states qt that are individually most likely for each t. However, this approach is not efficient as

it is only providing individual results, that is why the common approach to finding the best state sequence is to use the Viterbi algorithm. This algorithm is later presented at this section.

Finally, the third problem deals with the adjustment of the model parameters λ = (A, B, π) in order to maximize the probability P (O|λ) to better determine the origin of the observation O = O1, O2, ...OT. This is considered to be an optimization problem

where the output results to be the decision criterion. In addition, this is the point where training takes an important role in ASR as it allows adapting the model parameters to observed training data. Therefore, a common approach to solve this is by means of the maximum likelihood optimization procedure. This procedure uses separate training observation sequences Ov _{to obtain model parameters for each model λ}

v

P_v∗ = max

λv

P (Ov|λ_v) (3.17)

Typical usage of Hidden Markov Models

In ASR, various models can be made based on HMM, being the most common models the ones for phonemes and words. An example of these type of models is depicted in Figure 3.4 where various phone HMM models are used to construct the phonetic representation of the word one and the concatenation of these models are used to construct word HMM models.

Phoneme models are constrained by pronunciations from a dictionary, while word models are constrained by a grammar. Phone or phoneme models are usually made up using one or more state HMM. On the other hand, word models are made up of the concatenation of phoneme models, that at the same time help on the construction of sentence models [15].

(27)

Figure 3.4: Example of phone and word HMMs representation Acoustic Probabilities and Neural Networks / MLPs

The front end phase of any speech recognition system is in charge of converting the input sound waves into feature vectors. Nonetheless, these vectors need to be converted into observation probabilities in order to decode the most probable sequence of phonemes and words in the input speech. Two of the most common methods to find the observation probabilities are: the Gaussian probability-density functions (PDFs) and most recently, the use of neural networks (also called multi-layer perceptrons, MLPs).

The Gaussian observation-probability method converts an observation feature vector ot

into a probability function bj(ot) using a Gaussian curve with a mean value µ and a

covariance matrix Σ. In the simpler version of this method, each state in the hidden Markov framework has one PDF. However, most ASR systems use multiple PDFs per state, for this reason, the overall probability function bj(ot) is computed using

Gaussian Mixtures. The forward-backward algorithm is commonly used to compute these probability functions and is also used to train the whole hidden Markov model framework [11].

Having the mean value and the covariance matrix, the probability function can be computed using the following equation:

bj(ot) =

1 p(2π)(Σj)e

[(ot−µj)0Σ−1j (ot−µj)] _(3.18)

Artificial neural networks are one of the main alternatives to the Gaussian estimator in order to compute observation probabilities. The key characteristic of neural networks is that they can arbitrarily map inputs to outputs as long as they have enough hidden layers. For example, having the feature vectors as inputs (ot) and the corresponding

(28)

phone labels as outputs, a neural network can be trained to get the observation probability functions bj(ot) [10].

Typically, the ASR systems that use hidden Markov models and MLPs are classified as hybrid (HMM-MLP) speech recognition systems. The inputs to the neural network are various frames containing spectral features and the network has one output for each phone in the language. In this case, the most common method to train the neural network is the back-propagation algorithm [15].

Another advantage of using MLPs is that having a large set of phone labels and their corresponding set of observations, the back-propagation algorithm iteratively adjust the weights in the MLP until the errors are minimized. Figure 3.5 presents an example of a representation of an MLP of three layers.

Figure 3.5: A detailed representation of a Multi-layer Perceptron.

The two methods to calculate the acoustic probabilities (Gaussian and MLPs) have roughly the same performance. However, MLPs take longer time to train and they use less parameters. For this reason, the neural networks method seems to be more suited to ASR applications where amount of processing power and memory are major concerns.

(29)

Viterbi Algorithm and Beam Search

One of the most complex tasks faced by speech recognition systems is the identification of word boundaries. Due to co-articulation and fast speech rate, it is often difficult to identify the place where one word ends and the next one starts. This is called the segmentation problem of speech and it is typically solved using N-gram models and the Viterbi algorithm. Given a set of observed phones o = (o1, o2. . . ot), the purpose of the

decoding algorithm is to find the most probable sequence of states q∗ = (q1, q2. . . qn),

and subsequently, the most probable sequence of words in a speech [11].

The Viterbi search is implemented using a matrix where each cell contains the best path after the first t observations. Additionally, each cell contains a pointer to the last state i in the path. For example:

viterbi[t, i] = max

q1,q2,...,qt−1

P (q1q2...qt−1, qt= i, o1, o2...ot) (3.19)

In order to compute each cell in the matrix, the Viterbi algorithm uses what is called the dynamic programming invariant [9]. In other words, the algorithm assumes that the overall best path for the entire observation goes through state i, but sometimes that assumption might lead to incorrect results. For example, if the best path looks bad initially, the algorithm would discard it and select a different path with a better probability up to state i. However, the dynamic programming invariant is often used in order to simplify the decoding process using a recurrence rule in which each cell in the Viterbi matrix can be computed using information related to the previous cell [11].

viterbi[t, j] = max

i (viterbi[t − 1, i]aij)bj(ot) (3.20)

As it can be seen, the cell [t, j] is computed using the previous cell [t − 1, i], one emission probability bj and one transition probability aij. It is important to emphasize that this

is a simplified model for the Viterbi algorithm, for example a real speech recognizer that uses HMMs would receive feature acoustical vectors instead of phones. Furthermore, the likelihood probabilities bj(ot) would be calculated using Gaussian probability functions or

multi-layer perceptrons (MLPs). Additionally, the hidden Markov models are typically divided in triphones rather than single phones. This characteristic of HMMs provides a direct segmentation of the speech utterance [16].

The large amount of possible triphones and the use of a large vocabulary make the speech decoding a computationally expensive task. For this reason, it is necessary to implement a method to discard low probability paths and focus on the best ones. This process is called pruning and is usually implemented using a beam search algorithm. The main purpose of this algorithm is to speed up the execution of the search algorithm and the use of a lower amount of computational resources. However, the main drawback of the algorithm is the degradation of the decoding performance [11].

(30)

A* decoding algorithm

The A* decoding algorithm (also known as stack decoding) can be used to overcome the limitations of the Viterbi algorithm. The most important limitation is related to the use of the dynamic programming invariant, therefore it cannot be used with some language models, such as tri-grams [11]. The stack decoding algorithm solves the same problem as the Viterbi algorithm, which is finding the most likely word sequence W given a sequence of observations O, [17]. For this reason it is often used as the best alternative to substitute Viterbi’s method.

In this case, the speech recognition problem can be seen as a tree network search problem in which the branches leaving each junction represent words. As the tree network is processed, more words are appended to the current string of words in order to form the most likely path [17]. An example of this tree is illustrated by Figure 3.6.

(31)

The stack decoder computes the path with the highest probability of occurrence from the start to the last leaf in the sequence. This is achieved by storing a priority stack (or queue) with a list of partial sentences with a score, based on their probability of occurrence. The basic operation of the A* algorithm can be simplified using the following steps [17]:

1. Initialize the stack

2. Pop the best (high scoring) candidate off the stack

3. If the end-of-sentence is reached, output the sentence and terminate

4. Perform acoustic and language model fast matches to obtain new candidates 5. For each word on the candidate list:

(a) Perform acoustic and language-model detailed matches to compute new theory output likelihood.

i. If the end-of-sentence is not reached, insert candidate into the stack ii. If the end-of sentence is reached, insert it into the stack with end-of

sentence flag. 6. Go to step 2

The stack decoding algorithm is based on a criterion that computes the estimated score f∗(t) of a sequence of words up to time t. The score is computed by using known probabilities of previous words gi(t) and an heuristic function to predict the remaining

words in the sentence. For example:

f∗(t) = gi(t) + h∗(t) (3.21)

Alternatively, these functions can be expressed in terms of a partial path p, instead of time t. In this case f∗(p) is the score of the best complete path which starts at path p. Similarly gi(p) represents the score from the beginning of the utterance towards

the partial path p. Lastly, h∗(p) estimated the best extension from the partial path p towards the end of the sentence [11].

Finding an efficient and accurate heuristic function might represent a complex task. Fast matches are heuristic functions that are computationally cheap and are used to reduce the number of next possible word candidates. Nonetheless, these fast match functions must be checked by more accurate detailed match functions [17].

(32)

3.3 Training for Hidden Markov Models

One of the most challenging phases of developing an automatic speech recognizer system is finding a suitable method to train and evaluate its hidden Markov models. This section provides an overview on the Expectation-Maximization algorithm and presents the Forward-Backward algorithm that is used for the training of an ASR system.

3.3.1 Expectation-Maximization Algorithm

Typically, the methods used to complete the training task of an ASR are variations of the Expectation-Maximization algorithm (EM), presented by Dempster in 1977. The goal of this algorithm is to approximate the transition (aij) and emission (bi(ot)) probabilities

of the HMM using large sets of observation sequences O, [11].

The initial values of the transition and emission probabilities can be estimated; as the training for the HMM progresses, those probabilities are re-estimated until their values converge into a good model. The expectation-maximization algorithm uses the steepest gradient, also known as hill-climbing, method. This means that the convergence of the algorithm is determined by local optimality [16].

3.3.2 Forward-Backward Algorithm

Due to the large number of parameters involved in the training of HMMs, it is preferable to use simple methods. Some of these methods can feature adaptation in order to add robustness to the system by using noisy or contaminated training data. In fact, this has been a major research area in the field of speech recognition during the last couple of decades [14].

Two of the most popular algorithms used to train HMMs are the Viterbi algorithm, and the Forward-Backward (FB) algorithm, also known as the Baum-Welch algorithm. This last algorithm computes two main parameters in order to train HMMs, the maximum likelihood estimates and the posterior modes (transition and emission probabilities) using emission observations as training data. The approximation of HMM probabilities can be achieved combining two parameters named: the forward probability (alpha) and backward probability (beta) [11].

The forward probability is defined as the probability of being in state i after the first t observations (o1, o2, o3. . . ot).

αt(i) = P (o1, o2...ot, qt= i) (3.22)

Similarly, the backward probability is defined as the probability of visualizing observations from time t + 1 to the end (T ) when the state is j at time t.

(33)

βi(ot) = P (ot+1, ot+2...oT|qt= j) (3.23)

In both cases, the probabilities are calculated using an initial estimate and then the remaining values are approximated using an induction step. For example:

αi(1) = a1jbj(o1) (3.24)

And then the other values are recursively calculated: αj(t) = "_{N −1} X i=2 αi(t − 1)aij # bj(ot) (3.25)

As it can be seen, the forward probability in any given state can be computed using the product of the observation likelihood bj and the forward probabilities from time t − 1.

This characteristic allows this algorithm to work efficiently without drastically increasing the number of computations as N grows [16].

A similar approach is used to calculate the backward probabilities using an iterative formula: βi(t) = N −1 X i=2 aijbj(ot+ 1)βj(t + 1) (3.26)

Once that the forward and backward probabilities have been calculated, both the alpha and beta factors are normalized and combined in order to approximate the new values for the emission and transition probabilities.

Although the Baum-Welch algorithm has proven to be an efficient way to train HMMs, the training data needs to be sufficient to avoid having parameters with probabilities equal to zero. Furthermore, using the forward-backward algorithm helps to train parameters for an existing HMM, but the structure of the HMM needs to be generated manually. This could represent a major concern as finding a good method to generate the structure for an HMM could be a difficult task [16].

An example of a training procedure is depicted in Figure 3.7 in which the forward-backward algorithm is used by Sphinx.

(34)

Figure 3.7: Sphinx training procedure

3.4 Performance

The performance of an ASR system can be measured in terms of its recognition error probability, which is why specific metrics such as the word-error rate measurement are used to evaluate this type of systems. This section describes the word-error rate measurement, which is the most common measurement to evaluate the performance of an ASR system.

3.4.1 Word Error Rate

The word-error rate (WER) has become the standard measurement scheme to evaluate the performance of speech recognition systems [11]. This metric allows calculating the total number of incorrect words in a recognition task. Similar approaches use syllables or phonemes to calculate error rates; however the most used measurement units are words. The WER is calculated by measuring the number of inserted, deleted or substituted words of a correct transcript with respect to a hypothesized speech string [1].

W ER =Insertions + Substitutions + Deletions

(35)

As it can be seen, the number of word insertions is included in the mathematical expression, for this reason the WER can have values above 100 percent. Typically, a WER lower than 10 percent is acceptable on most ASR systems [10]. Furthermore, the WER is the most common metric used to benchmark and evaluate improvements to existing automatic speech recognition systems, for example, when introducing improved or new algorithms [21].

Although the WER is the most used metric to evaluate the performance of ASRs, it does not provide further insight of the factors that generate the recognition errors. Some other methods have been proposed in order to measure and classify the most common speech recognition errors. For example, the analysis of variance (ANOVA) method, which allows the quantification of multiple sources of errors acting in the variability of speech signals [1].

During the last decade, researchers have tried to predict speech recognition errors instead of just measuring them in order to evaluate the performance of a system. Furthermore, the predicted error rates can be used to carefully select speech data in order to train the ASR systems more effectively.

(36)

Chapter 4

Design and Implementation

This chapter describes in detail the design and implementation of a Keyword Based Interactive Speech Recognition System using PocketSphinx. A brief description of the CMU Sphinx technology and the reasons for selecting PocketSphinx is provided in this chapter.

Also introduced is the creation process of both the language model and the dictionary. Furthermore, several tools were created in order to ease the system’s development and they are introduced in this chapter. However, more information on these tools and their usage can be found at Appendix A.

4.1 The CMU Sphinx Technology

This section describes the Sphinx speech recognition system that is part of the CMU Sphinx technology. Also described is the PocketSphinx system which is the ASR decoder library selected for this project.

4.1.1 Sphinx

Sphinx is a continuous-speech and speaker-independent recognition system developed in 1988 at the Carnegie Mellon University (CMU) in order to try to overcome some of the greatest challenges within speech recognition: speaker independence, continuous speech and large vocabularies [7, 13]. This system is part of the CMU Sphinx, open source, technology that provides a set of speech recognizers and tools that allow the development of speech recognition systems. This technology has been used to ease the development of speech recognition systems as it provides a way to avoid developers the need to start from scratch.

(37)

The first Sphinx system has been improved over the years and three other versions have been developed: Sphinx2, 3 and 4. Sphinx2 is a semi-continuous HMM based system, while Sphinx3 is a continuous HMM based speech recognition system and both are written in the C programming language [19]. On the other hand, Sphinx4 is a system that provides a more flexible and modular framework written in the Java programming language [23]. Currently, only Sphinx3 and Sphinx4 are still under development. Overall, the Sphinx systems comprise typical ASR components such as in Sphinx4 that has a Front End, a Decoder and a Linguist as well as a set of algorithms that can be used and configured depending on the needs of the project. They provide part of the needed technology, that given an acoustic signal and having a properly created acoustic model, a language model and a dictionary, decode the spoken word sequence.

4.1.2 PocketSphinx

PocketSphinx is a large vocabulary, semi-continuous speech recognition library based on CMU’s Sphinx 2. PocketSphinx was implemented with the objective of creating a speech recognition system for resource-constrained devices, such as hand-held computers [8]. The entire system is written in C with the aim of having fast-response and light-weight applications. For this reason, PocketSphinx can be used in live applications, such as dictation.

One of the main advantages of PocketSphinx over other ASR systems is that it has been ported and executed successfully on different types of processors, most notably the x86 family and several ARM processors [8]. Similarly, this ASR has been used on different operating systems such as Microsoft’s Windows CE, Apple’s iOS and Google’s Android [24]. Additionally, the source code of PocketSphinx has been published by the Carnegie Melon University under a BSD style license. The latest code can be retrieved from SourceForge1.

In order to execute PocketSphinx, typically three input files need to be specified: the language model, the acoustic model and the dictionary. For more information about these three components please refer to Section 3.2.2. By default, the PocketSphinx toolkit includes at least one language model (wsj0vp.5000), one acoustic model (hub4wsj sc 8k) and one dictionary (cmu07a.dic). Nonetheless, these files are intended to support very large vocabularies, for example, the dictionary2 includes around 125,000 words.

Although there are not many scientific publications to describe how the PocketSphinx library works, the official web page3 contains documentation that describes how to download and compile the source code, as well as how to create example applications.

1_{CMU Sphinx repository: http://sourceforge.net/projects/cmusphinx} 2

CMU dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

(38)

Similarly, there is a discussion forum where new developers are encouraged to ask questions. Probably the main drawback about the web page is that some sections do not seem to be updated regularly and the organization is not intuitive. In this regard, it can be cumbersome to navigate through the web page as some documentation applies to both PocketSphinx and Sphinx 4, while some other parts apply just to Sphinx 4. At the end of the day, we chose to use PocketSphinx because it provides a development framework that can be adapted to fit our requirements. For example, PocketSphinx has been tested in several embedded devices running different operating systems. For this reason, we consider that is more feasible to adapt it instead of creating an ASR from scratch.

In this case, we can focus on creating a keyword-based language model and a dictionary that maximizes the performance of our system. Similarly we chose to create an application in which we can measure the performance of the ASR and interact with a computer using speech; in the following section we will describe this application in more detail.

4.2 System Description

The development of this project was done considering that the system can be used by a robot that is located at a museum and that can interact with people. The robot shall be able to listen and talk to a person based on identified keywords that were pronounced by the speaker. However, the robot shall never remain quiet. This means that if no keyword is identified, it shall be able to play additional information about the museum. For this project, the robot was assumed to be located at the Swedish Tekniska Museet4.

Therefore, the system needs to be able to identify previously defined keywords from a spoken sentence and it should be speaker independent. Moreover, the system needs to be an interactive speech recognition system, in the sense that it shall be able to react based on an identified keyword. In this case, an audio file with information related to the keyword is played in response. Otherwise, if the system does not identify any keyword from the decoded speech, the system shall play any other default audio file.

Furthermore, the system needs to be portable, as it has as main goal to be used by an embedded application. Therefore, as explained at Section 4.1.2, PocketSphinx was selected from the CMU Sphinx family of decoders as it provides the required technology to develop this type of systems.

(39)

4.3 Keyword List

The keyword list comprises the words of relevance and that are selected to be identified by the system. This list mostly likely shall have the words that could be pronounced more often by a speaker. Therefore, they shall be selected according to the area of interest. For this project the keywords were selected according to the current exhibitions at the museum.

For the definition of the keywords a keyword file was created. This file contains the list of all the words that shall be identified from the incoming speech signal. Each line of the file shall contain the keyword, followed by the number of audio files available for playing whenever the keyword is identified by the system.

An extra line is also added at the end of this file stating the number of available audio files that are to be played whenever the system is not able to identify a keyword. The word DEFAULT preceded by a hash symbol is used for this purpose. An example of the format of this file can be seen at Figure 4.1. This example has Keywords 1 to n, each with five available audio files as well as ten default audio files.

Figure 4.1: Keyword File Example

4.4 Audio Files

4.4.1 Creating the audio files

The information that is played whenever a keyword is identified by the system, comes from selected sentences converted from text to speech. Thus, to convert the sentences into speech, a Text to Speech (TTS) system is required. There are different TTS systems available such as eSpeak[4] or FreeTTS[5]. However, for this project we have selected Google’s TTS, which is accessed via Google’s Translate[22] service, as it is easy to use and more importantly, due to its audio quality.

(40)

Therefore, in order to ease the creation of the audio files containing the responses to be given by the system, the tool RespGenerator was developed and written in Java. Prior the usage of this tool, the desired sentences per keyword and the default ones shall be written within a text file. The file should first contain a line with the keyword preceded by a hash symbol and then the lines with its associated sentences. An example of the format of this file containing Keywords 1 to n and their sentences can be seen at Figure 4.2.

Figure 4.2: Sentences for Audio Files Example

Once the file containing the sentences is ready, the tool reads them from the file, connects to Google’s TTS and retrieves the generated audio files, all this by using the GNU wget5 free software package. The generated audio files are placed under a folder with the name of the keyword that they belong to and they are numerically named from 1 to n, where n is the number of available audio files per keyword. Figure 4.3 illustrates the process of the conversion of the sentences into audio files.

(41)

Figure 4.3: Conversion of Sentences into Audio Files

The RespGenerator tool has only one mode of operation, in which, the user must specify the path and name of the file containing the sentences to be used as answers. Additionally, the user must specify the desired path for the output audio files.

Regarding the wget tool, it needs to be placed in the same folder as the RespGenerator tool in order generate the audio files correctly. For linux operating systems, the wget tool is installed by default, nonetheless, for Windows operating systems, the user needs to download the executable and place it in the appropriate folder.

The following command provides an example about how to execute the RespGenerator tool:

java -jar RespGenerator.jar ..\SentencesFile.txt ..\OutputPath\

As it can be seen, the first parameter corresponds to the path and name of the file containing the sentences and the second parameter is the path for the output audio files. In case that the user specifies less than two parameters, the tool will display a mesage indicating that there was an error. For example:

Error: less than 2 arguments were specified

On the other hand, when the two inputs are correctly specified, the tool will generate the output audio files and it will display a message indicating that the tool was executed correctly:

(42)

Additionally, the RespGenerator tool creates a text file containing a list of keywords and the corresponding number of audio files created per keyword. This file is used by the ASR application in order to know how many possible responses are available per keyword as well as how many default responses can be used.

4.4.2 Playing the audio files

Initially, the ASR system was designed to play MP3 files, but it was later changed to play OGG files as it is a completely open and free format6. Therefore, in order to play the audio files, it is necessary to use an external library. For that, the BASS audio library7 was used.

The BASS audio library allows the streaming of different audio type files, included OGG. Moreover, the library is free for non-commercial use and it is available for different programming languages, including C. Furthermore, one of the main advantages of this library is that everything is contained within a small dll file of only 100 KB.

4.5 Custom Language Model and Dictionary

Although PocketSphinx can be used in applications supporting vocabularies composed by several thousands of words, the best performance in terms of accuracy and execution time can be obtained using small vocabularies [12]. After downloading and compiling the source code for PocketSphinx, we proceeded to run a test program using the default language model, acoustic model and dictionary. Nonetheless, the word accuracy for the application tended to be very low. In other words, most of the speech sentences used as inputs were not recognized correctly.

After performing some research in the documentation for PocketSphinx, we found out that is recommended to use a reduced language model and a custom dictionary with support for a small vocabulary. The smaller the vocabulary, the fastest PocketSphinx will decode the input sentences as the search space of the algorithms used by PocketSphinx gets smaller. Similarly the accuracy of the ASR becomes higher when the vocabulary is small. For example, there is an example application included with PocketSphinx that uses a vocabulary containing only the numbers from 0 to 9. In this case, the overall accuracy is in the range of 90% to 98%.

On the other hand, it is recommended to use one of the default acoustic models included with PocketSphinx. The main reason is that the default acoustic models have been created using huge amounts of acoustic data containing speech from several persons. In other words, the default acoustic models have been carefully tuned to be speaker

6_{OGG: http://www.vorbis.com/} 7

(43)

independent. If for some reason the user creates a new acoustic model, or adapts an existing one, the acoustic training data needs to be carefully selected in order to avoid speaker dependence.

The CMUSphinx toolkit provides methods to adapt an existing acoustic model or even create a new one from scratch. However, in order to create a new language model, it is required to have large quantities of speech data from different speakers and the corresponding transcript for each training sentence. For dictation applications it is recommended to have at least 50 hours of recordings of 200 speakers8.

4.5.1 Keyword-Based Language Model Generation

The CMUSphinx toolkit provides a set of applications aimed to help the developer during the creation of new language models, this group of applications is called the Statistical Language Model Toolkit (CMUCLMTK). The purpose of this toolkit is to take a text corpus as input and generate a corresponding language model. The text corpus is a group of delimited sentences that is used to determine the structure of the language and how each word is related to the others. For more information related to language models please refer to Section 3.2.2 . Figure 4.4 illustrates the basic usage of the CMUCLMTK.

Figure 4.4: Statistical Language Model Toolkit

Alternatively, there is an online version of the CMUCLMTK toolkit, called lmtool. In order to execute this tool, the user needs to upload a text corpus into a server using a web browser and the tool generates a language model. However, this tool is intended to be used with small text corpora containing a couple of hundreds of sentences at the most. This limitation became a problem for us, as most of the times our web browser timed-out before the tool was able to generate a language model. For this reason we decided to use the CMUCLMTK toolkit instead.

Besides generating a language model from a text corpus, the CMUCLMTK toolkit generates several other useful files, such as, the .vocab file that lists all the words found in the vocabulary. Similarly, the .wfreq file lists all the words in the vocabulary and the number of times they appear in the text corpus. Finally, the toolkit generates the laguage model in two possible formats, the arpa format (.arpa), which is a text file or

8