Search system for an audio archive

(1)

DEGREE PROJECT, IN COMPUTER SCIENCE , SECOND LEVEL STOCKHOLM, SWEDEN 2015

Search system for an audio archive

ANDRZEJ DUDZIEC

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)

(2)

Search system for an audio archive

Söksystem för ett ljudarkiv

ANDRZEJ DUDZIEC

Master’s Thesis at CSC Supervisor: Jussi Karlgren Examiner: Jens Lagergren

(3)

(4)

Abstract

Speed and accuracy of information retrieval is of utmost importance value in the contemporary world. Multimedia data is usually indexed not according to the content but to several keywords that only approximate the content.

The purpose of the study is to explore the possibility of using speech recognition algorithms to improve the quality of human speech audio files retrieval or live media content analysis. The project focuses on phonetic algorithms’ ability to compensate for imperfections of speech recognition.

The project has examined several speech recognition tools, language models and phonetic matching algorithms.

The results can be used for further research, developing or improving commercial products.

(5)

Referat

Snabbhet och noggrannhet i informationssökning är avgörande i dagens värld. Multimedial data indexeras of- tast inte på innehåll utan med beskrivande sökord.

Syftet med studien är att undersöka möjligheten att an- vända taligenkänningsalgoritmer för att förbättra kvaliteten på sökning av data med mänskligt tal eller innehållsanalys av media. Projektet fokuserar på fonetiska algoritmer med förmåga att kompensera för brister i taligenkänning.

Projektet har undersökt flera taligenkänningsverktyg, språkmodeller och fonetiska matchningsalgoritmer. Resul- taten kan användas för vidare forskning, skapa eller förbät- tra kommersiella produkter.

(6)

Introduction

Most of the modern search engines handle text documents (web pages, PDFs etc.) or file formats that make text easily extractable (like Power Point format). Nowa- days it is rather easy to provide a high accuracy searching system for this kind of data. A much more difficult task is to perform a proper indexing of such multimedia files as videos, audio files or streams. In those cases it is hard to obtain information about content (what does the video present? what are people talking about?) thus it is keywords, titles or tags that are used to determine the substance.

The goal of the project proposed by Findwise AB was to build a system that is able to index audio files containing human speech and provide text search functionality. This means that the system is supposed to use speech-to-text libraries and index text data corresponding to the content of the audio files. Although speech recognition is not an issue for human beings, machines have to struggle with the challenge. Modern algorithms are very sensitive to environmental noise and require high quality input. Due to the fact that speech-to-text conversion is not 100% accurate, it is impossible to obtain satisfactory search results without any additional processing.

The second part of the project is to test the potential of using phonetic algorithms to compensate for some errors and improve search quality. Those filters should match similarly spoken (but not necessarily spelled) words in case they were misunderstood by the speech-to-text engine. Some errors at this level are acceptable, since the purpose of filtering is to support mechanisms implemented in the search engine (stemming, synonym detection etc.) rather than replace them. More- over, it was strongly recommended to use open-source tools and libraries in the project development phase and test it with publicly available data.

Use Cases

The amount of applications that can be based on speech recognition and phonetic

(9)

CHAPTER 1. INTRODUCTION filtering is limited only by human imagination and creativity. The two dominant groups of ideas related to searching are as follows:

• looking up broadcast - listening to live information like podcast, radio, television etc. and extracting information meaningful for the project, e.g.:

– how am I presented in media (celebrity or politician)? or what’s the image of products in media (entrepreneur)?

– which medium provides more information regarding the USA/NATO or Russia in context of the Ukrainian conflict? What is their general approach to the issue?

– collecting information about traffic jams and road accidents to find an optimal route which would allow to avoid affected areas.

• typical archive that stores huge amounts of recordings and requires a fast and accurate searching tool

– radio or television reporter would find a searching system useful to prepare a material that e.g. compares currently recorded statement of a politician with the one made two years ago to demonstrate potential changes that happened during this time.

There are also many other applications, not related to searching, but exploiting advantages of speech recognition and phonetic filtering:

• dictation

– short notes can be taken by dictating the words to the computer, e.g. "re- member to buy milk, bread and eggs"

– long stories or probably just longer ideas for the plot, that can be used by professional writers.

• voice control

– many every day devices can be combined with simple voice control: intel- ligent houses, washing machines, coffee makers, e.g. "prepare me a coffee for tomorrow at 7 AM".

– speech conversation robots designed to cooperate with the user during the conversation and assist him. Real artificial intelligence devices will soon catch up with those shown in movies.

• voice mail messages can be automatically transcribed into text and sent as a short message to the user’s device.

2

(10)

Chapter 2

Automatic speech recognition

Automatic speech recognition (ASR) is a pattern recognition problem that analyzes human speech and tries to transcribe it into written words or commands [1, 2]. ASR systems can be used in a variety of applications like in-car systems, home automa- tion, robotics, video games or dictation.

Small speech decoding systems can be trained to understand entire words and combine them with simple grammar. A good example of this kind of system is an early car phone, which should “understand” only numbers and some specific keywords (for example “ ‘call’ - ‘nine’ - ‘eleven’ ”). More general purpose ASR systems detect phonemes and use them as an input to build words.

A phoneme is the smallest linguistic unit that can be combined with other phonemes to create a sound a human would call a word. Words are made of letters, whereas the sound we hear when pronouncing a word is made of phonemes. For example sequence of letters “o”, “n”, “e” result in the word “one” and phonemes

“W”, “AH”, “N” make a sound “one” (phonemes are written with usage of Arpabet - phoneme transcription code developed by ARPA). A sample sonogram of the word

‘zero’ with marked boundaries for four phonemes is presented in Figure 2.1.

2.1 Acoustic model

Each phoneme has its representation (or multiple representations) in the frequency domain, as illustrated in Figure 2.2. Some of them put more stress on low frequencies, some prefer high frequencies. Those sets of frequencies are called in acoustics

“formants”.

During the recognition process speech signal has to be analyzed and the most probable distribution of the formants is retrieved. This task is usually solved by the use of hidden Markov models. These models (called acoustic models) are crucial to the speech recognition process, thus they have to be trained with a huge amount of

(11)

CHAPTER 2. AUTOMATIC SPEECH RECOGNITION

Figure 2.1. Phonemes ‘Z’, ‘IH’, ‘R’, ‘OW’ combined together form the word ‘zero’.

[3]

Figure 2.2. Sample phonemes and their representation in the frequency domain. [4]

data. Every ASR system requires a version of this model.

2.2 Pronunciation model

A pronunciation model, also known as a dictionary, is a set of words with their composition of phonemes. When the most probable sequence of phonemes has been determined by the recognition engine, the system has to find a word corresponding to the sound. General purpose ASR systems may cover over 100 000 words, whereas the set of most commonly used words (for English) is just a couple of thousands.

Table 2.1 shows a fragment of a sample dictionary defining pronunciation of digits.

The construction of a good dictionary is a non trivial task, especially in case of English. A well known example that illustrates irregularities in spelling is the word

“ghoti”, which can be pronounced as “fish” when applying some of the pronunciation rules:

4

(12)

2.3. LANGUAGE MODEL

• pronounce “gh” as “f” in “rouGH”

• pronounce “o” as “i” in “wOmen”

• pronounce “ti” as “sh” in “naTIon”

What is more, in English some words that are spelled similarly such as “through”,

“though” and “rough” do not sound similar, whereas other words like “pony” and

“bologna” may.

digit phonemes zero Z IH R OW

one W AH N

two T UW

three TH R IY four F AO R five F AY V six S IH K S seven S EH V AH N eight EY T

nine N AY N

Table 2.1. Sample pronunciation model for digits.

2.3 Language model

Due to many reasons which will be pointed out further in the study, recognition of isolated words may have a high error rate. Even though a language model (or grammar) is not required to perform speech recognition, it can significantly improve quality of the decoder. For small vocabulary systems, a grammar should be enough.

In the “car phone” example, a simple grammar would be defined as a “start word”

(e.g. “call”) followed by “sequence of numbers”. This approach works fine for a limited number of possible words, but in bigger applications it may be impossible to define all possible sequences of words with the probability of their occurrence.

For this reason an N-gram model should be used. This kind of model tells the system how probable it is to see the sequence of N words in the language. For example the probability of one word; pair of words being next to each other; triplet of words etc... Obviously, the higher N, the more accurate (and slower) the system is. Usually combining models up to level three gives acceptable results. This kind of model can be easily trained using commonly available data (like the Wikipedia corpus).

(13)

CHAPTER 2. AUTOMATIC SPEECH RECOGNITION

2.4 Speech recognition issues

Automatic speech recognition is the pattern recognition problem of transcribing spoken words into text. Audio input is matched against training samples to find the most similar one. Unfortunately, human speech tends to vary according to the person’s mood, emotional state, origin, accent, sex, dialect and many other factors [1]. The quality of the recording may also affect the error rate of ASR results - background noise, type of microphone etc. Moreover, some words may have several acceptable pronunciations (like read - phonemes “R IY D” or “R EH D” - present and past tense). Another important problem is the lack of sharp boundaries between words, especially when somebody is speaking fast, it is hard to determine if he said

“ice cream” or “I scream”. The human brain can conduct context analysis to decide which words were more plausible. Of course automatic speech recognition systems try to perform such operation as well, but on a limited scale.

Figure 2.3 illustrates how speaker’s mood can influence representation of a word in a frequency domain. The sad version of ‘hello’ (left) is concentrated in low frequency part whereas the ‘happy hello’ (right) spreads energy through a much higher range.

Despite this fact, in both cases clear patterns are visible and recognition is possible.

Figure 2.3. Example how mood influences sonogram. Word ‘hello’ spoken sadly (left) and with joy (right). [5]

6

(14)

Chapter 3

Apache Solr

3.1 Platform

Solr is a powerful open source platform for indexing and searching among the text documents. It makes it easy to use advanced features such as hit word highlighting or faceted search (arrange search results in groups of some important fields like author or date). Solr analyzes each phrase during index and query time. This process makes it possible to apply some additional filters on the data: stemming, stopwords elimination, synonym filter and many others, one of them being the phonetic filter factory.

3.2 Phonetic filters

Apache Solr defines six phonetic filters. Each one of them, given a sequence of characters (word), converts it into a tag according to its pronunciation. Those codes can be used to determine if two words sound alike. Most of phonetic matching algorithms were designed for English in the twentieth century. Filters supported by Solr were built on the basis of three most common solutions: Soundex, Metaphone and Carevphone:

• Soundex [6] was developed in the early twentieth century for indexing names.

It creates a code that preserves the first letter of the word and follows it by three digits. It is commonly used during genealogical research.

• Refined Soundex is the original Soundex algorithm with several improve- ments. Letters are divided into more groups and the length of the code is not fixed, but varies from code to code.

• Metaphone [7] is an improvement of Soundex, published in 1990. It covers many inconsistencies of the language and works well for words in general, not only names.

(15)

CHAPTER 3. APACHE SOLR

• Double Metaphone improves original Metaphone algorithm by taking into account many irregularities of English language. In addition, it returns the two most probable codes instead of one, however by default Solr filter uses only the first one.

• Double Metaphone 2 is a different filter factory implemented in Solr, that considers both Double Metaphone codes.

• Caverphone [8] was designed in 2002 at the University of Otago for name matching purposes, especially 19th an 20th century electoral rolls. The algorithm has been optimized for the New Zealand accents.

Solr supports one more type of phonetic filter called Kölner Phonetik (Cologne Pho- netic), but it is not used in this project, as the algorithm was designed for German language and is of no use for English.

Phonetic filtering helps matching the documents that were misunderstood by speech recognition algorithms.

Example:

A quote “The 2014 Pacific hurricane season was the most active one on record since 1982” could be transcribed as “The 2014 Pacific hurricane Susan was the most active one on record since 1982”, since the words ‘season’ and ‘Susan’ sound very similar. In such cases, a user searching for “hurricane season” would not find this relevant document. Applying phonetic filters for both indexing and querying procedure, converts words “season” and “Susan” to “S250” (Soundex) or “SSN”

(Metaphone) codes.

Phonetic algorithms match similar words (like “season” and “Susan” or “sixty” and

“sixteen”). Yet, many of the mistakes made are caused by the similarity of phrases, not of isolated words. Utilizing the properties of the filters mentioned earlier, we can concatenate adjacent words and apply adequate algorithms, since they analyze the sequence of characters and do not perform a dictionary lookup.

Example:

The phrase “ice cream” could be easily misunderstood with “I scream”, thus fail to match. This can be avoided by converting the phrases to the same codes “ISKR”

(Metaphone) or “ASKRM1” (Caverphone).

In some cases the codes produced by phonetic filters do not match perfectly, but they differ with one character only. The situation is especially common with Soundex, which leaves the first letter of the original word unchanged. This fact may be used to improve search results by applying fuzzy search with edit distance equal 1. “Fuzzy”

option enables Solr to search for similar words. Originally used to correct typos, 8

(16)

3.2. PHONETIC FILTERS

fuzzy search can also be used to find very similar codes. Unfortunately, this function is very time consuming as a much higher number of indexed tokens have to be checked. Fuzzy search should improve recall (the documents that were transcribed faulty would normally not be retrieved) but decrease precision (retrieving irrelevant documents becaus of similarity between some word).

Example:

Soundex generates code “L200” from word “loose” and “U200” from “use”. The codes differ only the first letter and suggest that both words sound similar.

(17)

(18)

Chapter 4

Implementation

4.1 ASR engine choice

There are several speech recognition solutions available on the market, each of them having its drawbacks and benefits. In this section we present pros and cons of the selected ones.

Google web speech

The well known fact is that the more training data, the better results the system obtains and fewer errors are present. This is why the first step was to check solutions proposed by a company that has the biggest amount of training data available. The Google web speech was introduced in Chrome version 25. Unfortunately, the speech API has not been released officially, though some reverse engineering solutions have been proposed by the community. They are based on sending short audio snippets in flac format to Google and receiving back a JavaScript object with transcribed text and several parameters, such as confidence (no additional models were needed).

Unfortunately, these methods do not seem to work with the new version of the API, that changes very frequently. Adding to that the fact that Google web speech requires sharing private recordings with third party companies, this approach can not be favoured.

Sphinx 4

Sphinx is a speech recognizer developed at Carnegie Mellon University. The fourth version of the system has been written entirely in Java programming language. It is user-friendly and gives huge possibilities for adjustment, manageability and flexibility. First tests with small and simple models gave promising and accurate results, but the computation time was extremely long (it took several minutes to decode a 10 second long audio snippet). One may suspect the CMU team is aware of the issue since on the project wiki there is an article titled “How to tune

(19)

CHAPTER 4. IMPLEMENTATION the decoder to be fast (or rather, not horribly slow)”. Computation time is not the most important part of an audio archive project, but a bit faster solution would be welcome - further research is in progress.

Julius

Julius is a speech recognition software that has been developed with C language in Japan by several projects and consortia since 1997. The project has focused on the Japanese language, thus it is available only with Japanese acoustic and linguistic models. Voxforge works on developing Julius models for English. Unfortunately, Julius compatible format - HTK - has distribution restrictions. Due to the fact that we were obliged to use fully open libraries and some issues with dictionary configuration, we kept looking for an optimal ASR engine.

Pocketsphinx

According to CMU, Pocketsphinx is a lightweight version of their speech recognition engine, written in C, that gives as accurate results as Sphinx4, but faster at the expense of less flexibility. First tests gave reasonable results in an acceptable time, thus this library was taken into further consideration. Moreover, CMU develop- ers provide many models and dictionaries compatible with their engine. The basic pocketsphinx “small” model contains ~5000 words, but much more complex models for various languages are available for download (from their website). Two models were chosen for results comparison, since it was not obvious if a higher amount of words would increase or decrease decoding accuracy (on the one hand ASR recog- nizes more terms, but on the other hand, there are more similarly sounding words that would generate errors).

4.2 The system

Input file preprocessing

All input files have to be properly prepared before they can be analyzed by the system. This is due to the constraints of the ASR engine. The most important restrictions are: file format, sampling frequency and duration. This is why each file has to be converted to .wav format beforehand (Pocketsphinx cannot handle the very popular audiobook format - mp3) and the recording length needs to be limited.

A very important stage of the entire preprocessing operation is division into sub-files, since the original audio file may be too long for the ASR to handle it as a whole. The original input file cannot be cut in any random place, it has to be chosen

12

(20)

4.2. THE SYSTEM

carefully, because partitioning may affect the data. If a word is ripped into pieces, it is no longer recognizable. This problem can be avoided in several ways. The first approach is to prepare snippets that overlap. This way no content is lost, but

“partial” words are still present, which may falsify the output. A second approach, which is a bit more complicated, is to detect silence and cut the file in a moment when most probably nobody is talking. This can be done using an audio editor

“Sound eXchange - SoX” that implements a feature of silence detection, which can be utilized to determine the optimal division. As an alternative, the speech recognition engine Julius provides a toolkit with a program called “adintool”, which can perform similar silence detection splitting operation. The results of using both tools are alike.

After the preprocessing stage is finished, the original file consists of a number of several seconds long sub-files, that are ready to be processed by the ASR system.

Metadata for the main file are being used to fill in XML fields of title, author, album and date. These fields will be used during the search process and faceting.

Main processing

When the main input file has passed the preprocessing stage, all the sub-files are decoded by the Pocketsphinx ASR. The transcribed output and the file’s metadata are sent to the generator, where they fill in a template of an XML file defined in the Solr schema. In addition, some extra information such as: file name, processing time etc. are stored.

At this stage there are fields generated for the phrases similarity search. Many speech recognition errors are made due to the misunderstanding of phrases rather the misunderstanding of isolated words. Storing the data as a series of concatenated neighboring words may help to compensate for recognition inaccuracy. This means that phrase “this is a sample” will generate:

• one-grams: ‘this’, ‘is’, ‘a’, ‘sample’

• two-grams: ‘thisis’, ‘isa’, ‘asample’

• three-grams: ‘thisisa’, ‘isasample’

• four-grams: ‘thisisasample’

The higher the N, the lower the boost for the field should be given in the Solr search as it is less probable to make an error with four words than two.

XML files prepared this way are moved to Solr directory ready to be posted to the search engine, whereas audio sub-files are moved to “media” directory in order to enable the user to listen to them and verify the correctness of the search while using the demo.

(21)

CHAPTER 4. IMPLEMENTATION

Figure 4.1. Processing schema.

4.3 Additional tools

There is a need to implement several additional tools for the project. They were redundant for the main processing stream, but of great usefulness during the implementation and the evaluation phase. Most of them are Python scripts with the usage of some additional libraries.

Edit distance

Given transcribed and original texts, there is a need to measure the accuracy of the output. Edit distance is a way of calculating the (dis)similarity between two strings. It counts the minimal number of operations needed to transform the first string into the second. The most popular version of this approach - Levenshtein distance - allows for three operations: insertion, deletion and substitution of a character. Originally the algorithm was designed to compute similarity between two words, but it can easily be changed to process sequences of words.

Text difference

Measuring the distance between transcribed and original texts is not enough to determine how the decoder performs. The crucial knowledge is which particular words and phrases were misunderstood by speech to text engine. To address this issue, a difference matching script is needed. Just as in the edit distance problem, most of natural language toolkits available in Python, which is the language most used in the project, were designed to work on single characters rather than whole words. Character-wise matching cannot provide information about errors made in

14

(22)

4.3. ADDITIONAL TOOLS

speech analysis phase. Fortunately, an on-line tool [9] does the string comparison just as it is needed for the project. The library used by the app returns tags which words were added or removed from the original input to obtain transcribed output.

Combining this information can reveal misunderstood phrases as shown in the Fig- ure 4.2.

Figure 4.2. Sample output from text difference tool. Colors indicate errors. Green is the original data and red is decoded. Some errors are reported because of inaccurate input (no spaces between words in the original e-book)

Demo

Findwise AB has shared with me a skeleton of a web application that they have developed for demonstration purposes. It combines Solr search engine and Tomcat web server with a simple graphical user interface. With this demo the user can query a phrase and receive a number of snippets (see Figure 4.3). Each result contains a part of the original audio file to listen to and a piece of transcribed text. The user can follow the hyperlink to view the full text or to listen to the full recording (see Figure 4.4). Results can be browsed in a faceted form - they are organized in groups by author, album and date. Album usually refers to several chapters of the same audiobook. When the result is hit by original query word, without the need of phonetic phrase search, the hit word is highlighted.

(23)

Figure 4.3. Demo: results list, faceting and highlighted perfect hit term.

16

(24)

4.4. APPLICATION

Figure 4.4. Demo: snippet view (top) and full text view (bottom) with playable audio elements.

Moreover, the demo implements additional features like query suggestions (see Figure 4.5) and number to string translation (the term ‘16’ would normally not match with ‘60’, but they sound alike, though they may be meaningful for retrieving relevant documents and should be treated as ‘sixteen’ and ‘sixty’). The demo does not remove stopwords from the search corpus, since those terms may be part of misunderstood phrase. This feature, however, should be added to the finally released product.

4.4 Application

The result of this project is an application, written mostly with the Python programming language. It is able to transcribe an audio file, post the text output

(25)

Figure 4.5. Demo: query suggestions based on the indexed data.

to the Solr search engine and perform querying on the data corpus with usage of phonetic algorithms. The most important classes are:

• Main initializes all the required tools, including the decoder, processes the audio files and saves the results in the appropriate directories (audio snippets in tomcat/webapps/media and generated xml files in solr/exampledocs). Xml documents are not posted to the Solr system automatically, because of two reasons: it is unknown if the user wishes to post this file at the moment and it is unknown if the Solr is running.

• Decoder initializes speech recognition library and is responsible for transcrip- tion.

• XML prepares the data and saves the result in an xml format, suitable for the schema defined in the Solr platform.

Some additional Python classes were prepared for testing purposes. They are not required by the system to work properly, but are very useful for evaluation:

• Difference is used to match and differentiate two sequences of terms. Results are colored red and green to make a difference between both versions easier to analyze, as seen in the Figure 4.2.

• Levenstein class computes the edit distance between two text files in two possible ways: word-wise and character-wise as it is described in section 6.1.

18

(26)

4.4. APPLICATION

Apart from Python files described above, the project directory contains a folder with scripts required for preprocessing input files and using external tools. Solr directory has ready-to-use search platform implementing schema with phonetic algorithms used during indexing and querying time. Tomcat catalog consists of audio snippets and web apps used by the web server to display the demo.

(27)

(28)

Chapter 5

Evaluation

5.1 Test data

An evaluation of a project similar to this one requires audio files containing human speech and text files with true output for comparison. The simplest way to obtain this kind of data is to combine audiobooks and corresponding e-books. Being aware of the copyright infringement risk, we decided to use the data publicly available.

“LibriVox” and “Project Gutenberg” meet all my requirements.

LibriVox

“LibriVox” is a digital library of audiobooks that are published in a public domain.

The project started in 2005 and contains over 17 000 recordings (in 2014) in English, however other languages are also available. Moreover, all the recordings are read by volunteers with different voices, reading speeds etc., which means the system could not be tuned into any particular person or voice. Five audiobooks were used in the project. They contained a total of 48 different files (chapters) and resulted in 3966 audio snippets (indexed documents) after the preprocessing.

Project Gutenberg

“Project Gutenberg” is also a volunteer digital library founded in 1971. It contains huge amounts of e-books (over 47 000 in 2014) and on average fifty more is added every week. All the material available from Gutenberg is out of copyright and can be downloaded in multiple formats as well as plain text.

Drawbacks

Unfortunately, “Librivox” and “Gutenberg” files are not an exact match. Each audio file starts and ends with a short information about the book, author, reader, and the LibriVox project - before the actual content is read. “Gutenberg” files are corrupted with additional terms and conditions, that should be removed. For the

(29)

CHAPTER 5. EVALUATION purpose of this project evaluation, the text files were therefore edited, while the audio files remained unchanged. As much as the editions are needed, due to the different length and content of the introduction and ending, they are also difficult to do and may result in some unavoidable errors.

5.2 Experiments

Models comparison

Choosing the models wisely is crucial for every speech recognition system. This is due to the large amount of words in English, which is estimated to be over a million [10]. In everyday life only small part of them is used. It is impossible to decode all of the possible words, but a proper approximation for a given case may be useful.

The question is: what size the language model should one use efficiently cover the transcription problem? “As big as possible” seems to be a naive answer. On the other hand, the more uncommon words are being added to the model, the higher possibility of them being misunderstood by the ASR engine. Rare words are most informative but hardly ever seen. I have decided to compare two models provided by CMU - one contains ~5 000 words and second ~130 000. Test has been run on both - accuracy and computation time.

Quantity and quality tests

The system has to be tested in several ways. First of all, it should be checked if phonetic filters are able to compensate for errors made during speech to text conversion. Secondly, it has to be discovered how each type of filter influences search results both in true positive and false positive aspects. Similar work needs to be done for each level of N-grams (concatenated words for phonetic phrase search). Fi- nally, as every information retrieval system, capability for meeting some particular information needs has to be tested.

Phonetic filters and N-grams testing, called quantity testing because it inves- tigates the problem via statistical and numerical information, consist of two data corpora. The first of them is based on audiobooks transcribed by the ASR system and the second one is made of the original e-books. The system will receive multiple queries and results from both corpora will be compared. It will also be noted how each phonetic filter and N-gram influence the retrieved top ten. In order to avoid selection bias, all queries will be isolated words from the “small” CMU model (~5 000 words).

It seems to be much more important to test if phonetic filters are capable of compensating for the errors made by the ASR. For this purpose, we selected several phrases that were misunderstood. These can be identified by text difference tool described in the implementation section. The system will be queried with the

22

(30)

5.3. USE CASES EVALUATION

original words. The deep analysis of the results will be held in order to determine why the documents were retrieved. This consists of rating whether the system was able to retrieve the document with the use of wrong phrases and which filters and codes made it possible to make a match with the proper document.

The last stage of testing will be to prepare several information needs according to the data, that is, information that is known to be present in the audiobooks.

Next step is to formulate several queries for those needs and classify the results as relevant or not. This part is fully subjective on information need, query choice and document relevancy classification.

5.3 Use Cases evaluation

Each “family” of the use cases described in the introduction impose different requirements on the system. In the first example - broadcast analysis - the system has to be fast enough to keep up with the information stream i.e. speech recognition cannot cause a delay (because the stream never ends) and leave some spare time for text analysis. Also, in this case there is no need for high accuracy. All that is important is to signal presence of the interesting phrase and e.g. sentiment of surrounding terms. Average speech recognition quality is acceptable since phonetic filters should be able to point out matching words.

The second problem, archive of multimedia files, has no constraints regarding computation speed - searching accuracy is more important in this case. The files are added to the archive infrequently and no continuous information flow is present.

This means that more time can be spent on analysis and improvement of the system outcome. Long computation time is not as harmful as irrelevant search results.

(31)

(32)

Chapter 6

Results

6.1 Models

Two packages of models were compared. One of them is a default pocketsphinx model distributed with the Pocketsphinx package and the second is a generic US English model. The main differences between them are the number of words defined in a dictionary (~5 000 and ~130 000) and the size of the acoustic model (amount of data used for HMM training). For this reason those two models will be referred to as ”small” and ”big”.

The naive answer to the question ”which of the models should be used?” is the

”big” one. But the more words are defined in the dictionary, the more probable it is for the ASR engine to confuse similar terms. On the other hand, words that are rarely used carry more information than common expressions.

The comparison of the models has been performed on a set of twelve files, each one corresponding to a chapter of a book ”Fundamentals of prosperity” by Roger Babson. Each file has been processed twice - once per model. Two factors were analyzed: computation time and accuracy of the output.

Time comparison

The tests have revealed that for each file the small model was 9 to 10 times faster than the big one. This is probably due to a huge amount of words defined in the second model. The chart in Figure 6.1 illustrates the distribution of computation times for each model and duration of the original file. It is interesting how stable the computation time ratio according to the input file duration is. It is an average of 0.14 for the small model and 1.4 for the big one with standard deviation of 0.005 and 0.05 respectively. In conclusion, it can be said that computation time depends mostly on the size of the dictionary and is proportional to the duration of the audio file.

(33)

CHAPTER 6. RESULTS

Figure 6.1. Computation time for each model and original file duration for compar- ison.

Accuracy comparison

In speech recognition case it is important to obtain a result that is as similar to the “ground truth” as possible. Good measurement of difference is so called “edit distance”. It calculates the minimum number of operations required to transform a source string into a target one. Originally designed to compare words, it may be extended to sequences of terms. In this measure identical words (sequences) have distance “0” whereas different words (sequences) have positive score (the higher the score, the more unlike the words are). In this project the Levenshtein distance was used as a metric. Operations allowed by this algorithm are: insertion, deletion and substitution.

Chart 6.2 illustrates edit distance value divided by the number of words in the original e-book file. This may be treated as error rate only when taking into account the fact that the audio file does not match perfectly with the e-book (Librivox audiobooks contain additional prefaces and endings that were not removed). What can be concluded from the graph is that the big model performed much better than the small one. This means that much larger numbers of words defined in the dictionary do not confuse the ASR engine enough to exceed the benefits of the higher

26

(34)

6.1. MODELS

Figure 6.2. Comparison of accuracy for both tested models. Edit distance divided by the number of words in each chapter.

number of recognizable words. On average the small model requires 5 operations per 10 words and a big one 2,7.

Word-wise comparison presented in chart 6.2 is a very strict measure. For some very similar pairs of words the error rate is the same as in case of extremely different terms - boolean comparison, equal or not. This means that pair ‘help’ and

‘helped’ is treated the same as ‘hell’ and ‘heaven’. Weighting measurement would be useful for ranking error. The basic Levenshtein distance algorithm (character- wise) proves to be useful here. The edit distance between ‘help’ and ‘helped’ is two (two insertion operations) whereas the case of ‘hell’ and ‘heaven’ require at least four operations (two substitutions and two insertions). This approach may be too

“soft” for comparison of the audiobook transcription and e-book original but it is a good counterweight for previous, very strict word-wise algorithm. Chart 6.3 illustrates character-wise edit distance for both models and takes into account length of the e-book. General view between models is similar as in the previous case - chart 6.2, but values are of lesser quantity. For the big model, an average error rate is 17%, whereas previous approach computations indicate 27% (from 50% to 30%

improvement for the small model).

(35)

CHAPTER 6. RESULTS

Figure 6.3. Comparison of accuracy for both tested models. Edit distance divided by the number of characters in each chapter.

6.2 Filters

Phonetic filters transform a word into a more general sequence of characters called

“phonetic code”. Each code defines phonetic characteristic of a given word as pronounced in English. The key feature of this matching, which we would like to take advantage of, is that two similarly sounding words should be matched to the same (or at least similar) phonetic code.

In this section I will investigate which filters are able to match similar words to the same code. Table 6.1 illustrates several examples that were chosen on the basis of the actual misunderstandings made by ASR. In this test for utility reasons only isolated words are taken into account. Phrases, not shown in the table, were tested and gave similar results. The chosen words belong to two groups: first consists of identical words with different suffixes (i.e. sixty/sixteen and promise/promised) and second - words from different “families” that sound or are spelled similar (i.e.

march/much and three/tree).

28

(36)

6.2. FILTERS

filter \text march much promise promised three tree sixty sixteen

Soundex M620 M200 P652 P652 T600 T600 S230 S235

Refined Soundex M80930 M8030 P19080306 P1908030 T6090 T690 S30560 S305608

Metaphone MRX MX PRMS PRMS 0R TR SKST SKST

Double Metaphone MRX MK PRMS PRMS 0R TR SKST SKST

Double Metaphone2 MRX

MRK MK PRMS PRMS 0R TR TR SKST SKST

Caverphone MK1111 MK1111 PRMST1 PRMS11 TRA111 TRA111 SKTA11 SKTN11

Table 6.1. Phonetic filters comparison

Each row describes one phonetic filter whereas columns refer to pairs of analyzed words. Values in cells are codes produced by given filter for given word. Green highlighting indicates a “perfect match” between codes, yellow is “almost match” (when codes are no more than one character operation - insertion, deletion or substitution from each other) and red which means “no match”.

Brief analysis of the table reveals that some filters are able to produce “perfect match” for given pairs, but not for all of them. In the same time for each of the analyzed pairs, there is at least one filter that can handle the problem properly. What is more, many filters produce very similar codes (yellow) for similar words. This means that all of the filters should be taken into consideration and none of them should be favored. The important conclusion is that march-much is a difficult pair to obtain similar codes even tough the words are spelled similarly (edit distance 2).

It is much easier in the case of sixty-sixteen although the difference is much bigger (edit distance 3).

This project explores the ability of phonetic filtering to improve search results by compensating for speech recognition errors. This is why both original and transcribed texts were converted to phonetic codes domain and compared again with the Levenshtein edit distance measure, as it has been done in the previous section. The Double Metaphone algorithm was used for this comparison. Results are shown in the Figure 6.4. In both the big and small model a huge improvement can be seen.

On average edit distance to file length ratio is 11% smaller for the small model and 18% for the big one. Obviously, the results after applying phonetic algorithm could not be worse than without pronunciation encoding. Such big difference sug- gests however, that this approach should improve recall of the system (and decrease precision as they are contrary).

(37)

CHAPTER 6. RESULTS

Figure 6.4. Comparison of accuracy for both tested models with and without usage of the phonetic codes.

6.3 Quality tests

Filters

In this paragraph several queries will be tested. Each of them is a phrase from the original e-book, that was misunderstood by the speech recognition engine. The goal of this test is to check if phonetic filters implemented in Solr are capable of matching the original document, even though the transcribed text is faulty and lacks queried phrase. The ASR output and original data (queried) are given below:

• “doctor andrew” <-> “daughter and who” – the original file is listed far in the search results (second page of pagination) while documents that contain the word ‘daughter’ are higher ranked. It is important to point out that the Soundex filter applied on two-gram field converts ‘doctorandrew’ to the code

‘D236’ - the same as the queried phrase - and gives a match.

• “war might be grip” <-> “warmly he gripped” – In this case the original 30

(38)

6.3. QUALITY TESTS

document has not been retrieved. This is due to the fact, that the phrases are not similar enough for the system. Words ‘grip’ and ‘gripped’, even though they sound similar, differ for each implemented filter by at least one character (two characters in case of Refined Soundex). This means that there is no match between them. ‘Be’ and ‘he’ give short codes, but different in each case. Codes for phrases ‘warmight’ and ‘warmly’ also differ on two positions and cannot be matched. As shown above, each single similarity in sounding was not enough for the phonetic filters to mark them as a hit and retrieve the original document.

• “be pro to pleasant” <-> “eat protoplasm” – This query returned relevant document as the tenth position on the result list. Three-gram ‘protopleas- ant’ has been matched by multiple filters (all Metaphone based and a simple Soundex) with ‘protoplasm’. This example proves that N-grams are needed to match misunderstood phrases, not only isolated words.

• “noticed that her and let us” <-> “notice the bitter in it lettuce” – Rel- evant document has been returned as the fifth document. Threegrams ‘no- ticedthather’ and ‘noticethebitter’ match precisely by Metaphone and Soundex filters. The same happened with ‘letus’ and ‘lettuce’. Matching of all the terms in the query gave higher rank than in the previous example, but still many false positives were present.

In general, the main problem is the high number of false positive results. High- lighting hit results revealed that in many cases match is present between words that seem to be random (phrases ‘itwilldiethink’ and ‘itlettuce’ both match by Meta- phone filters as they produce code ‘ATLT’, even though seem to be very different - edit distance 9). This is probably because of N-grams that concatenate short words.

In many cases filters match ‘words’ that look as they were completely different. On the other hand, the N-grams help to match misunderstood phrases like ‘let us’ and

‘lettuce’ covered in the examples.

Information need

In this paragraph two information needs will be queried. Each document on the list of top ten results will be marked as ‘relevant’, ‘semi relevant’ or ‘not relevant’.

Needs were chosen according to the actual recordings’ content.

How to make a sandwich?

• “how to make a sandwich”

The first query is the full information need. Many stop words (‘how’, ‘to’, ‘a’) retrieve only non relevant documents.

• “sandwich recipe” The second query is limited just to the meaningful words.

The system retrieves two relevant documents on positions 2 and 4.

(39)

CHAPTER 6. RESULTS

• “food recipes” Query generalization brought more semi relevant documents (ranked 1, 2, 9 and 10) but none fully relevant. Documents cover the topic of food in general rather than recipes.

How to train my brain?

• “how to train my brain”

Similarly as in the previous example, no relevant documents were retrieved by query with many stopwords.

• “brain exercise”

A second query concentrates more on finding concrete exercises, but also in this case all documents were marked as ’not relevant’.

• “mind exercise”

Replacement of the word ‘brain’ with ‘mind’ gave one relevant document, second on the list.

Stopwords, which are essential for phonetic filters to work properly, influence the result enough to make relevant documents impossible to retrieve. Simply limiting the query may improve the results, but many false positives are present. An additional observation is that the documents are difficult to read and understand because of many words being misunderstood during the speech to text conversion. They also lack punctuation and grammatical correctness of the sentences. It is much easier to listen to the audio snippet.

6.4 Quantity tests

True Positives

Quantitative tests in true positive section revealed, as expected, that one-grams contributed the mose in search results (48%) - Figure 6.5. This means that the single word matching was the most effective one. Yet N-grams were very active as well - 21% for pairs of words indicates that misunderstanding on the level of sequences can be covered this way.

Analysis of the second graph (Figure 6.5) - contribution of each filter individually - shows that Metaphone based filters and Soundex play equal role in indicating the correct document. Caverphone and Refined Soundex highlight the hit half as frequently as other algorithms. The table 6.2 describes the contribution of each filter given the N-gram level (number of words concatenated). Brief analysis reveals that those two filters are equally effective for words without joining (one-grams) but give almost no hits for higher N. This means that Refined Soundex and Caverphone should be used with “real” words rather than sequences of letters that are not present in the dictionary.

32

(40)

6.4. QUANTITY TESTS

Figure 6.5. True positives contribution

(41)

CHAPTER 6. RESULTS

Figure 6.6. False positives contribution

34

(42)

6.4. QUANTITY TESTS False Positives

False positives are reported mainly by filters working on single words. This time their contribution is smaller than in the true positives case, which is good (see Fig- ure 6.6). The graph of overall false positives for filters is similar to the true positives case. The most interesting conclusion can be found in the detailed Table 6.3. Two filters - Refined Soundex and Caverphone produce fewer false positive results than other algorithms. This is especially visible for the single word context (one-grams).

On the other hand, the false positive contribution in twograms has increased (in comparison with TP). This information can be used to perform fine tuning of each N-gram and filter in the Solr search engine, as all fields can be parametrized with a

‘boosting’ factor.

filter \n-gram one grams two grams three grams four grams sum

double metaphone 0,081 0,048 0,041 0,038 0,208

metaphone 0,08 0,046 0,04 0,038 0,204

soundex 0,081 0,05 0,042 0,038 0,211

refined soundex 0,076 0,004 0 0 0,08

caverphone 0,077 0,008 0,001 0 0,086

double metaphone 2 0,081 0,05 0,042 0,039 0,212

sum 0,476 0,206 0,166 0,153

blue [0; 0,02), green [0,02; 0,04), yellow [0,04; 0,06), orange [0,06; 1]

Table 6.2. True positives contribution matrix

filter \n-gram one grams two grams three grams four grams sum

double metaphone 0,068 0,062 0,046 0,038 0,214

metaphone 0,063 0,049 0,038 0,033 0,183

soundex 0,072 0,071 0,053 0,045 0,241

refined soundex 0,039 0,011 0,001 0 0,051

caverphone 0,049 0,024 0,004 0 0,077

double metaphone 2 0,07 0,068 0,05 0,042 0,23

sum 0,361 0,285 0,192 0,158

blue [0; 0,02), green [0,02; 0,04), yellow [0,04; 0,06), orange [0,06; 1]

Table 6.3. False positives contribution matrix.

(43)

(44)

Chapter 7

Conclusions

This project has tested the possibility of using speech recognition and phonetic algorithms combined together to perform (or improve) searching on multimedia content. During the project there have been tests run for several filters and word combination approaches as well as speech recognition models. Following conclusions were made:

• Phonetic filters indicate many words or phrases to be relevant hits, even though they are not. During the testing stage all possible algorithms and N-grams were combined, but for real systems only the relevant algorithms and models should be implemented. On the other hand, because of the huge number of filters, some speech recognition errors could be corrected. This is the reason for using them.

• Boost tuning is suggested on each filter according to actual data that the particular system will be used on. This can be done on Solr search level with suggestions from tables 6.2 and 6.3, with special attention to Caverphone and Refined Soundex algorithms, which proved to perform well in one-grams true positives rate and produce relatively few false positives.

• Even though a simple version of the audio search system gives positive results, it should be combined with a full search engine that includes stemming, synonyms etc. Errors made by speech to text or phonetic algorithms should hardly be a concern, since text search is very powerful standalone and solutions proposed here are supposed just to support it.

• It is important to have a well trained model. I

This applies to both the acoustic and the language models as well as to the dictionary. High accuracy phoneme recognition can be done only with an acoustic model trained on thousands of hours of human speech performed by many different speakers. General purpose systems should be able to determine as big number of words as possible. It is not unlikely that terms contain

(45)

CHAPTER 7. CONCLUSIONS huge information content that may be crucial for the data. Possible misunderstanding of phrases should not be treated as a serious problem, since a huge language model should accept them and phonetic algorithms support searching them. In some cases new words such as brand names etc. should be added to the model.

• As in most systems, also audio search requires well prepared data. Recordings should be high quality, without unnecessary noise or multiple speakers. This may help to avoid some errors.

Use cases

As stream data analysis requires speed and not necessarily high recognition accuracy, the size of the input model (the dictionary) should be limited to a small number of words. this should reduce the speech recognition time significantly and allow “live” stream handling. In some cases (e.g. person/product sentiment analysis) there might be a need to expand the dictionary by extra terms such as brand or person name. Moreover, some additional tools have to be added to fulfill requirements of such use case, thus increase the time for whole system. It is also very important to provide high accuracy input, therefore implementation of noise reduction algorithms may be needed.

Archive of multimedia files requires big model that can recognize huge amount of words. Computation time is not as important as in the previous case. Well known approaches for text searching should be used to increase quality of the results e.g.

stemming or synonym detection, apart from phonetic filters.

38

(46)

Chapter 8

Future work

The system can be improved in several ways:

• Accuracy of speech recognition system is essential for high quality of such project. There is a high probability that replacing current speech recognition system (engine, acoustic and linguistic models), which is an open source project, with market one (for example Nuance Dragon), may improve output quality. This is due to a much higher amount of data used in the training process.

• In most cases speech recognition systems are able to detect particular words but not sentences. Therefore it may be inconvenient for the user to read the transcribed text. Completing the output with dots, question marks, big letters, commas etc. should improve user experience, though it may be difficult because of many errors during speech to text conversion.

• High quality of the input data is crucial for every system. Additional step of noise reduction could improve the ASR process in some cases. This can be done with “SoX” audio editor that implements functionality of noise reduction.

It needs a “noise sample” for each file but cannot be done fully automatically.

• Most probably recall could be improved and more ASR errors could be com- pensated by turning on “fuzzy search” option in Solr, which enables approximate matching. In such case many “almost match” codes in the phonetic filters table 6.1 may give a hit. The drawback of this solution is that “fuzzy search” option slows down the query time massively, especially in case of this project, in which many filters and fields have to be analyzed. Moreover, it may significantly increase the number of false positive results.

(47)

CHAPTER 8. FUTURE WORK

Help checklist

There is no one universal answer for every problem regarding information retrieval within spoken data using speech recognition software. Yet, there are some decisions that have to be made at the beginning of the designing process. Questions listed below may help to make some of them:

• Make a decision regarding the model. Which is more important, speed or accuracy?

• Where can one find a model for a given language? How about preparing your own model (or a least part of it)?

• How many recognized words or instructions are needed? Maybe there is a need to define new terms?

• Is there a need to create N-grams for misunderstood sequences of words?

• What kind of additional tools, apart from phonetic filters, will be needed in the project (e.g. sentiment analysis)?

• Is there a need to add preprocessing step for the input data (e.g. noise reduction)?

• What kind of tests are needed to fulfill requirements of the project?

40

(48)

References

[1] Bhiksha Raj, Rita Singh, Mosur Ravishankar, "Design and Imple- mentation of Speech Recognition Systems", Carnegie Mellon Univer- sity, lecture slides 2012

[2] "Basic concepts of speech"- http://cmusphinx.sourceforge.net/wiki/tutorialconcepts [3] David McCarten, "Speech Recognition improvement through conso-

nant substitution with Japanese ESL speakers" , Columbia Univer- sity, May 2008

[4] Nima Mesgarani, Stephen V. David, Jonathan B. Fritz, Shihab A.

Shamma, "Influence of Context and Behavior on Stimulus Recon- struction From Neural Activity in Primary Auditory Cortex", Journal of Neurophysiology, December 2009, Vol. 102, no. 6

[5] Tina Sander, "Using Sonograms to Recognize Affect in Speech", 2001 [6] National Archives and Records Administration, "The Soundex In-

dexing System" May 2007

[7] Lawrence Philips, "Hanging on the Metaphone", Computer Lan- guage, Vol. 7, No. 12 (December) 1990.

[8] David Hood, "Phonetic Matching algorithm", September 2002 [9] "Text Comparison tool" - http://www.textdiff.com/

[10] "Language monitor: Number of Words in the English Language" - http://www.languagemonitor.com/1000000th-word/no-of-words/

(49)

(50)

Appendix A

User guide

The project has been developed on Debian operating system and can be run on every Unix-like OS that supports basic command-line scripts.

Prerequisites

To run the project, several tools and libraries are needed to be installed on the computer:

• Tomcat and Solr: web server Tomcat is required to run the demo and Solr to index and retrieve the data. Both tools are already configured and attached to the project files.

• Pocketsphinx: a lightweight speech recognition engine, available to download at the project web page.

• Lame - a program used for preprocessing of input audio files.

• SoX - an audio manipulation program used for silence detection.

• Hachoir-metadata - Python tool for extracting metadata from multimedia files.

Apart from programs listed above, a text difference library is required to run the Difference tool for word-wise matching of two text files:

• Horde Text Diff Running the decoder

To run the decoder, the Main.py file has to be run with optional parameters:

(51)

APPENDIX A. USER GUIDE

• -b - by default the decoder makes use of small and fast (~5 000 words) language and acoustic models (as well as the dictionary). This option forces the decoder to use the big and slow (~130 000 words) model. To use one’s own model the appropriate paths have to be changed in the Decoder.py file.

• -m - used to change the default media location. Every audio snippet produced by splitting the original audio file is moved there. By default set to webapps/media in tomcat directory.

• -d - used to change the default location of xml files containig media file metadata, transcribed text, processing time and other data used by Solr defined schema. By default set to solr/exampledocs directory.

Figure A.1. Transcribing progress in the console.

Figure A.2. Console preview of the output with processing time.

After running the decoder the user is asked to choose a file (or multiple files) in an audio format (.mp3 or .wav) to decode. Work progress is visualized in the console, as illustrated in the Figure A.1. The whole process may take long time, depending on the model used by the decoder and duration of the file. When the process is finished, all the output files are moved to the defined locations and the transcribed text is printed in the console for fast preview (see Figure A.2). By default xml files are not posted to the index, but it can be easily done by entering following command in the solr directory:

java −j a r post . j a r <XML f i l e path>

44

(52)

after turning on the Solr.

The following information is printed after running the program with help option:

$ python Main . py −h Main . py [OPTIONS]

Main . py −m <mediaDir> −d <docDir>

OPTIONS:

−h , −−help

p r i n t t h i s message

−m, −−mediaDir

d i r e c t o r y where audio s n i p p e t s are moved to

−d , −−docDir

d i r e c t o r y where XML output f i l e s are moved to

−b , −−big

t e l l s to use ‘ big ’ model f o r speech−to−text r e c o g n i t i o n

Running the demo

To run the demo described in the report, two tools have to be started:

• Tomcat web server - by startup script in tomcat/bin directory - to handle the demo interface.

• Solr - ‘java -jar start.jar’ in the solr directory - to run the search platform and give access to the indexed data.

When this is done, Solr management is available via ‘http’ through the web browser at ‘http://localhost:8983/solr’ and demo gui at ‘http://localhost:8080/gui’

Running the text difference tool

Difference.py is a simple console program for illustrating the difference between two text files. The two input files are specified as command line arguments:

• -s or --source <file path> to define the first of the files for comparison

• -d or --destination <file path> to define the second of the files for comparison The program matches the files and displays the difference in two possible ways:

• option -c prints the whole text with differences colored red (deletion) or green (insertion). Substitution is an insertion and deletion next to each other, as shown in the Figure 4.2

• option -p is used to list just the differentiating phrases as pairs in format

‘phrase one <-> phrase two’

(53)

www.kth.se

Search system for an audio archive

Search system for an audio archive

Search system for an audio archive

Abstract

Referat

Contents

Chapter 1

Introduction

Chapter 2

Automatic speech recognition

2.1 Acoustic model

2.2 Pronunciation model

2.3 Language model

2.4 Speech recognition issues

Chapter 3

Apache Solr

3.1 Platform

3.2 Phonetic filters

Chapter 4

Implementation

4.1 ASR engine choice

4.2 The system

4.3 Additional tools

4.4 Application

Chapter 5

Evaluation

5.1 Test data

5.2 Experiments

5.3 Use Cases evaluation

Chapter 6

Results

6.1 Models

6.2 Filters

6.3 Quality tests

6.4 Quantity tests

Chapter 7

Conclusions

Chapter 8

Future work

References

Appendix A

User guide