Feasibility study on a text-to-speech synthesizer for embedded systems

(1)

2006:113 CIV

M A S T E R ' S T H E S I S

Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems

Linnea Hammarstedt

Luleå University of Technology MSc Programmes in Engineering

Electrical Engineering

Department of Computer Science and Electrical Engineering Division of Signal Processing

2006:113 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--06/113--SE

(2)

(3)

Preface

This is a master degree project commissioned by and performed at Teleca Systems GmbH in N¨urnberg at the department of Speech Technology. Teleca is an IT services company focused on developing and integrating advanced software and information technology solutions.

Today Teleca possesses a speak recognition system including a grapheme-to-phoneme module, i.e., an algorithm converting text into phonetic notation. Their future objective is to develop a Text-To-Speech system including this module. The purpose of this work from Teleca’s point of view is to investigate a possible solution of converting phonetic notation into speech suitable for an embedded implementation platform.

I would like to thank Dr. Andreas Kiessling at Teleca for his support and patient discussions during this work, and Dr. Stefan Dobler, the head of department, for giving me the possibility to experience this interesting field of speech technology. Finally, I wish to thank all the other personnel of the department for their consistently practical support.

i

(4)

(5)

Abstract

A system converting textual information into speech is usually denoted as a TTS (Text-To- Speech) system. The design of this system varies depending on its purpose and platform requirements. In this thesis a TTS synthesizer designed for an embedded system operating on an arbitrary vocabulary has been evaluated and partially implemented in Matlab, constituting a base for further development. The focus of this thesis is on the speech generation part, which involves the conversion from phonetic notation into synthetic speech.

The chosen TTS system is the so called Time Domain-PSOLA, which convincingly suits the implementation and platform requirements. It concatenates segments of recorded speech and changes its prosodic characteristics with the Pitch Synchronous Overlap and Add (PSOLA) technique. The segment size is from the mid point of one phone to the mid point of the next, referred to as a diphone.

The quality of the generated synthesized speech is rather satisfying for the test sentences applied. Some disturbances still occur as a consequence of mismatches, such as different spectral properties of the segments and pitch detection errors, but with further developing a reduction of these can be performed.

iii

(6)

(7)

List of abbreviations

IPA International Phonetic Alphabet

MBE Multi-Band Exciter

MBR-PSOLA Multi-Band Re-synthesis PSOLA MBROLA short for MBR-PSOLA

MOS Mean Opinion Score

MRPA Machine Readable Phonetic Alphabet

OLA Overlap and Add

PSOLA Pitch-Synchronous Overlap and Add

SAMPA Speech Assessment Methods Phonetic Alphabet TD-PSOLA Time Domain PSOLA

TTS Text-To-Speech

V/UV Voiced/Un-Voiced

vii

(10)

(11)

Chapter 1 Introduction

The possibility of producing synthesized speech from plain textual information, so called Text-To-Speech (TTS) systems, has today aroused an extensive interest in many technical areas. Different methods with varying quality and properties exist, and the development is still continuing.

The purpose of this thesis is to define and evaluate a TTS synthesizer suitable for embedded systems. It is performed at Teleca Systems GmbH in N¨urnberg and its focus is established by the requirements of the company. Today, Teleca holds a module able to transform text into phonetic notation, which is originally developed for another speech purpose. This module is assumed able to be used also in a TTS system, and the starting point for this project is hence phonetic notation. The developed system is restricted to British English, but the theoretical descriptions are, though, valid for an arbitrary language.

Since the starting level is at phonetic notation, it is not really correct to consider the investigated system as a Text-To-Speech system. However, for simplicity, and since the process of going from phonetic notation to speech is a major part of a TTS system, the term TTS is though used in this thesis describing the evaluated overall process.

In this study a TD-PSOLA (Time Domain-Pitch Synchronous Overlap and Add) [Dut97] synthesizer is investigated and implemented in Matlab. The result is evaluated and suggestions of further work will be given for an accomplishment of the system. A possible extension of this method with the Multi-Band Excitation (MBE) [GL88] model is theoretically presented together with information about its benefits and disadvantages.

The following sections briefly describe the main principals of a general TTS system as well as some existing classifications and groupings. The ambition of the latter description is to show what choices have been made and to give some explanation why.

1.1 Introduction to TTS Systems

A TTS synthesizer is a computer based system that takes a text string as input and converts it into synthetic speech waveforms. The methods and equipment needed for this process varies depending on physical restrictions according to the implementation platform and development costs. Two main hardware restrictions are storage properties, such as its capacity and memory type, and the clock rate of the processor.

1

(12)

2 CHAPTER 1. INTRODUCTION

The synthesis process can for all methods be divided into the two main modules presented in Figure 1.1. The first step transcribes the input text into a linguistic format, which is usually expressed as phonetic notations (phones) together with additional information about its prosody. The term prosody refers to properties of each phone such as duration and pitch. The outputs are then used in the second block for construction of the final synthesis speech waves.

Text Linguistic Analysis

Phonemes &

Prosody Info

Speech Generation

Speech

Figure 1.1: Division of a general TTS system into two main modules.

1.2 Linguistic Analysis Module

In almost all languages the textual presentation of a word does not directly correspond to its pronunciation. The position of letters within a word and the words appearance within the sentence affect the pronunciation considerably, as well as additive characters such as punctuation marks and the content of the sentence do. An alternative symbolic representation is therefore needed to comprise the hidden information. Usually a language can be described by 20 to 60 different phonetic characters [Lem99], when excluding the information about its melody. To also be able to describe the pitch characteristics additional prosodic information is needed.

Converting text into linguistic representation requires a large set of different rules and exceptions depending on language. This process can be described through three main parts [Dut97]:

1. Text analysis

2. Automatic phonetization 3. Prosody generation

The first part functions as a pre-processing phase. It identifies special characters and notations, such as numbers and abbreviations, and converts them into full text when needed. Several words can have different pronunciations depending on their meaning, and hence a contextual analysis is performed for categorization of the words. The last step in the text analysis part is to find the structure of the text and to organize the utterance into clauses and phrases.

After the text analysis phase an automatic phonetization of the words is performed, focusing on single words. The letter representation is automatically transcribed into a phonetic format using a dictionary-based or rule-based strategy, or a combination of both.

(13)

1.3. SPEECH GENERATION MODULE 3

The former strategy divides the words into morphemes¹ and then converts them using a morpheme to phoneme dictionary. A large database is required for the dictionary to function in a general environment, together with additive transcription rules for expressing un-matched morphemes. Also in the case of a general rule-based text conversion, a mixture between the two strategies is present. Here, an exception dictionary is needed for the words that do not follow the defined pronunciation rules.

The last part of the linguistic transcription process is to add the prosodic information.

This is applied as additional information and hence the phonemes are not further changed.

Prosodic features are created by grouping of syllables and words into larger segments. A hierarchic classification of these groups then leads to the resulting prosody description, usually presented as pitch definitions and phonetic durations.

1.3 Speech Generation Module

The best synthetic result for generating speech is achieved by having recordings of all existing words stored in a huge database. The input generated by the linguistic analysis module can then simply be used to find and return the desired words. However, having a TTS system able to work for arbitrary text inputs would in this case require an almost infinite amount of recorded words. A more effective and by that means also more complex speech generation system is therefore needed.

There exist several different methods for generating undefined speech in an implementation realistic manner, i.e., with a limited amount of storage and number of operations.

This synthesis can be done either explicitly using models of the vocal tract, or implicitly which is based on pre-recorded sounds [Dut97]. Implementation of these approaches results in the two different classifications:

1. Rule-based synthesis for explicit operations.

2. Concatenative-based synthesis for implicit operations.

1.3.1 Rule-Based Synthesis

Creating rule-based synthesizers requires a careful study of how the different sounds for the human voice are produced. This modelling is then usually represented either by articulatory parameters or by formants² [SO95]. In the former case several parameters hold information about for example shapes and movements of lips and tongue, glottal aperture, cord tension and lung pressure [Lem99]. In the latter, formants are based on a set of rules used to determine the parameters necessary to synthesize a desired utterance.

These rule-based synthesizers require a considerably high computational load compared to other common methods. Secondly, the synthesized speech sounds very unnat- ural due to the complicated modelling state and that it is impossible to model speech accurately. On the other hand, they are space-efficient, since no speech segments need to

1A morpheme is the smallest language unit that carries a semantic interpretation. For example, the word ’unbelievable’ can be divided into the three morphemes un-believe-able.

2A formant is a peak in an acoustic frequency spectrum.

(14)

4 CHAPTER 1. INTRODUCTION

be stored, and can in principal easily be adjusted to a new speaker with a different voice and dialect.

1.3.2 Concatenative-Based Synthesis

In concatenative-based synthesizers segments of pre-recorded speech are connected (concatenated) to produce the desired utterances. The longer segments used for synthesizing, the better quality is achieved. On the other hand, using segments each consisting of several phonemes is, as mentioned previously referring words, not realistic for a non-defined implementation area due to database size. In this case most often so called diphones (see subsection 2.1.3) are used consisting of two phonemes and the transition in-between.

The method principally used for concatenate speech segments is the PSOLA (Pitch- Synchronous Overlap and Add) technique, or only OLA in case of an exclusion of the pitch synchronizing state, which will be closer described in section 2.3. Several methods use these concatenation operations involved, with varying pre-processing steps of the database and different methods of applying desired prosody. Most common is the TD-PSOLA (Time Domain-PSOLA) and the MBR-PSOLA (Multi-Band Re-synthesis PSOLA), both described in the following chapter.

An evaluation and comparison between four classical concatenative-based syntheses is described in [Dut94] involving the TD- and MBE-PSOLA methods, an LPC (Lin- ear Predictive Coding) synthesizer and a synthesizer based on the Hybrid H/S (Har- monic/Stochastic) method³. This study implies that the TD- and MBR-PSOLA methods are very run-time effective, since they are estimated to require an operational load of 7 operations per sample while at least ten times more is needed for the other two. Addi- tionally, the two PSOLA-based methods have better intelligibility and naturalness, only in case of fluidity the Hybrid H/S model is ranked higher than the TD-PSOLA.

The main difference between the two methods TD- and MBR-PSOLA appears in the pre-processing state. In case of the latter mentioned synthesizer the segments are more normalized and equalized, which is beneficial for data compression and speech fluidity but at the cost of naturalness. The implementation of this method is also more complex, which can be seen in the next chapter where a more theoretical description of the two synthesizers is shown.

A drawback with concatenative-based synthesizers is the large memory space needed for the stored segments. Additionally, the synthesizer cannot change speaker characteristics as in the case of rule-based systems. Some important benefits, though, are the simplicity of implementation, the natural sound and the few real-time operations needed [SO95].

1.4 Project Focus

The speech recognition system existing at Teleca today includes a grapheme-to-phoneme module. This module is assumed able to be used as the linguistic part of a TTS system, converting text into phonetic notation. However, this is only an assumption and a future

3See [Dut97] for description of the LPC and Hybrid H/S models.

(15)

1.4. PROJECT FOCUS 5

implementation task, and therefore not put in practice in this project⁴. The starting level for the TTS system described in this thesis is hence phonetic notation combined with prosodic information, assumed presented on a defined format suiting the synthesize model chosen.

Since the future implementation platform for the desired TTS system is an embedded system device working in real-time, a method with low operational load and small data storage is required. This leads to, according to what is described in the previous section, that a concatenation-based synthesis method is preferable using the run-time effective PSOLA technique. To minimize the data storage for a TTS system with an arbitrary vocabulary area, the segment database should consist of pure phoneme recordings. How- ever, since it is known that the transition between two phonemes is more important for understanding of speech than the stable state itself [Dut97], a segment database consisting of diphones (with one transition point present in each segment) is preferable. Naturally, the more transitions each segment includes the better the speech is understood, but this would also require a larger database. The memory load will approximately increase with a power of two for each additional transition included in each segment and very soon an unrealistic number of segments is reached. Therefore, to reduce data storage size, this TTS system is based on diphone segments.

The most common PSOLA based synthesizers are the TD- and the MBR-PSOLA.

The former method is more widely used and requires a somewhat simpler implementation.

For this project this TD-PSOLA method is chosen basically as a consequence of the fact that the MBR-PSOLA is more or less an extension of the TD-PSOLA, and hence the implemented system still has the possibility to be further developed. It would though, be interesting to implement an MBR-PSOLA synthesizer as well and compare this with the TD-PSOLA, but as a consequence of the time restrictions for the project this is not performed.

For the implementation of the TTS system on an external device it is preferable having the algorithm expressed in C- or a device dependent Assembler code. In this thesis the program is thoroughly implemented in Matlab because of its good analysis possibilities and tools. When the optimal solution is found, the code can relatively easy be translated into C-code.

4The reason of an assumption and not an implementation is discussed in subsection 5.2.1

(16)

(17)

Chapter 2 Theory

The main operations needed for generating speech from a phonetic description through a concatenative-based synthesizer can be divided into two main processes – a segment data preparation process and a speech synthesis process. The former process creates the data underlying the synthesizing process and is performed once. It operates on collected speech data and restructures the data into a format suitable for the synthesizer. Information useful for the future synthesis process is calculated and applied either as additional data or as a recalculation in the collected data. The speech synthesis process consists of the functions operating on the phonetic input together with the generated data and produces the resulting speech. It is independent on the language of the stored segments, only the defined segment format, i.e., the length of each speech unit, is required.

2.1 Segment Data Preparation

2.1.1 Segment Format and Speech Corpus Selection

The initial two steps in building a concatenative-based synthesizer are to determine the segment format and to collect a speech corpus. The term segment format is referring to the number of phonetic notes present in each speech unit together with information about where in a phonetic note the segment starts and ends. Usually the number of notes are fixed to a certain value, as in the cases described in subsection 2.1.3 below, but it could also have varying length as in the case of words. The selection of the segment format is a trade-off between operational load and complexity, storage requirements and speech quality. Longer segments results in less concatenation points, and hence a simpler TTS system, and a better preserved naturalness. On the other hand, using a segment format of several phonetic notes requires a large amount of recorded speech segments. For each new phonetic part included, an almost exponential increase¹ of the memory size is required as a consequence of the mounting number of possible combinations.

1According to combinatorial theory, the permutation

P(n, k) = n!

(n − k)! = n(n − 1)(n − 2) . . . (n − (k − 1)) ≈ n^k (for small k and large n), where n denotes total number of phonetic notes and k the number of notes per segment.

7

(18)

8 CHAPTER 2. THEORY

The segment data preparation process is based on a recorded speech corpus, and the number of segments to be included in the corpus is derived according to the chosen segment format. It is preferable to record several versions of each segment and then later choose the most appropriate recording in the preparation process. The resulting quality of the TTS system depends to a large extent on the quality of the speech corpus. The recorded data should be noiseless and read by one single person, and to facilitate the future segment concatenation, it should be spoken as monotonously and energy stable as possible.

2.1.2 Preparation Process

Figure 2.1 displays the operations involved in the segment data preparation process of a general concatenative-based TTS system. It starts at speech corpus level and a description of each block is presented below.

Segmentation Selective

Equalization Corpus

Speech

Speech Analysis

Synthesis Information Segment

Segments

Figure 2.1: Block scheme for the segment data preparation process of a general concatenative-based speech generator.

Selective Segmentation

The recorded speech stored in the Speech Corpus database usually consists of complete words intended to be divided into the defined segment format. This segment extraction is performed by either marking of the segment end points or cutting out and storing the desired parts. Finding the optimal cutting points is a time consuming process since an automatic segmentation function is hard to develop and therefore it needs to be made more or less by hand. Secondly, the most appropriate speech segment is chosen if several recordings per segment exist.

Information about the segments are calculated and stored in the Segment Information database, referring for example the length of the segments and, when using segments consisting of at least two phonemes², the mid position of the transition appearing between the phonemes. Finally, the extracted speech segments are transmitted to the next function.

2Defined as the mental abstraction of a phonetic note, see next subsection.

(19)

2.1. SEGMENT DATA PREPARATION 9

Speech Analysis

The operations involved in the Speech Analysis part mainly depend on the chosen synthesis- method of the TTS system. In some cases the speech segments are recalculated to better resemble each other, such as the case of normalization. Additional information about each segment, for instance pitch marks, is for some methods stored in the Synthesis Segment database. Later in this chapter a description of the pre-processing calculations needed for a TTS synthesizer is presented for two different PSOLA-based systems.

Equalization

One operation concerning all concatenative-based methods is the energy equalization.

When speech data is recorded from a human, it is never received as spoken with a constant volume. This energy variation can lead to clearly audible mismatches when concatenating the different segments, so before final storing of the speech segments in the Synthesis Segment database, this equalization is applied.

It is found [Dut97] that the amplitude energy of each phoneme differs from each other according to the type of sound and where in the mouth it is produced. To preserve this natural energy variation, the equalization process is applied within each group of equal phones. Note that the term equalization is used here in contrast to the term normalization. An energy normalization process would set the energy for all segments to an averaged value and the natural energy variation would then be lost.

2.1.3 Segment Representation

A phoneme is the linguistic representation of a phonetic event and thereby defined as the smallest unit an utterance can be divided into. Often the phoneme is incorrectly identified with a phone, but the latter describes a pronounced phonetic note while the phoneme corresponds to the mental abstraction of it. In other words, a phoneme can be defined as the categorization of a group of related phones [Dut97]. To represent all phones in a language a set of about 40 to 50 basic phonemes is needed, depending on language and desired transcription accuracy [Lem99].

A diphone describes the transition between two phonemes. It starts in the middle of the steady region in one phoneme and ends in the middle of the steady region in the next. In this case the point to be concatenated will always appear at the most steady state of a phone, and compared to using phoneme segments, the spectral mismatch at the point for concatenation will be decreased. As an example, the word ’mean’ with the phonetic description [m, i:, n] (referring SAMPA notation, see further in this section) corresponds to a diphone description according to [#-m, m-i:, i:-n, n-#], where # denotes a short silence. Figure 2.2 displays the signal representation of the two phones m and i: spoken consecutively, together with a classification of its different regions. The number of diphones needed for presenting a language is basically the square of the number of phonemes, disregarding some non-existing phoneme combinations. This results that the English language consists of approximately 1500 to 2000 diphones [HAH⁺98].

The most general alphabet for presenting phonemes is the International Phonetic Al- phabet (IPA). It has the capability to express all spoken languages of the world with one standard notation. This notation is composed by a large set of different characters which

(20)

10 CHAPTER 2. THEORY

m Steady region of i:

cutting point

Steady region of

point transition DIPHONE

Transition region

cutting point

Figure 2.2: Signal representation of the two phones m and i: spoken consecutively, with defined conceptions marked.

are mostly not represented by the ASCII codes. In computer systems it is however preferable to use an alphabet composed by a restricted number of combinations of these ASCII characters. The SAMPA (Speech Assessment Methods Phonetic Alphabet) is one of the most popular machine-readable phonetic alphabets used today. It has a simple structure consisting of ASCII characters with up to two combined characters. In Appendix A the SAMPA notation for British English is listed together with descriptions of pronunciation and classified according to how the phones are produced.

Another, and more restricted used, phonetic alphabet is the MRPA (Machine-Readable Phonetic Alphabet). It is developed by the CSTR (the Centre for Speech Technology Research) at the University of Edinburgh in a project called Festival, which is further described in subsection 2.5.1 below. This alphabet considerably resembles the SAMPA, but uses only non-capital letters together with the character @. The MRPA can directly be mapped onto SAMPA notation, which can be seen in the appended MRPA-SAMPA lexicon in Appendix B.

Recorded speech phones can be classified according to their waveform as voiced or unvoiced signals, usually denoted V and UV, respectively. A voiced signal contains a clearly identifiable fundamental frequency with clear harmonics while the unvoiced has a frequency spectrum resembling noise. Many speech phones, however, consist of a mixture of these two classes, which occurs having varying V and UV proportions in different frequency regions, and therefore a ratio V/UV is introduced where 1 correspond to a purely voiced signal and 0 to an unvoiced.

2.2 Speech Synthesis

2.2.1 Synthesizing Process

A model of the speech synthesis process is shown in Figure 2.3, with a description of each block below. The figure describes a general concatenative-based TTS system starting at

(21)

2.2. SPEECH SYNTHESIS 11

phonetic notation level.

Segment List Generator

Collector

Prosody Modification

Segment Concatenation

Waveform Generator

Speech Prosody Info Phonemes &

Segment File Information

Segment

Segments Synthesis

Figure 2.3: Block scheme for the run-time operating functions of a general concatenative-based speech synthesizer.

Segment List Generator

In this block, the phonetic input notation is transformed into the pre-defined segment format of the synthesizer. The structure of the prosodic information is then changed in order that this information has an expression that corresponds to the defined format.

This operation requires information about the stored segment and this is found in the Segment Information database together with data needed for further operations such as the segment file addresses.

The functions following this block operate on one segment at a time, and therefore the segment list generator transmits the segment transcriptions with the corresponding information one by one.

Segment File Collector

The Segment File Collector reads the current speech segment file from the Synthesis Segment database according to the file address received from the Segment List Generator and transmits the file further.

(22)

Prosody Modification

An application of the desired prosodic properties is performed on the speech segment in this block. These properties usually refer to pitch and time duration (see subsection 2.2.2 below) and is adapted with a method depending on the synthesizing algorithm. This process will be described for PSOLA-based systems in section 2.3.

Segment Concatenation

The method for concatenation of two segments is independent of the segment format chosen. For a good concatenation result, the two segments should have as equal prosodic characteristic as possible in their concatenation parts, i.e., at the last and first ends, respectively. At these points the segments are assumed to have equal fundamental frequencies and that the cutting points appear at the same position within their period times. A concatenation using PSOLA technique is described in section 2.3.

Before the concatenation of the segments an eventual smoothing of discontinuities is performed. The concatenation process itself usually results in a smoothing of the end parts of the segments, but for instance in the method described in section 2.4 a spectral envelop smoothing is performed by linear interpolation (which is also described in the same section). Since the concatenation (or/and smoothing) of one segment depends on the shape of the next one, a one-segment delay in the concatenation block is required.

This delay forces the concatenative-based synthesizers to be partly non-causal. However, the smoothing of one segment will never depend on any non-adjacent segment [Dut97]

and the non-causality of the system is therefore clearly restricted.

Waveform Generator

In some cases, for instance in rule-based TTS systems, the sound segment is parametrically stored or described by certain rules, and a Waveform Generator is then needed to decode the sound into a perceptible format, i.e., by generating sound waves.

2.2.2 Prosodic Information

In linguistics, the term prosody refers to certain properties of a speech signal, which usually denote audible changes in pitch, loudness and syllable length. Other properties related to prosody are for example speech rate and rhyme, though not as commonly used in TTS systems.

The pitch information is one of the most essential prosodic properties. It describes the ’melody’ of the utterance and prevents thereby the output of a TTS system to sound monotonously. Additionally, a stressed syllable can be symbolized by a fast and large pitch change, as well as an increase in its length of duration. This syllabic length also varies for different positions within a word and hence is another important parameter for speech synthesis. The denotation of the desired pitch can be expressed as a sequence of frequency labels consisting of information about time appearance and value. The last mentioned common prosody property is the loudness, which can also be defined as energy intensity. This parameter is however only of interest when producing emotional speech, since it is approximately constant within normal speech temper [Dut97].

(23)

2.3. PSOLA METHOD 13

2.3 PSOLA Method

The purpose of the PSOLA (Pitch-Synchronous Overlap and Add) technique is to change pitch and duration of an audible signal without performing any operations in the frequency domain. This process can be divided into two main steps:

1. decomposition of the signal into separate but overlapping parts, and

2. recombination of the parts by means of overlap-adding (OLA) with desired pitch and duration modification considered.

The operations involved in these two steps are closely described in the following subsection.

2.3.1 PSOLA Operation Process

A signal³ s(t) is decomposed to several short-time signals (ST-signals) si(t) by windows generated by time-shifted versions of the window w(t). This window is centralized around each pitch mark pmiof the original signal. A pitch mark is defined as the maximum signal appearance, denoted in time, in a period time T0i of the instant fundamental frequency F0 according to Figure 2.4. If the signal is cut yielding pm0 = 0, each pitch mark can be described as

pmi =

i

X

n=1

T0n, i ∈ N. (2.1)

As described, these pitch marks correspond to the time shift of the windows and the extraction of a general ST-signal can thus be expressed as

si(t) = s(t)w(t − pmi). (2.2)

Note that as time index the variable t is used, which usually defines continuous time.

Time indexing in discrete time though, as in this case, is most often denoted by n, but for better understandability⁴ the variable t is used.

If the ST-signals are added together again but with another time shift, i.e., with a new pitch mark-vector pm^′, the reconstructed signal will have changed its fundamental frequencies and is generated by

s^′(t) =X

i

si(t − pm^′_i). (2.3)

If the original signal is strictly periodic, the time periods T0i are equal for all i and the pitch mark-vector in (2.1) can be simplified to pmi = iT0. The decomposition and recombination described in equation (2.2) and (2.3) respectively, can thus for periodic signals be redefined as

siper(t) = sper(t)w(t − iT0), (2.4) s^′_per(t) =X

i

siper(t − iT₀^′). (2.5)

3For a TTS system, indicating one speech segment.

4According to the author.

(24)

s (t)i

pm_i−1 pm_i pm_i+1

(b) (a)

T0i

w(t) s(t)

Figure 2.4: Signal representation of the phone i:. (a) Original signal s(t) with a window w(t) centralized around pmi. (b) Extracted ST-signal si(t).

Theoretically, the reconstruction comprising a pitch change as described in (2.5) can be perfectly performed. This means that s^′_per(t) must have the same spectral properties as sper(t) with only a constant change of its fundamental frequency and harmonics. This statement can be proved by the Poisson formula [DL93] meaning if a signal

f (t)←→ F (ω),^F

then +∞

X

n=−∞

f (t − nT0)←→^F 2π T0

+∞

X

n=−∞

F (n T0

)δ(ω − n T0

). (2.6)

In words, the formula implies that summing an infinite number of shifted versions of a given signal f (t) results in sampling its Fourier transform with a sampling period equal to the inverse of the time shift T0. The spectral envelope is hence preserved while the new harmonics are evenly spread, and the statement of theoretically perfect time shift of a periodic signal is approved.

As previously described, the windows used for creating the ST-signals si(t) are sep- arated with the length of the local T0. If a window size much larger than this is used, spectral lines⁵ will appear in the spectrum of the ST-signal [Dut97]. These spectral lines can prevent s(t) from being harmonized, since the sampling of the frequency domain (cf. 2.6) can result in frequency values at spectral dips. On the other hand, using a too narrow window will produce a very rough harmonization with approximated frequency values. Choosing a window size as an intermediate between these two cases results in an optimal window size of about twice the period time. This results in a window-overlapping

5A spectral line is a dominated absence or presence of a certain frequency (spectral dip or top).

(25)

2.3. PSOLA METHOD 15

for the ST-signal extraction in (2.2) at the length of the period time T0, when using a window size of exactly 2 ∗ T₀.

The Poisson formula presumes an infinite number of equal ST-signals as input, see (2.6), which is only achieved having a stationary signal. Speech, however, is a non-periodic signal but with a relative slow varying frequency spectrum. This property of quasi-stationarity implies a use of equation (2.6) for speech but with restricted summation boundaries and hence a somewhat less perfect result. Furthermore, the described windowing requires a slow variation of the fundamental frequency, since the window size is defined to approximately 2T0 and this condition will only be true if T0i ≈ T0i+1. Consequently, the requirement of a quasi-stationary signal is also generated from the windowing step.

2.3.2 Modification of Prosody

The pitch of the signal is changed by implementing the new pitch mark-vector pm^′as done in equation (2.3). Each pitch mark pm^′_j corresponds to a point in the original pm-vector, as shown in Figure 2.5, but with changed distances in-between. If a pitch change with a factor k is desired, the new instant T₀^′ = ^T_k⁰. The factor k can have different values for each pm^′_j, but with relative small variations for preserving the quasi-stationary assumption. A consequence of this pitch shift method is that the duration of the resulting signal is changed inversely proportional to k. To get the desired signal duration the number of ST-signals must then be changed and is done by either duplicating or removing some ST-signals before concatenation. The resulting function of pm^′ for application of prosody is therefore

pm^′_j =

j

X

n=1

T0_a(n)

ka(n)

, (2.7)

where a(j) consists of indices from the original ST-signal indexed i, i.e., the vector a indicates which ST-signals that are used. This results in a being a transfer function from pm to pm^′.

pm 1

t’

pm pm pm pm

pm’ pm’

pm’

3 4 5 6

5 6 9

pm 2

7 pm’

pm’

t

1 2 3 pm’

4 pm’

8

Figure 2.5: Schematic example of a transfer function for the pitch mark-vector regarding pitch and duration modification. Here a = {1, 2, 2, 3, 4, 4, 5, 6, 6}.

An expression for the recombination of the extracted ST-signals including both pitch

(26)

and duration modification can now be presented by combining (2.3) and (2.7) into s^′(t) =X

i

si

t −

j

X

n=1

T0_a(n)

ka(n)

. (2.8)

2.3.3 TD-PSOLA as Speech Synthesizer

The PSOLA operation presented above describes the prosody modification part of a TTS system. An expansion of this technique is the TD-PSOLA (Time Domain-PSOLA) that functions as a complete speech synthesizer keeping all its operations in the time domain.

It is defined that with this method it is possible performing a change in both pitch and time duration by a factor in the range of 0.5 to 2, without any notable change in position and bandwidth of its formants [Dut97].

The PSOLA operation used in this TTS system requires information about the pitch mark locations for each segment is needed. This is usually generated in the speech analysing step of the segment data preparation part (see Figure 2.1) by a pitch detection algorithm. However, detecting the fundamental frequency in a signal with a low V/UV value is difficult and sometimes even impossible (when V/UV ≈ 0) and the pitch marking must therefore often partly be done by hand. In cases of purely un-voiced signals no fundamental frequency exist and the pitch marks are spaced given a fixed distance, approximately an average of the pitches of the speech segments.

In the point of concatenation, three different types of mismatches can appear as a consequence of varying segment characteristics – pitch, harmonic phases and spectral envelope mismatches. All these mismatches can lead to a degradation of the quality. In case of differing pitches, the PSOLA process can eliminate this mismatch by placing the windows in the recombination phase equal for both segments. However, since PSOLA is an approximate method the process changes the spectral properties of the segments.

If then the pitch is to be changed rather differently, in case of a relative large pitch difference, a spectral mismatch will appear. A second case of audible mismatch occurs when two voiced signals have harmonics with different corresponding phases. The phases of the fundamental frequency, though, are implicitly equalized through the pitch marking process, since the marking is always placed at the highest peak of the period time.

The spectral mismatches described require operations in the frequency domain. Since the current synthesizer operates in time domain, a compensation of these mismatches, by for example smoothing, cannot be done. However, in the special case of having ST-signals with equal length (which occurs when the original pitch is constant), a spectral envelope smoothing in time domain can be performed. This will be further described in the next section.

2.4 Extension of TD-PSOLA into MBR-PSOLA

An extension of the TD-PSOLA synthesizer has been developed by Dutoit and Le- ich, [DL93], which involves a re-synthesis of the segment database. This extended TTS method called MBR-PSOLA (Multi-Band Re-synthesis PSOLA) has the purpose of performing a more specific normalization in the pre-processing state than in the case of TD- PSOLA, and not requiring additional pitch mark files. The resulting segments are stored

(27)

2.4. EXTENSION OF TD-PSOLA INTO MBR-PSOLA 17

and the same synthesis method can be used as before together with a quality improv- ing interpolation block. Figure 2.6 displays this extension from the original TD-PSOLA synthesizer (gray blocks) to an MBR-PSOLA TTS system.

Segments MBR−PSOLA Re−synthesis

Segment

Interpolation Linear

Speech Phonemes &

Prosody Info

SPEECH SYNTHESIS

Segment Information

Database

Segments TD−PSOLA

Figure 2.6: Extension of TD-PSOLA into MBR-PSOLA. The white blocks refer to the additive operations.

2.4.1 Re-synthesis Process

The segment re-synthesis process consists of two normalization steps. First, the speech segments are recalculated achieving constant pitch throughout the entire database. This has the consequence that the future window positioning performed in the PSOLA process can be given one fixed value for all segments, relative the constant pitch period start, and therefore no additional pitch mark information is needed. The second re-synthesis operation comprises harmonic phase normalization of voiced signals. These phases are set to fixed values, valid for all segments. The choice of these phase values affect the sound quality considerably. Constant or linearly distributed harmonic phases lead to a rather metallic sound. A better quality result is achieved giving the phases randomly distributed values. Additionally, tests performed in [DL93] have shown that keeping the phases of the high-frequency harmonics at their original value actually improves the quality. Using an upper border of about 2000 Hz for which harmonics to be normalized was proved as the best value. If a higher value was used, no enhancement was noticed, while a too low value resulted in worse quality.

The method for re-synthesizing the TD-PSOLA segment database using the MBE operations described in the next subsection is shown in Figure 2.7. First, each segment is windowed into ST-signals according to equation (2.2). This requires, however, known pitch marks and in this case no such information is available. Instead, the window size and position is calculated using a constant F₀ for the whole database, estimated from a rough average of the overall pitch (_T¹

0av) of the complete corpus. The pitch mark pmi

can hence be replaced with iT₀_av. Each ST-signal is then parameterized according to the MBE model (described in next subsection) into

(28)

• harmonic amplitudes (sampled spectral envelope),

• harmonic phases, and

• narrowband noise variances.

In the calculation process of these parameters, a voiced/unvoiced classification of the signal is included. This information is used for controlling if the signal will be modified (in case of V) or returned unchanged (in case of UV). Before final storing of the segments, the normalized ST-signals are concatenated with the OLA (Overlap and Add) method.

Parametrization MBE

V/UV

Normalization Parametric

Segment Synthesis w(t)

OLA Segments

TD−PSOLA

Segments MBR−PSOLA

Figure 2.7: Segment re-synthesis process using MBE model.

2.4.2 Spectral Envelope Interpolation

Another benefit having constant pitch and identical harmonic phases is that a spectral matching in the concatenation point of the synthesizer can be performed by a linear interpolation in time domain between the ST-signals. This normalization implies that this so called direct temporal interpolation described below, is equivalent to an interpolation of the spectral envelope [DL93], which is wanted. Furthermore, the constant length of the segments also simplifies the interpolation by a direct position mapping between the samples or parameters.

If the segments s^L and s^R (referring left and right segments) is to be concatenated, the two overlapping ST-signals can be denoted as s^L₀ and s^R₀, respectively. Each s^X_n is described by the speech sample or parameter set p^X_n, where X refers to the segment (L or R) and n to its current window or ST-signal. The vector p^X_n is of constant length which

(29)

2.4. EXTENSION OF TD-PSOLA INTO MBR-PSOLA 19

is required for the vector operations. Suppose the difference |p^L₀ − p^R₀| is to be divided onto NL windows on the left segment and NR on the right, the spectral smoothing can be expressed as

p^′L−i = p^L−i+ (p^R₀ − p^L₀)NL− i 2NL

, i = 0, 1, . . . , NL− 1 (2.9)

p^′R_j = p^R_j + (p^L₀ − p^R₀)NR− j 2NR

, j = 0, 1, . . . , NR− 1 (2.10) where p^′L−i and p^′R_j denote the new interpolated values of the samples or parameters describing, respectively, the ST-signals s^L−i and s^R_j.

The optimum number of ST-signals to use for the interpolation, i.e., NLand NR, varies between the different segments. It is preferable to avoid ST-signals from the transition part in spectral smoothing and since the length of a segment and its transition position varies, a segment-dependent selection of the number of smoothed windows is optimal.

Additionally, spectral smoothing is only applied on voiced signal, and the selection of which segments to use can also be achieved by the same segment classification. This selection of the number of windows to use, i.e., the segment classification, is based on the V/UV information calculated by the MBE analysing process described in the following subsection.

2.4.3 Multi-Band Excitation Model

The Multi-Band Excitation (MBE) model is originally designed for speech storage compression in voice codecs [GL88]. It is based on a parameterization of the frequency domain of a speech signal and since it includes information about its harmonic frequencies it is ideal to use for pitch and phase normalization. Below follows a description of the MBE parameterization of an arbitrary short time speech signal.

Suppose a voiced ST-signal s^w(t) has the Fourier transform S^w(ω) according to Fig- ure 2.8(a). This frequency represented signal can be modelled as the product of its spectral envelope Hw(ω) (with phase included) and an excitation spectrum |Ew(ω)| [GL88],

Sˆ^w(ω) = H^w(ω)|E^w(ω)|. (2.11) If the fundamental frequency ω0 of the signal is known, the excitation spectrum can be expressed as a combination of a periodic spectrum |P^w(ω)| which is based on ω0

and a random noise spectrum |U^w(ω)| with the variance σ². The periodic spectrum consists of peaks with equal amplitude appearing at the fundamental frequency and its harmonics as shown in Figure 2.8(c). A frequency band with a width of the distance between two harmonic peaks is defined as a harmonic band, centralized on a harmonic. A V/UV analysis is performed on Sw(ω) for each harmonic band and expressed on a binary representation using a threshold value, see Figure 2.8(d). The two spectral signals are combined using the V/UV information to generate |E^w(ω)| by

|Ew(ω)| = V /UV (ω) · |P (ω)| + (1 − V /UV (ω)) · |Uw(ω)|, (2.12) and these different spectrum parts can be seen in Figure 2.8(c-f). Figure 2.8(b) displays

(30)

Figure 2.8: Example of an MBE modelled signal. (a) Original spectrum, (b) Spectral envelope, (c) Periodic spectrum, (d) V/UV information, (e) Noise spectrum, (f) Excitation spectrum, (g) Synthetic spectrum.

the spectral envelope |Hw(ω)|, which is usually represented by one sample value for each harmonic in both voiced and unvoiced regions to reduce the number of parameters. Finally the resulting synthetic signal spectrum ˆS^w(ω) can be seen in Figure 2.8(g), calculated as described above.

The estimation of the parameters in this method is based on the least square error between the synthesized spectrum | ˆS^w(ω)| and the original spectrum |S^w(ω)|. This ap- proach is usually termed an analysis-by-synthesis method. First, the spectral envelope and the periodic spectrum are estimated in the least square sense. Then the V/UV de-

(31)

2.5. UTILIZED DATA FROM EXTERNAL TTS PROJECTS 21

cisions are made by comparing the resulting spectrum to the original for each harmonic band and using a threshold value for the error to determine the band voiced or unvoiced.

2.4.4 Benefits with the respective PSOLA Methods

TD-PSOLA

• High naturalness of the synthesized speech because of ’untouched’ segments.

• Less sensitive to analysis errors regarding V/UV classification [Dut94].

• Simpler data preparation step.

MBR-PSOLA

• No mismatch in harmonic phase and pitch.

• No external pitch marks needed, implicitly calculated.

• Simple spectral smoothing possible.

• Good database compression poten- tial.

2.5 Utilized Data from External TTS Projects

There exist numerous companies and universities researching and offering products in the area of TTS systems. The availability of these results varies between the owners, but in most cases the solutions are hidden. The research projects mentioned in this section are all in some extend underlying the TTS system investigated in this thesis.

2.5.1 Festival and FestVox

The CSTR (Centre for Speech Technology Research) is an interdisciplinary research centre at the University of Edinburgh. One of their project products, the Festival Speech Synthesis System, contains a full concatenative-based TTS system with different synthesis methods implemented. Except for the PSOLA based synthesizer, the software for the various TTS systems is distributed under a free license [Fesa]. The latest version is Fes- tival 2.0, which is developed for a number of languages: British and American English, Spanish and Welsh.

A further improvement of the Festival TTS system is developed at the Carnegie Mellon University (CMU) through their project FestVox. The aim for this project is to make the building of new synthetic voices more systematic and better documented. FestVox 2.0 is the latest version which was released in January 2003 [Fesb] with a software free of use without restrictions, referring both commercial and non-commercial use. The databases involved are presented by FestVox on their homepage, containing, among other things, two voice databases consisting of all possible diphones for American and British En- glish. They are developed by the CMU and the CSTR, respectively, including waveforms, laryngograph (EEG) files, hand corrected labels of start, stop and transition points, and extracted pitch marks. The pitch marks are not hand corrected and thus not completely reliable.

(32)

The data concerned in this thesis are extractions from the British database called CSTR UK RAB Diphone. (A detailed description of the licence is attached in Ap- pendix C.) The current data is as follows:

1. Recorded speech corpus spoken by one British male speaker covering all possible diphones for British English. The data is stored as wave-files comprising a set of 2001 nonsense words with a sampling rate of 16 kHz and a precision of 16 bits.

2. List of all diphones described with MRPA notation, resulting in 2005 items. Each diphone is complemented with information about the corresponding sound segment which consists of name of the wave file, labels with start and stop position of the current diphone segment and a label for its transient point. The position labels are hand corrected and expressed in seconds with a significance of three decimals.

3. Data files with extracted pitch marks. Each pitch mark file corresponds to one wave-file, and hence a number of 2001 files exist. The pitch position is expressed in seconds using a seven decimal significance.

The notation of the diphones used in the segment information list (point 2. in the list above) follows the MRPA scheme described in Appendix B. Additional information has been included consisting of the three symbols #, and $. The first corresponds to a short silence while the second is indicating a consonant cluster, i.e., it is indicating two conso- nants appearing within a word instead of between two words. This can be exemplified by the notation t - r meaning the /tr/ as in ’true’ and not as in ’fat rat’. The last character, $, is in subsection 3.2.2 investigated and found to symbolize a word border between a plosive and a vowel.

2.5.2 MBROLA

Another partly freely available TTS system is presented by the MBROLA project, which is initiated by the TCTS Lab of the Facult`e Polytechnique de Mons in Belgium [MBR].

A product with the same name as the project, MBROLA, has been developed consisting of a speech synthesizer based on the MBR-PSOLA technique. It takes a list of phonemes as input, together with prosodic information consisting of phoneme duration and pitch description. The design of the system requires a diphone database, but apart from this it can operate on any voice and language, assuming the defined input format. Since the starting level is at phonetic notation, MBROLA is rather a phoneme to speech system than a complete TTS system.

The MBROLA synthesizer is only provided for free for non-commercial and non- military applications. Originally it consisted of one single segment database, a French speaking male voice. Today the system is available for several different languages and voices through cooperation with other research labs or companies contributing their diphone databases. The downloading package from MBROLA includes example sentences to use as input and an executable file. No source code is available since the algorithm is protected. The example sentences are stored on the format

phoneme duration pitch position pitch position ...

(33)

2.5. UTILIZED DATA FROM EXTERNAL TTS PROJECTS 23

with SAMPA-notation for phonemes and the duration expressed in milliseconds. The position of the eventual pitch definition [Hz] is given in percent of the specified phoneme duration. Several pitch definitions can appear and a linear pitch interpolation between each pitch position is then intended. In total there are three different test files which together include almost 30 seconds of speech.

(34)

(35)

Chapter 3 Implementation

As implementation tool for the TTS system designed in this thesis Matlab is consistently used, with a Linux based Matlab of version 7.0. The synthesiser follows the TD-PSOLA model presented in the previous chapter with diphones as the segment format and com- bines the Festival diphone databases with the input format of MBROLA using SAMPA notation. The original corpus is stored as wave files with a sampling frequency of 16 kHz.

This frequency is kept during the synthesis process and the output is stored as a wave file.

In this chapter the implementation is closely described step by step through a division into a Segment Data Preparation part and a Speech Synthesis part. When a phoneme is declared by its phonetic description the MRPA alphabet is intended if no other phonetic alphabet is explicitly defined.

3.1 Segment Data Preparation

The segment data preparation process operates on the three Festival databases listed in subsection 2.5.1. It creates the Segment Information database and the Synthesis Segment database, the latter consisting of diphone segments and pitch mark vectors. In principal the operation model follows the structure displayed in Figure 2.1. The difference in this case is that the pitch mark vectors already exist and are used to define the cutting points of the speech corpus. Secondly, information about where the diphone appears and its transition point are known. The process in this case is better described by Figure 3.1, where the pitch mark operation block corresponds to the Speech Analysis in Figure 2.1 and the segment information operations and the diphone extraction block is a part of the Selective Segmentationblock. The databases denoted Pitch Marks and Diphone Segments are both representing the Synthesis Segment database block in Figure 2.1. Below follows a description of the modifications performed on the three original databases.

3.1.1 Segment Information Modification

The segment information file available from Festival contains data about each diphone on the format

diphone file name start point transition point end point 25

(36)

26 CHAPTER 3. IMPLEMENTATION

Pitch Marks Segment

Information

Speech Corpus

Diphone Info Recalculation Extraction

Valid Pitch Mark

Extraction Diphone

Energy Equalization

Diphone Segments Segment

Information Pitch Marks

MODIFIED ORIGINAL

needed Diphones Exclusion of un−

Figure 3.1: Block scheme of the segment data preparation process performed on Festival’s databases.

where the diphones are noted with MRPA with the character - as separation between the phones. The received points are expressed in seconds referring where in the given file the diphone appears.

As will be described in subsection 3.2.2, the character is used for denoting different versions of the recorded segment. These are only added on phones with the length of one character and for almost all cases only on one side. The exceptions are the notations K - R, where K ∈ [k,p,t] and R ∈ [r,w,y,l]. For simplify the TTS system these diphones are removed and each phone part is now restricted to two characters. This simplification does not affect the result considerably because of the first phone part of these segments is hardly audible and is therefore probably originally mentioned only to be used for special cases. Before removing, though, the segment addresses is overwriting the addresses for the diphones K - R, i.e., the sound files for K - R are used for K - R.

When the diphones are extracted from the speech corpus and stored in new files (see subsection 3.1.3), the start point is not needed in the segment information file and the file addresses are changed. Instead of the start points, the appearance of the first pitch mark is included. This information is used in the synthesis process when the time duration of the diphone is calculated. Since one period time T_0x is overlapped in the OLA process, this time loss has to be considered for correct length of the output.

(37)

3.2. SPEECH SYNTHESIS 27

3.1.2 Pitch Marks Modification

If the diphones are cut at the positions of pitch marks, the phase of the fundamental frequency is the same at the start point for all segments. The defined cutting points are therefore given the position of the closest pitch mark, which is not the case in the original information file.

The pitch mark database contains pitch marks for the whole speech corpus, i.e., also for the parts outside a diphone region. The pitch marks valid for each diphone are extracted and stored in the modified pitch mark database.

3.1.3 Speech Corpus Modification

The modified start and end points, which appears at pitch mark positions, are used for the extraction of the diphones. As can be seen in Figure 3.1, an energy equalization is performed before the final storing. This equalization is focusing on each group of related phones by setting the pitch periods to be concatenated, i.e., the first or last period time of the diphone, to the average value for each phoneme. For some phones, as in the case of plosives, there is a short silence before and after the pronounced part of the phone. An equalization of these phones is therefore not realistic and after a subjective energy analysis also the two fricatives th and dh are excluded, comprising the same signal characteristics as plosives. The different versions of each phoneme, referring the phonemes on the format x, x and x where x is an arbitrary phoneme, are classified as the same phoneme in the equalization calculations. The equalization is linearly distributed at sample level on each diphone. As an example, if the phoneme m-i: of sample length L is to be equalized with a factor of a for m and b for i:, each sample value s(n) is multiplied with the factor a + ^b−a_L · n.

3.1.4 Additive Modifications

When the pitch marks are displayed together with their corresponding diphone segments several mismatches are found referring missing or considerably misplaced pitch marks.

About ten percent of the segments contain a relation between the lengths of two following pitch periods of a value of 2 or more, i.e., one of the pitch periods is twice as long as the other one. The pitch marks for these segments are considered not reliable and therefore adjusted by hand. It is also found that the second phone part of the diphone m-p does not comprise a whole pitch period and hence this is lengthened one period time. During the quality evaluation it was found that the diphone au-@@ was incorrectly pitch marked (see 4.1.1), and hence also corrected by hand.

3.2 Speech Synthesis

The Speech Synthesis process is divided into the four main blocks presented in Figure 2.3.

In this thesis there exist no coding of the signals into parametric form, and hence the Waveform Generator block shown can be excluded. The Segment File Collector reads the stored data and includes no further operations and is hence not described in a subsection below.

Feasibility study on a text-to-speech synthesizer for embedded systems

M A S T E R ' S T H E S I S