A Prototype Text Analyzer for Mandarin Chinese TTS system

(1)

A Prototype Text Analyzer for Mandarin Chinese TTS System

Chiao-ting Fang

Uppsala University

Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology June 11, 2017

Supervisors:

Yan Shao, Uppsala University

(2)

Abstract

This project presents a prototype of a text analyzer for a rule-based Mandarin

Chinese TTS voice, including components for text normalization, lexicons, pho-

netic analysis, prosodic analysis, and a phone set. Our implementation shows

that despite the linguistic differences, it is feasible to build a Chinese voice

with the TTS framework for European languages used at ReadSpeaker AB. A

number of challenges in disambiguation and tone sandhi have been identified

during the implementation, which we discuss in detail. A comparison of the

existing voices is designed, based on these cases, to better understand the per-

formance level of commercial TTS systems. The results verify our conjecture

of the difficult cases and also show that there is considerable disagreement on

the tone sandhi pattern among the voices. Further research on these topics

will contribute to the development of future TTS voices.

(3)

Acknowledgements 5

1 Introduction 6

1.1 Purpose . . . . 7

1.2 Limitations . . . . 7

1.3 Outline . . . . 7

2 Text-to-Speech Systems 8 2.1 Architecture . . . . 8

2.2 Components of Unit Selection TTS system . . . . 9

2.2.1 Text Normalization . . . . 10

2.2.2 Phonetic Analysis . . . . 10

2.2.3 Prosodic Analysis . . . . 11

2.2.4 Internal Representation . . . . 12

2.2.5 Waveform Synthesis . . . . 12

2.3 Other approaches to synthesize speech . . . . 13

2.3.1 Articulatory Synthesis . . . . 13

2.3.2 Formant Synthesis . . . . 14

2.3.3 Diphone Synthesis . . . . 14

2.3.4 Hidden Markov Model-based Synthesis . . . . 15

2.3.5 New Experimental Approaches . . . . 16

3 An Overview of the Chinese Language 17 3.1 Introduction: Chinese or Mandarin? . . . . 17

3.2 Phonology . . . . 18

3.3 Phonetic Representation and Romanization . . . . 19

3.4 Morphology: What is a Word? . . . . 19

3.5 Writing Systems . . . . 20

4 Implementation 22 4.1 Text Normalization . . . . 22

4.1.1 Tokenization: ZPar . . . . 22

(4)

4.1.2 Normalization . . . . 23

4.2 Phonetic Analysis . . . . 26

4.2.1 Lexicons . . . . 27

4.2.2 Out-of-Vocabulary Words . . . . 30

4.2.3 Disambiguation . . . . 30

4.3 Internal Representation: the Phone Set . . . . 32

4.4 Prosodic Analysis . . . . 35

4.4.1 Prosody Beyond Tones . . . . 35

4.4.2 Third Tone Sandhi . . . . 36

4.4.3 Yi- and Bu-Tone Sandhi . . . . 37

4.5 Waveform Synthesis . . . . 39

4.5.1 Speech Database . . . . 39

4.5.2 Segmentation and Generating the Output . . . . 39

5 Evaluation 41 5.1 Evaluation Methods . . . . 41

5.1.1 Intelligibility . . . . 41

5.1.2 Naturalness . . . . 42

5.2 Existing Chinese TTS voices . . . . 42

5.3 Comparing the Voices . . . . 43

6 Conclusion 47 6.1 Summary . . . . 47

6.2 Future Work . . . . 48

Bibliography 49

A Complete List of Normalization Tasks 52

B Pinyin to R-sampa Mapping Chart 54

C Test Cases and Results 56

(5)

Acknowledgements

I would like to express my gratitude to my supervisor Yan for his constant help, devotion, and encouragement during the entire project. His guidance and feed- back were invaluable. I am most grateful to Andreas and Kåre, my supervisors at ReadSpeaker AB, for giving me this opportunity to work with this exciting project. Their patience and immense knowledge in TTS always manage to answer any questions I have. I also wish to thank Erik for his help with the implemen- tation and Filip for checking the phone set. My time at ReadSpeaker has been both enjoyable and productive thanks to all the colleagues and especially the TTS team. I consider myself very lucky to be part of the group.

I am indebted to Hoa and Caroline, who have gone through my writing

tirelessly to improve it. This work would not be possible without the support of

my friends and most of all, my family. I am sincerely thankful for their love and

company whenever I need them the most.

(6)

1 Introduction

A text-to-speech (TTS) system takes text as input and tries to generate the audio output of the text in the way that it would be read by a human. As speech is the most fundamental form of human languages, a wide variety of applications are now equipped with some forms of artificial voice not only for users who have difficulties in reading or understanding written text, but as an aid for the gen- eral public. Synthesized speech is also used by people who are unable to talk with their own voice. The history of synthesized voice has evolved over the years:

early attempts include talking machines that imitate articulatory movements, dating back to the 18th century (Jurafsky and Martin, 2009). Researchers in formant synthesis from the 1950s onward have successfully produced understand- able speech by using varied signals to create speech waveform, but the quality of the voice is far from natural (Black, 2000). Commercial approaches today are mainly based on the concatenation of recorded speech, made possible by more powerful computers and larger storage space.

The architecture of a modern concatenative TTS system generally includes two parts: text analysis and waveform synthesis (Taylor, 2009). In text analysis, tokenization of the words and sentences may be required in the first place de- pending on the language. Then the non-standard words in the input text, such as numbers, symbols, and abbreviations are converted to their written-out forms.

The written text is later turned into phonetic transcription, usually by looking up in the lexicon or by grapheme-to-phoneme rules. Suprasegmental features are also encoded for natural-sounding prosody. The waveform synthesizer then gen- erates the speech by selecting appropriate segments from the speech database according to the transcription and prosodic markings. The output of the system is the artificial speech made by joining the chosen units.

With better computing power and recording devices available, TTS system development is no longer limited to research institutes and laboratories. Many voices on the market are of good quality, covering a wide range of languages.

TTS is also a field of research and development for companies who wish to incor-

porate synthesized speech in their products. Although synthesized voices are now

generally comprehensible, the models are continuously improved to handle tricky

natural language cases as well as to capture and recreate the correct prosody.

(7)

1.1 Purpose

The goal of this thesis project is to explore the development of a text analyzer for a Mandarin Chinese TTS voice under the models described by Taylor (2009) and Jurafsky and Martin (2009). Mandarin Chinese is known for its logographic script and tones, which require different NLP approaches for the processes. The result of the project is a prototype capable of dealing with many common text analysis tasks in a Chinese TTS system. This project is in collaboration with ReadSpeaker AB, a TTS company which provides TTS solutions to digital texts and applications in Uppsala, Sweden. By working with a non-alphabetic and tonal language like Mandarin Chinese, we also hope to improve the robustness of text analyzers used at ReadSpeaker in general.

1.2 Limitations

Although Latin letters and foreign words may occur in a Chinese text, we only process Chinese characters and speech sound in this project. Non-character words are not analyzed and read in the output. Our focus is to improve the rule-based text analysis rather than fixing individual mistakes of words, so the output audio is only a demonstration of how the rules work rather than a sample voice of the product.

1.3 Outline

Chapter 2 provides an overview of the major text-to-speech approaches with

the focus on concatenative synthesis and unit selection. Chapter 3 provides the

linguistic background of Chinese that is relevant for our TTS system. The imple-

mentation is described in Chapter 4. The common criteria for TTS evaluation

are presented in Chapter 5, along with a survey of some existing Chinese TTS

services and a comparison of their performance. Chapter 6 sums up the project

and discusses possible future work.

(8)

2 Text-to-Speech Systems

Modern TTS systems are computer software that convert digital text into equiv- alent audio output. The conversion mainly contains two processes, text analysis and waveform synthesis. This chapter provides an overview of the major frame- work and examines the components of the processes used for synthesized speech.

2.1 Architecture

Figure 2.1 shows the common form model of a TTS system proposed by Taylor (2009) widely adopted as the basis of concatenative TTS systems. The model is divided into two layers: the spoken and written signal of natural language, and their components – graphemes and phonemes. As both text and speech can be ambiguous in natural languages, Taylor introduces the idea of definite underlying forms (represented by words in the figure) that allows one-to-one mapping of grapheme to phoneme. In this model, text analysis and waveform synthesis are viewed as decoding and encoding processes. Their goals are to reveal the forms and generate the speech accordingly. However, as forms do not exist in reality, the input is decoded into graphemes that serve as hints for the forms, which are turned into perceivable phonemes later. Taylor’s model provides a slightly abstract but general architecture of TTS systems: the main task is to find out what the text refers to and return the correct phones.

written signal

graphemes words

(forms) phonemes

spoken signal text decoding speech encoding

Figure 2.1: Common form model. Note that “words” is an underlying level that

serves as the transit between graphemes and phonemes

(9)

An adapted version of Taylor’s common from model is described by Jurafsky and Martin (2009) shown in figure 2.2, completed with more detailed steps. Here input text goes through a number of processes in text analysis before being converted to phonemic representation of the language. In waveform synthesis, matching speech units are chosen from the speech database and joined together to create the speech output. At every stage, the output is passed onto the next component as the input, creating an hourglass-shaped model. Except for prosodic analysis, most components are essentially similar to those in Taylor’s common form model. Text decoding deals with both normalization and phonetic analysis, the latter plays an important role in the grapheme-to-phoneme transit. This model of unit selection and the terms used here will be adopted throughout our discussion.

Text normalization Phonetic analysis Prosodic analysis

Unit selection Unit database Text Analysis

Waveform Synthesis

Phonemic internal representation

text

speech

Figure 2.2: Hourglass model for unit selection architecture

2.2 Components of Unit Selection TTS system

This section introduces the components of unit selection shown in figure 2.2 with

the focus on text analysis. Unit selection is a type of concatenative synthesis ap-

proach that joins the longest possible matching utterances from a large database,

thus preserving the naturalness of the speech segments. We also incorporate the

list of text analysis processes in Taylor (2009) to give a more general view of the

whole system. The framework is language independent and most examples are in

English, but approaches specifically required for Chinese are given as well.

(10)

2.2.1 Text Normalization

The goal of text normalization is to convert non-standard characters in the writ- ten text into their spelled-out form. Common normalization handles numbers, acronyms, abbreviation, symbols, and so on. The pronunciation of numbers usu- ally depends on the context. For example, 1685 can be decoded as year or ordinal number and 10 can be “October” or “tenth” in 2017/10/10. Some acronyms are spoken in spelled-out form (BMW), some as normal words (UNESCO). Abbre- viations need to be expanded and sometimes disambiguation is required to find out the correct form (Dr. can be “drive” or “doctor”). Another issue with the abbreviation is the punctuation that comes with it. Some punctuation marks are helpful hints to determine sentence boundaries where we can later insert pauses or adjust the intonation accordingly, but the exceptions in abbreviation must be handled first. Sentence splitting or sentence tokenization is thus an important task for such languages.

Besides sentence final position, pauses may occur in the middle of the speech.

Word boundaries can be used for locating the possible occurrences. But for lan- guages like Chinese that do not separate words with spaces, tokenization is nec- essary at this stage to ensure no pauses will be inserted incorrectly later. After this stage, the text should only contain graphemes of the language, with possi- bly sentence and word boundaries marked up to be used for prosodic analysis.

Taylor (2009) mentions some other issues such as the encoding of the input and multilingual text, which are also possibly dealt with at this stage.

2.2.2 Phonetic Analysis

The next step is to find corresponding phonetic transcription for the normalized

text. Pronunciation dictionaries are often used to look up the words, but there

are also many out-of-vocabulary words (OOV) that require special techniques. A

common type of OOV is proper noun. While a dictionary may include those that

are most frequent, it is not possible to list all of them. Concatenation of existing

entries is practical for many languages. For example, apelsin and juice can

be combined to produce the pronunciation of apelsinjuice (‘orange juice’) in

Swedish. Although this example is not an OOV word, the method works well for

Chinese as we can guess the pronunciation of any word as long as all the charac-

ters used are in the lexicon. For alphabet-based languages, grapheme-to-phoneme

is possible when the mapping between the letters and sounds is systematic. This

is relatively straightforward for languages like Finnish, but English may require

additional rules for the irregular cases. The research of Spiegel (2003) seeks to

identify possible source language of names by looking at letter sequence before

applying language specific grapheme-to-phoneme rules. A grapheme-to-phoneme

(11)

approach in Chinese relies on known characters. A human may be able to guess the pronunciation of an unknown character by its components, but a computer processing encoding like Unicode has no means to handle characters with un- known or no encoding. To our knowledge, Chinese grapheme-to-phoneme rules do not deal with unknown characters. As the input of TTS is always in digital format, connecting the system to a larger dictionary will probably give a much better result than teaching computers to recognize the components of the char- acter.

The first stage of text analysis in Taylor (2009) is viewed as a categorization task. The class of the input token is first identified – it can be a natural language token, abbreviation, email address, time, and so on. The decoding is then based on this class to find the underlying form. During this stage, the token may be mapped to multiple possibilities and disambiguation is used to determine the correct form. Approaches include using part-of-speech (POS) tags (record as noun or verb), looking at the context (bass occurs in close proximity with music probably means the instrument, not fish), or even machine learning models like Bayesian classification and decision trees (Yarowsky, 1997). Disambiguation is an important part of phonetic analysis, but it is not limited to words consisting only of graphemes. The previous date and Dr. examples show a case where disam- biguation is needed in normalization. Depending on the design of a system, the steps in text normalization and phonetic analysis may be carried out in different order.

2.2.3 Prosodic Analysis

The last part of text analysis in Jurafsky and Martin (2009) is prosodic analysis, which Taylor (2009) also named as a separate stage in his overview of processes.

Prosody is defined as the non-segmental features that generally stretch over more than one sound (segment) in spoken language, hence the term suprasegmental (Cruttenden, 1997). The tricky aspect of prosody is that the annotation conven- tions and models used now do not seem to be able to capture all features required for reproducing the speech, which is why synthesized speech is often described as monotonous. TTS input also offers little clues on how the text should be read. One type of prosodic markings from text we have seen is the sentence and (for Chinese) word boundaries. Pauses may also occur intermediately and their prediction generally relies on classification based on the annotated training data.

On the word level, a token can be stressed for syntactic or pragmatic reasons.

English examples of the former case include content/function word distinction

and compound noun stress, which can be covered by rules. Pragmatic prominence

is however harder to predict, as the distinction relies on semantic or discourse

(12)

information. Intonation is determined by pitch at the sentence level. In unit selection, it is mostly based on properties that come with the data rather than generated by models. Taylor (2009) regards Chinese tones as suprasegmental features in his model, but linguistically speaking they behave more like phonemes in the way that the pitch change affects the meaning of a word (Ladefoged and Johnson, 2014). This is the case for our prototype as the ReadSpeaker framework is designed for non-tonal European languages. To sum up, the implicitness of prosody in the written text and the absence of effective internal prosodic models make reproducing natural, human like prosody difficult.

2.2.4 Internal Representation

The processed input text is then turned into internal phonemic representation, which generally consists of phones and prosodic information based on the previ- ous analysis. For unit selection, prosodic markings may be rather simple, with only stress (for prominence), pauses, and some indication of intonation from the punctuation (like question marks). Other synthesis methods that require further signal processing to modify the prosody will also need information on the fun- damental frequency and duration of the phones. This is the final stage of text analysis/decoding and the output is then passed on to the waveform synthesis module.

2.2.5 Waveform Synthesis

In this section, we briefly introduce the Hunt and Black algorithm for finding and joining the best segments of utterances in unit selection (Hunt and Black, 1996). The length of a single unit varies across systems (range from half-phone to syllables), and the choice will affect the size of the database. The preference can also be language specific – a syllable-based unit is preferable for Chinese, as a single grapheme (character) almost always represents a syllable (Taylor, 2009).

Our prototype however, is half-phone based, meaning the unit is half of a phone.

Phone and its variants are most common for European languages as the large number of possible syllables renders syllable-based systems impractical. Barker lists 15831 syllables in English while only around 1600 are reported in Chinese (Duanmu, 2007).

Hunt and Black’s algorithm determines the best utterance sequence by look-

ing at the target cost and join cost. Target cost is calculated by the distance

between the desired unit and the candidate. Ideally, the chosen unit and the tar-

get should have identical quality (phone, stress, position in the word/sentence,

and adjacent phones). This is possible with databases with large coverage. The

less alike the two units, the larger the target cost. Join cost computes how natu-

(13)

ral the concatenation would be by checking their acoustic similarity at the unit boundaries. If the two units co-occur in the database, the cost will be zero as it is considered to be completely natural. Again, we want the join cost to be as low as possible, so existing words or phrases in the database are favored over the concatenative sequences. The combination of two costs ensure that the generated speech will be as accurate and natural as possible. Modifications like weights are used for optimal results for different systems, but the concept remains the same.

2.3 Other approaches to synthesize speech

This section provides a short introduction to other speech synthesis approaches.

Articulatory synthesis and format synthesis are some of the earliest attempts, but these techniques are rarely used now. Diphone synthesis is a type of con- catenative method that predates unit selection, with diphone as the base unit (Jurafsky and Martin, 2009). HMM (Hidden Markov Model) is applied in many areas in language technology, and its potential in speech synthesis has been ver- ified (Tokuda et al., 2013). In recent years, there has been a great interest in incorporating state-of-the-art machine learning models such as deep learning and neural networks with the existing pipelines. Although unit selection remains the most popular model for commercial TTS voices, these frameworks are not com- pletely exclusive and progressive integrations are expected in the near future.

2.3.1 Articulatory Synthesis

Articulatory synthesis generates speech by initiating human articulatory move-

ments mechanically, one earliest example being the talking machine invented by

Wolfgang von Kempelen in the 18th century (Jurafsky and Martin, 2009). The

input of the model is a sequence of parameters that specifies how the machine

should be tuned. The model usually involves passing airflow through adjustable

tubes with constrictions that mimic the vocal tract. Such model requires exten-

sive knowledge of the articulatory movements, but the data is often obtained with

invasive or arguably harmful means (like MRI or X-ray). It is also immensely dif-

ficult to produce voices of good quality with this method within the confines of a

practical model. The intrusive nature and complexity make this approach mostly

obsolete now, but it is still studied in other related fields such as audio-visual

synthesis.

(14)

Figure 2.3: A complex periodic wave consists of a 100 Hz and a 1000 Hz simple wave (left) and an aperiodic wave with irregular pattern (right) (Johnson, 2003)

2.3.2 Formant Synthesis

In acoustic phonetics, sound waves are either periodic or aperiodic: where the for- mer have regular function, while the latter lack any repeating pattern (Johnson, 2003). For voiced sounds such as vowels, their waveforms are complex periodic waves consisting of multiple simple waves (see Figure 2.3), each has its distinctive simple wave components. As vowels are characterized by their formants, study- ing and modifying the composed waves can change the pitch and other acoustic quality. This feature of sound gives rise to the approach of formant synthesis.

The sound source is passed through formant filters that block certain frequency, producing sounds with the required formants. For unvoiced sounds with aperiodic waves, white noise usually serves as the source to create friction or obstruction in the consonants. This approach is more manageable and produces better result than articulatory synthesis, the output quality is still far from natural. The Klatt synthesizer (Klatt, 1980), one of the most complex formant synthesizer built in 1980s, is described as intelligible

¹

, but the limitation of the rule-based model makes it hard to capture the dynamics of natural speech. For this reason, this approach is no longer popular.

2.3.3 Diphone Synthesis

As the predecessor of unit selection, the general outline of diphone synthesis is very similar to what we have introduced in the previous sections. The main differences are the size of the speech database and the signal processing proce- dure (Jurafsky and Martin, 2009). A diphone is a unit that stretches from the middle of a phone to the middle of another. The idea is that by concatenating these units, we can avoid the gaps between phones, as they are naturally joined.

A diphone database stores one copy of each phone, so if there are N phones in the language, there will be N

²

diphones at most (some combinations might

1You can hear it at http://www.cs.indiana.edu/rhythmsp/ASA/partD.html, along with the output of other earlier synthesizers (Klatt, 1987)

(15)

be impossible). The diphones are recorded within the context, clipped out from the recording and later labeled to create the database. To generate the speech, the diphones are chosen according to the internal representation. As the units may be of different pitch, prosodic adjustment is necessary before the concatena- tion. One common technique is TD-PSOLA (Time-Domain Pitch-Synchronous Overlap-and-Add) algorithm. The units are lengthened or reduced by repeating or clipping the existing waveform. To modify the fundamental frequency, waves are framed according to their pitch. The frequency is increased by having overlap- ping frames, as the waves are brought closer together. Compare to unit selection, diphone synthesis requires more prosodic information such as duration and fun- damental frequency. The approach is still common as the footprint is relatively small and requires less computational power, but the naturalness is unavoidably affected by any sort of signal alteration (Jurafsky and Martin, 2009).

2.3.4 Hidden Markov Model-based Synthesis

Hidden Markov Model (HMM) is a dominant model in the field of automatic speech recognition (ASR), but its generative nature can also be used to syn- thesize speech. The basic concept is to swap the input of ASR for output in model training and synthesis, but instead of having speech as the direct output of HMM, parameters for a source-filter system (similar to formant synthesis) are generated. The parameters are then used to produce speech signals. Unlike the rule-based model in formant synthesis, HMM learns from the data automatically and is therefore better at handling the dynamic nature of speech. The train- ing creates the multiple HMMs by finding the most probable speech parameters given the phonemes and linguistic representation. In synthesis, the most likely models are chosen according to the specification of the input text. The models are later concatenated to form a sentence level HMM to generate the parameters for the synthesizer. The details of HMM synthesis is described in the paper of Tokuda et al. (2013), who develop many important techniques for smoothing the discrepancy of the HMMs and processing the acoustic representations.

HMM synthesis has a number of advantages over other approaches. While

it requires much less memory, the intelligibility and naturalness are reported to

be on the same level as unit selection (Tokuda et al., 2013). The training and

decoding processes are very much automated and only a small part of the model

is language dependent, so it is relatively easy to develop a new voice. This makes

HMM one of the most researched methods in recent years and there have been a

number of combined approaches with unit selection (Tokuda et al., 2013).

(16)

2.3.5 New Experimental Approaches

The rapid progress in machine learning gives rise to many novel methods. Two of the most popular terms are neural network and deep learning. Neural network is a machine learning model that uses nodes (known as neurons) to accumula- tively process the input. The input is often passed through a number of nodes before finally reaching the output stage. The connections between multiple neu- rons form a net that can have many layers, thus the term deep learning. With neural network, it is possible to tackle more complex and dynamic problems that are difficult to solve with the traditional mathematical, rule-based models. The following paragraph will briefly introduce how neural networks have been used in the development of TTS in the past few years.

A number of components in text analysis has been modeled with neural networks, including grapheme-to-phoneme rules (Rao et al., 2015) and intona- tion prediction (Ronanki et al., 2016). For waveform synthesis, WaveNet

²

uses neural networks to model the waveforms directly (van den Oord et al., 2016).

Char2Wav

³

Sotelo et al. (2017) and Tacotron (Wang et al., 2017) are two re- cently developed end-to-end synthesis systems, meaning that no text preprocess- ing is required. While Char2Wav outputs parameters for waveform generation, Tacotron is reported to directly synthesize spectrogram. Arik et al. (2017) in- troduce Deep Voice, a synthesizer that models all the components with neural networks and requires no TTS framework beforehand. These experimental ap- proaches have achieved impressive outcomes with relatively small amount of data (and in the cost of intensive computation). Deep voice, for example, was trained on a database with around 20 hours of speech data compared to hundreds of hours for a unit selection database. With the new approaches and techniques, we can expect significant changes and progress in the development of TTS within a short period of time.

2https://deepmind.com/blog/wavenet-generative-model-raw-audio/

3http://www.josesotelo.com/speechsynthesis/

(17)

3 An Overview of the Chinese Language

Text analysis is mostly language dependent except for the new approaches men- tioned in the previous section. In our case, many rules and processes in the frame- work are adapted for the linguistic features of Chinese. We therefore provide an introduction to the language that is relevant to our implementation.

3.1 Introduction: Chinese or Mandarin?

When referring to the standard language in China, Chinese and Mandarin seem to be used interchangeably among the general public. For cultural and historical reasons, Chinese is often considered as one language with many vari- eties (dialects)

¹

by its speakers, but the difference between these dialects can be immense. Chen (1999) named seven major dialect groups in the language and Mandarin (Beifanghua, literally ‘North speech’) is the largest of all. Accord- ing to Oxford English dictionary, the name Mandarin originally refers to the officials in imperial China. European missionaries visiting China in the 16th cen- tury recorded a universal version of the language used by the officials (Guanhua,

‘official speech’), thus the name Mandarin (Coblin, 2000). Standard Chinese is based on the Beijing dialect, which is a member in the Mandarin group (Chen, 1999). For this reason, Chinese and Mandarin are both known as the name of the language. Chinese speakers often regard Beijing dialect or Mandarin as the synonyms of the standard language, despite the fact that some distinctions do exist (Chen, 1999).

Ethnologue counts more than 1 billion Mandarin speakers. 70% of the 1.3 billion Chinese population and 19 million people in Taiwan speak some form of Mandarin dialect, making Mandarin the language with most native speakers.

Mandarin speakers can also be found in Singapore and Chinese immigrant com- munities in many countries. These variants, along with the standard form of the language, do not differ from each other very much compared to the other Chi- nese dialect groups (Ramsey, 1987). The mutual intelligibility means the more

1The definition of dialect can be very diverse among linguists. Here we will adopt the view that all tongues are dialects. Language is either a hypernym of these related tongues, or one of the dialects with more predominance and standardized form that represents the whole group).

(18)

refined classification of Mandarin is usually only limited to dialectal linguistic re- search. It is common to simply use Mandarin to represent the standard Chinese language, as opposed tongues in other major dialect groups like Wu, Min, and Yue (Cantonese). All these groups belong to the Sino-Tibetan language family.

In this project, the term Mandarin Chinese is used in the title and introduction for clarity, but from now on we will simply refer to the common/official

²

form of the language as Chinese.

3.2 Phonology

We previously mentioned that syllable-based units will be well-suited for Chinese as the number of syllables are manageable. A single Chinese character can al- most always be mapped to a syllable. As a character is the smallest writing unit (grapheme) of the TTS input, it is reasonable to use the syllable or its variants as the base units and there are many such systems (Shih and Sproat, 1996).

But for our framework using half-phone, it would be easier to divide the sylla- bles into a handful of phones instead of specifying all possible syllables for units.

It is however important to know the syllable structure of Chinese as tones are syllable-based.

A Chinese syllable can be transcribed as CGVX, where C is an initial con- sonant, G a glide, V a vowel, and X either a final consonant or a glide (Duanmu, 2007). To reduce the number of phonemes used in the phone set, we have included the glide with the vowel and regard them as diphthongs. The syllable structure can thus be rewritten as CVC. Both consonants can be optional. While linguists have some different opinions on the number of vowels and initial consonants, it is mostly agreed that the only possible final consonants are [ŋ] and [n]. Instead of having the normal onset-vowel-coda structure in phonology, it is more common to have the final consonant with the vowel and refer to the combination as final in Chinese. The consonants are therefore known as the initials. Tone, the use of pitch to distinguish meaning, is superimposed on a syllable. There are four tones in Chinese, usually transcribed phonetically with the Chao tone letters (Chao, 1930). Chao tone letters cover the pitch variation in a scale of 5, with 5 as the highest and 1 the lowest. The four tones can be represented as 55, 35, 214, and 51, or the four diacritics ¯, ,́ ˇ,̀ in Pinyin on the vowel. They are also convention- ally referred to as tone 1 to 4 in Chinese phonology. If the tone is not present, the syllable is usually regarded as unstressed and with shorter duration. This is common among a number of function words. More details will be presented in the implementation where we introduce our phone set.

2Chinese does not have official status in Taiwan despite being the “default” language.

(19)

3.3 Phonetic Representation and Romanization

The transcription of Chinese characters using the Latin alphabet dates back to the 19th century and there have been many attempts to create a practical phonetic representation for Chinese since then. The most widely used one is Hànyǔ Pīnyīn, usually referred to as Pinyin (‘spelled-out sound’) for short.

Pinyin has been the official romanization system in China since its creation nearly 70 years ago (Chen, 1999). From dictionary entries to input method, it is now the default phonetic representation for many native as well as non-native speakers of Chinese

³

. The tones are marked with diacritics on the vowel of the syllable, but numbers are also commonly used for convenience (like Han4yu3 Pin1yin1).

A very important feature of Pinyin is that it is phonemic rather than phonetic, meaning that allophones in complementary distribution may be represented with the same symbol. It is however a good base of our phone set as our algorithm will consider the most appropriate context from the database. We will introduce more about the mapping between Pinyin and our phone set in the next chapter.

3.4 Morphology: What is a Word?

The concept of word is relatively straightforward for speakers of European lan- guages, but the notion is however not directly applicable for Chinese. Packard (2000) illustrates the difference by using the term sociological word proposed by the renowned Chinese linguist Yuen Ren Chao. A word under this definition is

“the unit between a phoneme and a sentence that the general public is aware of”. This would be a word in English, but the Chinese equivalence is zì (字), a character or a spoken syllable. A single character can be a word that behaves pretty much as in English, but it is more common to have words that consist of two or more characters known as cí (詞). For example, 手 is the character for hand, but 手機 (literally ‘hand’ and ‘machine’) means cellphone. New words are created by combining existing characters. Entries in a zì-lexicon remain pretty much the same as we rarely create new characters now, but a cí -lexicon listing new combinations of characters is expanded all the time. Some combinations may be used to express more complex ideas and are more or less the equivalence of English fixed expressions or phrases, but they are sometimes included in cí - lexicon. In short, a character is a morpheme in Chinese. Many characters have

3It should be noted Pinyin is only introduced into Taiwan recently. A number of romanization systems had been adopted for names and road signs over the years, causing some degree of inconsistency in translit- eration. My name, for example, would be Qiaoting in Pinyin. The most popular phonetic representation in Taiwan is Zhuyin ‘sound-annotating (symbols)’, with 37 symbols derived from Chinese characters.

Zhuyin is still dominant in Taiwan as it is mandatory at school. When Latin transliteration is required, Pinyin is often encouraged. However, as most people in Taiwan are unaware of the differences between the romanization systems, it is not uncommon that an old, even mixed system is used.

(20)

their own meanings in isolation and can be used to build up words or larger units like phrases.

Knowing the definition of a word is important for our TTS system as pauses occur between word groups. It is possible to create a Chinese TTS voice that only considers individual characters, but the result is often poor: pauses may be inserted incorrectly within a word (such as 手 | 機). It is also hard to predict the duration of a syllable when the characters are taken out of the context (the combination), making the voice sound choppy or overlapping. Some characters can be pronounced in more than one way and it is often easy to tell within the combination, but when characters are processed separately we have no means to determine the correct pronunciation. It is therefore common to use a tokenizer for Chinese NLP. The process of finding the word boundary is also known as segmentation in Chinese. It should however be noted that there are no objec- tive criteria for segmentation. Shih and Sproat (1996) report 75% of agreement between native speakers. For our TTS system, the segmentation should at least mark up one acceptable version of word boundary so that no incorrect pauses are inserted.

3.5 Writing Systems

Chinese is written in Chinese characters. Like many other written languages, vari- ations exist before a standardized version is established. Among the variations of characters are the simplified ones with fewer strokes. Simplified characters re- mained largely informal until late 1950, when the Chinese government introduced a list of simplified characters as the standard character sets. The table includes some existing simplified forms as well as some newly created characters based on the cursive script. The scheme also replaced some heterographs with a single char- acter. Meanwhile, the un-simplified, traditional script continues to be used in Taiwan, Hong Kong, and Macao. While many characters are written differently, literate Chinese speakers have no difficulties understanding both script, although they mostly choose to write with the system that they are familiar with.

The logographic nature of Chinese characters means that variations can be

created easily by simply adding or removing some strokes, especially before stan-

dardization. Large dictionaries may count tens of thousands of characters, but

up to 40% of them may be variant characters (Chen, 1999). With computers,

however, it is no longer possible to create new variants of characters by modi-

fying the strokes. It is not possible to input or process characters that are not

encoded, making the character set even more standardized as the existing ones

continue to be used for digital texts. Our TTS system should accept input in

both traditional and simplified characters. As long as the text is digitally coded,

(21)

our system should have no problem recognizing the characters. Encoding used to

be an issue, but as more and more systems support Unicode, we will ignore other

less common encoding formats and assume that the input is in UTF-8.

(22)

4 Implementation

In this chapter, we present our implementation of text analyzer components in the order shown in the hourglass model in Figure 2.2, as proposed by Jurafsky and Martin (2009).

4.1 Text Normalization

In text normalization, our goal is to convert the input text into words that can be processed later to find out their pronunciation. Two tasks are involved in this step:

tokenization locates the word boundary, and the normalizer turns non-characters into their spelled-out form.

4.1.1 Tokenization: ZPar

Tokenization, also known as segmentation, is required for Chinese for both pho- netic and prosodic reasons. Without the word boundary, pauses might be inserted incorrectly within a word. The combination of characters also helps us figure out the pronunciation. We use ZPar (Zhang and Clark, 2011)

¹

, a statistical parser that comes with segmenting and part of speech tagging features to tokenize the input. After compiling, ZPar creates a model when trained repeatedly on a seg- mented or tagged text. The model is then used for POS tagging and segmentation.

The model for simplified Chinese is trained on Chinese Penn Treebank 5.0 (CTBS)

²

Another model for traditional Chinese is created on the training set of the CKIP corpus provided by Academia Sinica in Taiwan for the Second Inter- national Chinese Word Segmentation Bakeoff in 2005 (Emerson, 2005). Unlike Penn Treebank, CKIP dataset does not come with POS tags, so only the segmen- tation is possible. Once the model is loaded, ZPar is reported to process around 50 sentences every second with satisfactory accuracy (Zhang and Clark, 2011).

This is a huge advantage over other dictionary-based tokenizers that we tested.

1ZPar can be downloaded at https://github.com/frcchang/zpar/releases. An online manual can be found at http://people.sutd.edu.sg/~yue_zhang/doc/index.html

2Thanks to Yan Shao for providing the model. To use this model, replace ctb5 in line 16 with penn in zpar-0.7.5/src/chinese/tag.h.

(23)

A drawback of ZPar is that the model is difficult to adjust once trained. The perceptron based model is cryptic, rendering it impossible to modify the model directly. Another solution is to add the new words into the training set. Our experiment shows that ZPar is capable of getting a name right with four iterations of training on a small paragraph containing around 150 characters. It is however impractical to retrain the model every time we want to expand the vocabulary given the size of our training data. It is also unclear how many occurrences of the word are required for the model to learn the pattern. Furthermore, modifying the training sets will affect the performance of ZPar. Evaluation has to be repeated on every iteration with the retraining to make sure that we have the best model available. Despite this issue, ZPar’s performance on the CKIP test set with 4.3%

of OOV is a competitive 95% (Zhang and Clark, 2011), hinting that the algorithm of ZPar may be more superior in predicting the boundary of unknown words than other bakeoff participants. For our tokenization task, we will continue to use the same models. Further discussion of OOV will continue in the next section.

An issue with the ZPar segmentation module is that it sometimes separates an English word into several parts. Although we do not plan to process Latin letters in this project, it is important to keep them intact. Our guess is that the model fails to learn the English words as they are rare in the training data. The word-based perceptron algorithm of ZPar also takes the position of a character in a word into consideration (Zhang and Clark, 2011), which means that English words with more letters are likely to be divided under the Chinese model. The issue is still present after we removed all Latin letters from the CKIP training data. Our workaround is to write a small script that takes the unsegmented and segmented texts as input and replaces the falsely segmented letters with the ones from the original text. For some reason, this problem is only found with traditional Chinese texts. Our test shows that simplified Chinese is not affected by the number of English words in the training set.

To sum up, we have two ZPar models, one for traditional and one for sim- plified Chinese. As the simplified training set contains POS tags, we are able to tokenize simplified Chinese with the POS tagging module. Traditional Chinese uses the segmentation module, which breaks English words occasionally. A script is written to fix the separated English letters.

4.1.2 Normalization

Normalization is the process of converting non-characters such as numbers and

symbols into their spelled-out form. In a rule-based analyzer, they are usually

identified by regular expressions as the patterns are relatively distinctive. Our

tokenized text, however, contains spaces and POS tags, making the construction

(24)

of the normalization rules very unintuitive. For example, 2017/09/19 is a common way of representing date, but ZPar’s POS tagging module will label it as 2017_CD /_PU 09_CD /_PU 19_CD. Despite ZPar’s high accuracy, it is still impractical to consider the tags while writing the rules. The solution for now is to perform normalization before tokenization. Our test shows that ZPar does not have any problem tokenizing a sequence of Chinese characters embedded in the text. The accuracy may drop accordingly as the training data contains fewer spelled-out forms, but the normalized texts do not seem to pose any other difficulty to the tokenizer.

A common restriction for the normalization rules is that the string we are looking for should not be preceded or followed by any other characters except whitespace. This way, 09 in 2017/09/19 will not be extracted from its context and interpreted wrongly. This rule is however not applicable for our untokenized Chinese text as numbers are not delimited by spaces either. Luckily, the restric- tion in our existing normalizer only specifies numbers, symbols, and a range of Latin letters, so 2017 年

³

(‘year 2017’) will still be captured by the current rules.

This seems to be a preferable coincidence for Chinese, as we have decided to normalize the text first. The alternative would be adding all characters to the restriction list and creating rules with the spaces in mind. Our solution for now can be problematic for rare cases like 碳-14 (‘Carbon-14’) when the character is a part of the normalization rule. While its English equivalent is found successfully with the pattern [letters]-[numbers], 碳 is not a valid letter and the general normalizer that handles all languages will read the hyphen as minus. Exception rules have to be made for all similar cases. If we wish to solve this problem at its root, the whole normalization process will be slowed down as the restriction list is expanded to many thousand times larger. Again, the text has to be tokenized first and the rules will be affected accordingly. As such, we will take advantage of this workaround and leave the restriction unchanged for now. This issue reveals that the robustness of a TTS framework can be challenged when dealing with a very different language like Chinese.

Unlike letters, Chinese characters take up twice the space when displayed on computers. For aesthetic reasons, fullwidth Arabic numerals, letters, and punc- tuations are also created so that they can be lined up with characters (compare ABC and ＡＢＣ). Although it is not mandatory to use fullwidth numbers in Chinese text, their occurrence is not uncommon and must be taken into consider- ation. A small dictionary is created to map the fullwidth items to their halfwidth counterpart and fullwidth characters are corrected in the general normalizer

⁴

. The conversion also covers all punctuation marks used in Taiwan and China,

3There is no space between 2017 and 年. Chinese characters are displayed in fullwidth, see the next paragraph.

4Thanks to Andreas Björk for adding the function.

(25)

including some special ones for vertical writing.

The Chinese normalizer is based on the Swedish normalization rules written in C++. Chinese normalization lacks some of the most common tasks in western languages such as acronyms, abbreviations, and ordinal numbers, but there are also a number of language specific cases like classifiers to work with. Acronyms and abbreviations are represented with selected characters and can be treated as OOVs. For example, 臺大 (Táidà) is the abbreviation for 臺灣大學 (Táiwān dàxué, ‘National Taiwan University’). Ordinal numbers are created simply by adding the prefix 第 dì.

Table 4.1 shows a list of common techniques for normalization with examples.

Most of them can be captured with regular expressions. Conversion to a unified format is required when a concept can be written in many different ways. For ambiguous cases, adjacent words are used to figure out the correct category. For example, 12345678 can be a ill-formatted phone number, or a cardinal number.

We can tell how to read it by looking for keywords like contact or call around the number. Small dictionaries are used to store pairs of symbols and their spell out forms. Although we do not work with English here, the English abbreviation for days of the week and months are kept as they are very common in news or blog posts online. Sometimes we need to reorder the input for grammatical reason. So the input 12% should be percent twelve in Chinese despite the fact that the symbol comes after the number. It is common to combine more than one approach from the list to create a rule. The complete list of normalization functions can be found in Appendix A.

Pattern Matching

• (0?[0-9]|1[0-9]|2[0-4]):[0-5][0-9] finds time notation like 23:59

• (0?[1-9]|[12][0-9]|3[01]) / (0?[1-9]|1[0-2]) ([12] _\ d{3}) for 09/19 2017 Format Conversion

• 1,000 and 1 000 are converted to 1000

• 19/09-2017 is replaced by 19/09 2017 Keyword Checking

• A long number string close to words like 聯絡 (‘contact’)

_→

phone number

• 3/10 next to time related words like 星期 (‘week’)

_→

a date instead of fraction Dictionary Lookup

• Math and currency symbols: =, <, %, $, €,...

• Common English abbreviations: Tel, Mon, Oct,...

Reordering

• 12%

_→

百分之 (‘percent’) 十二 (‘twelve’)

• 25°C

_→

攝氏 (‘celsius’) 二十五 (‘twenty five’) 度 (‘degree’)

Table 4.1: Common normalization techniques and examples

An interesting language specific rule is the distinction between 二 (èr) and

兩 (liǎng) in Chinese. Both mean two, but are used in different circumstances.

(26)

Èr is used for counting and ordinal number, as in 第二 (‘second’) and 一二三 (‘one two three’). When followed by a classifier, liǎng is chosen instead (兩本書, ‘two book-classifier book’). A function is written to check the POS tag of the following character and replace èr with liǎng when it is a classifier. Number names like hundred (百), thousand (千), and ten thousand (萬) are also treated as classifiers, so the spelled-out function for numbers must be changed accordingly.

The function is modified to accommodate the number notation system. Instead of breaking numbers down by three digits, Chinese has four digits as a group based on ten thousand. Numbers bounded with a hyphen or a tilde are commonly used to represent a range in date, time, or amount. These patterns are handled with rules that convert the symbol to the correct spelled-out form.

Besides language specific rules, local knowledge is sometimes required for patterns like phone numbers. Other than names of the months and days, many English abbreviation are not translated, particularly the scientific units. For ex- ample, people say GB rather than the Chinese translated term for Gigabytes.

Many modern TTS systems are bilingual to some degree to handle cases like this, as English loanwords are ubiquitous, especially on Internet. Moreover, these terms are likely to have different translations in Chinese as they are only newly coined (十億位元組 versus 吉字节 for GB). Creating two voices for the varieties of Chinese is a good way of resolving the problem, pretty much like we would have Z read as [zi] in American English but [z E d] with a British voice. For our system, we have chosen to translate some of the most common ones, but left the terms with deviations untouched. The normalized part is in traditional Chinese, but the normalizer also recognizes context related keywords in both scripts. Sim- plified text may contain a number of traditional characters from the normalizer, which are added manually to the lexicon for later lookup.

4.2 Phonetic Analysis

Normalized text is converted to phonetic transcription at this stage. Most words can be looked up in a pronunciation lexicon incorporated in the TTS system.

For our implementation, we create two lexicons in the format of text file: one in traditional and another in simplified Chinese. From now on, we will refer to the the internal pronunciation lexicon for our TTS system as lexicon, while the external source that we used for building the lexicon is known as a dictionary.

OOVs are broken down into individual characters to be checked in the lexicon.

For languages like English, heteronyms can be grouped into categories like noun or verb. But in Chinese, ambiguity is mostly solved by the context, but to tell the remaining heteronyms apart will require more syntactic or semantic information.

The phonetic representation is called the r-sampa phone set, a variant of SAMPA

(27)

created by ReadSpeaker. The phone set will be introduced in more details in 4.3.

4.2.1 Lexicons

Traditional Chinese Lexicon

The base of the traditional Chinese lexicon is Revised Chinese Dictionary (重編國語辭典修訂本) by Ministry of Education in Taiwan

⁵

. A Python script is written to process the entries. The pronunciation is created by a function mapping Pinyin into r-sampa symbols. There are two types of entries: zì with a single character and cí containing multiple characters. POS tags are available for most zì entries, but there are usually more than one for each character. For example, the character 手 is hand, but in 手機 (‘cellphone’) it is regarded as an adjective that means handy or small. Cí entries do not have any POS tags.

The dictionary contains 166,120 entries, 11,933 of them are individual char- acters. All variant characters are removed as many of them are unencoded and their pronunciation have to be manually checked in as well. Fortunately, variant characters are very rare with digital input method as the users are only allow to choose from the encoded word list. After removing invalid entries, our traditional Chinese lexicon has 164,225 entries in total, which is a reasonable size compared to other languages. However, the dictionary rarely adds new entries despite its authoritativeness. Only a dozen new words made their way into the dictionary in the latest update. Many entries are also out-dated, but removing them arbitrarily is not a good idea either. Many new words will thus be treated as OOVs. But as our lexicon covers a good number of characters, we should be able to generate the pronunciation of OOVs by breaking the words down. We can always expand our lexicon by adding the new words.

Pinyin is relatively phonemic so the conversion to phonetic representation is straightforward. The function takes the initial and and final of the Pinyin and maps them into r-sampa. There is however the special case of rhotic coda. In Beijing Mandarin, the diminutive form is created by adding an “r” to the word.

The suffix is realized as the character 兒 (ér), for example in 哪兒 nǎr (‘where’).

Rhotic coda is not as common in Taiwan Chinese and the suffix is often omitted or read as a separate syllable. The r remains as the coda in Pinyin, but we have marked it as a separate syllable ér in the transcription as our speech database does not read it as a rhotic coda.

There are eleven POS tags in the dictionary: noun, pronoun, verb, adjective, adverb, preposition, conjunction, interjection, Chinese particles

⁶

, onomatopoeia,

5The exe files of the dictionary are available at http://resources.publicense.moe.edu.tw/

6Chinese makes use of a number of syntactically independent grammatical elements known as particles to indicate mood, aspect, or pragmatic differences (Huang et al., 2014).

(28)

and affix. The first eight are found in other lexicons and are converted accordingly.

Except for onomatopoeia, which is merged with interjection as in English, new tags are created for the new POS categories. The complete POS list and their lexicon format can be found in Table 4.2. As we do not plan to use much of the POS tags, they are mostly for reference. There is however the classifier tag that we need for the classifier rule in the normalizer. They are labeled as Chinese particles in the dictionary. We have selected 70 or so most common classifiers and appended the tag to the entries for further reference.

Simplified Chinese Lexicon

The source of the simplified Chinese lexicon is a text file of Contemporary Chi- nese Dictionary (现代汉语词典 Xiàndài hànyǔ cídiǎn)

⁷

. The file is not very well- formatted. The Pinyin is written together with space, which can cause ambiguity for some transcriptions. The original file also contains non-standard letters and symbols that makes importing the transcription extremely difficult. We end up using pypinyin

⁸

, a Python package that converts words into Pinyin. The Pinyin transcription is then turned to r-sampa. As in the traditional Chinese lexicon, all variant characters that cannot be displayed are removed. The lexicon has about 64,849 entries, 10,442 of them are single characters. The smaller coverage is due to the fact that the traditional lexicon contains many old fashion words and id- ioms that are very unlikely to be used in our system. On the other hand, the number of individual characters is very close to that of traditional lexicon, which means the system should have no trouble creating the pronunciation of OOVs based on these characters.

Although pypinyin is incorporated with a dictionary and is capable of pro- viding most words with accurate transcription, it does not work well on single character entries with multiple pronunciation. Without the context, it is impos- sible to tell how a character should be read and pypinyin will simply choose the most common one. By listing the duplicated lines we have found around 900 entries that are possibly assigned with the wrong pronunciation. The actual number may be fewer, as identical entries do not necessarily mean they are read differently. They may have very unrelated meaning so they are listed as separate words despite sharing the same pronunciation. The correction of these entries can only be carried out manually, but it would be impractical to go through all of them due to time constraint. Most disyllabic words should be found in the lexicon with the right transcription, while the monosyllabic ones have the most common pronunciation given by pypinyin.

7The file is available on Github.

8The documentation for pypinyin is available at https://pypi.python.org/pypi/pypinyin

(29)

About 80% of the entries have at least one POS tag. Besides the tags given in traditional lexicon, the dictionary has two extra labels: numeral and classifier.

The POS labels are retrieved by looking at the beginning of the definitions as this is where the tags are found for an organized entry. As the tags are also common characters, those that do not occur at this position will be ignored to avoid extracting the wrong labels. As we only use classifier for our analysis, other POS tags are not as crucial.

Formatting the Lexicons

Figure 4.1 shows a number of sample entries from both lexicons. An entry contains the word, the transcription, and the optional POS tags (translation and line numbers are only provided here for reference). The columns are separated by tabs. The first four lines are extracted from traditional lexicon, while the last two are from the simplified database. Single character lines usually have a number of POS tags, as their combination with other characters can change the POS categories. Multiple character words do not come with POS tags in traditional Chinese lexicon, but they can be tagged in the simplified one. Line 3 and line 5 in the figure illustrate the difference.

1 星 [ 1S i N ] /NN/JJ/AB/ (‘star’)

2 xing1 [ 1S i N ]

3 星表 [ 1S i N . 3p iau ] (‘star chart’)

4 xing1_biao3 [ 1S i N . 3p iau ]

5 星座 [ 1S i N . 4ts uO ] /NN/ (‘constellation’) 6 xing1_zuo4 [ 1S i N . 4ts uO ] /NN/

Figure 4.1: A sample of entries from traditional and simplified Chinese lexicons The greatest difference between Chinese and other lexicons is that Pinyin transcription is included, so that people who do not know how to read and type Chinese can still work with the lexicon. But as our system does not allow any extra columns the Pinyin transcription has to be separated. In our lexicon, they are listed as individual entries as seen in line 2, 4, and 6 in Figure 4.1. The POS tags of monosyllabic Pinyin words are removed as many characters share the same transcription. For other entries, the POS labels are preserved. Items with the same transcription with different POS tags are imported as separate entries.

Underscores are inserted because the spaces are not allowed within a word for other languages. Figure 4.2 shows the completed POS tags list for both lexicons.

Speech Assessment Methods Phonetic Alphabet (SAMPA) is a phonetic tran-

scription system based on IPA. SAMPA maps IPA symbols to characters available

on the keyboard so that the transcription can be inputed and processed more eas-

ily by computers. SAMPA only covers a number of language, but the X-SAMPA

(30)

/NN/: noun /AB/: adverb /JJ/: adjective /CL/: classifier /PN/: pronoun /PP/: preposition /CP/: Chinese particle /FX/: affix /VB/: verb /KN/: conjunction /IN/: interjection /RG/: numeral

Table 4.2: Part of speech tags used in the lexicons.

extension includes the whole IPA chart

⁹

. R-sampa is a variation of the SAMPA systems created by ReadSpeaker for internal phonetic transcription. As with IPA, the r-sampa symbols are surrounded by square brackets. The complete Chinese phone set can be found in Table 4.3.

4.2.2 Out-of-Vocabulary Words

Out-of-Vocabulary words are usually handled by grapheme-to-phoneme rules for alphabetic writing system. In Chinese, such rules are mostly replaced by look- ing up individual characters in a word as the number of graphemes is too large.

The pronunciation of individual characters are then put together to generate the transcription of the word. This approach can be used in many other languages, for example the compound words (sammansättningsord) in Swedish as well as in other Germanic languages. We therefore make use of the existing script for gen- erating OOV word pronunciation in Swedish

¹⁰

. The OOV word is divided into a number of possible combinations and each part is assigned a cost. The function will seek to find the smallest possible total cost for the word. For example, if a word contains three characters, the combination of 2 and 1 will have a lower cost than three 1s added together. The idea is to preserve the largest possible chunks found in the lexicon as they are more likely to be the actual pronunciation of word. This is particularly important in Chinese as neighboring characters often reveal the pronunciation for heteronyms. This method should be able to predict most OOVs, although characters with multiple pronunciation can be problematic when in isolation. Disambiguation is then required to determine the correct pro- nunciation. To solve the problem at its root, frequent OOVs can be added into the lexicon for further reference.

4.2.3 Disambiguation

Disambiguation is the task of determining the correct pronunciation for homo- graphs in a language. Homographs have the same written form, but different meaning and sometimes different pronunciation. The latter case is known as het- eronyms and is a main concern of speech synthesis. In our rule-based system for other languages, disambiguation usually relies on language specific information

9X-SAMPA home page: http://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm

10Thanks to Erik Margaronis for providing the explanation to the script.

(31)

like POS tags or the context. The same approaches are not directly applicable to Chinese, but they provide some general ideas about how the disambiguation script for Chinese should look.

Fortunately, a rather simple way of dealing with heteronym characters in Chinese is adding the related words into lexicon. For example, 樂 can be read as lè (as in 快樂, ‘happy’) or yuè (in 音樂, ‘music’), but as long as the words are listed in the lexicon, the system should have no difficulty in providing the correct pronunciation. In most cases, a multiple character word does not need any disambiguation even though it may contain heteronym characters. True het- eronyms cí are extremely rare

¹¹

. The tricky part of disambiguation lies in OOVs with guessed pronunciation and individual characters.

Before tackling Chinese disambiguation, we will have a look at how other languages deal with heteronyms. POS tags work relatively well for English. For instance, a word preceded by the is likely to be an adjective or a noun, so project in the project should be read with stress on the first rather than the second syllable. Other than project, English has a large inventory of heteronyms that can be distinguished by checking POS tags. We do not have a tagger for our system, but the surrounding words are usually useful enough in guessing the POS label of the word. Another approach is by looking at the context. This is very similar to telling phone number apart from cardinal numbers with the keywords.

Bass can be the instrument or the fish and both are noun, but if words like music and play occur in the same sentence, it is probably referring to the instrument.

Heteronyms are ordered by their frequencies and listed in a separate lexicon.

When the context or tags provide no information, the more common one will be chosen.

The above approaches are not as feasible in Chinese, especially for OOV words that are broken down into characters. Chinese lacks any distinctive POS groups that are related to pronunciation, and characters usually have too many POS tags anyway. This does not mean that such words do not exist

¹²

, but we have no means of knowing the POS tags simply by looking at adjacent words.

Moreover, the POS tags provided by our dictionaries are incomplete. There is however a category that can be targeted in Chinese. Many Chinese characters are pronounced differently when used as family names. A possible way is to create two lists: one containing the family names and another the common titles. If any of the characters in the first list is followed by an item in the second one, then it should be read as a family name

¹³

. This rule does not work for surname follow by first

11Around a dozen examples are found in an article on a language forum, but we only agree with one of them. Another example that we can think of is 老子, which can be Lǎozǐ (a Chinese philosopher) or Lǎozi (an old man or a superior pronoun referring to oneself).

12Tones are sometimes used to distinguish different POS. An example is 鑽 in 鑽洞 (zuān dòng, ‘drill a hole’) and 電鑽 (diànzuàn, ‘electric drill’)

13Chinese surname comes before the title (For example, 曾教授, literally ‘Zhen Professor’) A typical

A Prototype Text Analyzer for Mandarin Chinese TTS system