Automatically creating multilingual lexical resources

(1)

resources

by

Khang Nhut Lam

MSc., Ewha Womans University, Seoul, Korea, 2009

A dissertation submitted to the Graduate Faculty of the

University of Colorado at Colorado Springs

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

Department of Computer Science

(2)

c

(3)

This thesis for Ph.D. of Computer Science degree by

Khang Nhut Lam

has been approved for the

Department of Computer Science

by

Dr. Jugal Kalita, Chair

Dr. Edward Chow

Dr. Rory Lewis

Dr. Martha Palmer

Dr. Jia Rao

(4)

Khang Nhut Lam, Ph.D., Computer Science

Title: Automatically creating multilingual lexical resources

Supervisor: Dr. Jugal Kalita

Bilingual dictionaries and WordNets are important resources for natural language

processing tasks such as information retrieval and machine translation. However, lexical

resources are usually available only for resource-rich languages, e.g., English, Spanish and

French. Resource-poor languages, e.g., Cherokee, Dimasa and Karbi have very few resources

with limited numbers of entries. Current approaches for creating new lexical resources work

with languages that have good quality resources already available in sufficient quantities.

This thesis proposes novel approaches to generate bilingual dictionaries, translate phrases

and construct WordNets for several natural languages, including some languages in the

UN-ESCO Endangered Languages List (viz., Cherokee, Cheyenne, Dimasa and Karbi), by

boot-strapping from just a few existing resources and publicly available resources in resource-rich

languages such as the Princeton WordNet, Japanese WordNet and the Microsoft Translator.

This thesis not only constructs new lexical resources but also supports communities using

(5)

Dedication

I would like to express deep love to my parents. Without your love and

your support, I could not make this dissertation this far. Thank you for

everything you have done for me. Even though, I have been alone half a

world away from you, I have never felt lonely because you are always with me.

I would like to thank the Le family, and my best friends, Vicky Collier and

Janet Gardner, who are always on my side, take care me as their daughter,

and have given me a real family during the time I have been in the United

(6)

Acknowledgments

I would like to take this opportunity to express my warm thanks to my advisor, Dr.

Jugal Kalita, who has supported and guided me with patience and encouragement, and

has provided me with a professional evironment studying and doing research since my first

day in the PhD Program at UCCS. I also owe my gratitude to my dissertation committee

members: Dr. Edward Chow, Dr. Jia Rao, Dr. Martha Palmer and Dr. Rory Lewis, for

their enthusiasm, insightful comments, constructive suggestions and critical evaluations of

my research.

A special thanks is due to Feras Al. Tarouti, my lab mate and my co-author, for

his simulating, contributions, discussions, help in programming and evaluating results, and

excellent company during stressful days when we worked together to meet crucial paper

deadlines. Many thanks to all of my lab mates for their helps, questions, suggestions and

all the fun we have had in our lab.

Many thanks to Dubari Borah, Francisco Torres Reyes, Conner Clark, Tri Doan,

Morningkeey Phangcho, Dharamsing Teron, Navanath Saharia, Arnab Phonglosa, Faris

Kateb, Abhijit Bendale, Lalit Prithviraj Jain and Svati Dhamija for helping me evaluate

lexical resources. I also thank all my friends in the Xobdo, Microsoft and PanLex projects

who provided me with dictionaries and translations.

This research was supported by Vietnam International Education Development-

Min-istry of Education and Training of Vietnam (VIED). I gratefully acknowledge VIED

finan-cial support. I also thank the Graduate School at UCCS for fellowships and the Computer

(7)

TABLE OF CONTENTS

1 Introduction 1

1.1 Overview . . . 1

1.2 Types of lexical resources . . . 3

1.3 Research focus and contribution . . . 6

1.4 Intellectual and scientific merit . . . 9

1.5 Broader impact . . . 9

1.6 Organization of the dissertation . . . 10

2 Related work 11 2.1 Introduction . . . 11

2.2 Structure of lexical resources . . . 11

2.2.1 Structure of a bilingual dictionary . . . 11

2.2.2 Structure of the Princeton WordNet . . . 12

2.3 Language codes . . . 15

2.4 Creating new bilingual dictionaries . . . 16

2.4.1 Generating bilingual dictionaries using one intermediate language . . 17

2.4.2 Generating bilingual dictionaries using many intermediate languages 21 2.4.3 Extracting bilingual dictionaries from corpora . . . 24

2.4.4 Generating dictionaries from multiple linguistic resources . . . 29

2.5 Generating translations for phrases . . . 33

2.6 Constructing WordNets . . . 38

2.6.1 Constructing WordNets using the merge approach . . . 38

(8)

2.7 Chapter summary . . . 45

3 Input resources and evaluation methods 47 3.1 Introduction . . . 47

3.2 Input bilingual dictionaries . . . 47

3.3 Input WordNets . . . 48

3.4 Evaluation method . . . 49

4 Creating reverse bilingual dictionaries 52 4.1 Introduction . . . 52

4.2 Related work . . . 53

4.3 Proposed approaches . . . 54

4.3.1 Direct reversal (DR) . . . 54

4.3.2 Direct reversal with distance (DRwD) . . . 56

4.3.3 Direct reversal with similarity (DRwS) . . . 58

4.3.4 Direct reversal with similarity and distance (DRwSD) . . . 60

4.4 Experimental results . . . 62

4.4.1 Preprocessing entries in the existing dictionaries . . . 63

4.4.2 Results . . . 65

4.5 Future work . . . 69

5 Creating new bilingual dictionaries 74 5.1 Introduction . . . 74

(9)

5.3 Proposed approaches . . . 76

5.3.1 Direct translation approach (DT) . . . 76

5.3.2 Using publicly available WordNets as intermediate resources (IW) . 77 5.4 Experimental results . . . 81

5.4.1 Results and human evaluation . . . 82

5.4.2 Comparing with existing approaches . . . 88

5.4.3 Comparing with Google Translator . . . 89

5.5 Future work . . . 90 5.6 Chapter summary . . . 91 6 Creating WordNets 93 6.1 Introduction . . . 93 6.2 Related work . . . 94 6.3 Proposed approaches . . . 95

6.3.1 Generating synset candidates . . . 95

6.3.1.1 The direct translation (DT) approach . . . 96

6.3.1.2 Approach using intermediate WordNets (IW) . . . 96

6.3.1.3 Approach using intermediate WordNets and a dictionary (IWND) . . . 99

6.3.2 Ranking method . . . 100

6.3.3 Selecting candidates based on ranks . . . 101

6.4 Experiments . . . 104

(10)

7 Generating translations for phrases using a bilingual dictionary and n-gram data 109

7.1 Introduction . . . 109

7.2 Vietnamese morphology . . . 110

7.3 Related work . . . 111

7.4 Proposed approach . . . 112

7.4.1 Segmenting Vietnamese words . . . 112

7.4.2 Filtering segmentations . . . 113

7.4.3 Generating ad hoc translations . . . 114

7.4.4 Selecting the best ad hoc translation . . . 114

7.4.5 Finding and ranking translation candidates . . . 116

7.5 Experiments . . . 117

7.7 Conclusion . . . 120

8 Conclusions 122

References 124

Appendix A: Reverse dictionaries generated 134

Appendix B: New bilingual dictionaries created 136

(11)

TABLES

2.1 Languages mentioned and their ISO 693-3 codes . . . 15

3.1 The number of entries in the input dictionaries. . . 48

3.2 The number of synsets in WordNets . . . 49

3.3 The average scores of entries in the input dictionaries. . . 51

4.1 Words related to the word “south”, obtained from the Princeton WordNet. . 59

4.2 Reverse dictionaries created using the DR and DRwD approaches. . . 65

4.3 Reverse dictionaries created using the DRwS approach . . . 65

4.4 Reverse dictionaries created using the DRwSD approach . . . 66

4.5 Examples of unknown words from the source dictionaries. . . 67

4.6 Examples of bad translations from the source dictionaries . . . 68

4.7 Reverse of reverse dictionaries generated . . . 70

4.8 Some new entries, evaluated as excellent or good, in the reverse of reseve dictionaries . . . 70

5.1 The average score and the number of lexical entries in the dictionaries created using the DT approach. . . 83

(12)

5.2 The average score of lexical entries in the dictionaries we create using the IW

approach. . . 84

5.3 The number of lexical entries in the dictionaries we create using the IW

approach . . . 85

5.4 The average score of entries and the number of lexical entries in some other

bilingual dictionaries constructed using 4 WordNets: PWN, FWN, JWN and

WWN. . . 86

5.5 Examples of entries, evaluated as excellent, in the new bilingual dictionaries

we created. . . 86

5.6 The number of lexical entries in some other dictionaries we create using the

best approach. . . 87

5.7 Examples of entries, not yet evaluated, in the new bilingual dictionaries we

create . . . 88

5.8 Some “unmatched” lexical entries. . . 90

6.1 Different senses of the word “chair” . . . 97

6.2 Synsets obtained from different WordNets and their translations in Vietnamese 98

6.3 Example of calculating the ranks of candidates in Arabic. . . 101

6.4 Example of Case 2 to select candidates . . . 103

(13)

6.6 The number of WordNet synsets we create using the IW approach. . . 105

6.7 The number of WordNets synsets we create using the IWND approach. . . 105

6.8 The number and the average score of WordNets synsets we create. . . 105

7.1 Some examples of Vietnamese phrases and their translations . . . 118

7.2 Some translations we create are correct but do not match with translations by the Google Translator. . . 119

1 Sample entries in the English-Assamese reverse dictionary . . . 134

2 Sample entries in the English-Vietnamese reverse dictionary . . . 134

3 Sample entries in the English-Dimasa reverse dictionary . . . 135

4 Sample entries in the English-Karbi reverse dictionary . . . 135

5 Sample entries in the Assamese-Vietnamese and Assamese-Arabic dictionar-ies . . . 136

6 Sample entries in the Assamese-German and Assamese-Spanish dictionaries 136 7 Sample entries in the Arabic-German and Arabic-Spanish dictionaries . . . 137

8 Sample entries in the Vietnamse-German and Vietnamse-Spanish dictionaries 137 9 Sample entries in the Assamese WordNet synsets . . . 138

(14)

(15)

FIGURES

1.1 “A new Vietnamese-English dictionary” compiled by William Peter Hyde [41]. 4

2.1 A general method to create a new bilingual dictionary. . . 16

2.2 An example of the lexical triangulated translation method . . . 22

4.1 The idea behind the DR algorithm . . . 54

4.2 The drawback of the DR algorithm. . . 56

4.3 The idea behind the DRwD algorithm . . . 57

4.4 The drawback of the DRwD algorithm . . . 59

4.5 The idea of the DRwS algorithm . . . 60

4.6 The idea behind the DRwSD algorithm . . . 62

5.1 An example of generating an entry for an Dimasa-Vietnamese dictionary using the DT approach . . . 78

5.2 The IW approach for creating a new bilingual dictionary . . . 78

5.3 Example of generating lexical entries for an Dimasa-Arabic dictionary using the IW approach . . . 81

(16)

6.1 The DT approach to construct WordNet synsets in a target language T. . . 96

6.2 The IW approach to construct WordNet synsets in a target language T . . 98

6.3 The IWND approach to construct WordNet synsets . . . 99

(17)

CHAPTER 1 INTRODUCTION

1.1 Overview

The Ethnologue organization1, which compiles the most comprehensive catalogue of

languages of the world, lists 7,106 living languages. Half the world’s population speaks 13

most populous languages, the other half of the world speaks the rest2_{. Eighty languages,}

1.2% of all languages, are spoken by 79.5% of world’s population and 305 (5.5%) are spoken

by 94.2%3. One hundred languages are spoken by at least 7.4 million people, the rest by

fewer4_{. 81.3% of world’s languages are spoken by less than a million people each. Many}

languages spoken by even tens of millions of people do not have official status or have only

(low) regional status, even within their own countries5. With so many languages spoken by

so few, many languages do not have high political or economic status. In addition to many

that are isolated by inhospitable geography, most languages lack resources to survive and

thrive. These resources include books for infants and children, books for adults of various

kinds, newspapers, magazines, monolingual dictionaries, bilingual dictionaries, thesauri,

and these days electronic versions of these same resources. In contrast to resource-poor

languages, resource-rich languages have better access to resources like dictionaries, thesauri,

ontologies and possibly have plentiful text corpora as well. In truth, no language can be

considered truly resource-rich in absolute terms, but we may consider a few languages (e.g.,

1_{http://www.ethnologue.com/} 2 http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers 3_{http://www.ethnologue.com/statistics/size} 4 http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers 5 http://en.wikipedia.org/wiki/List_of_languages_without_official_status

(18)

English, Spanish and Japanese), to be resource-rich in relative terms; researchers have

created many resources to facilitate various aspects of computational processing for such

languages. There are a few other languages that have a limited number of resources, but

can benefit from additional resources (e.g., Arabic and Vietnamese). Other languages have

very few resources, if any. Many other languages are becoming endangered, a state which

is likely to lead to their extinction, without determined intervention. Some endangered

languages are Chrau and Tai Daeng in Vietnam, Karbi and Dimasa in India, Cherokee and

Cheyenne in America.

We construct lexical resources necessary for computational processing of natural

lan-guages in areas such as information retrieval, automatic word-sense disambiguation,

com-puting document similarity, machine learning, and machine translation. Consider bilingual

dictionaries, an essential tool for human language learners. Most existing (print or

on-line) bilingual dictionaries are between two resource-rich languages (e.g., English-Spanish,

Japanese-Chinese and French-German dictionaries), or between a resource-rich language

and a resource-poor language (e.g., English-Assamese and English-Cherokee dictionaries).

The powerful online machine translators (MT) developed by Google6 and Bing7 provide

pairwise translations for 80 and 50 languages, respectively. These machines provide

transla-tions for single words and phrases also. In spite of so much information for some “privileged”

language pairs, there are many languages for which we are lucky to find a single bilingual

dictionary online or in print. For example, we can find an online Karbi-English dictionary

and an English-Vietnamese dictionary, but we can not find a Karbi-Vietnamese dictionary.

Another important resource that is very helpful in computational processing and in human

language learning is a thesaurus providing synonyms and antonyms of words. An enriched

6_{https://translate.google.com/} 7

(19)

thesaurus that provides additional relations among words in the computational context is

called a WordNet. An English version of such a WordNet has been produced over several

decades at Princeton University, and similar complete WordNets have also been produced

for a small number of additional languages (e.g., French, Hindi and Japanese WordNets).

Most such resources do not really exist for resource-poor and endangered languages.

This dissertation focuses on developing new techniques that leverage existing resources

for resource-rich languages to build bilingual dictionaries and WordNets for languages,

es-pecially languages having very few resources. In addition, a phrase translation model using

a bilingual dictionary augmented by n-gram data is also proposed to obtain translations

for phrases that occur within these resources or even outside. We believe using approaches

that are not language-specific to create computational lexical resources, some of which may

be adapted to produce printed resources as well, may work in concert with other similar

efforts to invigorate speakers, learners and users of these languages.

1.2 Types of lexical resources

According to Landau [58], a dictionary or a lexicon consists of a list of entries sorted

by the lexical unit. Each entry usually contains a lexical unit, the definition associated with

it, part-of-speech (POS), pronunciation, examples showing the uses of words, and possibly

additional information. The lexical unit is usually a single word, whereas its definition is a

single word, a multiword expression, or a phrase. A monolingual dictionary contains only

one language such as the Oxford English Dictionary8. A bilingual dictionary consists of

translations of words between two languages such as “A Dictionary in Assamese and

En-glish” [18]. The monolingual dictionary is mainly used by the native speaker for reading

(20)

and understanding texts. The bilingual dictionary is used to understand the words in the

source language [58], or to translate [84]. A bilingual dictionary can be unidirectional or

bidirectional. A unidirectional dictionary contains translations from the source language to

the target language, but the reverse translations are not provided. In contrast, a

bidirec-tional dictionary consists of translations from the source language to the target language,

and from the target language to the source language. Besides the obvious bilingual

dic-tionaries that cover all words used generally in a language, one finds specific dicdic-tionaries

such as a synonym dictionary (e.g., Merriam-Webster’s Dictionary of Synonyms [73]), a

dictionary focused on proper names (e.g., A Dictionary of Surnames [36]), or being focused

on a narrow and specific area (e.g., Black’s Law Dictionary [30], and Stedman’s

Medi-cal Dictionary [113]). Figures 1.1 is an example of a Vietnamese-English paper bilingual

dictionary [41].

Figure 1.1: “A new Vietnamese-English dictionary” compiled by William Peter Hyde [41].

Kilgarriff [47] defines a thesaurus as a resource that groups words according to

(21)

are in a dictionary, but according to the ideas which they express”. In particular, according

to Soergel [111], a thesaurus contains a set of descriptors, an indexing language, a

classi-fication scheme, or a system vocabulary. A thesaurus also consists of relationships among

descriptors. Each descriptor is a term, a notation, or another string of symbols used to

des-ignate the concept. Examples of thesauri are Roget’s International Thesaurus [98], Open

Thesaurus9 or a large online English thesaurus simply called thesaurus.com.

Miller [75] introduces WordNet, which is a large lexical database where nouns, verbs,

adjectives, and adverbs are grouped into unordered sets of cognitive synonyms, the so-called

synsets. Each synset expresses a distinct concept. The WordNet is both an enriched

dic-tionary and thesaurus. Given a lexical unit, the general dicdic-tionary and WordNet return

definitions, POSes and examples. For the lexical unit, the dictionary mainly contains single

words while the WordNet can consist of short phrases such as “tabular array”, “scholarly

person”, and “grape vine”. Given a concept, the WordNet and thesaurus return terms which

fit the concept. The words in WordNet synsets are disambiguated in terms of senses. The

relationships between words (such as hypernyms or generalization, hyponyms or

particular-ization, and meronymy or part-whole relationships) in the WordNet are labeled. Currently,

the biggest WordNet is the Princeton WordNet10 version 3.0 which has 117,659 synsets

including 82,115 noun synsets, 13,767 verb synsets, 18,156 adjective synsets, and 3,621

ad-verb synsets. Some other WordNets are the FinnWordNet [66], the Japanese WordNet [43],

the EuroWordNet [122]. The AsianWordNet11 (AWN) provides a platform for building

and sharing WordNets among Asian languages (viz., Bengali, Hindi, Indonesian, Japanese,

Korean, Lao, Mongolian, Burmese, Nepali, Sinhala, Sundanese, Thai, and Vietnamese).

9

http://www.openthesaurus.de/

10

http://wordnet.princeton.edu/

(22)

Unfortunately, the progress of the WordNets in AWN is extremely slow, and they are far

from being finished.

Schmidt and W¨orner [105] define parallel corpora as “collections of written texts

and their translations into one or more languages, edited and aligned for the purpose of

linguistic analysis”. Zanettin [125] introduces a comparable corpus consisting of “texts in the

languages involved, which share similar criteria of composition, genre and topic”. A corpus

containing only one language is called a monolingual corpus such as the British National

Corpus12 and the Brown Corpus13. A bilingual corpus involves two languages, such as the

English-Vietnamese Bilingual Corpus (EVbcorpus) [83], while a multilingual corpus consists

of three or more languages such as the International Cambridge Language Survey14.

1.3 Research focus and contribution

The dissertation concentrates on automatically constructing multilingual lexical

re-sources, especially bilingual dictionaries and WordNets, for several natural languages. We

also introduce a novel method to translate a given phrase in a source language to a target

language. The languages we focus are following.

- Languages that are widely spoken but have limited computational resources such as

Arabic and Vietnamese.

- A language that is spoken by tens of millions in northeast India, but has almost no

resources such as Assamese.

12

http://www.natcorp.ox.ac.uk/

13_{http://clu.uni.no/icame/brown/bcm.html}

14

(23)

- Languages that are in the UNESCO Endangered Languages15 _{list such as Cherokee,}

Cheyenne, Dimasa and Karbi.

We note that Cherokee16 is the Iroquoian language spoken by 13,500 Cherokee people in

Oklahoma and North Carolina. Cheyenne17is a Native American language spoken by 2,100

Cheyenne people in Montana and Oklahoma. Dimasa18 and Karbi19are spoken by 110,000

and 420,000 people, respectively, in India. Assamese20is an Indo-European language spoken

by about 16 million people and is resource-poor. Vietnamese21is an Austroasiatic language

spoken by 75 million people in Vietnam and Vietnamese Diaspora, whereas Arabic22 is an

Afro-Asiatic language spoken by 290 million people in countries of the Arab League.

First, we focus on creating reverse bilingual dictionaries. Published methods for

auto-matically creating new dictionaries from existing dictionaries use intermediate dictionaries.

Unfortunately, we are lucky to find a single bilingual dictionary online or in software form

for many resource-poor languages. So, our first effort, to increase lexical resources for a

language under consideration, is to investigate the creation of a reverse dictionary from

an existing dictionary, if we can find one. To remove ambiguous entries and increase the

number of entries in created dictionaries, WordNets of resource-rich languages will be used

to compute similarities between words or phrases. Of course, a new reverse dictionary is

associated with the same two languages as the original dictionary that is reversed.

15 http://www.unesco.org/new/en/culture/themes/endangered-languages/ 16_{http://en.wikipedia.org/wiki/Cherokee_language} 17 http://en.wikipedia.org/wiki/Cheyenne_language 18 http://en.wikipedia.org/wiki/Dimasa_language 19_{http://en.wikipedia.org/wiki/Karbi_language} 20 http://en.wikipedia.org/wiki/Assamese_language 21_{http://en.wikipedia.org/wiki/Vietnamese_language} 22 http://en.wikipedia.org/wiki/Arabic_language

(24)

Our next effort at increasing lexical resources will be to create bilingual dictionaries

for language pairs for which such dictionaries do not exist. We will create dictionaries

from resource-poor languages to several other languages by exploiting publicly available

WordNets, bilingual dictionaries, and the dictionaries we create in the first task.

Resource-rich languages will provide the pivots for such translations. In general, if a word b (which

may be polysemous) in language B is translated into a word a in language A and a word

c in language C, we cannot necessarily conclude that a is a translation of c because of

their association with b. Hence, statistical techniques and WordNets are used to remove

ambiguous entries.

WordNets are among the most heavily used lexical resources. We develop algorithms

and models to automatically build WordNets for languages using available resources, but

also by bootstrapping with resources we create ourselves. If we can create a number of

WordNets of acceptable quality, we believe it will contribute significantly to the repository

of resources for languages that lack them.

A problem we have encountered in our previous tasks is that quite often a dictionary

entry has a sense that is given in terms of a sequence of words or a phrase. When we

reverse a bilingual dictionary or create bilingual dictionaries for new language pairs, so far

we have ignored such sense entries since we do not know how to translate a phrase into

the target language. Jackendoff [44, page 156] estimated that the number of multiword

expressions or phrases in a person’s vocabulary is of the same order as the number of single

words. In addition, Sag et al. [100] found that 41% words in WordNet 1.7 are multiword

expressions. In the last research task, we develop a model to translate phrases in a given

source language to a target language using dictionary-based approach and n-gram data in

(25)

generating translations for phrases occurring both outside and inside bilingual dictionaries

using the information from existing bilingual dictionaries.

1.4 Intellectual and scientific merit

This dissertation will present several novel approaches from simple to complex for

automatically generating bilingual dictionaries and WordNets. We will also compare our

proposed methods against existing methods to find positive and negative points of difference,

and the reasons for the drawbacks. In addition, most existing research works with languages

that have some available lexical resources, each of which is expensive to construct. Using

many intermediate lexical resources for creating a new one may cause ambiguity in the

lexical resource created. The approaches we propose will have the potential not only to

create new lexical resources using just a few existing lexical resources which can reduce cost

and time consumed, but also can improve the quality of lexical resources we create.

Briefly, to be able to automatically create many lexical resources for languages,

espe-cially resource-poor and endangered, we need processes that do not require many resources

to begin with, presenting challenging problems for the computational linguist. Our research

will make substantial progress on these problems by bootstrapping and leveraging WordNets

and dictionaries for resource-rich languages.

1.5 Broader impact

The goal of this dissertation is to study the feasibility of creating multilingual lexical

resources for languages by bootstrapping from a few existing resources. Our research has

the potential not only to construct new lexical resources, but also to support communities

(26)

1.6 Organization of the dissertation

The thesis is organized as follows. Existing approaches for constructing new bilingual

dictionaries and WordNets for languages, and generating phrase translation are presented

in Chapter 2. Chapter 3 introduces notations, input resources used and the methods to

evaluate resources we create. Chapter 4 and Chapter 5 propose methods to create reverse

bilingual dictionaries and new bilingual dictionaries, respectively. Approaches to construct

WordNet synsets for many languages are proposed in Chapter 6. In Chapter 7, we present

algorithms to generate translations for phrases with a case study on translating from

Viet-namese to English. Future work is discussed at the end of each chapter. Chapter 8 concludes

the thesis.

Acknowledgment

A synopsis of this dissertation is presented in the paper “Automatically creating

mul-tilingual lexical resources” in the Proceedings of the Doctoral Consortium at the 28th

(27)

CHAPTER 2 RELATED WORK

2.1 Introduction

Understanding existing approaches to create new bilingual dictionaries, to generate

translations for phrases and to construct WordNets provides us the background knowledge

required to develop techniques to solve our problems, discussed in this dissertation. In this

chapter, we summarize and discuss related work to build relevant lexical resources. The

remainder of this chapter is organized as follows. In Section 2.2, we describe the structure

of lexical resources. Section 2.3 gives the ISO 693-3 codes of languages mentioned in the

this dissertation. Specific approaches to generate dictionaries, translations for phrases and

WordNets from different linguistic resources are presented in Section 2.4, Section 2.5 and

Section 2.6, respectively. Section 2.7 summarizes the chapter.

2.2 Structure of lexical resources

This thesis proposes approaches to automatically construct bilingual dictionaries and

WordNets. Therefore, this section presents the structure of bilingual dictionaries and

Word-Nets, focusing on the Princeton WordNet.

2.2.1 Structure of a bilingual dictionary

For notational purpose, we make an assumption that a bilingual dictionary Dict(A,B)

contains entries of word or phrase translations from the source language A to the target

(28)

or phrases in the language B to words or phrases in language A. In particular, Dict(A,B)

contains entries (a,b) whereas Dict(B,A) contains entries (b,a).

A dictionary entry, called LexicalEntry, is a 2-tuple <LexicalUnit, Definition>. Here

LexicalUnit is a word or a phrase being defined, also called definiendum more formally,

based on Aristotle’s analysis [58]. Usually, a LexicalUnit is lemmatized (i.e., reduced to

a representative or citation form such as infinitives for verbs), but not always. A list of

entries sorted by the LexicalUnit is called a lexicon or a dictionary. Given a LexicalUnit,

the Definition associated with it usually contains its class (e.g., part-of-speech (POS))

and pronunciation, its meaning, and possibly additional information, including usage. The

meaning associated with it can have several Senses. A Sense is a discrete representation of a

single aspect of the meaning of a word. Thus, a dictionary entry is of the form <LexicalUnit,

Sense1, Sense2, · · · >.

2.2.2 Structure of the Princeton WordNet

The main relation between words in a WordNet is synonymy. A synset contains one

or many words. A polysemous word is assigned to many synsets. Each synset has one gloss,

which is a brief definition of the concept, along with sentences showing the use of words

in the synset. The WordNet 2.1 overview by Marin Dantchev [26] says that each synset

is linked to other synsets by numerous conceptual relations. The rest of this section will

discuss the synsets from the four syntactic categories: nouns, adjectives, adverbs and verbs.

The Princeton WordNet version 3.0 has 117,798 nouns with 82,115 synsets. The

noun synsets are organized into hierarchies. WordNet distinguishes types and instances in

noun synsets [29]. Types contain common nouns such as “location”, “president” and “car”

(29)

instances always are leaves of trees, or terminal nodes in the hierarchy. The relations among

noun synsets are super-subordinate relations (viz., hypernymy and hyponymy), part-whole

relations (viz., meronymy and holonymy) and antonymy.

- Hypernymy is a semantic relation that links a more general word to a more specific

word. For example, the hypernym set of the word “dog” is {canine, canid}.

- Hyponymy links a more specific word to a general word. The hyponym set of the

word “canid” is {bitch, dog, wolf, jackal, hyena, hyaena, fox}. Hyponymy is transitive.

For example, the word “dog” represents a kind of the word “canine”, which represents

a kind of the word “carnivore”; so “dog” represents a kind of “carnivore”.

- Meronymy links synsets denoting parts to synsets denoting the whole. In particular,

if a word a is a meronym of a word b, a is one part of b. For example, the words

{back, backrest, leg} are meronyms of the word “chair”. The inverse of meronymy is

holonymy. Therefore, the word “chair” is the holonym of {back, backrest, leg}.

- Antonymy expresses the relation between two opposite nouns. For instance, the word

“woman” is an antonym of the word “man”.

The current WordNet contains 21,479 adjectives organized into 18,156 synsets.

Ad-jective synsets are classified into two categories: descriptive adAd-jectives and relational

adjec-tives. The main relation in descriptive adjectives is antonymy, e.g., the anytonym of the

word “short” is {long}. Adjective synsets are organized into bipolar clusters where words

similar to one adjective are grouped with all words similar to its antonym [26]. The relation

in relational adjectives is pertainym, which points to the nouns they are derived from, e.g.,

(30)

There are 3,748 adverbs with 733 synsets. Adverbs in WordNet are usually derived

from adjectives via morphological affixation such as “strongly”, “shortly” and “rarely”. The

relations among adverb synsets are synonymy and antonymy, sometimes.

WordNet contains 6,277 verbs with 5,252 synsets. Verb synsets are also organized

into hierarchies. The common relations between verb synsets are troponymy, entailment,

and the cause relation.

- Troponymy is when the activity of one verb is doing the activity of another verb in

some manner. For example, the verb “run” is the troponym of the verb “walk”.

- Entailment occurs when one verb logically occurs after one event. For instance, the

verb “divorce” entails the verb “marry”.

- The cause relation relates one verb, which is causative and another, which is

resulta-tive. For example, the verb “show” and the verb “see” have a cause relation between

them.

Another widely used term is Common Base Concepts, firstly introduced in building

EuroWordNet [96]. A concept is important if it is widely used. In the EuroWordNet, the

Common Base Concepts are classified using a Top Ontology. The Top Ontology is divided

into three categories named 1stOrderEntities, 2ndOrderEntities, and 3rdOrderEntities.

- The 1stOrderEntities contain concrete synsets which are specified for four roles, viz.,

“origin”, “form”, “composition” and “function”. For example, vehicle is classified as

Artifact (Origin) + Object (Form) + Vehicle (Function). The 1stOrderEntities always

(31)

- The 2ndOrderEntities include synsets which are located in time, occurr or take place

rather than existing, e.g., “continue”, “occur” and “play”. The 2ndOrderEntities can

be nouns, verbs and non-dynamic adjectives.

- The 3rdOrderEntities consist of synsets which exist independently of time and space.

They can be true or false rather than real, e.g., “idea”, “thought”, “information” and

“plan”. The 3rdOrderEntities are always nouns.

2.3 Language codes

In this thesis, we use names of languages and their ISO 693-3 codes interchangeably.

The ISO 693-3 codes of languages mentioned, including in discussion of related work and

our experiments, are presented in Table 2.1.

Table 2.1: Languages mentioned and their ISO 693-3 codes

Language Code Language Code Language Code Language Code

Arabic arb Assamese asm Bengali ben Cherokee chr

Cheyenne chy Chinese cht Croatian hrv Dimasa dis

Dutch uld English eng French fra Finnish fin

Galician glg German deu Hindi hin Hungarian hun

Indonesian ind Japanese jpn Karbi ajz Korean kor

Italian ita Lithuanian lit Malay zlm Thai tha

(32)

2.4 Creating new bilingual dictionaries

To construct a new bilingual dictionary, we may use diverse available resources such

as existing dictionaries, thesauri, corpora or WordNets. Whatever resources are used, there

are two main steps to create a new bilingual dictionary. First, translation candidates are

extracted from resources used (e.g., dictionaries, thesauri or corpora). Second, heuristic

algorithms or statistical information is used to disambiguate and to rank translation

can-didates. The general method for constructing a new bilingual dictionary is presented in

Figure 2.1. The approaches we discuss in the next subsections all fit within this general

architecture.

Figure 2.1: A general method to create a new bilingual dictionary.

Human evaluation is the first choice in evaluating the quality of a new dictionary.

However, it is really hard to find volunteers familiar with languages in a dictionary Dict(A,B)

we may create such as Assamese-Vietnamese or Cherokee-Karbi. Researchers have evaluated

their approaches by generating a dictionary for another language pair Dict(C,D) such that

there exists at least one published good quality dictionary Dict*(C,D), which is used as

(33)

or F-score for Dict(C,D). The precision value is the matching percentage of entries in the

new dictionary Dict(C,D) and the existing dictionary Dict*(C,D). The recall ratio is the

percentage of entries that exists in Dict*(C,D), but also exists in Dict(C,D). We consider

the terms accuracy and precision of a dictionary to be synonymous.

2.4.1 Generating bilingual dictionaries using one intermediate language

A basic approach to create a new dictionary and handle ambiguities is a

pivot-based method that uses inverse consultation, introduced by Tanaka and Umemura [115].

They generate a Japanese-French dictionary Dict(jpn, fra) and a French-Japanese

dic-tionary Dict(fra,jpn) from a Japanese-English harmonized dicdic-tionary, Dicthm(jpn, eng),

and an English-French harmonized dictionary, Dicthm(eng, f ra). A harmonized dictionary

Dicthm(A, B) is a symmetrical dictionary created by integrating two unidirectional

dictio-naries Dict(A,B) and Dict(B,A). In the one time inverse consultation method, for each

given word in the source language, Japanese, they find a translation chain jpn → eng1 →

f ra → eng2, and then count the number of matches between eng1 and eng2, where eng1

and eng2 are two sets of words obtained by translation as shown by the arrows. The

greater the number of matches, the better the translation candidate. Similarly, in two-time

inverse consultation, for each given Japanese word jpn1, they experiment with the

trans-lation chain jpn1 → eng → fra → eng → jpn2, and then, count the number of matches

between the input Japanese word and the returned Japanese words. For evaluation, Tanaka

and Umemura [115] randomly select 100 entries from each of the dictionaries they create,

Dict(jpn,fra) and Dict(fra,jpn), then evaluate them manually and by calculating the

(34)

matching fraction for manual evaluation and matching percentage are 56% and 58%,

re-spectively.

Shirai et al. [109], and Shirai and Yamamoto [108] conclude that the inverse

consul-tation approach does not resolve the WSD problem well. In addition, differences in the

linguistic natures of languages, such as Japanese and English, affect the content of the

har-monized dictionaries. The authors introduce methods to improve the quality of dictionaries

created using inverse consultation. Shirai and Yamamoto [108] generate translation

candi-dates from Korean to Japanese using one-time inverse consultation from two dictionaries:

Korean-English and Japanese-English. Then, the degree of similarity between words is used

to select correct translations. Given a word in the source language (Korean) wK, and a word

in the target language (Japanese) wJ, the degree of similarity between wK and wJ is the

number of common translations of these words in the intermediate language (English):

degree of similarity(wK, wJ) =

|common(EwK, EwJ)| ∗ 2

|EwK| + |EwJ|

, (2.1)

where EwK and EwJ are the set of translations in English of wK and wJ, respectively.

For evaluation, they randomly select 1,000 Korean words in a published Korean-Japanese

dictionary, and then create the Japanese translations for these Korean words using their

approach. They evaluate their translations against the translations in a published

dictio-nary. The accuracy of their translations is 72% when the degree of similarity is equal to or

greater than 0.8.

Zhang et al. [126] create a Japanese-Chinese dictionary from Japanese-English and

English-Chinese dictionaries using one-time inverse consultation. To rank candidates and

(35)

The smaller the penalty value, the better the translation:

penalty(wJ, wC) = k1 ∗ F 1(wJ, wC) − k2 ∗ F 2(wJ, wC), (2.2)

where k1 and k2 are weights, which are set based on preliminary experiments, F1 is the

similarity value in POS between a Japanese word wJ and a Chinese word wC, and F2 is

the one-time inverse consultation score of that pair. 172 Japanese words were randomly

selected for human evaluation, to be marked either “correct” or “wrong”. The accuracy of

their best dictionary is 70.12%.

According to Shirai et al. [109], selecting correct translations among many

transla-tion candidates produced using two-time inverse consultatransla-tion is a challenge. Starting with

a Korean-English dictionary and an English-Japanese dictionary, Shirai et al. [109] use the

two-time inverse consultation method to generate Korean-Japanese candidates; then, look

for overlaps to limit the number of translation candidates. They evaluate their

transla-tions by comparing with a published Korean-Japanese dictionary. The precision of their

dictionary is 85.7%, while the recall ratio is 35%.

Paik et al. [92] experiment with different input bilingual dictionaries and take

di-rectionality into account in creating new Korean-Japanese dictionaries with different

ac-curacies. First, given a Korean-English dictionary Dict(kor,eng) and a Japanese-English

dictionary Dict(jpn,eng), the one-time inverse consultation method is used. According to

their experiment, the more similar the source and target languages1 are, the more correct

the translations are. The same approach with several pivot languages is also used by Paik et

al. [91]. Their second experiment computes the overlapping constraints of translation

can-didates created from Dict(kor,eng) and Dict(eng,jpn). The candidate with a high overlap

(36)

similarity score is likely to be the correct translation:

overlap similarity score(wJ, wK) = |wJ|, wJ ∈ J (E(wK)), (2.3)

where E(wK) is a set of translations in English of a Korean word wK, and J (E) is the set of

translations in Japanese of words in English. This method can increase the number of entries

in the new dictionaries created significantly. However, many ambiguous entries are created

in the new dictionaries due to the presence of polysemous words in the pivot language.

Finally, a new dictionary is created from Dict(eng,kor) and Dict(eng,jpn). The candidates

whose similarity scores are greater than a threshold are added to the new dictionary. The

similarity score for wJ and wk is computed as below:

similarity score(wJ, wK) =

. (2.4)

Paik et al. [92] claim that it is appropriate to construct a new dictionary Dict(A,C) using

the two bilingual dictionaries Dict(A,B) and Dict(C,B), when A and C are very similar.

The pivot-based method is also used by Sj¨obergh [110] to create a new

Japanese-Swedish dictionary Dict(jpn,swe) from a Japanese-English dictionary Dict(jpn,eng) and

a Swedish-English dictionary Dict(swe,eng). After removing English stop words in the

existing dictionaries, each English word wE is assigned a weight, calculated by the idf -like

measure.

weight(wE) = log(

), (2.5)

where |Dict(A, B)| is the number of entries in the dictionary, and |Dict(A, B)wE| is the

number of descriptions in the dictionary containing the word wE. Then, they match English

words in the two existing dictionaries and score the matches as follows:

score = 2P a weight(wE) P w_E1 weight(wE1) + P w_E2 weight(wE2) , (2.6)

(37)

where a ∈ Dict(swe,eng)∩Dict(jpn,eng), wE1 ∈ Dict(swe,eng), wE2 ∈ Dict(jpn,eng). A

better translation has a higher score. For multiword expressions that have no translation

in the target language, the concatenations of translations of single words in the target

language are accepted as correct translations. Volunteers are asked to evaluate 300 words

using a 5-point scale: all correct, majority correct, some correct, similar (which means the

translation is not correct, but close to being correct), and wrong. The accuracies of their

translations are 75% all correct with a score greater than 0.9 and 89% all correct with a

score equal to 1.0.

2.4.2 Generating bilingual dictionaries using many intermediate languages

To increase the precision of new dictionaries, one can construct new bilingual

dic-tionaries using transitivity with two or more pivot languages. Gollins and Sanderson [32]

introduce a triangulated translation method for improving cross-language information

re-trieval. To create a translation of a word a in the source language A in the target language

B, they translate a to two intermediate languages C and D to generate words c and d,

respectively. Then, they translate c and d to the target language B and merge the results

in different ways. Adding one more intermediate language to the triangulated translation

method produces “three-way” triangulated translation. Their experiments are with

Euro-pean languages that are covered by the EuroWordNet [122]. They select words in a source

language, create translations in a target language, and evaluate by comparing their

trans-lations with the transtrans-lations obtained from the EuroWordNet. According to Gollins and

Sanderson, triangulated translation outperforms the transitive method by over 55% when

the accuracy metric is used because it helps reduce ambiguous senses of words in

(38)

triangulated scheme. The addition of pseudo-relevance feedback [6] as pre-translation to

triangulation translation improves precision of translations. An example of the triangulated

translation method applied to a non-European language with English and French as pivots

to create entries for a new dictionary is shown in Figure 2.2. The Hindi word “vasant” is

translated to English and French. Then, the resulting words in the intermediate languages

are translated to Vietnamese in order to generate translation candidate sets. The correct

translations of this Hindi word in Vietnamese are the words that survive after applying

different merge strategies on the translation candidate sets. As a result, the translation of

“vasant” is “mùa xuân”.

Figure 2.2: An example of the lexical triangulated translation method

Bond et al. [14], and Bond and Ogura [13] create new dictionaries via one or more

pivots. Created entries are ranked in different ways such as using the one-time inverse

con-sultation score introduced by Tanaka and Umemura [115], or a semantic matching score,

which is the number of times the semantic classes of ai and cj match, mainly focusing

on nouns. Samples of random words in the source language and their translations in the

target language are selected for evaluation by lexicographers. The evaluation of entries

in the Japanese-Malay dictionary they created from a Japanese-English dictionary and a

(39)

homonyms2_{, they use two intermediate languages: English and Chinese. Using the two}

in-termediate languages, 97% entries in the new dictionary become acceptable, but the number

of entries decreases significantly from 75,872 to 5,238.

A link structure, introduced by Ahn and Frampton [2], is also used to handle

am-biguous translations. The central idea is that if (i) a word a in a source language A is

translated to a word b in an intermediate language B, which is translated to a word c in a

target language C, and (ii) conversely, if the word c is translated to the word b which is

translated to the word c, then the word c is a correct translation of the word a. The problem

with this method is the presence of polysemous words in the intermediate languages. Ahn

and Frampton ameliorate the effect of polysemous words in the following manner. They

find all words bk which are translations of each word ci; then, they find all translations aj of

each word bk. The words aj, which are the same as the source word a are selected. Finally,

they retrace the path to get the words ci, which are correct translations of the word a. The

newly created dictionary, a Spanish-German dictionary, covers 78.4% entries in an existing

dictionary that was created manually. Issues affecting their results include the observation

that the manually-generated dictionary does not contain many entries created using their

approach, the sizes of the input dictionaries are limited, and that different font encodings

in the input dictionaries mess up their results.

A well-known effort to construct many new bilingual dictionaries is by Mausam et

al. [69]. They report several algorithms for creating dictionaries using probabilistic

in-ference. They extract entries from multiple dictionaries of multiple language pairs using

the concept of a translation graph in which each vertex represents a word in a language

and the edge connecting two vertices presents a belief that the two vertices share a sense.

(40)

The Transgraph algorithm computes the equivalence score that two words in a translation

graph share the same sense. If this score is greater than a threshold, these two words in

two distinct languages are accepted as sharing the same sense. The main idea behind the

Unpruned SenseUniformPaths (uSP) algorithm is that two vertices share the same sense if

there exists at least one translation circuit found by using a random walk and choosing

ran-dom edges without having duplicate vertices in the path from the source word to the target

word. However, the uSP algorithm faces errors that occur in processing source dictionaries

to generate the translation graph and correlated sense shifts in translation circuits. The

SenseUniformPaths (SP) algorithm solves uSP’s problems by pruning paths whose vertices

enter an ambiguity set twice. An ambiguity set is a set of nodes sharing more than one

sense. Their best algorithm is the SP algorithm at precision 0.90, producing 4.5 times as

many translations as the dictionaries supported by the Wiktionary, producing 73% more

translations over other source dictionary translations.

2.4.3 Extracting bilingual dictionaries from corpora

If languages A and C have substantial corpora of documents that are readily available,

researchers have attempted to derive translations between A and C using several methods.

This subsection presents a variety of approaches for extracting translations from parallel

corpora, bi-texts,3 comparable corpora and monolingual corpora.

Brown [19] derives bilingual lexicons from a Spanish-English parallel corpus containing

685,000 sentence pairs. They construct a correspondence table based on symmetric

co-occurrence ratios and asymmetric co-co-occurrence ratios among words to show the existence

of word or phrase translations within sentence pairs. Two thresholds, one symmetric and

(41)

one asymmetric, are set up through experiments to handle the ambiguous candidates and

coincidental co-occurrences. The value of each cell in the table is from 0.0 to 1.0. Elements

in the table with values greater than 0.0 are added to the new bilingual dictionary. The

best dictionary they extracted used a fixed threshold of 1.0 and consisted of 14,446 entries

(covering 15% vocabularies in corpus) with the lowest error rate at 29%.

If a language pair does not have a parallel corpus, but there are some directly

trans-lated texts from one language to the other or texts transtrans-lated into both languages from

an intermediate language, researchers may be able to construct a parallel corpus using

the intermediate language as a pivot. Then, a new dictionary can be extracted from

the generated parallel corpora. For example, Héja [38] collects texts translated to the

source languages (Lithuanian and Slovanian) and the target language (Hungarian) from

an intermediate language (English) to construct parallel corpora (Lithuanian-Hungarian

and Slovanian-Hungarian). In the corpora he creates, sentences in one language might

be combined or split to many sentences in another language because they are not perfect

direct translations. Hence, translation units, instead of sentences, are used to measure

the sizes of these corpora. The Lithuanian-Hungarian corpus contains 147,158 translation

units, whereas the Slovanian-Hungarian corpus consists of 38,574 translation units. Then,

GIZA++ [86] is used to compute translation properties for every translation candidate and

perform word alignment. Héja also calculates frequencies of words in the source and target

languages. A translation candidate is added to the new bilingual dictionary if its translation

probability, its frequency in the source language, and its frequency in the target language

are higher than some thresholds. From experiments, he finds that a candidate with a low

translation probability but high frequency is a good translation, whereas a candidate with

(42)

that the number of entries in the newly created dictionaries strongly depends on the size

of corpora. He derives approximately 5,000 and 4,000 translation candidates that satisfy

all three threshold requirements from the Slovanian-Hungarian and Lithuanian-Hungarian

corpora, respectively. 863 extracted translations are evaluated manually. The highest

“use-ful” translation pairs in the new dictionaries are 97.2% with the probability of translations

from 0.7 to 1.0.

If a language pair (A,C) has a very small size of bi-texts, but there exists a third

language B such that B is related to and has a large parallel corpus or bi-texts with A

or C, researchers might be able to construct bilingual lexicons for A and C from available

resources based on transliterations and cognates. The CLDR project4defines transliteration

as “the general process of converting characters from one script to another, where the result

is roughly phonetic for languages in the target script”. For example, “Niu Di-lân” is a

transliteration of “New Zealand” in Vietnamese. According to Molina [76], “cognates are

words descended from a common ancestor; that is, words having the same linguistic family

or derivation”. Some examples of cognates in English and Spanish are “family” - “familia”,

“elephant” - “elefante”, and “gorilla” - “gorila”. Nakov and Ng [80] concatenate the two

bi-texts, align words, then extract cognates. One of their main experiments is to extract

translations from Spanish to English from the bi-texts of Portuguese-English and

Spanish-English, and they consider Portuguese as a language closely related to Spanish. They extract

cognates based on the translation probabilities of words from Portuguese to Spanish using

English as a pivot, and orthographic similarities using the longest common subsequence

ratio (LCSR) [71], calculated by dividing the length of the longest common subsequence by

the length of the longer word. A threshold of LCSR is set to equal or greater than 0.58.

(43)

Then, they estimate the translations using the competitive linking algorithm [72]. Cognates

are extracted from a training dataset, then used to train on the same training dataset to

transform words in Portuguese to Spanish. The Bleu score of their translations is 3.37. In

addition, they claim that their approach achieves better results than methods using parallel

corpora and pivot languages.

Ljubeˇsi´c and Fiˇser [67] extract a Croatian-Slovene dictionary from a comparable

cor-pus of news articles. Initially, a seed dictionary, with 33,495 entries, is created by detecting

words that are identically spelled in both languages and also have the same POS in both

languages. The similarity between the two languages is high since the average cosine

dis-tance between corresponding 3-grams picked from corpus is 74%.5 The average precision

of their seed dictionary is 72% as computed by manual evaluation. The first dictionary is

created by expanding the seed dictionary with cognates found by using a modified LCSR

algorithm, named BI-SIM [55]. The second dictionary is generated by adding to the seed

dictionary the first set of translation candidates, with a frequency of at least 200. They

evaluate the dictionaries they create by comparing against a hand-created gold standard

with 500 entries. Their first dictionary consists of 34,823 entries with a precision of 68.5%

whereas the second dictionary has 34,817 entries with a precision of 71.4%. According to

Ljubeˇsi´c and Fiˇser, simply considering the first translation candidates as correct translations

is very effective.

Given a Chinese-English dictionary Dict(cht,eng), Shao and Ng [106] extract new

translations from a Chinese-English comparable corpus using both context and

translit-eration information. The existing Chinese-English dictionary they use has about 10,000

entries. The size of the English corpus is 730M bytes, and the size of the Chinese corpus is

(44)

120M bytes. They divide the corpus into time periods, perform segmentation, and

deter-mine unknown Chinese and English words appearing in each period. Next, they estimate

the translation probability for each translation candidate based on the context:

P (C(c)|C(e)) = Y tc∈C(c)

P (tc|Tc(C(e)))q(tc), (2.7)

where q(tc) is the number of occurrences of a Chinese word tc in the context C(c), e is

English words in the context C(e), and Tc(C(e)) is a bag of Chinese words created by

translating the English words in C(e) using a bilingual dictionary. Then, a probability of

translation for each candidate based on transliteration is obtained as follows:

P (e|c) = P (e|pinyin) =X a

Y i

P (la_i|pi), (2.8)

where pi is the ith syllable of Pinyin (the official romanization used in China) created by

converting each character in a Chinese word c, la_i is the English letter sequence that the

ith Pinyin syllable maps to a particular alignment a. Finally, they rank candidates based

on the probabilities of translation. The number of new Chinese source words and English

translations found are 4,499 and 192,521, respectively. The precision of newly found correct

translations is 78.2% as evaluated by humans.

Researchers have derived bilingual lexicons even for language pairs that have neither

a parallel corpus nor a comparable corpus. To test their idea, they work with an English

corpus and a German corpus that are different in time period and orientation. The goal

of Koehn and Knight [53] is to derive one-to-one bilingual noun translations from German

to English using these disparate corpora. They find translation candidates based on (i)

identical words adopted from other languages (e.g., “email” and “internet”), (ii) words with

similar spelling due to cognate origin (e.g., “website” in English and “webseite” in German),

(45)

language (e.g., the word “dog” is similar to the word “cat”) and (v) frequencies of words.

They extract 1,339 bilingual noun translations, which can be considered to constitute a

seed lexicon, with accuracy of 89% starting with just the identical words. According to

Koehn and Knight, finding identical words, words with similar spelling, and words in similar

context help find significantly more new bilingual translations. The authors report that

the translations they extract cover 39% of the translations extracted at word-level from a

German-English parallel corpus.

2.4.4 Generating dictionaries from multiple linguistic resources

To improve the quality and the quantity of entries in the newly created dictionaries,

researchers extract translation candidates from available bilingual dictionaries like we have

discussed in prior sections, but extend by using resources such as thesauri, corpora, and

WordNets to identify senses of words, and to remove irrelevant candidates. Sanfilippo and

Steinberger [102] enrich a bilingual dictionary Dict(A,B) by linking its senses to senses in a

thesaurus of A. The enriched dictionary can be used to distinguish translation candidates

of a word in a given context. In the thesaurus, each word ai has one or many senses and

corresponding synonyms for each sense. Each sense has an identical number sensei. ai: sensei1: ai11, ai12, ai13,...

sensei2: ai21, ai22, .... ....

senseij: ....

Given a word aiin language A in the dictionary, they obtain all words belonging to each sense

of this word from the thesaurus, translate them to the target language B, rank translation

candidates based on their occurrence counts. Finally, they match the translation candidates

(46)

or be discarded. As a result, translations b of the source word a are grouped based on senses

of a.

ai: sensei1: bi11, bi12,... sensei2: bi21, .... ....

senseij: ....

The precision and recall of linking senses are 86% and 97%, respectively, whereas

those of ranking translations are 87% and 92%, respectively. The approach of Sanfilippo

and Steinberger [102] can be used to create a new dictionary Dict(B,C) from the given

dictionaries Dict(A,B) and Dict(A,C), and a thesaurus in language A. They link senses in

each dictionary to senses in the thesaurus, generate translations between B and C using A

as a pivot, and align translations using the unique sense numbers of the pivot word ai in A.

Goh et al. [31] construct a new Japanese-Chinese dictionary from Japanese-English

and Chinese-English dictionaries using the pivot-based method through English and rely

on the one-time inverse consultation method. Samples of 200 randomly selected words of

each category (nouns, verbal nouns, and verbs) are evaluated manually using a 4-point scale

{correct, not-first, acceptable, wrong}. Their dictionary has 20,554 entries with an average

accuracy of 77%. Because many Japanese words are combinations of Kanji characters,

which are similar to Hanzi in Chinese, they find 7,941 new translations with accuracy of

97% for nouns and 97.5% for verbal nouns by converting Kanji to Hanzi.

Nerima and Wehrli [81] create a new bilingual dictionary Dict(A, C) from two input

bilingual dictionaries Dict(A, B) and Dict(B, C) using the transitive method. The

transla-tion candidates are validated by checking their appearance in an A-C parallel corpus. An

example of their experiments is to construct an German dictionary from

English-French and German-English-French dictionaries consisting of 76,311 and 45,492 entries, respectively.

Their new English-German dictionary has 21,600 entries, of which 26% of entries are found

(47)

using the corpus, are evaluated manually. The authors do not report a precision value for

their dictionary, but they claim that the translations they create are very good.

A comparable corpus has also been used to validate translation candidates. Otero

and Campos [90] create a new dictionary Dict(A, C) from Dict(A, B) and Dict(B, C) using

transitivity; then, remove ambiguous entries in the dictionary created using an A-C

compa-rable corpus. They split Dict(A, C) into two subsets Dict(A, C)amb containing ambiguous

entries, and Dict(A, C)unamb consisting unambiguous entries. To remove ambiguous

en-tries, they generate a temporary dictionary Dict(A, C)corpus from the comparable corpus

such that every word in A is translated into the top-N best translations in C and every word

in C is also translated into the top-N best translations in A. The final bilingual dictionary

Dict(A, C) is created using the following formula:

Dict(A, C) = Dict(A, C)amb∩ Dict(A, C)corpus∪ Dict(A, C)unamb. (2.9)

They create an English-Galician dictionary from the English-Spanish and Spanish-Galician

dictionaries, and a comparable corpus of English and Galician. The dictionary created

contains 12,064 entries and 22% of the entries are found in the comparable corpus. Similar

to Nerima and Wehrli [81], Otero and Campos claim that there is no need to manually

evaluate the entries they generated because their qualities are the same as those of entries

created by lexicographers without discussing their comparison method against the resource

created by a lexicographer.

In addition to parallel or comparable corpora, researchers have also used

monolin-gual corpora to validate translation candidates. Kaji et al. [45] create a Japanese-Chinese

dictionary from Japanese-English and Chinese-English dictionaries using the pivot-based

method. A correlation matrix of associated words versus translations obtained from two

(48)

am-biguous translation candidates. To construct a correlation matrix, they first extract word

associations from the corpora, align the extracted Japanese word associations with the

ex-tracted Chinese word associations using the dictionary created by the pivot-based method,

and iteratively compute the correlations between associated words and translations. The

correlation matrix is converted to a binary matrix such that the highest value in each row

of the matrix is converted to 1.0 whereas the remaining values are converted to 0.0. Finally,

the support for each translation is obtained by dividing the number of times 1.0 occurs in its

column by the number of rows in the matrix. The translations with support values greater

than a threshold are accepted as the correct translations. For evaluation, 384 Japanese

entries of nouns and their translations are manually validated. Evaluation produced 64.9%

of precision and 15.8%.

WordNets have been used to remove irrelevant translation candidates. Varga and

Yokohama [119, 120] generate a Japanese-Hungarian dictionary from Japanese-English and

Hungarian-English dictionaries using the pivot-based method. A translation candidate is

considered unambiguous if there exists only one translation from the the source language

to the pivot language, which in turn has only one translation to the target language. To

handle ambiguities, they compute scores using information obtained from a WordNet of the

pivot language, English WordNet, as below:

scoreB(wJ, wH) = max |sns(w_J → i0_{) ∩ sns(w} H → i0)| |sns(wJ → i0) ∪ sns(wH → i0)| , (2.10) scoreC,D,E(wJ, wH) = |ext(wJ → wE) ∩ ext(wH → wE)| |ext(wJ → wE) ∪ ext(wH → wE)| , (2.11) scoreF(wJ, wH) = Y rel

((c1+ max(scorerel(wJ, wH))).(c2+ c3.mf actorrel(wJ, wH))), (2.12)

where i0 ∈ (w_J → w_E) ∩ (wH → wE); sns(w) is the set of senses of word w ; ext(w) is