resources
by
Khang Nhut Lam
MSc., Ewha Womans University, Seoul, Korea, 2009
A dissertation submitted to the Graduate Faculty of the
University of Colorado at Colorado Springs
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Department of Computer Science
c
Copyright By Khang Nhut Lam 2015 All Rights Reserved
This thesis for Ph.D. of Computer Science degree by
Khang Nhut Lam
has been approved for the
Department of Computer Science
by
Dr. Jugal Kalita, Chair
Dr. Edward Chow
Dr. Rory Lewis
Dr. Martha Palmer
Dr. Jia Rao
Khang Nhut Lam, Ph.D., Computer Science
Title: Automatically creating multilingual lexical resources
Supervisor: Dr. Jugal Kalita
Bilingual dictionaries and WordNets are important resources for natural language
processing tasks such as information retrieval and machine translation. However, lexical
resources are usually available only for resource-rich languages, e.g., English, Spanish and
French. Resource-poor languages, e.g., Cherokee, Dimasa and Karbi have very few resources
with limited numbers of entries. Current approaches for creating new lexical resources work
with languages that have good quality resources already available in sufficient quantities.
This thesis proposes novel approaches to generate bilingual dictionaries, translate phrases
and construct WordNets for several natural languages, including some languages in the
UN-ESCO Endangered Languages List (viz., Cherokee, Cheyenne, Dimasa and Karbi), by
boot-strapping from just a few existing resources and publicly available resources in resource-rich
languages such as the Princeton WordNet, Japanese WordNet and the Microsoft Translator.
This thesis not only constructs new lexical resources but also supports communities using
Dedication
I would like to express deep love to my parents. Without your love and
your support, I could not make this dissertation this far. Thank you for
everything you have done for me. Even though, I have been alone half a
world away from you, I have never felt lonely because you are always with me.
I would like to thank the Le family, and my best friends, Vicky Collier and
Janet Gardner, who are always on my side, take care me as their daughter,
and have given me a real family during the time I have been in the United
Acknowledgments
I would like to take this opportunity to express my warm thanks to my advisor, Dr.
Jugal Kalita, who has supported and guided me with patience and encouragement, and
has provided me with a professional evironment studying and doing research since my first
day in the PhD Program at UCCS. I also owe my gratitude to my dissertation committee
members: Dr. Edward Chow, Dr. Jia Rao, Dr. Martha Palmer and Dr. Rory Lewis, for
their enthusiasm, insightful comments, constructive suggestions and critical evaluations of
my research.
A special thanks is due to Feras Al. Tarouti, my lab mate and my co-author, for
his simulating, contributions, discussions, help in programming and evaluating results, and
excellent company during stressful days when we worked together to meet crucial paper
deadlines. Many thanks to all of my lab mates for their helps, questions, suggestions and
all the fun we have had in our lab.
Many thanks to Dubari Borah, Francisco Torres Reyes, Conner Clark, Tri Doan,
Morningkeey Phangcho, Dharamsing Teron, Navanath Saharia, Arnab Phonglosa, Faris
Kateb, Abhijit Bendale, Lalit Prithviraj Jain and Svati Dhamija for helping me evaluate
lexical resources. I also thank all my friends in the Xobdo, Microsoft and PanLex projects
who provided me with dictionaries and translations.
This research was supported by Vietnam International Education Development-
Min-istry of Education and Training of Vietnam (VIED). I gratefully acknowledge VIED
finan-cial support. I also thank the Graduate School at UCCS for fellowships and the Computer
TABLE OF CONTENTS
1 Introduction 1
1.1 Overview . . . 1
1.2 Types of lexical resources . . . 3
1.3 Research focus and contribution . . . 6
1.4 Intellectual and scientific merit . . . 9
1.5 Broader impact . . . 9
1.6 Organization of the dissertation . . . 10
2 Related work 11 2.1 Introduction . . . 11
2.2 Structure of lexical resources . . . 11
2.2.1 Structure of a bilingual dictionary . . . 11
2.2.2 Structure of the Princeton WordNet . . . 12
2.3 Language codes . . . 15
2.4 Creating new bilingual dictionaries . . . 16
2.4.1 Generating bilingual dictionaries using one intermediate language . . 17
2.4.2 Generating bilingual dictionaries using many intermediate languages 21 2.4.3 Extracting bilingual dictionaries from corpora . . . 24
2.4.4 Generating dictionaries from multiple linguistic resources . . . 29
2.5 Generating translations for phrases . . . 33
2.6 Constructing WordNets . . . 38
2.6.1 Constructing WordNets using the merge approach . . . 38
2.7 Chapter summary . . . 45
3 Input resources and evaluation methods 47 3.1 Introduction . . . 47
3.2 Input bilingual dictionaries . . . 47
3.3 Input WordNets . . . 48
3.4 Evaluation method . . . 49
3.5 Chapter summary . . . 50
4 Creating reverse bilingual dictionaries 52 4.1 Introduction . . . 52
4.2 Related work . . . 53
4.3 Proposed approaches . . . 54
4.3.1 Direct reversal (DR) . . . 54
4.3.2 Direct reversal with distance (DRwD) . . . 56
4.3.3 Direct reversal with similarity (DRwS) . . . 58
4.3.4 Direct reversal with similarity and distance (DRwSD) . . . 60
4.4 Experimental results . . . 62
4.4.1 Preprocessing entries in the existing dictionaries . . . 63
4.4.2 Results . . . 65
4.5 Future work . . . 69
4.6 Chapter summary . . . 72
5 Creating new bilingual dictionaries 74 5.1 Introduction . . . 74
5.3 Proposed approaches . . . 76
5.3.1 Direct translation approach (DT) . . . 76
5.3.2 Using publicly available WordNets as intermediate resources (IW) . 77 5.4 Experimental results . . . 81
5.4.1 Results and human evaluation . . . 82
5.4.2 Comparing with existing approaches . . . 88
5.4.3 Comparing with Google Translator . . . 89
5.5 Future work . . . 90 5.6 Chapter summary . . . 91 6 Creating WordNets 93 6.1 Introduction . . . 93 6.2 Related work . . . 94 6.3 Proposed approaches . . . 95
6.3.1 Generating synset candidates . . . 95
6.3.1.1 The direct translation (DT) approach . . . 96
6.3.1.2 Approach using intermediate WordNets (IW) . . . 96
6.3.1.3 Approach using intermediate WordNets and a dictionary (IWND) . . . 99
6.3.2 Ranking method . . . 100
6.3.3 Selecting candidates based on ranks . . . 101
6.4 Experiments . . . 104
6.5 Future work . . . 106
7 Generating translations for phrases using a bilingual dictionary and n-gram data 109
7.1 Introduction . . . 109
7.2 Vietnamese morphology . . . 110
7.3 Related work . . . 111
7.4 Proposed approach . . . 112
7.4.1 Segmenting Vietnamese words . . . 112
7.4.2 Filtering segmentations . . . 113
7.4.3 Generating ad hoc translations . . . 114
7.4.4 Selecting the best ad hoc translation . . . 114
7.4.5 Finding and ranking translation candidates . . . 116
7.5 Experiments . . . 117
7.6 Future work . . . 119
7.7 Conclusion . . . 120
8 Conclusions 122
References 124
Appendix A: Reverse dictionaries generated 134
Appendix B: New bilingual dictionaries created 136
TABLES
2.1 Languages mentioned and their ISO 693-3 codes . . . 15
3.1 The number of entries in the input dictionaries. . . 48
3.2 The number of synsets in WordNets . . . 49
3.3 The average scores of entries in the input dictionaries. . . 51
4.1 Words related to the word “south”, obtained from the Princeton WordNet. . 59
4.2 Reverse dictionaries created using the DR and DRwD approaches. . . 65
4.3 Reverse dictionaries created using the DRwS approach . . . 65
4.4 Reverse dictionaries created using the DRwSD approach . . . 66
4.5 Examples of unknown words from the source dictionaries. . . 67
4.6 Examples of bad translations from the source dictionaries . . . 68
4.7 Reverse of reverse dictionaries generated . . . 70
4.8 Some new entries, evaluated as excellent or good, in the reverse of reseve dictionaries . . . 70
5.1 The average score and the number of lexical entries in the dictionaries created using the DT approach. . . 83
5.2 The average score of lexical entries in the dictionaries we create using the IW
approach. . . 84
5.3 The number of lexical entries in the dictionaries we create using the IW
approach . . . 85
5.4 The average score of entries and the number of lexical entries in some other
bilingual dictionaries constructed using 4 WordNets: PWN, FWN, JWN and
WWN. . . 86
5.5 Examples of entries, evaluated as excellent, in the new bilingual dictionaries
we created. . . 86
5.6 The number of lexical entries in some other dictionaries we create using the
best approach. . . 87
5.7 Examples of entries, not yet evaluated, in the new bilingual dictionaries we
create . . . 88
5.8 Some “unmatched” lexical entries. . . 90
6.1 Different senses of the word “chair” . . . 97
6.2 Synsets obtained from different WordNets and their translations in Vietnamese 98
6.3 Example of calculating the ranks of candidates in Arabic. . . 101
6.4 Example of Case 2 to select candidates . . . 103
6.6 The number of WordNet synsets we create using the IW approach. . . 105
6.7 The number of WordNets synsets we create using the IWND approach. . . 105
6.8 The number and the average score of WordNets synsets we create. . . 105
7.1 Some examples of Vietnamese phrases and their translations . . . 118
7.2 Some translations we create are correct but do not match with translations by the Google Translator. . . 119
1 Sample entries in the English-Assamese reverse dictionary . . . 134
2 Sample entries in the English-Vietnamese reverse dictionary . . . 134
3 Sample entries in the English-Dimasa reverse dictionary . . . 135
4 Sample entries in the English-Karbi reverse dictionary . . . 135
5 Sample entries in the Assamese-Vietnamese and Assamese-Arabic dictionar-ies . . . 136
6 Sample entries in the Assamese-German and Assamese-Spanish dictionaries 136 7 Sample entries in the Arabic-German and Arabic-Spanish dictionaries . . . 137
8 Sample entries in the Vietnamse-German and Vietnamse-Spanish dictionaries 137 9 Sample entries in the Assamese WordNet synsets . . . 138
FIGURES
1.1 “A new Vietnamese-English dictionary” compiled by William Peter Hyde [41]. 4
2.1 A general method to create a new bilingual dictionary. . . 16
2.2 An example of the lexical triangulated translation method . . . 22
4.1 The idea behind the DR algorithm . . . 54
4.2 The drawback of the DR algorithm. . . 56
4.3 The idea behind the DRwD algorithm . . . 57
4.4 The drawback of the DRwD algorithm . . . 59
4.5 The idea of the DRwS algorithm . . . 60
4.6 The idea behind the DRwSD algorithm . . . 62
5.1 An example of generating an entry for an Dimasa-Vietnamese dictionary using the DT approach . . . 78
5.2 The IW approach for creating a new bilingual dictionary . . . 78
5.3 Example of generating lexical entries for an Dimasa-Arabic dictionary using the IW approach . . . 81
6.1 The DT approach to construct WordNet synsets in a target language T. . . 96
6.2 The IW approach to construct WordNet synsets in a target language T . . 98
6.3 The IWND approach to construct WordNet synsets . . . 99
CHAPTER 1 INTRODUCTION
1.1 Overview
The Ethnologue organization1, which compiles the most comprehensive catalogue of
languages of the world, lists 7,106 living languages. Half the world’s population speaks 13
most populous languages, the other half of the world speaks the rest2. Eighty languages,
1.2% of all languages, are spoken by 79.5% of world’s population and 305 (5.5%) are spoken
by 94.2%3. One hundred languages are spoken by at least 7.4 million people, the rest by
fewer4. 81.3% of world’s languages are spoken by less than a million people each. Many
languages spoken by even tens of millions of people do not have official status or have only
(low) regional status, even within their own countries5. With so many languages spoken by
so few, many languages do not have high political or economic status. In addition to many
that are isolated by inhospitable geography, most languages lack resources to survive and
thrive. These resources include books for infants and children, books for adults of various
kinds, newspapers, magazines, monolingual dictionaries, bilingual dictionaries, thesauri,
and these days electronic versions of these same resources. In contrast to resource-poor
languages, resource-rich languages have better access to resources like dictionaries, thesauri,
ontologies and possibly have plentiful text corpora as well. In truth, no language can be
considered truly resource-rich in absolute terms, but we may consider a few languages (e.g.,
1http://www.ethnologue.com/ 2 http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers 3http://www.ethnologue.com/statistics/size 4 http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers 5 http://en.wikipedia.org/wiki/List_of_languages_without_official_status
English, Spanish and Japanese), to be resource-rich in relative terms; researchers have
created many resources to facilitate various aspects of computational processing for such
languages. There are a few other languages that have a limited number of resources, but
can benefit from additional resources (e.g., Arabic and Vietnamese). Other languages have
very few resources, if any. Many other languages are becoming endangered, a state which
is likely to lead to their extinction, without determined intervention. Some endangered
languages are Chrau and Tai Daeng in Vietnam, Karbi and Dimasa in India, Cherokee and
Cheyenne in America.
We construct lexical resources necessary for computational processing of natural
lan-guages in areas such as information retrieval, automatic word-sense disambiguation,
com-puting document similarity, machine learning, and machine translation. Consider bilingual
dictionaries, an essential tool for human language learners. Most existing (print or
on-line) bilingual dictionaries are between two resource-rich languages (e.g., English-Spanish,
Japanese-Chinese and French-German dictionaries), or between a resource-rich language
and a resource-poor language (e.g., English-Assamese and English-Cherokee dictionaries).
The powerful online machine translators (MT) developed by Google6 and Bing7 provide
pairwise translations for 80 and 50 languages, respectively. These machines provide
transla-tions for single words and phrases also. In spite of so much information for some “privileged”
language pairs, there are many languages for which we are lucky to find a single bilingual
dictionary online or in print. For example, we can find an online Karbi-English dictionary
and an English-Vietnamese dictionary, but we can not find a Karbi-Vietnamese dictionary.
Another important resource that is very helpful in computational processing and in human
language learning is a thesaurus providing synonyms and antonyms of words. An enriched
6https://translate.google.com/ 7
thesaurus that provides additional relations among words in the computational context is
called a WordNet. An English version of such a WordNet has been produced over several
decades at Princeton University, and similar complete WordNets have also been produced
for a small number of additional languages (e.g., French, Hindi and Japanese WordNets).
Most such resources do not really exist for resource-poor and endangered languages.
This dissertation focuses on developing new techniques that leverage existing resources
for resource-rich languages to build bilingual dictionaries and WordNets for languages,
es-pecially languages having very few resources. In addition, a phrase translation model using
a bilingual dictionary augmented by n-gram data is also proposed to obtain translations
for phrases that occur within these resources or even outside. We believe using approaches
that are not language-specific to create computational lexical resources, some of which may
be adapted to produce printed resources as well, may work in concert with other similar
efforts to invigorate speakers, learners and users of these languages.
1.2 Types of lexical resources
According to Landau [58], a dictionary or a lexicon consists of a list of entries sorted
by the lexical unit. Each entry usually contains a lexical unit, the definition associated with
it, part-of-speech (POS), pronunciation, examples showing the uses of words, and possibly
additional information. The lexical unit is usually a single word, whereas its definition is a
single word, a multiword expression, or a phrase. A monolingual dictionary contains only
one language such as the Oxford English Dictionary8. A bilingual dictionary consists of
translations of words between two languages such as “A Dictionary in Assamese and
En-glish” [18]. The monolingual dictionary is mainly used by the native speaker for reading
and understanding texts. The bilingual dictionary is used to understand the words in the
source language [58], or to translate [84]. A bilingual dictionary can be unidirectional or
bidirectional. A unidirectional dictionary contains translations from the source language to
the target language, but the reverse translations are not provided. In contrast, a
bidirec-tional dictionary consists of translations from the source language to the target language,
and from the target language to the source language. Besides the obvious bilingual
dic-tionaries that cover all words used generally in a language, one finds specific dicdic-tionaries
such as a synonym dictionary (e.g., Merriam-Webster’s Dictionary of Synonyms [73]), a
dictionary focused on proper names (e.g., A Dictionary of Surnames [36]), or being focused
on a narrow and specific area (e.g., Black’s Law Dictionary [30], and Stedman’s
Medi-cal Dictionary [113]). Figures 1.1 is an example of a Vietnamese-English paper bilingual
dictionary [41].
Figure 1.1: “A new Vietnamese-English dictionary” compiled by William Peter Hyde [41].
Kilgarriff [47] defines a thesaurus as a resource that groups words according to
are in a dictionary, but according to the ideas which they express”. In particular, according
to Soergel [111], a thesaurus contains a set of descriptors, an indexing language, a
classi-fication scheme, or a system vocabulary. A thesaurus also consists of relationships among
descriptors. Each descriptor is a term, a notation, or another string of symbols used to
des-ignate the concept. Examples of thesauri are Roget’s International Thesaurus [98], Open
Thesaurus9 or a large online English thesaurus simply called thesaurus.com.
Miller [75] introduces WordNet, which is a large lexical database where nouns, verbs,
adjectives, and adverbs are grouped into unordered sets of cognitive synonyms, the so-called
synsets. Each synset expresses a distinct concept. The WordNet is both an enriched
dic-tionary and thesaurus. Given a lexical unit, the general dicdic-tionary and WordNet return
definitions, POSes and examples. For the lexical unit, the dictionary mainly contains single
words while the WordNet can consist of short phrases such as “tabular array”, “scholarly
person”, and “grape vine”. Given a concept, the WordNet and thesaurus return terms which
fit the concept. The words in WordNet synsets are disambiguated in terms of senses. The
relationships between words (such as hypernyms or generalization, hyponyms or
particular-ization, and meronymy or part-whole relationships) in the WordNet are labeled. Currently,
the biggest WordNet is the Princeton WordNet10 version 3.0 which has 117,659 synsets
including 82,115 noun synsets, 13,767 verb synsets, 18,156 adjective synsets, and 3,621
ad-verb synsets. Some other WordNets are the FinnWordNet [66], the Japanese WordNet [43],
the EuroWordNet [122]. The AsianWordNet11 (AWN) provides a platform for building
and sharing WordNets among Asian languages (viz., Bengali, Hindi, Indonesian, Japanese,
Korean, Lao, Mongolian, Burmese, Nepali, Sinhala, Sundanese, Thai, and Vietnamese).
9
http://www.openthesaurus.de/
10
http://wordnet.princeton.edu/
Unfortunately, the progress of the WordNets in AWN is extremely slow, and they are far
from being finished.
Schmidt and W¨orner [105] define parallel corpora as “collections of written texts
and their translations into one or more languages, edited and aligned for the purpose of
linguistic analysis”. Zanettin [125] introduces a comparable corpus consisting of “texts in the
languages involved, which share similar criteria of composition, genre and topic”. A corpus
containing only one language is called a monolingual corpus such as the British National
Corpus12 and the Brown Corpus13. A bilingual corpus involves two languages, such as the
English-Vietnamese Bilingual Corpus (EVbcorpus) [83], while a multilingual corpus consists
of three or more languages such as the International Cambridge Language Survey14.
1.3 Research focus and contribution
The dissertation concentrates on automatically constructing multilingual lexical
re-sources, especially bilingual dictionaries and WordNets, for several natural languages. We
also introduce a novel method to translate a given phrase in a source language to a target
language. The languages we focus are following.
- Languages that are widely spoken but have limited computational resources such as
Arabic and Vietnamese.
- A language that is spoken by tens of millions in northeast India, but has almost no
resources such as Assamese.
12
http://www.natcorp.ox.ac.uk/
13http://clu.uni.no/icame/brown/bcm.html
14
- Languages that are in the UNESCO Endangered Languages15 list such as Cherokee,
Cheyenne, Dimasa and Karbi.
We note that Cherokee16 is the Iroquoian language spoken by 13,500 Cherokee people in
Oklahoma and North Carolina. Cheyenne17is a Native American language spoken by 2,100
Cheyenne people in Montana and Oklahoma. Dimasa18 and Karbi19are spoken by 110,000
and 420,000 people, respectively, in India. Assamese20is an Indo-European language spoken
by about 16 million people and is resource-poor. Vietnamese21is an Austroasiatic language
spoken by 75 million people in Vietnam and Vietnamese Diaspora, whereas Arabic22 is an
Afro-Asiatic language spoken by 290 million people in countries of the Arab League.
First, we focus on creating reverse bilingual dictionaries. Published methods for
auto-matically creating new dictionaries from existing dictionaries use intermediate dictionaries.
Unfortunately, we are lucky to find a single bilingual dictionary online or in software form
for many resource-poor languages. So, our first effort, to increase lexical resources for a
language under consideration, is to investigate the creation of a reverse dictionary from
an existing dictionary, if we can find one. To remove ambiguous entries and increase the
number of entries in created dictionaries, WordNets of resource-rich languages will be used
to compute similarities between words or phrases. Of course, a new reverse dictionary is
associated with the same two languages as the original dictionary that is reversed.
15 http://www.unesco.org/new/en/culture/themes/endangered-languages/ 16http://en.wikipedia.org/wiki/Cherokee_language 17 http://en.wikipedia.org/wiki/Cheyenne_language 18 http://en.wikipedia.org/wiki/Dimasa_language 19http://en.wikipedia.org/wiki/Karbi_language 20 http://en.wikipedia.org/wiki/Assamese_language 21http://en.wikipedia.org/wiki/Vietnamese_language 22 http://en.wikipedia.org/wiki/Arabic_language
Our next effort at increasing lexical resources will be to create bilingual dictionaries
for language pairs for which such dictionaries do not exist. We will create dictionaries
from resource-poor languages to several other languages by exploiting publicly available
WordNets, bilingual dictionaries, and the dictionaries we create in the first task.
Resource-rich languages will provide the pivots for such translations. In general, if a word b (which
may be polysemous) in language B is translated into a word a in language A and a word
c in language C, we cannot necessarily conclude that a is a translation of c because of
their association with b. Hence, statistical techniques and WordNets are used to remove
ambiguous entries.
WordNets are among the most heavily used lexical resources. We develop algorithms
and models to automatically build WordNets for languages using available resources, but
also by bootstrapping with resources we create ourselves. If we can create a number of
WordNets of acceptable quality, we believe it will contribute significantly to the repository
of resources for languages that lack them.
A problem we have encountered in our previous tasks is that quite often a dictionary
entry has a sense that is given in terms of a sequence of words or a phrase. When we
reverse a bilingual dictionary or create bilingual dictionaries for new language pairs, so far
we have ignored such sense entries since we do not know how to translate a phrase into
the target language. Jackendoff [44, page 156] estimated that the number of multiword
expressions or phrases in a person’s vocabulary is of the same order as the number of single
words. In addition, Sag et al. [100] found that 41% words in WordNet 1.7 are multiword
expressions. In the last research task, we develop a model to translate phrases in a given
source language to a target language using dictionary-based approach and n-gram data in
generating translations for phrases occurring both outside and inside bilingual dictionaries
using the information from existing bilingual dictionaries.
1.4 Intellectual and scientific merit
This dissertation will present several novel approaches from simple to complex for
automatically generating bilingual dictionaries and WordNets. We will also compare our
proposed methods against existing methods to find positive and negative points of difference,
and the reasons for the drawbacks. In addition, most existing research works with languages
that have some available lexical resources, each of which is expensive to construct. Using
many intermediate lexical resources for creating a new one may cause ambiguity in the
lexical resource created. The approaches we propose will have the potential not only to
create new lexical resources using just a few existing lexical resources which can reduce cost
and time consumed, but also can improve the quality of lexical resources we create.
Briefly, to be able to automatically create many lexical resources for languages,
espe-cially resource-poor and endangered, we need processes that do not require many resources
to begin with, presenting challenging problems for the computational linguist. Our research
will make substantial progress on these problems by bootstrapping and leveraging WordNets
and dictionaries for resource-rich languages.
1.5 Broader impact
The goal of this dissertation is to study the feasibility of creating multilingual lexical
resources for languages by bootstrapping from a few existing resources. Our research has
the potential not only to construct new lexical resources, but also to support communities
1.6 Organization of the dissertation
The thesis is organized as follows. Existing approaches for constructing new bilingual
dictionaries and WordNets for languages, and generating phrase translation are presented
in Chapter 2. Chapter 3 introduces notations, input resources used and the methods to
evaluate resources we create. Chapter 4 and Chapter 5 propose methods to create reverse
bilingual dictionaries and new bilingual dictionaries, respectively. Approaches to construct
WordNet synsets for many languages are proposed in Chapter 6. In Chapter 7, we present
algorithms to generate translations for phrases with a case study on translating from
Viet-namese to English. Future work is discussed at the end of each chapter. Chapter 8 concludes
the thesis.
Acknowledgment
A synopsis of this dissertation is presented in the paper “Automatically creating
mul-tilingual lexical resources” in the Proceedings of the Doctoral Consortium at the 28th
CHAPTER 2 RELATED WORK
2.1 Introduction
Understanding existing approaches to create new bilingual dictionaries, to generate
translations for phrases and to construct WordNets provides us the background knowledge
required to develop techniques to solve our problems, discussed in this dissertation. In this
chapter, we summarize and discuss related work to build relevant lexical resources. The
remainder of this chapter is organized as follows. In Section 2.2, we describe the structure
of lexical resources. Section 2.3 gives the ISO 693-3 codes of languages mentioned in the
this dissertation. Specific approaches to generate dictionaries, translations for phrases and
WordNets from different linguistic resources are presented in Section 2.4, Section 2.5 and
Section 2.6, respectively. Section 2.7 summarizes the chapter.
2.2 Structure of lexical resources
This thesis proposes approaches to automatically construct bilingual dictionaries and
WordNets. Therefore, this section presents the structure of bilingual dictionaries and
Word-Nets, focusing on the Princeton WordNet.
2.2.1 Structure of a bilingual dictionary
For notational purpose, we make an assumption that a bilingual dictionary Dict(A,B)
contains entries of word or phrase translations from the source language A to the target
or phrases in the language B to words or phrases in language A. In particular, Dict(A,B)
contains entries (a,b) whereas Dict(B,A) contains entries (b,a).
A dictionary entry, called LexicalEntry, is a 2-tuple <LexicalUnit, Definition>. Here
LexicalUnit is a word or a phrase being defined, also called definiendum more formally,
based on Aristotle’s analysis [58]. Usually, a LexicalUnit is lemmatized (i.e., reduced to
a representative or citation form such as infinitives for verbs), but not always. A list of
entries sorted by the LexicalUnit is called a lexicon or a dictionary. Given a LexicalUnit,
the Definition associated with it usually contains its class (e.g., part-of-speech (POS))
and pronunciation, its meaning, and possibly additional information, including usage. The
meaning associated with it can have several Senses. A Sense is a discrete representation of a
single aspect of the meaning of a word. Thus, a dictionary entry is of the form <LexicalUnit,
Sense1, Sense2, · · · >.
2.2.2 Structure of the Princeton WordNet
The main relation between words in a WordNet is synonymy. A synset contains one
or many words. A polysemous word is assigned to many synsets. Each synset has one gloss,
which is a brief definition of the concept, along with sentences showing the use of words
in the synset. The WordNet 2.1 overview by Marin Dantchev [26] says that each synset
is linked to other synsets by numerous conceptual relations. The rest of this section will
discuss the synsets from the four syntactic categories: nouns, adjectives, adverbs and verbs.
The Princeton WordNet version 3.0 has 117,798 nouns with 82,115 synsets. The
noun synsets are organized into hierarchies. WordNet distinguishes types and instances in
noun synsets [29]. Types contain common nouns such as “location”, “president” and “car”
instances always are leaves of trees, or terminal nodes in the hierarchy. The relations among
noun synsets are super-subordinate relations (viz., hypernymy and hyponymy), part-whole
relations (viz., meronymy and holonymy) and antonymy.
- Hypernymy is a semantic relation that links a more general word to a more specific
word. For example, the hypernym set of the word “dog” is {canine, canid}.
- Hyponymy links a more specific word to a general word. The hyponym set of the
word “canid” is {bitch, dog, wolf, jackal, hyena, hyaena, fox}. Hyponymy is transitive.
For example, the word “dog” represents a kind of the word “canine”, which represents
a kind of the word “carnivore”; so “dog” represents a kind of “carnivore”.
- Meronymy links synsets denoting parts to synsets denoting the whole. In particular,
if a word a is a meronym of a word b, a is one part of b. For example, the words
{back, backrest, leg} are meronyms of the word “chair”. The inverse of meronymy is
holonymy. Therefore, the word “chair” is the holonym of {back, backrest, leg}.
- Antonymy expresses the relation between two opposite nouns. For instance, the word
“woman” is an antonym of the word “man”.
The current WordNet contains 21,479 adjectives organized into 18,156 synsets.
Ad-jective synsets are classified into two categories: descriptive adAd-jectives and relational
adjec-tives. The main relation in descriptive adjectives is antonymy, e.g., the anytonym of the
word “short” is {long}. Adjective synsets are organized into bipolar clusters where words
similar to one adjective are grouped with all words similar to its antonym [26]. The relation
in relational adjectives is pertainym, which points to the nouns they are derived from, e.g.,
There are 3,748 adverbs with 733 synsets. Adverbs in WordNet are usually derived
from adjectives via morphological affixation such as “strongly”, “shortly” and “rarely”. The
relations among adverb synsets are synonymy and antonymy, sometimes.
WordNet contains 6,277 verbs with 5,252 synsets. Verb synsets are also organized
into hierarchies. The common relations between verb synsets are troponymy, entailment,
and the cause relation.
- Troponymy is when the activity of one verb is doing the activity of another verb in
some manner. For example, the verb “run” is the troponym of the verb “walk”.
- Entailment occurs when one verb logically occurs after one event. For instance, the
verb “divorce” entails the verb “marry”.
- The cause relation relates one verb, which is causative and another, which is
resulta-tive. For example, the verb “show” and the verb “see” have a cause relation between
them.
Another widely used term is Common Base Concepts, firstly introduced in building
EuroWordNet [96]. A concept is important if it is widely used. In the EuroWordNet, the
Common Base Concepts are classified using a Top Ontology. The Top Ontology is divided
into three categories named 1stOrderEntities, 2ndOrderEntities, and 3rdOrderEntities.
- The 1stOrderEntities contain concrete synsets which are specified for four roles, viz.,
“origin”, “form”, “composition” and “function”. For example, vehicle is classified as
Artifact (Origin) + Object (Form) + Vehicle (Function). The 1stOrderEntities always
- The 2ndOrderEntities include synsets which are located in time, occurr or take place
rather than existing, e.g., “continue”, “occur” and “play”. The 2ndOrderEntities can
be nouns, verbs and non-dynamic adjectives.
- The 3rdOrderEntities consist of synsets which exist independently of time and space.
They can be true or false rather than real, e.g., “idea”, “thought”, “information” and
“plan”. The 3rdOrderEntities are always nouns.
2.3 Language codes
In this thesis, we use names of languages and their ISO 693-3 codes interchangeably.
The ISO 693-3 codes of languages mentioned, including in discussion of related work and
our experiments, are presented in Table 2.1.
Table 2.1: Languages mentioned and their ISO 693-3 codes
Language Code Language Code Language Code Language Code
Arabic arb Assamese asm Bengali ben Cherokee chr
Cheyenne chy Chinese cht Croatian hrv Dimasa dis
Dutch uld English eng French fra Finnish fin
Galician glg German deu Hindi hin Hungarian hun
Indonesian ind Japanese jpn Karbi ajz Korean kor
Italian ita Lithuanian lit Malay zlm Thai tha
2.4 Creating new bilingual dictionaries
To construct a new bilingual dictionary, we may use diverse available resources such
as existing dictionaries, thesauri, corpora or WordNets. Whatever resources are used, there
are two main steps to create a new bilingual dictionary. First, translation candidates are
extracted from resources used (e.g., dictionaries, thesauri or corpora). Second, heuristic
algorithms or statistical information is used to disambiguate and to rank translation
can-didates. The general method for constructing a new bilingual dictionary is presented in
Figure 2.1. The approaches we discuss in the next subsections all fit within this general
architecture.
Figure 2.1: A general method to create a new bilingual dictionary.
Human evaluation is the first choice in evaluating the quality of a new dictionary.
However, it is really hard to find volunteers familiar with languages in a dictionary Dict(A,B)
we may create such as Assamese-Vietnamese or Cherokee-Karbi. Researchers have evaluated
their approaches by generating a dictionary for another language pair Dict(C,D) such that
there exists at least one published good quality dictionary Dict*(C,D), which is used as
or F-score for Dict(C,D). The precision value is the matching percentage of entries in the
new dictionary Dict(C,D) and the existing dictionary Dict*(C,D). The recall ratio is the
percentage of entries that exists in Dict*(C,D), but also exists in Dict(C,D). We consider
the terms accuracy and precision of a dictionary to be synonymous.
2.4.1 Generating bilingual dictionaries using one intermediate language
A basic approach to create a new dictionary and handle ambiguities is a
pivot-based method that uses inverse consultation, introduced by Tanaka and Umemura [115].
They generate a Japanese-French dictionary Dict(jpn, fra) and a French-Japanese
dic-tionary Dict(fra,jpn) from a Japanese-English harmonized dicdic-tionary, Dicthm(jpn, eng),
and an English-French harmonized dictionary, Dicthm(eng, f ra). A harmonized dictionary
Dicthm(A, B) is a symmetrical dictionary created by integrating two unidirectional
dictio-naries Dict(A,B) and Dict(B,A). In the one time inverse consultation method, for each
given word in the source language, Japanese, they find a translation chain jpn → eng1 →
f ra → eng2, and then count the number of matches between eng1 and eng2, where eng1
and eng2 are two sets of words obtained by translation as shown by the arrows. The
greater the number of matches, the better the translation candidate. Similarly, in two-time
inverse consultation, for each given Japanese word jpn1, they experiment with the
trans-lation chain jpn1 → eng → fra → eng → jpn2, and then, count the number of matches
between the input Japanese word and the returned Japanese words. For evaluation, Tanaka
and Umemura [115] randomly select 100 entries from each of the dictionaries they create,
Dict(jpn,fra) and Dict(fra,jpn), then evaluate them manually and by calculating the
matching fraction for manual evaluation and matching percentage are 56% and 58%,
re-spectively.
Shirai et al. [109], and Shirai and Yamamoto [108] conclude that the inverse
consul-tation approach does not resolve the WSD problem well. In addition, differences in the
linguistic natures of languages, such as Japanese and English, affect the content of the
har-monized dictionaries. The authors introduce methods to improve the quality of dictionaries
created using inverse consultation. Shirai and Yamamoto [108] generate translation
candi-dates from Korean to Japanese using one-time inverse consultation from two dictionaries:
Korean-English and Japanese-English. Then, the degree of similarity between words is used
to select correct translations. Given a word in the source language (Korean) wK, and a word
in the target language (Japanese) wJ, the degree of similarity between wK and wJ is the
number of common translations of these words in the intermediate language (English):
degree of similarity(wK, wJ) =
|common(EwK, EwJ)| ∗ 2
|EwK| + |EwJ|
, (2.1)
where EwK and EwJ are the set of translations in English of wK and wJ, respectively.
For evaluation, they randomly select 1,000 Korean words in a published Korean-Japanese
dictionary, and then create the Japanese translations for these Korean words using their
approach. They evaluate their translations against the translations in a published
dictio-nary. The accuracy of their translations is 72% when the degree of similarity is equal to or
greater than 0.8.
Zhang et al. [126] create a Japanese-Chinese dictionary from Japanese-English and
English-Chinese dictionaries using one-time inverse consultation. To rank candidates and
The smaller the penalty value, the better the translation:
penalty(wJ, wC) = k1 ∗ F 1(wJ, wC) − k2 ∗ F 2(wJ, wC), (2.2)
where k1 and k2 are weights, which are set based on preliminary experiments, F1 is the
similarity value in POS between a Japanese word wJ and a Chinese word wC, and F2 is
the one-time inverse consultation score of that pair. 172 Japanese words were randomly
selected for human evaluation, to be marked either “correct” or “wrong”. The accuracy of
their best dictionary is 70.12%.
According to Shirai et al. [109], selecting correct translations among many
transla-tion candidates produced using two-time inverse consultatransla-tion is a challenge. Starting with
a Korean-English dictionary and an English-Japanese dictionary, Shirai et al. [109] use the
two-time inverse consultation method to generate Korean-Japanese candidates; then, look
for overlaps to limit the number of translation candidates. They evaluate their
transla-tions by comparing with a published Korean-Japanese dictionary. The precision of their
dictionary is 85.7%, while the recall ratio is 35%.
Paik et al. [92] experiment with different input bilingual dictionaries and take
di-rectionality into account in creating new Korean-Japanese dictionaries with different
ac-curacies. First, given a Korean-English dictionary Dict(kor,eng) and a Japanese-English
dictionary Dict(jpn,eng), the one-time inverse consultation method is used. According to
their experiment, the more similar the source and target languages1 are, the more correct
the translations are. The same approach with several pivot languages is also used by Paik et
al. [91]. Their second experiment computes the overlapping constraints of translation
can-didates created from Dict(kor,eng) and Dict(eng,jpn). The candidate with a high overlap
similarity score is likely to be the correct translation:
overlap similarity score(wJ, wK) = |wJ|, wJ ∈ J (E(wK)), (2.3)
where E(wK) is a set of translations in English of a Korean word wK, and J (E) is the set of
translations in Japanese of words in English. This method can increase the number of entries
in the new dictionaries created significantly. However, many ambiguous entries are created
in the new dictionaries due to the presence of polysemous words in the pivot language.
Finally, a new dictionary is created from Dict(eng,kor) and Dict(eng,jpn). The candidates
whose similarity scores are greater than a threshold are added to the new dictionary. The
similarity score for wJ and wk is computed as below:
similarity score(wJ, wK) =
|K(E(wK) ∩ E(wJ))| + |J (E(wK) ∩ E(wJ))| |E(wK) ∩ E(wJ)|
. (2.4)
Paik et al. [92] claim that it is appropriate to construct a new dictionary Dict(A,C) using
the two bilingual dictionaries Dict(A,B) and Dict(C,B), when A and C are very similar.
The pivot-based method is also used by Sj¨obergh [110] to create a new
Japanese-Swedish dictionary Dict(jpn,swe) from a Japanese-English dictionary Dict(jpn,eng) and
a Swedish-English dictionary Dict(swe,eng). After removing English stop words in the
existing dictionaries, each English word wE is assigned a weight, calculated by the idf -like
measure.
weight(wE) = log(
|Dict(swe, eng)| + |Dict(jpn, eng)| |Dict(swe, eng)wE| + |Dict(jpn, eng)wE|
), (2.5)
where |Dict(A, B)| is the number of entries in the dictionary, and |Dict(A, B)wE| is the
number of descriptions in the dictionary containing the word wE. Then, they match English
words in the two existing dictionaries and score the matches as follows:
score = 2P a weight(wE) P wE1 weight(wE1) + P wE2 weight(wE2) , (2.6)
where a ∈ Dict(swe,eng)∩Dict(jpn,eng), wE1 ∈ Dict(swe,eng), wE2 ∈ Dict(jpn,eng). A
better translation has a higher score. For multiword expressions that have no translation
in the target language, the concatenations of translations of single words in the target
language are accepted as correct translations. Volunteers are asked to evaluate 300 words
using a 5-point scale: all correct, majority correct, some correct, similar (which means the
translation is not correct, but close to being correct), and wrong. The accuracies of their
translations are 75% all correct with a score greater than 0.9 and 89% all correct with a
score equal to 1.0.
2.4.2 Generating bilingual dictionaries using many intermediate languages
To increase the precision of new dictionaries, one can construct new bilingual
dic-tionaries using transitivity with two or more pivot languages. Gollins and Sanderson [32]
introduce a triangulated translation method for improving cross-language information
re-trieval. To create a translation of a word a in the source language A in the target language
B, they translate a to two intermediate languages C and D to generate words c and d,
respectively. Then, they translate c and d to the target language B and merge the results
in different ways. Adding one more intermediate language to the triangulated translation
method produces “three-way” triangulated translation. Their experiments are with
Euro-pean languages that are covered by the EuroWordNet [122]. They select words in a source
language, create translations in a target language, and evaluate by comparing their
trans-lations with the transtrans-lations obtained from the EuroWordNet. According to Gollins and
Sanderson, triangulated translation outperforms the transitive method by over 55% when
the accuracy metric is used because it helps reduce ambiguous senses of words in
triangulated scheme. The addition of pseudo-relevance feedback [6] as pre-translation to
triangulation translation improves precision of translations. An example of the triangulated
translation method applied to a non-European language with English and French as pivots
to create entries for a new dictionary is shown in Figure 2.2. The Hindi word “vasant” is
translated to English and French. Then, the resulting words in the intermediate languages
are translated to Vietnamese in order to generate translation candidate sets. The correct
translations of this Hindi word in Vietnamese are the words that survive after applying
different merge strategies on the translation candidate sets. As a result, the translation of
“vasant” is “mùa xuân”.
Figure 2.2: An example of the lexical triangulated translation method
Bond et al. [14], and Bond and Ogura [13] create new dictionaries via one or more
pivots. Created entries are ranked in different ways such as using the one-time inverse
con-sultation score introduced by Tanaka and Umemura [115], or a semantic matching score,
which is the number of times the semantic classes of ai and cj match, mainly focusing
on nouns. Samples of random words in the source language and their translations in the
target language are selected for evaluation by lexicographers. The evaluation of entries
in the Japanese-Malay dictionary they created from a Japanese-English dictionary and a
homonyms2, they use two intermediate languages: English and Chinese. Using the two
in-termediate languages, 97% entries in the new dictionary become acceptable, but the number
of entries decreases significantly from 75,872 to 5,238.
A link structure, introduced by Ahn and Frampton [2], is also used to handle
am-biguous translations. The central idea is that if (i) a word a in a source language A is
translated to a word b in an intermediate language B, which is translated to a word c in a
target language C, and (ii) conversely, if the word c is translated to the word b which is
translated to the word c, then the word c is a correct translation of the word a. The problem
with this method is the presence of polysemous words in the intermediate languages. Ahn
and Frampton ameliorate the effect of polysemous words in the following manner. They
find all words bk which are translations of each word ci; then, they find all translations aj of
each word bk. The words aj, which are the same as the source word a are selected. Finally,
they retrace the path to get the words ci, which are correct translations of the word a. The
newly created dictionary, a Spanish-German dictionary, covers 78.4% entries in an existing
dictionary that was created manually. Issues affecting their results include the observation
that the manually-generated dictionary does not contain many entries created using their
approach, the sizes of the input dictionaries are limited, and that different font encodings
in the input dictionaries mess up their results.
A well-known effort to construct many new bilingual dictionaries is by Mausam et
al. [69]. They report several algorithms for creating dictionaries using probabilistic
in-ference. They extract entries from multiple dictionaries of multiple language pairs using
the concept of a translation graph in which each vertex represents a word in a language
and the edge connecting two vertices presents a belief that the two vertices share a sense.
The Transgraph algorithm computes the equivalence score that two words in a translation
graph share the same sense. If this score is greater than a threshold, these two words in
two distinct languages are accepted as sharing the same sense. The main idea behind the
Unpruned SenseUniformPaths (uSP) algorithm is that two vertices share the same sense if
there exists at least one translation circuit found by using a random walk and choosing
ran-dom edges without having duplicate vertices in the path from the source word to the target
word. However, the uSP algorithm faces errors that occur in processing source dictionaries
to generate the translation graph and correlated sense shifts in translation circuits. The
SenseUniformPaths (SP) algorithm solves uSP’s problems by pruning paths whose vertices
enter an ambiguity set twice. An ambiguity set is a set of nodes sharing more than one
sense. Their best algorithm is the SP algorithm at precision 0.90, producing 4.5 times as
many translations as the dictionaries supported by the Wiktionary, producing 73% more
translations over other source dictionary translations.
2.4.3 Extracting bilingual dictionaries from corpora
If languages A and C have substantial corpora of documents that are readily available,
researchers have attempted to derive translations between A and C using several methods.
This subsection presents a variety of approaches for extracting translations from parallel
corpora, bi-texts,3 comparable corpora and monolingual corpora.
Brown [19] derives bilingual lexicons from a Spanish-English parallel corpus containing
685,000 sentence pairs. They construct a correspondence table based on symmetric
co-occurrence ratios and asymmetric co-co-occurrence ratios among words to show the existence
of word or phrase translations within sentence pairs. Two thresholds, one symmetric and
one asymmetric, are set up through experiments to handle the ambiguous candidates and
coincidental co-occurrences. The value of each cell in the table is from 0.0 to 1.0. Elements
in the table with values greater than 0.0 are added to the new bilingual dictionary. The
best dictionary they extracted used a fixed threshold of 1.0 and consisted of 14,446 entries
(covering 15% vocabularies in corpus) with the lowest error rate at 29%.
If a language pair does not have a parallel corpus, but there are some directly
trans-lated texts from one language to the other or texts transtrans-lated into both languages from
an intermediate language, researchers may be able to construct a parallel corpus using
the intermediate language as a pivot. Then, a new dictionary can be extracted from
the generated parallel corpora. For example, Héja [38] collects texts translated to the
source languages (Lithuanian and Slovanian) and the target language (Hungarian) from
an intermediate language (English) to construct parallel corpora (Lithuanian-Hungarian
and Slovanian-Hungarian). In the corpora he creates, sentences in one language might
be combined or split to many sentences in another language because they are not perfect
direct translations. Hence, translation units, instead of sentences, are used to measure
the sizes of these corpora. The Lithuanian-Hungarian corpus contains 147,158 translation
units, whereas the Slovanian-Hungarian corpus consists of 38,574 translation units. Then,
GIZA++ [86] is used to compute translation properties for every translation candidate and
perform word alignment. Héja also calculates frequencies of words in the source and target
languages. A translation candidate is added to the new bilingual dictionary if its translation
probability, its frequency in the source language, and its frequency in the target language
are higher than some thresholds. From experiments, he finds that a candidate with a low
translation probability but high frequency is a good translation, whereas a candidate with
that the number of entries in the newly created dictionaries strongly depends on the size
of corpora. He derives approximately 5,000 and 4,000 translation candidates that satisfy
all three threshold requirements from the Slovanian-Hungarian and Lithuanian-Hungarian
corpora, respectively. 863 extracted translations are evaluated manually. The highest
“use-ful” translation pairs in the new dictionaries are 97.2% with the probability of translations
from 0.7 to 1.0.
If a language pair (A,C) has a very small size of bi-texts, but there exists a third
language B such that B is related to and has a large parallel corpus or bi-texts with A
or C, researchers might be able to construct bilingual lexicons for A and C from available
resources based on transliterations and cognates. The CLDR project4defines transliteration
as “the general process of converting characters from one script to another, where the result
is roughly phonetic for languages in the target script”. For example, “Niu Di-lân” is a
transliteration of “New Zealand” in Vietnamese. According to Molina [76], “cognates are
words descended from a common ancestor; that is, words having the same linguistic family
or derivation”. Some examples of cognates in English and Spanish are “family” - “familia”,
“elephant” - “elefante”, and “gorilla” - “gorila”. Nakov and Ng [80] concatenate the two
bi-texts, align words, then extract cognates. One of their main experiments is to extract
translations from Spanish to English from the bi-texts of Portuguese-English and
Spanish-English, and they consider Portuguese as a language closely related to Spanish. They extract
cognates based on the translation probabilities of words from Portuguese to Spanish using
English as a pivot, and orthographic similarities using the longest common subsequence
ratio (LCSR) [71], calculated by dividing the length of the longest common subsequence by
the length of the longer word. A threshold of LCSR is set to equal or greater than 0.58.
Then, they estimate the translations using the competitive linking algorithm [72]. Cognates
are extracted from a training dataset, then used to train on the same training dataset to
transform words in Portuguese to Spanish. The Bleu score of their translations is 3.37. In
addition, they claim that their approach achieves better results than methods using parallel
corpora and pivot languages.
Ljubeˇsi´c and Fiˇser [67] extract a Croatian-Slovene dictionary from a comparable
cor-pus of news articles. Initially, a seed dictionary, with 33,495 entries, is created by detecting
words that are identically spelled in both languages and also have the same POS in both
languages. The similarity between the two languages is high since the average cosine
dis-tance between corresponding 3-grams picked from corpus is 74%.5 The average precision
of their seed dictionary is 72% as computed by manual evaluation. The first dictionary is
created by expanding the seed dictionary with cognates found by using a modified LCSR
algorithm, named BI-SIM [55]. The second dictionary is generated by adding to the seed
dictionary the first set of translation candidates, with a frequency of at least 200. They
evaluate the dictionaries they create by comparing against a hand-created gold standard
with 500 entries. Their first dictionary consists of 34,823 entries with a precision of 68.5%
whereas the second dictionary has 34,817 entries with a precision of 71.4%. According to
Ljubeˇsi´c and Fiˇser, simply considering the first translation candidates as correct translations
is very effective.
Given a Chinese-English dictionary Dict(cht,eng), Shao and Ng [106] extract new
translations from a Chinese-English comparable corpus using both context and
translit-eration information. The existing Chinese-English dictionary they use has about 10,000
entries. The size of the English corpus is 730M bytes, and the size of the Chinese corpus is
120M bytes. They divide the corpus into time periods, perform segmentation, and
deter-mine unknown Chinese and English words appearing in each period. Next, they estimate
the translation probability for each translation candidate based on the context:
P (C(c)|C(e)) = Y tc∈C(c)
P (tc|Tc(C(e)))q(tc), (2.7)
where q(tc) is the number of occurrences of a Chinese word tc in the context C(c), e is
English words in the context C(e), and Tc(C(e)) is a bag of Chinese words created by
translating the English words in C(e) using a bilingual dictionary. Then, a probability of
translation for each candidate based on transliteration is obtained as follows:
P (e|c) = P (e|pinyin) =X a
Y i
P (lai|pi), (2.8)
where pi is the ith syllable of Pinyin (the official romanization used in China) created by
converting each character in a Chinese word c, lai is the English letter sequence that the
ith Pinyin syllable maps to a particular alignment a. Finally, they rank candidates based
on the probabilities of translation. The number of new Chinese source words and English
translations found are 4,499 and 192,521, respectively. The precision of newly found correct
translations is 78.2% as evaluated by humans.
Researchers have derived bilingual lexicons even for language pairs that have neither
a parallel corpus nor a comparable corpus. To test their idea, they work with an English
corpus and a German corpus that are different in time period and orientation. The goal
of Koehn and Knight [53] is to derive one-to-one bilingual noun translations from German
to English using these disparate corpora. They find translation candidates based on (i)
identical words adopted from other languages (e.g., “email” and “internet”), (ii) words with
similar spelling due to cognate origin (e.g., “website” in English and “webseite” in German),
language (e.g., the word “dog” is similar to the word “cat”) and (v) frequencies of words.
They extract 1,339 bilingual noun translations, which can be considered to constitute a
seed lexicon, with accuracy of 89% starting with just the identical words. According to
Koehn and Knight, finding identical words, words with similar spelling, and words in similar
context help find significantly more new bilingual translations. The authors report that
the translations they extract cover 39% of the translations extracted at word-level from a
German-English parallel corpus.
2.4.4 Generating dictionaries from multiple linguistic resources
To improve the quality and the quantity of entries in the newly created dictionaries,
researchers extract translation candidates from available bilingual dictionaries like we have
discussed in prior sections, but extend by using resources such as thesauri, corpora, and
WordNets to identify senses of words, and to remove irrelevant candidates. Sanfilippo and
Steinberger [102] enrich a bilingual dictionary Dict(A,B) by linking its senses to senses in a
thesaurus of A. The enriched dictionary can be used to distinguish translation candidates
of a word in a given context. In the thesaurus, each word ai has one or many senses and
corresponding synonyms for each sense. Each sense has an identical number sensei. ai: sensei1: ai11, ai12, ai13,...
sensei2: ai21, ai22, .... ....
senseij: ....
Given a word aiin language A in the dictionary, they obtain all words belonging to each sense
of this word from the thesaurus, translate them to the target language B, rank translation
candidates based on their occurrence counts. Finally, they match the translation candidates
or be discarded. As a result, translations b of the source word a are grouped based on senses
of a.
ai: sensei1: bi11, bi12,... sensei2: bi21, .... ....
senseij: ....
The precision and recall of linking senses are 86% and 97%, respectively, whereas
those of ranking translations are 87% and 92%, respectively. The approach of Sanfilippo
and Steinberger [102] can be used to create a new dictionary Dict(B,C) from the given
dictionaries Dict(A,B) and Dict(A,C), and a thesaurus in language A. They link senses in
each dictionary to senses in the thesaurus, generate translations between B and C using A
as a pivot, and align translations using the unique sense numbers of the pivot word ai in A.
Goh et al. [31] construct a new Japanese-Chinese dictionary from Japanese-English
and Chinese-English dictionaries using the pivot-based method through English and rely
on the one-time inverse consultation method. Samples of 200 randomly selected words of
each category (nouns, verbal nouns, and verbs) are evaluated manually using a 4-point scale
{correct, not-first, acceptable, wrong}. Their dictionary has 20,554 entries with an average
accuracy of 77%. Because many Japanese words are combinations of Kanji characters,
which are similar to Hanzi in Chinese, they find 7,941 new translations with accuracy of
97% for nouns and 97.5% for verbal nouns by converting Kanji to Hanzi.
Nerima and Wehrli [81] create a new bilingual dictionary Dict(A, C) from two input
bilingual dictionaries Dict(A, B) and Dict(B, C) using the transitive method. The
transla-tion candidates are validated by checking their appearance in an A-C parallel corpus. An
example of their experiments is to construct an German dictionary from
English-French and German-English-French dictionaries consisting of 76,311 and 45,492 entries, respectively.
Their new English-German dictionary has 21,600 entries, of which 26% of entries are found
using the corpus, are evaluated manually. The authors do not report a precision value for
their dictionary, but they claim that the translations they create are very good.
A comparable corpus has also been used to validate translation candidates. Otero
and Campos [90] create a new dictionary Dict(A, C) from Dict(A, B) and Dict(B, C) using
transitivity; then, remove ambiguous entries in the dictionary created using an A-C
compa-rable corpus. They split Dict(A, C) into two subsets Dict(A, C)amb containing ambiguous
entries, and Dict(A, C)unamb consisting unambiguous entries. To remove ambiguous
en-tries, they generate a temporary dictionary Dict(A, C)corpus from the comparable corpus
such that every word in A is translated into the top-N best translations in C and every word
in C is also translated into the top-N best translations in A. The final bilingual dictionary
Dict(A, C) is created using the following formula:
Dict(A, C) = Dict(A, C)amb∩ Dict(A, C)corpus∪ Dict(A, C)unamb. (2.9)
They create an English-Galician dictionary from the English-Spanish and Spanish-Galician
dictionaries, and a comparable corpus of English and Galician. The dictionary created
contains 12,064 entries and 22% of the entries are found in the comparable corpus. Similar
to Nerima and Wehrli [81], Otero and Campos claim that there is no need to manually
evaluate the entries they generated because their qualities are the same as those of entries
created by lexicographers without discussing their comparison method against the resource
created by a lexicographer.
In addition to parallel or comparable corpora, researchers have also used
monolin-gual corpora to validate translation candidates. Kaji et al. [45] create a Japanese-Chinese
dictionary from Japanese-English and Chinese-English dictionaries using the pivot-based
method. A correlation matrix of associated words versus translations obtained from two
am-biguous translation candidates. To construct a correlation matrix, they first extract word
associations from the corpora, align the extracted Japanese word associations with the
ex-tracted Chinese word associations using the dictionary created by the pivot-based method,
and iteratively compute the correlations between associated words and translations. The
correlation matrix is converted to a binary matrix such that the highest value in each row
of the matrix is converted to 1.0 whereas the remaining values are converted to 0.0. Finally,
the support for each translation is obtained by dividing the number of times 1.0 occurs in its
column by the number of rows in the matrix. The translations with support values greater
than a threshold are accepted as the correct translations. For evaluation, 384 Japanese
entries of nouns and their translations are manually validated. Evaluation produced 64.9%
of precision and 15.8%.
WordNets have been used to remove irrelevant translation candidates. Varga and
Yokohama [119, 120] generate a Japanese-Hungarian dictionary from Japanese-English and
Hungarian-English dictionaries using the pivot-based method. A translation candidate is
considered unambiguous if there exists only one translation from the the source language
to the pivot language, which in turn has only one translation to the target language. To
handle ambiguities, they compute scores using information obtained from a WordNet of the
pivot language, English WordNet, as below:
scoreB(wJ, wH) = max |sns(wJ → i0) ∩ sns(w H → i0)| |sns(wJ → i0) ∪ sns(wH → i0)| , (2.10) scoreC,D,E(wJ, wH) = |ext(wJ → wE) ∩ ext(wH → wE)| |ext(wJ → wE) ∪ ext(wH → wE)| , (2.11) scoreF(wJ, wH) = Y rel
((c1+ max(scorerel(wJ, wH))).(c2+ c3.mf actorrel(wJ, wH))), (2.12)
where i0 ∈ (wJ → wE) ∩ (wH → wE); sns(w) is the set of senses of word w ; ext(w) is