• No results found

Word Alignment by Re-using Parallel Phrases

N/A
N/A
Protected

Academic year: 2021

Share "Word Alignment by Re-using Parallel Phrases"

Copied!
96
0
0

Loading.... (view fulltext now)

Full text

(1)

Word Alignment by Re-using

Parallel Phrases

by

Maria Holmqvist

Submitted to Link¨oping Institute of Technology at Link¨oping University in partial fulfilment of the requirements for degree of Licentiate of Philosophy

Department of Computer and Information Science Link¨oping universitet

SE-581 83 Link¨oping, Sweden

(2)
(3)

Maria Holmqvist December 2008 ISBN 978-91-7393-728-3

Link¨oping Studies in Science and Technology Thesis No. 1392

ISSN 0280–7971 LiU–Tek–Lic–2008:50

ABSTRACT

In this thesis we present the idea of using parallel phrases for word alignment. Each par-allel phrase is extracted from a set of manual word alignments and contains a number of source and target words and their corresponding alignments. If a parallel phrase matches a new sentence pair, its word alignments can be applied to the new sentence. There are several advantages of using phrases for word alignment. First, longer text segments include more context and will be more likely to produce correct word alignments than shorter segments or single words. More importantly, the use of longer phrases makes it possible to generalize words in the phrase by replacing words by parts-of-speech or other grammatical information. In this way, the number of words covered by the ex-tracted phrases can go beyond the words and phrases that were present in the original set of manually aligned sentences. We present experiments with phrase-based word align-ment on three types of English–Swedish parallel corpora: a software manual, a novel and proceedings of the European Parliament. In order to find a balance between improved coverage and high alignment accuracy we investigated different properties of generalised phrases to identify which types of phrases are likely to produce accurate alignments on new data. Finally, we have compared phrase-based word alignments to state-of-the-art statistical alignment with encouraging results. We show that phrase-based word align-ments can be used to enhance statistical word alignment. To evaluate word alignalign-ments an English–Swedish reference set for the Europarl corpus was constructed. The guidelines for producing this reference alignment are presented in the thesis.

This work has been supported by Swedish National Graduate School of Language Tech-nology (GSLT) and Santa Anna IT Research Institute.

Department of Computer and Information Science Link¨oping universitet

(4)
(5)

I would like to take this opportunity to thank the people who have supported me during my work with this thesis. First I want to thank my supervisor Lars Ahrenberg who introduced me to research in parallel text processing and machine translation. I am grateful for your insights and for always having the time to discuss my work. I also want to thank my secondary supervisor Magnus Merkel.

Thanks also to all members of NLPLab, past and present, for interesting discussions and good ”fika”: Sara Stymne, Arne J¨onsson, Lars Ahrenberg, Magnus Merkel, Jalal Maleki, Annika Silvervarg, Nils Dahlb¨ack, Jody Foo, Lars Degerstedt, H˚akan Sundblad, Mustapha Skhiri, Bertil Lyberg, Sonia Sangari, Pontus W¨arnest˚al and Genevieve Gorrell.

To other friends and collegues, thanks for making my PhD studies at IDA so enjoyable. Thank you Susanna Nilsson, Rebecka Edqvist, Robert Scholz, Fabian Segelstr¨om, Per S¨okjer, Magnus Ingmarsson, Ola Leifler, Anders Larsson, Bj¨orn Johansson, Hanjing Li and Jiri Trnka. I especially want to thank Sara Stymne for inspiring scientific co-operation, fun travel and for being a good friend. Thank you for your patience in proof-reading this thesis and for your contributions to this work. You always have a solution to any problem.

I also want to thank GSLT and Sture H¨agglund at Santa Anna IT Re-search Institute for financial support and Lise-Lott Andersson and Lillemor Wallgren for their help with administrative tasks.

I want to thank my family and friends for being there for me. Thank you Bj¨orn. You make me happy.

(6)
(7)

1 Introduction 1

1.1 Contributions . . . 2

1.2 Outline . . . 2

2 Background 5 2.1 Sentence alignment . . . 5

2.2 Statistical word alignment . . . 6

2.2.1 Statistical translation models . . . 6

2.2.2 Statistical alignment models . . . 7

2.2.3 Parameter estimation . . . 8

2.2.4 Giza++ . . . 8

2.2.5 Symmetrization heuristics . . . 9

2.3 Heuristic word alignment . . . 10

2.3.1 Combining knowledge sources . . . 11

2.4 Statistical vs. heuristic word alignment . . . 12

2.5 Improving statistical word alignment . . . 14

2.5.1 Pre-processing . . . 14

2.5.2 Adding parallel data . . . 16

2.5.3 Constraining EM . . . 16

2.5.4 Discriminative alignment . . . 17

2.6 Applications . . . 18

2.6.1 Statistical machine translation . . . 19

2.7 Evaluation . . . 21

2.7.1 Formats . . . 21

2.7.2 Intrinsic measures . . . 23

2.7.3 Extrinsic measures . . . 24

2.7.4 Intrinsic vs. extrinsic measures . . . 26

2.7.5 Conclusions . . . 26

3 Phrase-based word alignment 29 3.1 Phrases for word alignment . . . 29

3.1.1 Motivation . . . 30

3.1.2 Related work . . . 31

(8)

3.2.1 Experiment I - Aligning with phrases . . . 32

3.2.2 Experiment II - Aligning with generalised phrases . . 33

3.3 Conclusions . . . 34

3.4 Research questions . . . 35

4 Corpora and guidelines 37 4.1 Experiment corpus . . . 37

4.1.1 Corpus preparation . . . 37

4.1.2 Training and test data . . . 38

4.2 Linguistic annotation . . . 38

4.3 Guidelines for word alignment . . . 39

4.3.1 Manual word alignment . . . 40

4.3.2 Training data alignment . . . 41

4.3.3 Test data alignment . . . 42

4.3.4 Alignment software . . . 43

5 Experiments with Europarl 45 5.1 Experiments . . . 45 5.1.1 Phrase extraction . . . 45 5.1.2 Phrase length . . . 46 5.1.3 Generalised phrases . . . 46 5.1.4 Phrase selection . . . 48 5.2 Evaluation . . . 52 5.2.1 Theoretical coverage . . . 53 5.2.2 Giza baseline . . . 54 5.2.3 Test results . . . 55

5.2.4 Combining with Giza alignments . . . 56

5.3 Conclusions . . . 57

6 Conclusion 59 6.1 Future work . . . 60

A Word alignment guidelines 63 A.1 Guidelines for reference alignment . . . 63

A.1.1 General guidelines . . . 63

A.1.2 Null links . . . 65

A.1.3 Phrase alignments . . . 65

A.1.4 Noun phrases . . . 66

A.1.5 Named entities . . . 66

A.1.6 Category shifts . . . 67

A.1.7 Verb phrases . . . 68

A.1.8 Date and time expressions . . . 69

A.1.9 Translation errors . . . 70

A.2 Guidelines for training data alignment . . . 71

(9)

Introduction

Translated texts are rich sources of information about language, language differences and translation. A text and its translation in another language are called a parallel text. The goal of research in parallel text processing is to extract translation information from the large amounts of translations that are produced and made available in electronic form.

A fundamental step in extracting translation information from parallel text is to determine which words in the source text correspond to which words in the target text, a process known as word alignment. A parallel text is often aligned at the sentence level which means that word align-ment is performed on a pair of corresponding source and target sentences. Word alignments can then be used to extract information from the texts including bilingual dictionaries, transfer rules and information about word order differences between languages. It is also possible to train a completely data-driven statistical machine translation system based on a word aligned corpus, provided that the corpus is large enough. It is not an easy task to decide which source and target words correspond in a parallel text and manual word alignment can be very time-consuming. Today, there are effec-tive techniques for automatic word alignment of parallel text. However, the quality of automatic word alignment is far from perfect, especially if parallel data is scarce which is the case for most language pairs unless one of the languages is English.

This thesis concerns the problem of word alignment and suggests a new method to create high quality word alignments using phrases from a set of manually word aligned sentences. Phrases are defined as pairs of aligned sentence segments of arbitrary length. The use of phrases has several ad-vantages over words for word alignment. Longer parallel phrases provide more context to the word links which makes them more reliable. In addi-tion, the use of longer phrases makes it possible to generalize words in the phrase by replacing words by parts-of-speech or other grammatical informa-tion. In this way, the number of words covered by the extracted phrases

(10)

can go beyond the words and phrases that were present in the original set of manually aligned sentences. More specifically the thesis will address the following questions. What is the highest recall phrase-based word alignment can produce with high precision? How can phrases be generalised to increase recall without losing precision? We also address the issue of how the quality of phrase-based word alignment compares to the quality of state-of-the-art statistical word aligment and whether word alignments from these systems can be combined to produce better alignments.

1.1

Contributions

This thesis presents the idea of using phrases for word alignment and shows how phrases can be applied to create high quality alignments. The main contributions of this thesis are:

• A framework for phrase-based word alignment.

• A study of how to select reliable parallel phrases for word alignment. • A comparison of phrase-based word alignment and state-of-the-art

sta-tistical word alignment.

• An English–Swedish word aligned reference set from the Europarl cor-pus.

• A set of reference alignment guidelines for English–Swedish word align-ment.

1.2

Outline

After this short introduction to word alignment, Chapter 2 will present current state-of-the-art methods of automatic word alignment, as well as various approaches to improve them. The chapter discusses methods and measures for evaluating word alignment systems and reviews some appli-cations of word aligned parallel texts with a focus on statistical machine translation. Research on the relationship between word alignment quality and machine translation quality is also reviewed here.

Chapter 3 describes the novel phrase-based approach to word align-ment. Phrase-based word alignment is a new approach to word alignment, which aligns new text based on parallel phrases from manually aligned text. We explain the ideas and motivations behind the approach and outline the research questions for the experiments in this thesis.

Chapter 4 presents the experiment corpus and describes the creation of the manual word alignments that were used in this study. Two sets of word alignments were created, a training set and a test set. The training set was used to extract parallel phrases and the test set was used as reference data

(11)

for word alignment evaluation. The creation of manual word alignments in-cluded constructing suitable word alignment guidelines, manual annotation and finally, merging the annotations from two human annotators into one reference alignment.

Chapter 5 describes the experiments that were performed to evaluate the phrase-based word alignment algorithm on the Europarl corpus as well as various methods to improve its accuracy and recall. The results were evaluated using standard measures of alignment quality and were compared to the quality of statistical word alignment.

Chapter 6, finally, contains a summary of the findings and outlines directions for future research.

(12)
(13)

Background

Word alignment is the task of finding correspondences between words and phrases in parallel texts. Aligning words in translated texts is difficult for many reasons. Language differences make word-by-word translations very uncommon in real translations and human translators may use paraphrases, omissions, additions and reorderings to express the original content in a new language.

In this chapter I will present methods for aligning a parallel text on the sentence and word levels. Sentence correspondences are usually identified in a separate process before word alignment to make the word alignment problem easier. Sentence alignment is described in Section 2.1. The two main approaches to automatic word alignment, statistical word alignment and heuristic word alignment are described in Section 2.2 and Section 2.3. In Section 2.4 I discuss advantages and disadvantages with the heuristic and statistical approaches. In Section 2.5 I will present research on improving statistical word alignment with syntactic and morphological information. A number of applications of word alignment are described in Section 2.6, with a focus on statistical machine translation. Finally, Section 2.7 explains how to evaluate word alignment systems – a very important aspect of word alignment research.

2.1

Sentence alignment

Most word alignment algorithms require a parallel text that has been aligned at the sentence level. Sentence and paragraph alignment can be performed with relatively high precision using language independent methods such as Gale and Church’s algorithm (Gale and Church, 1991). Several simplifying assumptions can be made about sentences and paragraph alignment which makes this an easier problem than word alignment. First of all, sentences and paragraphs are generally translated in the order they appear in the original text and we can therefore generally assume that alignment can be monotone.

(14)

Second, there is a correlation between the sentence lengths (in characters) in many language pairs. The Gale and Church algorithm depends on these two assumptions and also assumes that sentences correspond either 1-to-1, 1-to-2, 2-to-1, 1-to-0, 0-to-1 or 2-to-2 since these correspondences are the most common in text. In their algorithm, each candidate sentence pair is assigned a probabilistic score based on the ratio of lengths of the two sentences and the variance of this ratio. Dynamic programming is then used to find the most probable alignment for all sentences in the document. The accuracy of this method varies with the language pair and characteristics of the text. Gale and Church, for example, report a 4% error rate on a sample of the French–English Canadian parliamentary proceedings, known as the Hansards. Other sentence alignment algorithms use similar dynamic programming techniques but also use information about cognates and word correspondences to find anchor points in the parallel text (Melamed, 1999; Moore, 2002; Wu, 1994).

2.2

Statistical word alignment

Statistical methods have become the dominating approach to word align-ment in recent years due to the availability of parallel data and the success of statistical machine translation. In the statistical approach to word align-ment, a statistical translation model is estimated directly from parallel texts. A translation model models the conditional probability of a source string given a target string, (e.g., P r(I|jag) = 0.9 and P r(me|jag) = 0.1). These probabilities are estimated from corpora via an alignment that con-nects words in a source sentence with words in a target sentence. Conse-quently, alignments are produced as a by-product of estimating the transla-tion model.

2.2.1

Statistical translation models

Translation models are used in statistical machine translation to find the most probable translation t = [t1, . . . , tI] of a source string s = [s1, . . . , sJ]. Using Bayes’ theorem, we can write the probability P r(t|s), the probability of translation t given the source string s, as in (2.1).

P r(t|s) = P r(s|t)P r(t)

P r(s) (2.1)

The goal of finding the most probable target t can be restated as finding the target language string t that maximize the function in (2.2). In this way, the translation probability only depends on the reversed translation model, P r(s|t), and a target language model, P r(t).

ˆ

t = argmax t

(15)

This model of translation assumes that the source string is just a noisy version of the target string. Using the example of translation from Swedish to English the model assumes that when a native Swedish speaker generates a string s in Swedish, she did actually have an English string t in mind. The goal is to find out which English string the speaker had in mind when she said s, based on the probability of the English string P r(t) and the probability of s being a translation of t, P r(s|t).

The target language model P r(t) is estimated from a target language corpus and the translation model P r(s|t) is estimated from parallel corpora.

2.2.2

Statistical alignment models

Statistical alignment models introduce a hidden alignment a, which connects words in the source string to words in the target string. The set of alignments a is defined as the set of all possible connections between each word position j in the source string to exactly one word position i in the target string. The translation probability P r(s|t) can be calculated as the sum of P r(s, a|t) over all possible alignments, where P r(s, a|t) is the joint probability of the source string s and an alignment a given the target string t (2.3).

P r(s|t) =X a

P r(s, a|t) (2.3) The joint probability P r(s, a|t) is not estimated directly from a parallel corpus. Instead, the process of generating the source string s from the target string t is broken down into smaller steps and the probability of each step is estimated from the corpus. The IBM models 1-5 (Brown et al., 1993) and the HMM model (Vogel et al., 1996) decompose the alignment model into a set of parameters that describe this generative process. Model 1 and 2 tells a simplified story of how the source string was generated from the target string, while the story of the more complex models 3, 4 and 5 account for more relevant factors that affect alignment probability.

In the simplest of the IBM-models, Model 1, P r(s, a|t) only depends on one parameter, the translation probability t(sj|taj) which is the probability that the target word t aligned to the source word at position j is a translation of the word sj.

Model 2 includes a parameter for alignment positions a(i|j, I, J ) where the position of the target word i depends on the position of the source word j, the length of the target sentence I and the length of the source sentence J . In this model, the alignment depends on the source and target words as well as the absolute position of the source word.

Model 3 adds several new parameters to the alignment model. In this model, each target word can give rise to several source words as in anyhow – hur som helst. The fertility parameter f (n|t) models the probability that a target word generates n source words. Model 3 also assumes that source words can be generated from a NULL word token at each position in the tar-get sentence. The probability of generating such a NULL word is also used

(16)

as a parameter in this model. Finally, a reversed position model a(j|i, J, I) is used that models the probability of the source word position j based on the target word position i.

Model 4 adds two new parameters, a relative word order model and a first order dependence on word classes. The word order model acknowledges the fact that words tend to move around in groups. This is modeled by having two reversed alignment models, one for the first word of a group d1(∆j|A(ti−1), B(sj)) and the second model for the relative positions of the following words d>1(∆j|B(sj)). ∆j is the relative position of the source word being placed and A(t) and B(s) are the word classes of a target word and a source word respectively. Consequently, in Model 4, the placement of the first word of a word group depends on the word class of the previous aligned target word and the word class of the source word being placed. The placement of the other words in a group depends only on the word class of the source word. Word classes are automatically induced from data (Och and Ney, 2003).

Model 5 differs from Model 4 only by the fact that it is not deficient. The deficiency of Model 4 means that the model wastes probability on events that can not be source strings. The problem is the relative word order model, which lets source positions be chosen more than once and assigns probabilities to alignments outside of the sentence boundaries.

The Hidden Markov Model alignment presented in Vogel et al. (1996) is an alternative to IBM Model 2. Unlike Model 2, the HMM model include a first order dependence on the previous aligned word position a(aj|aj−1, I). This model produces better results than Model 2.

2.2.3

Parameter estimation

The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is used to iteratively estimate alignment model probabilities according to the likelihood of the model on a parallel corpus. In the Expectation step, align-ment probabilities are computed from the model parameters and in the Maximization step, parameter values are reestimated based on the align-ment probabilities and the corpus. The iterative process is started by ini-tializing parameter values with uniform probabilities for IBM Model 1. The EM algorithm is only guaranteed to find a local maximum which makes the result depend on the starting point of the estimation process. Therefore, the result of simpler models is used as initial guesses to bootstrap more complex models.

2.2.4

Giza++

Giza++ is a free statistical word alignment system that implements IBM Models 1-5, HMM alignment, and parameter smoothing (Och and Ney, 2003). The Giza++ default alignment model is estimated using 5 itera-tions each of Model 1, HMM, Model 3 and Model 4. Giza++ also finds

(17)

the most probable alignment for each sentence pair based on the estimated parameter values. This alignment is called the Viterbi aligment.

Giza++ produces word alignments in the format shown in Figure 2.1 where each aligned sentence pair is represented by three lines of text. The first line contains sentence number, lengths of source and target sentences and an alignment score. The second line contains the target sentence. The third line, finally, contains the source sentence with links to the correspond-ing target words. The first source token is a null token so that target words can be linked to null if there is no other corresponding token in the source.

1: # Sentence pair (1) source length 3 target length 4 alignment score : 0.00459675

2: resumption of the session

3: NULL ({ }) ˚aterupptagande ({ 1 }) av ({ 2 }) sessionen ({ 3 4 })

Figure 2.1: Giza++ alignment format

2.2.5

Symmetrization heuristics

Because the statistical models require that each target word is aligned to exactly one source word, Viterbi alignment results in an asymmetric align-ment containing one-to-many alignalign-ments but not one or many-to-many alignments. This is a problem since many-to-many-to-many-to-many alignments are quite common. The solution is to produce assymetric alignments in both translation directions and then combine the alignments into one symmetric alignment. The simplest symmetrization algorithms take the intersection or the union of the words links from the two asymmetric alignments. More complex heuristics start from the intersection of word links and then add additional alignment points from the union based on particular criteria.

The following heuristics and combinations of heuristics in Table 2.1 are implemented in the Moses toolkit for statistical MT (Koehn et al., 2007). Starting with the intersection, the alignment is expanded by adding align-ment points from the union of alignalign-ments. Points are added if they are adjacent to previously aligned words (grow), that is, if they are above, be-low, to the right or left of an alignment point in the alignment matrix. In addition, alignment points may be diagonally adjacent in order to be added to the alignment (diagonal). In a final step, remaining unaligned words can be aligned although they are not adjacent to other alignment points (final). An additional constraint can be added to the final step that both words in the alignment point must be unaligned (and). The heuristics in Table 2.1 were used to produce symmetric Giza++ alignments for the project Europarl corpus. These alignments were then used as a state-of-the-art baseline. An evaluation of the heuristics on this corpus is presented in Section 5.2.2.

(18)

intersection A1∩ A2 union A1∪ A2

grow add adjacent links (block distance) grow-diag add diagonally adjacent links

grow-diag-final add non-adjacent links in a final step

grow-diag-final-and add links for unaligned words in the final step grow-final grow + add non-adjacent links in a final step

Table 2.1: Symmetrization heuristics

2.3

Heuristic word alignment

Word alignment methods that align words and phrases based on other knowledge sources than pure statistical estimation are called heuristic or associative word alignment. Heuristic systems base their alignment deci-sions on many knowledge sources, usually a combination of co-occurrence metrics, string similarity metrics, machine readable dictionaries and part-of-speech patterns or other types of information that can be manually defined or learned from word aligned data.

Co-occurrence metrics measure the association between a source and target word by measuring the number of times the source and target words co-occur in a sentence pair compared to the number of times they would co-occur by chance. Common measures of co-occurrence are the Dice score (2.4) and pointwise mutual information (2.5).

Dice(s, t) = 2(c(s, t))

c(s) + c(t) (2.4) M I(s, t) = log2 p(s, t)

p(s)p(t) (2.5)

Cognate words, words that are similar across languages, include proper names, loan words and other words with common origin. Cognates such as international - internationell and Volvo - Volvo can be used as high-confidence alignments in a sentence pair. String similarity measures are used to determine the likelihood that two words are cognates. A common measure of string similarity is the co-occurrence of character bigrams calcu-lated using Dice’s coefficent (2.6).

Dicecharacter bigrams(s, t) =

2|bigrams(s) ∩ bigrams(t)|

|bigrams(s)| + |bigrams(t)| (2.6) Another measure for finding possible cognates is the Longest Common Subsequence Ratio, LCSR, which measures the ratio between the longest common subsequence of characters in source and target words, and the length of the longest word (2.7). Example (2.8) shows LCSR for the word

(19)

pair farmer – farmare.

LCSR(s, t) = |longest common substring|

max(|s|, |t|) (2.7) LCSR(farmer, farmare) = |farmr|

max(|farmer|, |farmare|)= 5

7 ≈ 0.71 (2.8) Cognates can be a good indicator of translation correspondence for closely related languages, but they may also give false indications of correspondence in the case of ”false friends” such as eng. eventually - swe. eventuellt and eng. fart - swe. fart. The risk of false cognates is greater for shorter words and to minimise this risk a length threshold can be set for accepting potential cognate pairs.

Associations between words can also be identified through the use of ordi-nary bilingual dictionaries or lists of corresponding parts-of-speech patterns, phrase type correspondence, relative positions and so on.

2.3.1

Combining knowledge sources

Co-occurrence metrics like the Dice coefficent produce an association score for pairs of lexical items in a sentence pair. Heuristic systems need a way to find the best alignment of a parallel corpus given these scores. The com-petitive linking algorithm (Melamed, 2000) is a greedy algorithm that produces a word alignment from association scores. First, word pairs are ranked according to their association scores assoc(s, t). Beginning from the top of the ranked list, word pairs are linked in the parallel corpus. Because the algorithm is restricted to one-to-one alignments, linked words are re-moved from the search space. New alignments are then added iteratively by linking word pairs with the next highest association score.

Heuristic systems that use other knowledge sources than co-occurrence scores must have a method of combining the different knowledge sources into an accurate alignment. Tiedemann (2003; 2005) suggests a way to combine different types of information that indicate association between lexical items in a parallel text. He calls information that indicate associations between lexical items word alignment clues. Each clue Ci(s, t) indicates a value between 0 and 1 for the association between two lexical items s and t as defined by a weighted association measure wiAi(s, t). The association score can be calculated for any feature of the lexical items (e.g., word form, lemma, part-of-speech, phrase-type), using any measure of association (e.g., Dice, string similarity). Different clues are then combined for each word pair from the union of all available clues as defined in (2.9).

Call(s, t) = C1(s, t) ∪ C2(s, t)

= C1(s, t) + C2(s, t) − C1(s, t)C2(s, t)

(2.9) After filling a word alignment matrix with the value of the combined clues for each pair of lexical items, links are found by looking at cells in the matrix

(20)

with highest value and accept those links that do not violate a constraint, for example, by overlapping with more than one accepted link cluster. Unlike Melamed’s competitive linking algorithm, the clue-based algorithm is not restricted to one-to-one alignments. If a clue indicates association between MWUs (as in hand luggage – handbagage) the clue’s score will be added to each cell of the MWU (i.e., hand – handbagage and luggage – handbagage in Table 2.2). Then, in the alignment step, alignment of MWUs are created by adding new links to existing link clusters in the matrix. The only constraint is that an MWU must be a contiguous sequence of words. The search for links stops when no more links are found or when the values in the remaining cells reach a certain threshold.

sedan ¨oppnas handbagaget

then 0 0 0

hand 0 0 0.30

luggage 0 0 0.30

is 0 0 0

opened 0 0 0

Table 2.2: A word alignment matrix (Tiedemann, 2003)

Clues and association scores can also be learnt from word aligned data. In experiments with clue-based alignment Tiedemann (2003) used the align-ments found by combining two simple clues: LCSR and Dice-score to esti-mate new clues for the association between parts-of-speech, phrase types and word positions in source and target. These ”bootstrapped clues” im-proved performance in spite of the fact that they were derived from a less than perfect word alignment. Tiedemann also used bilingual lexical transla-tion probabilities from Giza++ as a knowledge source among others in the clue-based system. To choose a good combination of all available clues and find good weights for each clue, Tiedemann (2005) used a genetic algorithm to optimize Swedish–English word alignment.

2.4

Statistical vs. heuristic word alignment

The previous sections described the statistical and heuristic approaches to word alignment. In this section we will look at their respective advantages and disadvantages.

Statistical methods, as defined by IBM model 4 and implemented in the Giza++ system are currently state-of-the-art in word alignment and outperform most heuristic systems. This is evident from looking at recent word alignment competitions called Shared Tasks (Mihalcea and Pedersen, 2003; Martin et al., 2005), where the majority of the competing systems were based on IBM models improved in various ways.

(21)

An advantage of statistical word alignment is that it is language indepen-dent. No language dependent resources like part-of-speech taggers, parsers and dictionaries are required to create a word alignment provided that a sufficient amount of parallel training data is available. As a result, statis-tical alignment can be applied instantly to texts from any language pair. Language independence can be desirable in some cases. However, one can also argue that for languages with plenty of available linguistic resources it would be a waste not to use these resources to improve alignment quality. Especially since statistical word alignment has several well known weak-nesses where linguistically informed methods might be better at finding the correct alignment.

Heuristic methods, on the other hand, are well suited for linguistic aug-mentation because they typically rely on a number of information sources already and adding another one fits well into the framework. Adding syn-tactic and gold-standard information to statistical word alignment systems is a harder problem. There is no obvious way to incorporate linguistic in-formation into the standard statistical framework although some attempts are described in Section 2.5. Other weaknesses of statistical word alignment include:

Hapax legomena and data sparseness. For a word type that only occurs once in the text, called a singelton or hapax legomenon, there will not be enough data to reliably estimate its alignment parameters. Data sparseness is a problem for all kinds of words and phrases with only few occurrences in the data. Rare words in the source language also have a tendency to act as ”garbage collectors” during alignment, aligning to too many target words (Moore, 2004; Liang et al., 2006).

The risk of running into sparse data problems increase with vocabulary size. Larger source and target vocabularies means an increase in the number of parameters to estimate in the statistical alignment model. Consequently, more data is required in order to produce good alignments in texts with large vocabularies. Talbot (2005) compared word alignment on three different text domains and showed empirically that alignment of texts with a high vocabulary growth rate improved much slower with additional amounts of data than texts with lower vocabulary growth rate.

Long sentences. Sentence length is another factor that influences sta-tistical word alignment quality. High average sentence lengths increase the number of co-occurring source and target words and consequently increase the number of translation probabilities t(f |e) that must be estimated. For practical reasons, sentences longer than 40 words are usually removed from corpus data before word alignment with IBM Model 4 to speed up alignment and avoid running out of memory. In addition, longer sentences provide more alignment alternatives which will also increase the chance of produc-ing alignment errors.

(22)

Many-to-many alignments. Statistical word alignment is based on word-to-word correspondences. Words in multi-word units, phrases and idiomatic expressions correspond many-to-many. Many-to-many alignments are nec-essary for all languages, but it is a more serious problem for very different languages like English and Chinese. The fertility parameter introduced in IBM Model 3 improves modelling of source words being aligned to more than one target word, thereby creating one-to-many alignments. To align words many-to-many, symmetrization heuristics must be applied to the resulting Viterbi alignments.

Null links. Another weakness of statistical word alignment is the identifi-cation of null links. This issue is largely ignored since most of the accuracy scores reported in the literature are based on evaluations ignoring null links in submitted and reference alignments (i.e., no-null evaluation (Mihalcea and Pedersen, 2003)). Generally, word alignment accuracy decreases if null-alignments are included in evaluation. Although words can be null linked in the statistical paradigm, when symmetrizing two directed statistical align-ment using standard heuristics, any word that remains unaligned will also be treated as null links. The use of null links as a default alignment increase the risk of committing errors.

2.5

Improving statistical word alignment

As previously stated, pure statistical word alignment as formulated in Brown et al. (1990, 1993) only use word strings in parallel text to estimate word alignments. Recently, researchers have started to improve statistical word alignment by incorporating external knowledge and linguistic resources in the alignment process. In the rest of this section I will describe approaches to improving state-of-the-art statistical word alignment with linguistic in-formation to address some of the weaknesses listed in the previous section.

2.5.1

Pre-processing

A simple but effective way to incorporate linguistic knowledge in statistical word aligment is by pre-processing the parallel texts before alignment using information from POS-taggers, morphological analyzers, text chunkers and other tools for linguistic analysis. The goal of pre-processing parallel text is to remove unnecessary variation and make the two texts more similar concerning the number of word types, morphological variation, tokenization and word order. Pre-processing is a language dependent process, because the processing required to harmonise two languages is different for each language pair.

Lemmatization and stemming. Data sparseness can be alleviated by replacing words with their lemma or stem in order to reduce the number

(23)

of unique word types (Goldwater and McClosky, 2005; de Gispert et al., 2006). This process is meant to remove unnecessary morphological variation from the texts. For example, Swedish adjectives have different inflections depending on the noun it modifies (r¨od, r¨ott, r¨oda). The corresponding word in English would be red. In this case, alignment probabilities would be estimated for each inflected form separately. By lemmatizing the text before alignment, all instances of r¨od can be used to estimate the alignment probability to red.

By replacing words with their stem or lemma, there is also a risk of los-ing valuable morphological information. Goldwater and McClosky (2005) tested two ways of adding important morphological information to the lem-mas when pre-processing Czech, a highly inflected language, before word alignment to English. In their first approach, important morpological fea-tures were added to the end of the lemmas (e.g., byt+PERS 3 ). In their second approach, important morphological features were added as separate pseudo-words in the Czech text (e.g., byt PERS 3 ). This second method made the Czech more similar to English, at least in cases where a Czech word inflection was expressed as an independent function word in English. Both approaches improved word alignments, however, these alignments were not evaluated directly on alignment quality, but evaluated on their effect on statistical machine translation.

Segmentation. Tiedemann (2003) pointed out the importance of segmen-tation during alignment, suggesting that segmensegmen-tation and alignment should be done in parallel. By modifying segmentation of words, it is possible to avoid many-to-many correspondences and thus simplify alignment. One way to change the segmentation of Swedish to be more similar to English is to split compounds. Noun compounds in languages like German and Swedish are productive and can be formed from almost any two nouns. Many compounds such as plommontr¨ad (plum tree) may be rare in corpora, and chances are that plommon - plum and tr¨ad - tree occur more frequently. Thus, compound splitting can be a way to reduce the number of rare words in a text and at the same time increase the number of one-to-one correspon-dences (Niessen and Ney, 2000).

The opposite approach of adding words together rather than splitting them was taken by Ma et al. (2007) in alignment of English and Chinese texts. Ma et al. improved word alignment accuracy by creating new word to-kens for commonly coocurring multi-word units in both languages, a process they termed word packing. Standard statistical alignment was then ap-plied to the modified parallel data.

Word order changes. Word order changes have also been shown to im-prove word alignment accuracy (de Gispert et al., 2006; Popovi´c and Ney, 2006). In these experiments, word reordering based on POS tags was per-formed before alignment to make word order more similar in the source

(24)

and target texts. Different language pairs need different kinds of reorder-ings to become more similar. For example, Spanish nouns and adjectives were swapped before alignment from Spanish to English: bruja/N verde/A -green/A witch/N. The quality of German–English alignment improved after changing the position of English verbs to match German verbs.

Since many of the problems of statistical alignment is due to data sparse-ness, adding more parallel data will generally result in better alignments. For the same reason, larger parallel training corpora will also reduce the positive effects of pre-processing (Niessen and Ney, 2004).

2.5.2

Adding parallel data

As noted in the previous section, adding more parallel data generally has a positive effect on alignment. However, parallel data is a limited resource and we are often forced to make the best use of the data we have. One way to fully exploit parallel data is to extract correspondences from it before the actual word alignment. Kondrak et al. (2003) extracted a list of automati-cally induced cognate pairs from their data. The cognate pairs were added to the end of the training data and Giza++ alignment was then performed on the extended data set. By adding the cognate list multiple times to data, they amplified its positive influence on alignment.

Ittycheriah and Roukos (2005) used the same method to simply add gold standard word alignments to the end of their training data. Although this improved a baseline IBM Model 4 alignment, Fraser and Marcu (2006b) report that they had to add word alignments 10,000 times to their corpus to keep the word correspondence information from being ”washed out” by the rest of the corpus.

Callison-Burch et al. (2004) added word aligned sentences to the sentence aligned training data, thus estimating new word alignments from both un-labeled (sentence aligned) and un-labeled (word aligned) data. First, word aligned data was automatically produced by running Giza++ using stan-dard settings on sentence aligned corpora. In a second pass, Giza++ was run on both sentence-aligned and word-aligned data using a modified EM algorithm that calculated the probability of every permissible word align-ment differently for the sentence and word aligned data. The word aligned data only had one permissible alignment for each aligned word pair which simplified the estimation considerably. Weights were also added to control the relative contributions of sentence aligned and word aligned data.

2.5.3

Constraining EM

Another way to use external knowledge sources to improve statistical word alignment is to let it guide the alignment process in the direction of lin-guistically plausible alignments, for example, by constraining parameter

(25)

estimation during the Expectation-Maximization (EM) process. Och and Ney (2003) used a bilingual dictionary to weight the translation probability p(f |e) higher for word pairs that were in the dictionary.

Talbot (2005) also used external knowledge sources to constrain para-meter estimation during the EM process. External knowledge sources were used to propose accurate word pairings. These word pairs were then used as constraints to help decide which word pairings were permissible dur-ing parameter estimation. Constraints were formulated as either anchor constraints or cohesion constraints. Anchor constraints are exclusive constraints that force an alignment between two words. Alignment between words in an anchor point was forced by setting alignment probabilities to zero at that position for all other source tokens in the E-step. Similarly to Callison-Burch et al. (2004), anchor points were given a higher weight com-pared to other alignments when reestimating the model parameters. The other type of constraints, cohesion constraints, were used to decouple regions of a sentence by preventing alignments between source and target words in those regions. These constraints were derived from anchor points and syn-tactic chunks in the text. Given an anchor point between words in a target and source chunk, remaining words in these chunks were prevented from aligning with any other chunk in the sentence.

As an alternative to merging two independently trained alignment mod-els to produce a symmetric alignment, Liang et al. (2006) performed joint training of two models, letting the models constrain each other. By joint training of two directed HMM models Liang et al. outperformed a standard symmetrized IBM Model 4, despite the fact that HMM models are very simple compared to IBM Model 4.

2.5.4

Discriminative alignment

Recently, several research groups have improved word alignment quality by replacing or enhancing the generative paradigm of the IBM models with discriminative models by posing word alignment as a discriminative learning problem (e.g., Ittycheriah and Roukos, 2005; Moore, 2005; Taskar et al., 2005; Cherry and Lin, 2006; Fraser and Marcu, 2006b). In discriminative learning, alignments are not treated as a hidden variable instead model parameters are estimated directly from annotated data.

For example, in the discriminative framework of Moore (2005) the best alignment, a, is the one that maximizes a weighted linear combination of feature values, f (2.10). Moore (2005) used features resembling IBM model parameters such as a word association feature indicating the likelihood of a link between the two words s and t and a reordering feature indicating how much reordering is required in the alignment.

ˆ a = argmax a X i λifi(s, a, t) (2.10)

(26)

The feature weights λ were optimized on gold-standard word alignments. Moore (2005) optimized the feature weights on 200 gold standard sentences using perceptron learning. Perceptron learning is just one machine learning technique used for this purpose. Other successful methods include maximum entropy (Ittycheriah and Roukos, 2005), minimum error rate training (Fraser and Marcu, 2006b) and large-margin training (Taskar et al., 2005).

The IBM and HMM models described in Section 2.2 are unsupervised models which find word alignments based on unannotated parallel text. Dis-criminative methods, however, can also use word aligned data to estimate feature weights. This data can be hand aligned, produced using unsuper-vised methods (IBM models) or both. Ittycheriah and Roukos (2005) used 5000 sentences of gold standard data to train their alignment parameters. Other discriminative methods use unannotated data as well as annotated data. Such semi-supervised algorithms only require about 100 sentences gold standard data to outperform the quality of symmetrized IBM Model 4 alignments.

The greatest advantage of discriminative methods is the ease in which new features of the words and their context can be added to the alignment process. Discriminative methods share this advantage with heuristic word alignment methods (e.g., Tiedemann, 2005). The choice of features is essen-tial for high alignment quality and while some authors like Fraser and Marcu (2006b) and Moore (2005) strive for language independence using features based on fertilities, positions and association scores, others use carefully engineered linguistic features for a specific language pair. Ittycheriah and Roukos (2005), for example, used Arabic morphological features and English Wordnet synonym sets (Miller, 1995) for Arabic–English alignment.

2.6

Applications

Word alignments have many uses in natural language processing. One of the most popular uses of word alignments are in statistical machine translation, but other data-driven approaches to machine translation use word align-ments as well to extract bilingual dictionaries and learn translation rules (Brown et al., 2003; S˚agvall Hein et al., 2003).

Furthermore, word alignments can be used for lexicography (Smadja et al., 1996) and term extraction (Merkel and Foo, 2007).

Word alignments can also be a useful source of information in word sense disambiguation, which is the task of determining the sense of an ambiguous word in context. Because of inherent differences between languages, knowing the translation of a word in another language can often help disambiguate its meaning (e.g., Ng et al., 2003).

A fourth application of word alignments is bilingual annotation projec-tion. With the help of word alignments, linguistic analysis of texts in one language can be projected to texts in another language. In this way, neither manual annotation nor an automatic system for annotation are required for

(27)

the second language. The applications of bilingual projection include pro-jection of coreference resolution (Postolache et al., 2006), syntactic parsing (Hwa et al., 2005) and POS-tagging (Yarowsky and Ngai, 2001).

Different applications make different requirements on the alignments. For instance, lexicon and term extraction require correct alignment of con-tent words like nouns or verbs and correct identification of multi-word units. Statistical MT, on the other hand, require full text alignment.

2.6.1

Statistical machine translation

Machine translation – translating sentences from a source language to a target language – is a large and very active research topic. The vast ma-jority of research on word alignment conducted today focuses on producing word alignments for statistical machine translation. In this section I will describe statistical machine translation through a state-of-the-art-system, Moses1 (Koehn et al., 2007).

Current state-of-the-art machine translation systems are based on phrase-based statistical MT (PBSMT) (Och and Ney, 2004; Koehn and Knight, 2003). Instead of translating sentences word-by-word, a source sentence s is divided into a sequence of I phrases. Each source phrase si is translated into a target phrase ti. Phrase translation take the words’ context into ac-count which results in more accurate and coherent translations compared to word-based systems. However, phrases in this translation framework are not actually syntactic phrases but rather sequences of consecutive words.

Returning to the formula for statistical machine translation from Sec-tion 2.2, we said that statistical MT aims to find the translaSec-tion that maxi-mizes the probability of the source given the translation and the probability of the translation occurring in a monolingual text (2.11). The former is our translation model p(s|t) and the latter our language model p(t).

ˆ

t = argmax t

p(t)p(s|t) (2.11) In Moses and most other phrase-based translation systems, the transla-tion problem is posed in a more general and flexible log-linear framework. All components that affect translation probability such as translation mod-els and language modmod-els define a feature function hiwhich are combined in a log-linear model (2.12).

This flexible framework makes it easier to integrate more features into the translation process. In addition, each feature function has an associated weight λi. By changing these weights we can vary the relative importance of, for instance, the language model.

p(t|s) = 1 Zexp n X i=1 λihi(t, s) (2.12) 1http://www.statmt.org/moses

(28)

Training

The translation model in a baseline PBSMT system consists of source and target phrase pairs and their probabilities derived from a parallel corpus. In the training phase, phrase translations are extracted from symmetrized Giza++ word alignments and stored in a large table along with their prob-abilities. All phrases up to 7 words that are consistent with the word align-ment are extracted.

The following probabilities are then estimated from the bilingual phrases, (1) source-to-target conditional phrase translation probabilities p(¯ti|¯si), (2) target-to-source phrase translation probabilities p(¯si|¯ti), (3) source-to-target lexical translation probabilities (the probability that each word in the source phrase is translated with the words in the target phrase), (4) target-to-source lexical translation probabilities, and (5) probabilities for a reordering model. Consequently, word alignment quality will affect which phrase pairs are extracted, as well as the probabilities associated with each phrase pair. The language model is a 5-gram model containing the conditional prob-abilites for a target word t at position i given the previous 4 target words p(ti|ti−4ti−3ti−2ti−1). With a language model it is possible to say that after the words I play guitar in a the next word is a lot likelier to be translated as band than ribbon. The language model is estimated from monolingual texts in the target language.

Decoding

In a baseline setting of Moses, the probability of a phrase translation is a log-linear combination of phrase-probabilities in both directions, lexical translation probability in both directions, a word brevity penality, a phrase brevity penality, a reordering model and a language model.

The Moses decoder searches for the most probable translation accord-ing to available phrases and features, startaccord-ing with an empty translation hypothesis and adding probable phrase translations of the source sentence from left to right. The decoder implements a beam search algorithm for searching among alternative translation hypothesis. The size of the beam determines how many translation hypotheses are kept in memory and ex-panded upon. Only the most probable hypotheses are kept, the others fall out of the beam.

Tuning

The weights of the parameters in the log-linear model are trained in a tuning phase, where the weights are optimized according to the translation quality on a held-out development set using Minimum Error Rate Training (MERT) (Och, 2003). In this process, the development set is translated using various default parameter weights, the result is evaluated and the weights producing

(29)

the best translation are chosen. These weights are then varied some more, and the process is iterated until translation quality stops improving.

For a description of MT evaluation see Section 2.7.3.

2.7

Evaluation

This section presents current word alignment evaluation techniques and quality metrics. Evaluation of a word alignment system can be performed using intrinsic or extrinsic criteria (Galliers and Sparck-Jones, 1993). In intrinsic evaluation, the accuracy of word alignments is measured, for ex-ample by comparing them to a manually prepared reference alignment, a gold-standard. Extrinsic evaluation, on the other hand, evaluates the use-fulness of the word alignments in an application. For example, an extrinsic criterium of word alignment quality is the effect of using the alignments in a machine translation application. The usefulness of the word alignments can be measured by evaluating translation output quality with standard MT evaluation measures.

2.7.1

Formats

Before we can compare word alignments, we must represent the parallel texts and word alignments in a format. The choice of storage format is important when evaluating alignments against a gold-standard because the format will put restrictions on the alignment strategy. Below, we describe three formats, NAACL, symmetrized Giza and Link, to illustrate the differences.

NAACL

This format was used for word alignment evaluation in the shared task at the Workshop on building and using Parallel Text (Mihalcea and Peder-sen, 2003). A NAACL alignment project contains a source text file and a target text file where each sentence is tagged with a sentence number: <s snum=0008> ... </s>. In the alignment file, each line contains a word link between a source word and a target word. The link is represented by sentence number, source word position, target word position and an op-tional confidence label Sure (S) or Possible (P). Word positions start at 1 and position 0 is used for null-links.

008 1 2 S 008 1 3 P ...

(30)

Symmetrized Giza

The symmetrized Giza format is used to store symmetric alignments pro-duced from merging two uni-directional Giza-alignments (Och and Ney, 2003). The phrase-based MT systems Pharaoh and Moses (Koehn et al., 2003, 2007) extract phrase-alignments from word alignments stored in this format. This simple format contains three files. The source and target files contain one sentence per line and the alignment file contains the alignments between the source and target sentences on each line as pairs of source and target words, with word numbers starting at 0:

resumption of the session ˚

aterupptagande av sessionen 0-0 1-1 2-2 3-2

Null links are not included in this format. Consequently, the format lacks a distinction between null-aligned words and unaligned words.

Link format

The Link format is used by I*Link and I*Trix alignment tools (Ahrenberg et al., 2002; Foo and Merkel, 2006). Each link in this format is a link between two segments represented by 5 numbers (|s1|s2|t1|t2|n), start and end position of source segment (s1 and s2), start and end position of target segment (t1 and t2), and a number (n) representing the method of alignment, whether the link was produced manually or automatically. Deleted or added segments are aligned to the word position -1 to create null-links.

1#(1|1|2|3|5)#(-1|-1|4|4|5)#!

Note that aligned multi-word units (MWUs) are represented differently in Link format than in most other formats. Since the alignments are segment-based rather than word segment-based, MWUs are represented by a single link be-tween segments instead of several many-to-many word links. Word posi-tions can not belong to more than one segment and all words in a segment must be adjacent. As a result, the Link format prohibits discontinuous alignments. Examples of discontinuous phrases that can not be aligned is correspondences such as l˚aser upp – unlock in a phrase where the verb and particle are separated by another word: l˚aser du upp – you unlock. This is not a problem for alignment formats that use many-to-many word links to align multi-word-units. In word-based formats this discontinuous alignment would be expressed with two word links: 0-0 and 2-0.

Segment-based alignment is primarily used for term extraction applica-tions where getting the whole term segment right is more important than measuring precison and recall of the partial word links. Measures of align-ment precision and recall for term extraction are proposed in (Ahrenberg et al., 2000).

(31)

Alignment formats and evaluation

Differences in the annotation of untranslated segments and discontinous translations will remain after word alignments in one format has been con-verted into another format and will prevent fair comparisons between ments. For example, the I*Link format can not store discontinuous align-ments. Therefore, evaluating alignments produced by e.g., Giza++ with a gold-standard alignment originally produced and stored with I*Link, would punish the (perhaps correct) discontinous alignments made by Giza++.

Another difference is the annotation of untranslated segments. In most systems, words that have been added to or omitted from the translation is represented by a link to a null token. Formats that do not have an explicit representation of null links can not make the distinction between unaligned segments and null aligned segments. In the Shared task of the NAACL Workshop on Building and Using Parallel text (Mihalcea and Pedersen, 2003) evaluation was carried out in two modes, null-align and no-null-align. In null-align mode, each word in the submitted alignments must belong to a link, otherwise it is assigned a probable link to null. In no-null-align mode null alignments are removed from both submitted and reference alignments. Error rates increase in the null-align mode because incorrect null-alignments are penalised twice. First, the null-alignment is penalised for being wrong, and second, there will be a missing link to the correct word.

2.7.2

Intrinsic measures

The most resource efficient way to evaluate the accuracy of computed word alignments is to compare them to a manually prepared gold-standard word alignment that can be reused in future evaluations. The standard measures for evaluating computed word alignments against a gold-standard reference alignment are precision, recall, F-measure, and Alignment Error Rate (AER) (Mihalcea and Pedersen, 2003). Precision measures the propor-tion of correct links in the computed alignments, where correct links are the intersection between the set of computed alignments (A) and the set of gold-standard alignments (G) (2.13). Recall measures the proportion of cor-rectly computed links in the set of gold-standard links (2.14). The precision and recall metrics complement each other by measuring different aspects of alignment quality. One of the measures can be very high at the cost of the other, for example, high precision can be achieved at the cost of recall by aligning just a few ”easy” words in each sentence. What we need is the F-measure (2.15), a measure that balances precision and recall and gives a single score that combines both aspects of alignment accuracy.

Precision(A, G) = |G ∩ A|

|A| (2.13)

Recall(A, G) = |G ∩ A|

(32)

F-measure(P, R) = 2P R

P + R (2.15)

Word alignment can be difficult even for human annotators and we must therefore be careful when we construct the gold-standard alignments. Chap-ter 4 contains a more detailed discussion on the problem of constructing a fair gold-standard alignment. Och and Ney (2003) introduced sure and pos-sible links in the reference alignment to make up for the fact that human annotators disagree on gold standard word alignments. Every word link in their reference alignment recieved a confidence label, S (sure) if all annota-tors agreed on the link and P (possible) otherwise.

By definition, sure links are also possible, so the set of sure links is a subset of the set of possible links. Based on these two sets of gold-standard alignments Och and Ney (2003) defined two modified measures of precision (2.16) and recall (2.17), where recall-errors only occur with sure links and precision-errors only if the computed link is not even a possible link in the gold-standard. The modified precision and recall measures are balanced using Alignment Error Rate (2.18), a measure that Och and Ney derived from the F-measure.

Precision(A, P ) = |P ∩ A| |P | (2.16) Recall(A, S) = |S ∩ A| |S| (2.17) AER(A, P, S) = 1 −|S ∩ A| + |P ∩ A| |S| + |A| (2.18)

2.7.3

Extrinsic measures

Word alignment quality can also be measured by the impact the word align-ments have on other natural language processing applications. Researchers in statistical machine translation try to improve phrases and probabilities in the translation tables by improving word alignment. The impact of a new word alignment method on machine translated output can be measured by evaluating the translation with standard automatic MT measures like Bleu (Papineni et al., 2001).

Translations are often judged on two complementary aspects of trans-lation quality: adequacy – whether the transtrans-lation conveys the same in-formation as the original text and fluency – whether the translation is a well-formed sentence in the target language. The Bleu metric combines the aspects of fluency and adequacy by comparing the translation with one or more reference translations that get to represent the ”ideal” translation. Translation quality is measured by the overlap between different size n-grams in the candidate translation and the reference. The overlap is measured us-ing a measure of n-gram precision (2.19) which is slightly modified so that

(33)

a candidate n-gram can at most be counted as many times as the count of the n-gram in the reference translation Table 2.3.

Precision = candidate n-grams present in reference

all candidate n-grams (2.19)

Candidate: the the the the the the the Reference: the cat is on the mat Unigram precision: 7/7

Modified unigram precision: 2/7

Table 2.3: Modified n-gram precision

While unigram precision have been shown to account for adequacy of translation (the right words are used), precisions of longer n-grams account for fluency. The standard Bleu measure combines modified n-gram precision for different lengths of n up to 4 which has shown to have high correlation with human judgements of translation quality.

The Bleu metric is based on precision, but what about recall? A partial but correct translation should not get the same score as a longer, com-plete correct translation. The solution to the recall problem, is to penalize sentence candidates shorter than the reference translations with a brevity penalty (BP) defined as (2.20)

BP = (

1 if c > r

e1−r/c if c ≤ r (2.20) where c is the number of words in the candidate sentence and r the words in the reference translation. The Bleu score is made up of the brevity penalty and the sum of a weighted average of modified n-gram precision (2.21), which results in a score between 0 and 1.

Bleu = BP · exp N X n=1 wnlog pn ! (2.21) The Bleu metric, although very useful, is associated with many weaknesses. Some of these weaknesses are pointed out by Banerjee and Lavie (2005) and Callison-Burch et al. (2006). Alternative measures of translation qual-ity which addresses one or more of the shortcomings of Bleu have been proposed (e.g., NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005) and Precision/Recall (Melamed et al., 2003)), but none of them have yet completely replaced the Bleu metric as the standard for reporting ma-chine translation quality.

Another way to evaluate word alignments in the context of an application is by producing a bilingual dictionary from the alignments and evaluating the entries against an existing dictionary or with manual evaluation (Smadja et al., 1996; Schrader, 2006).

(34)

2.7.4

Intrinsic vs. extrinsic measures

Recently, the most common measure used to report alignment quality, AER, has been criticized by researchers in statistical MT of not correlating well with the quality of MT output (as measured by Bleu). Intuitively, better alignments should lead to improved translation tables and better transla-tions, but several studies show that lower AER does not necessarily lead to better MT output (Ayan and Dorr, 2006; Fraser and Marcu, 2006a). Fraser and Marcu (2006a) observe that although AER was modeled after the well-known F-measure, unlike the original F-measure, it does not appropriately penalise unbalanced precision and recall, nor does it allow weighting the importance of either precision and recall. The measure suggested by Fraser and Marcu (2006a) is a straight-forward F-measure with the α constant empirically weighed in favor of precision.

Ayan and Dorr (2006) take the adaptation to machine translation a step further and propose a measure of word alignment quality which is closely connected to the phrase-based statistical MT framework which they use to compare their alignment scores with translation quality. They call their measure consistent phrase error rate (CPER). Instead of evaluating word links in the alignment, CPER evaluates the quality of the consistent phrases that can be extracted from the word links in an alignment.

Lambert et al. (2005) point out that the proportion of S and P links in the gold standard reference will have a great effect on AER. According to the definition of AER, including P links in the gold-standard can only lower error rate. Their experiments show that reference alignments with a large proportion of P links will favour high-precision computed alignments over alignments with high recall. The reference alignment used by Och and Ney, for example, contained 77% P links. Furthermore, a large proportion of P links in the gold standard will limit the discriminative power of AER, as it allows different computed alignments that may vary in quality to get the same AER score. Similarly, Fraser and Marcu (2006a) report a better cor-relation between word alignment and MT quality by using a gold-standard alignment set aligned using only sure links.

2.7.5

Conclusions

This section has discussed different aspects of evaluating word alignments. Based on this research, a couple of decisions were made concerning word alignment evaluation in the phrased-based word alignment project. In this project standard measures of precision, recall and AER was used to report word alignment quality compared to a reference alignment. A lot of effort were put into the construction of the reference alignment. Two annotators were given the task of word aligning the test data with sure and possible alignments and their results were combined. Guidelines for manual word alignment were written to help the annotators decide what alignment strat-egy to use and when to use S links and P links. The guidelines promote more

(35)

S links than P links in the reference as suggested by several researchers to increase the discriminative power of AER. For more details on the construc-tion of the reference alignment, see Chapter 4.

(36)
(37)

Phrase-based word

alignment

This chapter presents the phrase-based approach to word alignment. Phrase-based word alignment is a method that uses phrases from manually created word alignments to align words in new sentence pairs. The phrase-based approach is explained in Section 3.1. In the experiments presented in Sec-tion 3.2 phrase-based word alignment was applied to two types of text. The goal of these initial studies was to see how well the method performs in terms of precision and recall as well as testing its applicability on differ-ent texts. The chapter concludes with the research questions guiding the more elaborate experiments with phrase-based word alignment described in Chapter 5.

3.1

Phrases for word alignment

Phrase-based word alignment is a way to re-use word alignments by extract-ing parallel segments, or phrases, of different lengths from manually word aligned sentence pairs1. Each extracted parallel phrase contains consecutive source and target words and their corresponding alignments.

To align the words in a new sentence pair the sentence is matched against the database of aligned parallel segments. If there is a match between an aligned parallel segment and the new sentence pair, the word alignments within the parallel segment can be applied to the new sentence.

The links in the database of phrases are very reliable as we are re-using previous manual work. However, longer phrases are more likely to produce accurate alignments than shorter phrases since they contain more context.

1Note that the term phrase is used to mean a number of consecutive words that

do not have to constitute a syntactic phrase. The same meaning of phrase is used in phrase-based statistical machine translation.

(38)

Source words Target words Links in the union i unionen 0-0 1-1 2-1 P the union P unionen 0-0 1-1 2-1 P DET N P N 0-0 1-1 2-1 in this N , i V i det N V jag 0-0 1-1 2-2 5-3 4-4

Table 3.1: Parallel phrases

By preferring matches of long phrases, alignment precision increases at the expense of recall.

To increase the number of words and links that the phrases cover, phrases can be generalised in different ways, using more general categories such as base form, part-of-speech, dependency labels or other syntactic and morpho-logical information. Such generalised phrases will match more sentence pairs and thus improve word alignment recall. In the experiments presented in this thesis, phrases were generalised using part-of-speech categories. A sam-ple of parallel phrases are shown in Table 3.1. Figure 3.1 shows an examsam-ple of a generalised phrase that matches a new sentence pair. The phrase in this example produces correct links but it only finds one of the links in the verb phrase am in agreement – inst¨ammer. However, the complete many-to-one alignment can be found by other (possible overlapping) parallel phrases.

in this N , i V & i det N V jag & {\small 0-0 1-1 2-2 5-3 4-4} \\

In this sense , I am in agreement with Mr Sakellariou ' s proposals .

I det avseendet instämmer jag i Sakellarious förslag .

In this sense , I am in agreement with Mr Sakellariou ' s proposals .

in this N , i V

I det avseendet instämmer jag i Sakellarious förslag .

i det N V jag

Figure 3.1: Matching a parallel phrase to a new sentence pair.

3.1.1

Motivation

In this thesis I investigate the use of phrases to create high precision word alignments. Such high quality word alignments could be used to simplify manual alignment of parallel treebanks and gold standards by automatically aligning words in a first pass, and manually aligning remaining words in a second pass.

The potential advantages of using phrases for word alignment was men-tioned in the beginning of this chapter. First, longer text segments include

References

Related documents

Detta hänger ihop med Blehrs (1994) och Kapferers (1987) diskussion om förutsättningar för att rykten sprids vidare. En person som innehar exklusiv information som de vet att

Elektronisk WOM avser alltså ett positivt eller negativt uttalande av en potentiell, nuvarande eller tidigare konsument av en produkt eller företag som är tillgängligt för många

Predominantly, there were more adverbial instances of the construction than premodifier instances and unlike the written subcorpora, there were no types that

To explore the usefulness of symbolic and algebraic methods, we use polynomials over finite fields (see section 2) applied to DEDS with industrial sized complexity: The landing

Då det fanns en stor variation i psykiskt mående hos ungdomarna som sökte sig till en behandling för låg självkänsla blev uppsatsens syfte att kartlägga psykiatrisk

Submitted to Link¨ oping Institute of Technology at Link¨ oping University in partial fulfilment of the requirements for degree of Licentiate of Philosophy. Department of Computer

The Knowledge Expectations of hospital patients (KEhp) Scale, was mailed to patients before the CRT implantation procedure, and the Received Knowledge of hospital patients (RKhp)

In order to reduce computation time in MILP models, there exist several approaches including reformulating the problem and developing effective search algorithms. In this paper we