IBM Model 4 Alignment Comparison

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2016,

IBM Model 4 Alignment Comparison

An evaluation of how the size of training data affects the interpretation accuracy and training time for two alignment models that translates natural language

TOR ARVIDSON

MARIA SIEBECKE

(2)

IBM Model 4 Alignment Comparison

An evaluation of how the size of training data affects the interpretation accuracy and training time for two alignment models that translates natural language

Tor Arvidson Maria Siebecke

Degree Project in Computer Science, DD143X Supervisor: Michael Minock

Examiner: Örjan Ekeberg

Stockholm 2016

(3)

Abstract

In modern society the amount of information processed by computers is increasing everyday. Computer translation has the potential to speed up communication between humans as well as human-computer interactions.

For Statistical Machine Translation word alignment is key. How large does a corpus need to be to align a natural language sentence with a simple unambiguous language? We investigate this matter by running a simple algorithm and comparing it to the results we get from an industry equivalent. The results show that the size of the corpus needs to be larger for the simplified model when there is a greater number of words per sentence. The IBM Model 4 conversely shows that the more words per sentence decrease the necessary size of the corpus to make better predictions. Thus we can conclude that corpus size is dependant on the number of terms in each sentence for both models.

(4)

Sammanfattning

I vårat moderna samhälle bearbetas mer information för varje dag.

Datoriserad översättning har potentialen att öka hastigheten utav kom- munikationen mellan människor emellan samt människa-datorinteraktion.

För Statistical Machine Translation så är word alignment en stor del. Hur stor måste en korpus vara för att man med stor sannolikhet lyckats att korrekt översätta meningar från ett naturligt språk med ett simpelt enty- digtspråk? Vi testar detta genom att jämföra en simpel algorithm med en algoritm som används inom industrin. I resultaten ser vi att ju mer ord som finns i meningen som ska översättas, ju större måste korpusen vara.

Med IBM Model 4 ser vi att resultaten blir bättre med ju fler ord per me- ning och därför kan korpusstorleken minskas. Vår slutsats är att korpus storleken beror på mängden aritmetiska termer för båda modellerna.

2

(5)

IBM Model 4 Alignment Comparison

Tor Arvidsson Maria Siebecke

May 27, 2016

1 Introduction

Translation between languages is important in international communication, in the appreciation of media and for the spreading of ideas. Information is streamed through the Internet at rates before unheard of. The need for translators has increased as a result. Never before has it been so easy to communicate with people from different parts of the world. Translating is a job that has historically been allocated to humans as there have been certain problems with machine translations. An example is differing sentence structures and grammar between languages. We need to make computers able to translate between multiple languages. Only then can we eliminate the language barriers between people that exist today. The scientific topic Statistical Machine Translation is the key to this. Computers can speed up the process to the point where a conversation could be carried out between two people using a translating gadget, such as the ili [1].

It can also help human-machine-interaction interfaces. Today we communicate with our computers via commands written in programming languages since that the computer only can read code. This limits what the user can do to and which languages the user knows. If we could instead teach the computer a natural language all speakers of that language would have a better opportunity to make their computers do what they want to. That would lower the limiting factor of programming knowledge significantly. One of the first steps to get a computer statistically learn a new language is to align the words of the two languages. To do this we need a bilingual training set, also called corpus, with the language we wish it to learn and the language that the computer already knows [2].

Our goal is to compare the necessary size of the training set necessary to get an accurate word alignment between a simplified model and an industry standard from the language of arithmetic expressions to the natural language of English. The simplified model employs a Context Free Grammar (CFG) that creates a parse tree of a mathematical expression the uses a straightforward algorithm that creates word alignment using an algorithm that employs a process of elimination and occurrence and a corpus. The industry equivalent we will be comparing to is the IBM Model 4.

(8)

1.1 Problem Statement

How large do these training sets need to be? For a complex language, such as English, which contains double meanings and ambiguous sentences the amount of training needed may be huge. Our aim is to test how much training data is required for a simple, unambiguous language, arithmetic expressions, by using a simple translating algorithm then compare the results to an industry equivalent IBM Model 4.

1.2 Scope of Study

We have chosen to experiment on the mathematical language - Arithmetic expressions. This is due to this study’s limited amount of time. We will be using two different models - our own simplified model and IBM Model 4 - to compare word alignment. Since this is their chief difference. We choose these specific models is because we wanted to compare a straightforward model with a industry standard. The corpus will vary based on maximum/minimum sentence lengths and size of the corpus (number of lines). These tests will be iterated multiple times to ensure that a result depicting reality is attained.

1.3 Disposition of the Report

The rest of this report is structured as follows: Section two will contain background information on the techniques and models used as well as specifying some terminology used in the report. The third section will contain the specifics of how we constructed the two models that we will be testing and how we test them. In the fourth section we will display the results of our findings which is followed by the fifth section in which we discuss our results as well as possible improvements to the test. The sixth section is our bibliography and is followed by the appendix which contains sample data.

6

(9)

2 Background

2.1 Semantic

A semantic defines the relationship between words, phrases, signs, and symbols and the logic they stand for. There are different fields in Semantics, for example linguistic semantics, lexical semantics, conceptual semantics and computational semantics. We will focus on computational semantics in this report, which is the meaning representation (hereafter called MR) behind a chosen word [3].

2.2 Natural Languages

With Natural Languages (NL) we mean languages that spoken and written by the human population. Examples: English, Swedish and German.

2.3 Corpus

A “Text Corpus” or “Corpus” in linguistics is a large amount of structured text, a text database [4]. It can be used compare a certain sentence in NL with the official language, to be able to both interpret the sentence and to find spelling errors, depending on the task. For this project, the corpus will contain training data to be used to build the link between Natural Language and Meaning Representation to interpret arithmetic expressions.The corpus will need to be bilingual, contain the same sentence in two languages.

2.4 Context Free Grammar

Any language can be broken up into parts. English as an example is built on adjectives, verbs, nouns and so forth. These parts of speech function under certain rules. An adjective ties to a noun etc. A Context Free Grammar is a set terminal symbols, non-terminal symbols, productions and a start symbol that together break down and classify a string into parts. Terminal symbol:

A terminal symbol is a symbol which links to a char or string. Start symbol:

The start symbol is a non-terminal symbol. The CFG will always start by using this symbol . Non-terminal symbols: A non-terminal symbol is a symbol which acts as a placeholder for terminal symbols or productions. Productions:

A productions is a set of rules which convert non-terminal symbols into other non-terminal symbols and/or terminal symbols.[5]

• Terminal symbol: A terminal symbol is a symbol which links to a char or string.

• Start symbol: The start symbol is a non-terminal symbol. The cfg will always start by using this symbol.

• Non-terminal symbols: A non-terminal symbol is a symbol which acts as a placeholder for terminal symbols or productions.

• Productions: A productions is a set of rules which convert non-terminal symbols into other non-terminal symbols and/or terminal symbols. [5] . . . The formalisms of context-free grammars were developed in Noam Chom- sky’s 1956 report "Three models for the description of language". [6]

(10)

2.5 NLTK

The Natural Language Toolkit (NLTK) is a Python library module used for processing natural language.

The NLTK was originally created by Steven Bird and Edward Loper as a software infrastructure for teaching Natural Language Processing (NLP) at University of Pennsylvania back in 2001. Since then the NLTK has evolved and a book "Natural Language Processing with Python" was written about it by Bird, Looer and Ewan Klein. This book was originally released 2009 with a second edition being released 2016 by the same authors. [7]

The NLTK contains a plethora of tools for text processing, including catego- rization and word tagging, text classification, information extraction, sentence structure analysis, parsing, word alignment as well as containing a library of corpora. [5]

We will utilize the NLTK for multiple tasks. The first is to load a pre-written CFG from a file and then using a NLTK semantic parser create a semantic parse tree. The second task is to carry out word alignment from the productions in the CFG to the natural language provided in the corpus using the IBM Model 4 so that we can compare our simple word alignment algorithm with an industry equivalent.

8

(11)

2.6 IBM Alignment models

The IBM alignment models is a collection of notable models in the field of statistical machine translation. They have underpinned the majority of statistical machine translation systems for almost twenty years [8]. You could describe them as a sequence of increasingly complex models used to train a translation model and an alignment model, starting with lexical translation probabilities and moving to reordering and word duplication [2].

The original work on statistical machine translation at IBM proposed five models [2]. The five models can be summarized as:

Model 1 - lexical translation Model 2 - additional absolute alignment model Model 3 - extra fertility model Model 4 - added relative alignment model Model 5 - fixed deficiency problem.

Each model is breaking the translation process into smaller steps. So every specific IBM model include all former IBM models and adds another level of complexity. In this report we will focus on IBM Model 4. Therefore, to understand IBM Model 4, we also need to understand the models it is built upon, IBM models 1, 2 and 3.

2.6.1 IBM Model 1

The IBM Model 1 only uses lexical translation [9]. Translation probability for the example of a foreign sentence f = (f1, ..., f_l_f) of length lf to an English sentence e = (e1, ..., el_e)of length le. The translation will be an alignment of each English word ej to a foreign word fi according to the alignment function a : j → i

p(e, a|f ) = (l_f+ 1)^l^e

l_e

Y

j=1

t(ej|f_a(j)) Where the parameter is a normalization constant.

Normally the word order in one language is different after translation. but IBM Model 1 is weak in the way is treats all kinds of reordering as equally possible. IBM Model 1 does not rearrange, add or drop words from a sentence.

Another problem with this aligning model it doesn’t consider so called fertility.

By fertility we mean the notion that input words would produce a specific number of output words after translation. In most cases one input word will be translated into one single word, but some words will produce multiple words or even get dropped (produce no words at all). In all latter cases the IBM Model 1 will fail with alignment [10]. For example, using only IBM Model 1 the translation probabilities for these translations would be the same:

Figure 1: The first step, lexical translation. Picture borrowed from [10].

(12)

2.6.2 IBM Model 2

The second model has an additional model for alignment that is not present in the former model [9]. While IBM Model 1 only uses lexical translation, IBM Model 2 adds an extra step, alignment. An example:

natürlich ist haus klein of course is the house small

das

1 2 3 4 5

of course the house is small

1 2 3 4 5 6

lexical translation step alignment step

Figure 2: Adding a model of alignment. Picture borrowed from [9].

To address the rearranging, IBM Model 2 uses an alignment probability distribution to translate a foreign word at position i to English word at position j:

a(i|j, le, lf) Putting everything together:

p(e, a|f ) =

l_e

Y

j=1

t(ej|f_a(j)) a(a(j)|j, le, lf)

IBM Model 2 still does not address the fertility problem, which basically means the number of words generated by a foreign word. Neither it addresses the problem that words do not move independently of each other. Since words move often in groups and also there is conditions for word movements due to the previous word, there is several reasons to refine the model even more.

10

(13)

2.6.3 IBM Model 3

The IBM Model 3 is adding a model of fertility to the latter lexical and insertion step. In sequence of steps it’s used before the insertion and lexical translation steps [10]. Here is an example:

Figure 3: Picture borrowed from [9].

The fertility is modeled by the below displayed distribution function[11].

P (S|E, A) =

I

Y

i=1

φ_i!n(φ|e_j)∗ =

J

Y

j=1

t(f_ji|e_a_j) ∗

J

Y

j:a(j)¬0

d(j|a_j, I, J ) ∗b − φ0

φ₀

p^φ⁰p^J₁ (1) Where φi represents the fertility of ei, each source word s* is assigned a fertility distribution n, and J refer to the absolute length of the target and source sentences respectively. The IBM model 3 still doesn’t consider the problem that words usually move in groups. Therefore is the IBM Model 3 still required improvements.

(14)

2.6.4 IBM Model 4

In IBM Model 4, each word is dependent on the previously aligned word and on the word classes of the surrounding words. That means that some words trigger reordering and creates a condition for how the reordering should be made [9].

Also some words tend to get reordered during translation more then others, for example:

• adjective–noun inversion when translating Polish to English).

• adjectives often get moved before the noun that precedes them.

To understand how IBM Model works we need to define Cepts. Foreign words with non-zero fertility forms cepts. For examples, see the picture and table below.

ja nicht gehe

ich zum haus

not to go

do the house

I

NULL

Picture borrowed from [9].

cept πi π1 π2 π3 π4 π5

foreign position [i] 1 2 4 5 6

foreign word f[i] ich gehe nicht zum haus English words {ej} I go not to,the house

English positions {j} 1 4 3 5,6 7

center of cept i 1 4 3 6 7

The word classes introduced in Model 4 solve the reordering problem caused by the previously aligned word by conditioning the probability distributions of these classes. The result of such distribution is a lexicalized model. The distribution function for IBM Model 4 is defined below [9].

for initial word in cept: d1(j − _[i−1]|A(f_[i−1]), B(ej)) for additional words: d>1(j − Π_i,k−1|B(ej))

Where A(f) and B(e) functions map words to their word classes, and ej and f[i−1]

are distortion probability distributions of the words. The cept is formed by aligning each input word fi to at least one output word.

Both Model 3 and Model 4 ignore if an input position was predefined and if the probability mass was reserved for the input positions outside the sentence boundaries [12]. Model 4 still is the model we chose to compare our simplified model with for accuracy and time depending on size of corpus.

12

(15)

3 Method

Our algorithm utilizes a CFG, the NLTK library for parsing, a self-generated corpus and a simple statistic implementation for disambiguating word alignment.

3.1 Context Free Grammar

The used CFG covers the simple arithmetic operations of multiplication, di- vision, addition and subtraction as well as numbers one through nine. The reason for choosing this particular CFG is that it contains a limited amount of operations, which aren’t interchangeable. This decreases the ambiguity of any statement to the point where there is only one true meaning. The grammar is unambiguous but is capable of creating sentences of infinite length without compromising the grammar. This allows us to test if sentence length matters for the amount of lines necessary for the algorithm to create unambiguous answers.

Figure 4: The used CFG file

3.2 Generating Training Data

To generate a corpus of substantial size a self made python script is used. It creates random mathematical terms and operations, in this report together called words, and their natural language (NL) interpretations. The corpus should contain one set of training data per line, in our case a mathematical expression, which we in this report define as a sentence. To control the size of the corpus and the length of the sentences the script takes some parameters to define the size of the corpus and the minimum and maximum length of sentence. The length of the sentence is defined by the number of arithmetical operations and terms in a mathematical expression.

A set is of the format “NL expression: mathematical expression”. To avoid ambiguity in the mathematical term (operators of equal order would cause multiple interpretations for each expression) we define the order of operations ex-

(16)

plicitly by adding parenthesis around each operation. This is reflected in the natural language by the order of operations happening in the order natural language words occur from left to right. In other words, "six minus five times two"

will equal (6 - (5 * 2)) in the corpus but not ((6 - 5)*2). An example Corpus could look like this:

Figure 5: Example of corpus

3.3 Semantic Parse

The tokenizing and parsing is done with the help of the NLTK. The data sets are read from the corpus with unnecessary white-spaces and other formatting characters being filtered away. The NL part is then parsed with the NLTK ChartParser using our CFG defined above. The expression "six times one times six": (6 * (1 * 6 ) ) is parsed into a tree that looks like this:

Figure 6: Parse Tree

After parsing we have a list of meaning representations (MR) that are linked

14

(17)

to the arithmetic expressions. Our next step is to link the natural language words of the corpus to these tokens.

3.4 Algorithm for Word Alignment

3.4.1 Simplified Model

To connect the MR for the NL words and the tokens created by the parse tree we created a dictionary. Each word would begin by containing each token associated with the same sentence as the first occurrence of that word. As the algorithm progresses through the corpus impossible rules are be extracted. For example:

The word “one” appears in both sentence: “one plus two plus three” and “one plus three plus four”. Since the first occurrence of the word is in sentence one,

“one” would be associated with rules [1, 2, 10]. Since rule ‘2’ does not recur in sentence two it would be removed from the NL word “one” when the algorithm reaches sentence two.

Figure 7: Dictionary

Our simple statistical implementation was to count the number of times a NL word was associated with a specific rule. At the end of the run if a NL word was associated with multiple rules the algorithm would choose the more commonly occurring rule.This was done to reduce ambiguous results caused by a randomly generated corpus in which the expressions using the same operators were repeated.

(18)

In figure 8 below you can see the pseudo code of the simplified model. Note that only the alignment model is described in any greater detail with delimita- tion of corpus text and parsing using NLTK not covered.

Figure 8: PseudoCode

3.4.2 Implementation of IBM Model 4

The IBM Model 4 works as described in section 2. The way we utilize it is that we read in the corpus and break it down into its natural language and arithmetic expressions component. Then we call the IBM Model 4’s alignment method on the two parts. To be able to call the alignment method we need to classify the target and source words. We chose to classify operations as 1 and numbers as 0. Worth noting here is that this means that the IBM Model 4 requires more information than our simplified model.

3.5 Testing process

The testing code defines some global settings: the length of the generated sentences, the maximal corpus size and the number of iterations per test case. The code then iterates over a list of test cases. For each of these test cases it will generate corpora with a size between 10 and 50 in increments of 10. The main

16

(19)

processing is done in run_case(). This function creates a corpus of the desired size for each iteration. Next the current time will get saved. For each corpus the function will then run the code for the simplified model/IBM Model 4 It will memorize minimum and maximum percentages and after processing all corpora calculate an average percentage. The results are then returned.

Figure 9: stat-program-code

(20)

4 Results

In this section we begin by predicting explaining what we think the results of the tests will be. This is followed by graphs showing the results we attained.

When we test our alignment models we are running tests on accuracy and running time and comparing them. The system we ran our tests on has the following specification:

• Processor: Intel(R) Core(TM) i 5-4690K CPU @ 3.50GHz

• Random Access Memory: 16.0 GB

• Operating system: 64-bit Windows 10

Something worth mentioning is that our statistics program is only run on one core of the processor. We choose this approach since it can then be applicable and recreated on more computers.

The training set is given to our algorithm with a predefined context free grammar and creates and alignment to translate the natural language arithmetic expressions into MRL.

In the analyzing of the tests, the results of accuracy and training time differed for different sizes of corpus. Accuracy of an alignment system is calculated by the average of all correct translated terms and operations to rules, divided by the full amount of terms and operations. The running time is basically the time and alignment program runs for the different corpora.

It is important to note that while it is desirable to have good values for all of them, a bad value for accuracy is worse as this can lead to incorrect answers and the output data will be useless. So what we define as an accepted result is basically close to 100% accuracy.

To improve the statistical confidence of our results we have done 100 iterations of the simplified model and taken the mean as the result. With the industry standard IBM Model 4 we only iterated a handful of times because of the extensive run-time.

Each test case will run with a corpus size increasing from 5 to 50 with an interval of 5 per run. This is done to create data points for showing the increase in % matches of the used CFG. The generated data is written to a csv file for easy import in other software.

4.1 Expected Results

Our expectation for the tests is that with an increasingly complex arithmetic expression, the percentage of correct matches will decrease for our simple model(assuming the same corpus size). We expect the IBM4 model to function better(increase in at least 10%) on medium sized corpora (20-30 lines) compared to the simple model.

18

(21)

4.2 Empirical Results

4.2.1 Simplified Model Accuracy

The tests as described in 5 were run for term lengths between 2 and 9.

Figure 10: Percent of accuracy for the different amount of operations - Simplified Model

A sentence length smaller than 7 operations shows a pretty similar accuracy curve to a logarithmic function and all results end up in the high 90%. Starting with a sentence length of 7 operations the curves start getting more deformed and rather linear.

It’s pretty clear that the size of corpus really affects the accuracy of our simplified model. For a two operation sentence a corpus size of 25 gives almost a 100% accurate result. But in the case of 9 operations, a corpus size of 50 is barely sufficient. The limit for a a corpus size of 50 is enough to get around 100% accuracy for our algorithm at the maximum of 6 operations.

(22)

4.2.2 IBM Model 4 Accuracy

Figure 11: Percent of accuracy for the different amount of operations -IBM4 Here we see that the IBM Model has very accurate results even for small corpora.

We see that even the IBM4 does react to the corpus as seen in the dip between 10 and 20 on length 9. The simple algorithm has problems with small corpora whereas the IBM Model 4 handles them decently. It is however apparent from the data that smaller number of terms give a lower accuracy.

20

(23)

4.2.3 Simplified Model Running Time

Figure 12: Running time for the different amount of operations - Simplified Model

As the graph show, the running time of the simplified model is almost linear.

It’s also relatively fast with a running-time of 24 seconds. We can also be certain that the amount of time needed is reliable, since the running time of the simplified model is calculated by the mean of a 100 iterations.

(24)

4.2.4 IBM Model 4 Running Time

Figure 13: Percent of accuracy for the different amount of operations - IBM4 For the IBM Model 4 the running time is not linear. While we haven’t done an approximation of the function to see if it’s polynomial or exponential. The amount of time for a corpus of the size of 50 is more than 20 minutes. As such we were unable to satisfyingly test corpora of size 40 and larger on the IBM4. In this case we cannot be sure the result is reliable since this is only one iteration per corpus size.

22

(25)

5 Discussion

This thesis set out to compare the results of two alignment models and did so within reasonable error limits. It can be noted that our results for the simplified model differs a lot in small corpora, below size 50, from the industry alignment model IBM Model 4. This means that our algorithm needs bigger training data with a minimum close to size 50 for the corpus. Perhaps some small tweaks would improve our result, however there are some limitations which are described below. We can see that the IBM Model 4 is very capable of creating a very good word alignment model using a small corpus. However, the model has a very long run time. This means that for applications that the model is good for tasks that allow for pre-processed data but is unfit for word alignment tasks that need quick results.

We will discuss improvements of our model and how the tests are made in the latter section Difficultie and Improvements.

5.1 Analysis of the different models results

A sentence length smaller than 7 words in a sentence shows a pretty similar accuracy curve to a logarithmic function and all results end up in the high 90%

for the simple model. Starting with a sentence length of 7 the curves start getting more deformed and rather linear. Comparatively we see that the IBM Model has pretty accurate results even for small corpora. While the simplified algorithm has problems with small corpora the IBM Model 4 handles them with preciseness. The odd thing we see is that the accuracy increases with the number of words in a sentence. This could be a result of more terms creating a greater distinctness between each term. The simplified models accuracy decreases instead with a greater number of words. As such we can conclude that the two model’s might be too different to compare to each other and acquire meaningful results.

(26)

5.2 Difficulties and Improvements

5.2.1 Model

The models in their present condition give results that are too different to be able to draw any conclusion except that Corpus size depends on the amount of words in a sentence. The simplified model needs to be adjusted and improved so that more accurate comparisons can be made.

5.2.2 More iterations for IBM Model 4

Since we did only a few iterations for each corpus size of IBM Model 4, we can’t be sure that neither running time or accuracy is reliable. We could improve the reliability of our results by doing more iterations, for example a 100 iterations like we did with the simplified model. We could also have done an approximation of the results to examine if the model can be approximated by a polynomial or exponential function.

Domain The arithmetic domain we chose to test the word alignment models on was quite simple. When aligning words of two natural languages there will inevitably be gap words; words that don’t exist or translate into the opposite language. In arithmetic this does not exist except in the form of parenthesis.

As our parser ignored the parenthesis they were not factored into the alignment process. Since we set out to test the word alignment process on a simple language we accomplished our goal by doing so. It is however recommended to run the tests on a two languages that contain large differences in sentence structure and multiple gap words to test this aspect.

As we only test the mathematical language to its English equivalent we narrow the alignment process down. This is good for time-management purposes but a comprehensive test should compare multiple different languages, not only multiple languages but different language types.

5.2.3 Testing System

Currently we have only tested the implementations on two computers, one of which was incapable of completing the IBM Model 4 algorithm due to lack of memory. The results, especially the running time results can vary a lot between computers and as we’ve seen there is a lower limit to which computers can run the IBM model 4. Primarily it would be interesting to test on a diverse set of computers, with different system specification. Another idea to try and improve the statistical program would be to run it on several processor cores. This would hopefully decrease the run-time of the tests so that we could take the average out of a large set of data.

5.2.4 Rule extraction

During the work on this thesis some additional ideas for experiments were sparked. One of the ideas we had in mind, but no time for, was Rule Extraction.

Translation of natural languages is somewhat of a hen and egg problem.[9]

• if we had the alignments → we could estimate the parameters (CFG) of our generative model

24

(27)

• if we had the parameters (CFG) → we could estimate the alignments This experiment is one step closer to a computer that can freely translate between languages. In this word alignment is half the battle with the second half being the creation of a functional grammar. This would also be a natural exten- sion of the report. The IBM Model 4 has according to our results proven that it would be fitting as a pre-step to a rule extraction algorithm meant to get the grammar of the spoken language.

6 Conclusion

It’s clear that the size of corpus really affects the accuracy of our simplified model and the IBM Model 4. We see that in IBM Model 4 the number of distinct words per natural language sentence increases the accuracy up until we reach 7 words per sentence. From that point the results get less stable. We believe this instability is a result of our randomly generated corpora. For the simplified model a corpus size of 25 gives an almost 100% accurate result for a two word sentence. But in the case of 9 words, a corpus size of 50 is barely sufficient. The maximum limit of words - that of a corpus size of 50 to get 100%

is enough - is for our simple algorithm at 6 words.

The results show that the size of the corpus needs to be larger for the simplified model when there is a greater number of words per sentence. The IBM Model 4 conversely shows that the more words per sentence decrease the necessary size of the corpus to make better predictions.

As such we can conclude that the necessary size of the corpus for both alignment models is dependant on the amount of words in a sentence. With our simplified model more words in a sentence requires a larger corpus but with the IBM Model 4 the reverse is true.

(28)

References

[1] Logbar inc: "I am ili." URL: http://iamili.com/ Accessed: 2016-05-05.

[2] Peter F. Brown, John Cocke, Stephen A. Della-Pietra, Vincent J. Della- Pietra, Frederick Jelinek, Robert L. Mercer, and Paul Rossin. A statistical approach to language translation. In Proceedings of the International Con- ference on Computational Linguistics (COLING), 1988.

[3] Yuk Wah Wong and Raymond Mooney. Learning synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th An- nual Meeting of the Association of Computational Linguistics, pages 960–

967, Prague, Czech Republic, June 2007. Association for Computational Linguistics.

[4] "Svenskt centrum för dokumentation och information om språkteknologi."

URL: http://sprakteknologi.se/vad-aer-sprakteknologi/lexikon/korpusar Accessed: 2016-05-10.

[5] Wiebke Wagner. Steven bird, ewan klein and edward loper: Natural language processing with python, analyzing text with the natural language toolkit. Lang. Resour. Eval., 44(4), December 2010.

[6] Noam Chomsky. Three models for the description of language. IRE Trans- actions on Information Theory, 2:113–124, 1956.

[7] Kunter Gero.

[8] Yarin Gal and Phil Blunsom. A systematic bayesian treatment of the ibm alignment models. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies. Association for Computational Linguistics, 2013.

[9] Philipp Koehn. Statistical Machine Translation. Cambridge University Press, New York, NY, USA, 1st edition, 2010.

[10] Krzysztof Wołk and Krzysztof Marasek. New Perspectives in Informa- tion Systems and Technologies, Volume 1, chapter Real-Time Statistical Speech Translation, pages 107–113. Springer International Publishing, Cham, 2014.

[11] P.M. Fernández. Improving Word-to-word Alignments Using Morphological Information. San Diego State University, 2008.

[12] Thomas Schoenemann. Computing optimal alignments for the IBM-3 translation model. In Proceedings of the Fourteenth Conference on Computa- tional Natural Language Learning, CoNLL 2010, Uppsala, Sweden, July 15-16, 2010, pages 98–106, 2010.

26

(29)

7 Appendix

7.1 Statistic Accuracy

# operations corpus size

10 69.70% 61.13% 55.35% 41.46% 30.00% 21.00% 12.15% 3.92%

20 94.15% 92.92% 89.85% 81.62% 72.38% 56.46% 36.85% 31.69%

30 98.54% 99.15% 97.23% 95.69% 92.23% 79.54% 74.46% 51.54%

40 100.00% 99.77% 99.69% 98.31% 96.85% 91.54% 83.69% 62.92%

50 100.00% 100.00% 100.00% 99.62% 98.69% 95.62% 92.08% 81.08%

IBMModel4

10 97.07% 73.34% 77.69% 81.26% 84.47% 86.45% 82.27% 82.82%

20 98.62% 75.46% 78.53% 82.08% 84.57% 86.65% 88.20% 80.47%

30 99.31% 74.20% 78.34% 81.63% 84.59% 86.55% 88.13% 89.38%

40 99.54% 74.38% 78.17% 81.86% 84.51% 86.56% 88.14% 89.36%

50 99.50% 74.82% 78.32% 81.78% 84.60% 86.51% 88.10%

Simplified model

2 3 4 5 6 7 8 9

Figure 14: Statistic Accuracy

(30)

7.2 Statistic Running Times

10 1.29 1.87 2.50 3.28 3.80 4.42 5.19 5.90

20 2.40 3.60 4.77 5.92 6.94 8.18 9.51 10.65

30 3.48 5.13 6.85 8.75 10.01 11.90 13.64 15.39

40 4.59 6.92 9.09 11.43 13.26 15.69 18.07 20.02

50 5.83 8.29 11.40 14.09 16.23 19.32 22.18 24.67

IBMModel4

10 3.66 12.73 42.83 120.56 288.91 580.50 945.40 2,082.18

20 7.81 26.16 92.00 242.54 579.44 1,156.20 1,955.33 3,883.19

30 10.62 38.69 133.37 370.21 860.23 1,676.90 2,931.16 5,556.67

40 13.91 51.67 175.75 495.14 1,081.99 2,318.27 4,202.31 7,501.80

50 16.72 64.65 217.86 600.56 1,344.51 2,578.82 5,950.03

Simplified model

2 3 4 5 6 7 8 9

Figure 15: Statistic Running Times

28

(31)

7.3 Corpus of fifty sentences with 5 words per sentence

"four plus two plus seven plus eight minus four": ((4 + (2 + (7 + 8))) 4)

"seven plus five minus five plus three plus nine": (7 + (5 ((5 + 3) + 9)))

"two times nine times seven divide four divide one": (2 * ((9 * (7 / 4)) / 1))

"six divide four times seven plus eight divide three": ((6 / ((4 * 7) + 8)) / 3)

"one minus four minus nine plus nine times six": (1 (((4 9) + 9) * 6))

"eight minus six plus two divide nine minus eight": (8 ((6 + (2 / 9)) 8))

"four times nine minus four times four divide three": ((4 * (9 (4 * 4))) / 3)

"three divide nine minus eight plus nine times one": (3 / (9 ((8 + 9) * 1)))

"four plus six plus four times two plus nine": ((((4 + 6) + 4) * 2) + 9)

"six plus four plus one times three times three": (6 + (((4 + 1) * 3) * 3))

"six minus five times eight divide two plus three": ((((6 5) * 8) / 2) + 3)

"five minus six minus five minus six divide nine": ((((5 6) 5) 6) / 9)

"eight times two plus two divide seven times six": (((8 * (2 + 2)) / 7) * 6)

"eight divide nine minus two plus two divide two": ((8 / (9 (2 + 2))) / 2)

"two divide four plus one plus seven plus four": (((2 / (4 + 1)) + 7) + 4)

"three plus two times eight times six times two": (3 + (2 * ((8 * 6) * 2)))

"three plus one minus eight times six minus six": (3 + (1 (8 * (6 6))))

"five divide six times three minus nine minus two": ((5 / ((6 * 3) 9)) 2)

"seven times nine divide five plus seven divide three": ((7 * (9 / (5 + 7))) / 3)

"six plus two times two times two divide seven": (6 + (((2 * 2) * 2) / 7))

"six minus eight times seven times six times two": ((6 (8 * (7 * 6))) * 2)

"two divide two divide two divide eight times one": (2 / (2 / (2 / (8 * 1))))

"five minus three times one plus eight times nine": (5 ((3 * (1 + 8)) * 9))

"seven plus one minus eight divide five times seven": (7 + (1 (8 / (5 * 7))))

"two times one minus seven plus five plus five": ((2 * ((1 7) + 5)) + 5)

"one minus nine plus five plus five minus five": ((((1 9) + 5) + 5) 5)

"one divide seven minus three times six divide four": ((1 / ((7 3) * 6)) / 4)

"nine divide seven plus nine divide one divide seven": (9 / ((7 + (9 / 1)) / 7))

"six plus four times nine times one divide nine": (6 + (4 * (9 * (1 / 9))))

"two times four divide two minus five plus eight": (((2 * (4 / 2)) 5) + 8)

"two plus two plus five times eight plus eight": (2 + (((2 + 5) * 8) + 8))

"four minus one plus seven divide four divide five": (4 (1 + (7 / (4 / 5))))

"five minus six divide five times five minus two": (((5 (6 / 5)) * 5) 2)

"one plus nine minus six plus seven minus four": ((1 + ((9 6) + 7)) 4)

"four divide three minus eight times eight plus eight": (4 / ((3 (8 * 8)) + 8))

"five times five divide two divide four plus four": ((((5 * 5) / 2) / 4) + 4)

"seven minus seven minus four divide one divide nine": ((((7 7) 4) / 1) / 9)

"eight times five times seven times one plus five": (8 * (((5 * 7) * 1) + 5))

"five plus one plus four minus six times eight": (5 + ((1 + (4 6)) * 8))

"five plus one minus eight plus four divide one": (5 + (1 (8 + (4 / 1))))

"two times six minus eight minus four divide one": ((((2 * 6) 8) 4) / 1)

"nine times one times eight divide two plus four": ((9 * ((1 * 8) / 2)) + 4)

"nine minus nine divide seven minus nine times two": ((((9 9) / 7) 9) * 2)

"five times six divide three plus seven divide one": (5 * (6 / (3 + (7 / 1))))

"nine minus nine plus two times six divide four": (((9 (9 + 2)) * 6) / 4)

"five minus two minus one divide six minus seven": (5 (2 ((1 / 6) 7)))

"five plus five minus eight minus nine minus five": (5 + (((5 8) 9) 5))

"nine minus nine plus five plus two times two": (9 (9 + ((5 + 2) * 2)))

"four divide seven minus three times five divide eight": (4 / (7 (3 * (5 / 8))))

"one plus five times seven plus six plus two": (1 + ((5 * (7 + 6)) + 2))

"four times five divide five minus seven minus seven": ((4 * (5 / (5 7))) 7)

IBM Model 4 Alignment Comparison

IBM Model 4 Alignment Comparison

An evaluation of how the size of training data affects the interpretation accuracy and training time for two alignment models that translates natural language

TOR ARVIDSON

MARIA SIEBECKE

IBM Model 4 Alignment Comparison

Stockholm 2016

IBM Model 4 Alignment Comparison

Tor Arvidsson Maria Siebecke

May 27, 2016

Contents

1 Introduction

1.1 Problem Statement

1.2 Scope of Study

1.3 Disposition of the Report

2 Background

2.1 Semantic

2.2 Natural Languages

2.3 Corpus

2.4 Context Free Grammar

2.5 NLTK

2.6 IBM Alignment models

3 Method

3.1 Context Free Grammar

3.2 Generating Training Data

3.3 Semantic Parse

3.4 Algorithm for Word Alignment

3.5 Testing process

4 Results

4.1 Expected Results

4.2 Empirical Results

5 Discussion

5.1 Analysis of the different models results

5.2 Difficulties and Improvements

6 Conclusion

References

7 Appendix

7.1 Statistic Accuracy

7.2 Statistic Running Times

7.3 Corpus of fifty sentences with 5 words per sentence