Translation Memory System Optimization: How to effectively implement translation memory system optimization

(1)

Translation Memory System Optimization

HOW TO EFFECTIVELY IMPLEMENT TRANSLATION MEMORY SYSTEM OPTIMIZATION

TING-HEY CHAU

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

EXAMENSARBETE VID CSC, KTH

Optimering av översättningsminnessystem

Hur man effektivt implementerar en optimering i översättningsminnessystem

Translation Memory System Optimization

How to effectively implement translation memory system optimization Chau, Ting-Hey

E-postadress vid KTH: tich02@kth.se Exjobb i: Datalogi

Handledare: Arnborg, Stefan Examinator: Arnborg, Stefan Uppdragsgivare: Excosoft AB Datum: 2015-06-11

(3)

(4)

Abstract

Translation of technical manuals is expensive, especially when a larger company needs to publish manuals for their whole product range in over 20 different languages. When a text segment (i.e. a phrase, sentence or paragraph) is manually translated, we would like to reuse these translated segments in future translation tasks. A translated segment is stored with its corresponding source language, often called a language pair in a Translation Memory System. A language pair in a Translation Memory represents a Translation Entry also known as a Translation Unit.

During a translation, when a text segment in a source document matches a segment in the Translation Memory, available target languages in a Translation Unit will not require a human translation. The previously translated segment can be inserted into the target document.

Such functionality is provided in the single source publishing software, Skribenta developed by Excosoft.

Skribenta requires text segments in source documents to find an exact or a full match in the Translation Memory, in order to apply a translation to a target language. A full match can only be achieved if a source segment is stored in a standardized form, which requires manual tagging of entities, and often reoccurring words such as model names and product numbers.

This thesis investigates different ways to improve and optimize a Translation Memory System. One way was to aid users with the work of manual tagging of entities, by developing Heuristic algorithms to approach the problem of Named Entity Recognition (NER).

The evaluation results from the developed Heuristic algorithms were compared with the result from an off the shelf NER tool developed by Stanford. The results shows that the developed Heuristic algorithms are able to achieve a higher F-Measure compare to the Stanford NER, and may be a great initial step to aid Excosofts’ users to improve their Translation Memories.

(5)

Optimering av översättningsminnessystem Hur man effektivt implementerar en optimering i

översättningsminnessystem

Översättning av tekniska manualer är väldigt kostsamt, speciellt när större organisationer behöver publicera produktmanualer för hela deras utbud till över 20 olika språk. När en text (t.ex. en fras, mening, stycke) har blivit översatt så vill vi kunna återanvända den översatta texten i framtida översättningsprojekt och dokument. De översatta texterna lagras i ett översättningsminne (Translation Memory). Varje text lagras i sitt källspråk tillsammans med dess översättning på ett annat språk, så kallat målspråk. Dessa utgör då ett språkpar i ett översättningsmin- nessystem (Translation Memory System). Ett språkpar som lagras i ett översättningsminne utgör en Translation Entry även kallat Translation Unit.

Om man hittar en matchning när man söker på källspråket efter en given textsträng i översättningsminnet, får man upp översättning- ar på alla möjliga målspråk för den givna textsträngen. Dessa kan i sin tur sättas in i måldokumentet. En sådan funktionalitet erbjuds i publiceringsprogramvaran Skribenta, som har utvecklats av Excosoft.

För att utföra en översättning till ett målspråk kräver Skribenta att text i källspråket hittar en exakt matchning eller en s.k. full match i översättningsminnet. En full match kan bara uppnås om en text finns lagrad i standardform. Detta kräver manuell taggning av entiteter och ofta förekommande ord som modellnamn och produktnummer.

I denna uppsats undersöker jag hur man effektivt implementerar en optimering i ett översättningsminnessystem, bland annat genom att un- derlätta den manuella taggningen av entitier. Detta har gjorts genom olika Heuristiker som angriper problemet med Named Entity Recogni- tion (NER).

Resultat från de utvecklade Heuristikerna har jämförts med resul- tatet från det NER-verktyg som har utvecklats av Stanford. Resultaten visar att de Heuristiker som jag utvecklat uppnår ett högre F-Measure jämfört med Stanford NER och kan därför vara ett bra inledande steg för att hjälpa Excosofts användare att förbättra deras översättningsmin- nen.

(6)

Chapter 1 Introduction

Hiring professional translators for translation of technical documents is often very expensive and time consuming. That is why in the last couple of decades computer scientists and human translators have been working together trying to develop different tools and methods to minimize the use of human translation. There are two branches in this area, where one of them is Computer Aided Translation (CAT) tools, which includes Translation Memory System (TMS)¹, while the other branch is Machine Translation (MT), aiming at general translation of text. Both branches are part of the computer science field Natural Language Processing (NLP). It is important to distinguish them both, since the two of them have different goals and purposes. The purpose of using MT tools is to completely eliminate the use of human translators, while CAT tools were developed for human translators to make their work more effective by eliminating repetitive work. As of today the TMSs have become the standardized tool used by all major translation agencies [1].

Companies grow with the help of the Internet and will require more multilingual support in order to compete globally, which is why it is important to find a good tool to manage translations. Technical documents tend to be repetitive, translators who use TMSs are able to reduce the cost by 15% to 30% and at the same time improve their productivity by 30% or even 50% according to Esselink [2].

So far many people have probably used a translation engine e.g. Yahoo! Babel Fish (previously owned by AltaVista), Google Translate, etc., to translate a website in a foreign language, and as many might have noticed the results may vary, ranging from acceptable results to ones that are grammatically incorrect, which is usually caused by a word-for-word translation. But even with a grammatically incorrect translation, no one can deny that these translation tools have provided us with an understanding of the website’s information, rather than not being able to understand the information on the website at all.

TMS and MT systems are no doubt great tools for minimizing the work of

1We will throughout the report distinguish the system from its core component, the database storing the translations, often called the “memory”. Which is why we will refer to the systems as TMSs, and to the database itself as Translation Memory (TM).

(9)

human translators [1, 3].

(10)

Chapter 2 Background

Back in the beginning of 1980, P.J. Arthern of the European Council Secretariat had the idea of having a computer application to utilize the computational power of the computer by letting it process natural language translations. He soon realized that it was a very complex task to perform. Instead of a machine translation system, it would be useful to have a word processing system which could remember if a new text typed into the system already has been translated, and fetch the translation which has previously been done and then show it on the display or have it printed out automatically as he previously reported [4].

Arthern described a solution to what the day-to-day task translators had been dealing with for many decades. Many translators developed different strategies to deal with this problem; card indexes, cut-and-paste, etc. The solution would eliminate the need for translators to translate repetitive texts and documents containing commonly used texts. The solution Arthern [4] mentioned became available for translators under the generic term Translation Memory (Translation Memory) which is a Computer aided translation tool; also called Computer Assisted Transla- tion (CAT) which includes the following three categories: Translation Memory tools, Terminology tools, Software localization tools. Usually the translation memory and terminology tools are combined in one tool-set for translation of documentation.

It is a known fact that Machine Translation (MT) tools are hard to develop, especially a generalized MT tool capable of translating many different types of documents. Using MT systems for wrong types of documents will make a costly, ineffi- cient and time-consuming process according to Esselink [2]. Although MT systems require a much larger initial investment compared to TMSs, it is recommended to adapt the MT system for the indented type of source documents in order to achieve a return of investment. This can be done by identifying often used terminology and by using controlled language (CL) which minimizes the number of ambiguities in source documents, see Section 4.1 on page 15.

Many companies working with translation use some kind of TMS. Each TMS has its own additional features, but all TMSs have one thing in common, the quality of a TMS is affected by their users. Different technical writers have different

(11)

styles of writing. This affects the matching quality of a TMS; making it harder to re-use previously translated segments due to the tendency of different writers to use different texts for the same meaning. A TMS is usually a system combined by three basic functions; Translator module, Editor module and a Database.

A list of the different modules in a TMS

Translator module: This module processes the documents that are sent for publication, where all formatting information is removed. Each and every segment in source language is replaced with a target language segment in a new document.

If no such translation exists, the TU is sent for human translation, when this has been done the translated language segment is inserted into the TU and is then able to replace the given segment. (The comparisons are done once all source segments are generalized, i.e., put on a standardized form permitting more exact matches to be made)

Editor module: This module displays documents in the source language and sometimes also in a target language, allowing human translators to view both source language and target language if such translation exists in the TM. This module can either be an add-on tool for an existing word processor e.g. Microsoft Word or a separate word processor, as the one shipped with Skribenta 4, named XML Editor.

Database: This module is the core of a TMS, it is simply the database that contains all the TUs and is often referred to as Translation Memory (TM). It manages all load and save operations.

See Example 2.1 for a provided example with two basic TUs with five different language segments.

2.1 Excosoft

The case study for this project was done for the company Excosoft AB. Excosoft is a Swedish company founded in 1986; they provide a single source publishing software. The software features a translation memory which offers writers potential re-use of previously translated documents and version control. The software de- veloped by Excosoft goes by the name Skribenta and is used to manage technical documentation, were there are often repetitions.

2.2 Research question

With the knowledge and basic understanding of how a TMS operates, experienced technical writers are able to accomplish some TM improvements on their own, and maximize the potential future re-use of existing TUs. Such knowledge minimizes

(12)

2.3. OBJECTIVE

TU Id ar fr-ca da en sv

198220 Select Su- per Rinse by pressing the button under the symbol.

Sélectionner le « Super rinçage » à l’aide de la touche située sous le symbole.

Vælg Su- perskyl ved at trykke på knappen under symbolet

Select Super Rinse by pressing the button under the symbol.

Välj Super- sköljning genom att trycka på knappen under sym- bolen.

198591 Move the plastic plugs on the upper and lower edges of the door.

Use a flat screwdriver to remove the plugs.

Déplacer les bouchons en plastique situés sur les bords supérieur et inférieur de la porte.

Utiliser un tournevis

à lame

plate pour retirer les bouchons.

Byt om på de plast- propper, der sidder på lugens over- og underkant.

Brug en flad skruetrækker til at løsne propperne.

Move the plastic plugs on the upper and lower edges of the door. Use a flat-bladed screwdriver to remove the plugs.

Byt plats på de plas- tpluggar som finns på luck- ans över- respektive underkant.

Använd en flat skru- vmejsel för att lossa pluggarna.

Example 2.1. Basic TU examples with five different languages segments

the risk of creating similar and unnecessary TUs with the same semantic meaning.

This has led us to the following research question:

How to effectively implement translation memory system optimization

2.3 Objective

The purpose of this master thesis is to research how to improve and optimize a TMS. Our objective is to identify different ways to improve pattern matching in the translation phase. Algorithms will be developed in order to identify TUs, which can be generalized in different ways. Users should be allowed to access the algorithms through a graphical user interface, and choose an appropriate correction for selected TUs, that match a chosen algorithm. The graphical user interface will be referred to as the quality assessment prototype (see Chapter 6 on page 25).

(13)

2.4 Delimitations

Many difficulties need to be solved in order to create a fully optimized TMS, which is why some limitations are needed. The focus of this project was to find ways to optimize a TM and its existing TUs.

To simplify the optimization task, some assumptions were made:

• Generalization rules follow the same structure as the algorithms in the Skribenta system.

• Changes are only made on existing TUs and will therefore not affect existing documents.

• Developed algorithms should work for all source and target languages based on the Latin alphabet¹. The reason for why we need an algorithm to work with all target language text segments is because otherwise we would not be able to go the other way around; translating a document from a target language to the previous source language should work at all times.

The coverage of this study was limited to developing a quality assessment tool for Excosofts software named Skribenta 4 by the time of implementation. The existing translation memory infrastructure was not within the scope of the project;

it was provided by Excosoft AB.

2.5 Limitations

Limitations encountered during this project were that some algorithms only work for languages based on the Latin alphabet, one of them are date and time identification.

In order to create an algorithm which will be able to identify and tag date and time in all possible languages, some knowledge is required in all languages. Unfortunately that is not possible, so in order to develop such generalized algorithms, one would be required to hire linguists which therefore would be too time consuming for this project.

Limitation in the evaluation process which is caused by optimizations done to only the TM. We will not be able to measure the pattern matching improvements done by the different algorithms, as that would require implementing improvements in the other two parts of the TMS as well.

1http://en.wikipedia.org/wiki/Latin_alphabets

(14)

Chapter 3 Theory

The aim of this Chapter is to introduce the theory behind TMS, Named Entity Recognition and the underlying theory used in the developed algorithms.

3.1 Translation Memory System

The term Translation Memory System (TMS) is usually referred to a software tool that contains a database of translation texts. A source language text segment associated with one or more target translation segments is called a language pair. A language pair stored within a TMS is called a Translation Unit (TU) and sometimes translation entry. TMS are solely developed to assist human translators in their daily translation work.

TMS are very popular and often used by companies with the need of publishing new manuals and technical documents, with a short life cycle, to many different languages. When a new product is developed or new features have been added to an existing computer application, a new manual is required. The use of TMSs enables companies to re-use translations from previous versions, reducing the time to market and the amount of human translation.

Esselink [2] states that the best results with TMSs may be achieved when the source documents are created in a structured way, by avoiding wordiness, ambiguities and synonyms (see Section 4.1 on page 15). For those who consider using TMSs please refer to the report written by Webb [5].

Esselink [2] suggest that the use of TMSs in translation projects may reduce the total translation cost by 15% to 30%, while O’Brien [6] suggest a productivity increase between 10% to 70% depending on the stored content in the TM. Although it is a very broad range, it is hard to calculate a generalized result, which is not depending on the type of the translated texts. But according to Somers [7], on the other hand, a 60% productivity increase may be possible, while a more reasonable average productivity gain around 30% may be expected when the software is used.

A survey done by Lagoudaki [1] in 2006, shows that 82.5% of all professional translators who responded to the survey use TMSs, while 17.5% did not use any

(15)

TMSs at all, the survey showed that many translators were able to save time (86%), improve the terminology and consistency of translations (83%), improve the quality of the translation output (70%), achieve cost savings (34%), while 31% thought TMSs were the best way to exchange resources, such as glossaries and TMs.

3.1.1 How Translation Memory Systems Work

TMSs work at sentence level, meaning the source documents are broken down into smaller components such as sentences or segments. The term segment is often used because in some cases, a chunk of text may not be a complete sentence, in the case of headings or lists. A segment is the smallest unit of text that may be reused when working with TMSs.

It is important to remember that smaller units of text, such as an individual word are not used, since they may occur in different context and would therefore require a context-dependent translation. That is why word-for-word translations normally do not produce usable results, as such translations are too literal, which is well described by Arnold [8].

When a new document in a source language is saved, the document is sent to the translation module. The document is separated into segments and processed by a set of generalization rules, and then compared with the TM. If a segment already exists in the TM it will not be added, otherwise a new TU is created in the TM.

Available target language segments in a newly created TU, are empty until it is sent for translation. When the translator module finds an exact or a full match in the TM, all previously translated target languages in the given TU will be available for translation. Please refer to subsection 3.1.2 on the next page for a short description of different matches used in a TM.

In theory a TMS used for a long period of time will become better and better as more TUs with translated target languages populates the TM, and after a while reach a point were practically all future documents will find perfect matches in the TMS and not require any human translation.

But we know that this is not true, this has to do with the complexity of written language, ambiguities, wordiness and language innovation is hard to deal with, see why in Section 4.1 on page 15.

If previous translation exists and no TMS has been used previously, the trans- lated documents need to be aligned, such an operation is called a translation align- ment¹. Translation alignment will match source language segments to target language segments; this will create a new language pair between source and target language on aligned segments, which can later on be used in a TMS. According to Esselink [2] it is not uncommon that manual alignment is required, it all depends on how the previous documents were produced.

1Parallel text - http://en.wikipedia.org/wiki/Parallel_text

(16)

3.1. TRANSLATION MEMORY SYSTEM

3.1.2 Different Matches

Esselink [2] states that there are four different types of matches which can be found when a newly written document is being compared with a TM; repetitions; full matches; fuzzy matches and no matches.

Repetition, also referred to as internal matches, means that multiple occurrence of the same segment exists in a document.

Full match sometimes also referred to as an exact match or perfect match, mean- ing that no character in a segment can differ from an existing TU, not even a white- space or punctuation. However according to Browker [9], on the other hand, a full match means that matching segments only differ in terms of variables or other pre- tagged entities. The full match definition used by Browker will be used in the rest of the report.

Fuzzy match means that one or more characters can be different, the difference is often computed by a string edit distance see Section 4.2 on page 15 for more detailed description.

A list of different matches within a TM

Exact match: An exact match, also called perfect match, is found when a segment in the new document is exactly 100% the same as a segment stored in the TM. No character can differ, this definition is well described by Browker [9].

Full match: A full match is a match where a segment differs from a stored seg- ment in the TM only by pre-tagged terms, which can be variable elements, named entities, numbers, dates, times, currencies, measurements, and sometimes proper names, this is well documented by Browker [9].

Fuzzy match: With a match between 75-94%, already translated segments may largely be re-used by translators [10, 11]. Translators may edit the suggested text and adapt it to the new content. Which is why many translation agencies usually charge less for fuzzy matches [12, 13, 14], the rates vary depending on the level of the fuzzy match.

Repetitions: The same segment occurs several times in a document. This seg- ment only needs to be translated once. When the segment is repeated, the TMS is able to automatically supply a translation.

No match:. No match is found or a match lower than 75% is found. If an MT system is used, the result must still be checked and usually edited by a human.

(17)

3.2 Translation Memory Optimizations

In order to fully take advantage of a TMS it is necessary to make sure that the TUs are kept as consistent as possible. It is therefore necessary to regularly reorganize TUs with similar or identical content. According to Iverson [15] time invested in rewriting sections and removing unnecessary words not relevant to a product in a document, will yield great results in TMSs. He also recommends rewritten sections to be compared with other similar documents, to ensure consistency between the documents. What he is referring to is often called Controlled Language (CL), see Section 4.1 on page 15.

Many companies who offer TMSs have realized that a paragraph-based segmentation is preferred before a sentence-based segmentation. Keeping shorter segments along with the target language in a TMS will result in a higher number of both full and exact matches. However the downside of using sentence-based segmentation, is that many similar TUs are stored in the TM. Some sentences can be interpreted in many different ways, which depends on the context where the sentence is used. Such sentences are often context dependent, which is why matches provided by a sentence- based TMS needs to be verified by a translator. While Esselink [2] states that fewer exact matches are found by a paragraph-based TMS compared to a sentence-based TMS. But when an exact match is found in a paragraph-based TMS, no additional review or proofreading is usually required; the translated paragraph in the target language may easily be re-used.

Iverson [15] also mentions the important tradeoff between the length of the TUs and storing many similar TUs. The chance of finding an exact match gets lower as the text segments in the TUs get longer. While short segments increase the probability of creating ambiguities (e.g. multiple or conflicting matches), which results in a lower translation quality of the intended translation.

In short the benefits of using TMSs are to reduce translation cost, improve turnaround times (require less time to produce), and increase translation consistency.

3.2.1 Generalization

Newly written or modified segments in the editor module are passed on to the translator module. The translator module will process the segments by going through all existing generalization rules in a TMS, this process is called generalization. A generalization makes segments match if they differ only in named entities, dates, etc. Generalized segments are compared with existing TUs in a TM, with a goal of finding a full match. No human translation is necessary if an exact or full match can be found. In case no full match is found, a new TU is created. The TU can be used as soon as the segment has been translated into target language(s). Fu- ture segments that match the previous stored segment will then no longer require a human translation. Entities such as dates, names, serial number etc. should be generalized with different XML tags, see Example 3.1 for a desired tagging.

(18)

3.2. TRANSLATION MEMORY OPTIMIZATIONS

English: MPU Flasher version 1.04 Should yield the following result:

English: MPU Flasher version <value>1.04</value>

TU: MPU Flasher version <value/>

Example 3.1. A desired segment generalization

During comparison with previous segments, the content within tagged entities are ignored; in this case “1.04”.

In cases where a user misses out to tag segments with certain entities, the non- generalized segments will not be able to find a match, creating new very similar TUs, see Example 3.2 on the current page where similar TUs have been created.

TU ID en

574062 Adjust the back check (illustration 3).

Example 3.2. Basic example of a non-generalized TM

A new segment only differentiating by the numerical parameter value will find a full match. But this requires that the segment compared is generalized and contains a <value/> tag; in this case the tag is a version number. Cases where writers have missed to select the numerical value will unfortunately not find a match. That is why it is interesting to explore a way to automatically identify and tag these named entities, as this is an often reoccurring problem. This work is a well-known field within computer science, Named Entity Recognition (NER), see Section 3.3 on page 13.

For a desired generalization, see Example 3.3.

Input: “MPU Flasher version 1.05”

Required input: “MPU Flasher version <value/>”

TU: “MPU Flasher version <value/>”

Example 3.3. Desired matching procedure.

Unless the numerical version number is identified and pre-tagged before the comparison is done, the new segment “MPU Flasher version 1.05” will not find a match, and a new TU is created instead, which creates an unnecessary TU with similar content.

(19)

3.2.2 Translation

Documents sent for publication are passed on to the Translator module; this process is called a Translation. In case a TU is missing in a target language translation, that TU is sent for translation either by in-house translator or a translation agency.

When the translated target language segments are inserted back into the TM, we can publish the given document in the required target languages.

Every segment sent for publication needs to be compared with all TUs in the TM. The comparison assures that there does not exist a previous translation in the TM. It is hard to find exact matches in a TM, if new segments are not generalized before they are stored in the TM. Given segment “MPU Flasher version 1.06”, no exact match can be found. A fuzzy match can be found instead, depending on how calculations are done, the difference between the two similar segments may vary. A character-by-character comparison using the Levensthein algorithm only requires one substitution operation. The level of the fuzzy match can be calculated in the following way (20+3)/(20+4)=0,958. A 96% fuzzy match means that these segments are very similar.

The best way to show fuzzy matches would probably be to notify the user how similar the TU is to an already existing TU in the TMS (ranging between 75-94%), if the user thinks the new segment can be replaced by an already existing one, and therefore will not require a human translation for the given segment.

3.2.3 Translation Memory Database

This case study focuses on optimizations in the TM, by tagging and generalizing different parts of a TUs text segment, we will be able to reduce the number of similar TUs in the TM. In Example 3.4, a writer has missed to tag a text segment.

”040209 11:43:44”

”040209 11:43:18”

”040209 11:43:19”

Example 3.4. Similar and unnecessary TUs in the TM.

Instead of storing these very similar text segments as separate TUs, it would be preferred to replace all of the above TU with one generalized TU were the date and time are replaced with parameters. This would enable future generalized TUs to find a match and be translated automatically. The date “040209” should be replaced with <date/> and the time were the seconds digits differs, should be replaced with

<time/>

Unfortunately tagging date and time in source language TUs in TM, will not yield a desired result. A problem likely to occur due to a cross-translation, is one or more parameters missing in either one of the segments. This problem can be avoided by generalizing all target language segments in a TU and not only its source language. Applied changes in a segments source language should also be applied

(20)

3.3. NAMED ENTITY RECOGNITION

in target language segments. This will allow a proper cross-translation between all languages and not only from a source language to its target languages. E.g. if the first document’s source language was English, and the target language would be Swedish. A future document written in the source language Swedish translated to the target language English, should be able to re-use previously translated TUs.

3.3 Named Entity Recognition

Named Entity Recognition (NER) is a task when documents, paragraphs or sentences are broken down into tokens, where each token is evaluated and classified into predefined categories, such as locations, names of persons, organizations, quan- tities, monetary values, percentages, etc. To illustrate the entity classification, a basic example containing different entities is provided below:

“Google launched their cloud service Google Drive in April 24, 2012, which offers online storage and backup. Allowing users to store 15 GB free of charge, while 1 TB costs US$9.99 per month.”

We would like to divide the example into five different entities: “organization”,

“name”, “date”, “unit” and “currency”, the generated output from a NER system could look like this:

“<ORGANIZATION>Google</ORGANIZATION> launched their cloud service <NAME>Google Drive</NAME> in <DATE>April 24, 2012</DATE>, which offers online storage and backup. Allowing users to store <UNIT>15 GB</UNIT> free of charge, while <UNIT>1 TB</UNIT> costs

<CURRENCY>US$9.99</CURRENCY> per month.”

The output from NER systems are generally not meant to be read by humans, they are often used in information extraction and text categorization.

There are three different methods for NER systems to learn how to identify named entities, which are supervised learning, semi-supervised learning and unsu- pervised learning. The different methods are well described in a recent report [16].

3.4 Concept of Evaluation

Evaluation of NER systems are usually done with three well-known metrics preci- sion, recall and F-Score, initially developed for Information Retrieval². To illustrate how the different metrics are calculated, we may use the problem “information retrieval”.

2http://en.wikipedia.org/wiki/Information_retrieval

(21)

Relevant Non-relevant Retrieved true positives (TP) false positives (FP) Non retrieved false negatives (FN) true negatives (TN)

Example 3.5. Different evaluation outcome classes

Precision (P) is the fraction of the retrieved documents that are relevant.

Precision = |{Relevant documents} ∩ {Retrieved documents}|

|{Retrieved documents}|

Recall (R) is the fraction of relevant documents that are retrieved Recall = |{Relevant documents} ∩ {Retrieved documents}|

|{Relevant documents}|

During evaluation each document is divided into four different classes: false posi- tive, false negative, true positive and true negative, see Example 3.5. A true positive match occurs when a word is correctly identified as an entity. A true negative match is a word which is correctly ignored, not an entity. A false positive match occurs when a word is incorrectly identified. A false negative match occurs when a word is incorrectly ignored.

F-Measure (F¹) is the weighted harmonic mean of precision and recall, the tradi- tional F-measure or sometimes called balanced F-score.

F1-measure = 2 × Precision × Recall Precision + Recall

The balanced F-Measure above weights precision and recall equally. Which is why a different formula may be used to weight the importance of the two metrics.

Fβ-measure = (1 + β²) × Precision × Recall (β²× Precision) + Recall

A value of β > 1 assigns a higher importance to recall. Assigning β = 2, F² weights recall twice as much as precision, while F⁰.5 weights precision twice as much as recall.

(22)

Chapter 4 Related Work

The aim of this Chapter is to examine related work in generalizing TM. It will describe different ways proven to be useful with TMSs.

4.1 Controlled Language

Written language in technical documents should be as clear and concise as possible to avoid misinterpretations. That is why companies who regularly produce technical documents have restricted their writers’ grammar, vocabulary, style and semantics, these restrictions are often called a Controlled Language (CL). The goal with CL is to get rid of unclear writing, such as ambiguous words, complex grammar, incom- plete sentences and vernacular. This allows writers to write sentences that are less likely to be misinterpreted [17], which makes it easier for a translator to make a more consistent translation that is easier to understand [18].

Previous research [19] shows that CL may be used to improve translation quality.

Mitamura [18] observed that 95.6% of all sentences could be assigned with a single meaning representation, while [20] found that around 33% of duplicate sentences could be removed.

With these kinds of improvements you might wonder why CL is not used by all technical writers and translators? That is because it is difficult and time consuming to learn a CL [21].

4.2 Similar Segments

Similar segment matching for TM is often referred to the generic term fuzzy match.

Similar segments can be identified by using an algorithm that solves the well-known problem Minimum Edit Distance. The minimum edit distance between two strings calculates the minimum number of edit operations (usually insertion, deletion and substitution of single characters) required to transform one string into another [22].

Given the words computer and commuter with the same character length in Ex- ample 4.1, one operation is required to transform the word computer to commuter,

(23)

by substituting the letter “P” to the letter “M”. If we were to assign a particular cost or weight to the different edit operations, we will have the Levenshtein distance between two sequences. Giving each of the three operators a cost of 1 (assuming a substitution of a letter for itself, has zero cost), the Levenshtein distance between computerand commuter is 1.

C O M P U T E R

C O M M U T E R

S

Example 4.1. Minimum edit distance operation.

This method is often used in spell-checkers to identify possible corrections for miss-spelled words. Words with a low minimum edit distance are often presented as possible corrections. While many different string distance algorithms exist, Lev- ensthein [23] was the first to report this method.

Companies providing a TMS without a fuzzy matcher should consider implementing a fuzzy matcher, to provide translators the ability to view similar text segments from previous translations. Improving the ability to generate more consistent translations, especially if there is a possibility to change the similarity threshold of the fuzzy matcher [24], giving users the ability to find the best balance between precision and recall. Previous research [25] shows that fuzzy matches over 70% may be of use.

4.3 Regular Expression In Translation Memory

Previous research [26] supports that the full matches in TMSs may be improved with the use of regular expressions. Each rule which uses regular expression needs to consist of three different search patterns; input segment; TU containing both source language and desired target language; additional regular expression to replace the parts that do not match with the source language in the TU as well as the target language segment in the TU, and translate it to the target language. The term for the last regular expression is called a transfer rule, since no translation is needed and the inputs are transferred to a different form.

Jassem and Gintrowicz [26] were able to developed transfer rules that allowed automatic translation for these specific entities: various format of date and time;

currency expressions; metric expressions; numbers and e-mail addresses.

4.4 Machine Translation

Today there are a lot of companies that have developed usable software that com- bines the benefit of TM with the advantages of MT.

In order to translate one language into another, one needs to understand the grammar of both languages, including morphology (the grammar of word forms) and

(24)

4.4. MACHINE TRANSLATION

syntax (the grammar of sentence structure). But in order to understand syntax, one also had to understand the semantics and the lexicon (or “vocabulary”), and even to understand something of the pragmatics of language use.

The requirement of understanding the grammar of both languages makes the process of developing a MT system more complex compared to a TMS. This is one of the main reason why CLs were developed, see Section 4.1 Controlled Language on page 15. According to [22] some impressive results have been achieved when com- biding CL with MT systems.

A wide variety of different MT systems exist today. The most commonly used ones are; Rule-Based Machine Translation (RBMT), Example-Based Machine Translation (EBMT) and a Statistical Machine Translation (SMT). Previous research by Biçici and Dymetman [27] show improved NIST¹ and BLUE² score from a system combined with a phrase-based SMT trained on the same domain as an existing TM, compared to the stand-alone SMT and the TMS.

One of the better known translation systems, Google translate, uses an extremely large multilingual corpus.

1NIST is a method to evaluate the text quality translated by a MT system.

2BLUE is a method used to evaluate the quality of text which has been translated by a MT.

This metric is reported to have a high correlation with human judgments of quality.

(25)

(26)

Chapter 5 Methods

The aim of this Chapter is to introduce the methods used in the algorithms developed, with the goal of achieving a more generalized TM system.

5.1 First step: White-space removal

Since no fuzzy matcher exists in Excosofts TMS, the fist step was to identify and remove unnecessary characters in existing TUs. The company’s previous attempt to reduce unnecessary characters was to add a “Smart Space” catcher in their editor module, alerting a user if more than two white-spaces were typed. The problems with this feature are that it can be disabled and that previous TUs were not cor- rected in the TM. It was therefore a very crucial step for the company to identify unnecessary characters, especially since they had embedded XML code in their TUs.

Removing these unnecessary characters could achieve a more generalized TM, which could result in fewer similar TUs.

5.1.1 Identifying unnecessary characters

The most common unnecessary characters used were different types of white-spaces.

The content of a segment stored in a TU is not affected, if one or several misplaced white-spaces are moved from one XML tag to a parent XML tag, see Example 5.1 on the following page, where white-space within the tag should be moved to its parent tag, which in this case is the tag.

Performing such corrections can be done in two ways; string manipulation with regular expressions and DOM parsers¹. DOM parsers are often used to read XML documents and validate the node-tree. They also feature data extraction, easing the task of XML manipulations. Qureshi [28] states that DOM parsers are slow and the complexity of parsing an XML document depends on the following factors; the height of the tree; total number of elements; total number of distinct elements and the size of the XML document.

1A DOM parser is a standard way to process and read XML documents.

(27)

With that in mind, the preferred method chosen in this case was string manipulations with a DOM-like approach, identifying all opening, closing and self-closing tags.

Operation String

Input White-spaces_and_HTML

Generalized output White-spaces_and_HTML

Example 5.1. Moving white-space to achieve a generalized segment, will not affect a segments content.

5.1.2 White-space extraction

Changes applied to the input string will change the string length. When we find another misplaced white-space, we need to know where previous changes have occurred, in order to insert a white-space at the accurate position.

That is why I implemented an integer array with the same size like the length of the input string; this array holds the information of the given index-position of the input string, which from now will be called the offset. If a character at a specific index has been removed, the status of that specific index will be marked as removed and the offset of the following characters on the right of the index will be subtracted with one, since the input string length now contains one less character, the white- space character. An algorithm will then proceed to check if the parent XML tag to the right or left of the index, depending on if we are currently checking if there contains a white-space in the beginning of a XML tag or in the back of a XML tag.

If there exists a white-space already or any other text, no white-space is required to be inserted again. Otherwise the currently checked parent might be “empty”. By empty, we mean that the XML tag do not contain any characters within the XML tag or a white-space.

The best approach is to check for multiple white-spaces before a white-space extraction is executed. Eliminating the requirement of multiple white-space filtering after a white-space extraction is done. Processing unwanted white-spaces in the following will eliminate cases were the reverse order would miss out, space extraction and multiple white-space removal. See Example 5.2, where some misplaced white- spaces are not removed and requires another iteration of white-space extraction.

Excosoft provides version control in their TMS, which means that each TU can contain different text segment versions. The latest version of each TU was used during the evaluation of the algorithms.

5.2 Identifying Named Entities

The second step was to identify entities in existing TUs and tag entities with relevant tags. This procedure is currently done by the users. Users who have done the

(28)

5.2. IDENTIFYING NAMED ENTITIES

Operation String Status

Input ___Loop

Output __Loop One white-space is removed.

Input (Multiple

white-space) __Loop

Output (Multiple white-space)

_Loop Multiple white-spaces are replaced by one.

Action White-space is extracted outside of the current XML- tag, causing a un-allowed white-space case. This forces us to re-run the algorithm.

Example 5.2. Improper white-space elimination order.

most work tagging entities in a TM, will achieve the highest re-use, allowing future documents to finding more full matches.

During the development of the first step, I noticed that many entities often occurred in almost every language text segment within a TU. Which led me to believe that an entity probably occurs in the same form in every language text segment, e.g. “D7000“, a camera model manufactured by Nikon is probably spelled the same way in all languages.

To investigate if the assumption was accurate, two approaches to identifying entities within different TMs were examined. One approach involved an off the shelf NER tagger developed by Stanford², while the other approach involved an implementation of heuristics based on different assumptions.

5.2.1 Intra-Heuristic

My initial approach was to divide the source language text segment into words (often called tokenization³) and compare each word with all words available in all languages within a TU, which is why this heuristic from here on will be referred to as the Intra-Heuristic. In case a word occurs in all available language text segments, that specific word has a high probability of being an entity.

Troubles I encountered with this approach were entities right next to characters without any contextual meaning in the TUs was limiting the recall of the entity recognition. In order to increase recall, I had to find a way to improve the tokenization. See Example 5.3 for a TU where undesired characters were causing lower precision for the Intra-Heuristic, due to the false positive words “(1” and “=”, that occurred in all language segments. To avoid these false positives, an exclusion filter

2Stanford CoreNLP http://nlp.stanford.edu/software/corenlp.shtml

3Tokenizer - A tokenizer divided text into a sequence of tokens, which roughly correspond to

"words". http://nlp.stanford.edu/software/tokenizer.shtml

(29)

was added, excluding the following characters “.([,=:)]” at the beginning or at the end of each word, before every word comparison.

See the improved result in Tables 7.5 and 7.6 on page 34 and on page 35.

TU Id en sv no ge

147612 (1 = Lights on)

(1 = Ljuset

är på) (1 = Lys på)

(1 =

Beleuchtung ein)

Example 5.3. TU with undesired characters, causing undesired false positive matches

5.2.2 Inter-Heuristic

Another idea came to mind during the evaluations of the Intra-Heuristic. Given a TM without a fully optimized generalization engine, it is very likely that over time similar non-generalized TUs are created. These similar TUs may differ in one single word, which might be a potential entity that also needs to be tagged, in order to generalize the TU.

A possible way to identify these entities may be to compare all source language words in a TU with all existing TUs source language text segments with the same word length. The word length can be calculated by tokenizing a text segment into

“words“.

See previously mentioned Example 3.2 on page 11, where the Inter-Heuristic algorithm is able to list the digits [3,4,6 and 8] as potential entities, that could be tagged. In the given example, the entities could also be matched with a simple regular expression, matching digits only. While the Intra-Heuristic is also able to identify other kinds of words, see Example 5.4 on this page.

TU ID en

574212 ABB has a huge database.

574219 Casco has a huge database.

574226 Raysearch has a huge database.

Example 5.4. Basic example of a non-generalized TM

See the result in Table 7.7 on page 35.

5.2.3 Stanford NER

To compare the results achieved with the different heuristics, the same datasets were evaluated with Stanford NER. Standford provides six different models which are trained on a mixture of different domains such as: ACE23, MUC-6, MUC-7, CoNLL, Wikiner, Ontonotes and English Extra. The provided models have the

(30)

5.2. IDENTIFYING NAMED ENTITIES

ability to identify different entity classes⁴ listed below.

3 class: Location, Person, Organization 4 class: Location, Person, Organization, Misc

7 class: Time, Location, Organization, Person, Money, Percent, Date

Using an English caseless 3 class model, I was able to achieve a high precision with a less desired recall. To see if more entities could be identified, the following Stanford NER models were used to evaluate the datasets: english.all.3class.caseless, english.all.3class.distsim, english.nowiki.3class.caseless, english.conll.4class.caseless, english.muc.7class.caseless and english.muc.7class.distsim. See results in Table 7.9.

4Stanford NER - http://nlp.stanford.edu/software/CRF-NER.shtml

(31)

(32)

Chapter 6 Design and Implementation

This chapter describes implementation of the prototype developed for the company.

The goal of the prototype was to offer a quality assessment tool for the existing application developed by the company. Two desired requirements for the prototype were support for loading quality filters in runtime and ease of creating new filters.

This was solved by developing the prototype as an Eclipse plug-in, extending a predefined extension point.

6.1 Implementation of Eclipse Plug-in

Eclipse applications are based on a core called the workbench. The workbench can be extended with a set of views, perspectives, menu contributions, key bindings, etc. through an extension point. These extensions can either be part of the main project or developed as separate Eclipse plug-in projects. Eclipse Plug-in projects can provide contributions (extensions) to pre-defined extension points. During implementation phase the company was migrating their application to Eclipse 4. This was one of the main reasons I choose to develop the prototype as an Eclipse plug- in. The Eclipse workbench manages the attached extensions. An extension point in the main application was created to enable extensions. Extended functionality is displayed depending on whether any extensions are attached to the predefined extension point. I chose to separate the prototype with the main application to ease the company’s future migration to Eclipse 4. Existing functions of the graphical user interface were refactored; table for displaying matching TUs; search filter;

language filter; project filter and status filter.

The Glazed List library [ca.odell.glazedlists]¹ was used to feature filtering of TUs from a source EventList. All TUs in an existing TM were loaded into the EventList. The library provides a thread safe EventList which can be used without calling Java lock() and unlock() synchronization methods.

A FilterList is generated depending on the filters selected. A MatcherEditor is used to enable dynamic filtering of the elements in the table. When a filter is

1http://www.glazedlists.com/

Translation Memory System Optimization: How to effectively implement translation memory system optimization

Translation Memory System Optimization

HOW TO EFFECTIVELY IMPLEMENT TRANSLATION MEMORY SYSTEM OPTIMIZATION

EXAMENSARBETE VID CSC, KTH

Optimering av översättningsminnessystem

Translation Memory System Optimization

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Background

Chapter 3

Theory

Chapter 4

Related Work

Chapter 5

Methods

Chapter 6

Design and Implementation