Estimating Post-Editing Effort with Translation Quality Features

(1)

Estimating Post-Editing Effort with Translation Quality Features

Oscar Sagemo

Uppsala University

Department of Linguistics and Philology Språkteknologiprogrammet

(Language Technology Programme) Bachelor’s Thesis in Language Technology June 15, 2016

Supervisors:

Sara Stymne, Uppsala University

(2)

Abstract

The field of Quality Estimation aims to predict translation quality without reference translations, through machine learning. This thesis investigates using Quality Estimation (QE) to predict post-editing effort at the sentence level by focusing on the impact of features. Regression models are trained on two datasets emulating real-world scenarios, separated by the availability of post-edited training data. One dataset consists of English- German translations and post-editions, annotated with HTER, the other dataset consists of Swedish-English translations and reference translations, annotated with TER.

A total of 16 features, including novel features measuring SMT reordering and German-English noun translation errors are proposed and individually evaluated for the two scenarios. The best performing feature vectors for each scenario are both able to surpass a commonly used baseline set, with the biggest impact observed on the Swedish-English dataset.

The possibility of estimating post-editing effort without utilising post- edited data is also explored. A correlation is indicated between the two through comparing human perceptions of post-editing effort with gold- standard and predicted TER scores. Even though the gold-standard values come close, the predicted scores still needs improvements to depict post- editing effort.

(3)

Acknowledgements

Firstly, I would like to thank my supervisor Sara Stymne, for her invaluable knowledge, advice and guidance throughout the thesis work, as well as a treasured cooperation with the shared task. I would also like to thank Nils-Erik Lindström and the TTS team at Semantix, for providing me with a bearing, material to work with as well as first-class encouragement from day one, and Marie Widell at Semantix for lending me your talent through the annotations.

Thank you to the organisers of the 2016 WMT shared task in quality estimation for material and your commitment. Thank you to the developers of QuEst++ (Lucia Specia, Gustavo Henrique Paetzold and Carolina Scarton) for opening the doors to quality estimation for the community. Thank you Jakob for constantly being on my back and thank you to everyone else who helped me along the way.

(6)

1 Introduction

Machine Translation (MT) usage is becoming more widespread every day, with readily available online systems for fast look-ups as well as increasingly convenient frameworks allowing users to train their own engine with nothing but a set of translations to train it with. Despite the surge in popularity, MT has not yet been able to consistently deliver fully automatic high quality translations.

A human-in-the-loop is therefore needed to either cognitively process the output in order to make sense of it, or manually correct the output in order to reach a publishable quality. Therefore, rather than competing with manual translations as supplied by language service providers, MT has been adapted as a beneficial tool in the arsenal of professional translators.

In the translation industry, Computer-Aided Translation (CAT) tools have grown to be the standard practice, where translators routinely make use of Translation Memories (TM), to store past translations and present matching entries when found in new input text, effectively speeding up the workflow.

Whenever a matching entry is found it is commonly presented to the translator accompanied by a match score, conveying its correspondence with the input.

Incorporating MT into this framework proved intuitive, an efficient approach was to complement TM matches with MT suggestions, leaving the translator with the task of post-editing the translation. Whether or not post- editing MT takes less effort than translating from scratch, depends on the quality of the translation, which at the present must be assessed by the translator.

Assessing Machine Translation (MT) quality is a challenging task, whether attempted by human or machine. The underlying problem posing this challenge is the multivalence of natural language, there is no sole correct translation output for any given input. Determining the quality of a translation is therefore commonly reduced to subjective human judgement based on predecided metrics like fluency, adequacy or ranking, or calculating a score from an automatic comparison with a provided reference translation like BLEU (Papineni et al., 2002) or NIST (Doddington, 2002). The primary goal of this type of evaluation is to compare an MT system with earlier versions of itself or other similar systems and is thus, coupled with the fact that providing human evaluators or reference translations requires resources, mainly done by developers of MT systems.

As machine translation usage has seen a vast increase in popularity while the quality is still varying, the need for quality assessment grows for the end user, who normally has very limited resources. Thus posing the problem of assessing quality without reference translations. Quality Estimation aims to solve this as a prediction task, where feature vectors of translation quality indicators (features) are associated with quality labels through machine learning, to be

(7)

able to estimate quality scores of machine translations, at run-time. Different applications for this estimated score include determining if a translation is sufficient for gisting, ranking several translations in order to select the best one, or deciding whether or not a sentence is worth post-editing. Ideally, the quality label is tailored to the intended application.

This thesis focuses on the post-editing application of quality estimation, where most modern approaches set the task as a regression problem - attempting to accurately predict an edit distance-based quality label through representing translations with feature vectors. The performance of such approaches rely on determining and extracting features that correlate strongly with the proposed quality label.

1.1 Purpose

This thesis investigates sentence-level quality estimation for post-editing machine translations by exploring the impact of a range of features aiming to convey translation quality. Both pre-existing and original features will be investigated on two different datasets representing different real world scenarios where quality estimation for post-editing could be utilised, separated by language pairs and the availability of post-edited data.

The first scenario, with post-edited data (henceforth, WMT), is based on data provided by the Conference (previously Workshop) on Statistical Machine Translation for the shared task in QE to which a submission will be made. The second scenario, without post-edited data (henceforth, SQE), is based on data provided by Semantix, a real language service provider in the early stages of employing post-editing MT as a translation method.

Through feature-focused experiments, I hope to make contributions by presenting novel features measuring reordering and language-specific noun translation errors as well as investigating the possibility of estimating post- editing effort, without using post-edited data.

1.2 Outline

The remainder of this thesis will be structured as follows: Chapter 2 presents the relevant previous research in Quality Estimation and introduces the Con- ference on Statistical Machine Translation. Chapter 3 describes the framework adopted to explore feature impact and introduces and motivates the learning methods used as well as the data. Chapter 4 will present and motivate the features proposed for the two scenarios, both as adopted from previous quality estimation experiments and as originally crafted for this thesis. The results of the features as adapted for both scenarios will be displayed and discussed in chapter 5. Lastly, conclusions and suggestions for future work is presented in chapter 6.

(8)

2 Background

In this chapter, the requisite background for this thesis is presented as well as an introduction to the Conference on Statistical Machine Translation.

2.1 Quality Estimation

Early work in Quality Estimation (QE) was originally focused on assessing the confidence in any MT output for a given MT system. The benefits of estimating Confidence Measures for Speech Recognition had been studied extensively (Jiang, 2005) but was not as common in other NLP fields. Motivated by the possibility of reference-free MT evaluation and the usefulness of confidence measures in Speech Recognition, Blatz et al. (2004) conducted the first com- prehensive experimental study of possible methods for Confidence Estimation (CE) for Machine Translation. They explored different techniques of associating feature vectors representing translations with quality labels based on automatic MT evaluation metrics at sentence and word-level.

A total of 91 features were proposed for the sentence-level CE experiments, a large portion of which were derived from information extracted from the MT system used to produce the translations, such features later came to be referred to as confidence features. They set the objective as a binary classification problem of correct/incorrect translations obtained from thresholding the scores from the machine learning (ML) algorithms Naive Bayes and Multi-layer perceptron.

The response variables, or quality labels, proposed were NIST (Doddington, 2002) and a modified version of WER, described and motivated in Blatz et al.

(2003).

This methodology was expanded by Quirk (2004), who used a similar set of features to train a variety of classifiers on manually annotated data and found that models trained on small human-tagged datasets outperformed large automatically-tagged datasets.

Specia et al. (2009) explored the use of Partial Least Squares to estimate continuous scores as opposed to binary classes over both automatically and manually tagged datasets. Furthermore, they proposed a wide array of new relevant features, separated by requisite of system-dependent information, as well as a method to identify relevant features in a systematic way. The term Quality Estimation then arose as a means to incorporate system-independent (black-box) and system-dependent (glass-box) under the same name (Felice and Specia, 2012).

Specia and Farzindar (2010) set the task of using QE to filter out translations unfit for post-editing and explored the usage of Translation Edit Rate (TER)

(9)

and Human-targeted Translation Edit Rate (HTER) (Snover et al., 2006) as quality labels. They were able to obtain good performance by computing HTER over a small set of machine translations and their post-editions.

TER and HTER are error metrics that aim to calculate the amount of editing a human would need to perform to correct a machine translation for it to match a reference translation. They are derived from the same formula, as presented in Equation 1, but differ in the type of reference translations used: The HTER uses a post-edited version of the machine translation, thus being a favorable metric for measuring post-editing effort, while the TER uses an arbitrary reference translation, thus depicting general MT quality.

TER= #of edits

average # of reference words (1) Where possible edits include insertion, deletion, substitution and shifts and all operations have equal cost.

2.2 Conference on Statistical Machine Translation:

Shared Task

The conference on Statistical Machine Translation (WMT) is a yearly event, initially held in conjunction with the annual meeting of the Association for Computational Linguistics (ACL)¹. Prior to each workshop, shared tasks in different fields of MT are conducted with the end goal of advancing the field.

Tasks in quality estimation have been included since 2012 (Callison-Burch et al., 2012) and became the main forum for quality estimation development ever since.

The shared tasks in QE are commonly divided into sub-tasks, focusing on various units of translations. The different units considered for the 2016 shared task are documents, words, phrases and sentences. For the purpose of this thesis, the focus is on the sentence-level task, to which a submission was made based on the insights and experiments presented in this thesis.

The quality label proposed for this year’s task is HTER, which sets the focus on predicting the post-editing effort needed to correct a translation. Quality labels for past years include post-editing time and different types of human annotations.

1http://www.aclweb.org/

(10)

3 Method

This chapter describes the method used to explore the impact of features for the two separate scenarios, by presenting a baseline, describing and motivating metrics chosen for the feature extraction and machine learning implementations as well as presenting the data used.

3.1 Framework

The first scenario considered (WMT) emulates QE when approached with post-edited data at hand, by utilising the translations provided for the shared task, annotated with HTER scores. Participation in the WMT shared task in- volves submitting predictions performed by one final system, thus posing a performance-focused methodology. In a feature-driven approach, this translates to finding and extracting features that correlate strongly with the proposed quality label, as established by performance metrics. This approach was transferred to the second scenario, QE as approached without previous post-edited data (SQE), with the added steps of constructing a dataset of translations annotated with TER.

A total of 16 features aiming to convey translation quality were proposed and tested for inclusion, of which two were specifically designed for the English- German translation direction and thus only tested on the WMT dataset. The features are described and motivated in chapter 4.

Initial tests consisted of measuring the features individual performance in combination with a baseline set, in order to sort out the features with an overall negative impact on each dataset. The features with a positive impact were then concatenated and measured, one by one, to form a feature vector resulting in the best performance.

3.2 Baseline system

For the WMT shared task, in order to establish a common ground for measure- ment, a baseline system trained with 17 features is provided for the participants.

These 17 baseline features (b17) are used as the foundation for all feature sets in this thesis and performance is measured in relation to the baseline system.

The same features have been used for all five years of shared tasks in QE, the organisers note:

“... although the system is referred to as ’baseline’, it is in fact a strong system. It has proved robust across a range of language pairs,

(11)

MT systems, and text domains” (Bojar et al., 2015), 14

The features quantify the complexity of the source sentence and the fluency of the target sentence, by utilizing corpus frequencies, Language Model (LM) probabilities and token counts. A full list of these features is included in appendix A.

3.3 Feature extraction

All instances in the training data are converted to feature vectors, by computing the various measurements as specified in the feature sets, and storing them in a predefined order.

For instance, let the following three features define a feature set:

• number of tokens in the source sentence

• average source token length

• LM log probability of target sentence

The features would then be extracted from a translation and stored in a feature vector in that order. Example 2, shows an English-German translation from the WMT16 dataset and its feature representation as defined by the feature set above.

(2) Click Repair All . = Klicken Sie auf " Alles reparieren . "

↓

[4.0, 3.75, -10.551]

In order to apply a machine learning model to a new data instance, it must first be converted to consist of the same feature vector used to train the model.

Therefore it is important to be able to extract features consistently and automatically. To this end, the open-source QuEst++(Specia et al., 2015) toolkit is employed in this paper. The toolkit incorporates feature extraction and machine learning and provides both pre-defined feature extraction modules as well as interfaces allowing convenient implementation of new modules.

3.3.1 Tools

A majority of the proposed features require different linguistic analyses of the source and target segments. To this end, several well-known NLP tools were utilised and merged with the QuEst++ pipeline, through wrappers, where possible. The following tools were used to extract the features:

• A modified version of the QuEst++ framework, with processors and features added and modified where needed, used to extract a majority of the features.

• Fast Align (Dyer et al., 2013) was used to generate word alignment files.

(12)

• Berkeley Parser (Petrov et al., 2006), trained with the included grammars for English and German and the Talbanken part of SUC (Nivre and Megyesi, 2007), to extract the phrase structure-based features.

• SRILM (Stolcke, 2002) to train a Part-Of-Speech (POS) Language Model (LM) over the training dataset as well as to compute all LM-based segment probabilities and perplexities.

• TreeTagger (Schmid, 1994) trained with the included models for English and German and HunPos(Halácsy et al., 2007) trained on SUC (Megyesi, 2009) to obtain all POS-related features

3.4 Machine Learning

The most prevalent algorithm in the regression-based QE in the literature is Support Vector Machine (SVM) regression, which has also been employed for the baseline systems of the shared tasks. After brief initial experiments, seen in sect. 5.1.1, it was adopted as the main ML algorithm used in this paper as implemented by LibSVM (Chang and Lin, 2011) in QuEst++. A Radial Basis Function (RBF) kernel is used as well as grid search optimisation of the cost, epsilon and gamma parameters from a 5-fold cross validation of the training set.

3.5 Evaluation

As per QuEst++ methodology, ML performance was measured in terms of Mean Average Error (MAE) and Root Mean Square Error (RMSE) which are defined in Eqs. 3 and 4, where yiare the predicted values and ˆyiare the gold-standard values.

MAE= 1 n

Xn i

|yi−yˆ_i| (3)

RMSE= v u u t 1 n

Xn i

(y_i−yˆ_i)² (4)

Critique of evaluation metrics The evaluation of submitted systems to the WMT shared tasks for the first four years (12-15) relied on measuring the MAE and RMSE as described in sect. 3.5. In response to this, Graham (2015) conducted an analysis of the measures and pointed out possibilities of wrongly perceived performance gains by optimising predictions with respect to the error measures, such as minimising variance in prediction score distributions. As an alternative, she proposed using the unit-free Pearson correlation r which has tradition in assessing MT evaluation metrics.

The metric measures the linear association between two variables, which for QE purposes is the predicted and gold-standard score, and is defined as seen in eq. 5, where yiare the predicted values and ˆyi are the gold-standard values.

(13)

This year’s shared task used Pearson correlation as the main metric of evaluation for the first time, followed by MAE and RMSE.

r =

PN

i=1(yi−y)(¯ y −ˆ y)¯ˆ qP_N

i=1(yi−y)¯ ² qP_N

i=1(yˆi−y)¯ˆ ²

(5)

3.5.1 Post-editing effort

The performance, as measured by the evaluation metrics, has different implica- tions for the two separate scenarios as the quality labels differ.

In their comparative study, Snover et al. (2006) shows that HTER has a high correlation with human judgment as an evaluation metric and that TER, while giving an over-estimate of edit rates, correlates reasonably well with both human judgement and HTER.

As HTER uses post-edited reference translations, it is reasonable to assume that the score correlates well with post-editing effort. However, TER scores use arbitrary reference translations which results in a lower correlation with post- editing effort. Therefore, a well predicted TER score conveys general translation quality more than post-editing effort.

In order to validate the TER predictions in terms of post-editing effort, a professional post-editor was tasked with annotating a part of the test set with a quality score indicating the amount of editing needed to correct each segment. The quality score is defined as a scale from 1-5, in accordance with the proposed quality labels of the 2012 shared task in QE (Callison-Burch et al., 2012). Koponen (2012) also used the same scale in her study comparing human perceptions of post-editing with post-editing operations. The scores were defined as follows:

• 1: The MT output is incomprehensible, with little or no information transferred accurately. It cannot be edited, needs to be translated from scratch.

• 2: 50% -70% of the MT output needs to be edited. It requires a significant editing effort in order to reach publishable level.

• 3: 25-50% of the MT output needs to be edited. It contains different errors and mistranslations that need to be corrected.

• 4: 10-25% of the MT output needs to be edited. It is generally clear and intelligible.

• 5: MT output is perfectly clear and intelligible. It is not necessarily a perfect translation, but requires little to no editing.

The annotations were then treated as classes and compared with a sorted list of the predicted TER scores by utilising a common Information Retrieval metric, Average Precision (AP) (Zhu, 2004). It is defined as shown in Equation 6, where k is the rank in the sequence of ordered TER scores, P(k) is the precision of the sublist at k and rel(k) is a function returning 1 if the class at rank k matches the relevant score and 0 otherwise.

(14)

By measuring the average precision, inferences can be made about the correlation between the model’s predictions and post-editing effort. The most interesting cases are average precision for the top and bottom classes, as they convey prediction performance where it matters most.

Furthermore, the standard deviation and average score was computed over each class in order to investigate the distribution of predicted TER scores in relation to post-editing effort.

Lastly, the same metrics were applied to the gold-standard TER scores, in order to test the overall correlation between TER scores and post-edting effort.

AP =

P_n

k=1(P(k))rel(k))

number of instances of class (6)

3.6 Data

The translations and their origins are presented below, separated by scenario.

3.6.1 WMT

The dataset used for the WMT scenario is provided by the organisers of the 2016 WMT shared task in Quality Estimation.

The organisers provided two different datasets that was divided between the different sub-tasks. For the purpose of this thesis, only the dataset intended as input data for the sentence-level task was utilised. The dataset spans a total of 15,000 English-German translations from the IT domain, provided by unspecified industry partners. Each entry consists of a source segment, its machine translation, a post-edition of the translation and an edit distance score (HTER) derived from the post-edited version. The dataset was split into separate sets of training and development, with gold-standard scores, and testing data without, as shown in Table 3.1.

The translations were produced by a single in-house MT system, regarded as a “black-box” system since no system-dependent information was provided.

These translations were then post-edited by professional translators. To capture the post-editing effort, HTER scores were computed between each MT translation and the corresponding post-edited version using the TER toolkit(Snover et al., 2009).

In addition to the translations, participants were also provided with language models, word-based translation models as well as raw n-gram frequency counts, collected from an IT domain-specific English-German parallel corpora which was not provided. The complementary files were provided to aid in extracting the baseline features (see Appendix A).

3.6.2 SQE

The dataset for the Semantix QE scenario (SQE) consists of 28,398 Swedish- English translations from the public sector domain, provided by Semantix.

Each entry consists of a source segment, its machine translation and an edit distance score (TER) derived from a reference translation. The dataset is split into separate training and testing sets, as shown in Table 3.1.

(15)

The translations were produced by a domain-specific MT system trained with in-house human translations using Microsoft Translator Hub.¹No system- dependent information is accessible through the Microsoft Translator hub framework.

Dataset Segments WMT-train 12,000 WMT-dev 1,000 WMT-test 2,000 SQE-train 21,000 SQE-test 7,100

Table 3.1: Size and division of the datasets

Constructing a dataset

The data used to construct the SQE dataset consists of reversed English-to- Swedish human translations. Due to the fact that all available in-domain Swedish-English translations had been used to train the MT system, thus mak- ing them unsuitable candidates. After collecting the translations, the Swedish source side was re-translated with the in-house MT system, resulting in a set of Swedish source segments, their English machine translation and an English reference translation. A TER score was computed between the machine translated and reference segments, using the TER toolkit, to be used as quality labels.

Ideally, reference translations used for MT evaluation should consist of human translations. Studies have shown (Lembersky et al., 2012; Kurokawa et al., 2009) that the translation direction of training data for SMT systems impacts the performance, due to unique characteristics of translated language.

It is plausible to assume that reversing the translation direction also has an impact when used as training data for QE systems, as the reference translations are consequently derived from source texts and vice versa.

A better solution could have been to re-train the MT system after removing a part of the training data to be used as data for the QE system. However, this was not attempted due to time constraints.

In addition to the translations, I also used language models, word-based translation models and ngram-frequency counts, computed from the Swedish- English MT training data.

3.6.3 Pre-processing

As the SQE dataset consisted of unfiltered human-translated segments of varying lengths, the following cleaning steps were taken:

• Deleted segments with <= two source or target words

• Deleted segments with more than 80 source or target words

1https://hub.microsofttranslator.com

(16)

• Deleted segments containing links or phone numbers

• Deleted segments consisting of >= 50% numbers

• Deleted segments containing markup tags

• Swapped any series of two or more whitespace characters with one

• Randomized the order of the segments and split into 75% training and 25% testing

Additionally, before feature extraction, both datasets were tokenised using the tokenisation scripts included in the open-source SMT toolkit Moses (Koehn et al., 2007). Additionally, the same toolkit was used to truecase the SQE dataset, returning the initial words in each segment to their most probable casing, as this improved the performance of the baseline system. Truecasing was not applied to the WMT dataset, as the truecasing script utilises a corpus to compute case probabilities, and no corpora matching the domain-specific content were provided or found elsewhere.

(17)

4 Proposed features

This chapter lists and motivates the features used in this thesis, separated by category of information conveyed.

14 features were proposed for both scenarios, the features were selected to capture sources of and results of difficulties for SMT systems, by quantifying reordering measures, grammatical correspondence and structural integrity. Ad- ditionally, two language-specific features were proposed for the WMT scenario, quantifying noun translation errors from English to German.

4.1 Reordering measures

Reordering is problematic for MT in general, and especially so when the placement of verbs differ between languages. English, German and Swedish all have follow a SVO pattern in simple sentences, but differ in verb placement in e.g.

subordinate clauses and questions.

Three metrics that measure the amount of reordering done by the MT system were explored, to investigate a correlation between SMT reordering and quality labels. All metrics are based on alignments between individual words.

• Crossing score: the number of crossings in alignments between source and target

• Kendall Tau distance between alignments in source and target

• Squared Kendall Tau distance between alignments in source and target Crossing score was suggested by Genzel (2010) for SMT reordering and Tau was suggested by Birch and Osborne (2011) for use in a standard metric with a reference translation. To my knowledge, this thesis presents the first usage of these measures for quality estimation.

The features are computed by counting crossing link pairs in a word alignment file, where the number of crossing links considers crossings of all lengths.

The Squared Kendall Tau Distance (SKTD) is defined as shown in Eq. 7.

SKTD=1−

s|crossing link pairs|

|link pairs_| (7)

The amount of reordering done as measured by these features can suffice to indicate irregularities in reordering through the learning methods. However, due to simply relying on counting crossings in 1-1 alignments, could inflict noise. All the measures for reordering only measures the difference in word

(18)

order in a language independent way. For a specific language pair like English–

German or Swedish–English, it would be useful to be able to measure known word order divergences like verb placement, through more carefully designed and targeted measures. A better solution could be adapt the feature to fit the expected reordering for specific translation directions and to quantify it based on infringements of word-order expectations.

4.2 Grammatical correspondence

Features measuring the relationship between different constituents in source and target are useful for measuring translation adequacy, i.e. whether or not certain elements of structure and meaning was conveyed in the translation.

Several features quantifying grammatical discrepancy are explored, mainly measured in terms of occurrences of syntactic phrases or part-of-speech (POS) tags.

In modeling syntactic structure for this thesis, constituency parsers based on Probabilistic Context-Free Grammars (PCFGs) were employed. PCFGs are sets of language-specific grammar rules as learned from syntactically annotated corpora accompanied by probabilities of production.

• Ratio of percentage of verb phrases between source and target

• Ratio of percentage of noun phrases between source and target

• Ratio of percentage of nouns between source and target

• Ratio of percentage of pronouns between source and target

• Ratio of percentage of verbs between source and target

• Ratio of percentage of tokens consisting of alphabetic symbols between source and target

The relationship of token types, e.g. part of speech, is commonly parameterised as the ratio of percentage (Specia et al., 2011; Felice and Specia, 2012), which normalises token counts by sentence length. However, normalizing syntactic phrases by the total amount of phrases is not as intuitive, as both syntactic constructions vary between languages and phrase structure rules vary between different PCFGs. Therefore, different means of parameterising the relationship between syntactic constituents were briefly explored and presented in Section 5.1.2.

4.3 Structural integrity

Measuring the structural integrity of the source segment is intended to convey translatability, based on the assumption that ill-formed sentences are more difficult to translate. The structural integrity of the translated target segment conveys output fluency, i.e. how well-formed and fluent the sentence is in the target language.

(19)

Features measuring well-formedness as conveyed by syntactic parse trees were explored for both source and target. Additionally, POS language models were utilised for the target segment.

• Source PCFG average confidence of all possible parses in the parser n-best list

• Source PCFG log probability

• Target PCFG log probability

• LM log perplexity of POS of the target

• LM log probability of POS of the target

Avramidis et al. (2011) proposed utilising PCFG probabilities confidence as features and POS language models showed promising results in the work of Tezcan et al. (2015). Small sizes of n-best lists (1-3) were used for the confidence feature due to difficulties in coming up with more parse trees for several of the input segments in the WMT dataset. The POS language models were trained with an order of 4, over the target side of the training data for both scenarios separately.

4.4 English-German Noun Translation Errors

Capturing common translation errors is intended as a direct measure of results of MT difficulties. However, such features need to be defined with individual language pairs in mind, and are therefore expensive to craft.

Two novel features attempting to capture such errors in the direction English-German are explored for the WMT dataset.

• Ratio of Noun groups between source and target

• Ratio of Genitive constructions between source and target

In previous work on English–German SMT (Stymne et al., 2013), it is noted that the translation of noun compounds is problematic. It is common for English compounds, that are written as separate words, to be rendered as separate words or genitive constructions in German, instead of the idiomatic compound.

Compounds tend to be common in technical domains, such as IT. Following is an example of a mistranslated compound noun flagged by this feature, from the WMT data:

footnote structure→Fußnote Struktur(Fußnotenstruktur)

English →German

Due to the fact that split compound nouns is a common translation error for German machine translations, a feature to look for sequences of nouns in target text was implemented. The feature looks for any noun group in both source and target and is computed as the ratio of noun groups, where noun groups are defined as the number of occurrences of sequences of two or more nouns.

(20)

Another common compound translation is genitive constructions, which can be over-produced in German. A feature that looks for possible genitive constructions in source and target was designed, and is computed as the ratio of genitive constructions, defined as follows:

German:Any noun or proper noun preceded by a noun and the genitive article des/der.

English:Any noun or proper noun preceded by a noun and the possessive clitic

’s or the possessive preposition of.

Note that these patterns could also match other constructions since “of” can have other uses and “der” is also used for masculine nominative and feminine dative.

(21)

5 Results and discussion

This chapter presents the results of the preliminary experiments, regarding ML algorithms and ratio parameterisation, the impact of the proposed features for both scenarios separately, as well as additional evaluation for both scenarios.

5.1 Preliminary experiments

These brief experiments were carried out in order to make initial decisions from which to base all further experimenting on.

5.1.1 Machine Learning

The Orange toolkit (Demšar et al., 2004) was used to compare 6 ML algorithms for the baseline features on the development set from the WMT scenario, the results are presented in Table 5.1.1. The algorithms were chosen based on the implementations available in the Orange toolkit.

ML Algorithm MAE RMSE

SVM Regression 13.942 19.814

Random Forest (RF) Regression 15.527 20.159

Univariate Regression 14.089 19.324

Stochastic Gradient Descent (SGD) Regression 22.012 29.876

Regression Tree 20.485 27.028

Linear Regression 14.089 19.324

Table 5.1: A comparison of the baseline performance between 6 ML algorithms

The SGD regression and regression tree algorithms performed considerably worse than the other 4 algorithms. Linear and Univariate regression, while not commonly employed for tasks of this type, showed surprisingly good performance based on this brief experiment. While RF regression has been seen in past QE studies (Rubino et al., 2012), the performance difference from the SVM regression was deemed significant enough to not experiment any further.

Based on the results, coupled with the fact that it was the only implemented algorithm in the QuEst++ toolkit, SVM Regression was chosen as the ML algorithm for the feature tests.

(22)

5.1.2 Parameterising ratio features

Three different means of quantifying the same relationship between constituents in source and target were explored when implementing the verb phrase ratio feature. They are defined as shown below, where VP_sideis the number of verb phrases in the respective side and P_side is the total number of phrases.

Absolute difference:

|VPsource− VPtarget| (8)

Ratio: _P

VPsource

PVP_target (9)

Ratio of percentage: _P

VP_source/P

P_source PVPtarget/P

Ptarget

(10) To test the performance of the different measures, each measure was implemented as a feature and concatenated with the baseline features for the WMT system, the results are shown in the table below. Results point towards the ratio

Measure MAE RMSE

Absolute difference 13.864 19.553

Ratio 13.842 19.527

Ratio of percentage 13.834 19.515

Table 5.2: Performance in terms of MAE and RMSE for the different ratio implementations in the WMT scenario

of percentage, which went against the initial intuition that normalising with the total number of phrases would result in noise, due to the non-linear relationship in phrase constructions between the languages involved. This metric was also applied to the phrase ratios in the second scenario.

5.2 Feature performance

The results of the initial tests of measuring the features individual performance in combination with a baseline set are presented in Table 5.3 in terms of MAE and RMSE, and a comparison chart of the impact on both scenarios is shown in Figure 5.1. The impact is defined as the MAE difference in relation to the corresponding baseline, normalised by total MAE.

The features with an overall negative impact on each scenario were rejected for inclusion in feature combination tests while the features with a positive impact on were concatenated and measured, one by one, to form a feature vector resulting in the best performance for each scenario.

(23)

Figure 5.1: A comparison of the normalised impact in MAE of the 14 language-independent features, as well as the WMT-specific noun group ratio for comparison.

The individual feature impact varied considerably between the two different scenarios, however, some similarities can be observed: The noun, pronoun, and noun phrase ratio had a comparable negative impact on both scenarios while the verb ratio had a similar positive impact. This suggests that inserted or deleted verbs have a higher correlation with edit operations than nouns and pronouns. Worth noting: both Swedish and German have compound nouns where English has multi-noun chains, e.g. “baseballmatch”, “Baseballspiel” vis-à- vis “baseball game”. This might affect the performance of the noun ratio feature as well as explain the positive impact of the noun group ratio feature, which is constructed for this very reason.

Furthermore, the source PCFG average confidence in a 3-best list had a negative impact on both scenarios while the source and target PCFG probabilities had a positive impact, however the impact differences of the latter features is rather significant despite both being positive.

The three reordering measures all showed different relations between the scenarios, which is surprising as they are all based on the same number of crossings. The Kendall Tau distance (Tau) was the only reordering measure with a positive impact on the WMT scenario while both the squared version (SKTD) and Tau had a positive impact on the SQE scenario with the Tau feature being one of the best performing features overall.

Another noteworthy observation is that no feature with a negative impact on the SQE scenario performed well on the WMT scenario, while there were many cases the other way around. The biggest differences observed are of the the verb phrase ratio and target POSLM features, all having a significantly more positive impact on the Swedish-English SQE scenario.

(24)

Feature WMT SQE

MAE RMSE MAE RMSE

Baseline (b17) 13.826 19.507 20.751 27.230

b17 + Crossings 13.834 19.480 20.789 27.350

b17 + Tau 13.801 19.460 20.616 27.112

b17 + SKTD 13.836 19.468 20.718 27.206

b17 + verb phrase ratio 13.834 19.515 20.607 27.100

b17 + ratio of noun phrases 13.846 19.523 20.786 27.146

b17 + noun ratio 13.842 19.466 20.787 27.217

b17 + pronoun ratio 13.827 19.510 20.776 27.137

b17 + verb ratio 13.799 19.604 20.685 27.134

b17 + a-z token ratio 13.848 19.488 20.693 27.174

b17 + Source PCFG average confidence in 3-best list 13.859 19.551 20.792 27.493 b17 + POSLM target log perplexity 13.859 19.465 20.641 27.108 b17 + POSLM target log probability 13.851 19.522 20.613 27.138

b17 + Source tree PCFG 13.812 19.515 20.682 27.176

b17 + Target tree PCFG 13.819 19.534 20.654 27.096

b17 + Noun Group Ratio 13.759 19.503 NA NA

b17 + Genitive constructions 13.840 19.539 NA NA

Table 5.3: Performance in terms of MAE and RMSE for all individual features

5.3 WMT Scenario

A majority of the proposed features proved to have a negative impact on the performance metrics through individual testing, leaving only 5/16 features with a positive impact:

• Noun group ratio

• Kendall Tau distance of alignments

• Ratio of percentage of verbs

The surprisingly small amount of positive features may be a result of a disagree- ment between the proposed features and the data. The features mainly rely on linguistic analyses while the data, being exclusively from the IT-Domain, is inherently irregular. POSLM and syntactic phrase features appears to be particularly unreliable which may be due to the nature of the domain, where series of constituents of uncommon character are frequent, as demonstrated in the English-German example from the WMT dataset below:

Choose File > Save As , and choose Photoshop DCS 1.0 or Photoshop DCS 2.0 from the Format menu .

↓

Wählen Sie " Bearbeiten " " Voreinstellungen " ( Windows ) bzw. "

Bridge CS4 " > " Voreinstellungen " ( Mac OS ) und klicken Sie auf

" Miniaturen . "

(25)

This appears to affect syntactic parsers trained on out-of-domain PCFGs as the parser often had difficulties generating more than 3 trees per sentence and while the probabilities of the parse trees for both source and target slightly increased the performance of the model, they had a much higher impact on the SQE scenario. The difference between performance of the POSLM-based features are even higher, even though the language models were trained on in-domain data, suggesting text domain may affect feature performance.

Of the novel features proposed in this thesis, the noun group ratio and Kendall Tau Distance showed promising results both individually and in combination with the other features, the noun group ratio feature had the highest impact out of all the proposed features.

Furthermore, out of all the features with an individual positive impact on MAE, only noun group ratio and Tau perform well on RMSE. This carries over to the performance when combined as well. The impact of the combined features are presented in Table 5.4, with the addition of the Pearson correlation metric, as motivated in Section 3.5.

Feature combinations MAE RMSE r

baseline 13.826 19.507 0.381

+ Source PCFG 13.812 19.515 0.382 + Target PCFG 13.805 19.560 0.383 + Verb ratio 13.795 19.627 0.383

+ Tau 13.757 19.522 0.384

+ Noun Group Ratio 13.723 19.552 0.386

Table 5.4: Performance in terms of MAE and RMSE for the combined features resulting in the best performing feature set for WMT

Based on the results of the feature combinations, MAE seems to have a higher correlation with the Pearson correlation than RMSE as there is a linear relation- ship between the decrease in MAE and increase in r, despite the slight increases in RMSE.

5.3.1 Shared task results

A submission to the 2016 shared task in sentence-level QE was made based on the best performing feature set, presented in Table 5.4.The submission surpassed the baseline and ranked 9 / 13, with a Pearson correlation of 0.363 on the test set, which is separate from the one used for the evaluation performed in this thesis (See Table 3.1).

Only the Pearson correlation is known at the time of writing, as the organisers are experiencing some issues with the MAE and RMSE metrics.

5.4 SQE Scenario

The following features had an individual positive impact on the performance metrics for the SQE scenario:

(26)

• Kendall Tau distance of alignments

• Ratio of percentage of verbs

• Target POSLM perplexity

• Target POSLM log probability

• Ratio of tokens consisting of alphabetic symbols (a-z)¹between source and target.

• Squared Kendall Tau Distance of alignments

• Ratio of percentage of verb phrases in source and target

The individual performance tests suggests that the proposed features were better suited for the SQE scenario as 9/14 performed well and the well-performing features had a significantly larger average impact than the well-performing features for the WMT scenario.

The structural integrity of source and target as measured by PCFG and POSLM features showed promising results, except for the PCFG average confidence feature which may be hindered by the small list size of 3 for the number of trees considered, as imposed by the WMT scenario. Considering that texts from the public sector domain need to be comprehensible by a large number of people, it seems reasonable that violations of conventional structure carry a heavier weight. This furthers the hypothesis on text domain influence over feature performance.

Feature combination MAE RMSE r

baseline 20.751 27.230 0.501

+ source pcfg 20.682 27.176 0.503 + target pcfg 20.581 27.055 0.510

+ Tau 20.446 26.917 0.518

+ verb ratio 20.403 26.841 0.521 + poslm prob 20.301 26.744 0.526 + poslm perp 20.291 26.714 0.527 + a-z ratio 20.230 26.629 0.531

+ sktd 20.212 26.635 0.531

+ vp ratio 20.221 26.638 0.531

Table 5.5: Performance in terms of MAE and RMSE for the combined features resulting in the best performing feature set for SQE

As with the WMT combinations, there is a linear relationship between the decrease in MAE and increase in r, with performance gains for each feature

1The feature is denoted as a-z to conserve space, but includes the Swedish letters ’å’, ’ä’ and

’ö’.

(27)

addition. However, there is an exception with the addition of the last feature, verb phrase ratio. This is surprising as the feature showed a strong positive impact when tested individually and individual impact differences otherwise more or less correspond to the differences in performance when combined.

It is possible that decrease in performance is a result of a high amount of features, causing noise. However, there appears to be no correlation between performance gain and number of features based on the previous additions.

Another interesting observation is that the Pearson correlation is significantly higher for the SQE scenario than for the WMT scenario, while the MAE and RMSE reflect the opposite. A possible explanation for the high error metrics is a high variation in the distribution of quality scores. In her study of evaluation metrics for QE, Graham (2015) shows that a low variance in quality scores results in a lower MAE score and RMSE, and vice versa, as they are based on the absolute difference between all predictions and gold-standard labels.

To briefly test this hypothesis, the standard-deviation was computed for the gold-standard quality scores in the test data from both scenarios:

SQE WMT

20.628 31.424

Indeed, the variation is slightly higher in the distribution of scores for the WMT test set which could affect the reliability of the error metrics, but the difference is not significant enough for any conclusions.

5.4.1 Human annotation

In order to validate the predictions for the SQE scenario, they were compared with effort scores provided by one professional post-editor, as motivated and described in Section 3.5.1. Ideally, in order to avoid biased results, annotation is performed by more than one annotator and the score is averaged over all the annotators. Furthermore, only 240 test segments, about 3.4% of the test set, were annotated and the class distribution was imbalanced. Due to this, the insights as provided by the comparisons are not conclusive on their own, but may provide some level of evaluation.

The average TER for each annotated class is shown in Table 5.6, both the predicted TER values and the gold-standard values are presented along with the standard deviation. The average precision for the best class (5), the top two as well as the two worst classes is shown in table 5.7. The performance for class 1 was excluded due to the low number of occurrences.

The difference in deviation and average scores for the lower classes between the predicted and gold-standard scores is due to an overall higher deviation in the distribution of gold-standard scores.

The average scores at each class indicate the relationship between TER scores and post-editing effort as perceived by the annotator. There is a linear relationship between higher perceived post-editing effort and TER score for the Gold-standard distribution. The predicted values show a similar trend, but with a lower TER average for the third class.

Based on this small sample size, the results are somewhat encouraging. The gold-standard distribution indicates, that a well predicted TER score, does in

(28)

Gold-standard Predicted Class Count Avg. TER (Std-dev) Avg. TER (Std-dev) 5 101 40.205 (23.166) 47.834 (10.769)

4 95 51.307 (22.151) 50.273 (8.773)

3 30 64.390 (28.753) 47.854 (7.759)

2 11 93.673 (17.076) 66.570 (9.236)

1 3 119.667 (15.035) 72.545 (2.800)

Table 5.6: The average TER for each occurrence of the annotated classes from the predicted and gold standard scores

fact correlate to post-editing effort. Furthermore, as the two worst classes (1,2) only constitutes 6% of the annotated classes, it is noteworthy that the TER scores matched so well. This trend is observable in the average precision as well, as the average precision for the predicted values reaches 59%, despite the small number of occurrences.

The high precision for the combined top two classes is misleading as they constitute 81% of the annotated classes together.

Classes Gold standard Predicted

5 63% 53%

5,4 93% 86%

1,2 47% 59%

Table 5.7: The average precision for the top and bottom annotated classes

(29)

6 Conclusion and future work

This thesis has investigated the impact of 16 features for two separate scenarios of quality estimation for post-editing. Among these, novel features modeling noun translation errors and SMT reordering were proposed. A majority of the proposed features (11/16) had a negative impact on the WMT scenario, while a majority (9/14) had a positive impact on the SQE scenario. The reason for the feature performance discrepancy is believed to lie in the textual domain, as many features rely on linguistic analyses which is here found to perform poorly on the IT-domain based WMT scenario. The relationship between text domain and QE features remains an interesting research topic.

Of the novel features proposed, the noun group ratio and Kendall Tau distance features showed particularly promising results. In the future I would like to investigate an expanded set of translation errors for more language pairs, as well as adapt reordering measures as features to take expected reordering in specific translation directions into account.

The possibility of estimating post-editing effort without using post-edited data was explored through comparing predicted and gold-standard TER scores with human annotations. Due to a small amount of annotated segments and an imbalanced annotation distribution, results were inconclusive but indicated a correlation between the scores and perceived post-editing effort. The correlation between the annotations and the gold-standard scores was stronger than with the predicted scores. This indicates that even if the system as presented in this thesis did not accurately predict post-editing effort, an improved model may come close.

(30)

A Features

17 Baseline Features

• number of tokens in the source sentence

• number of tokens in the target sentence

• average source token length

• LM probability of source sentence

• LM probability of target sentence

• number of occurrences of the target word within the target hypothesis (averaged for all words in the hypothesis - type/token ratio)

• average number of translations per source word in the sentence (as given by IBM 1 table thresholded such that prob(t|s) > 0.2)

• average number of translations per source word in the sentence (as given by IBM 1 table thresholded such that prob(t|s) > 0.01) weighted by the inverse frequency of each word in the source corpus

• percentage of unigrams in quartile 1 of frequency (lower frequency words) in a corpus of the source language (SMT training corpus)

• percentage of unigrams in quartile 4 of frequency (higher frequency words) in a corpus of the source language

• percentage of bigrams in quartile 1 of frequency of source words in a corpus of the source language

• percentage of bigrams in quartile 4 of frequency of source words in a corpus of the source language

• percentage of trigrams in quartile 1 of frequency of source words in a corpus of the source language

• percentage of trigrams in quartile 4 of frequency of source words in a corpus of the source language

• percentage of unigrams in the source sentence seen in a corpus (SMT training corpus)

• number of punctuation marks in the source sentence

• number of punctuation marks in the target sentence

(31)

Bibliography

Eleftherios Avramidis, Maja Popovic, David Vilar, and Aljoscha Burchardt.

Evaluate with confidence estimation: Machine ranking of translation outputs using grammatical features. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 65–70. Association for Computational Linguistics, 2011.

Alexandra Birch and Miles Osborne. Reordering metrics for mt. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:

Human Language Technologies-Volume 1, pages 1027–1035. Association for Computational Linguistics, 2011.

John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. Confidence estimation for machine translation. final report, jhu/clsp summer workshop, 2003.

John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. Confidence estimation for machine translation. In Proceedings of the 20th international conference on Com- putational Linguistics, page 315. Association for Computational Linguistics, 2004.

Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi.

Findings of the 2015 workshop on statistical machine translation. In Pro- ceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal, September 2015. Association for Computational Linguistics.

URL http://aclweb.org/anthology/W15-3001.

Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia, editors. Proceedings of the Seventh Workshop on Statistical Machine Translation. Association for Computational Linguistics, Montréal, Canada, June 2012. URL http://www.aclweb.org/anthology/W12-31.

Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2 (3):27, 2011.

Janez Demšar, Blaž Zupan, Gregor Leban, and Tomaz Curk. Orange: From experimental machine learning to interactive data mining. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 537–539. Springer, September 2004.

(32)

George Doddington. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, pages 138–145. Morgan Kaufmann Publishers Inc., 2002.

Chris Dyer, Victor Chahuneau, and Noah A Smith. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, pages 644–649. Association for Computational Linguistics, 2013.

Mariano Felice and Lucia Specia. Linguistic features for quality estimation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 96–103. Association for Computational Linguistics, 2012.

Dmitriy Genzel. Automatically learning source-side reordering rules for large scale machine translation. In Proceedings of the 23rd international conference on computational linguistics, pages 376–384. Association for Computational Linguistics, 2010.

Yvette Graham. Improving evaluation of machine translation quality estimation.

In 53rd Annual Meeting of the Association for Computational Linguistics and Seventh International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 1804–1813, 2015.

Péter Halácsy, András Kornai, and Csaba Oravecz. Hunpos: an open source trigram tagger. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 209–212. Association for Computational Linguistics, 2007.

Hui Jiang. Confidence measures for speech recognition: A survey. Speech communication, 45(4):455–470, 2005.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine trans- lation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics, 2007.

Maarit Koponen. Comparing human perceptions of post-editing effort with post-editing operations. In Proceedings of the Seventh Workshop on Statis- tical Machine Translation, pages 181–190. Association for Computational Linguistics, 2012.

David Kurokawa, Cyril Goutte, and Pierre Isabelle. Automatic detection of translated text and its impact on machine translation. Proceedings. MT Summit XII, The twelfth Machine Translation Summit International Association for Machine Translation hosted by the Association for Machine Translation in the Americas, 2009.

(33)

Gennadi Lembersky, Noam Ordan, and Shuly Wintner. Adapting translation models to translationese improves smt. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL

’12, pages 255–265, Stroudsburg, PA, USA, 2012. Association for Compu- tational Linguistics. URL http://dl.acm.org/citation.cfm?id=2380816.

2380850.

Beata Megyesi. The open source tagger hunpos for swedish. In Proceedings of the 17th Nordic conference on computational linguistics (NODALIDA), 2009.

Joakim Nivre and Beata Megyesi. Bootstrapping a Swedish treebank using cross-corpus harmonization and annotation projection. In Proceedings of the 6th International Workshop on Treebanks and Linguistic Theories, pages 97–102, 2007.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318.

Association for Computational Linguistics, 2002.

Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st Interna- tional Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 433–440. Association for Computational Linguistics, 2006.

Christopher Quirk. Training a sentence-level machine translation confidence measure. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), pages 825–828, Lisbon, Portugal, 2004.

Raphael Rubino, Jennifer Foster, Joachim Wagner, Johann Roturier, Rasul Samad Zadeh Kaljahi, and Fred Hollowood. DCU-Symantec submission for the WMT 2012 quality estimation task. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 138–144. Association for Computational Linguistics, 2012.

Helmut Schmid. Probabilistic part-of-speech tagging using decision trees.

In Proceedings of the International Conference on New Methods in Language Processing, pages 44–49, Manchester, UK, 1994.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. A study of translation edit rate with targeted human annotation.

In Proceedings of association for machine translation in the Americas, pages 223–231, 2006.

Matthew G Snover, Nitin Madnani, Bonnie Dorr, and Richard Schwartz. TER- Plus: paraphrase, semantic, and alignment enhancements to translation edit rate. Machine Translation, 23(2-3):117–127, 2009.

Lucia Specia and Atefeh Farzindar. Estimating machine translation post-editing effort with hter. In Proceedings of the Second Joint EM+/CNGL Workshop Bringing MT to the User: Research on Integrating MT in the Translation Industry (JEC 10), pages 33–41, 2010.

(34)

Lucia Specia, Marco Turchi, Nicola Cancedda, Marc Dymetman, and Nello Cristianini. Estimating the sentence-level quality of machine translation systems. In 13th Conference of the European Association for Machine Translation, pages 28–37, 2009.

Lucia Specia, Najeh Hajlaoui, Catalina Hallett, and Wilker Aziz. Predicting machine translation adequacy. In Machine Translation Summit, volume 13, pages 19–23, 2011.

Lucia Specia, Gustavo Paetzold, and Carolina Scarton. Multi-level translation quality prediction with quest++. In Proceedings of ACL-IJCNLP 2015 System Demonstrations, pages 115–120, Beijing, China, July 2015. Association for Computational Linguistics and The Asian Federation of Natural Language Processing. URL http://www.aclweb.org/anthology/P15-4020.

Andreas Stolcke. SRILM - an extensible language modeling toolkit. In Pro- ceedings of the Seventh International Conference on Spoken Language Processing, Denver, Colorado, USA, 2002.

Sara Stymne, Nicola Cancedda, and Lars Ahrenberg. Generation of compound words in statistical machine translation into compounding languages. Compu- tational Linguistics, 39(4):1067–1108, 2013.

Arda Tezcan, Veronique Hoste, Bart Desmet, and Lieve Macken. Ugent-lt3 scate system for machine translation quality estimation. In Tenth Workshop on Statistical Machine Translation, 2015.

Mu Zhu. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2, 2004.

Estimating Post-Editing Effort with Translation Quality Features