Depending on VR : Rule-based Text Simplification Based on Dependency Relations

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer Science

Bachelor thesis, 18 ECTS | Cognitive Science

2017 | LIU-IDA/KOGVET-G--17/014--SE

Depending on VR

–

Rule-based Text Simplification Based on Dependency

Relations

Vida Johansson

Supervisor : Arne Jönsson

Assistant supervisor : Evelina Rennes Examiner : Henrik Danielsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

The amount of text that is written and made available increases all the time. However, it is not readily accessible to everyone. The goal of the research presented in this thesis was to develop a system for automatic text simplification based on dependency relations, develop a set of simplification rules for the system, and evaluate the performance of the system. The system was built on a previous tool and developments were made to ensure the that the system could perform the operations necessary for the rules included in the rule set. The rule set was developed by manual adaption of the rules to a set of training texts. The evaluation method used was a classification task with both objective measures (precision and recall) and a subjective measure (correctness). The performance of the sys-tem was compared to that of a syssys-tem based on constituency relations. The results showed that the current system scored higher on both precision (96% compared to 82%) and recall (86% compared to 53%), indicating that the syntactic information dependency relations provide is sufficient to perform text simplification. Further evaluation should account for how helpful the text simplification produced by the current system is for target readers.

(4)

Acknowledgments

To my supervisor, Arne Jönsson, for all his support and enthusiasm and for pushing me fur-ther than I believe I can go. To my assistant supervisor and dear friend, Evelina Rennes, for sharing her knowledge, preforming proofreading, and being my role model. To Johanna Ledin, Isabella Konikowski, Jasmina Jahic, and Elin Sjöström, for your support and friend-ship. To Rickard Paulsén, for being there and being you.

(5)

List of Figures

(8)

List of Tables

2.1 Operators in StilLett. . . 7

2.2 Node conditions in StilLett. . . 8

2.3 Arguments available in dep_tregex. . . 8

2.4 Node conditions in dep_tregex. . . 9

2.5 Neighborhood conditions . . . 10

2.6 Actions in dep_tregex. . . 11

3.1 Size of the training and test material. . . 13

3.2 Rules included in the rule set. . . 14

(9)

1 Introduction

In order to increase digital inclusion, for example by providing information that more people can access and use, there is a need for effective ways of simplifying texts. Previous work on automatic text simplification (ATS) for Swedish has either been aimed towards lexical simplification or constituency-based syntactic simplification. However, dependency parsing has shown to be more economic than constituency parsing, due to its lower algorithmic time complexity (Cer, De Marneffe, Jurafsky, & Manning, 2010).

There are many target groups of ATS, for example people with dyslexia and second lan-guage learners, and their needs differ. The most obvious goal of ATS is to facilitate these groups by providing texts in plain language1. Although, the ATS system described in this thesis is valuable for both for readers with a need for simplified texts and authors of plain language texts. It will be used in TeCST (Falkenjack, Rennes, Fahlborg, Johansson, & Jönsson, 2017), a tool that provides support for web editors of plain language texts. It will also be available externally trough SAPIS (Fahlborg & Rennes, 2016), StilLett SCREAM API Service. However, in order for the system described in this thesis to be a successful replacement of the system currently implemented in TeCST, it is important to know that it performs better than the currently used system.

1.1 Aim

The aim of this thesis was to develop a system for rule-based ATS based on dependency relations, develop a set of simplification rules and compare the performance of the system to that of the constituency-based ATS system currently implemented in TeCST.

1.2 Delimitations

Text can be simplified in other ways, for instance using methods for lexical simplification. Such simplification methods are not considered in the work presented in this thesis. The main focus was, however, to develop and evaluate a system that will aid syntactic simplification by providing tools that will ease the creation of transformation rules and the application of rules to texts.

1.3 Structure

This thesis is divided up into six chapters. Chapter 2 provides an overview of previous re-search on plain language and ATS. Chapter 3 includes a description of the method and eval-uation process. Chapter 4 presents the results of the evaleval-uation and a comparison with a similar system. In Chapter 5, the method and results are discussed. Finally, in Chapter 6, the thesis is concluded and suggestions for future research are proposed.

1_{Is this thesis, plain language refers to communication that the intended audience can understand as quickly,}

(10)

2 Theory

This chapter is divided into four sections. The first two sections aim to provide an overview of plain language and previous work on ATS. The last two sections include descriptions of two toolkits: TeCST, a text complexity and simplification toolkit, and dep_tregex, a dependency tree reordering tool. TeCST is the toolkit the system described in this thesis will be included in and dep_tregex is the tool the system was built on.

2.1 Plain Language

Around the world, several countries have recognized the importance of making information easy to read and understand. The United Nations’ Convention on the Rights of Persons with Disabilities and its Optional Protocol (A/RES/61/106) stated that persons with disabilities have the right to accessible important information (United Nations, 2006).

In the US, federal plain language guidelines have been provided by the Plain Language Action and Information Network (PLAIN, 2011). The guidelines included information on how to write plain text. This information was divided into guidelines for words, sentences, paragraphs, and other aids that can help clarify text. On a word-level, the guidelines recom-mend the use of precise and concise words. On a sentence-level, short sentences with care-fully selected words and a carecare-fully selected word order are recommended. On a paragraph-level, short paragraphs with only one topic per paragraph are recommended. Other means to simply text include the use of examples, lists, tables, and illustrations, a minimal use of cross-references, and a document design that increases readability.

In Europe, people from eight countries gathered to write standards for making informa-tion easy to read and understand, as a part of a larger project (Inclusion Europe, 2014). The standards included recommendations on how to write plain language texts on a word-level, a sentence-level, and a textual level.

In Sweden, a public inquiry concerning the area of easy-to-read identified goals concern-ing readability within several political areas in Sweden (Lättlästutrednconcern-ingen, 2013). The goals included that it should be possible for everyone to participate in the cultural life as well as society at large and that the living conditions should be equal for everyone. The inquiry also identified several target groups with reduced literacy and different needs for plain language material: persons with intellectual disabilities and other disabilities with an effect on reading ability (e.g. motorical or perceptual difficulties), persons suffering from dementia diseases, persons with reading and writing disabilities, second language (L2) learners, and persons who are very unused to reading.

2.2 Automatic Text Simplification

Text simplification is aimed at making information accessible to people with reduced literacy. Making texts easier to comprehend by automatic means has many uses and can be done in several ways. ATS can, for example, be used to make texts easier to read and comprehend,

(11)

2.2. Automatic Text Simplification

as preprocessing before machine translation or parsing, and as an aid for writers of plain language texts. Shardlow (2014) has provided a survey of research on text simplification, in which the following approaches to the simplification task were included:

• Lexical simplification • Syntactic simplification

• Statistical machine translation (SMT) • Explanation generation

• Hybrid techniques

Lexical simplification is to identify complex words and replace them with more compre-hensible words, while still retaining the original meaning and information content. Lexical simplification does not aim to simplify the grammar of texts, unlike syntactic simplification. Syntactic simplification denotes the process of identifying complex grammatical structures and rewriting them into simpler structures. Syntactic simplification is usually done in three steps: analysis of text, transformation of complex structures, and regeneration of text. SMT, often phrase-based statistical techniques, learn translations from large aligned bilingual cor-pora and apply the knowledge on valid translations to other texts. Explanation generation is to identify complex concepts in text and augment them with additional information, in order to provide the readers with more context and increase their understanding of the text. Hybrid simplification techniques combine several approaches to the simplification task.

2.2.1 Previous Work on Syntactic Simplification

Angrosh, Nomoto, and Siddharthan (2014) described two text simplification systems using dependency relations: one that performs lexical and syntactic simplification and one that performs sentence compression. To show how automatically harvested rules can be general-ized, Angrosh and Siddharthan (2014) described a hybrid text simplification system, which combines handwritten transformation rules with automatically acquired rules. A synchro-nized grammar was created by aligning dependency-parsed sentences in English and plain English. This made it possible to identify differences between the aligned sentences, which the transformation rules are based on.

The PorSimples project (Aluísio & Gasperin, 2010) aimed to develop text adaptation tools for Brazilian Portuguese and to supply both readers, with a need for simplified texts, and writers, who wants to produce texts for readers with reduced literacy, with such a tool. Within the project, three systems have been developed: an authoring system (SIMPLIFICA), an assistive technology system (FACILITA), and a web content adaptation tool (Educational FACILITA). All three systems are hybrid systems that include lexical and syntactic simpli-fication and explanation generation. The syntactic simplisimpli-fication is rule-based and applied sentence-wise.

Specia (2010) approached the task of simplification of Portuguese sentences as a transla-tion task and used SMT to translate complex sentences to sentences with a simpler grammat-ical structure. Although the results showed that the method used resulted in high precision, it requires very large aligned corpora to achieve high recall.

A web-based demonstration of a text simplification system, which performs both lexi-cal and syntactilexi-cal simplification for English, was presented by Ferrés, Marimon, and Sag-gion (2015). The system makes use of typed dependencies and transformation rules, com-plemented with text generation techniques. The syntactic simplification component of the system worked in two phases: analysis and generation. In the analysis phase, the syntac-tic structures that were to be simplified were identified. In the generation phase, the correct simplified structures were produced. Sentences were recursively simplified until no more

(12)

syntactic structures could be further simplified. A Java API was used to convert words to the inflectional form that were appropriate to their context and part-of-speech tag (POS-tag).

A method for syntactic sentence simplification for French was presented by Brouwers, Bernhard, and François (2014). The simplification was performed in two steps: overgener-ation and optimizovergener-ation. First, all possible simplificovergener-ations were generated for each sentence. This was done by applying 19 rules recursively to each sentence, which often resulted in more than one simplified variant for each sentence. The rules were applied using Tregex and Tsugreon (Levy & Andrew, 2006). Second, the subset of simplified sentences that maximized readability was selected. This was done by analyzing the simplified sentences with respect to a number of text complexity metrics and treating the selection as an optimization problem, which could be solved with an Integer Linear Programming approach.

Suter, Ebling, and Volk (2016) presented a rule-based text simplification system for Ger-man, which performed both syntactic and lexical simplification. The architecture of the sys-tem was divided up into three levels: character and word level, sentence level, and textual level and layout. The transformation rules used in the system were based on typed depen-dency relations and applied sentence-wise. For words that were considered difficult, but remained after all transformations were made, an explanation was added. Also, long com-pound words were separated by a symbol called Mediopunkt, in order to improve readabil-ity. All abbreviations were replaced with the words they represent. The transformation rules used in the system were derived from four sets of guidelines for plain language texts. Among the guidelines were the European standards for making texts easy to read and understand, produced by Inclusion Europe (2014).

StilLett, a rule-based text simplification system that performs syntactic simplification of Swedish texts, based on constituency relations, was presented by Rennes (2015). The rules that were implemented in the system were chosen based on a literature review and refined using manual training, during which the rules were manually adapted to a set of training texts. For more information on StilLett, see Section 2.3.

Siddharthan (2010) presented a framework for automatic complex lexico-syntactic refor-mulation of sentences. Four discourse markers for causation were considered and a cor-pus of sentences, containing such markers, aligned with reformulations of them was used. Siddharthan (2010) experimented with three representations: phrasal parse trees, Minimal Recursion Semantics (MRS), and typed dependencies, in order to find the most suitable rep-resentation for the task. The first reprep-resentation, phrasal parse trees, was shown to be too dependent on the grammar rules used by the parser. This was especially true for longer sentences, where strings in aligned sentences were parsed differently. It was concluded that substitution grammars for phrasal parse trees, which have shown to be successful in sen-tence compression tasks, are not as useful for complex lexico-syntactic tasks. The second representation used was MRS, where a bi-directional grammar was used and transforma-tions were performed at a semantic level. It was shown that it was easy and intuitive to write rules in MRS and that the bi-directional grammar ensured the generation of grammatical sentences. However, the bi-directional grammar failed to parse ill-formed input and ana-lyze well-formed inputs containing unusual constructions. the method could also get slow and memory intensive. Finally, typed dependencies were employed. The flat structure of dependency trees allowed for transformation rules to be written in a more compact form. It was shown that typed dependency structures was the most suitable representation for the task and that the handcrafted rules often generalized well to the unseen sentences in the test corpus.

2.2.2 Evaluation of ATS

Consensus on how ATS systems should be evaluated has not yet been reached. Also, systems differ in the type of text simplification they aim to perform and, thus, require different eval-uation methods. Previously employed evaleval-uation methods include the comparison against

(13)

human-generated gold standards, the use of text complexity metrics, human layman knowl-edge, and online and offline techniques with roots in psycholinguistic literature. Few text simplification systems have been evaluated with target populations as participants.

Brouwers et al. (2014) performed both a qualitative and a quantitative evaluation of their sentence simplification system. The manual (qualitative) evaluation was aimed at checking for possible errors that the application of rules had resulted in. This was done in order to make sure that the rules did not cause unintelligible or ungrammatical sentences. The quan-titative evaluation built on the manual evaluation and yielded results for correctness and the sources of errors. The error sources were preprocessing errors and simplification errors. Simplification errors were further divided up into syntactic errors and semantic errors.

A qualitative method was also used by Suter et al. (2016) to evaluate a text simplification system that performed both lexical and syntactic simplification. The evaluation consisted of a comparison of the output from the system to a text that had been simplified manually by an experienced author of plain German texts. However, the manually simplified text used to evaluate the system was the same text that inspired the implementation of some of the rules included in the system. That is, the system may partly have been trained on the data used to evaluate it. In addition to the qualitative method, the readability index, LIX (Björnsson, 1968), of the original text, the manually simplified text, and the automatically simplified text were compared.

Angrosh and Siddharthan (2014) performed human evaluation with participants recruited through Amazon Mechanical Turk. The participants were presented with ten sets of ten sen-tences alongside with five simplified sensen-tences. They were asked to rate each of the simplified sentences with regard to their fluency and simplicity. Also, they rated how well the simplified sentences preserved the meaning of the original sentences. Human evaluation has also been used by Ferrés et al. (2015) to evaluate the lexical and syntactical simplification their system performed. Eight human judges were asked to rate the fluency, adequacy, and simplicity of 25 randomly selected sentences, generated by the system.

The rule set implemented in the text simplification system StilLett1(Rennes, 2015) was evaluated with a set of texts as test data, according to a predefined annotation process. For each rule, two participants individually annotated all sentences containing relevant struc-tures as true positives. During the evaluation, the output of the system was compared to the annotated sentences and the objective measures recall and precision were calculated. In addi-tion, the subjective measure correctness was calculated by letting the same two participants rate each modified sentence as linguistically correct or incorrect.

Specia (2010) treated and evaluated the text simplification task as a translation task. The sentences that had been "translated" to simplified ones were evaluated using standard SMT metrics, BLEU and NIST, and manual inspection.

To evaluate a framework for ATS, Siddharthan (2010) looked at how many rules that were required for the lexico-syntactic reformulations and how robust the transformations were to parsing errors. The hand-crafted transformation rules used for the evaluation were devel-oped looking at one-third of the aligned corpus, yielding 48 sentences, and tested on the remaining two-thirds, yielding 96 sentences. Siddharthan pointed out that the evaluation method is merely aimed at evaluating the framework and not how useful the transformation rules are in respect to text simplification.

Siddharthan (2014) has raised several important issues with methods used to evaluate ATS. One of them is that few text simplification methods have been evaluated with target reader populations. As a result, it is difficult to know how effective ATS is. A second issue concerns the evaluation of texts produced with ATS using text complexity metrics: it assumes that the simplified texts are error-free, even though ATS struggles to produce error-free texts. Also, when ATS is evaluated on a sentence level, discourse and coherence implications of ATS is ignored.

(14)

2.3. TeCST - A Text Complexity and Simplification Toolkit

2.3 TeCST - A Text Complexity and Simplification Toolkit

Research on digital inclusion at Linköping University has resulted in several tools, including an automatic text summarizer (Smith & Jönsson, 2011a, 2011b), methods for lexical simplifi-cation (Johansson & Rennes, 2016; Keskisärkkä & Jönsson, 2013), a system for ATS (Rennes & Jönsson, 2015), and text complexity metrics (Falkenjack, Heimann Mühlenbock, & Jönsson, 2013; Falkenjack & Jönsson, 2014). In TeCST (Falkenjack et al., 2017), or Text Complexity and Simplification Toolkit, these tools were integrated and made available for text producers who want to make texts easier to read and people with a need for simplified texts.

TeCST consists of three parts: text complexity analysis, text simplification, and text sum-marization. The text complexity analysis consists of a subset of the text complexity metrics available through SCREAM2. Text summarization is performed by FriendlyReader (Smith & Jönsson, 2011a, 2011b) and text simplification by the text simplifier StilLett.

StilLett (Rennes & Jönsson, 2015) is a rule-based text simplification tool for Swedish, partly built on CogFLUX (Rybing, Smith, & Silvervarg, 2010). CogFLUX performed syntactic analy-sis of texts and supported two text rewriting operations: replacement and deletion of words. A subset of rewriting rules retrieved from a literature study (Decker, 2003), were used. A prototype of StilLett was then developed, which included additional operators and node con-ditions. Rennes (2015) created a first set of rules (R1), based on a thorough literature review. R1 was refined by manual training, resulting in a new rule set (R2). An evaluation of the performance of the two rule sets showed that the manual training significantly improved the performance of the rules; R2 scored higher than R1 on both precision, recall, and correctness. StilLett was also further extended with yet more rewriting operators and node conditions. After the extension of StilLett, the possible operations were the ones shown in Table 2.1 and the node conditions shown in Table 2.2. The rules that have been implemented in StiLett and evaluated, regarding precision, recall and correctness, are shown in Table 3.2.

Table 2.1: Operators in StilLett. Action Definition

REPL Used to replace a target phrase structure with a replacement structure.

DEL Removes the target phrase.

SHIFT Shifts the order of given nodes.

SPLIT Splits a tree, at a given node, into two new trees. PLANTTREE A second split action, which moves a target phrase

structure to a new tree.

PLANTBRANCH A third split action, which moves a child node of a given node to a new tree.

DROP Removes nodes from a matched node. ADD Used to add a node to a tree.

SAPIS (Fahlborg & Rennes, 2016), or StilLett SCREAM API Service, is an API that provides the possibility to calculate the complexity of texts and perform text simplification on a remote server. SAPIS also provides a preprocessing module, which consists of two POS-taggers, two versions of a parser, and a format converter. The implemented POS-taggers are Stag-ger (Östling, 2013) and the OpenNLP part-of-speech tagStag-ger (Morton, Kottmann, Baldridge, & Bierner, 2005). Further syntactical analysis is performed by MaltParser (Nivre, Hall, & Nilsson, 2006). Two different versions of MaltParser are implemented in the system:

(15)

2.4. dep_tregex - A Dependency Tree Tool

Table 2.2: Node conditions in StilLett.

Condition Definition

DEPENDENCY MATCH Requires a node to have a given dependency tag.

UNKNOWN PARENT (?) The question mark can represent none or any node.

INDEX Requires a node to have a given index. CHILDREN MATCH Requires a node to have certain children. COND Requires the children of a given node to

fulfill a given condition.

NOT Excludes nodes containing matching

nodes.

Parser 1.2 and MaltParser 1.7.2. The later version was implemented to make it easier to use dependency parsing in the future, although the current system is constituency-based. The format converting module is used to convert the CoNLL formatted output from Stagger to the NEGRA format, supported by Maltparser. There is also a post-processing module, which restores text to a readable format by removing syntax tags and ensuring correct placement of capital letters.

2.4 dep_tregex - A Dependency Tree Tool

Table 2.3: Arguments available in dep_tregex. Argument Definition

words Extracts and prints all words

wc Counts and prints the number of trees nth Extracts and prints the nth tree head Extracts and prints the N first trees tail Extracts and prints the N last trees

shuf Extracts and prints the trees i a shuffled order

grep Extracts and prints all trees that match a given pattern

sed Applies rules contained in a script file to trees and prints the resulting trees

html Views trees in a web browser and can take optional arguments, for example –postag, which will affect the amount of information shown for the trees

gdb Views the application of rule scripts to a tree step-by-step

dep_tregex is a dependency tree reordering tool that was developed as part of the techniques used for a contribution to the WMT16 shared task Machine Translation of News (Dvorkovich, Gubanov, & Galinskaya, 2016). The module implemented a Stanford Tregex-inspired lan-guage for rule-based dependency-tree manipulation. Tregex (Levy & Andrew, 2006) is a Java module for matching patterns in phrasal parse trees.

(16)

Table 2.4: Node conditions in dep_tregex.

Condition Definition Example

ATTR STR_COND Is used to check whether a node has a given attribute matched by the string condition (string or regular expression).

x form /dog/i

is_top Is used to check whether the a node’s parent is the root node.

x is_top

is_leaf Is true if a node x does not have any children.

x is_leaf can_head ID Is used to control that a tree

will be valid (not cyclic or disconnected) if x were to head y.

x can_head y

can_be_headed_by ID Is true for y

can_be_headed_by x whenever x can_head y is.

y can_be_headed_by x

== ID Is used to check whether a

node, x, is the same node as y.

x == y

dep_tregex can handle several arguments, given the input of a file with trees of CoNLL format and, in some cases, a text file containing rule scripts. The arguments that can be given are shown in Table 2.3. There are also several node conditions, actions, and operators avail-able in dep_tregex. The conditions included node conditions, string conditions, and neigh-borhood conditions. The possible actions are copy, move, delete, set, set_head, try_set_head, and group. Finally, there are three operators: not, and, and or.

2.4.1 Conditions

The string conditions in dep_tregex can either be written as strings, with single or double quo-tation marks, or regular expressions. The node conditions are used to specify what kind of node a certain script should be applied to. The conditions available in dep_tregex are shown in Table 2.4. dep_tregex also includes a more specific type of conditions: neighborhood con-ditions. Such conditions are used to query which relations nodes have to other nodes in the same tree. The neighborhood conditions are shown in Table 2.5. For example, the condition "x has y as its adjacent left neighbor" would be written as "x $- y".

2.4.2 Actions

The actions available in dep_tregex are used to modify trees. Table 2.6 show the actions that are available in dep_tregex alongside with their respective syntax and definition. All actions require that nodes have been defined and given an ID through conditions first.

(17)

Table 2.5: Neighborhood conditions Condition Definition

.<- - Has left child - ->. Has right child <- -. Has right head .- -> Has left head

.<- Has adjacent left child ->. Has adjacent right child <-. Has adjacent right head .-> Has adjacent left head

> Has child

> > Has predecessor

< Has head

< < Has successor $- - Has left neighbor $++ Has right neighbor $- Has adjacent left neighbor $+ Has adjacent right neighbor

2.4.3 Syntax

In dep_tregex, rules are parsed by a parser, created using the yacc module of PLY (Python Lex-Yacc3), following the creation of a lexer using the lex module of PLY. The syntax for a rule is: { condition(s) :: action(s); }

In the condition(s), variables referring to individual nodes or groups of nodes are defined. In the action(s), the actions that should modify the previously defined variables are defined. For example:

{

n form "dog" and postag "NN" and $- (a postag "JJ") ::

delete node a; }

The script would match the sentence "The brown dog is happy." and transform it to "The dog is happy.". Meanings of POS-tags can be found in Appendix B.

(18)

Table 2.6: Actions in dep_tregex.

Action Syntax Definition

copy copy (group|node) x

before (group|node) y

Copies x and appends the copy of x before y.

move move (group|node) x

after (group|node) y

Moves x and places it after y.

delete delete (group|node) x Removes node x from the tree.

set set postag x "NN" Changes the POS-tag of x to NN.

set_head set_head x

(heads|headed_by) y

Sets node x as head of y or vice versa

try_set_head. try_set_head x (heads|headed_by) y

Sets the head of a node, but does not fail if the tree becomes cyclic or

disconnected, unlike set_head.

group group x y Creates a virtual arc

between x to y, which is then considered in copy, move and delete actions.

2.4.4 Rule Application

Rules contained in text files are applied to trees. Each rule file can contain one or many rules and each rule can contain one or many conditions and actions. Rules are parsed and applied to one tree at the time. If the conditions of a rule are fulfilled, functions raised by keywords in the actions are run and the trees are modified according to the actions defined in the rule. If not, the next rule contained in the text file is applied to the trees. Several files of rules can be run on the same trees. However, only one file can be run at a time. As different rules may modify the same trees, it is possible for the order in which files are run to affect the output trees.

(19)

3 Method

The system described in this thesis was built on the publicly available tool dep_tregex. This chapter describes the further developments of dep_tregex that has been done within the work on this thesis. Moreover, the creation and manual training of a rule set is described, as well as the method used to evaluate the performance of the rules.

dep_tregex is written in Python and has much of the functionality needed to perform text simplification, for example the possibility to the remove and reorder words. It was considered to be a good basis for a text simplification system and was, thus, chosen to be used and further developed within the work on this thesis.

3.1 Further Developments

dep_tregex was created to support the rearrangement of words in dependency trees (Dvorkovich et al., 2016). Since ATS requires more than just the rearrangement of words, for example, the possibility to split a sentence into two, changes were made to dep_tregex within the work on this thesis. The changes can be divided into two categories: manage-ment of errors that were in the publicly available code and the addition of functionality that was not in the publicly available dep_tregex system. All additions were made to make sure that the current system could perform the same rewriting operations as StilLett (Rennes & Jönsson, 2015). The added functionalities were: add, split, and conj.

3.1.1 Add

The action condition Add made it possible to add new nodes to trees, an action that was needed in order to perform several of the transformations. For example, the addition of the indefinite pronoun man (eng: one) was needed for some of the passive sentences that were to be simplified by the rule passive-to-active. The syntax for the action script was "add X (before|after) (node|group) ID", where X refers to the string that is to be added.

Example of usage: add "man" before node n.

3.1.2 Split

This action condition made it possible to split a tree into two new tree structures. The action was needed in order to handle transformation rules where a new tree is created out of parts of another tree. The syntax for the action script was "split (before|after) (node|group) ID" or "split (before|after) (node|group) ID and (before|after) (node|group) ID". The first splits the tree at a given point, while the latter allows for a part in the middle of a tree to be moved to a new tree structure.

(20)

3.2. Text Selection

3.1.3 Conj

The possibility to conjugate verbs was crucial for some of the transformation rules, for example when changing a sentence from passive to active voice. Consider the passive sen-tence "Huset byggdes av Pelle" (eng: "The house was built by Pelle"), which needs to be transformed into the active sentence "Pelle byggde huset" (eng: "Pelle built the house"). A new module for conjugation of verbs was added to the framework. The module made use of a list of all verbs and their corresponding conjugation paradigms, which was gathered from SALDO (Borin, Forsberg, & Lönngren, 2008). In addition, a list of irregular verbs was gathered and used to conjugate verbs that consisted of or ended with an irregular verb. An action condition for conjugation of verbs was added to the framework as well. The following forms of verbs were included in the module: infinitive (inf), present (pres), preterit (pret), and supine (sup). All verb forms can be found in Appendix C. The syntax for the action script was "conj ID (inf|pres|pret|sup)".

Example of usage: conj v pret.

3.2 Text Selection

Manual training of transformation rules, to manually fit a rule set to the texts contained in the training data, has shown to be a successful method to improve the performance of a rule set (Rennes, 2015). In order to be able to compare the performance of the current system to that of the system presented by Rennes and Jönsson (2015) - StilLett, the manual training process used to improve the rule set implemented in StilLett was replicated. Thus, the same texts that were used to train and evaluate the rule set implemented in StilLett were chosen as training and test data for this thesis. The texts were SweSAT texts, which are created to be used to measure Swedish reading comprehension. The topics and genres of the texts vary, but the variation in difficulty level between the texts can be assumed to be small. A total of 48 texts were used. Six of them were used as training data and the remaining texts as test data. Information on the size of the training and test data is shown in Table 3.1. Rennes (2015) pointed out that a larger training material would likely have increased the performance StilLett. However, to include a larger number of texts in the training data used in this thesis would rule out the possibility to compare the two systems. Thus, the same texts were chosen, despite the disadvantages of the size of the training material. All texts were preprocessed using SAPIS1(Fahlborg & Rennes, 2016).

Table 3.1: Size of the training and test material. Data Number of sentences Training 129

Test 1208

Total 1337

3.3 Rule Selection and Production

The rules chosen for the evaluation are shown in Table 3.2. The rules were the same as the ones included in the rule set developed by Rennes (2015) (R2), in order to make it possible to compare the performance of the two systems.

(21)

3.4. Evaluation

First, a prototype rule set was created with the aim to perform the same sentence trans-formations as R2, without looking at the training data. This was done to gain a set of rules to train. Manual training of the rules in the prototype rule set, following Rennes (2015), was performed on the texts contained in the training data. For each rule, all target tree structures in the training data were identified manually and the rules were adjusted to make sure that they modified the structures according to the desired output, resulting in a refined rule set.

Table 3.2: Rules included in the rule set.

Rule Definition

Proximization (Prox.)

The rule aims to change the text to make it psychologically closer to the reader. This can be done by directly addressing the reader. This was done by changing the indefinite pronoun man (eng: one) to du (eng: you). Also, the correct form of the object corresponding to the pronoun was set, if needed.

Passive-to-active (P2A)

The rule aims to rewrite sentences of passive form to active form. The rule is triggered by a verb with the feature SFO, indicating a verb in passive tense.

Quotation inversion (QI)

The rule aims to change the place of a quotation and the person expressing it. The rule is triggered by sentences of quotation-like form. That is, a quotation followed by a comma, a verb, and a pronoun or a noun. A quotation either starts with a dash or has a quotation mark before and after the quote.

Straight word order (SWO)

The rule aims to rearrange the words in a sentence to achieve straight word order. That is, first a subject, then a verb, and then an object.

SPLIT-k Sentence splits aims to divide long and/or complex sentences into new, simpler sentences. SPLIT-k performs splitting for subordinating and coordinating conjunctions. The rule was triggered by a comma followed by a word with POS-tag SN or KN.

SPLIT-r A second split rule, which performs splitting for relative clauses. The rule was triggered by a relative pronoun (POS-tag HP) in a nominal phrase.

SPLIT-a A third split rule, which performs splitting for appositions. The rule was triggered by an apposition (dependency label AN) within commas.

3.4 Evaluation

As a first step in the evaluation of the manually trained rule set, sentences that were relevant for each rule were identified using the grep argument in dep_tregex2. One condition script was created for each rule and sentences that were matched by the conditions were extracted. The conditions were the same as the ones used by Rennes (2015). Sentences that were erro-neously matched by the broad conditions were ignored. Whether sentences were correctly matched or not was decided individually by two undergraduate students with Swedish as their native language. If their opinions on a sentence differed, the most generous approach

(22)

3.4. Evaluation

was followed and the sentence was deemed as correctly matched. The correctly matched sentences are henceforth referred to as true positives.

For proximization, all sentences containing the indefinite pronoun man (eng: one) or the word sig (eng: oneself ) with the feature OBJ were matched. For passive-to-active, sentences with verbs tagged with the morphological feature SFO were matched. For quotation inver-sion, sentences containing a quotation mark or a dash were matched. For straight word or-der, sentences indicating reversed word order were matched. That is, sentences where the first word is an adverb and the second word is a verb. The three split rules all have different conditions. Candidates for sentence splitting at subordinating and coordinating conjunctions were extracted by searching for the POS-tags KN and SN. Candidates for sentence splitting at relative clauses were extracted by searching for sentences containing relative pronouns (POS-tag HP). Finally, candidates for splitting at appositions were extracted by searching for sentences containing words with the dependency label AN. Meanings of dependency labels, POS-tags and morphological features can be found in Appendix A-C.

For the evaluation, the performance of each rule was evaluated by applying it to all its sentence candidates. Sentence candidates contain both true positives and true negatives. Precision and recall were measured, based on the previously extracted true positives. Pre-cision is given by the true positives divided by the sum of the true positives and the false positives and recall is the true positives divided by the sum of the true positives and the false negatives. In addition, correctness was measured. The same two undergraduate students that performed the selection of true positives rated all sentences the rules were applied to. Sentences that were linguistically correct were marked with correct and sentences that were incorrect were marked with incorrect. Correctness is the correct sentences divided by the true positives and the false positives.

Each rule was evaluated separately from all the other rules in the rule set, as some sen-tences can trigger several rules and, thus, have an effect on the results.

(23)

4 Results

In the following section, the results for the rule set created within the work on this thesis, VR (short for Vida Rules), are presented and compared to the results of the rule set created by Rennes (2015) (R2). VR rules were trained and evaluated using the same method and data as R2 rules. The training data included target tree structures for all rules.

4.1 Results for VR

The results of the evaluation of VR are presented in Table 4.1. The row labeled Correct shows the ratio of linguistically correctly transformed sentences. That is, the number of correct sen-tences divided by the sum of the true positives and the false positives for each rule. Overall, the rule set scored both a high precision and a high recall (precision: 96%, recall: 85%), which is also shown in Figure 4.1.

4.2 VR compared to R2

Compared to R2 (Rennes, 2015), VR performed better in almost all conditions regarding pre-cision and recall, with an increase in total prepre-cision from 82% to 96% and an increase in total recall from 53% to 85%, as shown in Figure 4.1. Total precision was calculated by dividing the sum of the true positives for all rules by the sum of the true positives and the false positives

Precision Recall 0 0.2 0.4 0.6 0.8 1 0.82 0.53 0.96 0.85 R2 VR

(24)

4.2. VR compared to R2

for all rules. Total recall was calculated by dividing the sum of the true positives for all rules by the sum of the true positives and the false negatives for all rules. A more detailed com-parison of the results is shown in Table 4.1. The only exception is Quotation Inversion (QI), where the VR rule for QI scored a lower precision (84% compared to 100%). However, VR scored a higher recall for QI (84% compared to 65%). When comparing the scores on correct-ness for the two rule sets, VR scored a higher correctcorrect-ness on proximization (Prox.), straight word order (SWO), SPLIT-k, and SPLIT-r. R2 scored a higher correctness on passive-to-active (P2A) and QI. Both rule sets scored 0% on correctness for SPLIT-a. Total correctness was 48% for VR, but is not available for R2. Total correctness was calculated by dividing the sum of all modified, linguistically correct sentences by the sum of all true positives and false positives. Table 4.1: Precision, recall, and correctness for all conditions for the two rule sets. Grey cells contain values that are higher than the corresponding value in the opposite table.

Rule set: R2

Prox. P2A QI SWO SPLIT-k SPLIT-r SPLIT-a SPLIT-tot. Tot. Precision 0.979 0.890 1 0.848 0.537 0.732 0.111 0.732 0.821

Recall 0.960 0.564 0.650 0.683 0.254 0.449 0.026 0.314 0.530

Correct 62% 77% 85% 63% 35% 41% 0% 37%

-Rule set: VR

Prox. P2A QI SWO SPLIT-k SPLIT-r SPLIT-a SPLIT-tot. Tot.

Precision 1 0.978 0.842 1 0.981 0.957 0.286 0.933 0.957

Recall 1 1 0.842 0.859 0.922 0.658 0.857 0.741 0.850

(25)

5 Discussion

The results showed that the current system in general scored higher than StilLett (Rennes & Jönsson, 2015), on all three conditions (precision, recall, and correctness). The higher preci-sion is explained by that the sentences matched by VR rules were relevant to a larger extent than sentences matched by R2 rules. The higher recall shows that VR rules left fewer relevant sentences untreated than R2 rules. The higher correctness shows that the sentences that were modified by VR rules were linguistically correct to a larger extent than those modified by R2 rules.

While the precision and correctness were high for the VR rules, the correctness was for some rules lower than the precision and recall for the same rules. Many of the mistakes made by the system, resulting in lowered performance, were due to parser errors. For example, sentences with periods, exclamation marks, or questions marks in the middle of the sentence were parsed as two sentences, giving rise to linguistically incorrect output. Another example is cases where a noun was erroneously tagged as a cardinal number. Erroneously tagged words will not be matched correctly, yielding lower results.

The P2A rule in the VR rule set scored 98% on precision and 100% on recall, but only 23% on correctness. In this case, errors were mainly due to an awkward word order. This problem could be fixed by making the rule less general. Another source of error was the conjugation module1, which was mainly used for the P2A rule. In most cases, conjugated words were correct, but not always. In order to reach a higher score for correctness, the module needs improvements.

The sentence split for appositions scored 0% correctness for both VR and R2. As men-tioned by Rennes (2015), target structures rarely occurred in the texts and the training data were not sufficient to capture all target structures in the test data. However, worth noting is that the VR split for appositions scored 86% on recall compared to the 11% recall for the R2 rule. This shows that the VR rule found more of the relevant structures. The precision was still low (27%), which shows that further training is needed and that the rules need to be made less general in order to be able to differentiate between relevant structures and false positives.

The differences in performance between StilLett (Rennes & Jönsson, 2015) and the cur-rent system could be due to several reasons. One possible reason is that the manual training processes, adaption of rules to the texts included in the training data, might have differed. Although the same training and evaluation method were used, with the same data, the train-ing could have resulted in very different rules. For example, a more general rule might work almost as well as a less general rule on the training data, but give rise to different results when applied to the test data. The VR rules were created with the aim to perform the same transformations as R2 rules did, using information from a dependency parser instead of a constituency parser. That is, VR is not a "translation" of R2 and it is not clear how similar their respective rules are.

(26)

A second possible reason to why the results for the systems differ is that the rules are evaluated by different participants. Correctness is a subjective measure and it is, thus, possi-ble that there is a difference in what the participants consider a correct sentence to be. This could give rise to misleading results, why a larger amount of participants would be useful. It would also be useful to let the participants rate the text simplification regarding, for ex-ample, fluency, simplicity, and flow. By doing so, a more reliable and nuanced measure of the correctness could be reached. Although the subjective measure correctness did not have an effect on the objective measures in this evaluation, true positives were also selected by humans. This has some disadvantages. For example, the selection process might have been more subjective than it would have been using automatic methods. Also, the results might be less uniform. On the other hand, humans outperform natural language processing systems on many tasks. To achieve a more trustworthy selection of true positives, a larger number of participants could be asked to perform the classification.

A third possible reason to the difference in performance is that the systems have access to different syntactic information. While the current system have access to information about how words are related to each other and their functions, StilLett have access to information about how words or groups of words function as units in a hierarchal structure. Although the results showed that the current system performed better than StilLett, they provided no clear evidence of that the difference in syntactic information had an effect on the performance of the systems. The suitableness of typed dependencies and phrasal parse trees for complex lexio-syntactic tasks has, however, been examined by Siddharthan (2010). It was concluded that typed dependency structures was the most suitable representation for such tasks and that handcrafted dependency-based simplification rules often generalized well to sentences not included in the training data. These findings conform with the results for R2 and VR and could explain the differences in performance.

It is important to note that R2 rules made use of information acquired through depen-dency parsing (for example, to identify the subject of a sentence in the P2A rule) in addition to the information acquired through constituency parsing. While StilLett needs such informa-tion, the current system does not need information provided by constituency parsing. This shows that the current, dependency-based system is sufficient to achieve better results than StilLett. As dependency parsing is faster than constituency parsing (Cer et al., 2010), simpli-fied texts can be produced more efficiently if only dependency parsing is needed.

As previously mentioned, the conjugation module used by the current system was not perfect. The other additions made to dep_tregex, functionality to add words and split sen-tences, did not give rise to erroneous output. However, each rule was only applied once to the "original" nodes of a tree and not to the nodes created by the application of the same rule. Due to this, some sentences were not modified enough, resulting in output sentences that had been "half-way" simplified. To further improve the current system, it needs to be made possible to apply a rule to the nodes created by it.

Manual training has shown to be a successful method to improve text simplification (Rennes, 2015). However, it is time-consuming to identify target tree structures and adjust rules according to them. Also, manually trained rules might not generalize well to texts that are very different from the training data. It would be useful to increase the efficiency of the rule production and, thus, make it possible to develop different rule sets for different target groups, with different desires, and for different genres. One way to increase the efficiency of the rule production is to create a parallel corpus, containing sentences aligned with plain language sentences, and automatically harvest rules from patterns in the corpus (Angrosh & Siddharthan, 2014). For the current system, this would require a method to format the harvested rules into rules that can be parsed by the system.

The evaluation of the current system consisted of a comparison of the performance of the implemented rules to that of the rules implemented in StilLett (Rennes, 2015). While this method indicates how the systems differ, it has some limitations. For example, it does not show how well the systems would perform on other rules or how time-consuming it is to

(27)

cre-ate rules for the different systems. Also, the only subjective measure was correctness, which concerns the linguistic correctness of modified sentences. It does not supply information on how comprehensible the modified sentences are. The objective measures used, precision and recall, provide a picture of how well relevant tree structures are captured by the rules. Al-though these measures are a good first step to measure the performance of a system, with an implemented set of rules, other methods might capture the comprehensibleness of the output better. Human evaluation can be used to measure, for example, the fluency and sim-plicity of simplified sentences (Angrosh & Siddharthan, 2014; Ferrés et al., 2015). To give an accurate picture of to what extent the simplification aids the intended audience, partici-pants should belong to the target group(s). In addition to human evaluation, text complexity metrics (Falkenjack et al., 2013; Falkenjack & Jönsson, 2014) could be calculated for input and output text. This would give a deeper understanding of how the simplification process affects the complexity of text.

(28)

6 Conclusion

The aim of this thesis was to develop a system for automatic text simplification based on dependency relations, develop a rule set, and compare the performance of the system to that of a constituency-based system. The system was built on dep_tregex, a dependency tree reordering tool. Additional functionality was added to ensure that the system could perform all operations needed for the text simplification task.

In order to be able to compare the performance of the current system to that of a similar, constituency-based system, the method that was used to train and evaluate the rules for the constituency-based system was replicated. A first rule set was developed, with the goal to perform the same textual operations as the constituency-based rules, using information from a dependency parser. The rule set was then refined by manually adapting the rules to a set of training texts. The evaluation of the results showed that the current system scored higher on both precision and recall. The precision was 96%, compared to 82%, and the recall was 85%, compared to 53%. While the constituency-based system made use of information acquired through both constituency and dependency parsing, the current system performed better without information from a constituency parser. This indicates that dependency parsing is sufficient to provide the information needed to perform ATS.

In this thesis, the evaluation of text simplification consisted of a classification task. This enabled the comparison of the current system and the constituency-based system. However, the evaluation only examines one part of the text simplification task. The second part, to what extent the text simplification is helpful for target readers, is left to examine. Moreover, different rules can be created and combined in many ways. It would be valuable to find out which types of rules and combinations of them that are the most helpful for target readers.

Manual training has shown to be a successful method to improve text simplification. Al-though, the training data need to include more texts containing relevant structures in order to adapt the rules to more structures. It would also be interesting to train the rules on texts of other genres, to examine the role the genre plays in the shaping of rules. In order for this to be done in an efficient manner, a parallel corpus of sentences aligned with plain lan-guage sentences is needed. Within the work on TeCST, a method for alignment of regular and simplified sentences is currently being developed. The goal is to automatically harvest simplification rules from the aligned sentences and implement them in the current system.

(29)

References

Aluísio, S. M., & Gasperin, C. (2010). Fostering Digital Inclusion and Accessibility : The PorSimples project for Simplification of Portuguese Texts. In Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas (pp. 46–53). Los Angeles, California, USA.

Angrosh, M., Nomoto, T., & Siddharthan, A. (2014). Lexico-syntactic text simplification and compression with typed dependencies. In COLING (pp. 1996–2006). Dublin, Ireland. Angrosh, M., & Siddharthan, A. (2014). Text simplification using synchronous dependency

grammars: Generalising automatically harvested rules. In Proc. of the 8th International Natural Language Generation Conference. Philadelphia, Pennsylvania, USA.

Björnsson, C.-H. (1968). Läsbarhet. Stockholm, Sweden: Liber.

Borin, L., Forsberg, M., & Lönngren, L. (2008). SALDO 1.0 (Svenskt associationslexikon version 2). Språkbanken, Göteborgs universitet.

Brouwers, L., Bernhard, D., & François, T. (2014). Syntactic Sentence Simplification for French. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) @ EACL 2014 (pp. 47–56). Gothenburg, Sweden: Association for Computational Linguistics.

Cer, D., De Marneffe, M.-C., Jurafsky, D., & Manning, C. D. (2010). Parsing to Stanford De-pendencies: Trade-offs between speed and accuracy. In LREC 2010, Seventh International Conference on Language Resources and Evaluation. Valletta, Malta.

Decker, A. (2003). Towards automatic grammatical simplification of Swedish text (Master’s thesis). Stockholm University.

Dvorkovich, A., Gubanov, S., & Galinskaya, I. (2016). Yandex School of Data Analysis ap-proach to English-Turkish translation at WMT16 News Translation Task. In Proceedings of the First Conference on Machine Translation (Vol. 2, pp. 281–288). Berlin, Germany. Fahlborg, D., & Rennes, E. (2016). Introducing SAPIS - an API service for text analysis and

simplification. In The second national Swe-Clarin workshop: Research collaborations for the digital age, Umeå, Sweden.

Falkenjack, J., Heimann Mühlenbock, K., & Jönsson, A. (2013). Features indicating readability in Swedish text. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NoDaLiDa-2013), Oslo, Norway.

Falkenjack, J., & Jönsson, A. (2014). Classifying easy-to-read texts without parsing. In The 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR 2014), Göteborg, Sweden.

Falkenjack, J., Rennes, E., Fahlborg, D., Johansson, V., & Jönsson, A. (2017). Services for text simplification and analysis. In Proceedings of the 21st Nordic Conference on Computational Linguistics, Gothemburg, Sweden.

Ferrés, D., Marimon, M., & Saggion, H. (2015). A Web-based Text Simplification System for English. Journal of Sociedad Española para el Procesamiento del Lenguaje Natural, 55. Inclusion Europe. (2014). Information for all - European standards for making information easy to

read and understand.

Johansson, V., & Rennes, E. (2016). Automatic extraction of synonyms from an easy-to-read corpus. In Proceedings of the Sixth Swedish Language Technology Conference (SLTC-16), Umeå, Sweden.

(30)

References

Keskisärkkä, R., & Jönsson, A. (2013). Investigations of Synonym Replacement for Swedish. Northern European Journal of Language Technology, 3(3), 41–59.

Lättlästutredningen. (2013). Lättläst (SOU 2013:5 ed.). Stockholm, Sweden: Kulturdeparte-mentet.

Levy, R., & Andrew, G. (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC).

Morton, T., Kottmann, J., Baldridge, J., & Bierner, G. (2005). Opennlp: A java-based nlp toolkit. Nivre, J., Hall, J., & Nilsson, J. (2006, May). MaltParser: A Data-Driven Parser-Generator

for Dependency Parsing. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006) (pp. 2216–2219).

Östling, R. (2013). Stagger: an open-source part of speech tagger for swedish. Northen European Journal of Language Technology, 3.

PLAIN. (2011). Federal Plain Language Guidelines.

Rennes, E. (2015). Improved Automatic Text Simplification by Manual Training (Master’s thesis). Linköping University.

Rennes, E., & Jönsson, A. (2015). A tool for automatic simplification of swedish texts,. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NoDaLiDa-2015), Vilnius, Lithuania.

Rybing, J., Smith, C., & Silvervarg, A. (2010). Towards a Rule Based System for Automatic Simplification of Texts. In Swedish Language Technology Conference, SLTC, Linköping, Swe-den.

Shardlow, M. (2014). A Survey of Automated Text Simplification. (IJACSA) International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing, 58–70.

Siddharthan, A. (2010). Complex lexico-syntactic reformulation of sentences using typed de-pendency representations. In Sixth International Natural Language Generation Conference (INLG 2010) (pp. 125–133). Dublin, Ireland.

Siddharthan, A. (2014). A survey of research on text simplification. International Journal of Applied Linguistics, 165(2), 259–298.

Smith, C., & Jönsson, A. (2011a). Automatic Summarization As Means Of Simplifying Texts, An Evaluation For Swedish. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NoDaLiDa-2010), Riga, Latvia.

Smith, C., & Jönsson, A. (2011b). Enhancing extraction based summarization with outside word space. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand.

Specia, L. (2010). Translating from Complex to Simplified Sentences. In PROPOR (pp. 30–39). Suter, J., Ebling, S., & Volk, M. (2016). Rule-based Automatic Text Simplification for German.

In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016). Bochum, Germany.

United Nations. (2006). Convention on the Rights of Persons with Disabilities and Optional Proto-col. New York City, New York, USA: United Nations.

(31)

A

Dependency Labels

MAMBA Categories (http://stp.lingfil.uu.se/ nivre/swedish_treebank/dep.html) Tag Meaning

++ Coordinating conjunction +A Conjunctional adverbial

+F Coordination at main clause level AA Other adverbial

AG Agent AN Apposition

AT Nominal (adjectival) pre-modifier CA Contrastive adverbial

DB Doubled function DT Determiner

EF Relative clause in cleft EO Logical object

ES Logical subject

ET Other nominal post-modifier FO Dummy object

FP Free subjective predicative complement FS Dummy subject

FV Finite predicate verb I? Question mark IC Quotation mark

IG Other punctuation mark

IK Comma IM Infinitive marker IO Indirect object IP Period IQ Colon IR Parenthesis IS Semicolon IT Dash IU Exclamation mark IV Nonfinite verb

JC Second quotation mark

JG Second (other) punctuation mark JR Second parenthesis

JT Second dash

KA Comparative adverbial MA Attitude adverbial MS Macrosyntagm

(32)

NA Negation adverbial OA Object adverbial OO Direct object OP Object predicative PL Verb particle PR Preposition PT Predicative attribute RA Place adverbial

SP Subjective predicative complement SS Other subject

TA Time adverbial TT Address phrase

UK Subordinating conjunction VA Notifying adverbial

VO Infinitive object complement VS Infinitive subject complement

XA Expressions like "så att säga" (so to speak) XF Fundament phrase

XT Expressions like "så kallad" (so called) XX Unclassifiable grammatical function YY Interjection phrase

CJ Conjunct (in coordinate structure)

HD Head

IF Infinitive verb phrase minus infinitive marker PA Complement of preposition

UA Subordinate clause minus subordinating conjunction VG Verb group

(33)

B

Stagger POS-tags

Tag Meaning

HS possessive relative pronoun PS possessive pronoun VB verb IE innitive marker PP preposition PN pronoun NN noun DT determiner KN conjunction RG cardinal number SN subordinating conjunction HD relative determiner JJ adjective HP relative pronoun AB adverb PM proper noun RO ordinal number IN interjection HA relative adverb PC participle PL verb particle UO foreign word

(34)

C

SUC Morphological Features

Tag Value Feature

AKT Active Voice

DEF Definite Definiteness

GEN Genitive Case

IND Indefinite Definiteness INF Infinitive Verb form IMP Imperative Verb form KOM Comparative Degree KON Subjunctive Mood

NEU Neuter Gender

NOM Nominative Case

MAS Masculine Gender

OBJ Object Pronoun form

PLU Plural Number

POS Positive Degree

PRF Perfect Perfect form PRT Preterite Verb form

PRS Present Verb form

SFO S-form Voice

SIN Singular Number

SMS Compound Case

SUB Subject Pronoun form

SUP Supinum Verb form

SUV Superlative Degree

UTR Common Gender

Depending on VR : Rule-based Text Simplification Based on Dependency Relations

Linköping University | Department of Computer Science

Bachelor thesis, 18 ECTS | Cognitive Science

2017 | LIU-IDA/KOGVET-G--17/014--SE

Depending on VR

Rule-based Text Simplification Based on Dependency

Relations

Vida Johansson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Delimitations

1.3

Structure

2

Theory

2.1

Plain Language

2.2

Automatic Text Simplification

2.2.1

Previous Work on Syntactic Simplification

2.2.2

Evaluation of ATS

2.3

TeCST - A Text Complexity and Simplification Toolkit

2.4

dep_tregex - A Dependency Tree Tool

2.4.1

Conditions

2.4.2

Actions

2.4.3

Syntax

2.4.4

Rule Application

3

Method

3.1

Further Developments

3.1.1

Add

3.1.2

Split

3.1.3

Conj

3.2

Text Selection

3.3

Rule Selection and Production

3.4

Evaluation

4

Results

4.1

Results for VR

4.2

VR compared to R2

5

Discussion

6

Conclusion

References

A

Dependency Labels

B

Stagger POS-tags

C

SUC Morphological Features