Is Simple Wikipedia simple? : – A study of readability and guidelines

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor thesis, 18 ECTS | Cognitive science

2018 | LIU-IDA/KOGVET-G--18/029--SE

Is Simple Wikipedia simple?

–

A study of readability and guidelines

Fabian Isaksson

Supervisor : Arne Jönsson Examiner : Henrik Danielsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

Creating easy-to-read text is an issue that has traditionally been solved with manual work. But with advancing research in natural language processing, automatic systems for text simplification are being developed. These systems often need training data that is par-allel aligned. For several years, simple Wikipedia has been the main source for this data. In the current study, several readability measures has been tested on a popular simplification corpus. A selection of guidelines from simple Wikipedia has also been operationalized and tested. The results imply that the following of guidelines are not greater in simple Wikipedia than in standard Wikipedia. There are however differences in the readability measures. The syntactical structures of simple Wikipedia seems to be less complex than those of standard Wikipedia. A continuation of this study would be to examine other readability measures and evaluate the guidelines not covered within the current work.

(4)

Acknowledgments

There are no words that can describe my gratitude for the endless support from my best friend and partner Nora. And the selfless dedication to my happiness from my mother Elis-abeth. Without you this paper would not exist.

(5)

List of Tables

4.1 Surface features and Word list comparison . . . 15 4.2 Dependency based features . . . 15 4.3 Statistics pertaining to simple wiki Guidelines . . . 16

(8)

1 Introduction

Linguists have been working on ways to concretely assess the readability of text prior to the dawn of computer science. Early approaches were aimed at determining the years of formal education needed to understand a text. With the practical motivation of providing children with reading material appropriate to their language abilities. In contemporary readability research, both the motivations and ambitions have expanded. Natural language processing has allowed for more sophisticated ways of measuring text complexity. With statistical ma-chine learning we can now create tools that automatically distinguish easy-to-read language from other language. Using that knowledge to let a computer simplify texts is known as automatic text simplification. Several groups of our population are in need of easy-to-read material. Including second language learners, aphasic patients and children. Since manual simplification is a tedious and time consuming task, the interest is growing in automating the process. With data driven machine learning, new systems can automatically learn how to simplify texts. But in order to develop such systems, a large amount of simple language data is needed. Simple Wikipedia is a large collection of encyclopaedic articles aimed at people learning english. There is research suggesting that these articles can be said to be lower in readability (Yasseri, Kornal, and Kertész 2012)(Kauchak 2013)(Hwang, Hajishirzi, Ostendorf, and Wu 2015). For several years the Simple English Wikipedia corpus (Kauchak and Coster 2011) has been a primary source of data in automatic text simplification (ATS)(Napoles and Dredze 2010). It has been used as a resource for creating language models, developing clas-sifiers and studying readability. The corpus consists of parallel aligned sentences between the standard part of Wikipedia (standard Wikipedia) and simple english Wikipedia (simple Wikipedia). In creating parallel aligned corpora from unaligned texts, a lot of text from the source material is left out. The results of readability studies on the whole of standard and simple Wikipedia do not necessarily apply to any subsection of the data. It would be useful to see if the aligned corpus follow the same patterns as has been found in the whole of sim-ple Wikipedia. At this point in time there is also a lack of descriptive information about the readability differences in the corpus. And no studies have been made about whether author follow the simple Wikipedia guidelines for writing. These points of concern is what will be addressed in this study.

(9)

1.1. Aim

1.1 Aim

The aim of this study is to evaluate the language complexity differences between simple and standard Wikipedia. This study also aims to test the extent to which the guidelines for simple english Wikipedia has been followed. The analysis will specifically pertain to the corpus prepared by Kauchak (Kauchak and Coster 2011). This means that only a subset of all available articles, that has been sentence aligned with standard Wikipedia, will be the subject of this study. The results found here will hopefully provide insights that can be used in developing new text simplification systems based on the corpus.

1.2 Research questions

There are two research questions that I intend to answer in this thesis:

1. Are there surface level or dependency based features that indicate differences in readability between parallel aligned sentences from simple Wikipedia and standard Wikipedia?

2. Do these parallel aligned sentences follow the guidelines for simple English Wikipedia to the same extent?

1.3 Delimitations

Not all Simple Wikipedia guidelines have been operationalized and tested. Constituency based readability features have not been used.

1.4 Structure

This thesis consists of six different chapters. Chapter 2 gives an account of previous research on text simplification, presents the guidelines, how the corpus was created and the NLP tools used in this project. Chapter 3 contains motivations and descriptions of the readability mea-sures and guidelines that have been used to analyse the corpus. In chapter 4, results of the measures are presented. Chapter 5 is a discussion on how to interpret the results and method-ological alternatives. Chapter 6 presents the conclusions drawn from the study and proposi-tions for future research.

(10)

2 Theory

Simple English Wikipedia is a collection of simplified encyclopaedic articles available for free online. The database contains 133,641 articles at this time (Wikipedia 2018a). Most of which have corresponding standard articles on the same subjects. This is a resource cre-ated to share knowledge. And specifically to facilitate an easy-to-read option to standard Wikipedia for children and adults who are learning English. Even though this resource is unique in many ways, addressing the need for simple language is not a novel concept. Before the website Wikipedia launched, guidelines for writing plain English were being developed. These guidelines specify instructions for how to limit a vocabulary, change the syntactic struc-tures of sentences, structure a text in logical order and prioritize the core information needed for comprehension (PLAIN 2011)1_{. Simple english Wikipedia has its own set of guidelines,}

many of which are adopted from "A Plain English Handbook"2. They describe methods for simplifying text that prompt authors to:

• Use words from the Basic English word list (BE850) (Ogden 1933), or the Voice of Amer-ica (VOA1500) word list.

• Use active voice instead of passive voice

• Avoid contractions and instead use long forms (for example using "I have" instead of "I’ve")

• Explain difficult words in parentheses • Use simple sentence structures

• Use a straight word order (subject-verb-object) • Split longer sentences into multiple sentences

The above list is a compressed version of the actual guidelines. There are more lengthy ex-planations of these concepts on the website proper. Mainly in the from of examples. The ones included here are the ones that are related to plain language and readability. Since this study is focused on text complexity and readability, a number of guidelines will be disregarded

1_{http://easy-to-read.eu/european-standards/} 2_{https://www.sec.gov/pdf/handbook.pdf}

(11)

2.1. Concerning the authorship of simple Wikipedia

completely. For example how to use links and references, pictures and structuring of articles is not discussed within this work. Such guidelines have to do with looks and typography, and is special to Wikipedia. It is not related to plain language.

2.1 Concerning the authorship of simple Wikipedia

Anyone, is allowed to create an account. And anyone with an account may contribute to Sim-ple Wikipedia. In some cases, most often in standard Wikipedia, an account is not required to make changes. For simple Wikipedia however, there is a minuscule level of commitment and conformity with automated filters required. To prevent vandalism or other unwanted behaviour, there is a threshold to pass before you can make any major changes to a simple Wikipedia article. First, users are required to make 10 minor changes, such as correcting cita-tions or grammar. At that point the user is automatically validated, granted that their account is at least four days old. This validation allows users to move pages and change their content. This includes writing articles from scratch or providing changes and additions. There are also several types of administrative privileges granted to productive and well behaved users. These administrators are chosen by the community.

What this essentially boils down to is: anyone and everyone can change the contents of simple Wikipedia articles. It is therefore reasonable to presume that these articles are written by amateurs with no training in or experience of simplifying texts. In using this data for text simplification research, this is a point that is seldom brought up. Probably because there have been few alternative sources of data of similar quality or quantity. Only recently have there been a proposal for a better alternative which is the Newsela corpus prepared by Napoles et.al (2010) (Xu, Callison-Burch, and Napoles 2015), containing professionally simplified parallel articles. Still, simple Wikipedia is widely used as a resource for text simplification research and development (Hwang, Hajishirzi, Ostendorf, and Wu 2015).

It should be said that amateur simplifications is not necessarily a bad thing. There is a value in having text data that reflects on how people generally tend to simplify texts. Wikipedia Simple and Main are uncorrected (natural) output of the human language gen-eration ability (Yasseri, Kornal, and Kertész 2012). Theoretically any personal biases and tonalities should be eliminated when looking at the sample as a whole. And any, against belief, emerging biases would be an interesting subject in terms of human language phenom-ena.

2.2 The basis for text simplification

The topic of text simplification is a research area in computational linguistics. Computational linguistics is a interdisciplinary field that draws from computer science, artificial intelligence, linguistics, cognitive science, statistics and anthropology (Computational Linguistics 2005). The overall aim of the field is to create computational models for linguistic phenomena. This broad definition entails many areas of research. Some computational linguists are concerned with scientifically understanding the construction and usage of natural language, both writ-ten and spoken. (Natural language refers to language used by humans as opposed to com-puterized forms of communication). However, many of the studies that are discussed in this particular thesis primarily serve the purpose of furthering technological advancements. Specifically, creating more efficient and high performing systems for text simplification. Text simplification (TS) is a process in which modifications are made to a text in order to make it more accessible to a reader. These modifications are not intended to take away from the most important information and content being relayed in the text. What separates automatic text simplification from regular TS is that a computer is doing the simplifications automatically instead of a person doing it manually.

(12)

2.2. The basis for text simplification

Generally there are two types of modifications that are made in ATS, syntactic and lexi-cal, although they are not mutually exclusive and often overlap with each other (Siddharthan 2014). In syntactic simplification the grammatical structure of a text is altered somehow. Typical alterations would include: changing the word order of a sentence, removing certain phrases or words and splitting long sentences into multiple shorter sentences (Rennes 2015). In lexical simplification, semantic content is the subject of interest. Uncommon or ambigu-ous expressions are transformed into more easily decoded and recognizable terms. Statisti-cal machine translation (SMT) is another approach to TS. It can involve both alterations of syntax and lexical terms at the same time. In this approach simplification is considered a mono-lingual translation task. Sentences are "translated" into simple language. This type of approach usually require large amounts of parallel data to train a system. This can be seen as an inductive process, where more examples lead to better tuned hypotheses. The alterna-tive to this is using manually created rules for simplification and letting a computer search for and apply transformations. It is an approach that do not require as much data and has been shown to be very effective (Rennes 2015). There are also methods that involve adding explanations and essentially adding more information to support the readers comprehension. These systems have several practical virtues relating to the earlier mentioned need for easy-to-read text. In 2010 Aluísio & Gasperin presented a tool for identifying complex text features and to help authors in simplifying texts in Brazilian Portuguese. The project was motivated by the the high amount of illiteracy in Latin American countries. In 2007 a cor-pus analysis was performed, prior to development of a tool used to help teachers simplify texts by Petersen & Ostendorf (Petersen and Ostendorf 2009). The tool was aimed toward second language learners. It was meant to help teachers simplify their writings. A Swedish system based on manually defined rules has been developed to automatically perform sim-plifications and support writers (Falkenjack, Rennes, Fahlborg, Johansson, and Jönsson 2017). Belder & Moens developed an aid for children learning how to read (Belder and Moens 2010). Automatic text simplification has also been used to help patients with aphasia (Devlin and Tait 1998), and deafs (Inui, Fujita, Takahashi, and Iida 2003).

2.2.1 Measuring Readability

Since the dawn of the research field, linguists have been studying how to gauge the readabil-ity of text. But with advent of computers in the mid twentieth century, the interest grew in defining formal metrics for determining text difficulty. The purpose behind such metrics are to describe the amount of cognitive work load, time or level of education that is needed for comprehending the contents of a text. Traditional metric such as the gunning fog index (Gun-ning 1952) and Flesch Reading Ease test (Flesch 1948) are based on counts of sentences and words. The hypothesis being that text containing many long words in a sentence necessitate a higher language proficiency in the reader. These measures are still widely used today but have also gotten some criticism. Applying text simplifications based on the rules has been shown to produce underwhelming results (Davison and Kantor 1982) (Duffy and Kabance 1982).

From development of automatic text simplification systems there has emerged several new suggestions for measures. Liu (2008) proposed calculating the average dependency dis-tance in a sentence. A higher disdis-tance indicating a more complex sentence structure. Falken-jack and Jönsson showed that sentence depth and amount of verbal roots also could be used as features in a classification task (Falkenjack, Mühlenbock, and Jönsson 2013).

(13)

2.3. Material

2.3 Material

2.3.1 Dataset

Coster and Kauchak (2011) provide a sentence aligned parallel corpus that is larger and more accurate than many other simple Wikipedia corpus available. Mainly, this dataset is an im-provement over its predecessor The Parallel Wikipedia Simplification (PWKP) corpus pre-pared by Zhu et al. (2010). It has for many years been a benchmark dataset in text simplifi-cation. It has also been used as a baseline in evaluating new sentence alignment systems. In this section, I describe the process of aligning the sentences of this corpus.

First article pairs were extracted based on title, where each pair consisted of one standard and one simple article. Then the articles were segmented into paragrpahs using formating data availble in the webpage html-documents. After that, each paragraph pair from simple and standard scoring a TF-IDF cosine similarity above 0.5 were aligned. Lastly sentences from the paragraphs were aligned in a dynamic programming approach. For each sentence in the paragraph, the local similarity between two sentences i and j, was computed as the maximum in a set of potential alignment operations.

a(i, j) =max $ ’ ’ ’ ’ ’ ’ ’ ’ ’ & ’ ’ ’ ’ ’ ’ ’ ’ ’ % a(i, j ´ 1)´skippenalty a(i ´ 1, j)´skippenalty a(i ´ 1, j ´ 1) +sim(i, j) a(i ´ 1, j ´ 2) +sim(i, j) +sim(i, j ´ 1) a(i ´ 2, j ´ 1) +sim(i, j) +sim(i ´ 1, j) a(i ´ 2, j ´ 2) +sim(i, j ´ 1) +sim(i ´ 1, j)

In the above set, sim(i, j) is the similarity between the ith standard sentence and the jth simple sentence. The similarity was calculated using TF-IDF, cosine similarity (cosine simi-larity calculated on TF-IDF weights, used to normalize term frequencies). The skip penalty is assigned to sentences that have too high similarity. The thought behind this is to not fill the corpus with pairs of identical sentences or ones that hardly differ from one another. In text simplification, we want aligned data that is similar in meaning yet different in form. Using that, we can measure the disparity between simple and standard language, and create our models based on those disparities.

Having completed the corpus, they did a small trial. In which they treated simplification as an English to English machine translation problem. In this approach each standard sen-tence, is viewed as having an optimal translation into the target language, simple. In their test they trained the translation system Moses on their data. Moses is a phrase based parser, which means it uses constituents as opposed to dependencies as parameters. By comparing the output of their parser to a system with an oracle (an imaginary system that uses perfect inputs) they found that future implementations would benefit from better parameter estima-tion.

Hwang 2015 developed another version of a sentence aligned parallel corpus based on Wikpedia simple, using a different technique for aligning sentences. This corpus has been compared to Kauchaks corpus. The semantic similarity scores on both word and sentence level proved significantly better for Hwangs corpus. The corpus combines automatic and manual alignment. After automatic alignment, a manual evaluation was made splitting the corpus into levels of human perceived "goodness" of simplifications.

Instead of parsing individual paragraphs at a time, they used a greedy search algorithm over whole articles. By removing aligned sentences they could avoid many-to-one align-ments. Also, they were able to capture alignments across paragraphs. Another reason for the success of this corpus comes from using more parameters than just phrase structures. Hwang used word-level semantic similarity scores which also accounted for syntactic dependencies.

(14)

2.3. Material

2.3.2 Preprocessing tool

Spacy is an API for python containing several state-of-the art NLP tools. The following fea-tures are availble at this time in spaCy: neural network models, integrated word vectors, tok-enization, part-of-speech tagging, sentence segmentation, dependency parsing, entity recog-nition. The accuracy of spaCy is on par with most state-of-the art dependency parsers and is faster than any other available for free online (Choi, Tetreault, and Stent 2015). The overall accuracy of spacy has been measured in a study where training and evaluation was made on english text data from OntoNotes5. The features for the dependency parsing were part-of-speech tags were extracted using spaCy’s own algorithm. This experiment showed an unlabelled attachment score of 91.85 and a labelled attachment score of 89.91 (Honnibal and Johnson 2015).

SpaCy utilizes a deep learning neural network architechture called thinc. Thinc is a multi-task convolusional neural network (CNN). This neural network is developed specifically for use in the spaCy library. In order to identify syntactic features in a text, the network first needs to be trained. In the english language model that spacy provides, there is a thinc network that is trained on data from OntoNotes (Consoritum 2013). This dataset contains annotated text from several different genres (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows). The english part of the OntoNotes5 corpus concists of a total of 1,4 million words. Using the trained thinc network the spacy lanugage model is able to assign POS tags, dependcy parse and identify named enteties.

The thinc network is also used for creating word vectors applied in spaCy’s named entity recognition, lemmatization and word similarity measures. The word vectors in the spaCy language model are GloVe vectors (Pennington, Socher, and Manning 2014) that have been trained on data from Common Crawl. Which is a large repository for texts extracted from webpages (Crawl 2018).

2.3.3 The architecture of spaCy

All of the language models for spaCy are availble for free on their website. Users can also build their own models by using spaCy itself. The specifics of constructing a spacy model is not relevant to this project since all text processing has been done using spaCy’s ready made one.

A language model for englsh is loaded. This is essentially spacy loading the trained CNN and the set of GloVE word vectors. From the word vectors, spacy extracts a vocabulary. The vocabulary consists of a string storage, where string representations of words are stored. It also has a lexeme storage. And a store of other morphological information both general and specific to individual words.

When text input, in the form of a unicode string, is provided to the spacy pipeline, it is tokenized. Words and symbols are segmented into separate units in a sequence. In the tok-enization, spaCy predicts where there are beginnings of new sentences by identifying punc-tuation. It also extracts morphological information for each word (such as suffixes, prefixes and infixes). Lexical information, such as language specific expressions (“one hundred” and “twenty” are words for counting called numbers), is also extracted. The above mentioned information is not derived from the language model itself. But rather from spaCy’s so called “base data”. Data that is contained within the the base spacy library.

Parallel with the tokenization, the text can be passed to one or several of spacys other processing modules. In this project the Lemmatizer, DependencyParser and the Tagger have been used.

PosTagger The POS tagger assigns parts of speech tags to tokens. It makes predictions based on input from the vocabulary (provided by the language model), such as lemmas and GloVe vectors, and information extracted in the tokenization process, for example word order

(15)

2.3. Material

and suffixes. The input is fed through the language model CNN which returns probability distribution over which pos-tags are likely to correspond to the word. The most likely tag is chosen and assigned to the word. This includes both a simple and a detailed POS-tag, for example the word “is” will be assigned the simple tag “VERB” and the detailed tag “VBZ” since it is a verb in third person singular.

Lemmatizer After all tokens have been processed by the POS tagger they can be fed to the Lemmatizer module. In this module, words are assigned their respective lemma. Based on the language model spaCy constructs a look-up table for lemmas. In this look-up table each individual word is mapped to a single lemma. There are words that correspond to multiple lexemes, such as “padding” that be both a verb and a noun. Because of this the look-up table is relative to the POS-tag of a word. In the lemmatizer module the POS-tag for the token is used along with the tokens string representation to identify the correct entry in the look-up table. From that entry a token is assigned it’s lemma.

Here follows a short explanation of what lexemes and lemma are. A lexeme is a collection of word forms that correspond to a singe lexical concept, i.e. drive, drove, driving, drived. There is a subset of lexemes called lemma. Lemmas are single words that correspond to a certain form of a lexeme. The lemmas are used are chosen by convention to represent a word. For the lexeme [drive, drove, driving, drived] the lemma is [drive]. Using lemmas we can extract from a text how many times a lexical concept is used, regardless of what circumstantial morphological features are present. Let’s say we want to count the amount of times “driving” is used in a text as opposed to “traveling”. We wouldn’t necessarily care if the word is past or present tense. So we look up the words lexeme and add that to the count instead.

Dependency Parser In order to assign syntactic relationships between different tokens a dependency parser is used. It outputs a dependency tree for each sentence in the data. In a dependency tree each token is assigned a dependency relation to another token. Such a relation consists of a dependent word, a head word and a dependency type which describes the syntactic connection between the words. For example, in the sentence “I eat cake” there are three dependency relations: “I” is the nominal subject and is dependent of “eat”, “eat” is a predicate verb and the root word in the sentence, “cake” is the direct object in the phrase and is dependent of “eat”.

I eat cake

prep verb noun

root

nsubj dobj

In spaCy, the dependency parser works much in the same way as the POS-tagger does. They both operate on the thinc CNN. All parameters used in the pos-tagger are also used in the dependency parser. Since thinc is a multi-task CNN it is able to output predictions for both dependencies and for POS-tags. Important to note here is that the dependency parser also uses POS-tags as input if they are available. Using POS-tags as well as morphological information and word vectors greatly increases the accuracy of the dependency parser (Hon-nibal and Johnson 2015).

(16)

3 Method

This chapter contains descriptions of the measures used to measure text complexity and how guidelines were followed. The data was used to analyse the differences, and similarities between the aligned standard and simple sentences.

All measures were implemented in python by the author. Tokenization, lemmatization, part-of-speech tagging and dependency parsing was done using the spaCy python API, which is a multi purpose language processing tool. SpaCy provides an English language model trained on a large collection of text from different genres. More in depth explanations of the simple Wikipedia corpus and the spaCy API can be found in the Theory chapter. Since the data was already split into sentences, instead of letting spaCy infer beginnings and ends of sentences, this was done manually. Each token was tagged as either beginning, or not be-ginning of sentence. This proved a good approach since spaCy otherwise predicted incorrect end and beginnings of sentences. Each sentence was parsed individually.

The study of the readability and following of guidelines in simple Wikipedia articles was done on the data prepared by Coster and Kauchak (Coster and Kauchak 2011). This data contains 167,688 parallel aligned sentences between standard and simple Wikipedia. This data is publicly available (Kauchak 2011).

In the first two sections there are descriptions of the measures used to examine the read-ability of simple Wikipedia. The third section is a presentation of how a selection of guide-lines for simple Wikipedia were operationalized and tested. In the last section, the statistical analysis method is presented.

3.1 Surface features

Surface features are features that appear in the surface of a text, such as characters, words, sentences and syllables. These are often synonymously called shallow features. They are shallow in comparison to other more sophisticated feature such as part-of-speech tags. Most traditional measures of text readability are based on these components alone (Falkenjack, Mühlenbock, and Jönsson 2013). To extract these features, a text is segmented into its com-ponents, the process is called tokenization. Tokenizing a text is often a necessary first step in doing any data driven analysis and can yield some interesting results on its own.

In this study, three types of measures derived from surface features are used. These features are also related to the Wikipedia guidelines. In each section below I explain these

(17)

3.2. Dependency based Features

measures, how they are calculated and justify their usage. Word and sentence averages

Average sentence length and average word length has been used in several studies as feature in evaluating text complexity (Feng 2010) (Aluísio and Gasperin 2010) (Falkenjack, Mühlen-bock, and Jönsson 2013). Longer sentences and words usually implicate a higher complexity in a text. In a statistical analysis of all simple Wikipedia articles and their standard counter-parts, the usage of fewer complex words and shorter sentences was found to be the main reason for improved readability (Yasseri, Kornal, and Kertész 2012). Since the current study is done on kauchaks corpus specifically, I wanted to see if these findings would hold for the aligned sentence dataset.

Type-token ratios

The type token ratio is a measure that describes the diversity of word usage in a text, also known as the lexical richness. The measure was first proposed in 1957 by Mildred C. Tem-plin (1957). In a text, the lexical types are all the unique words used. In other words, the vocabulary of a text is the collection of all lexical types. The type-token ratio, in its simplest form, is calculated by dividing the amount of types, unique words, to the amount of tokens, all words (Torruella and Capsadab 2013). In certain genres of text this measure can indicate the level of text complexity. A higher type-token ratio indicates a higher lexical richness. There is a study in which classifiers were trained to distinguish simple Wikipedia articles from standard Wikipedia articles, the type token-ratio was used as a parameter in training. (Napoles and Dredze 2010).

Another, more uncommon measure, is the lemma-token ratio. Instead of using types, which include morphological variations of a single word sense, we can use lemma. In finding out the lexical richness of a text, we are not always interested in how the word is contextually used. Texts that use multiple word senses to describe a similar activity, for example drive/transport/relocate/move, are arguably more lexically rich than texts using several variations of the same word; drive/drove/driving (McCarthy and Schmitt 1990). The Gunning Fog Index

The Gunning Fog Index is a traditional readability measure. The measure is calculated based on the amount of words, complex words (words that contain more than three syllables), and sentences in a text. Robert Gunning, who introduced this measure, was researching on chil-dren’s reading skills. The purpose of the measure was originally to gauge how many years of formal education was required in order to apprehend a text without difficulty (Gunning 1952). It has since then been widely used as a general measure of readability and a baseline for developing new readability measures.

GunningFogIndex(text) =0.4 ¨_#sentences#words +100 ¨#complexwords_#words

3.2 Dependency based Features

While the surface features can tell us about the lexical properties of a text, they do not provide any information about syntactic structures. The length of words and sentences are important for measuring readability. But the relationships between these words and the sentence struc-ture as a whole gives us a deeper understanding of what makes language comprehensible. In a lot of ways, meaning and information is conveyed not through individual words, but from their interaction with other words. These interactions are in linguistics referred to as syntax. In natural language processing, dependency based features help us understand the syntactic complexity of a text. In order to compare the syntactic complexity of sentences between datasets I have used three measures which are described below. These measures are also related to the Wikipedia guidelines.

(18)

3.3. Operationalization of Simple Wikipedia Guidelines

Average dependency distance

A dependency distance is the distance between a dependent word and its governor (Liu 2008). This is the absolute value obtained from subtracting the dependents numerical po-sition in a stentence from the governors numerical popo-sition. The average of this value in a dataset is the sum of all dependency distances divided by the number of sentences. A longer average dependency distance can be indicative of a complex sentence structure (Liu 2008) (Falkenjack, Mühlenbock, and Jönsson 2013).

Average sentence depth

Higher sentence depth has been found to indicate more complex sentences (Liu 2008). Since I have not been able to find an explanation of how to calculate this measure in the literature I have constructed what I believe to be a passable definition of the depth in a sentence. I have then then written an algorithm that extracts sentence depths based on that definition. More or less, this is just a conventional way of defining the average depth of a graph tree (Knuth 1968).

The depth of a sentence dependency tree is calculated as the sum of all individual token depths divided by the number of tokens in the tree. In other words it is the average token depth. The depth of a token is the amount of ancestors it has that precede the root. If a token only has one ancestor, and that ancestor is the root, the token depth is 1. A root token is considered to have a depth of zero and is left out in calculating the tree depth. The average depth of a sentence is the sum of all token depths within the sentence (except the root) divided by the number of tokens in the sentence (except the root).

The average sentence depth is the sum of all sentence depths divided by the number of sentences.

Ratio verbal roots

The number of sentences in which the root word is a verb out of all sentences. This measure has been used to train readability classifiers (Dell’Orletta, Montemagni, and Venturi 2011) (Falkenjack, Mühlenbock, and Jönsson 2013). The implication being that sentences with verbal roots are easier to read.

3.3 Operationalization of Simple Wikipedia Guidelines

Until now there have been no efforts in researching the following of guidelines for simple Wikipedia. In this study, I have operationalized a selection of the guidelines. While a lot of them intersect with usual readability measures, others do not. Writers are encouraged to use short words and sentences and limit their vocabulary. They are also prompted to simplify their sentence structures. These guidelines are covered by the surface features and dependency based features described above.

The question at hand is: are the guidelines followed to the same extent between sentences in simple and standard Wikipedia. And to answer this questions the following measures has been used to compare the datasets.

Contractions

In the guidelines for simple Wikipedia. It is stated that usage of contractions is disapproved of: "(Do not) use contractions (such as I’ve, can’t, hasn’t). Instead, do use long forms as this allows learners to recognize familiar grammatical patterns.". To measure the usage of contractions, a list of common English contractions and their respective long forms was collected from Wikipedia itself (Wikipedia 2018b). Words matching with either contractions or long forms was counted.

(19)

3.3. Operationalization of Simple Wikipedia Guidelines

Verb tenses

The preferred verb forms for simple Wikipedia are (simple) present, past and future: "(Use) verb in past, present or future only". The occurrences of these verb forms were counted using POS-tags. Verbs in simple future consists of a base form verb following the auxillary verb "will" or "shall".

Parentheses

Writers of simple Wikipedia articles are encouraged to explain words they believe the reader might not understand in parentheses. I both counted the amount of parentheses used and a category of parentheses called explanatory parentheses. The criterion of the explanatory parentheses is as follows: 1. It must contain a lemma mentioned earlier in the sentence outside the parentheses. 2. The lemma must be a content word, being either a noun, verb, adjective or adverb.

The explanatory parentheses as described here is a new measure, and will be discussed later in the thesis.

Passive phrases

A phrase is in passive voice if the subject is being acted upon by an agent. In active voice, the subject itself is the actor. For simple Wikipedia, use of passive phrases are discouraged. Instead, writers are prompted to use active voice. This was measured by counting phrases containing passive nouns. In the guidelines, authors are explicitly asked to take phrases in passive voice from a standard sentence, and turn them into active voice. Therefore, the amount of phrases changed to active voice was counted. A passive phrase in a standard sen-tence was considered to be changed if it had a matching phrase in active voice in the aligned simple sentence. Phrases were matched if the subject had the same lemma between sentences. Word list usage

The most mentioned, and explicit instruction in the guidelines is usage of simple words. Simple words are words that are considered to be core terms within the English language. The first incarnation of a simple english vocabulary was created by Charles K. Ogden (Ogden 1933). He proposed a set of 850 words and some strict grammatical rules for their usage. According to Ogden, the words in his very limited vocabulary could be used to express almost any concept of standard English through extensive paraphrasing. This word list will be referred to as BE850.

Voice of America is the U.S. governments official international radio broadcasting service. Their broadcasts contain domestic news, intended for listeners outside of the united states. In order to promote comprehension, they have constructed a vocabulary of 1500 words. This vocabulary is part of their internal guidelines for writing manuscripts for broadcasts. This word list will be referred to as VOA1500.

The simple Wikipedia guidelines state that authors should, to the best of their ability, limit their vocabulary to that of BE500 and VOA1500. In order to measure the following of this guideline I conducted a search for these words in the datasets. To capture all word sense occurrences, lemmatized tokens were used.

(20)

3.4. Statistical analysis

3.4 Statistical analysis

To compare the samples (simple, and standard Wikipedia) Wilcoxon signed rank tests were used. This was done to show if the differences, in the values for readability measures and guideline measures, between samples represented random variations in a single population, or if they represented actual differences between populations. In this case, the population being texts from respective parts of the corpus. The null hypothesis being that there are no differences, and the alternative that there are.

Assumptions for the Wilcoxon signed rank test were checked before testing. First, vari-ables need to be on a continuous level. Some of the varivari-ables (measures) were binomially distributed, therefore a continuity correction was applied in those cases. None of the mea-sures were normally distributed, but this is does not violate the assumptions of the test. In the test, normal distribution is artificially approximated. However this still requires variables to be symmetrically distributed. It was found that that all variables had a symmetric distri-bution between samples (except for usage of simple verb tenses). This was checked by first extracting the difference between all individual datapoints. Then, using boxplot diagrams, identifying if interquartile ranges and whiskers were similarly distributed around a centred median.

Lastly, variables need to be dependent. This cannot be fully guaranteed as true within this study. We don’t know the circumstances under which simple Wikipedia were written. Did the authors use standard articles as precursors and apply modifications? Or did they write them without looking at the corresponding standard article? The simple Wikipedia guide-lines seem to suggest that using a precursor is a conventional method for writing (Wikipedia 2018a). Therefore this assumption was made. Accordingly, the variables (measures and guidelines) are treated in the statistical tests as dependent variables. Having a pre condition (standard) and a post condition (simple) relationship.

(21)

4 Results

In this chapter, the results from the comparative evaluation of readability and following of guidelines between the aligned standard and simple sentences, are presented. Table 4.1 con-tains information on surface features available after tokenization, along with word list com-parisons. Table 4.2 presents three readability measures based on syntactic dependencies. In table 4.3, the results for features based on wikipedia guidelines are listed.

4.1 Surface features and word list comparsion

A Wilcoxon signed rank test indicated that sentence length was significantly higher in stan-dard sentences (Mdn=23) than in simple sentences (Mdn=21) Z=55.69, p<0.001, r=0.14. Also, words were found to be significantly longer on average in standard sentences (Mdn=4.36) compared to simple sentences (Mdn=4.24) Z=58.0, p<0.001, r=0.14.

VOA1500 words occurred far more frequently than BE850 words overall (table 4.1). There were also far more types from VOA1500 in the data than there were BE850 types (table 4.1). For both simple and standard, approximately half of all tokens were words from the VOA1500 word list (table 4.1). Determiners, connectives, particles and auxiliary verbs, such as ‘the’, ‘as’, ‘a’ and ‘be’, accounted for a vast majority of the matching tokens. The words in the word lists are in lemma form. Therefore, in measuring the occurrences of word list words, all tokens were lemmatized. The words ‘and’ and ‘as’, accounted for approximately 70% of all the tokens matching with the BE850 in both standard and simple.

The difference in occurrences of VOA1500 words between simple and standard sentences were indicated to be random in a Wilcoxon signed rank test Z=12.56, p>0.9 (Mdn=0 in both samples). The differences of occurrences of BE850 words were also indicated to be insignifi-cant Z=5.61, p>0.9 (Mdn=0 in both samples).

The data sets differed in amount of tokens and amount of lemmas. The simple data set having fewer in both regards (table 4.1). However, the type-token and lemma-token ratio was higher in simple than in standard (table 4.1).

The Gunning Fog Index was higher for the standard sentences, which indicates a more complex language usage (table 4.1).

(22)

4.2. Dependency based Features

Table 4.1: Surface features and Word list comparison

Feature Standard Simple

Tokens 4,381,873 3,915,071 Types 130,381 118,609 Lemmas 120,753 109,943 Lemma-token ratio 0.027 0.028 Type-token ratio 0.029 0.030 Sentences 167,688 167,688

Average Sentence length 26 tokens 23 tokens

Average Word length 4.81 4.68

BE850 types 73 73

BE850 tokens (percent of all

to-kens) 184,726 (4.2%) 155,929 (3.9%)

VOA1500 types 1500 1499

VOA1500 tokens (percent of all

tokens) 2,153,160 (49.1%) 1,981,632 (50.6%)

Gunning fog index 12.92 11.44

4.2 Dependency based Features

A Wilcoxon signed rank test indicated that the average dependency distance was signif-icantly higher in standard sentences (Mdn=3) than in simple sentences (Mdn=2) Z=34.48, p<0.001, r=0.08. Average sentence depth was also significantly higher in standard sentences (Mdn=2.83) than in simple sentences (Mdn=2.64) Z=59.93, p<0.001, r=0.15. This can be seen in table 4.2. There was a higher percent of sentences with verbal roots in the simple dataset (Mdn=1) than in the standard dataset (Mdn=1) (table 4.2). However, a Wilcoxon signed rank test indicated that this difference was insignificant Z=-1.68, p>0.09.

Table 4.2: Dependency based features

Average dependency distance 3.57 3.43

Average sentence depth 2.97 2.78

Percent of sentences with verbal

(23)

4.3. Features from Simple Wikipedia Guidelines

4.3 Features from Simple Wikipedia Guidelines

There was no notable difference between datasets in the usage of contractions (table 4.3). However, the long form for commonly contracted word pairs (such as "I have" in place of the contracted "I’ve") were more commonly used in the simple dataset.

A Wilcoxon signed rank test indicated that there was no significant differences in the amount explanatory parentheses occurrences between standard sentences (Mdn=0), and sim-ple sentences (Mdn=0) Z=3.6, p>0.9. The amounts of parentheses and explanatory parenthe-ses are displayed in table 4.3.

There were more phrases in passive voice in the standard sentences (Mdn=0) than in the simple sentences (Mdn=0) (table 4.3). However a Wilcoxon signed rank test indicated that this difference was non-significant Z=4.86, p>0.9. Out of all passive voice phrases in the standard sentences, 32% were changed to active voice in the aligned simple sentences (table 4.3).

The usage of verbs tenses in simple past, present and future was slightly higher in the simple dataset than in the standard dataset (table 4.3). Since this data was not symmetrical in it’s distribution, no statistical test was performed.

Table 4.3: Statistics pertaining to simple wiki Guidelines

Contractions 14 13 Long forms 27,244 34,078 All Parentheses 33,103 30,116 Explanatory parentheses 5704 4497 Regular-to-explanatory paren-theses ratio 0.17 0.15 Passive phrases 53039 51883

Percent of sentences containing

a passive phrase 26.8% 20.0%

Percent of phrases changed from passive voice in standard to active voice in simple

NA 32.4%

Simple Verb tense percentages (out of all verbs)

future: 5.2% present: 11.1% past: 24.7% other: 59.0% future: 5.4% present: 11.6% past: 26.0% other: 57.0%

(24)

5 Discussion

This chapter contains a discussion on the results of this study and the limitations of the meth-ods used.

5.1 Results

Looking at the results we can confirm the findings of Yasseri (2012)(Yasseri, Kornal, and Kertész 2012). In that study a larger amount of data from Wikipedia was used, though not all of it was aligned. Yasseri found that the main reason simple Wikipedia was easier to read was a shorter sentence length and fewer complex words. This was reflected by the gunning fog index. In this study we see can see the same tendency in a parallel sentence aligned corpus. The gunning fog index was lower for the simple data than for the standard data.

It should however be said that this measure has gotten some criticism. Gunning himself stated that there are problems with the index(Gunning 1952). He points out that the gunning fog index should be interpreted as an indication of a sentences being "needlessly complex", and not as an absolute truth. In other words, the idea of having an index that show how many years of formal education are needed to comprehend a text is redacted at this point. And simplifying text by shortening sentences has been objected against(Davison and Kantor 1982). The main argument being that it does not promote comprehension in a meaningful way.

Still, since there was significant differences between datasets in regards to surface features, I suggest such features would be useful in training an ATS system on the corpus prepared by Kauchak Kauchak and Coster 2011.

5.1.1 Sentence complexity

I found that the average dependency distance and sentence depth was significantly lower for simple sentences than for standard sentences. It is hard to tell if the difference between the two datasets should be considered large or small because there is a lack of comparison data from other corpora, and a non-parametric test was applied that shows no effect size. But this result indicates a difference in language complexity between the standard and simple sen-tences. This measure could be a useful parameter to use in training an ATS system on the corpus examined here. There are a few studies that have used average dependency distance

(25)

5.1. Results

for monolingual comparison (Falkenjack, Mühlenbock, and Jönsson 2013)(Liu 2008)(Chan-drasekar, Doran, and Srinivas 1996). In these studies, the feature has been used as an input parameter in an automatic text simplifier or classifier. In these cases, the actual differences in dependency distances are not presented. Liu (2008), calculated the average sentence length and average dependency distance in an English language dataset. The dataset was based on news texts. They found that the average dependency distance was 2.54. My results show that simple Wikipedia is well above that value, being 3.43. Does this mean that the sentences of simple Wikipedia are unusually complex? To answer this we need to take into consideration the theoretical underpinnings of the measure. The distance between dependent is relevant in terms of cognitive workload needed to temporarily store information in memory. In order to comprehend a text, understanding the relationships between words is paramount. Here, the assumption is made that workload increases with dependency distance, which makes inter-preting word relationships more cognitively demanding. So, in order to draw the conclusion that sentences in simple Wikipedia are hard to read, we would really need empirical evidence from human perception. What is shown in this thesis is that simple Wikipedia has a higher average sentence complexity than what Liu (2008) found. One possible explanation for this result is that it has to do with genre. Encyclopaedic articles usually aim to describe factual events, phenomena, historical characters, scientific discoveries as well as common concepts. Intuitively this can be considered more complex subjects than the average news story would tackle.

Something that can be noted in respect to genre is the internal validity of average de-pendency distance as a measure. Regardless of language and genre, dede-pendency distance describe the same relationship between words(Liu 2008). This consistency in what the mea-sure describes is arguably a sign of internal validity.

5.1.2 Were the guidelines followed?

Part of the aim of this study was to examine if the Simple Wikipedia guidelines were followed to the same extent in both datasets. I found that there were no significant differences between datasets in regard to any of the operationalized guidelines. Below there will be an analysis of the results. And also some discussions about the usefulness of the examined guidelines in respect to readability and potential applications in text simplification systems.

The simple Wikipedia guidelines insist upon authors using mainly words from BE850 and VOA1500. In terms of text complexity, the guideline corresponds to a type of lexical simplifi-cation. The intention is to limit the vocabulary to a few words that are easily comprehended regardless of reading level. There is some research of this method’s practical effect on reader text comprehension(Duffy and Kabance 1982). It was found that word list simplification only produce noticeable reading improvements for low level readers. But this is enough to mo-tivate the inclusion of this guideline, since low level readers are the target group of simple Wikipedia. The occurrences of these word-list words was checked using a lemmatized vo-cabulary, and compared between datasets. The hypothesis was that word list words would occur more often in the simple dataset, this would indicate that the guideline had been fol-lowed. The results showed that BE850 words stood for only 3.9% of all tokens in the simple data. This ratio was actually higher in the standard data, being 4.2%, although the difference was statistically insignificant. Out of the 850 word types in BE850, only 73 were found (it was the same 73 types in both datasets). Further more, the majority of these types belong to a category of BE called "operations", which include function words rather than content words. Meaning, the findings on BE850 words within the corpus seem contradict what is said in the guidelines.

The word types in the VOA1500 word list accounted for approximately half of all tokens in both data sets. While most the most frequently occurring word types were functional in nature, every single word type could be found in the standard set (and 1499 in the simple set). But since there was no significant difference in word-list occurrences between the datasets we

(26)

5.2. Method

can conclude that this guidelines seems to have been disregarded by the authors. Authors of standard Wikipedia articles use word-list words to roughly the same extent as authors of simple Wikipedia articles.

I did not find any differences between datasets when it came to contractions. Contractions generally did not occur. This can be interpreted as a genre specific tendency. Authors might automatically acknowledge that contractions are not fitting when writing encyclopaedic ar-ticles. The reasoning behind this guideline is that readers would be more familiar with the long forms than the contractions. There are no studies to show whether this is the case or not. And my results don’t argue for or against the case. The interpretation of this result is that the guideline concerning contractions is redundant and followed to the same extent between datasets.

There was no significant differences in the use of explanatory parentheses between stan-dard and simple sentences. This contradicts the guidelines, although it is hard to say if this has any impact on readability. Even though a reader can benefit from having a word or con-cept explained to them, this can also be done in other ways. Writing an explanation of a word in a separate sentence before using it elsewhere could be a more fitting option. We can say that this specific guideline on using parentheses was not followed in simple Wikipedia. But we don’t know what other types of explanations were made.

No significant differences in use of passive phrases were found between datasets. This implies that the corresponding guideline was not followed to a greater extent in either of the datasets.

The rational behind this guideline comes from "A Plain English Handbook"1. There it is suggested that passive phrases takes longer time for readers to cognitively process than active phrases. Suggesting that active voice is more in line with how human beings think and process information. This might seem a bit far stretched, but perhaps this is a common notion. My findings suggest that authors of simple Wikipedia seem to prefer using active voice. This does not mean that passive phrases are actually easier to read, but perhaps that we believe them to be when purposefully writing simple language. There are previous automatic simplification systems in which rules based on passive phrases have produced good results (Rennes 2015). It does however not seem that this could be used as a feature in training a system on the corpus used in this study.

In my findings I discovered that 32% of the active voice phrases from the standard sen-tences were kept but changed into passive voice in the simple sensen-tences. This suggests that people writing simple Wikipedia are using existing standard articles and translating them. If people were writing from scratch, the number of such kept phrases would probably be lower. In the simple sentences, simple verb forms were slightly more common over all, compared to the standard data. But since no significance test could be used to check this hypothesis, it should not be interpreted as a necessarily meaningful difference.

5.2 Method

There are two main ways in which this study could have been done differently. Firstly, a clas-sifier could have been trained on the simple Wikipedia data using the readability measures I extracted as parameters. Also, the operationalized guidelines could have been used as input to the classifier. Through different configurations of features used in training, I could have compared the performance of different versions of the classifier. The version that performed best in accuracy, precision and recall could have been interpreted as having the best training features. This approach would have been an improvement over mine. Not only would we learn the differences between data sets, but also, the relevance of these differences in a classi-fication task. In text simpliclassi-fication, development and testing of new systems is more or less the main goal for researchers. However, I did not go that far in this project. The conclusion

(27)

5.2. Method

can can be drawn from this project as is, is that using the readability measures I tested would be beneficial in such a system.

Secondly, I could have done an experiment using human participants. The aim of this study was to examine differences in text complexity. But the text complexity measures I used don’t provide any information about how readable a text is. A more interesting study would contain a comparison between the readability metrics and human readability percep-tion. That thesis would try to answer the question of the external validity of the formal measures. For example, I could have tested if there is a correlation between sentence average dependency length and a participants experience or behaviour reading a sentence. Human readability perception have been tested both through eye-tracking methods and interviews or questionnaires. This would answer if the measure really did reflect on cognitive work load. Still that would serve a whole other purpose than the task that is presented within this work. The focus here lies on simple Wikipedia language complexity, which has been successfully studied, although in a limited manner.

A limitation in the current study is the inability to use parametric tests. The values of the measures were binomially distributed in a lot of cases, and the measures that were continuous had skewed distributions (although they were symmterical). The data therefore fails to meet the assumptions for parametric tests. The only reason this is suboptimal is that no effect sizes can be found using non-parametric tests. I found certain differences between samples, but the magnitude of these differences cannot be analysed as part of testing the hypothesis. And because of this, there is no telling here how big the difference in dependency distances were for example. This limits the granularity of the results presented, and would be a good opportunity for further study. Although the statistical analysis performed is limited in scope and detail, the findings of this study should not be discarded or disapproved. There is still a value for future researchers in knowing the descriptives I have provided here. And since my findings overlap with what other researchers have found, that could be said to strengthen the claims made in the results discussion and my conclusions.

5.2.1 Parallelism matters

There are some inherent weaknesses in the aligned dataset prepared by Kauchak and Coster(Kauchak and Coster 2011). In their own tests, they found that 27% of the data was identical. Meaning that standard sentences had been aligned with identical simple sentences. Is this necessarily a bad thing? No, any system trained for text simplification should be able to handle cases where simplification is not needed. But having that type of data take up just short of a third of the whole corpus is probably not representative. This fault comes from using a alignment method that is too generous towards highly similar sentences. Fine tuning the penalties for high similarity in the alignment would perhaps be a way of avoiding this generosity. In respects to the current study, the identical sentences can be interpreted as noise. They don’t add anything to the analysis, but neither do they take away from it. At least not in terms of comparison between the datasets. The differences between simple and standard sen-tences as calculated in this study would be the same if all identical sensen-tences were removed. They have however been kept in, since this seems to be the conventional way to treat them in other studies. And for purposes of generalisation and outside comparison, such conventions should be followed, as long as there is no error in doing it.

Why use sentence aligned data, why not use the entirety of simple Wikipedia and stan-dard Wikipedia? There are multiple reasons that motivate the choice of data. Primarily, the underlying idea in this study is to provide descriptives that could be used in text simplifica-tion research. And most of this research is done on sentence aligned data(Napoles and Dredze 2010). In order for the present work to be relevant in that context, the data from Kauchak 2011 was used(Kauchak and Coster 2011). There are however some downsides to this approach. A lot of potentially available data is omitted from my tests. While 167,000 parallel sentences were used, there is a total of 133,000 articles in the whole of simple Wikipedia(Wikipedia

(28)

5.2. Method

2018a). So in determining the readability of simple Wikipedia, one could argue the largest amount of data available is to be strived for. But in taking the full text of all articles and align-ing on document level, the overall reliability and validity of this study would change. These articles are subject to constant change. And in one years time, it is hard to say if the results here would be replicable in any way. By using a state of the art simplification corpus, the results remain relevant and reliable over time. As far as validity is concerned, the problem is size and content. Even though a lot of simple Wikipedia articles seem to echo their stan-dard counterparts in content, this is not a robust claim. For example other researchers have found simple Wikipedia articles to be shorter and contain less information overall(Napoles and Dredze 2010)(Yasseri, Kornal, and Kertész 2012). Especially when studying syntactic dif-ferences, and training syntax based parser, data need to be parallel in the truest sense. We usually want to make the assumption that the content is more or less identical between sam-ples. So that what is modelled is syntax as opposed to textual information. Having data that is parallel on sentence level lets us make that assumption. Otherwise we cannot be sure that syntactic features, or lexical features sometimes, are valid in what they tell us.

5.2.2 Concerning the parser

It is possible that the results could be improved upon by using a better parser. One of the reasons spaCy was chosen was that it is the fastest available parser (to my knowledge), and its cheap on resources. Instead of using the Stanford coreNLP for example, which outperforms spaCy in accuracy, I opted for the fast yet reliable library. For this study there were system constraints in terms of the speed of hardware available. There is a study where spaCy was compared to a number of other parsers. The performance was on a state-of-the art level, yet lower than many other parsers. All results within this project should be replicable, or why wouldn’t they be? (Choi, Tetreault, and Stent 2015)

5.2.3 Studying the guidelines

There are some alternative ways in which the following of guidelines could have been mea-sured. The way in which passive phrases were counted did not account for reduced passive constructions. Concider the following two sentences. The first one is an ordinary passive construction, while the second one is reduced. The reduction is done by taking away the axillary verb was beloning to the passive verb washed(Igo 2007).

• 1: The dog that was washed yesterday lives next door • 2: The dog washed yesterday lives next door.

It is uncertain whether the spaCy parser would identify this structure of passives correctly or not. I manually inspected a sample of the passive phrases extracted. The size of the sam-ple was 200 randomly selected sentences (100 from standard and 100 from simsam-ple). In that inspection, no reduced passive phrases were found. This could mean that there is a lack of identified passive phrases in the results of this study.

A number of guidelines were excluded from the study. Most because they do not directly relate to readability. Though there were two exclusions that did not necessarily have to be made. The first one was splitting of sentences. There are numerous tips on how to shorten sentences by splitting them in two in the guidelines. The reason this was not measured in the present study is a poverty in the data. The corpus that was used contained only one-to-one alignments, and if splitting of sentences was to be traced, another corpus needed to be used. In the corpus of Hwang 2015 (Hwang, Hajishirzi, Ostendorf, and Wu 2015), there are one-to-many alignments, and it would be more fitting for the task. Still, tracing sentence splits is quite a complex task in itself, essentially reverse engineering the writing process. And it would require thorough study.

(29)

5.2. Method

One guideline that would have been very appropriate to measure was straight word or-ders. Basically variations of subject-verb-object relationships with no interceding words. Due to limitations in hardware this task was cut as it would take a disproportionate amount of time to the value it could provide to the study.

5.2.4 Social aspects

The social implications of this study are quite limited. What can be said is that my findings further the knowledge on text simplification resources. The research into text simplification and readability can help with promoting digital inclusion(Alusio, Specia, Gasperin, and Scar-ton 2010). Both from the perspective of individuals, government institutions or for example news outlets. People with reading disabilities, language learners (Petersen 2007) and people in different reading levels benefit (Belder and Moens 2010) from simplified text. Text simpli-fication has the potential of making information accessible to more people. In the long term, this is can be seen as a democratization of knowledge.

(30)

6 Conclusion

There were two aims in this study. The first aim of this study was to evaluate the language complexity differences between simple and standard Wikipedia in the corpus prepared by Kauchak (Kauchak and Coster 2011). This was done by measuring surface features such as the Gunning Fog index, and dependency based features such as average dependency dis-tance. It was found that sentences from simple Wikipedia were less complex on average than those in standard Wikipedia. The lexical richness as measured by the type token ratio was also slightly lower in the simple Wikipedia sentences. I propose that average sentence length, word length, dependency distance and sentence depth would be relevant parameters to use when training an automatic text simplification system on simple Wikipedia data.

The second aim of this study was to test the extent to which the guidelines for simple English Wikipedia has been followed. A number of guidelines were operationalized and tested, the results were then compared between the standard and simple data. None of the operationalized guidelines showed a significant difference between datasets. This indicates that the guidelines for simple Wikipedia are not followed to a greater extent within the simple Wikipedia articles themselves than in standard Wikipedia articles.

Although the results were limited in scale and detail level, the descriptives given in this article will be useful for future projects in ATS.

An interesting continuation of this project is to operationalize and test the simple Wikipedia guidelines left out in this study, including word ordering and splitting sentences. Plain English guidelines from other sources could also be included. Training a classification system based on the guideline measures and evaluating its performance would tell us more about both the following of guidelines and the usefulness of these guidelines. Letting the system classify unseen data and training on different configurations of parameters should show which guidelines are useful in a text simplification task.

Is Simple Wikipedia simple? : – A study of readability and guidelines

Linköping University | Department of Computer and Information Science

Bachelor thesis, 18 ECTS | Cognitive science

2018 | LIU-IDA/KOGVET-G--18/029--SE

Is Simple Wikipedia simple?

A study of readability and guidelines

Fabian Isaksson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Tables

1

Introduction

1.1

Aim

1.2

Research questions

1.3

Delimitations

1.4

Structure

2

Theory

2.1

Concerning the authorship of simple Wikipedia

2.2

The basis for text simplification

2.2.1

Measuring Readability

2.3

Material

2.3.1

Dataset

2.3.2

Preprocessing tool

2.3.3

The architecture of spaCy

3

Method

3.1

Surface features

3.2

Dependency based Features

3.3

Operationalization of Simple Wikipedia Guidelines

3.4

Statistical analysis

4

Results

4.1

Surface features and word list comparsion

4.2

Dependency based Features

4.3

Features from Simple Wikipedia Guidelines

5

Discussion

5.1

Results

5.1.1

Sentence complexity

5.1.2

Were the guidelines followed?

5.2

Method

5.2.1

Parallelism matters

5.2.2

Concerning the parser

5.2.3

Studying the guidelines

5.2.4

Social aspects

6

Conclusion