• No results found

Can a graded reader of authentic material be generated?

N/A
N/A
Protected

Academic year: 2021

Share "Can a graded reader of authentic material be generated?"

Copied!
75
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Can a graded reader of authentic

material be generated?

by

Kent Danielsson

LIU-IDA/LITH-EX-A--13/050--SE

2013-10-28

(2)

Linköping University

Department of Computer and Information Science

Final Thesis

Can a graded reader of authentic

material be generated?

by

Kent Danielsson

LIU-IDA/LITH-EX-A--13/050--SE

2013-10-28

Supervisor: Arne Jönsson

Examiner: Lars Ahrenberg

(3)

Abstract

The thesis investigates if a graded reader for English leveled to the CEFR levels by using the English Vocabulary Profile (EVP) dictionary can be generated from a corpus of authentic material. It was tested on Wikipedia and the ukWaC corpus. There were some problems in making correct matches between the words in the EVP word lists with the tagged words of the corpora. The results show it might be possible to find enough suitable texts to generate a graded reader for at least the higher CEFR levels if only lemmas are considered. If also the POS tags should be matched between the word list and the corpora the errors were too big to be able to give a conclusive answer.

(4)

Contents

1 Introduction 4 1.1 Background . . . 5 1.2 Goal . . . 5 1.3 Purpose . . . 6 1.4 Limitations . . . 7 1.5 Thesis Outline . . . 7

2 Background to Language Learning through Reading 9 2.1 Graded Readers . . . 9

2.2 Extensive Reading . . . 10

2.3 Reading Comprehension . . . 10

3 Background to Word lists in Language Learning 12 3.1 Overview . . . 12

3.2 BNL2709 . . . 13

3.3 Project KELLY . . . 13

3.4 English Vocabulary Profile (EVP) . . . 14

4 Method 16 4.1 Word list - EVP . . . 16

4.2 Corpora . . . 17

4.2.1 ukWaC . . . 18

4.2.2 English Wikipedia . . . 19

4.2.3 Simple English Wikipedia . . . 20

4.3 Text Analysis . . . 20 4.3.1 Term Definition . . . 20 4.3.2 Multiwords . . . 22 4.3.3 Text Coverage . . . 22 4.4 Tag sets . . . 23 4.5 Corpora Analysis . . . 23 4.6 Document Analysis . . . 23 4.6.1 Term Categorization . . . 24

4.6.2 Candidate Filtering Strategies . . . 24

5 Implementation 27 5.1 Tools . . . 27

5.1.1 Programming Language . . . 27

(5)

CONTENTS CONTENTS

5.1.3 POS Tagging and Lemmatization . . . 28

5.2 Preprocessing . . . 28

5.2.1 Word lists . . . 28

5.2.2 Corpora . . . 29

5.3 Strategies . . . 30

6 Results and Discussion 31 6.1 Corpora Analysis . . . 31

6.1.1 Corpora Comparison . . . 31

6.1.2 Corpora Size Comparison . . . 33

6.1.3 Document Length Comparison . . . 34

6.1.4 Tagger Comparison . . . 35

6.2 Document Analysis . . . 37

6.2.1 Document Length Comparison . . . 37

6.2.2 Corpora Comparison . . . 43

6.2.3 Analysis Per Strategy . . . 49

6.3 Ignoring POS tags . . . 51

6.3.1 Term Coverage . . . 51 6.3.2 Document Analysis . . . 52 7 Conclusion 59 7.1 Conclusions . . . 59 7.1.1 Corpora Analysis . . . 59 7.1.2 Document Analysis . . . 60 7.2 Future Work . . . 61 7.2.1 Term Definition . . . 61

7.2.2 Input word list lemmatized . . . 61

7.2.3 Incorrect Tagging and Lemmatization of Texts . . . 62

7.2.4 Investigate Special Category . . . 62

7.2.5 Add Grammar Check to the Strategies . . . 62

Appendices 66 A Common European Framework of References (CEFR) 67 B Tagset Mappings 69 B.1 Mutual tagset . . . 69

B.2 Freeling Tag Set . . . 70

B.3 Treetagger Tag Set . . . 71

B.4 EVP Tag Set . . . 72

C Words Not Covered 73 C.1 With POS tags . . . 73

(6)

List of Figures

1.1 System Overview . . . 6

3.1 English Profile Program research . . . 14

4.1 Text Example POS-tagged . . . 20

4.2 Text Example Lemmatized . . . 21

4.3 Ideal Strategy . . . 25

4.4 Perfect Strategy . . . 25

4.5 Strict Strategy . . . 25

4.6 Soft Strategy . . . 26

4.7 Strict Constrained Strategy . . . 26

4.8 Soft Constrained Strategy . . . 26

5.1 First ten entries in the csv-file for level A1 with POS tags. . . 28

5.2 First ten entries in the csv-file for level A1 without POS tags. . . 29

6.1 Corpus Type Comparison of Term Coverage, size=10M terms, tagger=Freeling . . . 32

6.2 Corpus Size Comparison of Term Coverage, corpus=ukwac, tag-ger=Treetagger . . . 33

6.3 Document Length Comparison of Term Coverage, corpus=ukWaC, size=2,4M terms, tagger=Treetagger . . . 35

6.4 Tagger Comparison of Term Coverage, corpus=ukWaC, size=10M terms . . . 36

6.5 Freeling & Treetagger Difference CEFR A1 level . . . 36

6.6 Term Definition Comparison of Term Coverage, corpora=ukWaC, size=40M terms, tagger=Treetagger . . . 51

(7)

List of Tables

3.1 Project Kelly English size for CEFR . . . 14

3.2 The word jump in the EVP dictionary . . . 15

4.1 EVP size for CEFR . . . 17

4.2 ukWaC Length Corpora Sizes . . . 19

4.3 Text Example Term Occurrences . . . 21

4.4 Text Example Coverage for jump/NOUN . . . 21

4.5 Term Categories . . . 24

4.6 All Candidate Filtering Strategies . . . 25

6.1 Document Length Comparison of Number of Candidate Texts per Level and Strategy, corpus=ukwac, size=2,4M terms . . . 38

6.2 Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukwac, size=2,4M terms, length=150-250 terms . . . . 40

6.3 Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukwac, size=2,4M terms, length=450-550 terms . . . . 41

6.4 Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukwac, size=2,4M terms, length=950-1050 terms . . . . 42

6.5 Corpora Type Comparison of Number of Candidate Texts per Level and Strategy, size=10M terms, grouped by level . . . 44

6.6 Corpora Type Comparison of Number of Candidate Texts per Level and Strategy, size=10M terms, grouped by strategy . . . . 45

6.7 Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukwac, size=10M terms . . . 46

6.8 Text Characteristics of Candidate Texts per Level and Strategy, corpus=wikipedia, size=10M terms . . . 47

6.9 Text Characteristics of Candidate Texts per Level and Strategy, corpus=simple wikipedia, size=10M terms . . . 48

6.10 Number of Candidate Texts per Level and Strategy, size=10M terms, corpora=ukWaC, term def=lemma, grouped by level . . . 53

6.11 Number of Candidate Texts per Level and Strategy, size=10M terms, corpora=ukWaC, term def=lemma, grouped by strategy . 54 6.12 Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukWaC, size=10M terms, term def=lemma, grouped by strategy . . . 56

6.13 Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukWaC, size=10M terms, term def=lemma, grouped by level . . . 57

(8)

LIST OF TABLES LIST OF TABLES

A.1 ALTE Skill Level Reading Can Do Statements . . . 68

B.1 Tagset Mapping . . . 69

B.2 Freeling Tagset Mapping . . . 70

B.3 Treetagger Tagset Mapping . . . 71

(9)

Chapter 1

Introduction

Graded readers are books aimed at learners of foreign languages in order to learn vocabulary through reading as defined by Nation and Wang in [13]. They are graded, or leveled, in series that normally consist of 4-6 different reading levels. Each level within a series is graded in such a way that Level 1 could for example be restricted to 500 headwords1, Level 2 to 600 headwords, and

so on. This allows progress over time as the reader works through the levels. The content can be written specifically for each level or it can be adapted from literary classics, movies etc.

In order for a learner to comprehend a text and to be able to guess the meaning of unknown words from context, research shows that 95% of the words need to be known [11]. In other words, no more than every 20th word can be unknown. Other research has an even higher limit of 98% for adequate comprehension [8].

For a language learner to actually learn a word and store it in memory by reading it in context, the word needs to be encountered several times. 10-12 times is a number mentioned in studies [19].

In other words, to actually learn a previously unknown word by reading it, it needs to appear in a context of which you already know at least 95% of the surrounding words, and you need to encounter it at least 10 times.

This thesis will explore the possibilities to generate a graded reader from authentic texts which meet this needs of the language learner. Authentic texts in this case means texts not written and adapted specifically for a language learner. The levels will be adapted to the Common European Framework of References (CEFR) which defines six levels in which the proficiency of a foreign language is measured.

A study by Pellicer-S´anchez Schmitt [19] says that graded readers seldom pay attention to the mid range frequency words defined as the 3,000 to 9,000 most common words. This means that it is problematic for teachers to explicitly cover them. These mid range frequency words maps to the higher levels of CEFR. If it is possible to generate graded readers which target exactly these words and levels it could simplify the job for teachers finding suitable texts for their students.

1headword

, or lemma, is the word you look up in the dictionary. The term lemma will mainly be used in this thesis.

(10)

1.1. BACKGROUND CHAPTER 1. INTRODUCTION

1.1

Background

In a research project by Eldridge and Neufeld [6] they revisited the concept of the graded readers and developed a series of what they called eReaders. The content of these eReaders were based on Wikipedia articles edited to a target length of 1000 words. During the process the articles were also continuously profiled against the Billuro˘glu-Neufeld list of the 2709 (BNL2709) most frequent words in English. See Section 3.2 for an explanation of BNL2709. They ensured complete coverage of the word list and that each item were repeated at least 12 times.

1.2

Goal

The goal of this thesis is to examine if the concept of the eReader by Eldridge and Neufeld can be taken one step further. For their eReader the texts were edited and examined manually with some help of analysis tools. The goal here would be investigate if it is possible to generate a minimal graded reader from a corpus, leveled against a word list without any supervision more than the input parameters for the graded reader generator. It should be a minimal graded reader in the sense it contains as few words as possible, but it should still help the language learner to learn the words in the EVP (English Vocabulary Profile) word list. The EVP word list define at which CEFR level each word belongs. The texts used for the graded reader should not be simplified and adapted for language learners, but they should be what I call authentic texts.

As can be seen in the system overview in Figure 1.1 the idea is to first take a corpus to which we apply a strategy that filters this corpus into a smaller set of candidate documents that are a good fit for the minimal graded reader. To generate the minimal graded reader the second step is to based on the target words in the EVP word list from the smaller set of candidate documents select an even smaller optimized set of documents which together form the minimal graded reader. The second step of actually generating a graded reader is not done in this thesis. The goal is to answer if this is even possible.

(11)

1.3. PURPOSE CHAPTER 1. INTRODUCTION

1.3

Purpose

The purpose of this thesis is first to do an analysis of corpora consisting of authentic texts to investigate if any kind of corpus of authentic texts contains all words in the English Vocabulary Profile (EVP) word list. A second analysis is done to see if enough texts in the corpora are fit for the generation of a graded reader.

More specifically the following set of questions will be answered.

• Does the source/type of documents in the corpus affect the coverage of the CEFR levels?

• How big does the corpus need to be to cover the CEFR levels?

• Does the length of the documents in the corpus affect the coverage of the CEFR levels?

• Does the tagger which tag the texts play a major role?

• Does the length of the documents affect how many suitable documents are found?

• Does the length of the documents affect what kind of documents are found? • Can a graded reader with authentic texts be automatically generated? If the the last question is positively answered the results of this thesis could be used to develop tools that meet a need for language learners and their teach-ers. Especially covering the mid range frequency words, the higher levels of CEFR, which according to Pellicer-S´anchez Schmitt [19] can be a bit problem-atic.

1.4

Limitations

Only the English language will be evaluated. Initial idea was to use data which enabled a comparison between different languages, but the data proved to be unreliable so that idea had to be rejected. Though, with the right kind of data the analysis done in this thesis could be applied to any language.

The idea is to find good study material for language learners on different levels but the focus will only be with respect to vocabulary. Texts which are found to be at a lower level if only looking at the vocabulary used in it, actually could be hard to comprehend for language learners at this level because of its higher level grammar structure the language learner has not yet learned. This fact will however be ignored and left for future work.

A word can have many meanings, or senses. In this thesis the different senses of a word will not be taken into consideration. However the part of speech of a word will be used to distinguish between words that have several part of speech. For example the word jump can be both a noun and a verb, but the different senses of the verb form will be ignored. See Table 3.2 for jump in the EVP dictionary.

The corpora used consisted of at most 40 million terms. The limit was set to 40 million terms to limit execution time. Some analyses took a little bit over 2 hours to run.

(12)

1.5. THESIS OUTLINE CHAPTER 1. INTRODUCTION

1.5

Thesis Outline

Chapter 1: Introduction

The Introduction chapter elaborates on the goal and purpose of this thesis and introduces its approach and limitations.

Chapter 2: Background to Language Learning through Reading This chapter gives the background to research in language learning through reading.

Chapter 3: Background to Word lists in Language Learning

This chapter gives an overview of some word lists that have been used in lan-guage learning.

Chapter 4: Method

The method used in this thesis is presented in this chapter. Chapter 5: Implementation

This chapter explains how things are implemented. Chapter 6: Results

This chapter presents the results from the analysis as well as some analysis and discussion of the results.

Chapter 7: Conclusion

The thesis is summarized in this chapter together with conclusions made. Some future work is also suggested.

Appendices

The appendices contain a description of CEFR (Appendix A), tag sets used (Ap-pendix B) and a list of words in EVP not covered by the corpora (Ap(Ap-pendix C).

(13)

Chapter 2

Background to Language

Learning through Reading

This chapter gives the background to research in language learning through reading.

2.1

Graded Readers

Graded readers are books which are specially written or adapted for second language learners. Quite often they consist of simplified texts. The goal of these graded readers can be different: it could include gaining skill and fluency in reading, reinforcing previously learned vocabulary and grammar, learning new vocabulary and grammar, and gaining pleasure from reading.[13]

They are targeted for second language learners by restricting the vocabulary used and the grammar structures that occur in the texts. A beginner would start at a lower level and continue reading texts on that level until reading is comfortable, then moving up to the next level.

Graded readers often reach an amount of about 3,000 headwords which are covered in all their texts. For example Cambridge English Readers has 3,800 headwords, Penguin Readers 3,000 headwords and Oxford Bookworms has 2,500 headwords as explained by Pellicer-S´anchez and Schmitt in [19].

In the research for the eReader in [6] Eldridge and Neufeld found that in a word corpus of 741,504 words consisting of 56 graded readers in stage 1 to 6 in the Oxford Bookworms series the recycling of words seemed to be more by accident than by design. They also noted that the strong point of the graded readers, that they contain homogeneous texts, is also their weakness. Since they contain homogeneous texts it means that most words only show up in a few senses. See Section 4.3.1 for more about this.

Apart from doing the study of graded readers they also upgraded the concept and created an eReader. Tailored after the word list BNL2709, see Section 3.2, they made sure that each word at each level occurred at least 12 times. As basis they used texts from Wikipedia which they edited to shorten sentence lengths and simplify the clause structures. It was assumed that the language learner had some knowledge of English to begin with. How this eReader fared against traditional graded readers was not tested.

(14)

2.2. EXTENSIVE READING

CHAPTER 2. BACKGROUND TO LANGUAGE LEARNING THROUGH READING

2.2

Extensive Reading

The goal of extensive reading is to expose language learners to ”large quantities of material within their linguistic competence” [20]. Whether this approach is good and an effective way to learn vocabulary is debated. Regarding the effectiveness with respect to time and sufficient exposure to words there has been a debate between McQuillan and Krashen against Coob on this matter. Krashen is a researcher on bilingual education and language acquisition, and also a promoter of free voluntary reading.

McQuillan and Krashen in [12] claim that a reader who reads 20 minutes per day (reading speed 100 words per minute) would encounter 1,460,000 words over a two years period. They claim that this would be enough to reach a 5000 word level with a minimum exposure of each word at least 10 times.

Cobb’s response in [3] is that there exists no smooth rise for 3,000+ word families. Meaning, reading material at this level usually contains a too high per-centage of unknown words for the reader to be able to comprehend, understand and learn the unknown words.

Why extensive reading on an appropriate level for a language learner is important has been studied by Claridge in [2]. By investigating the reading perceptions of readers, the conclusion was that those who read above their level do not show any increase in positive attitude towards reading and do not derive any pleasure. It is unlikely to develop fluency this way it is then claimed by Claridge.

In a study on incidental learning by Pellicer-S´anchez and Schmitt in [19], with the goal of reading a text for pleasure and at the same time learning new vocabulary, they let the language learners read the unmodified authentic novel Things Fall Apart. The result was that after more than 10 exposures the meaning of previously unknown words could be recognized for 84% of the words, and spelling for 76% of the words.

2.3

Reading Comprehension

A commonly cited study by Hu and Nation [8] finds that 98% of the words in a text needs to be known in order for comprehension. The study was performed by exchanging low frequency words with nonsense words. So for a coverage of 98% of known words of a text every 50th word was exchanged for a nonsense word. To test the comprehension the test persons were afterwards given multiple choice questions to answer and to give a written recall of the text. An interesting note is that even at 100% text coverage not all gained adequate comprehension. Building upon these findings there has been research investigating how big vocabulary is needed to comprehend written and spoken texts in general. A study by Nation [16] found that if 98% coverage of a text is desired for com-prehension a knowledge of 8,000 to 9,000 word families is needed to reach that percentage for written texts. For spoken language it is a bit lower, 6,000 to 7,000 word families.

An example of a word family used in [16] is the word family abbreviate to which the words abbreviate, abbreviates, abbreviated, abbreviating, abbreviation and abbreviations belongs. In this word family we find the two lemmas abbre-viate and abbreviation.

(15)

Chapter 3

Background to Word lists

in Language Learning

First an overview of word lists is given. Then three word lists are presented which were considered for this thesis. BNL2709 was considered on the basis that it is the word list used in eReaders in [6]. The other two, Project Kelly and English Vocabulary Profile (EVP), because they are adapted to the CEFR Framework. See Appendix A for an overview of the CEFR Framework. The one picked in the end for this thesis was EVP.

3.1

Overview

Compiling word frequencies and selecting the words of high frequency from one or more corpora is a common way of creating word lists containing the most usable words for a language learner to learn in order to be able to compre-hend general English texts. Two commonly used word lists are the General Service List (GSL) and the Academic Word List (AWL) according to Neufeld and Billuro˘glu in their research for their own word list BNL2709 explained in [18].

GSL is a list consisting of 2000 headwords. It was developed in the 1940s already. Although being old it is still considered to be one of the best word lists for this purpose as stated by Nation and Waring in [15]. It is based on a 5 million word written corpus.

The more specialized AWL contains 570 word families taken from a corpus of 3,5 million words consisting of written academic texts. Of this corpus the words in AWL make up about 10% of the words to compare with an equally sized corpus of fiction of which AWLs cover 1%.

For a language learner, these words of high frequency are crucial and when time is limited one should focus on these according to Nation in [14].

3.2

BNL2709

BNL2709 is a word list consisting of 2709 words and it is based on the GSL and AWL as described in [18] by Neufeld and Billuro˘glu. The reason they created

(16)

3.3. PROJECT KELLY

CHAPTER 3. BACKGROUND TO WORD LISTS IN LANGUAGE LEARNING a new word list was because they observed some missing words in the GSL regarding the difference between American and British spelling. Moreover GSL is from the 1940s, words fall out of fashion and new ones are created. Therefore they felt it needed a face lift. These two word lists as a pair could be seen as if the GSL was a base list for lower levels and AWL an advanced list for higher levels. They considered this to be a problem and thus created the BNL2709 with words from the AWL distributed even at lower levels.

3.3

Project KELLY

Project Kelly (KEywords for Language Learning for Young and adults alike), was an EU project granted to ten partner organizations for the period of 2009-2012.[10] The aim of this project was to classify words into the different CEFR levels of nine languages (Swedish, Norwegian, English, Italian, Polish, Greek, Russian, Arabic and Chinese). Out of these vocabulary lists they created language-training products such as Keewords 1 consisting of word cards for

72 language pairs.

The approach to create these word lists was to base the lists on word frequen-cies in corpora and to order the words in the lists in such a way that the more common words were taught first. But in some situations the corpora frequencies was overruled by expert judgments. A word with a low frequency could still be placed at a lower level if the experts thought it would aid language learners to learn it earlier.

The list also contains POS tags. However, the POS tags for the English word list are incorrect in many cases. The first 1944 words seems to be correct, but in the remaining part of the list the POS tags are scrambled between the words with no visible pattern. Examples are ”an administration VERB” and ”at first NOUN”.2 This makes this word list not so useful if one wants to go deeper than

just using the lemma. See Section 4.3.1 for a discussion on this issue. The size of the Project Kelly word list for English can be seen in Table 3.1.

Table 3.1: Project Kelly English size for CEFR A1 A2 B1 B2 C1 C2 Terms 1150 1119 1112 1093 1131 1151 Accumulated 2269 3381 4474 5605 6756

3.4

English Vocabulary Profile (EVP)

The English Vocabulary Profile (EVP) is an online dictionary which is part of the English Profile Program which is a collaborative program endorsed by the Council of Europe. It is designed to create a profile of what learners can do in English at the different CEFR levels.

1

http://www.keewords.com/

2

Obtained from http://www.kellyproject.eu/?page id=690 in January 2013. As of 20th of October 2013 the list is not available, but there is a notice saying ”Will be updated”.

(17)

3.4. ENGLISH VOCABULARY PROFILE (EVP)

CHAPTER 3. BACKGROUND TO WORD LISTS IN LANGUAGE LEARNING

Figure 3.1: English Profile Program research

The EVP word list described in [22] is substantially but not exclusively corpus-informed. As basis for the EVP Cambridge English Corpus and Cam-bridge Learner Corpus have been used. CamCam-bridge English Corpus is a 1.2 billion word corpus of written and spoken English. Cambridge Learner Corpus on the other hand is a corpus consisting of 45 million words of texts written by students at all six CEFR levels. See Figure 3.13 for the process regarding the

usage of Cambridge Learner Corpus.

Together with these two corpora word lists from leading coursebooks, read-ers’ word lists and vocabulary skill books have been used. Furthermore vo-cabulary lists for Cambridge ESOL’s Key English Test (KET) and Preliminary English Test (PET) examinations have been taken into account.4

This EVP dictionary contains words, phrases, phrasal verb and idioms. It includes parts of speech as well as the sense of the word. As can be seen in Table 3.2 the word jump as a verb is found in no less than 4 CEFR levels; A2, B1, B2, C2. This means that a language learner at level B1 should have learned at least three ways of using the verb jump, the two senses on A2 level plus one sense on level B1.

3

Image Source: http://www.englishprofile.org/images/pdf/ep-roger-hawkey.pdf

4

(18)

3.4. ENGLISH VOCABULARY PROFILE (EVP)

CHAPTER 3. BACKGROUND TO WORD LISTS IN LANGUAGE LEARNING

Table 3.2: The word jump in the EVP dictionary

POS Sense & Explanation & Example Level

verb into air A2

[I] to push your body up and away from the ground using your feet and legs

The children were jumping up and down with excitement.

verb go over A2

[I or T] to go over something by moving up into the air Can you jump over/across this stream?

verb jump in/into/up, etc. B1 to move or act suddenly or quickly

She jumped in/into a taxi and rushed to the station.

verb fear B2

[I] to make a sudden movement because you are frightened or surprised

Her scream made me jump.

verb jump to conclusions C2 to guess the facts about a situation without having enough information

He saw them talking together and jumped to conclusions.

noun B1

a sudden movement off the ground or off a high place He won with a jump of 8.5 meters.

(19)

Chapter 4

Method

This thesis is based my own experiments on how a graded reader could be automatically generated. Doing these experiments I came up with the questions presented in Section 1.3. I discovered that the characteristics of the corpus used which these questions cover may affect the results.

As can be seen in the presentation of the corpora in Section 4.2 I use different kinds of corpora to analyze these characteristics and their effect. I analyze these characteristics by changing one at the time. For example I have three corpora containing texts of different lengths, run the analysis, and then compare the results. If there is any difference between the corpora I try to come up with an explanation on why that is.

The filtering strategies described in Section 4.6.2 which are used for finding suitable texts for language learning are based on literature studies on graded readers and the concept of learning through reading.

In the end combining all these results I will answer the big question of this thesis, Can a graded reader with authentic texts be automatically generated?

4.1

Word list - EVP

The word list selected for this thesis is the EVP word list. The reason this one is selected is that I wanted to explore the possibility of creating a graded reader for the CEFR levels. EVP is selected over the Kelly Project word list because the latter had some errors. Moreover the EVP was also actually recommended by Adam Kilgarriff who is one of the co-authors of the Kelly Project.1

As shown in Table 3.2 the term jump/VERB exists on four CEFR levels. The fact that a word with the same part of speech can belong to several CEFR levels creates an interesting challenge. To be able to use the information about different senses belonging to different levels one must have a sense tagged corpus. The corpus ukWaC is not sense tagged, see Section 4.2.1 for a description of the corpus. The Freeling tagger2used as a part of the implementation could be

used to also do word sense tagging, but then comes the challenge of matching

1

Personal correspondence when I investigated the errors I found in the Kelly Project En-glish word list.

2

(20)

4.2. CORPORA CHAPTER 4. METHOD

the sense of word tagged by the Freeling tagger with the senses in the EVP word list. Therefore within this thesis the different senses will be ignored.

When a lemma/pos pair is found at two or more CEFR levels, like for ex-ample jump/VERB shown in Table 3.2, it will be assigned to the lowest level it is found at. That means jump/VERB will be assigned level A2.

This method creates the distribution of terms seen in Table 4.1. In the column Listed are the number of terms in the original list, in column Remain the number of terms after duplicates and lemma/pos pairs on higher levels already on lower levels have been removed. The same is done for term defined as lemma which is later used as comparison.

Table 4.1: EVP size for CEFR

Level Listed lemma/pos lemma

Remain Accumulated Remain Accumulated A1 633 633 587 A2 1196 1031 1664 890 1477 B1 2008 1647 3311 1420 2897 B2 2620 1995 5306 1745 4642 C1 1497 1106 6412 1005 5647 C2 1826 1178 7590 1009 6656

Comparing with the word lists from the Kelly Project, Table 3.1, the EVP word list contains almost 1000 more terms than the Kelly Project word list.

As mentioned earlier, it is found that in order to be able to read unsim-plified texts with a coverage of 98% of known words a vocabulary of around 8,000-9,000 words is needed. Looking at Table 4.1 we can spot a potential prob-lem. A vocabulary of 8,000-9,000 words is not reached with the EVP word list. Therefore it could be hard to find texts with enough terms that the language learner comprehends already.

4.2

Corpora

Three corpora will be used to do the analysis. ukWaC, English Wikipedia and Simple English Wikipedia. Graded readers often contain simplified reading material. Nation and Wang write in [13] that some consider that this simplifi-cation results in a language that is not suitable for language learners because it is somewhat distorted and not authentic. Therefore the aim for the thesis is to use corpora of non simplified authentic texts.

4.2.1

ukWaC

The aim of the ukWaC corpus described in [1] by Baroni, Bernardini, Ferraresi and Zanchetta is to act as a general-purpose resource. To the knowledge of the authors of that paper it is the only English web-crawled resource with linguistic annotation. The total size is 2 billion words. For this research 40 million words of ukWaC are considered. The reason why the limit is set to 40 million words

(21)

4.2. CORPORA CHAPTER 4. METHOD

is that at that point the gains of a bigger corpus started to diminish. This can be seen in Figure 6.2.

Description

To ensure the quality of the corpus over 96% of the documents they crawled were removed. Post-crawl removal methods used was to first only keep documents of mime type text/html and between 5 and 200KB in size. The reasoning behind this is that small documents usually contain little text because of the HTML code overhead. Large documents on the other hand have a tendency to be lists of various sorts. Complete removal of perfect duplicates was also done because they are often warning messages, copyright statements etcetera, and therefore they are of no real linguistic interest. Then code was removed from the documents as well as boilerplate text. Boilerplate text is for example headers, navigation etc. which again bear little interest and value for the intended usage of language learning.

After that the documents are filtered based on a list of English function words. This is a way to remove texts containing little connected text since it is mentioned that research shows that connected texts contain a high portion of function words.

Finally the corpus was made kid friendly by identifying and removing porno-graphic pages.

The corpus available for download3 is already tagged using Treetagger, see

Section B.3 and reference [21]. Sub Corpora

From the corpus described above some sub corpora with different characteristics were created.

Corpus Size To test how many words from the EVP word list are covered at different sizes of the corpora several subcorpora were created, ranging from 1 million to 40 million words. The results are seen in Figure 6.2. To create the 1 million corpus the texts were added to the corpus in the same order they were read from the complete ukWaC corpus until 1 million words had been reached. The same process was used for the other sizes.

Text Length Early testing for this thesis gave the impression that it was easier to find suitable text for the creation of a graded reader among shorter texts than longer texts. Therefore, three more sub corpora of the ukWaC corpus were created. One with shorter texts of length between 150 and 250 terms, a middle length corpus with texts of length between 450 and 550 and one with longer texts between 950 and 1050 terms. The one with short texts was created due to the observation that it seems to be easier to find suitable texts of short length. The one with texts of a length around 1000 terms was created because that is the length of the texts used in the creation of the eReader. All three corpora contain 2.4 million terms. The number of documents in each corpus can be seen in Table 4.2.

3

(22)

4.2. CORPORA CHAPTER 4. METHOD

Table 4.2: ukWaC Length Corpora Sizes Lengths Documents Average Length 150-250 12191 197 450-550 4824 498 950-1050 2400 1000

Tagger The corpus is already tagged using Treetagger. To analyze if the tagger which is used to tag the corpus greatly affects the result, a 10 million terms sized corpus was retagged using the Freeling tagger.

4.2.2

English Wikipedia

Since the eReader used Wikipedia as the corpus this one will be analyzed in this thesis as well.

A dump was taken on the 15th of February 2013. The texts from the dump was extracted and cleaned with Wikipedia Extractor4. Wikipedia Extractor

extracts the whole dump into smaller files.

Texts with the title starting with ”List of” were directly discarded since they basically only contain one or more lists which is not a good text for language learning. Also very short texts of less than 10 terms were ignored.

To create the 10 million term corpus the files into which the whole dump was extracted was read in alphabetical order until a total sum of 10 million terms were reach. The texts were then tagged with the Freeling tagger.

4.2.3

Simple English Wikipedia

Simple English Wikipedia is just like Wikipedia. The difference is that an easier language is used. The description of the wiki on the website reads as follows: 5

Articles in the Simple English Wikipedia use fewer words and easier grammar than the ordinary English Wikipedia. The Simple English Wikipedia is also for people with different needs, such as students, children, adults with learning difficulties, and people who are trying to learn English.

This corpus can not be called authentic as authentic is defined for this thesis in Section 1.2. The aim of these texts as described in the quote above are to be simplified. This goes against some comments that texts should not be simplified in order to be useful for a language learner, for example in the work of Nation and Wang [13].

This one is still selected for comparison with Wikipedia and with the hy-pothesis that it will do better than Wikipedia on at least lower CEFR levels.

The preparation of the corpus followed the same scheme as the one for Wikipedia.

4

http://medialab.di.unipi.it/wiki/Wikipedia extractor.

5

(23)

4.3. TEXT ANALYSIS CHAPTER 4. METHOD

4.3

Text Analysis

Text coverage has been mentioned a few times. Before going into text coverage, one must first define what a term is since terms are used to determine the text coverage.

4.3.1

Term Definition

To illustrate the different ways of approaching the problem of defining a term let us use the following nonsense text as an example: ”I jump. I performed a great jump. I jumped high.” .

If we analyze and add each word’s part of speech6(POS-tagging) we get the

analysis depicted in Figure 4.1.

I/N jump/V.

I/N performed/V a/DET great/ADJ jump/N. I/N jumped/V high/ADV.

Figure 4.1: Text Example POS-tagged

Furthermore a text can be lemmatized and the lemma of each word is found, see Figure 4.2.

I/I jump/jump.

I/I performed/perform a/a great/great jump/jump. I/I jumped/jump high/high.

Figure 4.2: Text Example Lemmatized

As can be seen in Figure 4.2 above all but two words, performed and jumped, are in the lemma form already in the original text.

Let us use three different ways of defining a term: original word the exact form found in the text lemma the lemma form of the word

lemma/pos the pair of the lemma form and pos-tag

With these three different ways of defining a term we get the term distribu-tion shown in Table 4.3.

6

(24)

4.3. TEXT ANALYSIS CHAPTER 4. METHOD

Table 4.3: Text Example Term Occurrences original word lemma lemma/pos token count token count token count

I 3 I 3 I 3

jump 2 jump 3 jump/V 2 jumped 1 jump/N 1 performed 1 perform 1 perform/V 1 a 1 a 1 a/DET 1 great 1 great 1 great/ADJ 1 high 1 high 1 high/ADV 1

Given the text analyzed above, and if we want to calculate the coverage of the word jump and more specifically jump as a noun, we get quite different results depending on which definition of a term we use as can be seen in Table 4.4.

Table 4.4: Text Example Coverage for jump/NOUN Categorization Count Coverage

word 2 20% lemma 3 30% lemma/pos 1 10%

From an implementation point of view the word categorization scheme would be the easiest to implement since only tokenization of the text is needed. For the categorization of lemma and lemma/pos further analysis of the text is needed in the form of pos-tagging and lemmatization.

From a language learning perspective one might want to go even deeper and differentiate between different senses. The EVP word list being used in this thesis has that kind of depth so regarding the word list input it would be possible. As seen in Table 3.2 with the example of the word jump a language learner could at a quite early level comprehend one sense of the word, and will later on at higher levels learn more senses of the same lemma/pos pair.

Word sense disambiguation (WSD) as the problem of determining the sense of a word is called, is not yet completely solved.[17]

Analysis of a graded reader corpus done for the creation of the eReader and comparing texts with a sample corpus of Wikipedia show that for example the word draw in the graded reader corpus exists almost only in the sense of drawing a picture. In the Wikipedia sample corpus on the other hand draw exists in a much richer set of senses. Although it is only incidental, a graded reader created from texts from Wikipedia in this case may expose the language learner to more senses of each word.[6]

(25)

4.4. TAG SETS CHAPTER 4. METHOD

4.3.2

Multiwords

As seen in Section 4.3.1 how we define a term greatly influences the text coverage result we get. As can be seen in Section 4.2.1 terms which uses Treetagger, see Appendix B.3, are lemmatized and POS-tagged separately. Whereas Freeling, see Appendix B.2, has a multiword database one can turn on and off. Further-more Freeling also directly combine consecutive proper nouns into one proper noun. See examples below.

Treetagger

He/PP ate/VVD at/IN the/DT Great/NP Monster/NP Lobster/NP Cage/NP Coverage of he is 12.5%

Freeling

He/PRP ate/VBD at/IN the/DT Great Monster Lobster Cage/NP. Coverage of he is 20%

Since the EVP word list contains multiwords I opted for using multiwords. A comparison is done between the two taggers to see if there is any difference. See Figure 6.4 for the result of that analysis.

4.3.3

Text Coverage

Two ways to calculate text coverage have been found. One way of doing it is to ignore proper nouns, acronyms and abbreviations as done by Neufeld and Billuro˘glu in [18]. In this case if we reuse the example He ate at the Great Monster Lobster Cage. The word he would have a text coverage of 25% since Great Monster Lobster Cage would be ignored being a proper noun (or four separate proper nouns if not multiwords are used). In a study on graded readers by Hu and Nation they on the other hand treat the proper nouns as a separate category which should be comprehended by the reader.[13]

As seen in Table 4.5 I go for the latter alternative, not ignoring proper nouns but putting them in a special category named special.

4.4

Tag sets

As explained in Section 4.3.1 we approach the matching of a word list and a corpus by computing coverage of terms in a document, and a term is defined as the lemma pos-tag pair. A challenge is that different lemmatizers and the word list selected, in this case EVP described in Section 3.4, use different sets of tags. One common mutual tag set is needed to be able to translate and match between the different corpora and word lists. To conquer this problem the lowest common denominator is used. The English Vocabulary Profile word list described in Section 3.4 uses a fairly simplified set of tags. Therefore the POS tags used in the mutual tag set are basically the same as the EVP word list. See Table B.1 in Appendix B.

The taggers Treetagger and Freeling both uses tag sets based on the Penn Treebank Tag Set. The mappings can be seen in Appendix B.

(26)

4.5. CORPORA ANALYSIS CHAPTER 4. METHOD

4.5

Corpora Analysis

To analyze if the whole EVP word list can be found in the corpus and how big corpus is needed to cover the whole word list a corpora analysis is done. In the corpora analysis a corpus is treated as a bag of words, one can think of it as one giant text, and each occurrence of a term is counted. A term from EVP word list is here defined as to be covered if it occur 10 times or more in the corpus as a whole. The number 10 is selected because of research saying that 10 is the number of times one at least needs to encounter a new word in order to have a chance to learn it. [19]

4.6

Document Analysis

In the corpora analysis a corpus was seen as a bag of words, in the document analysis we go down one level and analyze each text in the corpus separately to understand what characteristics they have. This to be able to answer the question if there is enough texts which match the requirements of a text suitable for a graded reader.

4.6.1

Term Categorization

Each term of a document will be analyzed into four different categories seen in Table 4.5.

Table 4.5: Term Categories Category Description

known These are words from a level below the current level and therefore the reader should know them already. target These are the words at the current level the reader

is studying.

special Terms which the reader should be able to compre-hend although not belonging to any level. For exam-ple proper nouns or numerals.

unwanted Words which are at a higher level than the current level and thus should be incomprehensible to the reader.

comprehend known + special

In the study of graded readers by Nation and Wang where graded readers and the characteristics of them were analyzed the categories used were previous levels (relates to known), current level (target), proper nouns (special) and other (unwanted). [13]

Which category the terms are categorized as depends which CEFR level we are doing the analysis on. For example if we are doing an analysis for CEFR level B2, all terms on the B2 level will be categorized as target. Terms on

(27)

4.6. DOCUMENT ANALYSIS CHAPTER 4. METHOD

levels below (A1, A2 and B1) will be categorized as known. Terms which are at the higher levels, C1 and C2, will be in the unwanted category. Also terms which do not belong to any CEFR level will always be unwanted independent on level. The special category is also independent of the current level analyzed.

4.6.2

Candidate Filtering Strategies

Several filtering strategies are created which will filter all the texts of a corpus to a subcorpus based on the text coverage of different term categories as presented in Table 4.5 for each text.

These strategies will be applied to each level to investigate if there is any difference between the levels.

CEFR level A1 will be ignored though since category known for that level is 0.

Table 4.6: All Candidate Filtering Strategies Strategy Term Category

Target Compre. Known Special Unwant. Ideal >= 4% >= 90.5% <= 1% Perfect >0% >= 95% Strict Constr. >0% >= 95% <= 4.5% Strict >0% >= 95% Soft Constr. >0% >= 85% <= 4.5% Soft >0% >= 85%

target >= 4%, known >= 90.5%, unwanted <= 1 % Figure 4.3: Ideal Strategy

The Ideal strategy, see Figure 4.3, is based on the research on graded readers by Nation and Wang in [13]. In that study the ideal scheme as it is called there is previous levels = 90.5%, proper nouns = 4.5%, current level = 4.0% and others = 1.0%. I can not find any explanation for why the coverage for proper nouns should be 4.5%. The only thing mentioned is that the previous levels and proper nouns together in this way make up a total of 95% comprehensible text. In the study [11] by Laufer this percentage is found to be needed to comprehend the remaining 5% of the text from context. In other words a bit lower than the other percentage of 98% found. Also 4.5% seems to be the average text coverage of proper nouns in the graded readers studied by Nation and Wang.

target > 0%, known >= 95% Figure 4.4: Perfect Strategy

In the Perfect strategy, see Figure 4.4, we leave room for some unwanted terms but we still keep the known category within the ideal limit. This means

(28)

4.6. DOCUMENT ANALYSIS CHAPTER 4. METHOD

we are satisfied if we find just one target term.

target > 0%, comprehend >= 95% Figure 4.5: Strict Strategy

Strict strategy, see Figure 4.5, leaves a bigger room for special terms than in the Perfect strategy by switching from a constraint on known to compre-hend.

target, > 0%, comprehend >= 85% Figure 4.6: Soft Strategy

The strategy Soft, see Figure 4.6, has relaxed the constraint on the com-prehend category by lowering it to 85%. This leaves room for more unwanted terms. The reasoning behind this is that although a language learner is at for example level B2, he/she has most likely encountered and learned more words than only the words at level A1 to B1. Still it is up to luck whether these unwanted terms are known or not by the language learner which means the text can not be guaranteed to be suitable for the language learner.

target > 0%, comprehend >= 95%, special <= 4.5% Figure 4.7: Strict Constrained Strategy

target > 0%, comprehend >= 85%, special <= 4.5% Figure 4.8: Soft Constrained Strategy

Both the Strict and Soft strategy, see Figure 4.5 and Figure 4.6, also exist in a more constrained version, Strict Constrained and Soft Constrained, see Figure 4.7 and Figure 4.8, which sets a limit of 4.5% on the special terms. This is done to increase the number of terms carrying meaning which enhances comprehension of other words. The percentage of 4.5% for special is based on the ideal strategy although I can not find any good explanation for the specific number selected by Nation and Wang in [13].

(29)

Chapter 5

Implementation

This chapter will explain how the corpora analysis system is implemented.

5.1

Tools

5.1.1

Programming Language

The programming language used for the analysis is Python 2.7. The reason Python was used is that it is a programming language I’m fluent in, thus I did not need to spend any time learning or improving my knowledge of a programming language. Another argument for using Python is that the initial studies was done on a trial and error basis and with Python one can fairly quickly create prototypes. An argument against using Python is that there are other languages which are faster than Python in terms of execution time. But as the system is implemented, Python is mainly used as a glue between other tools and the main part of the execution time was spent in these other tools. Therefore choosing a faster language would not have affected the total execution time much.

5.1.2

Database

Postgresql 9.2 is used for storage and indexing of documents, term vectors and computed values. Another storage system considered was ElasticSearch. As with the programming language I wanted to use a storage system I had earlier experience with. I selected Postgresql in favor of ElasticSearch because I found it easier to implement the candidate filtering strategies for Postgresql. This is an area in which one might be able to do big improvements concerning the execution time since for the analyses a lot of time went into executing the SQL queries.

5.1.3

POS Tagging and Lemmatization

A third party tool is used for POS tagging and lemmatization, Freeling1. I have earlier experience using Freeling and I could quickly create scripts for analyzing the texts using Freeling. Therefore I selected Freeling as a POS tagging and

1

(30)

5.2. PREPROCESSING CHAPTER 5. IMPLEMENTATION

lemmatization tool to cut down on the time spent programming. There may exist better tools which could have been used but due to time limits this was not investigated. A comparison is done between Freeling and Treetagger though, see results in Section 6.1.4.

5.2

Preprocessing

Both the EVP word list, and the corpora need to be preprocessed in order to be able to analyze and match them through the strategies.

5.2.1

Word lists

In the case of the EVP word list described in Section 3.4 it is made sure the POS tags follow the mutual tag set selected as it is described in Appendix B.1. Entries in the EVP dictionary without a POS tag, usually phrases, are ignored and removed from the word list.

Furthermore since the EVP word list also takes word senses into account the word list has the interesting feature that a term can be found at several of the CEFR levels. See Table 3.2 for an example. What the word list preprocessor does is it removes the term from higher levels and then assigns a term to the level of its first encounter.

After all filtering and mapping is done a word list for each level is stored as a comma separated file. These lists are used for the target term category.

See Figure 5.1 for the first ten entries in the file for level A1, and Figure 5.2 for the first ten entries in the file for level A1 when the individual POS tags have been removed and replaced with the match all *.

a.m.,ADV about,ADV about,PREP above,ADV above,PREP address,NOUN adult,NOUN afternoon,NOUN after,PREP again,ADV

(31)

5.2. PREPROCESSING CHAPTER 5. IMPLEMENTATION a.m.,* about,* above,* address,* adult,* afternoon,* after,* again,* age,* all right,*

Figure 5.2: First ten entries in the csv-file for level A1 without POS tags. The accumulation of terms from level A1 to each other higher level is also saved as comma separated files. For example level A1 to level B2 contains all terms on level A1, A2, B1 and B2. The content in these files is then used as the known term category.

5.2.2

Corpora

Term extraction

The corpus ukWaC is already lemmatized and POS tagged as described in Section 4.2.1. A python script was created to extract documents from the XML files in which the corpus is stored.

As discussed in Section 4.3.2 I opted for using multiwords. Proper nouns consisting of several words are not tagged as one word in the ukWaC corpus. To mimic the behavior of Freeling, which tags consecutive proper nouns as one proper noun, they are combined to one proper noun by the script used for extraction.

For each of these documents a term vector is then built which holds the count for each term in the document. The POS tags are then mapped to the mutual tag set, see Table B.3 . From this term vector the coverage of terms from EVP can easily be computed.

Tagging

The tool Freeling was used for tagging of texts of wikipedia and simple wikipedia and the subcorpus of ukWaC used for comparison between the Freeling tagger and the Treetagger. Freeling was selected for comparison be-cause of previous experience using it thus not much time would be needed for the implementation of it. Just like with the extraction of terms from ukWaC described above, the POS tags from Freeling was mapped to the mutual tag set and then stored together with the document.

Computed values

For filtering purposes used by the strategies, see Section 4.6.2, text coverage of different kinds were computed at index time. For each CEFR level A1 to C2 the text coverage of the different term categories described in Table 4.5 was

(32)

5.3. STRATEGIES CHAPTER 5. IMPLEMENTATION

computed. The category special is only computed once for each document since it is independent of level.

5.3

Strategies

The strategies are implemented as different filtering schemes on the computed values.

For example the Ideal strategy, see Table 4.6, is given in the format of: s t r a t e g i e s = { ’ i d e a l ’ : { ’ TARGET gte ’ : 0 . 0 4 , ’KNOWN gte ’ : 0 . 9 0 5 , ’COMPREHEND gte ’ : 0 . 9 5 ’UNWANTED lte ’ : 0 . 0 1 } , }

At runtime the strategy is then translated for the current level being analyzed and converted to an SQL query. The SQL query is then run in Postgresql to filter out the texts that are suitable for the strategy.

(33)

Chapter 6

Results and Discussion

This chapter presents the results from the evaluations as well as analysis and discussion of those results. The size of a corpus is often refereed to as for example 10M terms in the case of Figure 6.1. This should be read as 10 million terms.

6.1

Corpora Analysis

This section will present the results from the different corpora analyses. What is analyzed is how big percentage of the terms in the EVP word list that are covered, found at least 10 times, at each CEFR level in different corpora.

6.1.1

Corpora Comparison

Figure 6.1 shows the percentage of words at each CEFR level covered in each of the three kinds of corpus.

(34)

6.1. CORPORA ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

Figure 6.1: Corpus Type Comparison of Term Coverage, size=10M terms, tag-ger=Freeling

On levels A1 to B2 there is no big difference between the different corpora. A tendency can be seen in B1 and B2 that ukWaC covers a bit more terms but the difference is not that big.

On level C1 and C2 they could be ranked as ukWaC is the best, second is wikipedia, and last is simple wikipedia.

I think a likely explanation for the fact that ukWaC performs better than wikipedia is that ukWaC has a much more diverse set of texts and therefore manages to cover more words. The fact that wikipedia has much longer texts than ukWaC could be a negative factor.

Comparing between wikipedia and simple wikipedia the reason for greater coverage by wikipedia is because of the text lengths. Looking at the average text lengths of these two corpora it is seen that wikipedia has an average text length of 2308 terms, while the average text length in simple wikipedia is much lower with 140 terms. See further discussion on text length in 6.1.3.

Compared with ukWaC, simple wikipedia performs worse because of both the factors given above. ukWaC has a more diverse set of texts than simple wikipedia plus the average text length of ukWaC is 675, almost 5 times as long as the average text of simple wikipedia.

The conclusions are that on level A1 to B2 the corpus does not matter that much, but for the C1 and C2 level one should select a corpus with longer texts and a more diverse set of texts.

6.1.2

Corpora Size Comparison

Figure 6.2 shows how the term coverage increases as the size of the corpus ukWaC increases.

(35)

6.1. CORPORA ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

Figure 6.2: Corpus Size Comparison of Term Coverage, corpus=ukwac, tag-ger=Treetagger

Here we have some rather interesting results. I had expected that the A level and maybe also B level would be completely covered due to the fact that they should be the most common words and ukWaC consists of different kinds of texts. If we investigate what words are not covered we may find some explana-tions. All words not covered by the ukWaC corpus of 40 million terms tagged with Treetagger are listed in Appendix C.

On every CEFR level there is at least one multiword. For example credit card/NOUN at A1 and text message/NOUN at A2. Multiwords are not tagged in the ukWaC corpus which makes them impossible to cover for the corpus.

Another type of word not covered is some/PRON which we find on level A1 but it is not covered by ukWaC. An example sentence given in EVP for some/PRON is ”If you need more paper then just take some.” Having Tree-tagger tagging this sentence1, some is tagged as DT which is wrong according

to the EVP example. Freeling2 recognizes that the word some in the example

sentence could be a pronoun, but gives that choice only a probability of 0.001, while the choice of determiner gets 0.99089 and therefore Freeling tags it as some/DT (incorrectly according to EVP).

Another example of incorrect tagging by the Treetagger is for example the word text/VERB which is found at level A2. For example in the sentence ”He likes to text her.” the word text gets tagged as a noun. The Freeling tagger does not perform better and makes the same mistake. The ukWaC corpus is already tagged. In the case of wikipedia and simple wikipedia which is tagged with Freeling a potential solution could be to update the dictionary used for the tagging. For example adding text/VERB in this case which does not exist. Manually going through the words not covered to identify all the incorrectly not covered words because of mismatch between the EVP tag set and the tagger tag set used is though a feat out of scope of this thesis so these errors will remain throughout the thesis.

A word not covered which is a bit more surprising is jeans/NOUN on level

1

http://web4u.setsunan.ac.jp/Website/TreeOnline.htm

2

(36)

6.1. CORPORA ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

A1. Searching in the ukWaC corpus for only the lemma jeans it is found 22 times, tagged with the Treetagger tag NP, proper noun singular, which maps to OTHER. See Table B.1. Searching through the corpus there are texts which contains the word jeans in sentences in which it should be tagged as NOUN. The mystery was solved when I realized that the word jeans is actually lemmatized as jean. Which is correct since it is true that the word exists as jean in singular form, but most commonly it is used in its plural form and is therefore listed as such in EVP. A way to solve this kind of problem is to make sure the EVP word list contains only true lemmas.

The reason the higher levels have more cover I think is because of the nature of the terms found on these levels. They are more often less ambiguous with a clear POS tag.

The conclusion I make here is that the mismatch between the word list tags and tagging done of the corpora by the tagger is a challenge not easily overcome without some additional manual work.

6.1.3

Document Length Comparison

Figure 6.3 displays how the term coverage change in corpora consisting of texts of a specific length.

Figure 6.3: Document Length Comparison of Term Coverage, corpus=ukWaC, size=2,4M terms, tagger=Treetagger

As can be seen in Figure 6.3 on the A1 to B2 levels the difference between the corpora of different document lengths is not that big and has no clear pattern. On the C1 and C2 level though interesting things are starting to happen. The shorter the average document length in the corpus the lower term coverage. The bigger the vocabulary size the harder it is to find the terms. Intuitively one could think that the more advanced and less used terms are found in texts with a bit more developed discourse, thus the longer the text the higher the coverage.

(37)

6.1. CORPORA ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

The conclusion is that to be able to cover the higher levels like C1 and C2 one needs a bit longer texts.

6.1.4

Tagger Comparison

UkWac has been tagged with two different taggers, Freeling and Treetagger, and how that affects the term coverage can be seen in Figure 6.4.

Figure 6.4: Tagger Comparison of Term Coverage, corpus=ukWaC, size=10M terms

Even though the Freeling tagger in this study has the advantage over Tree-tagger by being able to tag multiwords, which there are some in the EVP list, Treetagger still performs much better.

first/ADJ, seventeen/NUM, today/ADV, yes/ADV, cross/NOUN, two/NUM, p.m./ADV, why/ADV, hard/ADV, about/ADV, tonight/ADV, second/ADJ, nine/NUM, where/ADV, ok/EX, closed/ADJ, bored/ADJ, six/NUM, an/DET, below/ADV, eighteen/NUM, orange/ADJ, there/ADV, twelve/NUM, what/DET, five/NUM, four/NUM, exciting/ADJ, above/ADV, fifteen/NUM, ten/NUM, as/ADV, no/ADV, eight/NUM, outside/ADV, that/DET, how/ADV, past/PREP, boring/ADJ, dancing/NOUN, tired/ADJ, ear-ly/ADV, right/ADJ, interesting/ADJ, seven/NUM, three/NUM, clean/ADJ, dvd/NOUN, first/ADV, stop/NOUN, thirteen/NUM, well/EX, only/ADV, excited/ADJ, please/EX, best/ADV, sixteen/NUM, grey/ADJ, watch/NOUN, nineteen/NUM, many/DET, one/NUM, twenty/NUM, yesterday/ADV

Figure 6.5: Freeling & Treetagger Difference CEFR A1 level

Looking at Figure 6.5 which displays words in EVP CEFR level A1 that are not covered when ukWaC is tagged by Freeling, but covered when tagged with Treetagger we can see that one kind of words that Freeling misses are numbers.

(38)

6.2. DOCUMENT ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

Other odd errors are words like excited/ADJ and clean/ADJ which Freeling seems to miss because it POS tags them as verbs although they obviously are adjectives. For example in the sentence He is very clean. the word clean is POS-tagged as a verb by Freeling.

Conclusion from this is that Treetagger should be used rather than Freeling. But there are also many problematic terms with the Treetagger and I would say neither Treetagger performs as well as I had expected.

6.2

Document Analysis

Let us now apply the different document candidate filtering strategies described in Section 4.6.2 to the different corpora described in Section 4.2 with the terms from the different CEFR levels, Section A, with terms given in the EVP, Ap-pendix 3.4 acting as word lists and see what kind of results we get. This is the Candidate Filtering part as seen in Figure 1.1.

6.2.1

Document Length Comparison

Table 6.1 contains the percentage of texts in a corpus that is suitable for a strategy at that specific level. The number within parentheses is the number of texts matching. See Table 4.2 for the number of texts in each corpus.

Table 6.2, Table 6.3 and Table 6.4 are breakdowns on how the terms are distributed in the term categories of the documents in each corpus which passes the constraints for each strategy. The percentages are the average of the texts matching the strategy. As we can see in Table 6.1 there is exactly one text matching the strategy Strict for level A2 with a length between 150 and 250. This means that the numbers in Table 6.2 are for that specific docu-ment. Whereas for example for level A2 and strategy Soft we have 10 texts matching. This means the percentages for the different categories in Table 6.2 are averages of these 10 documents.

(39)

6.2. DOCUMENT ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

Table 6.1: Document Length Comparison of Number of Candidate Texts per Level and Strategy, corpus=ukwac, size=2,4M terms

Level Strategy Lengths

150-250 450-550 950-1050 A2 Ideal 0.0% (0) 0.0% (0) 0.0% (0) A2 Perfect 0.0% (0) 0.0% (0) 0.0% (0) A2 Strict Con 0.0% (0) 0.0% (0) 0.0% (0) A2 Strict 0.1% (1) 0.0% (0) 0.0% (0) A2 Soft Con. 0.0% (0) 0.0% (0) 0.0% (0) A2 Soft 0.1% (10) 0.1% (2) 0.0% (0) B1 Ideal 0.0% (0) 0.0% (0) 0.0% (0) B1 Perfect 0.0% (0) 0.0% (0) 0.0% (0) B1 Strict Con 0.0% (0) 0.0% (0) 0.0% (0) B1 Strict 0.1% (4) 0.0% (0) 0.0% (0) B1 Soft Con. 0.1% (2) 0.0% (0) 0.0% (0) B1 Soft 0.5% (61) 0.1% (7) 0.1% (3) B2 Ideal 0.0% (0) 0.0% (0) 0.0% (0) B2 Perfect 0.0% (0) 0.0% (0) 0.0% (0) B2 Strict Con 0.0% (0) 0.0% (0) 0.0% (0) B2 Strict 0.1% (5) 0.0% (0) 0.0% (0) B2 Soft Con. 0.2% (26) 0.1% (4) 0.1% (2) B2 Soft 6.6% (806) 3.0% (145) 2.1% (50) C1 Ideal 0.0% (0) 0.0% (0) 0.0% (0) C1 Perfect 0.0% (0) 0.0% (0) 0.0% (0) C1 Strict Con 0.0% (0) 0.0% (0) 0.0% (0) C1 Strict 0.1% (7) 0.1% (1) 0.0% (0) C1 Soft Con. 1.7% (212) 1.4% (69) 1.2% (30) C1 Soft 27.2% (3320) 23.3% (1122) 16.2% (388) C2 Ideal 0.0% (0) 0.0% (0) 0.0% (0) C2 Perfect 0.0% (0) 0.0% (0) 0.0% (0) C2 Strict Con 0.0% (0) 0.0% (0) 0.0% (0) C2 Strict 0.1% (13) 0.0% (0) 0.0% (0) C2 Soft Con. 2.9% (357) 3.2% (156) 3.2% (76) C2 Soft 33.9% (4137) 40.1% (1936) 32.2% (772) Looking at Table 6.1 one can see that the corpora with short texts between 150 and 250 terms do better on levels A2 to C1. On C2 level (a bit surprisingly for me) middle length texts of between 450 and 550 is the best one. Short and long texts perform about the same on the C2 level.

(40)

6.2. DOCUMENT ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

Having the results from Figure 6.2 in mind these results go a bit towards my first intuition. Due to the fact that the corpus of longer texts has a greater total coverage of all the terms on C1 and C2 level I thought it would be easier to find suitable documents if you increase the length.

Noteworthy is that with this small corpus of 2,4 million terms it is almost impossible to find suitable docs for level A2 and B1 for any strategy.

(41)

6.2. DOCUMENT ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

Table 6.2: Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukwac, size=2,4M terms, length=150-250 terms

Strategy Level Target Compre. Known Special Unwant.

Ideal A2 — — — — — Ideal B1 — — — — — Ideal B2 — — — — — Ideal C1 — — — — — Ideal C2 — — — — — Perfect A2 — — — — — Perfect B1 — — — — — Perfect B2 — — — — — Perfect C1 — — — — — Perfect C2 — — — — — Strict Con. A2 — — — — — Strict Con. B1 — — — — — Strict Con. B2 — — — — — Strict Con. C1 — — — — — Strict Con. C2 — — — — — Strict A2 1.92% 95.19% 26.92% 68.27% 2.88% Strict B1 1.48% 95.97% 36.58% 59.39% 2.55% Strict B2 1.20% 96.51% 51.45% 45.06% 2.29% Strict C1 0.51% 96.06% 60.81% 35.24% 3.43% Strict C2 0.66% 95.87% 64.94% 30.93% 3.47% Soft Con. A2 — — — — — Soft Con. B1 1.98% 86.77% 84.41% 2.36% 11.26% Soft Con. B2 2.24% 86.51% 84.59% 1.93% 11.24% Soft Con. C1 1.42% 86.78% 83.86% 2.92% 11.80% Soft Con. C2 1.10% 87.08% 84.23% 2.85% 11.82% Soft A2 3.07% 89.05% 33.22% 55.83% 7.88% Soft B1 3.81% 87.74% 51.82% 35.92% 8.45% Soft B2 2.91% 87.30% 66.67% 20.64% 9.79% Soft C1 1.49% 87.55% 72.22% 15.33% 10.96% Soft C2 1.09% 87.87% 74.04% 13.83% 11.04%

(42)

6.2. DOCUMENT ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

Table 6.3: Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukwac, size=2,4M terms, length=450-550 terms

Strategy Level Target Compre. Known Special Unwant.

Ideal A2 — — — — — Ideal B1 — — — — — Ideal B2 — — — — — Ideal C1 — — — — — Ideal C2 — — — — — Perfect A2 — — — — — Perfect B1 — — — — — Perfect B2 — — — — — Perfect C1 — — — — — Perfect C2 — — — — — Strict Con. A2 — — — — — Strict Con. B1 — — — — — Strict Con. B2 — — — — — Strict Con. C1 — — — — — Strict Con. C2 — — — — — Strict A2 — — — — — Strict B1 — — — — — Strict B2 — — — — — Strict C1 0.44% 95.57% 34.15% 61.42% 3.99% Strict C2 — — — — — Soft Con. A2 — — — — — Soft Con. B1 — — — — — Soft Con. B2 2.03% 86.43% 83.78% 2.65% 11.54% Soft Con. C1 1.18% 86.33% 83.30% 3.02% 12.49% Soft Con. C2 0.92% 86.54% 83.49% 3.05% 12.54% Soft A2 4.03% 85.91% 29.59% 56.32% 10.06% Soft B1 2.40% 87.41% 40.67% 46.74% 10.18% Soft B2 2.96% 86.67% 64.21% 22.46% 10.37% Soft C1 1.32% 86.96% 71.82% 15.14% 11.71% Soft C2 0.90% 87.27% 73.98% 13.29% 11.83%

(43)

6.2. DOCUMENT ANALYSIS CHAPTER 6. RESULTS AND DISCUSSION

Table 6.4: Text Characteristics of Candidate Texts per Level and Strategy, corpus=ukwac, size=2,4M terms, length=950-1050 terms

Strategy Level Target Compre. Known Special Unwant.

Ideal A2 — — — — — Ideal B1 — — — — — Ideal B2 — — — — — Ideal C1 — — — — — Ideal C2 — — — — — Perfect A2 — — — — — Perfect B1 — — — — — Perfect B2 — — — — — Perfect C1 — — — — — Perfect C2 — — — — — Strict Con A2 — — — — — Strict Con B1 — — — — — Strict Con B2 — — — — — Strict Con C1 — — — — — Strict Con C2 — — — — — Strict A2 — — — — — Strict B1 — — — — — Strict B2 — — — — — Strict C1 — — — — — Strict C2 — — — — — Soft Con. A2 — — — — — Soft Con. B1 — — — — — Soft Con. B2 1.96% 85.86% 82.69% 3.17% 12.18% Soft Con. C1 0.85% 86.30% 83.01% 3.30% 12.85% Soft Con. C2 0.85% 86.50% 83.32% 3.18% 12.65% Soft A2 — — — — — Soft B1 3.80% 86.17% 40.54% 45.63% 10.03% Soft B2 2.63% 86.64% 64.73% 21.91% 10.73% Soft C1 1.27% 86.77% 72.11% 14.66% 11.96% Soft C2 0.85% 87.01% 74.54% 12.48% 12.14% Looking at the strategies Strict and Soft we see that not constraining the special terms, other than through comprehend, gives a pretty high percentage of special terms. It is lower at C1 and C2 level but still above 10%. Comparing between Table 6.2, Table 6.3 and Table 6.4 and the corpora of different average text length, I see no clear difference how the text length affects term distribution.

References

Related documents

In conclusion, every matrix representing a rooted directed spanning tree has exactly one zero eigenvalue and the others in the left half-plane.... The next course of action is to

For small sized businesses due to a lack of financial resources and limitations in experience in new markets, adopting a low cost strategy would help SMEs reduce the

The examples given above, in the discussion, demonstrate how the CEOs and their SMEs have at least begun to demonstrate how their businesses are able to perform more effectively when

Photo: Maskot / Folio och Malin Öberg, Jan Gerscher / Ministry of Finance..

After review of existing literature and analyzed the findings in relation to Uppsala model and Driscoll’s (1995) foreign market entry modes choice framework as a guiding tool, we

We believe this corpus will be a useful resource for researchers studying reference resolution, landmark salience in the con- text of route instructions (Richter, 2013),

Even though it is seen how the doing of strategy is managed by assigning the module leads and module groups to fill the modules with content and activities, it is shown how Group

These results show that even though the most indirect politeness strategy is preferred by subordinates, too much redressive action may not be the best approach. This is suggested by