• No results found

Pronunciation and Spelling: the Case of Misspellings in Swedish L2 Written Essays

N/A
N/A
Protected

Academic year: 2022

Share "Pronunciation and Spelling: the Case of Misspellings in Swedish L2 Written Essays"

Copied!
4
0
0

Loading.... (view fulltext now)

Full text

(1)

Pronunciation and Spelling: the Case of Misspellings in Swedish L2 Written

Essays

Gintar˙e GRIGONYT ˙Ea,1and Bj¨orn HAMMARBERGa

aDepartment of Linguistics, Stockholm University, Sweden

Abstract. This research presents an investigation performed on the ASU corpus.

We analyse to what extent does the pronunciation of intended words reflects in spelling errors done by L2 Swedish learners. We also propose a method that helps to automatically discriminate the misspellings affected by pronunciation from other types of misspellings.

Keywords. L2, misspellings, ASU corpus, written text, Swedish

Introduction

Misspellings normally occur due to typing errors or the lack of knowledge of the correct spelling. Presumably, when not being certain and trying to reproduce the correct spelling of a word, second language learners might also need to rely on the pronunciation of that word (among all possible sources which are relevant to the decision making of the spelling of that word).

In general a misspelled word can result in a string of characters that either repre- sents: a) an out-of-dictionary word, i.e. incorrectly spelled word (like ”erata” instead of

”errata”); or b) a correct spelling of another word - heterograph (like ”too” instead of

”two”). In this paper we focus on the former type of misspellings.

The misspellings of L2 learners have received a lot of attention in L1 identification task. [1] suggest that spelling of L1 transfers to L2 and thus spelling errors in L2 can be helpful for automatically identifying writer’s native language. Based on the assumption that spelling errors are related to pronunciation in L1 authors have singled out 8 types of spelling errors and used them as features for L1 identification. In the Natural language Identification task 2013 [2] has shown that cognate and misspelling features are among the most important ones when identifying writer’s L1. In their study [2] have relied on orthography of misspelled words and mapped them with intended word in L2 and L1 cognates.

[3] demonstrated a 66% accuracy in language classification task by relying only on 200 most frequent character bigrams. Which was interpreted as the correlation between

1Corresponding Author: Gintar˙e Grigonyt˙e, Department of Linguistics, Stockholm University, SE-10691 Stockholm, Sweden; E-mail: gintare@ling.su.se

Human Language Technologies – The Baltic Perspective A. Utka et al. (Eds.)

© 2014 The authors and IOS Press.

This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License.

doi:10.3233/978-1-61499-442-8-95

95

(2)

L1 and the choice of words in L2 texts. Related to this study, a negative result has been reported by [4] showing that majority of indicative bigrams are independent of L1.

Another related research area to this paper is L2 error tagging. [9] study describes automatic error tagging of spelling mistakes in learner corpora. [8] presents how spelling errors can be used for assessing L2 writing.

1. Data and Method

This study proposes an L1 independent methodology for analysing L2 misspellings and detecting the pronunciation effect in spelling. We use the Swedish spell-checker system described in [5] and the large Swedish dictionary with pronunciation information [6].

First, we detect out-of-dictionary words in L2 written texts and suggest the most likely correct spelling candidate. Second, we perform an automatic phonologically-based map- ping between the original L2 word and the pronunciation of the intended spelling. Fi- nally, we measure the distances between the two mappings: a) the original spelling and the pronunciation of the intended word, b) the spelling of the intended word and the pro- nunciation of the intended word; and rank misspellings according to the phonological spelling bias.

We use the part of written essays in Swedish, produced by adult learners of the ASU corpus [11]. The ASU Corpus is a longitudinal corpus of transcribed audio-recorded con- versations and written texts collected from adult learners of Swedish and supplemented by a comparable material from native Swedes.

The learners’ written part of the ASU corpus data comprises of 220 text units (10 persons x 11 sessions x 2 texts) ca 50,000 word tokens. The data ranges from the beginner stage up to a level where Swedish learners are studying in Swedish at university level.

1.1. Detection of Out-of-dictionary Words and Suggestion of Intended Spelling

Initially we use the SALDO [10] dictionary to perform a dictionary check-up for detect- ing possible spelling errors. The approach to Swedish spelling correction is orthographic and is based on the phonetic similarity key method combined with a method to measure proximity between the strings [5]. We use the Edit distance algorithm to measure the proximity of orthographically possible candidates and the Soundex algorithm to shortlist the spelling candidates which are phonologically closest to the misspelled word. Further, the spelling correction candidates are analyzed in a context by using the SLME n-gram model.

The SLME employs the Google Web 1T 5-gram, 10 European Languages, Version 1, dataset for Swedish, which is the largest publicly available Swedish data resource. The SLME is a simple N-gram language model, based on the Stupid Backoff Model [7]. The n-gram language model calculates the probability of a word in a given context Eq.(1).

The maximum-likelihood probability estimates for the n-grams are calculated by their relative frequencies Eq.(2).

P(wL1) =

L

i=1P(wi|wi−11 ) ≈

L

i=1

P(wˆ i|wii−1−n+1) (1)

G. Grigonyt˙e and B. Hammarberg / Pronunciation and Spelling: The Case of Misspellings 96

(3)

r(wi|wii−1−n+1) = f(wii−n+1)

f(wii−n+1−1 ) (2)

The SLME n-gram model calculates the probability of a word in a given context:

p(word|context) (Table 1). The highest probability determines the spelling correction.

Table 1. An example case of the spelling correction for the word menningen.

Original word Intended words Probability

f¨ors¨oka att f¨orst˚a hela menningen meningen (En. sentence) 5.06e-05 (En. try to understand all senntence) mynningen (En. mouth) 4.71e-09 1.2. Phonological Comparison of Original Spelling and Intended Spelling

In order to detect whether phonology has an effect on spelling mistakes we use the pro- nunciation information from the Swedish dictionary [6]. The pronunciation data encoded in the SAMPA format is simplified by: discarding markers of syllable boundaries, special punctuation characters (e.g. vet [ve:t] - [vet]), disregarding upper/lowercase characters (e.g. r¨att [rEt] - [ret]), and replacing codes of numeric allophonic variants to the closest single or double alphabet characters (e.g. f¨or [f9:r] - [fior]). This robust conversion is not ideal and looses subtle phonological information in the process, however it provides simple phonological transcriptions to the word-alike orthographic forms of the intended spellings. The simplified orthographic transcription of the intended spelling is then mea- sured for the string similarity between the misspelled and the intended word (Table 2).

Table 2. Phonological string similarity for the misspelled and the intended words.

Original word Intended word Pronunciation SimOP SimIP

snab snabb [snab] 1 0.889

ihog ih˚ag [ihog] 1 0.750

respektuos respektl¨os [respektlios] 0.857 0.857 interesserad intresserad [intreserad] 0.909 0.952

2. Analysis of the Results

We run the automatic procedure described in section 1.1. on the ASU corpus and detected 1677 out-of-dictionary words. The list was pruned by removing person and geographical names, also insertion words of the code switching (e.g., beef, buy, body, class, clean, etc.). After this step we were left with 1141 misspelled words with spelling corrections (the density of spelling errors in the ASU corpus is 0,023%).

For the phonological comparison described in section 1.2. we have used 725 unique pairs of misspellings and their spelling corrections (source 1411 misspelled words). We measured similarities between a) original spelling and pronunciation of intended spelling (SimOP) and b) intended spelling and pronunciation of intended spelling (SimIP).

All 725 spelling corrections were manually checked by a Swedish native speaker.

Misspellings were annotated as phonologically (pronunciation) influenced (e.g. snell - sn¨all), orthographically influenced (e.g. tva - tv˚a), influenced by cognate words from

G. Grigonyt˙e and B. Hammarberg / Pronunciation and Spelling: The Case of Misspellings 97

(4)

other languages than Swedish (e.g. discotek - diskotek) or random spelling errors with no obvious motivation (e.g. upplyserning - upplysning). The annotations also included learners’ own (invented) lexical and grammatical solutions which result in target-deviant word forms and hence not found in the dictionary, provided that they are correctly spelled according to Swedish orthography (e.g. motsidig - ¨omsesidig, sittade - satt).

The annotations were used as the gold standard to check whether it is possible to automatically differentiate between misspellings influenced by the pronunciation of the intended Swedish word and other types of misspellings. We have used the arithmetic difference between SimOPand SimIPas a binary classifier: if the difference was positive, the misspelling was categorized as phonological. The evaluation against the manually annotated 158 pronunciation related errors has shown 0.75 precision and 0.57 recall.

3. Conclusions and Future Work

The ASU corpus investigation has shown that around 21% of spelling errors are influ- enced by the pronunciation of intended Swedish words. We have attempted to model this phenomena and proposed the L1-independent method that combines Swedish pronunci- ation data and the statistical language model. The evaluation of the binary classification between the phonologically affected and other spelling errors has shown an encourag- ing result, however the large proportion of false negative classifications suggests that in- tended word and the pronunciation alone is not entirely sufficient for making such a dis- tinction. The future work directions could include distinction and separate modeling of a) misspellings due to discrepancies between spelling and pronunciation in the target lan- guage (i.e. standard written and spoken Swedish), and b) misspellings due to differences between the target phonology and the learners interpretation of the pronunciation.

References

[1] M.Koppel, J.Schler, and K.Zigdon, Determining an authors native language by mining a text for errors.

In Proceedings of the 11th ACM SIGKDD international conference, (2005), 624–628.

[2] G. Nicolai, B. Hauer, M. Salameh, L.Yao, G. Kondrak, Cognate and Misspelling Features for Natural Language Identification, In Proceedings of NAACL-BEA8, (2013), 140–145.

[3] O.Tsur and A.Rappoport, Using classifier features for studying the effect of native language on the choice of written second language words. In Proceedings of ACL-CACLA, (2007), 9–16.

[4] G. Nicolai, and G. Kondrak. Does the Phonology of L1 Show Up in L2 Texts? In proceedings of The 52nd Annual Meeting of the Association for Computational Linguistics, (2014), 854–859.

[5] G. Grigonyte, M. Kvist, S. Velupillai and M. Wir´en, Improving Readability of Swedish Electronic Health Records through Lexical Simplification: First Results, In EACL-PITR, (2014), 74–83.

[6] Swedish lexical database by Nordisk sprkteknologi holding: http://www.nb.no/Tilbud/

Forske/Spraakbanken/Tilgjengelege-ressursar/Leksikalske-ressursar [7] R. ¨Ostling, SLME tool inspired by the Stupid Backoff Model (Brants et al., 2007), (2012),http://

www.ling.su.se/english/nlp/tools/slme

[8] Y. Bestgen and S. Granger, Categorising spelling errors to assess L2 writing, International Journal of Continuing Engineering Education and Life-Long Learning, (2011), 21(2/3), 235–252.

[9] P. Rayson, andA. Baron, Automatic error tagging of spelling mistakes in learner corpora, In Meunier, F., De Cock, S., Gilquin, G., Paquot, M. (Eds.), A Taste for Corpora. John Benjamins (2011), 109–126.

[10] L. Borin, M. Forsberg and L. Lnngren, The hunting of the BLARK - SALDO, a freely available lexical database for Swedish language technology. Studia Linguistica Upsaliensia, (2008), 21–32.

[11] B. Hammarberg, Introduction to the ASU Corpus, (2010), available from:http://spraakbanken.

gu.se/swe/itg

G. Grigonyt˙e and B. Hammarberg / Pronunciation and Spelling: The Case of Misspellings 98

References

Related documents

We can clearly see that encrypting data and sending it takes considerably longer time than sending it in plain text, and the difference increases rapidly when more votes

The Bureau of Reclamation, Mid-Pacific Region, Water Conservation Field Services Program (WCFSP), and the Irrigation Training and Research Center (ITRC) at Cal Poly State

Hypothesis number 3 stated that the types of repairs would also vary between the groups; the L2 informants were expected to produce a larger proportion of E-repairs and

Since most teachers used computers for their final task, the function of Classroom equipment went from 4 occasions in the beginning to 12 at the end of the lesson, and

Based on these questions, the authors would like to find out the possible problems in actual business cooperation between Chinese and Swedish companies through the

Det svenska samhället måste ta krafttag för att säkerställa att dessa personer lagförs för sina brott mot mänskligheten, att de deltar i avhopparverk- samhet och att de inte

However, as I mentioned in Chapter 4, if we use Owens’ (1994) notion that in order to conclude that a learner has acquired a morpheme he must be able to use the morpheme correctly

The majority of the teachers claimed that it is crucial to encourage the students to interact in the target language during the lessons but did not feel the need to force the students