A Rule-Based Normalization System for Greek Noisy User-Generated Text

(1)

A Rule-Based

Normalization System for

Greek Noisy

User-Generated Text

Marsida Toska

Uppsala University

Department of Linguistics and Philology Master Programme in Language Technology

Master’s Thesis in Language Technology, 30ECTScredits November 9, 2020

(2)

Abstract

(3)

Acknowledgements

(6)

1 Introduction

User-generated texts, such as social media texts (e.g. tweets), constitute a vast source of information for opinion and event extraction. However, most of this information is composed in a language that is notorious for its high variability in sentence structure, the extensive usage of non-standard word forms and the presence of ungrammatical linguistic units (e.g. misspelled words) resulting in noise (Sikdar and Gambäck, 2016). Since most Natural Language Processing (NLP) tools are trained on formal texts, such as news text, it has been observed that the performance of such tools declines when run over informal text (Gimpel et al., 2011; O’Connor et al., 2010). Good tagging and parsing results are essential for applications such as opinion mining, information retrieval etc (Kong et al., 2014). Therefore, there is seen a need to preprocess noisy user-generated texts (NUGT) through normalization, so that the performance of preprocessing NLP tasks, such as Part-of-Speech (POS) tagging, parsing and other subsequent ones, is not compromised significantly, if at all (Clark and Araki, 2011).

1.1 Purpose

Normalization could concisely be defined as the task of converting word tokens into their standard form (Han et al., 2013). Depending on the text genre (formal vs. informal) and its purpose (preprocessing text for Text-to-Speech (TTS) systems, information retrieval or other NLP tasks such as tagging and parsing) it poses different challenges and subsequently requires a different approach as well. The purpose of the present work is to explore the extent to which the rule-based normalization of Greek social media texts, specifically Greek tweets (tweets written in the Modern Greek language), can lead to more accurate tagging results. We opted for the rule-based approach because it has proven to deliver optimal results in other languages, at least when straightforward mappings suffice (Ruiz et al., 2014; Sidarenka et al., 2013). Additionally, including the levenshtein-distance algorithm, allows us deal with any type of spelling deviations, a phenomenon which we expect to be abundant in user-generated texts, such as tweets. Therefore, the research questions which we will attempt to answer in this project, could be formulated as follows:

1) What are the core categories of Greek tweets that could potentially be in need for normalization?

2) How well does a rule-based system combined with Levenshtein distance perform in normalizing Greek tweets?

3) What is the effect, if any, of normalization in the tagging results of Greek tweets?

1.2 Outline

The chapters are organized as follows:

(7)

Chapter 3introduces the dataset along with the resources that were used for the purposes of this project. Part of the data was used for the analysis of the phenomena occurring in Greek Tweets and the creation of an annotated test set, that was used for the evaluation of the system in the end.

Chapter 4 contains information about the preprocessing steps, such as cleaning the data, performing an analysis and categorization of non-standard word forms and annotating the test set.

Chapter 5describes the actual implementation of the system by giving an overview of the architecture and illustrating approaches through examples.

Chapter 6gives an overview of the evaluation results, both of the system itself and with regard to POS-tagging, at which phase the answers to the research questions, as resulted from the work, are summarized.

(8)

2 Background

2.1 Text Normalization

Text normalization (or canonicalization or standardization) is a prevalent task in the NLP pipeline. It includes normalization subtasks such as tokenization, lemmatization, stemming or sentence segmentation, but it can also be encountered in a more complex form in situations where, for instance, out-of-vocabulary (OOV) words must be addressed by converting them into a standard (lexicon approved) form. Lexical normalization, where the focus of this work lies, consists in preprocessing a text on a word level and transform it into a form that can be easily analyzed and processed by other tasks in the NLP pipeline (e.g. taggers) or downstream applications (e.g. machine translation systems) for these to produce consistent results (Han et al., 2013; Jurafsky and Martin, 2009) . As one may already suspect, there is no single correct way to normalize a text. On the contrary, text normalization is a domain-dependent and purpose-specific task.

In the subsections below, we enlist some of the areas of application of text nor-malization, such as that of Historical Texts and Text-to-Speech, illustrating how the difference in domain and purpose pose different challenges for and even give a different meaning to normalization.

2.1.1 Normalization of Historical Texts

Historical documents are a significant source of information regarding the past and as such a significant resource for researchers in digital humanities. With time, many of them have been digitized making historical texts easily accessible and suitable for instant searching (Pettersson, 2016). Nevertheless, they still cannot be studied, analyzed or exploited fully computationally and remain challenging within NLP due to the high degree of variability they exhibit in terms of spelling, syntax, morphology, vocabulary and semantics (Bollmann et al., 2011; Pettersson, 2016).

In an attempt to provide a solution to this problem and taking into consideration that most NLP tools are trained on contemporary texts, researchers have proposed the normalization of historical texts to modern linguistic standards, so that they can be processed by the same tools and rendered suitable for e.g. information extraction (Piotrowski, 2012). In her study about historical spelling normalization, Pettersson (2016) supports this approach as well, as opposed to adapting NLP tools to the historical domain instead. An example of a sentence found in a historical text before and after normalization can be seen in Example 1 below from the Innsbruck Letter Corpus1: (1) þe the quene queen was was ryght right gretly greatly displisyd displeased with with us us both both

In this sentence, the normalization consists mainly, if not entirely, in substituting certain non-standard word forms with their equivalent modern ones. For instance, “ryght” with “right”, “displisyd” with “displeased” etc. without performing changes

(9)

in the ordering of the words or other kind of modifications. This is anticipated, considering that spelling stands out as the most systematic and striking difference between old and modern texts (Pettersson, 2016; Bollmann, 2018; Piotrowski, 2012).

However, normalizing historical texts regarding spelling is more challenging than one would expect due to its variability between different genres, periods or different authors within the same language. Inconsistent spelling has also been observed within the same document suggesting that in the past not all words were standardized in a single form (Piotrowski, 2012). Consequently, creating a correspondence between different lexical forms that appear in historical texts to present-day spelling is not as straightforward and adds to the normalization challenge.

In her study on normalization of historical texts, Pettersson (2016) suggests the following methods for spelling normalization of historical texts.

a) Rule-based Normalization

b) Levenshtein-based Normalization c) Memory-based Normalization d) SMT-based Normalization

Experiments suggest that the Statistical machine translation (SMT) model, which sees normalization as a translation task, achieves better results compared to other approaches. The proposed pipeline is applicable to several European languages and serves as a good preprocessing step for historical texts, before feeding them to tag-gers or parsers or using them to extract information (information retrieval). Other researchers (Bollmann and Søgaard, 2016; Korchagina, 2017a; Pettersson, 2016; Tang et al., 2018), have shown that NMT can surpass the SMT in the normalization of historical texts provided that large data sizes are available.

2.1.2 Normalization of Texts for Text-to-Speech Systems

Contrary to the normalization of historical texts where the focus mainly lies in map-ping or converting word forms to present-day standard word forms, the normalization of texts destined for Text-to-Speech (TTS) systems has a quite different configuration. In this case, the aim of the normalization process is to prepare the text in such a way that it can be passed on to speech synthesizers and be read aloud (Zhang et al., 2019). The normalization component is usually among the first steps in the TTS pipeline and indispensable since its absence can compromise the meaning and the comprehension of the spoken message and easily lead to a poorly performing TTS system.

Category Before Normalization After Normalization

acronyms e.g. WHO World Health Organization

abbreviations e.g. Mr Mister

numbers (ordinals) e.g. May 7 May seventh

dates e.g. in 1994 in nineteen ninety-four

money e.g. £12,000 twelve thousand pounds

Table 2.1: Example of TTS phenomena where normalization is needed.

(10)

as digits in a certain context, i.e. 5 as five, 2 as two and 3 as three when in “the pin is 523” but this does not suffice in other cases where numerical values have to be read out as cardinals as in the expression “523 people”, where the appropriate expansion would be “five-hundred twenty-three people” instead of the spell out of single digits “five two three people”.

The degree of normalization complexity is language-dependent, considering that in highly inflected languages such as Greek, certain cardinals are subject to case, number and gender inflectional endings as they act as adjectival modifiers when preceding a noun. Consequently, while the cardinal 523 would be pronounced as “πεντακόσια είκοσι τρία” in Greek, when context available, some of its endings would have to get adjusted accordingly as illustrated in Example 2.

(2) πεντακόσιοι Five hundred.M.PL. είκοσι twenty τρείς three.M.PL. άνθρωποι people.M.PL. ’Five hundred twenty-three people’

Initially, most text normalization components were implemented through a rule-based approach where hand-written rules would specify the right expansion rule-based on linguistic knowledge, while finite state transducers (Ebden and Sproat, 2014) have also been deployed as a method. Recently, there have been several attempts to approach text normalization for TTS systems from a data-driven perspective by using machine learning methods, e.g. deep neural networks. The main challenge faced by the latter is the lack of directly available parallel corpora, as there are no texts where instances of the semiotic class and abbreviations occur naturally in a verbalized form (Zhang et al., 2019).

2.2 Characteristics of Noisy User-Generated Texts

User-generated texts posted in social media platforms provide users with the possibility to share information, to informally express an opinion on current affairs, exchange ideas, comment, review or recommend products and services (Steinskog et al., 2017). Twitter posts, or otherwise tweets, are known for their brevity as each post is subject to a character limit, currently amounting to 240 characters per tweet (Alegria et al., 2015). In view of this restriction and given the informal context of text generation in social media (since posts are usually generated by and intended for individuals), they are characterized by certain idiosyncratic features, such as free or elliptical sentence structure, use of non-standard words and abbreviations, ungrammaticalities (intended or not), unreliable capitalization and code-switching, which often account for the noise found in such texts (Clark and Araki, 2011; Han et al., 2013).

(3) sry, sorry, dont don’t know know but but will will DM direct message u you 2mr tomorrow about about it it

’I am sorry, I don’t know, but I will let you know through a private message tomorrow about it’

In the above example there appear non-standard words, such as “sry”, “u” and “2mr” that are not included in any English lexicon and qualify as such as OOV words. Except for the lexical level idiosyncrasy, punctuation marks such as periods and apostrophes are also among the features that are irregularly used in social media texts and can also impact text processing negatively.

(11)

urls, hashtags and mentions (Wikström, 2017). These can appear either as syntactical part of a sentence or be independent of the sentence structure.

(4) a) its getting outta hand /. Students MUST be given masks for #exams. b) staying home and chilling. #luvmylife.

In the example of tweet a, the hashtag “#exams” is an integral part of the sentence. Removing it would make the sentence ungrammatical. On the contrary, the hashtag “#luvmylife”, summarizing the mood of the statement, and following the period, could easily be omitted from a grammatical point a view. The deviation of such texts from newswire or similar texts in terms of governing rules accounts for the fact that social media posts are often referred to as noisy user-generated texts.

2.3 Methods for the Normalization of Noisy User-Generated Texts

The normalization of user generated texts resembles to some extent that of historical texts since the main focus in both cases lies in the conversion of certain words into canonical forms, so that these do not count towards OOV words and lead in turn to a drop in the performance of NLP tools run over such texts. However, historical texts fall into the category of formal texts which are written using non-modern language, while social media texts are classified as modern texts which incorporate numerous features that are associated with informal writing. Hence, tweets need to be normalized with regard to informal language use in order to approach to some extent the formal writing of the texts which NLP tools are mainly trained on (Chrupała, 2014; Han et al., 2013; Wang et al., 2017).

A significant part of normalization consists in restoring the canonical form of words which in application resembles techniques used in traditional spelling corrections. Initially, the correction of spelling was viewed as a noisy channel model where it was assumed that the user knew what they wanted to type but accidentally introduced an error when typing (Church and Gale, 1991), an approach dating back to Damerau (1964). Consequently, the aim of the spelling task was to recover the intended character/s. This formula was formalized as follows by Church and Gale (1991):

𝑃 𝑟(𝑡 |𝑐)

The above formula expresses the probability (Pr) of typo t occurring provided that

c was the intended character. Along the same line, Brill and Moore (2000) proposed

an improved error model where positional information, that is information about the position of the error in a word, was also deployed. Later, Li et al. (2006) utilizes distributional similarity to perform spelling corrections. Focusing on the normalization of text messages, Choudhury et al. (2007) uses a hidden Markov model to model the transformations and emissions of characters.

The use of finite-state transducers (FST) has also been a popular method (Hulden and Francom, 2013; Porta and Sancho-Gómez, 2013), while the rule-based approach, comparison of phonetic similarity and machine translation methods have been imple-mented as well and will be presented in more detail in the subsequent chapters.

2.3.1 The Rule-Based Approach

(12)

historical texts to their present-day counterparts. In the domain of user-generated texts, Sidarenka et al. (2013) proposed a normalization method for German that constitutes in the analysis of the unknown tokens and the development of rules that target specific cases. For Twitter specific phenomena (e.g. mentions, urls etc.), they introduced the use of artificial tokens, replacing the unknown word with “%” and its category, for instance replacing “https://www.example.com” with “%Link”, while under certain conditions, these were removed in initial and final sentence positions. For the major class of unknown tokens, those with spelling deviations, they created transformation rules, while also incorporating statistical information for those generating errors. Specifically, they set as a prerequisite that the sum of log probabilities of the current token (i.e. 𝑤𝑖) and its immediate context (𝑤𝑖-1 and 𝑤𝑖+1)

be lower than that of the proposed in-vocabulary word form (i.e. 𝑤∗

𝑖) for a replacement

to take place. This condition is formally illustrated by the authors (Sidarenka et al., 2013) through the below inequality:

log(𝑃 (𝑤𝑖−1, 𝑤𝑖)) +log(𝑃 (𝑤𝑖)) + 𝑙𝑜𝑔(𝑃 (𝑤𝑖, 𝑤𝑖+1)) < log(𝑃 (𝑤𝑖−1, 𝑤 ∗ 𝑖)) +log(𝑃 (𝑤 ∗ 𝑖)) + 𝑙𝑜𝑔(𝑃 (𝑤 ∗ 𝑖, 𝑤𝑖+1)) (2.1) Ruiz et al. (2014) rely on rule-based methods to normalize Spanish microtext, focusing on Twitter data. By making use of regular expressions, they create three different sets of rules for rule-based preprocessing that respectively handle a) ab-breviations, b) tokens lacking whitespace as a delimiter and c) emoticons and OOV words with character elongations. For unresolved cases, the authors deploy minimum edit-distance (Damerau, 1964), where frequently occurring “errors” are assigned a lower weight than rare ones.

Beckley (2015) introduced a normalization approach where among others (sub-stitution list and sentence-level ranker) a rule-based component was also included. More specifically, they created the “ing” and the “coool” rules to handle frequent categories of OOV tokens that require normalization in terms of non-standard ending and elongated characters respectively. Rule-based implementations were deployed by further researchers (Pathak and Joshi, 2019; Samsudin et al., 2013) for normalization purposes.

2.3.2 Levenshtein Distance

The Levenshtein distance (Levenshtein, 1966), also simply referred to as edit distance (Navarro, 1999), is an algorithm that measures the similarity between two strings

a and b based on the number of character edits required (i.e. insertions, deletions,

substitutions) to transform a into b. Although it was initially coined by V. Levenshtein (1966) where all the edit operations were equally assigned the weight 1, today it is encountered in multiple variations and adjustments and is widely used in spell checking programs for the correction of typographical errors. A formal representation of it can be seen in Figure 2.1.

(13)

Figure 2.1: The Levenshtein distance algorithm summary by Pettersson (2016).

2.3.3 Comparison of Phonetic Similarity

Given the increasing and irregular variance in spelling, especially in user-generated texts, phonetic algorithms are deemed advantageous due to the possibility of directly mapping multiple spelling variations of the same word to their in-vocabulary (IV) counterpart through their similar or common pronunciation as shown in Figure 2.2.

Figure 2.2: OOV spelling variances of the word “together” from the Edinburgh Twitter Corpus by Liu et al. (2011).

For instance, taking “2getha”, one of the most frequent non-standard spelling vari-ants of the word “together” as shown in Figure 2.2, its edit distance from “together” would be 4 (2 substitutions and 2 insertions) when using edit distance algorithms. On the contrary, deriving the pronunciation [tu:"geD@(r)] for “2getha”, its similarity to [t@"geD@(r)], the IV word form, is more apparent and can easily lead to the normaliza-tion “together”.

In social media text normalization, Han et al. (2013) make use of a phonetic algorithm (Philips, 2000) to get the phonetic encoding of an OOV word and search for its IV candidate based on it. Ahmed (2014) on the other hand proposes the combination of Soundex, refined Soundex or other phonetic algorithms along with edit distance ones for the maximization of results for the lexical normalization of tweets.

Most phonetic algorithms that exist today are based on Soundex, which was designed for the normalization of names by producing a phonetic representation consisting of the initial letter of an OOV word and three digits. Assuming the aim would be to normalize Erique to Eric, the following Soundex steps would be followed (Knuth, 1998):

a) Identify and retain the initial letter of the token, here “E”

b) Drop all vowels and the semivowels “h”, “y” and “w”, therefore “Erq”

c) Substitute the consonants, except for the initial letter, with numbers based on the defined correspondence (see Table 2.2), therefore “E62” which is identical to the code for “Eric”.

(14)

Letters to be substituted Substitution number b, f, p, v 1 c, g, j, k, q, s, x, z 2 d, t 3 l 4 m, n 5 r 6

Table 2.2: Soundex substitution of consonant classes with numbers by Knuth (1998).

would not be useful in the normalization of the example word “2getha” provided above.

2.3.4 Statistical Machine Translation and Neural Methods

Statistical machine translation (SMT) systems rely on statistical information derived from bilingual corpora to render a source text into the target language. In text normalization, the SMT approach sees normalization as a translation task where the non-standard word form is the source and the normalized version the target text. Aw et al. (2006) were among the first to deploy this method for the normalization of noisy user-generated texts, such as SMS messages, achieving good results. Later, focusing on the Twitter domain, Kaufmann (2010) introduced a model consisting in the preprocessing of tweets and the training of an SMT system.

The main challenge faced in the implementation of SMT systems is the lack of available parallel corpora, i.e. user-generated texts along with their normalized version, to train the systems (Han and Baldwin, 2011). A partial alleviation to this has been provided by the proposal of character-based SMT which requires considerably fewer annotated resources for training as the statistical model is based on characters and the character-based context, which automatically results in more data and fewer possible combinations to learn – since letters are limited, while words are not – as opposed to word-based and phrase-based SMT. This approach has found application mainly in the normalization of historical texts as in Scherrer and Erjavec (2013), Pettersson (2016) and Korchagina (2017b).

Neural approaches have also been deployed for lexical normalization delivering promising results. Bollmann and Søgaard (2016) introduce a bi-LSTM neural network model based on character level combined with multi-task learning for the lexical normalization of Early New German texts. On Twitter data, there have been attempts to exploit the neural approaches as well, while combining them with existing ones (Goker and Can, 2018; Leeman-Munk, 2016).

2.4 Greek Language and Greek Tweets

(15)

in terms of normalization has not attracted much attention, despite its cruciality for consequent NLP tasks.

2.4.1 Overview of Greek Spelling

The Modern Greek alphabet consists of 24 letters, out of which 5 are vowels. Three of these vowels have more than one graphic representation (Daniels and Bright, 2010; Protopapas et al., 2012), e.g. /I/ can be spelled as ι, η, ει, οι or υ with no phonological differences, the choice of which is prescribed by grammatical rules or is predetermined by etymology and historical orthography (Pittas and Nunes, 2014). All vowels can also appear with diacritics, either the accent mark (΄v) e.g. ά or diairesis (¨v) that can only be used with Ì i.e. ϊ, the omission or misuse of which can result in unorthographical and consequently non-standard word forms or lead to changes in meaning e.g. /'nomos/ (νόμος, law) vs. /no'mos/ (νομός, county). Moreover, the accent mark is obligatory – as well as pronunciation-motivated and therefore predictable – for all lowercased polysyllabic word forms (with few exceptions), e.g. /kan/ (καν, even) but /'kano/ (κάνω, to do) (Ktori et al., 2008).

2.4.2 Linguistic Phenomena in Greek Tweets

Similar to tweets in English and most languages, tweets in Greek also exhibit a frequent use of non-standard word forms (e.g. ευχαριστώωω instead of ευχαριστώ,

thanks) and abbreviations (τλκ instead of τελικά, finally), irregular use of punctuation

(e.g. Κρίμα μόλις έφυγε!!! instead of Κρίμα, μόλις έφυγε., pitty, he/she just left) and capitalization (Δεν ΞΕΡΩ instead of Δεν ξέρω, I don’t know), loose sentence structure and so forth. However, in view of the language-specific characteristics, there are also phenomena that are specific to Greek. A notorious example is Greeklish, a term that describes Greek transliterated with Latin characters through a correspondence based on visual similarity (e.g. substituting χ with x, δ with d, etc). This is also referred to as Romanization (Chalamandaris et al., 2006).

(5) Greek: Greeklish: Δεν Den νομίζω nomizw να na έρθω. erthw. ’I don’t think I will come.’

Except for Greeklish, code-switching is also a common phenomenon in computer mediated communication (Georgakopoulou, 2007), where Greek Twitter users mainly interchange between English and Greek . Often, it is interchanged between the two languages, either intentionally or due to certain words not being as widespread in Greek (Balamoti, 2010). Such sentences may get even more complex to approach computationally when neologisms are also included.

(6) Σκρολάρω scrolling.neologism και and κάνω doing share share.EN διάφορα various posts posts.EN ’I am scrolling and sharing various posts.’

There has also been observed the opposite phenomenon, where English is transliter-ated in Greek letters, spanning either part of or the whole sentence.

(16)

Another commonly observed feature of Greek tweets (or social media texts in Greek generally) is also the frequent omission of the accent mark, resulting in OOV forms. For instance, /sImfo'no/ (συμφωνώ, to agree) could appear as /sImfono/ (συμφωνω) where the final vowel ω misses the stress ώ. As such, the word counts towards non-standard word-forms requiring normalization.

2.5 Part-of-Speech Tagging

Part-of-Speech (POS) Tagging is the process of assigning to each word in a sentence the corresponding part of speech label (Jurafsky and Martin, 2009). Labeling words with their POS tags is not a straightforward process, as the same word may fall in a different word category, depending on the context it appears in. Therefore, efficient taggers use context information to resolve disambiguation cases. Example: “Answer to me” vs “I gave an answer”, where the word “answer” fulfills a different syntactic role in each case and requires the tag of VERB in the first and that of NOUN in the latter expression. POS-tagging is a key step in the NLP pipeline as it facilitates other NLP tasks and applications, such as word sense disambiguation, lemmatization, statistical machine translation models, language modelling (Jurafsky and Martin, 2009) etc.

There are various methods that are employed for tagging. Some of them are: • rule-based methods that assign a POS tag based on rules that exploit linguistic

knowledge (Brill, 1992)

• probabilistic methods using CRF (Conditional Random Field) or HMM (Hidden Markov Model) that are based on the probability of a sequence of tags occurring in order (Kupiec, 1992)

• lexical-based methods that rely on the frequency of a word and its tag in the training corpus (Hall, 2003)

There are also approaches that make use of neural networks for the tag prediction (Schmid, 1994).

(17)

3 Data & Resources

3.1 The Dataset

The dataset1_{used in this project is a Collection of Greek Tweets that was retrieved –}

originally for a project concerning Sentiment Analysis of Greek Tweets (Kalamatianos et al., 2015) – and made publicly available for research2 purposes by a team of researchers at the Democritus University of Thrace in Greece and was presented in Kalamatianos et al. (2015). It was extracted using Twitter API by using the Python programming language. An overview of the dataset statistics is provided in Table 3.1.

dataset Size 832.1 MB Number of Tweets 4,373,197 Number of Users 30,778 Number of Hashtags 54,354 Hashtags (>1000 tweets) 41 Time Span 2008 - 2014

Table 3.1: Statistics of the Collection of Tweets by Kalamatianos et al. (2015) used as dataset in the present work.

Although the tweets span from 2008 to 2014, we did not consider it necessary to work with a more recent dataset as the phenomena we were expecting to encounter partly pertain to the nature of the Greek language itself (e.g. omitting the stress mark) and partly to practices associated with online text generation (e.g. abbreviating words due to character brevity that Twitter imposes, faster typing etc.).

3.2 Resources

3.2.1 Hunspell

Hunspell is a spell checking library that supports several languages that have a lexi-cal writing system. It is suitable even for languages with rich morphology, complex compounding and unicode characters. Except for spell checking through dictionary lookups, it also provides the possibility to perform tokenization, stemming, mor-phological analysis and more. It is freely available for download and licenced under LGPL/GPL/MPL tri-license (Németh, 2008). In the present project, it was only used as a spell checker in order to exclude from normalization words that were IV. The Hunspell Dictionary for Greek has 828.805 entries. Except for common words, it also includes a wide array of proper names (mainly Greek ones) and acronyms, rendering it suitable for classifying as in-vocabulary even words from these categories, which along with its size justifies our choice for it.

1_{Link to the dataset used in this project: http://hashtag.nonrelevant.net/downloads.html}

2_{The tweet collection was work of the Undergraduate students Kalamatianos Georgios and Mallis}

(18)

3.2.2 PANACEA n-gram Corpus

The PANACEA n-gram corpus for Greek consists of Greek n-grams and word/tag/lemma n-grams along with their corpus frequency. Specifically, the n-grams range from un-igrams to 5-grams. The data is from the domain “environment” and was collected through a web crawl within the framework of the EU project PANACEA.3_{The corpus}

is available under the CC-BY-SA licence. Despite the different domain, the unigram and bigram frequencies were utilized in the project for the calculation of the context probabilities at the stage of candidate selection, which will be described thoroughly in chapter 5. An overview of the specifics of this corpus if shown in Table 3.2.

Tokens 31,71 million Sentences 1,185,312 Unigrams 435,189 Bigrams 3,860,716 Trigrams 9,767,383 4-grams 13,683,940 5-grams 14,954,020

Table 3.2: Statistics of the PANACEA n-gram Corpus.

In this project, we made use only of the unigram and bigram frequencies, as required in our calculations. Despite the difference in domain (i.e. environment vs. tweets), we decided to rely on this n-gram corpus due to the high number of the frequencies provided and the assumption that these would remedy the difference in terms of domain.

3.2.3 UD Pipe

Today, there are many algorithms and tools that are readily available for tagging texts in several languages, such as UD Pipe. It is a web-based tool which can be used to tag, lemmatize or parse but also to train taggers, but also lemmatizers and dependency parsers with new corpora, while it also allows choosing the exact model, all of which are based on the Universal Dependency Treebanks. It comprises 17 core tags as those have been defined within the Universal Dependencies framework (Straka and Straková, 2017). This tool will be used in this project for the extrinsic evaluation of our rule-based normalization system. Our choice was motivated by the diversity of texts used for training UD Pipe for Greek as opposed to a single Twitter-irrelevant genre (the Treebank called UD Greek GDT includes news text, wikipedia entries and spoken language) and the fact that the majority of the POS-tags occur (16 out of 17 tags). In the project, a test text will be tagged before and after normalization in order to evaluate the effect of normalization of social media texts on the tagging accuracy.

(19)

4 Preprocessing

4.1 Cleaning the Dataset

The dataset originally consisted of a high number of retweets, which were all removed so as to deal exclusively with unique content. As a result, the size of the dataset shrank from 4,373,197 tweets to 1,918,105. Following this, we removed all instances of links (https://example.com), hashtags (#example) and mentions (@example) when occurring at the beginning or the very end of a tweet. Except for links which were removed invariably, hashtags and mentions were left untouched when appearing amid a sentence, so as to eliminate the probability of them being part of the sentence structure. As for emojies and emoticons, these were removed regardless of their position in the tweet due to their presence being irrelevant to lexical normalization. Finally, repeated punctuation marks (e.g. !!!!) were reduced to one, even in the case of ellipsis. Finally, tweets missing a final full stop were added one, in order to prevent potential errors in their segmentation. Figure 4.1 provides an overview of these steps.

Figure 4.1: Preprocessing Steps.

(20)

4.2 Test Set

A random sample of 500 tweets was extracted from the dataset and reserved for testing purposes. Its annotation was conducted in the very end, after the completion of the implementation. Each tweet was manually normalized with regard to the normalization scope of the present work. Consequently, out of scope OOV words and potentially problematic features were not optimized in order to be able to measure the efficiency of the system by the same standard. Our annotated test set is therefore not a universally gold one, as its normalization degree is constrained by the scope definition of the project.

4.3 Systematic Analysis of Greek Tweets

In order to gain an overview of the specifics of Greek Tweets and create an outline of the OOV word groups that would be targeted by the normalization system, we performed a manual analysis of 1,000 randomly selected tweets from the dataset (disjoint from the preselected and set aside test set).

4.3.1 Sentence Structure

The vast majority of tweets exhibits a relatively loose sentence structure, such that the verb is often elided, interjections are frequently used and interposed, while sentence boundaries are some times not detectable due to the lack of periods. Example 1 below illustrates this.

(1) Original tweet: χαχαχ δε ξερω γιατι παει το χασα τωρα την επομενη παλι!

Potential boundaries: χαχαχ |δε ξερω γιατι |παει το χασα τωρα |την επομενη παλι!

English: hahaha |don’t know why |now it’s gone though |next time again!

4.3.2 Lower/Upper Case

As with punctuation, inconsistency of capitalization was another common phe-nomenon. For instance, words were not always initialized with an uppercase when at the beginning of a sentence or following a full stop. More rarely, this practice was also observed with proper names. Furthermore, there were lowercase instances of acronyms, while on the contrary there were posts consisting of common words in capitalized letters. Table 4.1 provides examples with unconventional character cases.

4.3.3 Neologisms

(21)

Categories Examples Standard Forms

Initial letter - χθες έδειξαν την ταινια

(yesterday they showed the movie) Χθες... Proper names - δεν έχω δει την μαρια

(I haven’t seen mary) ...Μαρία

Acronyms - περασα στο εκπα ευτυχως

(I got admitted to ekpa thankfully) ...ΕΚΠΑ... Capitalization ΠΟΛΛΑ ΜΠΡΑΒΟ

(MANY CONGRATS) πολλά μπράβο

Table 4.1: Instances of non-standard case (upper vs. lower).

4.3.4 Greeklish and Engreek

As a general phenomenon of Greek online texts, Greeklish – Romanization of Greek words – is present in Twitter texts as well. Usually, as revealed from the analysis of example tweets, users opt to encode the entire Greek tweet with Latin characters instead of just part of it. Some of such posts may also contain English words or exhibit further features of non-standard writing, such as character elongations e.g. klaiwwww (“cryinggg”, usually standing for “I am laughing out loud”).

Not as common, but nevertheless sporadically present, are English word or expres-sions transcribed with Greek characters. Most often, this is the case with single words which are syntactically fully integrated in the otherwise Greek sentence.

(2) με me έκανε do.3rd.SG.PST φόλοου follow

’He followed me (on social media)’

In the above example, the word φόλοου is automatically an OOV word as although being seemingly Greek, it is a foreign word (follow) and will therefore be treated as an unknown token by the system.

4.3.5 Stress, Contractions, Elongations, Space

No stress mark:The vast majority of the detected non-standard tokens in Greek tweets are by far those lacking the stress mark (e.g. ειμαι – I am – as opposed to είμαι), which is an integral part of the correct spelling of a word. The high frequency of such stress-lacking forms suggests that these forms are intended (probably due to the typing brevity when omitting the extra key for stress). As a result, the number of non-standard word forms increases as there is no direct match with dictionary entries.

Bringing the stress back is not always straightforward as depending on its placement – above one of the vowels – it can result in more than one valid words as explained in chapter 2.4.1. Such instances appear in a more challenging form when spelling errors (usually unintended ones) are also present. e.g. διαλειμα (break) where the conversion into a standard form would mean converting α to ά and adding an extra μ, therefore: διάλειμμα.

(22)

in which case the 3 repetitions of ε would need to be reduced, while the remaining one would need to bear a stress mark and consequently be έ. Table 4.2 provides some more examples of this and other categories.

Categories Examples Standard Forms

No stress mark (disambiguation may be needed)

- χρονια (a. years/ b. year) - φορα (a. time/ b. momentum) - ειναι (it is) - χρόνια (a) or χρονιά (b) - φορά (a) or φόρα (b) - είναι Elongations (any position) - οοοοολα (everything) - ναιιιιιιι (yes) - βεεεεεβαια (of course) - όλα - ναι - βέβαια Contractions (lacking apostrophe)

- θα χε (must have had) - το ξερα (I knew it) - τα γραψα (I wrote them) - θα είχε - το ήξερα - το έγραψα No space (accidentally joined words) - δενξέρω (don’t know) - νακαναμε (to do) - πεςτης (tell her) - δεν ξέρω - να κάναμε - πες της Table 4.2: Categories of non-standard word forms.

Contractions: Another common phenomenon are word contractions appearing without the presence of the obligatory apostrophe to denote their form. This concerns mainly verbs in the past tense when the past form starts with a vowel. This vowel is then elided due to the preceding particle or weak personal pronoun ending in a vowel as well (e.g. το χα instead of το είχα, θα φυγε instead of θα έφυγε etc.)

No space delimiter:Probably owing to the hastiness that characterizes the compo-sition of most online posts, there often appear two or more words that lack space as a delimiter. As a result, they count as a single word with a non-standard word form. For instance, πεςμου (tellme) instead of πες μου (tell me). As with other phenomena, here as well, the new token gets more challenging when one of the two forms, if not both, lack the obligatory stress as well e.g. δενειπα (instead of δεν είπα) in which case it is not straightforward for the system to potentially identify the individual lexical units.

4.3.6 Non-Standard Abbreviations

Except for the official abbreviations that exist in a language, Twitter users seem to like using non-standard word abbreviations as well. Just like several of the characteristics that account for the non-standard nature of OOV tokens, informally established abbreviations are quicker to type which may explain their popularity. Based on the sample tweets, non-standard abbreviations could be distinguished in two groups: a)

initialismsand b) disemvoweled cases.

Examples of initialisms in Greek could be οτκ (ό,τι καλύτερο, simply the best), where the original expression has been replaced with the initial letters of each of its words. Similar examples in English are lmk (let me know), lol (laughing out loud). Contrary to English users being prolific with initialisms, not that many seem to exist in Greek. However, some widespread English initialisms, such as lol, are used by the Greek users as λολ or with varying elongation λοοολ, where it has been only graphemically localized.

(23)

No vowels Transc. Variants Expansion Translat. κ /k/ και and σμρ /smr/ σμρα σήμερα today τλκ /tlk/ τλκα τελικά eventually/ finally τπτ /tpt/ τπτα τίποτα nothing κλμρ /klmr/ κλμ κλμρα καλημέρα good morning

τν /tn/ την/τον the: def. article

accusative, Fem/Masc

π /p/ που where

Table 4.3: Non-standard abbreviations: (semi)disemvowelment examples.

4.3.7 Misspellings

Misspelled words are another example of non-standard tokens where misspellings could be either intended or – most of the times – unintended, in which case they could be classified as a) valid dictionary entries, b) random misspellings. For instance, in the tweet ποιο συχνά (which often) although both words are IV entries, the first one ποιο (which) is, as the context suggests, not intended and counts rather as a misspelling of the assumingly correct word πιο (more). An equivalent example in English would be writing their is instead of there is. Random misspellings are the ones occurring either due to a typo or due to lack of knowledge of spelling rules or of the unique and specific spelling of a word. In Greek, most random misspellings concern the spelling of the vowels /i/, /e/ and /o/ which despite identical pronunciation have more than one possibility of being transcribed e.g. επιρεάζω instead of επηρεάζω /epireazo/ where /i/ has been misspelled as iota (ι) instead of as eta (η).

4.4 Scope Definition

In view of the identified OOV word categories, only certain of these have been in-cluded in the normalization scope. The main criterion of selection was their relevance to the normalization task, as well as their frequency of occurrence in the sample test, along with the assumption that their non-standard form could lead to incorrect assignment of the POS-tag. For instance, while code switching was a quite common phenomenon, foreign words were not selected for normalization as rendering them into a “standard” form would most likely mean translating them, which is then by itself a different task. Similarly, instances of Greeklish and Engreek were left untouched, while neologisms were tackled only partially, by manually enriching the dictionary with few more entries, due to the lack of machine readable resources on neologisms. As for misspellings, those that accidentally result in IV forms are not handled by the system either (as opposed to those resulting in OOV word forms). Last, loose

sen-tence structureis not dealt with since the focus of the present work is normalization on a lexical level. Therefore, the performance of the system will not be tested with regard to the aforementioned categories, which consequently have been left as is in the annotated test set as well.

In summary, the below categories are in scope for normalization:

a)Words lacking the stress mark

b)Elongations

(24)

d)Non standard abbreviations

e)True-Casing (to some extent)

f)Misjoined Words (to some extent)

g)Misspellings (only when OOV)

(25)

5 System Architecture

Our system consists mainly of a series of rule-based components which hierarchically attempt to convert an OOV word into its standard form. Our choice is motivated by the lack of annotated (i.e. normalized) Greek user-generated data which could make us consider other approaches, such as an NMT, but also by our assumption that linguistically motivated rules and methods could efficiently tackle OOV tokens when bearing commonly used non-standard features. This is why we make use of regular expressions, self created dictionary mapping and rules which among others exploit linguistic information. The very last rule which calculates the edit distance of an OOV token from certain – depending on the defined conditions – dictionary entries, is only activated, when none of the preceding components has produced a normalized variant. As it can be seen in the Figure 5.1, which provides an overview of the system’s main components, the OOV token is handled step by step and either returned normalized, if converted into an IV word, or passed on further when no match could be produced.

Figure 5.1: Overview of the System Architecture.

(26)

The system could be distinguished in three main modules.

5.1 Module 1: Regular Expressions

In this module, regular expressions, the main technique, scan the tweets to determine patterns and perform substitutions, expansions and other modifications. At this stage, the tweets are not tokenized, but rather handled as plain text.

5.1.1 Non-Standard Abbreviations

The first step for the system is to examine if the OOV words in the tweet are listed in the dictionary with non-standard abbreviations. This dictionary contains 60 entries which we collected through the analysis of 1,000 random tweets. For instance, if the OOV word is τλκ and found in the aforementioned dictionary, it will be replaced with the corresponding standard form τελικά (finally).

5.1.2 Elongations

Next in the process is the handling of character elongations. Generally, it is legitimate to encounter up to two repetitions of the same character in Greek. However, different conditions apply to each letter and this is why we distinguish between four different categories:

a)A small group of characters which can never appear more than once in sequence, regardless of the position of their occurrence in a word, i.e. [δζθξφχψΔΖΘΞΦΧΨ] These are always truncated to 1, if elongated. For instance, the OOV όχχι would be normalized as όχι (no) but the IV entry γράμμα (letter) would, as intended, be left as is, as μ is excluded from this group.

b)A group of all Greek characters when at initial position in a word. In this case, always truncated to 1, if elongated. For instance, δδδεν would be captured by this and rendered as δεν (not).

c)A group of all Greek characters when at final position in a word. In this case, sequentially repeated characters are truncated to 1, except for few exceptions (e.g. when the word ends with α, ε or ο, for instance the word αέναα (continuous) will remain as is, while ναιιι will be normalized to ναι (yes).

d)A group of all Greek characters that are only truncated to 2 instances when in a middle position, if more than 2 occurrences in sequence, based on the idea that usually only up two two repetitions could be legitimate in Greek. This will not always produce a normalized variant, but in most cases will reduce the number of superfluous characters to a minimum, which at a later stage may facilitate the match to an IV entry. For instance, the OOV word καλλλλό will be modified to καλλό which is only one edit away from the standard counterpart καλό (good).

5.1.3 Contractions

(27)

all forms regardless of the presence of the apostrophe, under the assumption that the full verbal form would later on enhance POS-tagging. In order to capture only the forms in scope, we define the context to the left by listing the relevant pronominal clitics and particles as in: ([σΣ]?[τΤ][οα]|θα|[μΜσΣτΤ]ου). For instance, το χω or το’χω (I’ve it* in the sense of I got it) gets expanded into το έχω (I have it).

5.1.4 Misjoined words

Incorrectly joined words, words lacking space as a delimiter, are tackled by the system only to some extent. At this stage the tweet is tokenized and a rule checks if the OOV word in question is made up by two valid (IV) sequential substrings and if yes, it inserts space between them. For instance, accidentally joined words διαφορετικέςκαι are returned as διαφορετικές και (different and). However, should the string bear additional irregularities, e.g. lack of stress or a misspelling, as in διαφορετικεςκαι, the rule will not deal with it as διαφορετικες is not an IV entry, although και is.

5.1.5 Truecasing

Trucasing checks if an OOV word is in the dictionary in a different case than the one it appears in. For instance, acronyms that incorrectly appear lowercased, are successfully matched against the IV entry when tested in their uppercase variant and returned as such (ικα changed into ΙΚΑ). Furthermore, main names that do not occur with an uppercase initial letter are similarly matched to their standard case form (e.g. άννα into ΄Αννα). Finally, common words appearing in uppercase are converted into their lowercase form (e.g. ΝΑΙ into ναι). However, simply lowercasing uppercase occurrences of common words and rendering them that way as IV words is possible only to a limited extent due to the missing stress mark (obligatory as it is for almost all polysyllabic lowercase words in Greek, while superfluous in uppercase variants) e.g. ΒΙΒΛΙΟ is not automatically IV through βιβλιο, but rather through stress mark bearing βιβλίο, a step included in the next module.

5.2 Module 2: Rule about Stress Restoration

This module deals exclusively with OOV words that lack the stress mark. Such words could either be uppercase, where the stress mark is by default not part of the correct spelling (but it will have to be restored since uppercase words will be returned lowercase based on our assumption about the POS-tagging assignment), or lowercase instances (for reasons as explained in the background chapter).

5.2.1 Rule Overview

Here as well, the input is a tokenized tweet, while examined get only tokens consisting of Greek characters. The rule first identifies and “remembers” the original case of the OOV token (i.e. if fully lowercase e.g. xxx, or uppercase initial letter and lowercase all the rest e.g. Xxx), before lowercasing it, so that the token can be returned in the end in the original form. Similarly, even in cases where a main name both lacks stress and the capitalization of the initial letter, the stress restoration rule will still match it to its IV counterpart. In summary, the following cases are generally covered:

a) Uppercase common words are converted into their lowercase stress-bearing variant e.g. ΣΠΙΤΙ as σπίτι (house)

(28)

the stress restored e.g. Αννα as ΄Αννα (Anna)

c) Main names fully lowercase and no stress are returned with the stress and case restored, e.g. αννα as ΄Αννα (Anna)

d) Common words with an uppercase initial letter (i.e. due to its position in the sentence) are returned with the stress restored and the case untouched e.g. Σημερα as Σήμερα (today)

5.2.2 Rule Analysis

In practice, a word (or at least a seemingly Greek word – consisting of Greek charac-ters) that is not in vocabulary and lacks the stress mark and provided that it has at least one vowel, will be handled by the rule. It is split in its individual letters, and if the current letter is a vowel, it will be replaced with its stressed counterpart (e.g. ε with έ etc). Then the system checks if this modification makes the word a valid dictionary entry. If yes, the specific stressed variant of the word is stored in a list of possible replacements. At the same time, the vowel in question is reverted to the original state (unstressed) and the testing continues with the rest of the vowels, so that potentially further valid stressed variants are considered as well, as it is shown in Figure 5.2.

for 𝑡𝑜𝑘𝑒𝑛 ←in tweet do

iftoken == Greek and OOV and has at least 1 vowel and is unstressed then

List_of_possible_replacements = empty list;

for 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 ←in token do

ifcharacter == vowel then

replace vowel with stressed counterpart; update token;

iftoken IV then

append token to List_of_possible_replacements; Revert changes on token and continue;

end else

Revert changes on token and continue;

end end end end end

Figure 5.2: Snippet of pseudocode from the stress restoration rule.

(29)

the previous token and a stressed candidate, we base our calculation on the formula below:

𝑃(𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 |𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠) =

𝑐(𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠, 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒) 𝑐(𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠, ¬𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒)

For unigram probabilities, we follow the same principle by relying on the formula below:

𝑃(𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒) =

𝑐(𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒) 𝑐(𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦_𝑒𝑛𝑡𝑟𝑖𝑒𝑠)

5.2.3 Handling of Special Cases

Although not as common, but still possible are tokens that aside from the lack of the stress mark, also bear a misspelling. In this case, most of the times, the system will forward it to the next module, as it is expected to. For instance, the word μυνημα will not be restored by the stress restoration rule, as no stress-bearing alternative of the token in question is included in the vocabulary. In other words, the system will check to see if μύνημα or μυνήμα (trying out stress on each of the vowels in turn) exists and after getting no match, will pass it over to the edit distance rule, which ideally should return the IV counterpart μήνυμα.

However, such tokens (lacking the stress mark and including a misspelling) can in certain cases incorrectly be converted into an IV entry, when there is one available. For example, in the expression δεν ξερο (I don’t know), the word ξερο will by default be handled by the stress restoration rule (because it lacks the stress mark) and a match with the IV entry ξερό (dry) will be identified, while the actual standardization – based on the left context δεν (here: don’t) – would be to correct the misspelling (by replacing omikron with omega) and restore stress on epsilon, so that ξέρω (know) is returned. In order to limit the possibilities of this happening, we do not automatically accept the suggested candidate (when it is proposed as a single candidate by the stress-restoration rule), but rather we first check if it appears at all in the immediate context (with the token before or after) and if yes, we return it, otherwise we pass the OOV word to the third and final module, which in present case would be expected to return the correct normalization of ξέρω. When no context is available, we base the selection on the unigram probabilities of the proposed candidate and the OOV token itself and keep the one with the higher one.

5.3 Module 3: Edit distance

In this module, which is also the last, end up all the OOV words which were not normalized in the previous modules. For the normalization of an OOV word, we compute its edit distance from certain vocabulary entries. Based on it, a number of candidates is selected for which we get the logarithmic probabilities. The candidate with the best probability, is then chosen as the normalized version.

5.3.1 Extraction of IV subset

(30)

would be considering as a candidate for the normalization of a four-letter-long OOV token even dictionary entries that have double the length. As a result, computing the edit distance only from the entries of a subset of the Hunspell dictionary is both quicker and more effective. The size of this subset is determined each time by the length of the OOV word, as well as its starting and final characters.

We distinguished between four conditions:

1. If len(OOV word)>8: the Hunspell subset should consist of IV entries that: a) either have the same length with OOV token AND (start with the same 4

OR end with the same 3 characters)

b) have len(OOV)+1 AND (start with the same 4 OR end with the same 3 characters)

c) have len(OOV)-1 AND (start with the same 4 OR end with the same 3 characters)

2. If len(OOV word)>=6 or len(OOV word) <=8: the Hunspell subset should

consist of IV entries that:

a) either have the same length with OOV token AND (start with the same 3 OR end with the same 2 characters)

3. If len(OOV word)=4 or len(OOV word) = 5: the Hunspell subset should

consist of IV entries that:

a) either have the same length with OOV token AND (start with the same 2 OR end with the same 2 characters)

4. If len(OOV)< 4: the Hunspell subset should consist of IV entries that:

a) either have the same length with OOV token AND (start with the same 1 OR end with the same 1 character)

b) have len(OOV)+1 AND (start with the same 1 OR end with the same 1 character)

c) have len(OOV)-1 AND (start with the same 1 OR end with the same 1 character)

5.3.2 Extraction of Candidates

After the Hunspell subset has been created, we compute the edit distance of the OOV word in question from all the entries of the subset. We use a partially weighted implementation1of the levenshtein distance where certain character substitutions are assigned customized weights, while all the other operations, including substitutions of the majority of the characters, are assigned a common weight. The Tables 5.1 and 5.2 show all the customized weights for vowel substitutions.

(31)

Char Sub Cost α ε η ι ϊ υ ϋ ο ω ά έ ή ί ΐ ύ ΰ ό ώ 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25

Table 5.1: Substitution weight for restoring stress on same vowel

Char Sub Cost

ο ό η ή η ή ι ί ι ί υ ύ υ ύ ω ώ ι ί υ ύ η ή υ ύ η ή ι ί 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

Table 5.2: Substitution weight for vowels commonly mixed up

The default weight that is assigned for an edit operation is 1. However, when it comes to the substitutions of certain vowels, we have manually assigned an edit distance of 0.25. The customization of the substitution weight is based on the idea that specific characters are more likely to be misstyped, intentionally or due to uncertainty, as certain other characters. For instance, substituting Greek vowels that lack the stress mark with their stressed counterpart should be a low-cost substitution, since, as explained in the Background chapter, omitting the stress mark in online content creation, is a widespread phenomenon.

Once the edit distance of all Hunspell subset entries has been computed, the system cuts out from selection all the potential entries with an edit distance above 3. This upper limit was defined heuristically, after experimenting with different weights and concluding that only extremely rarely was the gold normalized counterpart excluded from consideration. As a result, the selection of the IV entry is done among only a short list of normalization candidates.

5.3.3 Final Selection of the IV Counterpart

The final selection of the IV entry that will replace the OOV token is performed by considering initially IV words with edit distance lower than 1.5. Only when none of the IV words has an edit distance below this threshold, will the other candidates be considered. In each case, should the options be more than one, the IV entry with the highest sum of log probabilities of the immediate context, if available, is calculated, otherwise, only the unigram probability is taken into consideration for the selection. For example, assuming that that the OOV token μύνημα is about to be processed, as appearing in the context έστειλα μύνημα χθες, it will be handled as follows:

a) The Hunspell subset of IV candidates will be created, from which the system will compute the edit distance. Based on the length, the starting and ending characters of the word μύνημα, a Hunspell subset of 871 entries is created.

(32)

threshold 3. In this example, these are the following 33 candidates:

Candidate extraction with edit distance below 3: {’εύσημα’: 2.2, ’μέλημα’: 2.2, ’εύχυμα’: 2.45, ’κούνημα’: 2.1, ’μύθευμα’: 2.35, ’μύρωμα’: 2.2, ’μύξωμα’: 2.2, ’μάθημα’: 2.2, ’μήνυμα’: 1.35, ’μάσημα’: 2.2, ’κύημα’: 2.1, ’φώνημα’: 2.2, ’μάδημα’: 2.2, ’πόνημα’: 2.2, ’κίνημα’: 2.2, ’ομώνυμα’: 2.35, ’γόνιμα’: 2.45, ’μύωμα’: 2.1, ’μάνιωμα’: 2.35, ’ρίνημα’: 2.2, ’μίλημα’: 2.2, ’μύρισμα’: 2.35, ’μνήμα’: 1.25, ’φύσημα’: 2.2, ’εύρημα’: 2.2, ’μόν-ιμα’: 1.35, ’ούρημα’: 2.2, ’μάχ’μόν-ιμα’: 2.45, ’ξύπνημα’: 2.1, ’σύνθημα’: 2.1, ’εύθυμα’: 2.45, ’κύλημα’: 2.2, ’εύφημα’: 2.2 }

c) If IV entries with edit distance below or equal to 1.5, create a subgroup with these. The result would be:

Candidates with edit distance below or equal to 1.5: {’μήνυμα’: 1.35, ’μνήμα’: 1.25, ’μόνιμα’: 1.35}

d) The logarithmic sum of probabilities of each IV candidate of the above subgroup and the immediate context in the tweet is calculated. Finally, the one with the highest probability is returned.

Final candidates with log sums: {(’μήνυμα’, 1.931853990), (’μνήμα’, -0.20860722411955), (’μόνιμα’, 1.2660461067248)}

(33)

6 Evaluation and Results

6.1 Performance of the Rule-Based System

The performance of the rule-based system was tested in terms of accuracy. This was measured by comparing each token of the normalized test data with the equivalent gold one. As baseline, we considered the deviation of the unnormalized (i.e. clean) test data from the gold dataset. The results, as shown in Table 6.1 indicate that the system has an accuracy of 95.4%, which in turn means lower error rate, from 19% to 4.5%.

Word Accuracy % Error %

Baseline 80.9 19.0

Normalization System 95.4 4.5 Table 6.1: Performance of the normalization system.

Given that the normalization of certain OOV tokens could in some cases result in more than one words (increasing that way the total number of tokens of the normalized dataset), for the purpose of performance testing, we modified such cases by outputing them with an underscore, in order to ensure alignment and comparability between the different versions of the test set.

(34)

set, with capitalized tokens misspellings, abbreviations and other groups following in smaller percentages.

Finally, we provide further insight in the normalization performance by distinguish-ing between the percentage of tokens that were handled correctly by the system, either by being normalized into the expected form, or by being left as is, and the other way around.

Normalized % Not Normalized % Correctly 83.44 98.20

Incorrectly 16.55 1.79

Table 6.2: Percentage of tokens that were handled correctly and incorrectly by the system, through modification or not.

The figures in Table 6.2 show that when modifying a token, the system did this cor-rectly 83% of the times and incorcor-rectly 17%, while when leaving a token unmodified, it was successful in 98% of cases and unsuccessful in 2%.

6.2 Error Analysis

Misidentification as OOV: Despite the restrictions set to the system as to what to identify as OOV, there are also cases of words that were sent for normalization despite being in a perfectly valid form. This concerned mostly newly coined acronyms not included in any of the used or self-created resources and neologisms.

Failure to split: Incorrectly joined words (i.e. lacking space between word bound-aries) were only restored when a) consisting of maximally two words1_{, b) bearing no}

misspellings beside the missing space. When at least one of these cases did not apply, the OOV token was sent for normalization to the edit distance rule. However, since that rule only generates a single token, it was by default an incorrect normalization.

Misrestoring stress: Although we tried to some extent to limit this already when developing the system, incorrect stress restoration is in some cases still inevitable. Words lacking stress can sometimes also bear a misspelling, in which case the leven-shtein distance from valid dictionary entries should be computed. However, since the first step is to check whether stress restoration can convert an OOV token into an IV, some OOV words get converted through stress restoration into an IV token which was unintended by the user, as in δεν ξερο where ξερο gets converted to ξερό (dry, adjective) instead of ξέρω (I know, verb) given δεν (don’t). Whether this can be tackled successfully, depends on how informative the context is each time (if there is some at all) and how good the n-gram corpus used for this purpose is.

Wrong candidate selected: When more than one candidate were available to re-place an OOV token, in some cases, higher context probability was assigned to an incorrect one. This was mostly the case either when the n-gram corpus contained too genre-specific context frequencies or when the context, once again, was not particularly informative, either because it consisted of function words or of words that were as well in a non-standard form.

Correct candidate excluded: Although not as common, but still an issue, the correct candidate was in some cases excluded from consideration when creating the Hunspell subset for which the levenshtein distance would be computed. This was the case for

1_{Splitting any number of potentially misjoined words requires a more complex solution which takes}

(35)

IV entries that neither met the length restrictions with regard to the OOV word, nor had the same starting or ending characters.

6.3 Effect of Normalization on Tagging

We used accuracy as a measure of testing the effect that lexical normalization would have on tagging. We first tagged the clean test data and then, after running the data through the normalization system, we retagged it. Each of these versions (i.e. the clean tagged data and the normalized tagged data) was compared to the tagging of the gold test set. However, since the tagging of the gold test set was performed automatically using UD Pipe (and not manually), it rather constitutes an upper bound and not the ground truth.

Tagging accuracy % Error %

Baseline 88.33 11.67

Normalized Text 97.43 2.56 Table 6.3: Results of tagging clean vs normalized data.