Automatic morphological analysis of L-verbs in Palula

(1)

Automatic morphological

analysis of L-verbs in Palula

Emma Wallerö

Department of Linguistics Bachelor’s Thesis 15 ECTS credits

Linguistics: Computational Linguistics – Bachelor’s Thesis, LIN621 Spring semester 2020

Main supervisor: Robert Östling. Co-supervisors: Henrik Liljegren, Mats Wirén

(2)

Automatic morphological

analysis of L-verbs in Palula

Emma Wallerö

Abstract

This study is exploring the possibilities of automatic morphological analysis of L-verbs in the Palula language by the help from Finite-state technology and two-level morphology along with supervised machine learning. The type of machine learning used are neural Sequence to Sequence models. A morphological transducer is made with the Helsinki Finite-State Transducer Technology, HFST, toolkit covering the L-verbs of the Palula Language. Several Sequence to Sequence models are trained on sets of L-verbs along with morphological tagging annotation. One model is trained with a small amount of manually annotated data and four models are trained with different amounts of training examples generated by the Finite-State Transducer. The efficiency and accuracy of these methods are investigated. The Sequence to Sequence model trained on solely manually annotated data did not perform as well as the other models. A Sequence to Sequence model trained with training examples generated by the transducer performed the best recall, accuracy and F1-score, while the Finite-State Transducer performed the best precision score.

Keywords

(3)

Automatisk morfologisk analys

av L-verb i Palula

Emma Wallerö

Sammanfattning

Denna studie undersöker möjligheterna för en automatisk morfologisk analys av L-verb i språket Palula med hjälp av finit tillståndsteknik och två-nivå-morfologi samt övervakad maskininlärning. Den typ av maskininlärning som används i studien är neurala Sekvens till Sekvens-modeller. En

morfologisk transduktor är skapad med verktyget Helsinki Finite-State Transducer Technology, HFST, som täcker L-verben i Palula. Flera Sekvens till Sekvens-modeller tränas på set av L-verb med morfologisk taggningsannotation. En modell tränas på ett litet set av manuellt annoterade data och fyra modeller tränas på olika mängder träningsdata som genererats av den finita tillstånds-transduktorn. Effektiviteten och noggrannheten för dessa modeller undersöks. Sekvens till Sekvens-modellen som tränats med bara manuellt annoterade data presterade inte lika bra som de andra modellerna i studien. En Sekvens till Sekvens-modell tränad med träningsdata bestående av genereringar producerade av transduktorn gav bästa svarsfrekvens, noggrannhet och F1-poäng, medan den finita tillstånds-transduktorn gav bästa precision.

Nyckelord

(4)

1. Introduction

There are over 7.000 languages in the world but only a small part of these have and have had resources for producing significant amounts of linguistically annotated language samples (Ethnologue, 2020). Is there an effective way of analysing these languages successfully automatically even though there are no large annotated data sets of the languages available? This is a big question and the answers might depend largely on the languages, methods, and data studied. However, because of all these factors there is much to explore in the narrow corners of this big question. This study intends to explore the possibilities of automatic annotation of a certain low-resource language and within this language a certain word-class.

The language concerned is Palula, an Indo-Aryan Dardic language spoken in northern Pakistan by roughly 10.000 people (Liljegren & Haider, 2015:xix). Palula fits this study in the sense that it has a limited set of annotated data and no automatic analysis possibilities like those intended for this research are in use. The word class to be studied is L-verbs which are verbs ending with an “-l” in perfective form. An example of a Palula L-verb is the perfective verb stem “phedíl-” meaning to “arrive” or to “reach”. The imperfective stem of the same verb is “phed-” (Liljegren, 2016:210). Perfective and imperfective are different aspects and have different qualities. Perfective aspect is usually associated with change of state while imperfective aspect can be described as the time of an assertion being made falling “entirely under the time of the situation” (Gvozdanović, 2012:6).

(6)

2

2. Background

2.1 The Palula language

Palula is an Indo-Aryan language spoken by 10,000 people in northern Pakistan (Liljegren & Haider, 2015:xix). The language is along with some other varieties subsumed under the heading Shina (Liljegren, 2016:1). Palula is along with other Hindukush languages often referred to as “Dardic” languages. This classification is mainly based on geographic assumptions (Liljegren, 2016:13). Palula morphology is concatenative and suffixing (Liljegren, 2016:45). It has a Subject Object Verb (SOV) syntactic structure which is typical for Indo-European languages and it has a two-gender system (Liljegren 2016:404,46). The morphology of the language is suffixing with “a moderately high degree of synthesis” and the formatives are usually concatenative (Liljegren, 2016:45).

There are two main dialects of Palula: Ashret and Biori. The focus dialect of this study is Ashret. This dialect is spoken by about 5,900 people (Liljegren, 2016:7). In this study focus lies on the structure of L-verbs in Palula. The majority of verbs in Palula are mono- or di-syllabic. Monosyllabic verbs are the most common and these usually have a Consonant Vowel Consonant (CVC) structure but other structures do occur (Liljegren, 2016:201). This type of verbs is named after the perfective ending (íl, -óol etc) (Liljegren, 2016:201). The majority of the verbs of Palula are categorized as this type. This verb class is also the most productive one (Liljegren, 2016:209). This particular class is chosen as the focus of this study because of its frequency and productivity. The L-verbs are classified by Liljegren into four subtypes: consonant ending, a-ending, e-ending and other L-verbs. These behave a little different from each other when it comes to morphophonemic alternations (Liljegren, 2016:209). For an overview of the different types of Palula L-verbs used in this study, see table 1.

Table 1: An overview of the characteristics of the three different types of L-verbs discussed in this study. Type of L-verb Characteristics

Consonant-ending - A majority of the verbs are intransitive

- Consonant-ending singular imperative is usually identical to the stem itself

- Some of these verbs have morphophonemic alternations in the stem a-ending - A majority of the verbs are transitive

- Unique to this class are the perfective segments óol and éel - Usually not affected by morphophonemic alternations in the stem e-ending - A rather even distribution of transitive and intransitive verbs

- Some of the verbs have a long accented vowel in the perfective form - Some of these verbs have morphophonemic alternations in the stem

2.1.1 Consonant-ending L-verbs

Most of these verbs are intransitive, but not all. For perfective stems, the accent occurs on the final “í”, in the íl-segment (Liljegren, 2016:210). This is however the only feature that distinguishes the

(7)

3

strengthening, and umlaut (Liljegren, 2016:210). Umlaut means anticipatory fronting. The umlaut is expressed in the way that “aa” changes to “ee” when followed by a “i” in the next syllable (Liljegren, 2016:81). Only long “a”s seem to be affected by the umlaut. The umlaut is expressed both inside stems and in suffixes. This is the type of L-verbs with most stem changes in inflections based on findings in the online dictionary. Example 1 shows an instance of the consonant-ending L-verb “khóṇḍa”. The surface word “khoṇḍíia” is analysed as containing the segments “khoṇḍ” and “-íia”. These are a stem carrying the meaning “talk” and an inflectional suffix glossed as carrying a first-person plural meaning.

(1)

Palula (N.1168) A:MAR015-6 [source] (Liljegren 2019, ”khóṇḍa”)

be teeṇíi méeǰi ǰargá thíia, khoṇḍíia.

be teeṇíi méeǰi ǰargá the -íia khoṇḍ -íia

1PL.NOM REFL among council do -1PL talk -1PL ‘Let's consult with one another and discuss.’

2.1.2 a-ending L-verbs

The majority of these verbs are transitive. The final vowel in these verbs have “coalesced” with the first vowel of the suffix and in some cases, this has resulted in umlaut (Liljegren, 2016:210). What is unique to this L-verb class is the perfective segments óol and éel (singular and plural) (Liljegren, 2016:210). This type of L-verb does not seem to be affected by umlaut inside the stem very often except sometimes in the very last vowel.

2.1.3 e-ending L-verbs

This verb class includes both transitive and intransitive verbs (Liljegren, 2016:211). Some of the verb forms have a long accented vowel in the perfective form while others have not (íil versus íl)

(Liljegren, 2016:212). The stems of e-ending L-verbs are less affected by umlaut than consonant-ending L-verbs but more affected by it than a-consonant-ending L-verbs.

2.1.4 Other L-verbs

This class includes L-verbs that does not seem to fit in the earlier mentioned categories. However, two frequently occurring verbs are yhe- “come” and ru- “cry” (Liljegren, 2016:213). These types of L-verbs will not be a part of this study since they are not described thoroughly in the verb paradigms (Liljegren & Haider 2011:188-189).

2.2 Natural Language Processing (NLP)

Natural Language Processing (NLP) is a wide term concerning basically any type of computer manipulation of natural languages (Bird, Klein & Loper, 2010:ix). A natural language is a language like English or Palula used and evolved by generations of humans in comparison to artificial

languages such as programming languages (Bird, Klein & Loper, 2010:ix). NLP can involve anything from morphological parsing to machine translation (Jurafsky & Martin, 2000:799).

2.2.1 Finite-State Transducer and Two-level Morphology

In the context of Natural Language Processing parsing refers to the process of analysing possible underlying syntactic structures of usually a sequence of words (Nederhof & Satta, 2010:105).

(8)

4

FST stands for Finite-State Transducer, which is based on finite-state technology. Finite-state

technology is based on finite-state automata with nodes that are linked by arcs (Beesley & Karttunen, 2003:4-5). Finite-state automaton (or FSA) is described by Karttunen and Beesley as an abstract machine that accepts only the strings of a language. A Finite-State Transducer on the other hand maps or encodes a relation between strings of two languages (Beesley & Karttunen, 2003:261).

A visualization and example of a finite-state network is shown in figure 1. The figure shows an automaton with four states or nodes and four arcs. This automaton generates the words “hat” and ”has”. “q3” is the end-state and is marked by being double-lined. If there are several arcs pointing from the same state it means that there are several possible generations. In this case both “t” and “s” can be generated after state “q2”. However, they both lead to the same state (q3).

Figure 1: A finite-state automaton that can generate the words “hat” and “has”.

Now the first part of “finite-state transducer” has been covered but what is a transducer? A transducer is in common terms a device that converts energy from one source to another and could be described as a generator. The Finite-State Transducer converts one string of symbols to another string of symbols (Beesley & Karttunen, 2003:13). Two-level morphology is often used as a tool or method within finite-state technology and compiled into a Finite-State Transducer (Koskenniemi, 2013:164). Koskenniemi designed the two-level formalism in 1983 (Karttunen & Beesley, 1992). In two-level morphology there are two morphological levels: the lexical level and the surface level of a word (Gao et.al, 2017:64). In the 1960’s, Chomsky and Halle introduced a system for describing morphological alternations. These kinds of rules are called rewrite rules and are ordered sequentially (Karttunen & Beesley, 1992).Rewrite rules and two-level rules are similar but in the usage of the latter rules and constraints are applied simultaneously. There are advantages for choosing a bidirectional model like two-level rules over unidirectional ones like rewrite rules according to Koskenniemi. He mentions which these advantages are in his 1983 dissertation where he presents a computational model for two-level morphology. When using unidirectional models it is easy to discuss the production of words from lexical form to surface form but it is not as easy to discuss the process in the other direction since rewriting grammars have rules that cannot be reversed (Koskenniemi, 1983:10). However, when using a bidirectional model like the two-level rule formalism both the directions generation and analysis can be studied with equal simplicity since the formalism provides correspondences and relations between lexical and surface representations rather than “segments being transformed to other segments” (Koskenniemi, 1983:10). And so, two-level rule formalism is unifying the process and studying of both word production and recognition (Koskenniemi, 1983:10).

(9)

5

Figure 2: An example of Two-level Constraints. The higher level is the lexical level and the lower is the

surface level (Figure 3 in Beesley & Karttunen, 2001).

One problem or advantage of a finite-state transducer with two-level rules is that the transducer only analyses surface forms that actually exist in the defined language. Unlike statistical methods, if given a new word that is not part of the lexicon the transducer cannot guess any possible classifications (Beesley & Karttunen, 2001).

2.2.2 Machine learning

Mohammed describes machine learning as a branch of AI (Artificial intelligence) that is aimed to perform tasks by using intelligent software based on statistical learning methods (Mohammed, 2017:4). The machine needs data to be able to learn. Supervised machine learning needs labelled training data in order to function. Neural Sequence to Sequence models are supervised machine learning models that maps an input sequence to an output sequence (Sutskever et al., 2014:1-2). Recurrent Neural Network (RNN) is a form of Sequence to Sequence model which the translator of OpenNMT used in this study is based on. For more information on OpenNMT see chapter 4.2.2. Rezk et al. explains Recurrent Neural Networks in a study from 2020 in an intuitive way. They write that the intelligence of humans and most animals depend on the possibility of having a memory of the past (Rezk et al., 2020:57970).

Memory can be both short-term like combining segments to make words or long-term like when using “he” as you are referring to someone you mentioned by name many, perhaps hundreds of words earlier (Rezk et al., 2020:57970). RNNs provide this for Neural Networks. The idea of RNNs is that they add feedback so that earlier outputs are enabled while processing the current step (Rezk et al.,

2020:57970). The memory cells are supposed to function similar to human short-term and long-term memory (Rezk et al., 2020:57970). Simply put, RNNs can map sequences to sequences (Sutskever et al., 2014:3). Sequence to Sequence models like the RNNs can be successfully used for machine translation of long sentences (Bahdanau et al., 2014). However, Sequence to Sequence models are versatile and can be used for morphological analysis too as done by Premjith et al, see chapter 2.3 (Premjith et al, 2018). In this study these types of machine learning models used will be called Sequence to Sequence models or Seq2Seq models.

2.3 Previous research of morphological analysis

of low resource languages

In a recent study from late 2019 Uddin & Uddin studies the possibilities of creating a well-functioning morphological transducer based on finite-state technology for the language Torwali which is spoken by roughly 100,000 people in northern Pakistan and is like Palula an Indo-Aryan language (Uddin & Uddin, 2019:6). Uddin & Uddin used the Helsinki Finite-State Transducer in their study. Their

transducer handles surface and lexical forms written in the Torwali alphabet. Torwali is a low resource language without “robust” morphology sources and therefore NLP tools for this language are

(10)

6

2019:9). However, they faced problems concerning random changes in noun stems, tone variations in nouns, and behaviour of vowel-ending verbs and nouns (Uddin & Uddin, 2019:9). This was the first attempt to implement Torwali morphology using FST according to the authors.

Washington, et al. (2014) created and studied FST-transducers for three different languages and compared these transducers as part of the Apertium project which aims at creating rule-based machine translation for “lesser resourced languages” (Washington et al., 2014:3378). They also used the HFST toolkit for the design of the transducers. The languages concerned were three closely related Kypchak languages or Kipchak languages as Glottolog.org spells it. The languages were Kazakh, Tatar and Kumyk which have 8 million, 5,4 million and ca 500.000 speakers respectively (Washington et al., 2014:3378-3379). What the study shows is that the transducer for each subsequent language took less development time than the last. This is due to the researchers being able to reuse much of the

morphotactic description from the previous transducers developed in their project. Also, all the transducers had a coverage around 90 % which is deemed to be reasonable by the writers of the study (Washington et al., 2014:3378).

(11)

7

3 Purpose and research questions

3.1 The purpose of the study

The purpose of the study is to investigate the application of two-level rule morphology and supervised machine learning for morphological analysis of the low resource language Palula. The method of two-level rule morphology is executed with a Finite-State Transducer and this will be compared to supervised machine learning. The supervised machine learning models are based on neural Sequence to Sequence models. A combination of the two methods will also be executed and the results will be compared with the other methods. The combination consists of Sequence to Sequence models trained with different amounts of training data generated by the Finite-State Transducer.

3.2 Research questions

The research questions are:

1. What level of accuracy can be achieved with a Finite-State Transducer with two-level morphology-based morphological analysis of L-verbs in Palula given the data and time resources available? Is it accurate enough for being useful to a linguist?

2. How does the Finite-State Transducer method compare to supervised machine learning? 3. How do the Sequence to Sequence machine learning models perform with training data

generated by the Finite-State Transducer?

4. How much does the amount of training data affect the performance of the machine learning models?

4 Material and methods

4.1 Data

The data used for the transducer were verb stems from Liljegren’s online Palula dictionary (2019) along with the inflectional and derivational morphemes that were collected from paradigms for L-verbs from Liljegren and Haider’s Palula vocabulary (Liljegren & Haider, 2011:188-189). The dictionary provided information about what type of L-verbs the verbs were and some inflectional information is usually but not always included along with glossed examples and morpheme

segmentation in most entries. Information about the possible changes in the verb-stem were collected from the dictionary-entries as well1_{. One glossing example provided in the online dictionary is}

glossing example 2 below (Liljegren, 2019). In row 1 in the example you can find the surface word “c ̣iinkíla”. The verb is an L-verb with and e-ending and is analysed as consisting of the morphemes

(12)

8

“c ̣iinké” as the stem and the suffices “-íl” and “-a” inflecting the word to being in perfective and masculine plural.

(2)

Palula (N.507) A:DLX065 [source] (Liljegren 2019, ” c ̣iinkíi”)

(1) čalúṭii ooṛá ǰhanduraá paší ba c ̣iinkíla.

(2) čalúṭi –ii oóṛ -a ǰhanduraá paš -í ba c ̣iinké -íl -a

(3) sparrow -GEN young.one -PL snake see –CV TOP squawk -PFV -MPL (4) ‘The young birds squawked when they saw a snake.’

The collecting of verb-stems from entries in the online dictionary was done manually by the author of this study. When this was already done the online dictionary became available for the author in text form. However, collecting these verb stems and their inflectional qualities in a slow pace instead of automatically was not all in vain since it was a good opportunity to get to know the verbs and their properties better.

The rules of the Finite-State Transducer were based on the glossed entries of the online Palula dictionary. The manually annotated data set of Palula L-verbs were collected from data of the online dictionary glossed by Liljegren and Haider. The data from the dictionary consists of glossed examples of sentences containing L-verbs. However, instead of using the English translation of the verb-stem along with the glossing tags e.g “squawk -PFV -MPL”, see row 3 in glossing example 2, the English translation was replaced by the corresponding Palula verb-stem found on row 2. Furthermore, hyphens and punctuations were ignored. This was initially based on there being different types of

interpretations concerning perfective verbs in the manually annotated data available earlier on in the process of this study. This meant that a perfective verb could either be segmented as an imperfective verb-stem with a perfective suffix (c ̣iinké-íl) or simply a perfective verb-stem (c ̣iinkíl). However, in the annotated data that was used for the final versions of the methods the perfective ”íl”, “óol“ and “éel” are always glossed and segmented as suffixes and not as being part of the verb stem. So, the final target-example for the training data would be formed as “c ̣iinké PFV MPL”. Its corresponding source-example would be “c ̣iinkíla”. In this context, a source source-example is the surface form of the verb and a target example is the gloss of the surface form, see table 2.

Table 2: An example of a training data pair for the Sequence to Sequence models of the study.

Evaluation data was consequently formed in the same way.

Source: Target:

c ̣iinkíla c ̣iinké PFV MPL

Source-target pairs were separated into different files with the line-number linking them together. Most of the transcribed data was recorded between 1999-2004 (Liljegren & Haider, 2015:xvi). The manually annotated data consists of 583 verb examples. All the examples were extracted from a large text file containing all glossing examples and morpheme segmentations from the Palula online dictionary (Liljegren, 2019). The Sequence to Sequence networks were trained with different kinds and amounts of data. Apart from the manually glossed data used as training data the networks were also trained on annotated verbs generated by the transducer. More specific information about what type and amount of data that was used in which version of the Sequence to Sequence models can be found in chapter 4.2.2.

Two gold standards were used in this study. The gold standards consist of different amounts of the manually annotated verb data set. The gold standards were used as evaluation data for the models in this study containing correct analyses of surface forms of verbs.

(13)

9

models except Seq2Seq model XXS manual were evaluated with both gold standards. There are two reasons for using two gold standards like these and these are: (1) Having the ability to compare the Sequence to Sequence model trained on manually annotated (Seq2Seq model XXS manual) with the other models and methods based on the same manually annotated evaluation data, (2) being able to compare all the other models by using a gold standard that contains a larger amount of data i.e. the Large gold standard which enables a more relevant evaluation.

Considering the Small gold standard and the training data for Seq2Seq model XXS manual the total of the manually annotated data was randomly sorted into these two sets. So, as can be seen in figure 3, the Small gold standard and the manually annotated training set of 437 examples are not overlapping and together they make the Large gold standard which contain the full set of manually annotated data of 583 examples.

Figure 3: The manually annotated data is distributed into data sets used for the training and evaluation

of the models of this study.

For Sequence to Sequence models L, M, S and XS, explained in further detail in the following chapter 4.2.2, the training data used was consisting of generations produced by the Finite-State Transducer. The models were trained on 6443, 3222, 1610 and 805 forms respectively. The maximum capacity of forms, paired with analyses of these forms, that the transducer could produce is 6443 forms. As can be seen in the Venn diagram of figure 4 the three smaller data sets consisted of 50 %, 25 % and 12,5 % of the largest data set. The content of the large dataset was mixed randomly and so no consideration concerning the smaller sets containing certain amounts of certain forms has been made.

Small gold standard of 146 forms 25% Manually annotated training set of 437 forms 75%

(14)

10

Figure 4: The Venn diagram shows how the total of all verb forms generated by the Finite-state

transducer were distributed into data sets used for the training of some of the Sequence to Sequence models of this study.

Some verb stems were removed from the transducer. These were L-verbs that do not belong to the three basic groups e-ending, a-ending and consonant-ending L-verbs along with passive and causative verbs. These were removed because they do not occur in the verb paradigms used for the formation of the transducer. However, the number of excluded verb stems was proportionately small.

For an overview of all data used in this study, see Appendix A.

(15)

11

4.2 Methods

4.2.1 The morphological transducer

Machine learning methods always requires a training data set. However, many languages have not got much annotated data to train a machine learning model as a morphological tagger. Using a Finite-State Transducer for morphological analysis of a low resource language like Palula could be a good idea because of its low requirements of data. No amount of training data is needed. What is needed however is grammatical and orthographical information. When it comes to Palula a grammatical description of the language is available but there is not an excess of annotated data. These conditions seem to fit the conditions of the Finite-State Transducer well. The technology used in this study was the Helsinki Finite-State Transducer Technology, HFST. HFST is written in C++ but apart from the HFST command line interface a Python API is available. The notation used in HFST-documentation is based on Xerox transducer notation (Lindén et al., 2009). In order to make a morphological transducer with two-level rule morphology in the HFST framework two main files are needed, a lexicon (LEXC) and two-level rules (TWOLC) (Uddin & Uddin, 2019:6). The basic structure of these will be presented later in this chapter.

A transducer like this can be operated in two directions. Surface forms of words can be generated based on all the possibilities the transducer provides or we can “look up” surface forms of words to see how they are analysed by the transducer and if they exist in the transducer system at all. The lexicon generates all the underlying forms of the words of the language. It can concatenate all lexical representations of legal words based on the content of the different sub-lexicons of the lexicon. The two-level rule compiler contains rules making it possible to generate surface forms of the underlying forms of the lexicon or vice versa (Karttunen & Beesley, 1992).

Lexicon

The lexicon compiler contains morphemes and their morphotactic combinations2_{which together}

makes the information that is needed for a Finite-State Transducer (Uddin & Uddin, 2019:7). The lexicon can on its own produce all possible combinations of morphemes that are allowed based on the “rules of succession” that are stated in the file. Here is an example of an output that could be produced by a lexicon compiler:

“ čakaté-prs-msg:čakat{E}-áan-u”

As you can see, morpheme boundaries are visible and there is a strange big “{E}” in the middle of the generated string. This “{E}” is an archiphoneme and it is very useful when writing the two-level rules. Both the archiphonemes and morpheme boundaries will be dealt with when applying the two-level rule file to the transducer. When there is not enough information about inflection for certain stems assumptions will be made based on verbs with similar qualities. The lack of inflectional information could perhaps be blamed on the low frequency of the stem in the manually annotated Palula data. So, some possible errors made in the transductor will pass unnoticed. The verb paradigms used for the formation of the transducer are as is stated by Liljegren and Haider themselves very general and cannot be used for every L-verb. So, because of the generality of the verb paradigms and small amount of data the transducer might not produce spotless generations. To see an example of how a lexicon-file can be formed, see Appendix B, where the lexicon used for the Finite-State Transducer in this study can be found.

(16)

12

Two-level Rules

So, the Lexicon is a text-file containing morphemes and morphotactics. The latter includes information about what glossing-tags these morphemes should be interpreted as belonging to. The two-level rules have to be formulated as a two-level grammar in order to be acceptable for the Finite-State Transducer system (Kartunen, Beesley, 1992). The two-level grammar begins with an Alphabet and ends with Rules. In the Alphabet all symbols and possible correspondences between lexical and surface level symbols of the transducer must be specified (Karttunen & Beesley, 1992). When we know what morphemes a word contains, we have to add phonological and orthographical rules in order to be able to generate the surface level of the word. To do this, we use two-level rules. For instance, two-level rules can be used for generating the words “buc ̣húṇa” and “buc ̣huṇí”, L-e verbs from the same root. These are strings that our lexicon can generate:

“buc ̣húṇ-3sg:buc ̣h{U2}ṇ-a” “buc ̣húṇ-cv:buc ̣h{U2}ṇ-í”

To generate the words that we want we have to give the transducer two rules: one where we remove or change the “ú” in the verb stem to “u” and one where we remove the morpheme boundaries. As can be seen in the example the “ú” is replaced with “{U2}” - an archiphoneme. An archiphoneme can be used when the realization of the phoneme differs in different contexts. It is also useful in the sense that it makes it easy for the user not to write rules that might accidentally affect more contexts than the ones wanted. For instance, “ú” might occur in other contexts where we want it to always be realized as “ú”. In the two-level rule file all the possible realizations of the archiphoneme must be specified. In this case, {U2} can be realized as either “u” (as in “buc ̣huṇí”), or as “ú” (as in “buc ̣húṇa”). Then, rules are given for when which realization is applied. The rules are structured in the following way: 1.

Corresponding parts, 2. Operator, 3. Context. A rule for the “u”-realisation of “{U2}” could be explained in words in the following way: 1. “{U2}” is realised as “u”. 2. In this context only, and always in this context. 3. When followed by a consonant, followed by a morpheme boundary, followed by an “í”. When this rule is given, the archiphoneme:” {U2}” will in all other contexts be realized as “ú”. In the HFST-formalism this rule could be expressed followingly:

"Accent change in converb"

%{U2%}:u <=> _ (Vow:) Cons: %-: í: ;

Firstly, the correspondence between lexical level and surface level is stated: “%{U2%}:u”, then the operator is stated: “<=>”, and lastly, the context: “(Vow:) Cons: %-: í: ; ”. In order to use variables like “Vow” for vowels and “Cons” for consonants the type of characters that these refer to can be stated in the beginning of the two-level rule file under the headline “Sets” right after the stated Alphabet. A headline for the rule should be stated above the rule itself. For more detailed examples of the two-level rule formalisms of HFST, see Appendix B. If a correspondence is expressed in the Alphabet of the two-level rule-file and no specific rule is expressed, the transducer will assume that the correspondence can be realised in any context. A rule to remove the morpheme boundaries would in HFST-formalism look like this:

"Remove morpheme boundary" %-:0 <=> _ ;

When we combine the lexicon and the rules the transducer can generate these strings: ”buc ̣húṇ-3sg:buc ̣húṇa”

”buc ̣húṇ-cv:buc ̣huṇí”

(17)

13

Table 3: A simplified overview of the different steps in the process of creating desired generations with

the help from two-level rules.

Step in process Output or rule

Generations made by lexicon compiler

1. “buc ̣húṇ-3sg:buc ̣h{U2}ṇ-a” 2. “buc ̣húṇ-cv:buc ̣h{U2}ṇ-í”

Two-level rule "Accent change in converb"

%{U2%}:u <=> _ (Vow:) Cons: %-: í: ;

Combining lexicon and

two-level rule 1. ”buc ̣húṇ-3sg:buc ̣húṇ-a” _{2. ”buc ̣húṇ-cv:buc ̣huṇ-í”}

Two-level rule "Remove morpheme boundary" %-:0 <=> _ ;

Combining lexicon and

two-level rules 1. ”buc ̣húṇ-3sg:buc ̣húṇa” _{2. ”buc ̣húṇ-cv:buc ̣huṇí”}

4.2.2 Machine learning

As for the machine learning this study relies on OpenNMT. OpenNMT is an open source neural machine translation framework. Even though this is a translation system it can be used for the purpose as a POS-tagging guesser if we provide training data for this. Neural Machine translation (NMT) is a methodology for machine translation widely used for different types of NLP-tasks (Klein et al., 2018). The toolkit OpenNMT is a collection of implementations of NMT. The implementation used for this study is OpenNMT-py. In this study it is the machine translation in the OpenNMT-py that is useful. The Sequence to Sequence model is based on the works of Bahdanau et.al (2014). One of the models in the present study was trained by using training data of 437 manually annotated inflected verbs and four models were trained with generated training data from the transducer with 6443, 3222, 1610 and 805 verb examples respectively. 6443 unique examples were the maximal amount that the transducer could produce. At the training of the models the default settings were used apart from changed word vector size. Word vector size for the models is 100 for both encoder and decoder. The encoder extracts a representation from an input string and the decoder generates a translation of this representation (Cho et al., 2014:1). Word vector size is the same as word embedding size. The models require validation data and so the Small gold standard of 146 examples was used for Seq2Seq model manual for this purpose and the Large gold standard of 583 examples was used for the other models. A specification of the different models follows.

The Sequence to Sequence models were trained with the following data:

Seq2Seq model XXS manual: Manually annotated training data set consisting 437 examples. Seq2Seq model XS: Transducer-generated training data set consisting 805 examples

Seq2Seq model S: Transducer-generated training data set consisting 1610 examples Seq2Seq model M: Transducer-generated training data set consisting 3222 examples. Seq2Seq model L: Transducer-generated training data set consisting 6443 examples.

(18)

14

model XS was trained on circa 12,5 %. The training data for Seq2Seq models M, S and XS was randomly selected from the complete generations of the transducer.

4.3 Evaluation

The output of the methods was evaluated based on exact match tagging accuracy. So, the accuracy for all models is measured in Recall and Precision and Accuracy. All methods and models also receive an F1-score. Why these different measures are used and how they are calculated will be explained in the next paragraph. The quantity of evaluation data is limited and this is the general problem or situation with low-resource languages. In this case, the Small gold standard consists of just 146 verbs and the Large gold standard consists of 583. All methods and models were compared to both gold standards except for Seq2Seq model XXS manual which only was evaluated with the Small gold standard. See table 4 for calculations of Precision, Recall, Accuracy and F1-score. These calculations are based on four metrics (Lisbach & Meyer, 2013:8,158). These metrics are:

• True Positives (TP), which are correct forms analysed correctly by the models/methods • False Negatives (FN), which are correct forms overlooked by the models/methods

• True Negatives (TN), which are irrelevant forms that are not returned by the models/methods • False Positives (FP), which are forms analysed incorrectly by the models/methods

An F1-score or harmonic mean is a measurement of accuracy, based on Precision and Recall, and is measured for each method or model. However, it is only for the results of the transducer that the Precision and Accuracy will not be the same. Accuracy is presented in percentage, while Recall, Precision and F1-score are presented as ratios. The measurement of Accuracy is calculated as number of correct predictions / total number of predictions made (Pedregosa et al., 2011). If two analyses are made for the same word they are in this context interpreted as one prediction. This is only valid for the Finite-State Transducer. The measurement Accuracy can also be found in table 4.

The advantage of using these different kinds of measurements is the possibility of reviewing several aspects of the performance of the methods used. Also, because of syncretism, a surface form having several correct possible gloss analyses, two types of Recall, Precision, Accuracy and F1-score for the transducer is retrieved. These two types will be called Best case and Worst case. The Best case interpretation is considerate of syncretism in the sense that if the transducer produces several analyses of one surface form and one of these is correct the generation is counted as correct, True Positive. In Worst case interpretation all cases where more than one analysis are retrieved from the transducer are counted as incorrect, False Positives. The transducer however never produces more than two analyses for each surface form, source-form.

Table 4: These are calculations of Recall, Precision, F1-score and Accuracy which are measurements used

for the evaluation of this study (Lisbach & Meyer, 2013:158).

Recall TP / (TP + FN) = relevant forms identified / all relevant forms

Precision TP / (TP + FP) = relevant forms identified / all forms identified

Accuracy TP + TN / total number of predictions made

F1-score 2 * ((Precision*Recall) / (Precision + Recall))

(19)

15

5 Results

Firstly, an overview of the Recall, Precision, Accuracy and F1-score of the Seq2Seq models and the Finite-State Transducer is presented in table 5 and 6. Table 5 shows the results for evaluations with the Large gold standard, and table 6 shows the results for evaluations with the Small gold standard. In table 7 basic information concerning the models are restated. Below table 7 basic information concerning the Finite-State transducer is presented. Thereafter, each research question is answered concisely in one section each.

Table 5: Recall, Precision, Accuracy and F1-score of the models and methods used in this study when

valuated against the Large gold standard.

Gold standard type Analyzer Recall [ratio] Precision [ratio] Accuracy [%] F1-score [ratio] Large

Seq2Seq model XXS manual - - - -

Seq2Seq model XS 1.00 0.68 68 0.81 Seq2Seq model S 1.00 0.77 77 0.87 Seq2Seq model M 1.00 0.78 78 0.88 Seq2Seq model L 1.00 0.81 81 0.90 Finite-State-Transducer Best case 0.70 0.92 66 0.79 Worst case 0.64 0.85 57 0.73

Table 6: Recall, Precision, Accuracy and F1-score of the models and methods used in this study when

evaluated against the Small gold standard.

Gold standard type Analyzer Recall [ratio] Precision [ratio] Accuracy [%] F1-score [ratio] Small

Seq2Seq model XXS manual 1.00 0.63 63 0.77

Seq2Seq model XS 1.00 0.77 77 0.87 Seq2Seq model S 1.00 0.84 84 0.91 Seq2Seq model M 1.00 0.84 84 0.91 Seq2Seq model L 1.00 0.88 88 0.93 Finite-State-Transducer Best case 0.76 0.99 76 0.86 Worst case 0.71 0.92 66 0.80

Table 7: The basic properties of the models used in this study are presented in this table.

(20)

16 Finite-State Transducer

The transducer was formed based on Palula verb paradigms from Liljegren and Haider’s 2011 Palula vocabulary and other inflectional information from the Palula online dictionary (Liljegren & Haider, 2011; Liljegren, 2019).

1. What level of accuracy can be achieved with a Finite-State Transducer with two-level morphology-based morphological analysis of L-verb in Palula given the data and time resources available? Is it accurate enough for being useful to a linguist?

The results for the Finite-State Transducer evaluated with the Large gold standard in the Best case are: a Recall of 0.70, a Precision of 0.92, an Accuracy of 66 % and an F1-score of 0.79, as can be seen in table 5. The results in the worst case are: a Recall of 0.64, a Precision of 0.85, an Accuracy of 57 % and an F1-score of 0.73. With a Best case Precision of 0.92 the transducer can probably be useful to use for automatic analysis of surface forms of Palula L-verbs. The Best case Precision is based on True Positives being interpreted as at least one of the generated analyses for a surface form being correct. The transducer took about three months of part-time work to complete. This work included both getting to know the Palula L-verbs as well as the concept and design of the Finite-State Transducer and two-level morphology.

2. How does the Finite-State Transducer method compare to supervised machine learning?

This is a comparison between the Finite-State Transducer and the Seq2Seq model XXS manual. To answer this question the results of the evaluation with the Small gold standard are used for this discussion, see table 6. This way the two methods can be assessed on equal grounds. The Finite-State Transducer method seems to be the better option based on F1-score compared to the Seq2Seq model XXS manual which is trained on solely manual annotation. The transducer has got higher scores than the Seq2Seq model XXS manual in Precision, Accuracy and F1-score in both Best and Worst case. However, like all the Seq2Seq models of this study, the Seq2Seq model XXS manual received a Recall score of 1.00. The transducer received a Recall score of only 0.76 in Best case with the Small gold standard.

3. How do the Sequence to Sequence machine learning models perform with training data generated by the Finite-State Transducer?

The Seq2Seq model L performed the highest scores in Recall 1.00, Accuracy 81 % and F1-score 0.90 when reviewing the evaluation made with the Large gold standard, see table 5. Based on these three measurements this kind of combination of methods was the most successful tested in this study. However, when reviewing Precision score the Seq2Seq model L received a score of 0.81 which is only the second to best Precision score of the models or methods used in this study. The Finite-State

Transducer Best case performed the highest Precision score of 0.92.

4. How much does the amount of training data affect the performance of the machine learning models?

(21)

17

6 Discussion

6.1 Discussion of each research question

Research question 1: Level of accuracy of the Finite-State Transducer.

The transducer could be useful to a linguist to some extent since the Precision is 0.92 in the Best case with the Large gold standard, see table 5. However, the Recall of the same method is of 0.70,

Accuracy of 0.66 and an F1-score of 0.79. The analysis is usually right when present since the Precision score is quite high even though there are many surface forms that the transducer cannot analyze. This means that even though a linguist would have to gloss 30 % of the surface forms themselves the part of the data that the transducer has analyzed is not in need of excessive manual rechecking.

Research question 2: The transducer compared to the supervised machine learning.

Since the Small gold standard is so small, the accuracy of the Seq2Seq model XXS manual cannot be considered as valid as the results based on the Large gold standard. Comparing the Seq2Seq model XXS manual and the Finite-State Transducer the latter must be considered the most appropriate one because of the uncertainty in evaluation for the supervised machine learning model. Considering syncretism the transducer has an upper hand on the Seq2Seq models since the latter only generate one analysis for each surface form. The transducer could have been improved but given the time limit some improvements have not been incorporated. The downside of the transducer is that the Recall rate is only 0.70 and not more when reviewing the Recall of Best case evaluated with a Large gold

standard, see table 5.

Research question 3: Performance of the supervised machine learning models trained on data generated by the Finite-State Transducer.

The Seq2Seq model L trained with data generated by the transducer was the method with the highest F1-score of 0.90. This is based on the evaluation made with the Large gold standard, see table 5.

Research question 4: Importance and impact of the amount of training data.

The amount of training data is important. The difference in Accuracy and Precision is 3 percentage points when comparing the Seq2Seq models L and M which have training sets of 6443 and 3222 forms respectively. The difference in performance of models M and S is yet smaller with 1 percentage point in Accuracy. The models with less than 1610 training examples performed slightly worse. The amount of training data matters but the difference between a model trained on 1610 forms and one trained on 6443 forms is not larger than 4 percentage points in Accuracy. This answer is based on the evaluation made with the Large gold standard, see table 5.

6.2 Discussion of methods

There are aspects in the methods of this study that can be discussed critically. Firstly, having access to data in an appropriate format is essential otherwise time-consuming manual work is required.

However, as stated earlier in the study this manual collection of verb stems from the Palula online dictionary added to the author’s much needed familiarity of the language (Liljegren, 2019). Secondly, inexperience of the researcher of this study regarding the studied language badly affects the amount of time spent on error checking of the generations made by the transducer.

(22)

18

when formulating the transducer would also be in favour for the result. If one or both aspects would be satisfied a better Recall score for the transducer, and the Seq2Seq models trained with generated data from the transducer would probably be retrieved. It can be stated that a thorough understanding for the language material is needed in order to produce a well-functioning transducer.

Another reason for a closer collaboration with a linguist scholar of the language is the tagging. When using data from field linguists a completely standardized tagging format of the data is often not the case. Of course glossing should not be done in order to make it easy for a hypothetical program or a computational linguist. Deviations from tagging standards is most likely not a devastating problem but can simply be slightly time consuming to detect.

The lexicon and rules for the transducer were based on verb paradigms from a book along with information from the online dictionary. However, too little consideration concerning deviations from these paradigms was made. In order to make the transducer work properly on deviating verbs a closer consideration of glossing examples and information found in the Palula grammar and dictionary should be made. According to Liljegren, 20 of the most common verbs in Palula constitutes 78 % of the verb usage in total (Liljegren, 2015: 205). So, because of data sparsity authentic examples of verbs inflected by different suffixes in the glossed data are not available in excess. This shortcoming can be improved by simply spending more time on background knowledge and the design of the transducer. Since the combination of the generations from Finite-State Transducer and the Sequence to Sequence model was the most successful method or model used in this study regarding Recall, Accuracy and F1-score it seems like generating training data for a supervised machine learning model might be

promising. Especially because the small resources of annotated data can in its entirety be used as evaluation data and the methods will be validated with greater certainty.

6.3 Discussion of results

Regarding the level of accuracy of the transducer and whether it could be useful to a linguist or not many aspects needs to be considered. Firstly, the Best case Precision of 0.92 and Recall on 0.70 are probably the numbers that are most accurate to consider since syncretism must be taken into account. See chapter 4.3 for definition of Best case and Worst case. A Precision of 0.92 can probably be counted as a good result while a Recall of 0.70 is not ideal and a higher Recall would be desirable. Concerning the Recall of 0.70 there seems to be several reasons for the missing analyses. Errors concerning stem changes and vowel lengthening in suffixes in certain contexts were found in the output of the transducer when error checking the instances where the transducer did not provide an analysis. A few errors were also found to be based on the transducer not being able to produce analyses with a certain tag. This tag was however not represented in the verb paradigm used for the formation of the transducer of this study (Liljegren & Haider, 2011:188).

The transducer would be improved if the problem with syncretism was solved. One action could be to use weights for the transducer so the likelihood of each possible analysis would be presented and sorted based on likelihood. However, this might not be easily calculated if the available language resources are very small and not representative. Weights like these should perhaps be formed by or with someone highly proficient in the language if the resources available are limited.

The performance of the transducer and the Seq2Seq model with manually annotated training data are not easily compared due to several reasons. They have different characteristics. The Seq2Seq models have a guaranteed spotless Recall score since there are no False Negatives present because of the design of the gold standards in combination with the function of the models. The Seq2Seq always provides an analysis of the surface forms from the evaluation data. The transducer does not always provide analyses and so the Recall score is not spotless. Also, the Finite-State Transducer takes syncretism into account while the Seq2Seq models only generate one possible analysis.

(23)

19

surpasses the teacher the Finite-State Transducer. The Seq2Seq model L had a Recall of 1.00, Precision of 0.81, Accuracy of 81 %, and an F1-score of 0.90 when reviewing the evaluations based on Large gold standard. However, this model did not perform the best Precision score in the study. The Finite-State Transducer performed a Precision of 0.92 in Best case, see definition in chapter 4.3. Concerning the importance and impact of the amount of training data the small difference in Accuracy and Precision of 3 percentage points when comparing model L and M is positive in the sense that less training data can provide results almost as good as models trained on the double amount. The amount of training data certainly has an impact on the results but perhaps a smaller amount of training data is needed for this type of project since the usage is very specific in comparison to a Seq2Seq model used for translation.

How does the results of this study compare to earlier studies? Washington et al. (2014) generally generated better Precision scores than the results of the methods and models of the study of Palula L-verbs. The lowest Precision score of the models used in the Washington et. al. study was of 0.95 (Washington et al., 2014:3384). When it comes to Recall the models of the Palula-study all performed better than all the models in the Washington et al. study. The latter contains three transducer models receiving Recall scores of 0.58, 0.69 and 0.86 (Washington et al., 2014:3384). However, these studies are quite different in the sense that Washington et al. (2014) covers larger parts of the languages studied and their gold standard used for evaluation was 1000 words i.e. bigger than the ones used in this study. Also, the Recall score for the Seq2Seq models of this Palula study are always 1.00 since there is no possibilities of False Negatives for these models. The Seq2Seq models used in Premjith et al. (2018) received an F1-score of circa 0.98 each compared to the highest F1-scores in this study of 0.90 for Seq2Seq model L (Premjith et al., 2018:52). The study of the Torwali language by Uddin & Uddin (2019) do not present any calculations of accuracy but concludes that Helsinki Finite-State Transducer Technology was found to be a good option for the first ever implementation of Torwali morphology. There is however still work to be done in order to have a more complete morphological analyser of Torwali (Uddin & Uddin, 2019:9). Just like Uddin & Uddin (2019) the Finite-State

(24)

20

7 Conclusions

The task of automatic morphological analysis of Palula L-verbs using the method of Finite-State Transducer technology with two-level rules in combination with Sequence to Sequence machine learning models has proved to be possible and successful.

The transducer of the Large gold standard performed a Recall score of 0.70 and a Precision score of 0.92 in Best case and with a Recall of 0.64 and a Precision of 0.85 in Worst case.

The Best case results should be considered as the most valid because they consider syncretism which refers to surface forms having several possible correct analyses.

Finite-State Transducers are perhaps best suited for a linguist with great knowledge of the language or to be designed in close collaboration with someone proficient of the language. Sequence to Sequence models on their own do not seem to be the best choice when the accessible training data set it very small. However, combining the Finite-State Transducer and Sequence to sequence machine learning by using 6443 generations of the transducer as training data for a Seq2Seq model proved to be the most successful method tested. It performed an F1-score of 0.90, when reviewing the results based on the Large gold standard Best case.

The amount of training data provided for the Seq2Seq models did not seem to affect the results hugely when comparing a model trained on 6443 examples, one trained on 3222 examples, and one trained on 1610 examples. However, the Seq2Seq models trained with 805 forms and below performed slightly worse.

When reviewing the results of the Small gold standard the Seq2Seq model trained on only 437 manually annotated verbs received an Accuracy of 63 % and an F1-score of 0.77. The relevancy of these results can however be questioned since the gold standard set here contained only 146 forms. So, the amount of training data seems to have the largest effect on the performance of the models with a low amount of training data examples somewhere below 1610.

Concludingly, a combination of the two methods seems to be the most promising method tested in this study based on the measurements of Recall, Accuracy and F1-score. The Finite-State Transducer however received an F1-score only slightly worse than the Seq2Seq model trained on 6443 transducer generated forms.

The Finite-State Transducer received the highest Precision score of all of 0.92 for the Large gold standard in Best case.

The idea of generating correct forms in order to receive a bigger training data seems to suit the issue of having small annotated data resources. Using generated data as training data for such a model would leave all or at least a larger part of the small amount of manually annotated data to be used as a gold standard. Even though the Finite-State Transducer has not got as good Recall score as the Seq2Seq models it provides several analyses when there are several possible correct forms of the same surface form of a verb – syncretism.

The problem of syncretism could be further dealt with by using weights for the analyses of the surface forms.

The transducer and Seq2Seq model L trained on transducer generated data sets can in their current state be useful for automatic morphological analysis of Palula L-verbs and with time and effort the transducer can be improved without great difficulty and so the generated data of this improved transducer will result in a better performing Seq2Seq model.

(25)

21

(26)

22

References

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Beesley, K. R., & Karttunen, L. (2001). A short history of two-level morphology. Collected 10/3 from:

http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/

Beesley, K. R., & Karttunen, L. (2003). Finite State Morphology. CSLI Publications

Bird, S., Klein, E., & Loper, E. (2010) Natural Language Processing with Python (First Edition). CA: O’Reilly Media, Inc.

Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches.arXiv preprint arXiv:1409.1259.

Eberhard, D. M., Simons G. F., & Fennig C. D. (eds.). (2020). Ethnologue: Languages of the

World. Twenty-third edition. Dallas, Texas: SIL International. Collected 10/3 2020 from online

version: http://www.ethnologue.com

D, U., Gao, G., I, B., & B, N. (2017). Using The Two-level Morphology on Modern Mongolian Linguistics. Proceedings of the Mongolian Academy of Sciences, 52(4), 63-69.

Gvozdanović, J. (2012). Perfective and Imperfective Aspect. In Binnick, R. (Eds.), The Oxford

Handbook of Tense and Aspect. Oxford University Press.

Helsinki Finite-State Transducer Technology. (2018). Collected 27/3 2020 from:

https://github.com/hfst/python/wiki

Jurafsky, D & H. Martin, J. (2000). Speech and language processing. Prentice Hall, New Jersey. Karttunen, L. & Beesley, K. R. (1992). Two-level rule compiler. Xerox Corporation, Palo Alto Research Center. Collected 11/3 2020 from: http://staff.um.edu.mt/mros1/nlp/fsa/twolc92.html Koskenniemi, K. (1983). Two-level morphology: A General Computational Model for Word-Form

Recognition and Production. Diss. Helsingfors: Univ. Helsinki.

Koskenniemi, K. (2013). An informal discovery procedure for two-level rules. Journal of Language Modelling, 1(1), 155-188.

Klein, G., Kim, Y., Deng, Y., Nguyen, V., Senellart, J., M. Rush, A. (2018). OpenNMT: Neural Machine Translation Toolkit. Collected 27/3 2020 from: https://arxiv.org/pdf/1805.11462.pdf Liljegren, H. (2019). Palula dictionary. Dictionaria 3. 1-2700. DOI: 10.5281/zenodo.3066952. Collected 2/4 2020 from: https://dictionaria.clld.org/contributions/palula

Liljegren, H. (2016). A grammar of Palula (Studies in Diversity Linguistics 8). Berlin: Language Science Press.

Liljegren, H & Haider, N. (2011). Palula Vocabulary. Islamabad: Forum for Language Initiatives. Liljegren, H & Haider, N. (2015). Palula texts. In FLI Language and Culture Series. Islamabad: Forum for Language Initiatives.

Lindén,K., Silfverberg, M., & Pirinen, T. (2009). HFST Tools for Morphology – An Efficient Open-Source Package for Construction of Morphological Analyzers. State of the Art in Computational

Morphology : Workshop on Systems and Frameworks for Computational Morphology, SFCM 2009, Zurich, Switzerland, September 4, 2009. Proceedings, 28-47. DOI: 10.1007/978-3-642-04131-0_3

(27)

23

Mohammed, M., Khan, M. B. & Bashier, E.B.M. (2017). Machine learning algorithms and

applications. Boca Raton: CRC Press.

Nederhof, M., Satta, G. (2010). Theory of Parsing. In Clark, Fox & Lappin (Eds.) The handbook of

Computational Linguistics and Natural Language Processing (pp. 105-130). West Sussex: Blackwell

Publishing Ltd.

OpenNMT-py. (2017). Collected 2/4 2020 from: https://opennmt.net/OpenNMT-py/main.html Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.

Premjith, B., Soman, K. P., & Kumar, M. A. (2018). A deep learning approach for Malayalam morphological analysis at character level.Procedia computer science,132, 47-54.

Rezk, N., Purnaprajna, M., Nordström, T., Ul-Abdin, Z. (2020).Recurrent Neural Networks: An Embedded Computing Perspective. IEEE Access, 81(1), 57967-57996.

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104-3112.

Uddin, N., & Uddin, J. (2019). A step towards Torwali machine translation: an analysis of morphosyntactic challenges in a low-resource language. In Proceedings of the 2nd Workshop on

Technologies for MT of Low Resource Languages, 6-10.

(28)

24

Appendices

Appendix A: Data overview

An overview of the data collected for the methods of this study can be found in table 8.

Table 8: An overview of the type of data used in the study and for which applications. Application Data

Verb stems for the Finite-State Transducer

Verb stems from Liljegren’s online Palula dictionary (2019)

Inflectional

information of most verbs for the Finite-State Transducer

Inflectional information from Liljegren’s online Palula dictionary (2019)

Inflectional and derivational morphemes for the Finite-State Transducer

Paradigms for L-verbs from Liljegren and Haider’s Palula vocabulary (Liljegren & Haider, 2011:188-189)

Data for gold standards and training data for Seq2Seq model XXS manual

Manually annotated data set of Palula L-verbs collected from the glossed example sentences from entries in Liljegren’s online Palula dictionary (2019)

Total of manually annotated data set/Large gold standard: 583 forms Manually annotated training set: 437 forms

Small gold standard: 146 forms

Training data for Seq2Seq models XS to L

Annotated data sets generated by the Finite-State Transducer

Training set for Seq2Seq L: 6443 forms Training set for Seq2Seq M: 3222 forms Training set for Seq2Seq S: 1610 forms Training set for Seq2Seq XS: 805 forms

Appendix B: Transducer model

The Lexicon and Two-level rules used for the transducer made for this study are presented below. In both these files, comments can be made following an exclamation mark (“!”).

Lexicon Multichar_Symbols

(29)

25 %-f ! fem singular %-fpl ! fem plural %-prs ! present %-1sg %-2sg %-3sg %-1pl %-2pl %-3pl %-imp ! imperative %-inf ! infinitive %-cv ! converb %-oblg ! obligative %-cprd ! copredicative participle %-vn ! verbal-noun

%-ag ! agentative verbal-noun %{A%} ! archiphoneme-tag %{A2%} ! archiphoneme-tag %{E%} ! archiphoneme-tag %{E2%} ! archiphoneme-tag %{I%} ! archiphoneme-tag %{I2%} ! archiphoneme-tag %{U2%} ! archiphoneme-tag %{O2%} ! archiphoneme-tag

%{AE%} ! archiphoneme-tag , L:e either aa or ee %{AE2%} ! archiphoneme-tag , L:cons, a,á,e,o

%{AE3%} ! archiphoneme-tag L:e aa or ee (also in prs) %{AAcons%} ! archiphoneme-tag , either á or áa (imp) %{AAcons2%} ! archiphoneme-tag , either a or áa (imp) %{OEcons%} ! archiphoneme-tag , either óo or ee (imp) %{OUcons%} ! archiphoneme-tag , óo --> úu imp

%{EIcons%} ! archiphoneme-tag , ée --> íi in imp %{CORR%} ! archiphoneme-tag, corr

(30)

26 %-1pl : # ; LEXICON IMPERF2SG %-2sg : # ; LEXICON IMPERF2PL %-2pl : # ; LEXICON IMPERF3SG %-3sg : # ; LEXICON IMPERF3PL %-3pl : # ; LEXICON IMP : # ; %-imp%-pl : %-óoi # ; LEXICON IMPCONS : # ; %-imp%-pl : %-ooi # ; LEXICON INF %-inf : %-áa%{I%} # ; LEXICON INFA %-inf : %-áai # ; LEXICON CONVCONSE %-cv : %-í # ; LEXICON CONVA %-cv : %-aá # ; LEXICON OBL %-oblg : %-eeṇḍeéu # ; LEXICON CPRD %-cprd : %-íim # ; LEXICON VN %-vn : %-ainií # ; LEXICON AGVN

(31)

(32)

(33)

(34)

(35)

31 silaawé:silaaw%{E%} STAME ; ṭoké:ṭok%{E%} STAME ; ẓoké:ẓok%{E%} STAME ; ṣóoṣé:ṣóoṣ%{E%} STAME ; ǰoṣé:ǰoṣ%{E%} STAME ; ṣuunké:ṣuunk%{E%} STAME ; Two-level rules Alphabet a á e é o ó y ý i í u ú w r t p s d l k j h g f z x c v b n m c A Á E É O Ó Y Ý I Í U Ú W R T P S D L K J H G F Z X C V B N M C ṭ č ḍ č c ̣č ṣ š ẓ ǰ ṇ ṛ

%{A%}:0 %{A%}:á %{A%}:a %{E%}:0 %{E%}:á %{I%}:i %{I%}:0 %{AE%}:a %{AE%}:e %{AE%}:á %{AE3%}:a %{AE3%}:e %{A2%}:a %{A2%}:á %{U2%}:u %{U2%}:ú %{E2%}:e %{E2%}:é %{O2%}:o %{O2%}:ó

%{AE2%}:a %{AE2%}:e %{AE2%}:ó %{AE2%}:á %{AAcons2%}:á %{AAcons2%}:a

%{I2%}:i %{I2%}:í %{CORR%}:Vy

%{CORR2%}:a %{CORR2%}:0

%{OUcons%}:ó %{OUcons%}:ú %{OUcons%}:o %{EIcons%}:í %{EIcons%}:é %{EIcons%}:e ;

Sets

Vow = a á e é o ó y ý i í u ú ;

Cons = w r t p s d l k j h g f z x c v b n m c y ý ṭ č ḍ č c ̣č ṣ š ẓ ǰ ṇ ṛ ; Rules

"Remove morpheme boundary" %-:0 <=> _ ;

" á is realised in imperative" %{A%}:á <=> _ # ;

" é is á in imperative" %{E%}:á <=> _ # ;

" á is a in coprec and agentative" %{A%}:a <=> _ %-: í: i: m: # ;

" stemchange 1,umlaut, in L:e,cons-verbs "

%{AE%}:e <=> _ %{CORR%}:e Cons: (%{E%}:0)* %-: í: l: ; " stemch umlaut1"

%{AE%}:a /<= _ %{CORR%}:a Cons: (%{E%}:0)* %-: í: l: ; " stemchange 3 in imp á - a sen corr"

%{AAcons2%}:á => _ %{CORR2%}:a Cons:+ ( %-: ó: o: i:)* # ; "stemch2 a: o in imperative"

(36)

32 %{AE2%}:á => _ (%{CORR%}:a) Cons:+ %-: [ u: m: | a: ṛ: | a: t: | a: n: | a: ] # ;

"stemch 2.2"

%{AE2%}:e => _ %{CORR%}:e Cons:+ %-: [ í: # | í: l:] ; "stemch 2.3"

%{AE2%}:a /<= _ [%{CORR%}:a Cons:+ ( %-: o: o: i:)* # | (%{CORR%}:a) Cons:+ %-: [ u: m: | a: ṛ: | a: t: | a: n: | a: ] # | %{CORR%}:a Cons:+ %-: [ í: # | í: l:]] ;

" stemch 3:1"

%{AAcons2%}:a /<= _ %{CORR2%}:a Cons:+ ( %-: ó: o: i:)* # ; " stemchange 4 imperative ó : ú"

%{OUcons%}:ú <=> _ %{CORR%}:u Cons:+ ( %-: o: o: i:) # ; " stemch 4"

%{OUcons%}:o /<= _ %{CORR%}:o Cons:+ ( %-: o: o: i:)* # ; "stemch 4.2"

%{OUcons%}:ó <=> _ (%{CORR%}:o) Cons:+ %-: [ u: m: | a: ṛ: | a: t: | a: n: | a: ] ; " stemchange 5 imperative é : í"

%{EIcons%}:í <=> _ %{CORR%}:i Cons:+ ( %-: ó: o: i:)* # ; " stemch 5"

%{EIcons%}:e /<= _ %{CORR%}:e Cons:+ ( %-: ó: o: i:)* # ; " stemch 5.2"

%{EIcons%}:é <=> _ (%{CORR%}:e) Cons:+ %-: [ u: m: | a: ṛ: | a: t: | a: n: | a: ] ; "stemchange 6"

%{AE%}:a /<= _ %{CORR%}:a Cons:+ ( %-: [ó: | o:] o: i:) # ; "stemchange 6.3 imperfective à :a "

%{AE%}:á <=> _ (%{CORR%}:a) Cons:+ %-: [ u: m: | a: ṛ: | a: t: | a: n: | a: ] ; "stemchange 6.4"

%{AE%}:á /<= _ %{CORR%}:o Cons:+ ( %-: o: o: i:) # ; "stemchange 7 in cv á : e"

%{AE%}:e <=> _ %{CORR%}:e Cons:+ (%{E%}:0) %-: í: # ; "stemchange 7"

%{AE%}:a /<= _ %{CORR%}:a Cons:+ (%{E%}:0) %-: í: # ; "stemchange 8"

%{AE3%}:e <=> _ %{CORR%}:e Cons:+ (%{E%}:0) %-: [ í: l:| í: #| á: a: n: | é: e: n:] ; " stemch 8.1"

Automatic morphological analysis of L-verbs in Palula

Automatic morphological

analysis of L-verbs in Palula

Emma Wallerö

Automatic morphological

analysis of L-verbs in Palula

Abstract

Automatisk morfologisk analys

av L-verb i Palula

Sammanfattning

Contents

1. Introduction

2. Background

2.1 The Palula language

(1)

Palula (N.1168) A:MAR015-6 [source] (Liljegren 2019, ”khóṇḍa”)

2.2 Natural Language Processing (NLP)

2.3 Previous research of morphological analysis

of low resource languages

3 Purpose and research questions

3.1 The purpose of the study

3.2 Research questions

4 Material and methods

4.1 Data

(2)

Palula (N.507) A:DLX065 [source] (Liljegren 2019, ” c ̣iinkíi”)

4.2 Methods

Lexicon

Two-level Rules

4.3 Evaluation

5 Results

6 Discussion

6.1 Discussion of each research question

6.2 Discussion of methods

6.3 Discussion of results

7 Conclusions

References

Appendices

Appendix A: Data overview

Appendix B: Transducer model