Neural Network Based Automatic Essay Scoring for Swedish

(1)

Neural Network Based Automatic Essay Scoring

for Swedish

Rex Dajun Ruan

Uppsala University

Department of Linguistics and Philology Master Programme in Language Technology

Master’s Thesis in Language Technology, 30 ects credits September 26, 2020

Supervisor:

(2)

Abstract

This master thesis work presents a novel method of automatic essay scoring for Swedish national tests written by upper secondary high school students by deploying neural network architectures and linguistic feature extraction in the framework of Swegram. There are four sorts of linguistic aspects involved in our feature extraction: count-based,lexical morphological and syntactic. One of the three variants of recurrent network, vanilla RNN, GRU and LSTM, together with the specific model parameter setting, is implemented in the Automatic Essay Scoring (AES) modelling with extracted features measuring the linguistic complexity as text representation. The AES model is evaluated through interrater agreement with human assigned grade as target label in terms of quadratic weighted kappa (QWK) and exact percent agreement. Our best observed averaged QWK and averaged exact percent agreement is 0.50 and 52% over 10-folds among our all experimented models.

(3)

Acknowledgement

This thesis work is supervised by Beáta Megyesi. I really appreciate her kindness, encouragement, comments, and instructions throughout the entire progress of thesis work. I want to thank Ildikó Pilán for the inspiration of feature implementation. I also want to thank Anne Palmér and Robert Östling who provided the data. I am very grateful for the support from Ali Basirat and Gongbo Tang.

I want to thank all my friends and families who have been always trusting and supporting me in the past two years. None of my achievements could come true without the support from them.

(6)

1. Introduction

The task of automatic essay scoring (AES) is to automatically assign a given essay with an expected grade according to some grading criteria and text quality. The use of a satisfactory AES system decreases time and cost for text assessment, avoids the subjectivity because of teachers’ personal preference and creates a nondiscriminatory environment with consistent assessment criteria. It also provides the feedback in details to support on which aspects the text shows strong respective weak points (Page, 1968).

AES systems have been widely researched since its first trial (Page, 1968) on English and some variants of AES systems (Burstein, Kukich, et al., 2001) as a commercial tool have been successfully deployed in the assessment of standardized English writing proficiency tests including Test of English as a Foreign Language (TOEFL) and Graduate Record Examination (GRE). In contrast to the success of AES systems on English, the research on AES systems of Swedish is rarely addressed. Together with this thesis work, the goal of developing Swedish AES in the long term is to support the school teachers with suggested holistic textual assessment along with detailed comments about the linguistic characteristics of the text provided by the Swedish AES system.

The purpose of this thesis work is to contribute to the development of a Swedish AES system by measuring linguistic complexity in the framework of neural networks.

Specifically, the goal of the thesis is to detect which linguistic features given a set of linguistic features are relevant for prediction of automatic essay scoring and in which way these linguistic features are best represented in the AES systems. As is shown in many studies (Crossley, 2020b; Dasgupta et al., 2018), linguistic complexity of the text correlates with text quality. Due to the limitation of detecting the features that highly correlate with grade prediction for the learner texts on our own, the linguistic features used for this thesis are based on the work of Pilán (2018) that presents a set of features to measure the linguistic complexity for the Swedish texts. Therefore, our hypothesis is that the expected grade of a given text can be automatically predicted through measuring the text’s linguistic complexity. The main research questions that we ask in this thesis are as follows:

• Can neural network be used for Swedish AES by including linguistic characteristics of the text given a rather small amount of training data? If so, can the AES model be optimized by tuning neural network?

• Are linguistic features useful when modelling Swedish AES? If so, can the AES model be optimized by feature scaling?

• Does the training data size have an influence on the AES system performance?

• The features to be implemented can be classified into different blocks after linguistic aspect. In our case, the linguistic features can be count-based, lexical, morphological or syntactic. What kind of linguistic features have more impact on modelling performance, a specific feature block or a combination of those?

Since human raters can happen to assign different grades to the same texts, it becomes difficult to assess which grades are more reasonable for the texts. In our case,

(7)

we leave the question whether the assigned grades are biased to the future and assume that the human assigned grades are gold standard.

Another constraint worthy to be pointed out is that the data we have access to for AES modelling and evaluation is two samples of student written essays for the national test in Swedish and graded by the students’ own teachers. Each of two samples comprises of around one million tokens covering a variety of text genres and topics.

In Chapter 2, we describe the background of AES, comprising of three well-developed English AES system with core ideas, a brief description of AES systems using neural networks, the progress in Swedish AES systems and the evaluation metrics used to evaluate our system performance. Chapter 3 gives a detailed description of features used to represent the text to be applied in the neural networks for prediction. Besides, the procedure of feature extraction from the raw text to the feature scalar is given.

Chapter 4 gives an introduction of neural network, two types of neural networks used in the AES modelling: feedforward neural network and recurrent neural network, and a set of hyperparameters. Chapter 5 gives a description of data, cross-validation, and the procedure of experiment and tested variables in the experiment. Chapter 6 gives the result based on the experimental setup. Chapter 7 compares the result with previous studies and discusses the limitations of the experiments and proposes the possible future works. Chapter 8 makes a brief conclusion of the thesis.

(8)

2. Background

The possibility of developing research on AES was first proposed in the 1960s by Page (1968). Page (1968) defined the problem of AES as a transformation from a sequential input string that occurs in form of text into an output symbols from the philosophical aspect. The output symbols can be an overall grade, comments or a diagnose of the input text. He argues that the computer-friendly quantitative measurements allow objective and reasonable assessments of text writing. In the following, we provide a background to AES systems.

2.1. Existing Automatic Essay Scoring Systems

Concerning the aspect the essays are graded, the AES systems are classified into two categories: content-based and style based. Content based AES systems look into features involved with claims, arguments and theses for grade prediction while style- based systems focus on linguistic features measuring lexical sophistication, syntactic complexity and text cohesion. Project Essay Grade (Page, 1968) is a style-based AES system and Intelligent Essay Assessor (Crossley, 2020a) is a content-based AES system.

Here we present three AES systems which all are based on NLP techniques.

2.1.1. Project Essay Grade

Project Essay Grade (PEG) was proposed by Page (1968) as the first trail of AES based on student essays written in English from grade eight to twelve. PEG is based on analysis of text style with the help of two concepts, "prox" and "trin", where "prox"

refers to measurable indirect variable by computer to approximately simulate "trin"

that refers to intrinsic variable predicting essay grade. Prox variables consist of a set of surface linguistic features including essay length and the distribution of part of speech of the essays, which indicates text quality. A set of essays with human grades are used in a standard multiple regression (predicting text grade based on multiple extracted features from the text) to simulate correlations between prox and overall text grade.

The simulated correlations and measured prox variables predict the grade for essays.

Hearst (2000) pointed out the early developed PEG Page (1968) and Page (1994) had a limitation when using indirect variables. On one hand, the prox variables are not reliable since the essay writer could artificially produce a longer essay without improving text quality. On the other hand, PEG does not take essay content into account for grade prediction, which further implies that PEG will not provide any feedback on the essay’s content.

2.1.2. Electronic Essay Rater

Electronic Essay Rater (Burstein, Leacock, et al., 2001) called as E-Rater is founded by Educational Test Service (ETS) and is widely used on high stake assessment of writing skills on English language.

E-rater consists of five independent sub-modules using NLP tools and statistical measurements. Three of five sub-modules tackle analysis of the syntactic variety, textual structures and vocabulary use. Part-of-speech tagging, chunking techniques

(9)

of detecting phrase and categorization of clause types are used to reflect syntactic variation. The cue words and terms, for instance "however" respective "on one hand", along with the syntactic structures demonstrate textual structure of essay, which divides the essay into several sections where each section has its own topic. And big bag word based on the vector space model illustrates the vocabulary use in each partitioned section. The fourth module selects and weights predictive features. The fifth module proceeds the final computation and generates the final score.

The discrepancy rate between the score graded by E-rater and a human reader given a GMAT essay is reported to be less than three percent, which is comparable between two human readers (Burstein, Leacock, et al., 2001).

2.1.3. Intelligent Essay Assessor

Intelligent Essay Assessor (IEA) (Miller, 2003) is a mainly content-based automatic essay scoring system by exploiting the Latent Semantic Analysis technique. To grade essay by IEA is to compare the semantic similarity of the essay on the basis of the use of words without considering word order against other essay samples or standard essay. The measurement of semantic similarity does not take word order into account.

The essay style on redundant sentences and structural issues and essay mechanics indicating misspelled words and grammatical errors, are left to two assistant models to deal with.

The idea of grading from semantic aspect is motivated by that observation that human grader gives more weights to the content than to style when grading the student essay. The content of student essay is usually from a specific source, which makes it possible to compare the example essay with. To detect the semantic representation from exposed words, a semantic space is set in form of matrix to represent word- content co-occurrence. The cell of the matrix records the frequency of word occurring in the context. This matrix is transformed with the help of Term Frequency Inverse Document Frequency (Ramos et al., 2003). The transformed matrix is subdivided into three component matrices applying with Singular Value Decomposition (Golub and Reinsch, 1971) to create the latent or hidden units to represent semantic meaning.

IEA systems achieve interrater agreement with human graders between 85%-91%

on GMAT essays (Valenti et al., 2003).

2.2. Automatic Essay Scoring Using Neural Networks

Deep learning in neural networks has gained significant improvements in applications involving image processing, text processing and also in AES. AES is one of the fields benefiting from the application of deep learning (Alikaniotis et al., 2016; Dasgupta et al., 2018; Hussein et al., 2019; Schmidhuber, 2015).

On one hand, many studies (Alikaniotis et al., 2016; Taghipour and Ng, 2016) on neural networks based AES research claim that the use of neural network for AES makes manual feature engineering unnecessary due to the learning process of neural network is able to capture data patterns that correlates with grade prediction.

Alikaniotis et al. (2016) proposed a neural network based AES system with Score- Specific Word Embedding that indicates both the local contextual linguistic information from the word in the linear sequence and the correlation between the word and the text’s overall score. The experiment on the Kaggle Automated Student Assessment Prize (ASAP) dataset¹shows that the model built on the Score-Specific Word Embedding in two layers’ bidirectional LSTM performed best. The best observed correlations for

1https://www.kaggle.com/c/asap-aes. Avaiable on Sep 14, 2020.

(10)

predicted and expert grades reached 0.91 of Spearman’s 𝜌, 0.96 of Pearson 𝑟 and 0.96 of Cohen’s 𝜅.

Taghipour and Ng (2016) explored a variety of neural network models without human engineered features. The model architecture includes five layers: a look-up table layer to convert the word into a vector by word embedding, convolutional layer to extract the local information in sentence level, a recurrent layer from one of the three variants (basic recurrent units, Gated-recurrent units or Long-short-term-memory units), a mean over time layer to aggregate the variable number of inputs into a fixed length vector, and a linear layer with sigmoid activation to transform the hidden state to an output vector representing for the grades. The averaged Quadratic Weighted Kappa (QWK) over eight-fold cross-validation test on Kaggle dataset reached 0.761 given the LSTM model. There is an increase by 5.6% QWK and significant different when comparing with the AES system Enhanced AI Scoring Engine².

On the other hand, linguistic factors, for instance, lexical sophistication, syntactic complexity and text cohesion (Crossley, 2020b), have a profound influence on which grade the essay is supposed to be assigned with. To capture these linguistic factors is an appropriate way to better understand the essay representation for grading. In a study by Dasgupta et al. (2018) on the Kaggle dataset, the input layer in the model architecture of neural networks consists of both word embedding and enhanced linguistic features. The result shows that the models with augmentation of qualitatively enhanced features such as lexical diversity significantly improves the performance of text quality assessment determination. The best observed model achieved an averaged QWK 0.786 over 8- fold cross-validations. The correlation in terms of Pearson’s and Spearman’s is up to 0.94 and 0.97, respectively. It implies that manual feature engineering can also be an important factor to improve the performance of AES within the framework of neural network architecture.

2.3. Progress in Swedish AES

To our best knowledge, there are several studies on the development of AES systems for Swedish by applying statistical information and NLP techniques.

The first published paper on Swedish AES from Östling et al. (2013), based on a bachelor thesis by Smolentzov (2013), utilized Linear Discriminant Analysis (LDA) classifier to predict grades given a four stage scale on a corpus. The corpus consisted of collected essays from secondary upper students for their examinations in spring 2005 and fall 2006. Each student essay was graded by own teacher and another blind grader.

Each essay to be graded was represented as a feature vector where each dimension represented a manually engineered feature and the scalar in each dimension is the correlation between the feature and the grade. The result showed that the LDA classifier for automatic essay scoring reached the highest performance of 62% (𝜅 = 0.399) in terms of interrater agreement against with averaged grades of teacher and blinder grader. When using the grades from the teachers, the agreement between predicted and expected grade assignments dropped to 53.6% (𝜅 = 0.345).

Lilja (2018) tried a new approach based on deep learning in neural networks model, Long Short Term Memory (LSTM) with the same data used by Östling et al. (2013) with the assumption that AES should benefit from the power of deep learning. Instead of using a set of linguistic features, the text was represented as a sequence of word embedding. Lilja (2018) experimented with different settings including testing bidirectional LSTM and changing hidden layer numbers for the neural networks. The best

2https://github.com/edx/ease. Available on Sep 14, 2020.

(11)

AES model in Lilja (2018)’s study achieved percent agreement 52.41 % and the highest QWK value 0.36.

2.4. Measurement for Model Evaluation

Interrater agreement (IRA) indices relate to the extent to which different raters assign the same precise value for each item being rated (Gisev et al., 2013). In the context of evaluation for AES models, interrater agreement indicates the degree the AES models predict the same grades given the essays compared with human graders. Although we are aware that there is inconsistency of essay assessment among human raters, we assume the grades given by human raters as golden standard labels to evaluate the performance of AES models with preserved test datasets. Two measurements are exploited in terms of interrater agreement: percent agreement and quadratic weighted kappa.

2.4.1. Percent Agreement

Percent agreement refers to the ratio of the counts the AES model assigns the same grade of the essay as human grader to the total counts of predicted essays (see Equation 1). We interchangeably use accuracy to refer percent agreement to illustrate the model performance.

Acc(AES) = Number of concordant grade assignments

Number of grade assignments × 100% (1) 2.4.2. Quadratic Weighted Kappa

Quadratic Weighted Kappa (QWK) error metric is seen as the standard measurement to assess model performance in the Kaggle competition of AES modelling on student- written essays³. To perform the QWK for student essays rated by the human grader and the AES system, we initially build a weight matrix with respect to the size of grade scale according to the following formula:

𝑊_{𝑖, 𝑗} = (𝑖 − 𝑗 )²

(1 − 𝑁 )², 𝑖, 𝑗 ∈ {1, 2, ..., 𝑁 } (2) where 𝑖 refers to the reference score assigned by the human grader and 𝑗 the system score predicted by an AES system, and 𝑁 is the size of the grade scale. An observed matrix 𝑂 in which 𝑂𝑖, 𝑗 denotes the number of student essays that are assigned with score 𝑖 by human grader and 𝑗 by the system is calculated. An expected matrix 𝐸 is computed as the outer product of reference and system scores. The expected matrix is then normalized such that the total number in the expected matrix is equal to the number in the observed matrix. The final QWK value is computed with the formula:

𝑘 =1 − Í

𝑖, 𝑗𝑊_{𝑖, 𝑗}𝑂_{𝑖, 𝑗}

Í

𝑖, 𝑗𝑊_{𝑖, 𝑗}𝐸_{𝑖, 𝑗} (3)

We adapt a minimum agreement of 0.70 as the threshold, which is defined by Ramineni et al. (2012) given quadratic weighted kappa, to evaluate if the implemented AES system is good.

3https://www.kaggle.com/c/asap-aes/overview/evaluation. Available on Sep 14, 2020.

(12)

2.4.3. Precision, Recall and F-measure

In order to analyse the performance of AES model in terms of specific grade categories, we hence employ precision, recall and F-measure (see Equation 4). Precision shows the ratio between the true positive and the false positive. Recall indicates the ratio between true positive and false negative. F-measure demonstrates a combination between the given precision and recall. In our case, we set 𝛽 to 1 to compute F-measure.

Precision = Number of essays correctly graded as grade i

Number of essays predicted with grade i by AES system (4a) Recall = Number of essays correctly graded as grade i

Number of essays assigned with grade i by human rater (4b) F-measure = (𝛽²+ 1) × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙

𝛽²× 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (4c)

(13)

3. Feature Extraction

On the way to create an AES system of student written Swedish essays, we start with our very first question: how to represent essays in the AES system. Inspired by the exploration of automatic approaches for modelling Swedish linguistic complexity from Pilán (2018), we exploit the linguistic features to represent the essays in AES systems.

The initial goal for the template of linguistic features described in Pilán (2018) is to automatically predict the proficiency level for learning Swedish as a second language by measuring linguistic complexity. This differs from the application of our AES system in that the essay to be predicted can be written by students whose first language is Swedish. The language proficiency level is not always a determinative factor for grade assignment to the essays. In spite of differences, the linguistic features made by Pilán (2018) illustrated in Table 3.1 can be a good starting point to represent the student written essays in the AES system since the feature extraction is computation friendly and convenient to be implemented. In this chapter, we introduce the approaches of annotation in Section 3.1 to preprocess the data. In Section 3.2, we present the selected features from Pilán (2018) with respect to the linguistic aspects. Since the boundedness of feature values differs from one to another, we employ the approaches of scaling for normalization in Section 3.3. In the end of the chapter, an extracted example essay is incorporated to give a picture of linguistic features.

3.1. Annotation

In order to extract the linguistic features to represent the essays, the essays as se- quences of alphanumeric characters stored in the format of plain text or Word need to be annotated first. The annotation is based on the Swegram pipeline (Megyesi et al., 2019) and incorporates tokenization, spelling correction, part-of-speech tagging, lemmatization and syntactic parsing. The procedure of annotating a given essay goes through the following steps in sequence shown in Figure 3.1.

Figure 3.1.: Annotation procedure given an essay.

Sentence segmentation and tokenization are based on the work of Cap et al. (2016).

Spelling error checker goes through all tokens outputted from tokenization to detect and correct the spelling errors with the help of HistNorm (Pettersson et al., 2013). The part-of-speech tagging and lemmatisation are performed with Efselab¹(Östling, 2018), and the syntactic analysis is processed by MaltParser (Nivre et al., 2006).

The preprocessed data is encoded in an extended version of the Computational Natural Language Learning format²(CoNLL-U format). CoNLL-U format is applied within the Universal Dependencies, which is a framework aiming at the consistent linguistic annotation across languages. In this extended CoNLL-U format, every single word is lowercased and stands on its own line, and one extra column is appended to

1https://github.com/robertostling/efselab Available on Sep 14, 2020

2https://universaldependencies.org/format.html. Available on Sep 14, 2020.

(14)

Name Type Name Type

Sentence length COUNT Modal V to V MORPH

Avg token length COUNT Particle INCSC MORPH

Extra-long token COUNT 3SG pronoun INCSC MORPH

Nr characters COUNT Punctuation INCSC MORPH

LIX COUNT Subjunction INCSC MORPH

Bilog TTR COUNT PR to N MORPH

Square root TTR COUNT PR to PP MORPH

Avg KELLY log freq LEXICAL S-V INCSC MORPH

A1 lemma INCSC LEXICAL S-V to V MORPH

A2 lemma INCSC LEXICAL ADJ INCSC MORPH

B1 lemma INCSC LEXICAL ADJ variation MORPH

B2 lemma INCSC LEXICAL ADV INCSC MORPH

C1 lemma INCSC LEXICAL ADV variation MORPH

C2 lemma INCSC LEXICAL N INCSC MORPH

Difficult W INCSC LEXICAL N variation MORPH

Difficult N&V INCSC LEXICAL V INCSC MORPH

OOV INCSC LEXICAL V variation MORPH

No lemma INCSC LEXICAL Function W INCSC MORPH

Avg. DepArc length SYNTACTIC Neuter N INCSC MORPH

DepArc Len >5 SYNTACTIC CJ + SJ INCSC MORPH

Max length DepArc SYNTACTIC Past PC to V MORPH

Right DepArc Ratio SYNTACTIC Present PC to V MORPH

Left DepArc Ratio SYNTACTIC Past V to V MORPH

Modifier variation SYNTACTIC Supine V to V MORPH

Pre-modifier INCSC SYNTACTIC Present V to V MORPH

Post-modifier INCSC SYNTACTIC Nominal ratio MORPH

Subordinate INCSC SYNTACTIC N to V MORPH

Relative clause INCSC SYNTACTIC Lex T to non-lex T MORPH

PP complement INCSC SYNTACTIC Lex T to Nr T MORPH

Avg senses per token SEMANTIC Relative structure INCSC MORPH

N senses per N SEMANTIC

Table 3.1.: The feature set proposed for linguistic complexity analysis of L2 Swedish from Pilán (2018).

(15)

store the original word form in the word line. Each word line consists of the following fields: ID, FORM, NORM (spelling checked), LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, and MISC.

In each word line, ID denotes index for the word in the sentence; FORM represents the word form or punctuation symbol that occurs in the original text; NORM refers to the spelling checked word; LEMMA is lemma or stem of word form; UPOS and XPOS demonstrates part-of-speech tag from Universal- and SUC tagset (Gustafson-Capková and Hartmann, 2006), respectively; FEATS shows a list of morphological features such as tense when the word is verb and some features are language specific, for instance, neuter gender in Swedish; HEAD refers to zero if the current word’s head is root or the ID of the current word’s head for other cases; DEPREL indicates a universal dependency relation to the HEAD; DEPS shows enhanced dependency graph in the form of a list of head-deprel pairs; MISC takes any other annotation. Underscore is substituted if there is no annotation presented in the current field.

3.2. Linguistic Features

Inspired by the work of Pilán (2018), a set of linguistic features as presented in Table 3.2 are extracted for the representation of the essay in the AES modelling. The collection of the linguistic features is divided into four blocks with respect to linguistic aspects:

count-based textual features, morphological features, syntactic features and lexical features. Each feature set is explained in a subsequent section.

Count-based features Lexical Token to Non-lexical Token Syntactical features

NO. Character in Token Lexical Token to Token NO. DepArc Length 5+

NO. Character in Type Modal VERB to VERB Max DepArc Length

NO. Token S-VERB to VERB Avg. DepArc Length

NO. Type NOUN to VERB Right DepArc Ratio

Avg. Token Length Past Participle to VERB Left DepArc Ratio

Avg. Type Length Present Participle to VERB Modifier Variation

NO. Long Token Past VERB to VERB Pre-modifier INCSC

NO. Long Type Present VERB to VERB Post-modifier INCSC

NO. Misspelling Supine VERB to VERB Subordinate INCSC

NO. Sentence NOUN Variation Rel Clause INCSC

NO. Paragraph VERB Variation PREP Comp INCSC

Lix ADJ Variation Lexical features

Bi-logarithm TTR ADV Variation A1 Lemma INCSC

Square Root TTR 3SG PRON INCSC A2 Lemma INCSC

Morphological features Nominal Ratio B1 Lemma INCSC

PRON to NOUN PART INCSC B2 Lemma INCSC

PRON to PREP PUNCT INCSC C1 Lemma INCSC

S-VERB INCSC SCONJ INCSC C2 Lemma INCSC

Neuter Gender NOUN INCSC ADJ INCSC Difficult Word INCSC

Functional Token INCSC ADV INCSC Difficult NOUN &VERB INCSC

CCONJ & SCONJ INCSC NOUN INCSC OOV INCSC

Rel INCSC VERB INCSC Avg. KELLY Log Frequency

Table 3.2.: A feature set with respect to the linguistic aspects.

Which linguistic unit that is suitable to apply is feature dependent. The least linguistic unit for the most of the features in Table 3.2 is sentence. In two cases, however, features are extracted on the text level instead: NO. Sentence and NO. Paragraph, denoting the number of sentences and paragraphs, respectively. The features that are available on sentence-level can also be applied on text-level.

Based on the whole text, the feature can be extracted according to its feature definition. To represent the feature based on the whole text, we can alternatively present the statistics of the distribution of the sentence-level feature values given the

(16)

segmented sentences in the text. Concretely, mean, median and standard deviation of the feature values are taken into consideration for feature representation on the text level. In case of extreme values in the distribution, we incorporate the median value. However, both mean and median are to detect an average value. To represent the features on the text level, we take an average value between the mean and the median of the feature value distribution. For instance, to extract the feature NO. Token, denoting the number of tokens, based on a text with six sentences whose lengths are 30, 11, 21, 8, 12 and 8 tokens, respectively, there are three values to represent NO. Token on text-level. The first value is 90, which denotes the total number of tokens in the text.

The second is 13.25, calculated as the average value between the mean and median of the sentence-level feature values, which are 15 and 11.5, respectively. The third value is the standard deviation of the sentence-level feature values, which is about 8.76.

3.2.1. Count Based Features

Count based features comprise of counts of characters, types, tokens, misspellings, long tokens, long types, sentences respective paragraphs. A token or a type whose length is over 13 characters is considered to be long. Misspelling refers to the detected misspellings by HistNorm (Pettersson et al., 2013). The counts of sentences and paragraphs are only available on text level.

NO. Character in Token and NO. Character in Type indicate the total number of characters in all tokens and types, respectively. The features, NO. Token, NO. Type, NO.

Long Token,NO. Long Type, NO. Misspelling, NO. Sentence, NO. Paragraph, Avg. Token Length and Avg. Type Length are straightforward and self-explanatory. Features for Lix, Bi-logarithm TTR and Square Root TTR are given their explanations below.

Lix (Swedish: läsbarhetsindex) is a feature to measure readability for Swedish (Björns- son, 1968). The formula of computing Lix is:

𝑓(Lix) = Counts(tokens)

Counts(sentences)+ Counts(long tokens)

Counts(tokens) (1)

where long tokens in these cases refer to those whose lengths are over 6 characters.

Type-token ratio (TTR) is a feature to measure lexical diversity by comparing unique types against total tokens. Instead of using a simple ratio of the counts of types and tokens, bi-logarithm and square root TTR (see Equations 2) are implemented.

𝑓(Bi-logarithm TTR) = In(Counts(type))

In(Counts(token)) (2a)

𝑓(Square Root TTR) = Counts(type)

pCounts(token) (2b)

3.2.2. Morphological Features

Morphological features indicate part-of-speech distribution through computing incidence score (INCSC) (Pilán, 2018). The INCSC computation is defined as

INCSC = 1000 𝑁_𝑡

× 𝑁𝑐 (3)

where 𝑁𝑡 and 𝑁𝑐 are two categories of tokens.

In most cases, 𝑁𝑐 is a subset of 𝑁𝑡. For instance, the feature of Present VERB to VERB involves two categories, present verbs (𝑁𝑐) and verbs (𝑁𝑡), and the set of present verbs is a subset of verbs. However, there are also some cases that 𝑁^𝑡 and 𝑁^𝑐are disjoint. For

(17)

instance, a token whose part of speech is a preposition can not be pronoun at the same time. When extracting the feature PRON to PREP, some sentences can happen to have none of prepositions, which makes the denominator be 0 in Equation 3. Therefore, we set 1000 to the feature value to take care of division by zero error.

30 morphological features are divided into six subgroups after the use of part of speech in the INCSC value computation. Since every morphological feature is extracted through computing INCSC value, we define each morphological feature by presenting 𝑁_𝑐and 𝑁^𝑡. Both SUC-tags and UD-tags are used for the thesis. The most Part-of-speech tags in the following features are UD-tags. The only SUC tag mentioned in the thesis is PC for the feature of 𝑁 𝑜𝑚𝑖𝑛𝑎𝑙 𝑅𝑎𝑡𝑖𝑜.

Verb Form INCSC 𝑁_𝑡 in the group of Verb Form INCSC refers to the category of tokens whose parts of speech are either VERB or AUX, and 𝑁^𝑐 is a subset of verbs, clustered after tense, aspect or modality. 𝑁𝑐 is one of the following verb categories to form an individual INCSC feature: modal verbs, S-VERBs, present participles, past participles, present verbs, past verbs and supine verbs. S-VERB refers to verb that ends with character 𝑠 because of passive verb construction, reciprocal or deponent verb in Swedish.

One PoS to One PoS In this group, we extract the features based on a pair of two different parts of speech. The part-of-speech pairs are hence noun-verb, pronoun-noun or pronoun-preposition.

One Subpos to All PoS There are three INCSC features where 𝑁𝑐 refers to the tokens whose part of speech belongs to the subset of a particular part of speech in the tagset and 𝑁𝑡 includes all tokens. Tokens included in 𝑁𝑐 category in this group is either S-VERBs, third person pronouns or neuter gender nouns.

One PoS to All PoS The 𝑁𝑐 category in this group is a collection of identical part- of-speech tokens that belongs to one of following classes: adjective, adverb, noun, particle, punctuation, subjunction, and verb. And 𝑁𝑡 includes all tokens.

Lexical Variation There are four INCSC Variation features where 𝑁𝑐refers to one of the four categories: adjective, adverb, noun and verb, and 𝑁𝑡 includes tokens whose part of speech belongs to any of the four mentioned classes.

Multiple PoS vs Multiple PoS Either 𝑁𝑐 or 𝑁𝑡 consists of tokens that have more than one part of speech. A part of speech in the tagset is classified as being content or functional. The content parts of speech in UD tagset are made up of adjective (ADJ), adverb (ADV), interjection (INTJ), noun (NOUN), proper noun (PROPN) and verb (VERB) while the functional parts of speech include adposition (ADP), auxiliary (AUX), coordinating conjunction (CCONJ), determiner (DET), numeral (NUM), particle (PART), pronoun (PRON), subordinating conjunction (SCONJ), punctuation (PUNCT), symbol (SYM) and other (X). We define the rest morphological features in Table 3.3 denoting part of speech for 𝑁𝑐 and 𝑁𝑡.

(18)

Feature 𝑁_𝑐 𝑁_𝑡

CCONJ & SCONJ INCSC CCONJ, SCONJ All PoS

Functional Token INCSC Functional PoS All PoS

Lexical Token to Non-lexical Tokens Content PoS Functional PoS

Lexical Token to Token Content PoS All PoS

Nominal Ratio NOUN, ADP, PC PRON, ADV, VERB

Rel INCSC Int, Rel All PoS

Table 3.3.: Specification of categories for 𝑁𝑐and 𝑁_𝑡 given the specific morphological feature.

Int and Rel refer to the tokens that have property of being interrogative and relative, and All refers to all kinds of parts of speech.

3.2.3. Syntactical Features

The syntactical feature extraction is based on the syntactical annotation given by Efselab (Östling, 2018) within the framework of Universal Dependencies. To get access to the mechanism of syntactical feature extraction, we recall several key terms with a parsed example sentence in form of the graphical representation in Figure 3.2. Syntactic annotation indicates the syntactical structures of the sentence through detecting the dependent-head pairs with assigned labels denoting the sorts of syntactical relationship.

As is shown in Figure 3.2, every arc links a pair of words that have a dependent-head relationship and the arc direction goes through head to dependent. Each word in the sentence has and only has one head. A dummy node of ROOT is added as the head to the very word that does not have its head in the sentence. If the dependent occurs at the left side of its head, the arc is defined as left arc. In the same way, we define right arc. As illustrated in Figure 3.2, we place ROOT node in the most left place of the sentence so that the arc with ROOT node is always a right arc. By passing through arcs, a dependent is able to arrive at other heads apart from the current one. We define the dependency arc length (DepArc Length) of a dependent given a target head as the number of arcs through which the current dependent node reaches the target head node.

Figure 3.2.: A graphical representation of the annotated sentence ofTrots allt detta finns det vuxna individer som Bert Karlsson som tar sig rätten att attackera oss unga med falska påståenden om hur vi är. (See translation in Section 3.4).

3.2.3.1. Syntactical Features with the Count of DepArcs

Five out of eleven syntactical features are extracted based on statistics on dependency arcs to indicate the syntactic complexity: Avg. DepArc Length, Max DepArc Length, NO.

DepArc Length 5+, Right DepArc Ratio and Left DepArc Ratio. Given the description of dependency arc and DepArc Length, we illustrate the concrete walk paths in Figure 3.3 through dependency arcs for the words in the parsed sentence shown in Figure 3.2 in word order.

Avg. DepArc Length refers to the averaged dependency arc length between any of the word node in the sentence and ROOT node. To exemplify, we illustrate this in the example in Figure 3.3 to demonstrate the walk path for each word to ROOT, Avg. DepArc Length is rounded to 5.12. The maximal dependency arc length is 9, which is indicated

(19)

Figure 3.3.: A graphical representation of walk paths from each word to 𝑅𝑂𝑂𝑇 in word order given the sentence.

(20)

through the feature, Max DepArc Length. To have an illustration how syntactically complicated it is, we also track the number of existing paths whose dependency arc lengths are longer than (or equal to) five. In the case of the presented sentence in Figure 3.3, there are 14 word nodes from each of which to ROOT node the dependency arc length is longer than (or equal to) five. Since the initial node is not necessarily to be the ROOT node to detect a path whose number of dependency arcs is more than five. For instance in Figure 3.4, we can detect four other paths whose lengths more than five from a path that has seven dependency arcs. In a similar way, we merge all paths whose lengths are over five and the sum of detected paths given the example sentence is 44 (see Appendix B for details of detected paths).

Figure 3.4.: An example of detecting walk path over 5 arcs from a given walk path.

According to the description of arc in terms of direction, the arc given the dependent- head pair, (𝑣𝑢𝑥𝑛𝑎, 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑒𝑟 ) where 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑒𝑟 is the head, is a left arc while 𝐴𝑟𝑐 (𝑡𝑎𝑟, 𝑠𝑖𝑔), where 𝑡𝑎𝑟 is the head, is a right arc. Furthermore, Right DepArc Ratio as well as Left DepArc Ratio is applied to calculate the ratio (see Equation 4) between the count of dependency arcs in terms of arc direction and the total arcs. In our example case, we have 12 left arcs (46%) and 14 right arcs (54%) out of total 26 arcs in the example sentence.

Right DepArc Ratio =𝐶𝑜𝑢𝑛𝑡(Right arcs)

𝐶𝑜𝑢𝑛𝑡(Total arcs) (4a)

Left DepArc Ratio = 𝐶𝑜𝑢𝑛𝑡(Left arcs)

𝐶𝑜𝑢𝑛𝑡(Total arcs) (4b)

3.2.3.2. Syntactic Features Based on Syntactical Relations

The rest six of the syntactical features, Pre-modifier INCSC, Post-modifier INCSC, Mod- ifier Variation, Subordinate INCSC, Rel Clause INCSC and PREP Comp INCSC, are on the basis of detecting the specific syntactical relations and incidence score computation. Since 𝑁^𝑡 refers to the number of tokens that build up the context for syntactic annotation, we specify the 𝑁𝑐 for incidence score computation given each of the six features.

(21)

Pre-modifier INCSC, Post-modifier INCSC, Modifier Variation Modifier refers to any dependent whose dependency relation with its head is annotated with nmod (See Appendix C for its definition and example), appos, nummod, advmod, discourse or amod.

A modifier that occurs before its head is registered as pre-modifier. And post-modifier indicates that the modifier comes after the occurrence of its head. 𝑁^𝑐 for Modifier Variation refers to the total number of modifiers. When extracting the values from the features, Pre-modifier INCSC and Post-modifier INCSC, 𝑁𝑐 refers to the number of detected modifiers with respect to in which word order modifiers and their heads occur.

Subordinate INCSC Subordinate refers to tokens and all the tokens’ dependents if there is a dependency relation (DEPREL) annotated with one of the following labels:

csubj, xcomp, ccomp, advcl, acl or acl:relcl between the current token and any of the token’s direct dependent. 𝑁𝑐 for the feature of Subordinate INCSC is the number of subordinates.

Rel Clause INCSC Relative clause refers to any dependent whose annotation for FEATS with its head contains PronType=Rel that indicates the Pronominal Type is relative pronoun, determiner, numeral or adverb. 𝑁𝑐for the feature of Rel Clause INCSC is the number of tokens in the relative clauses.

PREP Comp INCSC Prepositional complement refers to the dependent whose dependency relation with its head is annotated with case. 𝑁𝑐 for the feature of PREP Comp INCSC refers to the number of prepositional complements and other dependents that are subordinated under prepositional complements’ heads.

3.2.4. Lexical Features

Apart from extracting morphosyntactic features, the lexical feature is also considered.

To measure the lexical diversity and proficiency, we consult a Swedish KELLY-list³ (Volodina and Kokkinakis, 2012) based on the project, KEywords for Language Learning for Young and adults alike (KELLY). KELLY-list is a vocabulary sample representing the usage of vocabulary with each selected word or phrase classified into one of the six Common European Framework of Reference for Languages (Pilán and Volodina, 2016) (CEFR) levels in the difficulty ascending order: A1, A2, B1, B2, C1 and C2. The Swedish KELLY-list contains 8409 items generated from a corpus of web texts, and is furthermore illustrated through Table 3.4 with two concrete entries for each CEFR level.

Each word entry is represented with lemma and is appended with the information of part of speech, the statistics of the relative word frequency given by word per million (WPM) , and English translation.

As is seen from Table 3.4, the Swedish KELLY-list treats the different word senses of a homonym like för as distinctive entries. Multiple word senses given a polysemy like avgå are stored in the same entry. Since the distinctive entries given the same homonym can be classified into different CEFR levels, we utilize the word’s part-of- speech property to tell which word sense of the homonym is used in the context. The tagset applied by KELLY-list differs from the one UD employs. Therefore, we make a tagset conversion in Table 3.5 to facilitate the use of KELLY-list in our lexical feature extraction. Entries in the KELLY-list are composed of both words and phrases. We only take use of word entries that do not contain any space from the KELLY-list for our feature extraction.

3https://spraakbanken.gu.se/en/projects/kelly. Available on Sep 14, 2020.

(22)

CEFR Lemma PoS WPM English translation

A1 all pron 2975.47 all

för adverb 421.08 too

A2 påstående noun-ett 63.24 statement

för conj 44.36 because

B1 avgå verb 23.46 1, resign; 2, depart

stabilitet noun-en 23.32 stability

B2 ödmjuk adjective 11.86 humble

i natt adverb 11.85 last night, tonight

C1 foster noun-ett 7.06 paternal aunt

enastående adjective 7.05 outstanding

C2 allergisk adjective 3.40 allergic

eskalera verb 3.40 escalate

Table 3.4.: Example entries from the Swedish KELLY-list.

KELLY UD KELLY UD

adjective ADJ noun-ett NOUN

adverb ADV noun-en/-ett NOUN

aux verb AUX numeral NUM

conj CCONJ particip ADJ

det DET particle PART

interj INTJ prep ADP

noun NOUN pronoun PRON

noun-en NOUN proper name PROPN

Table 3.5.: The part-of-speech tagset conversion between KELLY-list and UD.

(23)

As is indicated in Table 3.2, there are ten lexical features extracted on the basis of incidence score computation and the use of KELLY-list. 𝑁^𝑡 for all lexical features except Avg. KELLY Log Frequency refers to the number of tokens. For each of the following features, A1 Lemma INCSC, A2 Lemma INCSC, B1 Lemma INCSC, B2 Lemma INCSC, C1 Lemma INCSC and C2 Lemma INCSC, 𝑁^𝑐 refers to the number of detected tokens with a certain CEFR level. Difficult Word refers to any tokens whose CEFR level is equal to or above B1 and Difficult NOUN&VERB refers to token that is a noun or a verb, and has its CEFR level equal to or above B1. Since KELLY-list is a vocabulary sample, 𝑁^𝑐 of OOV INCSC refers to the number of tokens that are not registered by KELLY-list.

By looking up for WPM of a word entry, we get to know the relative frequency of the word. Due to the value of WPM can be very large, we use natural logarithm to normalize the value. We extract the feature, Avg. KELLY Log Frequency, by taking the mean of normalized WPM values through natural logarithm for the tokens whose lemmas are stored in the KELLY-list.

3.3. Scaling and Feature Storage

As is shown in Table 3.2, there are total 65 features that can be used to represent the essay in the AES modelling. To simplify the use of these features to predict the grade by the AES system, we assume that the correlation between each feature and grade prediction is uniform. However, according to the description of features in Section 3.2, the range of one feature can differ from another. The variation caused by range difference among features has an impact that some features are overestimated and others overlooked for grade prediction. Therefore, the range for the features is rescaled.

The procedure of scaling for the features goes through the following three steps: 1) extracting all scalars given the feature in the whole dataset, 2) each scalar as an input, 𝑥, is standardized to ˆ𝑥in Equation 5a where ¯𝑥and 𝜎 denotes the mean and standard deviation of the scalar list, respectively, 3) each standardized scalar, ˆ𝑥, is projected to the range between 0 and 1 by sigmoid function in Equation 5b.

ˆ

𝑥 =𝑥−𝑥¯

𝜎 (5a)

𝑦 = 1

1 + 𝑒^{− ˆ}^𝑥 (5b)

Through standardized normalization, we alleviate the range problem caused by the difference in the same feature space. By employing sigmoid function as a way for normalization, we mitigate the range problem caused by the variation across the features.

The feature extraction is independent of the dataset splitting and the choice of AES model. Hence, the text representation through extracted features can be processed in advance. Due to the fact that most features are available on both sentence-level and text-level, we build up two matrices to store the feature scalars with respect to the type of level.

Thanks to sentence segmentation, all distinctive sentences in the entire dataset are merged. To store 𝑚 distinctive sentences given 𝑛 features for each sentence, we build up a matrix with the size of 𝑚 × 𝑛 where 𝑚 denotes the sentence index and 𝑛 denotes feature index. By looking up sentence index, 𝑚^𝑖, and feature index, 𝑛^𝑖, the scalar of the sentence given the feature is easily located. In the same way, we create another matrix to store feature scalars from 𝑗 student essays given 𝑘 linguistic features on text-level.

(24)

3.4. An Extracted Example Essay with Linguistic Features

Given the essays, we can deploy features to represent individual sentences as well as the whole text. 63 out of 65 features in Table 3.2 are available to represent the sentence and two features are only available to represent the text. Based on the description of feature on the text level in Section 3.2, each feature that is able to represent sentence is divided into three distinctive sub-features to represent the whole text. The number of features is given in Table 3.6 with respect to the linguistic aspect respective linguistic unit.

Feature block Sentence-level Text-level

Count-based features 12 38

Morphological features 30 90

Syntactical features 11 33

Lexical features 10 30

Total 63 191

Table 3.6.: The number of features in terms of feature block.

To have a closer look on the linguistic features, we extract two sentences from our data described in Section 5.1 as a mini-example essay to be represented with feature scalars.

Trots allt detta finns det vuxna individer som Bert Karlsson som tar sig rätten att attackera oss unga med falska påståenden om hur vi är. Men jag väljer, som smart och driftig ungdom, att inte låta Bert sätta en stämpel i min panna.

The translation of the example essay is given below.

In spite of all this, there are adult individuals like Bert Karlsson who take the right to attack us young people with false claims about who we are. But I choose, as a smart and energetic youth, not to let Bert put a stamp in my forehead.

Given the example essay, we record the two sentences as 𝑆₁and 𝑆₂, and the whole text as 𝑇 . 𝑇𝑓 is used to denote the text-based feature scalar according to feature definition.

Given the feature scalars based on individual sentences, 𝑇𝑎 𝑣𝑔 and 𝑇𝑠𝑡 𝑑 denote the average and standard deviation of the feature scalars. We present the feature scalars with both original and rescaled values in parentheses in Table 3.7 for count-based features, in Table 3.8 for morphological features, in Table 3.9 for syntactical features and in Table 3.10 for lexical features.

(25)

Count-based features 𝑆₁ 𝑆₂ 𝑇_𝑓 𝑇_{𝑎 𝑣𝑔} 𝑇_{𝑠𝑡 𝑑} NO. Character in Token 113.00

(0.69)

74.00 (0.48)

187.00 (0.10)

93.50 (0.71)

27.58 (0.28) NO. Character in Type 110.00

(0.72)

74.00 (0.51)

174.00 (0.09)

92.00 (0.77)

25.46 (0.27)

NO. Token 4.52

(0.48)

4.11 (0.37)

4.35 (0.36)

4.32 (0.32)

0.29 (0.19)

NO. Type 4.58

(0.47)

4.11 (0.35)

4.46 (0.06)

4.35 (0.27)

0.33 (0.20) Avg. Token Length 25.00

(0.69)

18.00 (0.52)

43.00 (0.10)

21.50 (0.75)

4.95 (0.23) Avg. Type Length 24.00

(0.74)

18.00 (0.58)

39.00 (0.07)

21.00 (0.84)

4.24 (0.19)

NO. Long Token 0.00

(0.41)

0.00 (0.41)

0.00 (0.25)

0.00 (0.29)

0.00 (0.13)

NO. Long Type 0.00

(0.41)

0.00 (0.41)

0.00 (0.24)

0.00 (0.29)

0.00 (0.13)

NO. Misspelling 0.00

(0.42)

0.00 (0.42)

0.00 (0.27)

0.00 (0.40)

0.00 (0.21)

Lix 41.00

(0.59)

29.11 (0.40)

35.45 (0.52)

35.06 (0.46)

8.41 (0.13) Bi-logarithm TTR 0.99

(0.59)

1.00 (0.68)

0.97 (1.00)

0.99 (0.76)

0.01 (0.18)

Square Root TTR 4.80

(0.78)

4.24 (0.65)

5.95 (0.04)

4.52 (0.91)

0.39 (0.06)

NO. Sentence - - 2.00

(0.11) - -

NO. Paragraph - - 1.00

(0.27) - -

Table 3.7.: Count-based feature scalars given the example essay.

(26)

Morphological features 𝑆₁ 𝑆₂ 𝑇_𝑓 𝑇𝑎 𝑣𝑔 𝑇_{𝑠𝑡 𝑑}

Modal VERB to VERB 0.00

(0.26) 0.00 (0.26)

0.00 (0.01)

0.00 (0.04)

0.00 (0.00)

S-VERB to VERB 250.00

(0.68) 0.00 (0.40)

142.86 (0.76)

125.00 (0.95)

176.78 (0.45)

NOUN to VERB 750.00

(0.37)

1000.00 (0.42)

857.14 (0.23)

875.00 (0.23)

176.78 (0.08)

Past Participle to VERB 0.00

(0.43) 0.00 (0.43)

0.00 (0.19)

0.00 (0.23)

0.00 (0.18)

Present Participle to VERB 0.00

(0.50) 0.00 (0.50)

0.00 (0.50)

Past VERB to VERB 750.00

(0.62)

333.33 (0.32)

571.43 (0.51)

541.66 (0.43)

294.63 (0.36)

Present VERB to VERB 0.00

(0.40) 0.00 (0.40)

0.00 (0.30)

0.00 (0.37)

0.00 (0.16)

Supine VERB to VERB 0.00

(0.41) 0.00 (0.41)

0.00 (0.18)

0.00 (0.21)

0.00 (0.11)

PRON to NOUN 2000.00

(0.73)

333.33 (0.35)

1166.67 (0.77)

1166.66 (0.76)

1178.51 (0.66)

PRON to Prep 2000.00

(0.66)

1000.00 (0.45)

1750.00 (0.69)

1500.00 (0.80)

707.11 (0.24)

3SG PRON INCSC 76.92

(0.58) 0.00 (0.29)

42.55 (0.32)

38.46 (0.39)

54.39 (0.42)

S-VERB INCSC 38.46

(0.72) 0.00 (0.40)

21.28 (0.85)

19.23 (0.97)

27.20 (0.51)

Neuter Gender NOUN INCSC 38.46

(0.47) 0.00 (0.31)

21.28 (0.17)

19.23 (0.29)

27.20 (0.21)

PART INCSC 38.46

(0.67)

47.62 (0.73)

42.55 (0.96)

43.04 (1.00)

6.48 (0.09)

PUNCT INCSC 38.46

(0.27)

142.86 (0.63)

85.11 (0.39)

90.66 (0.42)

73.82 (0.57)

SCONJ INCSC 0.00

(0.35) 0.00 (0.35)

0.00 (0.08)

0.00 (0.21)

0.00 (0.02)

ADJ INCSC 115.38

(0.66)

95.24 (0.59)

106.38 (0.90)

105.31 (0.88)

14.24 (0.04)

ADV INCSC 38.46

(0.31)

47.62 (0.34)

42.55 (0.06)

43.04 (0.09)

6.48 (0.01)

NOUN INCSC 115.38

(0.38)

142.86 (0.45)

127.66 (0.25)

129.12 (0.25)

19.43 (0.06)

VERB INCSC 153.85

(0.59)

142.86 (0.55)

148.94 (0.77)

148.36 (0.78)

7.77 (0.01)

Functional Token INCSC 500.00

(0.53)

476.19 (0.47)

489.36 (0.52)

488.10 (0.48)

16.84 (0.02) Lexical Token to Non-lexical Token 1000.00

(0.43)

1100.00 (0.48)

1043.48 (0.47)

1050.00 (0.42)

70.71 (0.09)

Lexical Token to Token 500.00

(0.47)

523.81 (0.53)

510.64 (0.48)

511.90 (0.52)

16.84 (0.02)

CCONJ & SCONJ INCSC 38.46

(0.38)

142.86 (0.79)

85.11 (0.70)

90.66 (0.81)

73.82 (0.82)

Rel INCSC 76.92

(0.77) 0.00 (0.34)

42.55 (0.83)

38.46 (0.89)

54.39 (0.79)

ADJ Variation 272.73

(0.70)

222.22 (0.62)

250.00 (0.96)

247.48 (0.94)

35.72 (0.05)

ADV Variation 90.91

(0.32)

111.11 (0.34)

100.00 (0.06)

101.01 (0.11)

14.28 (0.01)

NOUN Variation 272.73

(0.41)

333.33 (0.49)

300.00 (0.32)

303.03 (0.33)

42.85 (0.02)

VERB Variation 363.64

(0.63)

333.33 (0.58)

350.00 (0.88)

348.49 (0.86)

21.43 (0.02)

Nominal Ratio 666.67

(0.44)

800.00 (0.47)

714.29 (0.51)

733.34 (0.44)

94.28 (0.14)

Table 3.8.: Morphological feature scalars given the example essay.

(27)

Syntactical features 𝑆₁ 𝑆₂ 𝑇_𝑓 𝑇_{𝑎 𝑣𝑔} 𝑇_{𝑠𝑡 𝑑} NO. DepArc Length 5+ 44.00

(0.87)

4.00 (0.45)

48.00 (0.25)

24.00 (0.96)

28.28 (0.70) Max DepArc Length 9.00

(0.90)

5.00 (0.49)

9.00 (0.45)

7.00 (0.96)

2.83 (0.93) Avg. DepArc Length 5.12

(0.92)

3.14 (0.52)

4.23 (0.90)

4.12 (0.93)

2.27 (0.91) Right DepArc Ratio 0.46

(0.37)

0.52 (0.50)

0.49 (0.12)

0.49 (0.22)

0.04 (0.07) Left DepArc Ratio 0.54

(0.63)

0.48 (0.50)

0.51 (0.88)

0.51 (0.78)

0.04 (0.07) Modifier Variation 115.38

(0.38)

95.24 (0.33)

106.38 (0.08)

105.31 (0.10)

14.25 (0.01) Pre-modifier INCSC 115.38

(0.52)

95.24 (0.46)

106.38 (0.43)

105.31 (0.49)

14.25 (0.01) Post-modifier INCSC 0.00

(0.30)

0.00 (0.30)

0.00 (0.01)

0.00 (0.07)

0.00 (0.02) Subordinate INCSC 653.85

(0.74)

428.57 (0.57)

1807.69 (0.25)

541.21 (0.85)

159.29 (0.01) Rel Clause INCSC 576.92

(0.91)

0.00 (0.38)

3133.33 (0.30)

288.46 (1.00)

407.95 (0.99) PREP Comp INCSC 384.62

(0.64)

142.86 (0.40)

3615.38 (0.42)

263.74 (0.65)

170.95 (0.17) Table 3.9.: Syntactical feature scalars given the example essay.

Lexical features 𝑆₁ 𝑆₂ 𝑇_𝑓 𝑇_{𝑎 𝑣𝑔} 𝑇_{𝑠𝑡 𝑑}

A1 Lemma INCSC 538.46

(0.48)

428.57 (0.30)

489.36 (0.15)

483.51 (0.17)

77.70 (0.10)

A2 Lemma INCSC 115.38

(0.76)

47.62 (0.50)

85.11 (0.94)

81.50 (0.90)

47.91 (0.39)

B1 Lemma INCSC 0.00

(0.37)

47.62 (0.64)

21.28 (0.47)

23.81 (0.65)

33.67 (0.41)

B2 Lemma INCSC 0.00

(0.41)

0.00 (0.41)

0.00 (0.17)

0.00 (0.22)

0.00 (0.15)

C1 Lemma INCSC 0.00

(0.43)

47.62 (0.86)

21.28 (0.96)

23.81 (1.00)

33.67 (0.72)

C2 Lemma INCSC 0.00

(0.45)

0.00 (0.45)

0.00 (0.31)

0.00 (0.34)

0.00 (0.28) Difficult Word INCSC 0.00

(0.33)

95.24 (0.71)

42.55 (0.51)

47.62 (0.67)

67.34 (0.59) Difficult NOUN&VERB INCSC 0.00

(0.35)

95.24 (0.78)

42.55 (0.70)

47.62 (0.83)

67.34 (0.69)

OOV INCSC 346.15

(0.47)

428.57 (0.62)

382.98 (0.71)

387.36 (0.69)

58.28 (0.09) Avg. KELLY Log Frequency 6.68

(0.47)

6.38 (0.39)

6.56 (0.16)

6.53 (0.22)

0.21 (0.14) Table 3.10.: Lexical feature scalars given the example essay.

(28)

4. Neural Network

To take fully use of the automatically extracted linguistic features, we deploy the neural network (NN) as approach for AES modelling with the motivation that NN as a data-driven approach has achieved significant results in the AES-related experiments (Alikaniotis et al., 2016; Dasgupta et al., 2018; Hussein et al., 2019; Schmidhuber, 2015).

In this chapter, we introduce the mechanism of neural networks with respect to our AES modelling. Section 4.1 gives a introduction to artificial neural network with focus on two types: feedforward neural network in Section 4.2 and recurrent neural network in Section 4.3. Two more variants of recurrent neural networks, long-short-term- memory and gated recurrent units, are specified. In the last section of this chapter, we introduce the involved hyperparameters for our neural networks.

4.1. Artificial Neural Network

Artificial Neural Network (ANN) (Gurney, 1997) is a simulation of functionality of interconnected neurons in the human brain. The corresponding element of neuron in the ANN is called node or unit. In general, each node takes input signals from upstream nodes and sends output signals to downstream nodes according to the workflow of the network architecture. The content and quantity of the passing signal through the current node depend on the incoming signals from the upstream nodes, the stored function in the node, and weights illustrating the connection strength between the node and other connected nodes. Given a training pattern, ANN is capable of being adapted to the training data through adjusting the weights. The process of modifying the weights to make the ANN model fit the data is called learning. If the true label is given as a mentor to monitor the model training, this learning form is called supervised machine learning.

In the case of AES systems, the objective function 𝑓 (·) based on the neural network techniques is to predict the grade, ˆ𝑦, of the student essay, 𝑥: ˆ𝑦 = 𝑓 (𝑥 ). During the training phase, the objective function learns to generate the grade, ˆ𝑦, of the essay through minimizing the difference between the predicted grade, ˆ𝑦, and human assigned grade, 𝑦. On one hand, the weight update is a result of learning data patterns exposed in the training dataset. On the other hand, the characteristics of neural network architecture as well as a set of hyperparameters have a profound influence on in which way the weight is updated. Hereby, we present the choice of neural network architecture employed for AES systems and state the setting of hyperparameters (see Section 4.4).

In order for the trained AES model to be able to predict the grade of the unseen essay that does not exist in the training data, AES model needs to capture the features that correlate with grade prediction. In other words, the model has to learn enough characteristics of data patterns to generate a grade so that the model does not underfit the data. The model is supposed to avoid being overfitted, namely learning data patterns that do not correlate with grade prediction other than the characteristics of the training data themselves. Therefore, we separate the dataset into a training set, used to train the model and a validation set serving for model estimation. We output the model when it is converged to an extension that the model is neither underfitted nor overfitted the data set.

Neural Network Based Automatic Essay Scoring for Swedish