AUTHORSHIP CLASSIFICATION USING THE VECTOR SPACE MODEL AND KERNEL METHODS

(1)

AUTHORSHIP CLASSIFICATION USING THE VECTOR

SPACE MODEL AND KERNEL METHODS

Submitted by

Emil Westin

A thesis submitted to the Department of Statistics in partial

fulfillment of the requirements for a two-year Master of Arts degree

in Statistics in the Faculty of Social Sciences

Supervisor

Rauf Ahmad

(2)

ABSTRACT

Authorship identification is the field of classifying a given text by its author based on the assumption that authors exhibit unique writing styles. This thesis investigates the semantic shortcomings of the vector space model by constructing a semantic kernel created from Word-Net which is evaluated on the problem of authorship attribution. A multiclass SVM classifier is constructed using the one-versus-all strategy and evaluated in terms of precision, recall, ac-curacy and F1 scores. Results show that the use of the semantic scores from WordNet degrades the performance compared to using a linear kernel. Experiments are run to identify the best fea-ture engineering configurations, showing that removing stopwords has a positive effect on the financial dataset Reuters while the Kaggle dataset consisting of short extracts of horror stories benefit from keeping the stopwords.

(3)

1 Introduction

A common way to represent natural language text data in numerical form is to use the Vector Space Model (VSM) or, commonly known, the Bag-of-Words (BoW) model. Under VSM, the frequency of a word enters the document, ignoring the word order. For example, in this repre-sentation the following two sentences are equal: “John greeted Mary” and “Mary greeted John”, since both sentences have exactly the same words and the same word frequencies. Clearly, this model has flaws since we lose important information such as the syntax (word order) and se-mantic information (i.e. how the words relate to each other).

The VSM can be extended in many ways. In fact, it can be shown that the VSM can be used to create kernels (Shawe-Taylor and Cristianini 2004). This opens up the possibility to do document classification using Support Vector Machines (SVM).

The goal of this thesis is to evaluate the performance of semantic kernels on the problem of authorship attribution or identification. It is based on the idea that different authors exhibit certain peculiarities in their writings. In linguistics, this is called stylometric analysis. We con-truct an SVM model with the goal of classifying an unlabeled text to an author. We use the BoW model to represent our data in the form of a document-term matrix to be used as train-ing data for the SVM model. We incorporate a semantic matrix into a kernel which contains weights that measure how similar two words are. Our objective is to get a clear picture on the classification performance of these different kernels and to discuss the results.

The main research questions of this thesis are:

• Does the semantic matrix improve the classification accuracy?

• How do the models perform on different data sets with variation in how long the docu-ments are?

• How does the feature engineering affect the classification results in terms of stopwords, weighting and pre-processing?

(5)

2 Background

The need to represent textual data in such a way that a model can ’learn’ useful representations is important in the field of natural language processing (NLP). This field deals with the idea of giving computers the ability to process human language (Jurafsky and Martin 2009), for example, automatic analysis or classification of text documents, of which authorship attribution is a subfield.

2.1 Basic concepts

This section provides a very brief introduction to important language-specific terms and nota-tions. For details, see Jurafsky and Martin (2009) and Manning et al. (2008).

A word, in the English language, is a sequence of letters from the alphabet separated by punctuation symbols (spaces, commas, full stops, etc). A document is a unit containing textual information that we are interested in, for example a news article, a single book paragraph, etc. A collection of computer-readable documents is called a corpus (plural: corpora), which represents the full set of documents we are interested in.

Tokenizationis the process of automatically dividing a text sequence into meaningful units,

such as words or characters, which we call tokens. At the same time certain characters may be discarded, such as punctuation symbols, since they don’t contribute any information in the bag-of-words model.

An N -gram is an N -token sequence of words. For example, a 1-gram or unigram sequence is: "please", "call", "the", "cops". A 2-gram or bigram sequence is like "please call", "call the", "the cops". The advantage of considering N -grams with N > 1 is increased context.

(6)

distinc-tion is that when we use N -grams, the resulting model is called a bag-of-N -grams, where each term represents an N -gram.

2.2 Vector Space Model

In the Vector Space Model (VSM) or Bag-of-Words model (BoW) the main idea is to represent a text document, or a collection of documents, as a set (bag) of words. The assumption of this model is that the original word order in the documents is not taken into account. Instead, the words are mapped to a new N-dimensional space where the words have been tokenized and often preprocessed in some manner, such as removing stop words. Each element in this vector represents how many times a term appears in a given document, which is called the term frequency. Formally, the bag can be represented as a row vector in an N-dimensional space where N is the total number of terms in the dictionary (Shawe-Taylor and Cristianini 2004):

φ : d −→ φ(d) = (tf (t1, d), tf (t2, d), ..., tf (tN, d)) ∈ F = RN (1)

where tf (ti, d) refers to the frequency of the ith term t appearing in document d and F

rep-resents the feature space where each feature is a term, each term corresponds to a dimension. Typically the vocabulary size N is very large, exceeding the number of training examples.

The advantage of the BoW model is its simplicity. Additionally, the preprocessing reduces the dimensionality by removing words that don’t contribute much information.

Some obvious drawbacks of this model are that grammatical information is lost (no word order), and there is no semantic information encoded. Furthermore, polysemy (words with many meanings) and synonymy (words with similar or same meaning) are ignored. In other words, if two similar documents contain a similar text but using different synonyms, these will be considered different in the BOW model (Hussain 2019). Another drawback is the need of preprocessing which can be time consuming.

Let us consider an example of two short documents, d1, d2. Let d1 = “The cat sat in the

hat.” and d2 = “The dog chased the cat.” and let the corpora be d = (d1, d2)0. Tokenization

is performed on the documents and the punctuation symbol (full stop) is removed. Performing preprocessing we can remove the stopwords “the”, “in”. The resulting dictionary of unique

terms is t = (t1 = cat, t2 = sat, t3 = hat, t4 = dog, t5 = chased)0 with vocabulary size N = 5.

(7)

each document. D =    φ(d1) φ(d2)   =    tf (t1, d1) tf (t2, d1) tf (t3, d1) tf (t4, d1) tf (t5, d1) tf (t1, d2) tf (t2, d2) tf (t3, d2) tf (t4, d2) tf (t5, d2)    (2) =   

tf (cat, d1) tf (sat, d1) tf (hat, d1) tf (dog, d1) tf (chased, d1)

tf (cat, d2) tf (sat, d2) tf (hat, d2) tf (dog, d2) tf (chased, d2)

   (3) =    1 1 1 0 0 1 0 0 1 1    (2×5) (4)

where tf () refers to the term frequency. In practice, the columns are sorted alphanumerically. A large corpus will result in a document-term matrix that is very sparse (containing many zeroes) which results in the curse of dimensionality (Hussain 2019; Zervas and Ruger 1999).

2.3 Kernel Methods

A Kernel is a function that computes the inner product between two feature maps (Shawe-Taylor and Cristianini 2004):

κ(x, z) = hφ(x), φ(z)i = φ(x)φ(z)T (5)

for all x, z ∈ X where X is the input space and φ is a mapping from X to a feature space of dimension N as seen in Eq. (1). Note that φ(x) and φ(z) are row vectors.

(8)

For all documents we can create a kernel matrix as K = DDT =          φ(d1) φ(d2) .. . φ(d`)          φ(d1) φ(d2) . . . φ(d`) (7) =          κ(d1, d1) κ(d1, d2) . . . κ(d1, d`) κ(d2, d1) κ(d2, d2) . . . κ(d2, d`) .. . ... . .. ... κ(d`, d1) κ(d`, d2) . . . κ(d`, d`)          (8)

which is also known as the Gram matrix, a symmetric ` × ` square matrix. It is a valid kernel matrix under the constraint that it is positive semi-definite (Shawe-Taylor and Cristianini 2004). This matrix is important because it is the central data type for all kernel-based methods. In this bag-of-words context, we can interpret each element of the kernel matrix as a measure of similarity between two documents based on the terms that they have in common.

A kernel method consist of two parts: mapping data into the feature space and a learning algorithm that can discover linear patterns in that space. The feature space is where the method looks for linear relations. One advantage of using a kernel is that the coordinates of the bag-of-words vectors (feature vectors) are not needed, only their pairwise inner products. This can act

as a dimensionality reduction technique when N > `, e.g. D(`×N ), reduces to K(`×`). At the

same time, this implies there is some information loss in the kernel matrix when compared to the original set of vectors.

(9)

0 2 4 6 0 500 1000 1500 2000 2500 Index v alue

Reuters data. Dimension 2500x2500

Kernel matrix eigenvalues

Figure 1: All eigenvalues are greater than or equal to zero illustrating that the matrix is positive semi definite. 0 20 40 60 80 0 20 40 60 80

(10)

2.4 Previous research

The use of a semantic kernel can be used as a way to improve the BOW model by incorporat-ing semantic information (Shawe-Taylor and Cristianini 2004). Wang and Domeniconi (2008) explored the semantic shortcomings of the VSM model by investigating a different approach constructing a semantic kernel built from a taxonomy created from Wikipedia articles. Re-cently Hussain (2019) proposed a new semantic kernel based on co-clustering showing that the semantic issues of the VSM model are still interesting to investigate.

Another well-known issue of the VSM model is the sparsity of the document term matrices. Cristianini et al. (2002) introduced the so called Latent Semantic Kernel which essentially consist of doing PCA on the kernel matrix in order to find latent “topics”, however the main application of this kernel is mainly for document clustering and topic classification.

Lodhi et al. (2002) proposed the use of string kernels which showed “positive results on moderately sized datasets”. The string kernels differs from the Vector Space Model by using character sequences as features instead of terms.

Houvardas and Stamatatos (2006) compared the differences of feature-selection techniques between using fixed versus varying n-gram features for authorship attribution on the Reuters 50-50 corpus instead of fixed-length n-grams which showed results similar to the information gain technique.

(11)

3 Data

We will consider two data sets as shown in Table 1. Reuters_50_50 data set1 is a subset of

the Reuter Corpus Volume 1 (RCV1) containing longer news articles in the financial domain written by journalists. This is a balanced corpus of 5000 documents of which 50% is dedicated for training and 50% is dedicated for testing. Each set contains 50 authors with 50 texts per author.

The other dataset comes from Kaggle2_{called the “Spooky Author Identification”}

contain-ing short text extracts from fictional works from three authors: Edgar Allan Poe, HP Lovecraft and Mary Shelley. For simplicity, will refer to this dataset as the “Kaggle” dataset. In the orig-inal version, the data is already split up into a training and test set. However, the test set does not have any labels for the authors, so we have downloaded the training set containing 19 579 documents, which in turn has been randomly split up into a training (70%) and test set (30%).

Dataset Training set size Test set size Number of authors Average terms /document

Reuters 50 50 2500 (50%) 2500 (50%) 50 505

Kaggle Authorship Identification 13705 (70%) 5874 (30%) 3 26

Table 1: Descriptive statistics of datasets

3.1 Preprocessing

As discussed in the background section, it is common to do preprocessing on textual data in order to reduce the dimensionality of the document term matrix. Figure 3 depicts the most frequent terms in different document term matrices. In Figure 3a we can see that the word ’the’ has over 70000 counts. The following five terms (to, of, a, in, and) are also stopwords. When we remove the stopwords from the pre-defined stopword list the resulting frequencies are shown in Figure 3b. Here we observe that ’said’ is the most frequent term, which was the seventh most frequent term in Figure 3a. It is roughly four times more common than the next four terms. For the bigrams in Figure 3c and the trigrams in Figure 3d it is easy to see the context of the terms. The most frequent term in Figure 3d is a phone number. It can therefore be of interest to investigate if removing numbers from the dictionary has a positive effect.

1_{https://archive.ics.uci.edu/ml/datasets/Reuter_50_50}

(12)

# Columns DTM Reuters Kaggle D1gram+stopwords 33934 22275 D1gram 33771 22109 D2gram 421545 149646 D3gram 597023 149969

Table 2: Dimensions of document term matrices (DTM). The number of rows for the Reuters part is 2500 in both the training and test set. The number of rows for the Kaggle part is 13705 (train) and 5874 (test). Each column is a term.

3.2 Document Term Matrices

It is important to notice that there may be terms in the DTM test set that are not present in the training set, and vice versa. The standard approach is to only consider the dictionary from the training set, i.e. all words in the test set that are not in the training set are ignored. This will ensure that the number of columns of the DTM in the training and test sets match. In Table 2 we notice that the DTMs have a varying degree of terms.

3.3 Standardization and weighting

(13)

that on for said and in a of to the 0 20000 40000 60000 Term frequency T er m

(a) Unigram + stopwords

one billion company market new million year percent will said 0 5000 ₁₀₀₀₀ ₁₅₀₀₀ ₂₀₀₀₀ Term frequency T er m (b) Unigram + no stopwords said_one new_york next_year last_year told_reuters million_pounds chief_executive united_states analysts_said hong_kong 0 500 ₁₀₀₀ ₁₅₀₀ Term frequency T er m (c) Bigram + no stopwords ford_motor_co said_one_analyst chief_financial_officer amp_t_corp chief_executive_officer told_news_conference pence_per_share york_stock_exchange new_york_stock 44_171_542 0 ₅₀ 100 150 Term frequency T er m (d) Trigram + no stopwords

Figure 3: The 10 most frequent terms in each document term matrix dictionary for the Reuters data

since they occur in basically every document, while rare words may get a higher discriminat-ing power. To weigh the document term matrix containdiscriminat-ing term frequencies, we construct the

diagonal matrix R(N ×N ) containing the idf weights for each term in the dictionary. In order

to avoid bias towards longer sentences, it is common practice to first standardize the document term matrix to L1-norm such that the row sums are equal to one.

Dtf-idf,train = Dtrain (`×N ) Rtrain (N ×N ) (10) =       tf (t1, d1) · · · tf (tN, d1) .. . . .. ... tf (t1, d`) · · · tf (tN, d`)             idf (t1) . .. idf (tN)       (11)

where ` is the number of documents in the corpus and N is the number of terms in the dictionary. This matrix R that is created from the training set is also used to weigh the test set since the test data should be unknown. To weight the document term matrix from the test set, we

(14)

4 Method

This section is devoted to discussing an SVM method as well as different kernels that we will use later.

4.1 SVM

Figure 4: The optimal hyperplane separates two classes with maximal margin. Source: Vapnik (2000)

Consider the training set of instance-label pairs (x1, y1), ..., (x`, y`)

(xi, yi), i = 1, ..., `, xi ∈ Rn, y ∈ {−1, 1}` (12)

where yi ∈ {−1, 1} denotes two class labels and ` is the number of observations. In the bag of

words representation we have the training set corpus

(φ(xi), yi), i = 1, ..., `, φ(xi) ∈ RN, y ∈ {−1, 1}` (13)

φ(xi) = (tf (t1, xi), tf (t2, xi), ..., tf (tN, xi)) (14)

where each φ(xi) is a text document represented by an N-dimensional row vector of term

frequencies and N is the size of the dictionary. Recall the 2-dimensional linear regression

yi = β0+ β1xi1+ β2xi2, i = 1, ..., ` (15)

By setting Eq. (15) to zero we get the hyperplane equation which in two dimensions is a straight line that separates two classes. To be consistent with litterature, we will refer to the

(15)

To classify an observation in the case where the observations are linearly separable, the simplest approach is to observe if the observation lies above or below the hyperplane

wTxi+ b ≥ 1 if yi = 1 (16)

wTxi+ b ≤ −1 if yi = −1 (17)

More compactly

yi[wTxi+ b] ≥ 1, i = 1, ..., ` (18)

The vectors that satisfy equality in (18) are called the support vectors. According to Vapnik (2000), the goal of SVM is “to find the optimal hyperplane, or the maximal margin hyperplane, such that the set of vectors in the training set is separated without error and the distance between the closest vector to the hyperplane is maximal”. The optimal hyperplane is found by solving the following optimization problem

min w,b 1 2||w|| 2 subject to yi[wTxi+ b] ≥ 1, i = 1, ..., ` (19)

The optimization can be expressed in another form by solving the Lagrangian. For details and a step-by-step tutorial, see Smith (2004).

In general the data is often not perfectly separable. In this case, the goal is to solve the optimization problem for the so called soft-margin SVM

min w,b 1 2||w|| 2 + C   ` X i=1 ξi   subject to yi[wTxi+ b] ≥ 1 − ξi, ξi ≥ 0, i = 1, ..., ` (20)

Here C is a regularisation parameter which controls the cost of classification. A low value of C will solve the optimization problem with a larger margin if it classifies more observations correctly.

The decision function is

f (x) = sgn   ` X i=1 yiαiK(xi, x) + b   (21)

(16)

For multinomial or multi-class classification, i.e. k > 2 categories to classify, the one-versus-all (OvA) or one-versus-one (OvO) SVM options are used. The latter constructs one

classifier per pair of classes. For k classes it is necessary to train k₂ = k(k − 1)/2 classifiers.

All k₂ classifiers are used to classify a test observations and by counting the frequency of each

class assignment the final classification is decided by the most frequently assigned class (James et al. 2013).

The strategy in one-versus-all classification is fitting f1, ..., fkclassifiers, where each

clas-sifier is trained to separate one class from the rest (Schölkopf et al. 2002). The test observation is assigned to the class for which Eq. (21) is largest before applying the sgn function, i.e:

argmax j=1,...,k gj(x), where gj(x) = ` X i=1 yiαi,jK(xi, x) + bj (22)

For example, for k = 3 classifiers, let z1be the first observation in the test set, suppose we get

the following decision function values for each class: ˆg1(z1) = −1.5, ˆg2(z1) = −0.5, ˆg3(z1) =

1.5. The third classifier has the largest value indicating that z1 lies above its fitted hyperplane,

so z1 is classified as k = 3. This is done for all observations in the test set. The advantage

of this approach is faster training time and allows for the construction of one ROC curve for each classifier by comparing different decision thresholds. By training on different binary

classification problems, the main disadvantage is that the values from gj(x) is not guaranteed

to be on comparable scales according to Schölkopf et al. (2002). One alternative is to transform

gj(x) to class probabilities, however it can be computationally expensive and the argmax of the

probabilities may not be the argmax of the scores in Eq. (22). For details, see Wu et al. (2004) and Platt (1999).

4.2 Kernels

BoW:κBoW(xi, xj) = φ(xi) · φ(xTj) (23)

The kernel in Eq. (23) is a linear kernel in φ(x). The linear kernel shows good performance on text classification tasks and is faster to train, however as a result it cannot deal with non-linear data (Hussain 2019).

Moving past these default kernels, we define the Semantic kernel:

(17)

As pointed out by Hussain (2019), traditional kernel methods (linear, RBF, etc) compute the similarity between documents based on the terms that the documents have in common. How-ever they do not consider the semantic similarities between words. One way to tackle this issue is to introduce the semantic matrix S which is a N by N symmetric matrix containing certain word weights that measure the similarity between two words. S is defined as

S(N ×N ) = R(N ×N )P(N ×N ) (25)

where R is a diagonal matrix containing for example idf weights as in Eq. (10) and P is the

proximity matrix that contains semantic similarity scores extracted from WordNet 3_{, a lexical}

database in which one can find the relation among words using synonymy. Synonyms are words denoting the same concept while being interchangeable in many contexts. In WordNet, the synonyms are grouped into unordered sets (’synsets’) in hierarchical fashion. The more general terms occur higher in the tree like “feline”, and links to more specific terms “cat”, “domestic cat”, “kitty” in increasing specificity. One way to design the semantic matrix is based on a distance metric (Rada et al. 1989), by setting the length of the shortest path connecting terms i and j, i.e. the inverse of the shortest distance in the tree. The shorter the path between the terms i and j, the more semantically similar they are. WordNet refers to the terms as “concepts” but for simplicity we will not make that distincion.

Sim(t1, t2) = 1/(distance(t1, t2) + 1), 0 ≤ Sim(t1, t2) ≤ 1, (26)

Sim(t1, t2) = 1, t1 = t2 (27)

In figure 5, we can observe the hypernyms of the words cat in Fig. 5a and dog in Fig. 5b. Notice that cat and dog have a node in common, which is carnivore. The number of edges between cat and carnivore is 2, and the number of edges between dog an carnivore is also 2. Therefore, the shortest distance between cat and dog in this structure is 2+2 = 4. The similarity score between cat and dog is therefore Sim(cat, dog) = 1/(4 + 1) = 0.2. To implement this using WordNet, we will use the nltk library in Python.

(18)

(a) The word cat has one hypernym: feline. (b) The word dog has two hypernyms: canine and domestic animal.

Figure 5: Illustration of the hierarchical tree strucure in WordNet.

To create the semantic matrix the main algorithm is outlined in Appendix A.

An alternative to the path similarity is the Wu-Palmer similarity which (Wu and Palmer 1994):

SimW up(t1, t2) =

2 · depth(lcs(t1, t2))

depth(t1) + depth(t2)

(28)

where lcs(t1, t2) denotes the least common subsumer of the two terms t1, t2 which returns the

lowest node in the hierarchy that t1 and t2 share as hypernym and depth function refers to the

distance between a node to the root node.

4.3 Implementation

We have mainly used R to prepare the data with the following libraries: text2vec (v. 0.6) for creating document term matrices in the BoW representation, tf-idf weighting, etc. For SVM classification we have implemented the code in Python (v. 3.7.4) using the sklearn library (v. 0.21.3) with the main advantage that custom kernels can be implemented, either by specifying a precomputed kernel matrix or by calling a custom function.

(19)

We have transformed our data to a document-term matrix which has been normalized to L1-norm. This document term matrix can then be used directly as input to the SVM model or transforming it to a kernel matrix. We have considered the linear kernel first. For cross valida-tion, we use 5-fold cross validation grid search implemented in GridSearchCV from sklearn. A practical approach to finding good hyperparameters according to Hsu et al. (2003) is to

spec-ify exponentially growing sequences of the hyperparameters, for example C = 2−4, 2−3, ..., 24_.

When using the tf-idf weighting, Eq. (10) corresponds to fit_transform() and weight-ing the test data is done with transform() from the text2vec library which is a compu-tationally faster implementation than doing the matrix multiplications directly.

The strategy used for SVM multi-class classification is the one-versus-all approach since it allows for the construction of ROC curves and slightly faster training time, which is the strategy recommended by default in sklearn since v 0.19.

4.4 Feature Engineering

First we want to find a baseline model by testing different configurations. We begin with the linear kernel and 1-gram, 2-gram and 3-gram combinations with stopword removal observing how the accuracy changes if we remove features. At first, we run with the default hyperpa-rameter C = 1 while we try different configurations to find out which preprocessing steps are effective. We will consider the following options:

• Digits removal (true/false), • Stopwords removal (true/false), • Stemming (true/false),

• Tf-idf weighting (true/false), • N-grams (1, 2, 3)

• Pruning: (none, 1e-03, 2e-03, 3e-03, 4e-03, 5e-03, 6e-03, 7e-03, 8e-03) for 1- and 2-grams and (none, 1e-03, 2e-03, 3e-03, 5e-04) for 3-2-grams

(20)

should be found in at least 0.3% of all documents. Trigram sequences are less likely to be found in a larger proportion of documents which is why we consider only a few values for trigrams for which the dimensionality is reduced.

When we have found some good configurations for feature engineering with a default value of C = 1, we will consider some of these best models by performing 5-fold cross validation with grid search to find the best hyperparameter C for each different model.

For the Reuters data set we train (16 · 9) + (16 · 9) + (16 · 5) = 368 different configurations.

4.5 Evaluation

For classification, we can construct the evaluation metrics from a confusion matrix as in Table 3, where TP = true positive, TN = true negative, FN = false negative, FP = false positive. We will use the following measures to evaluate the classification:

Accuracy = TP+TN TP+TN+FN+FP (29) Precision = TP TP+FP (30) Recall = TP TP+FN (31) F1 = 2 · precision · recall precision + recall (32)

For each class we predict, we get precision, recall and F1 metrics. One way to get an overall metric for all classes is to take the average of all classes which is called the macro precision, macro recall and macro F1.

Predicted

Positive Negative

Actual Positive TP FN

Negative FP TN

(21)

5 Results

5.1 Experimental results

For the Kaggle data set, in Table 4 we observe the best configuration with a default value of C = 1, reaching an accuracy of 0.82 with a macro F1 and precision of 0.82. This is the 1-gram model without using stopwords, with tf-idf weighting, no digits removal, no stemming and a pruning of 1e-04. For the kaggle data set, the documents are of shorter length and of literary nature (i.e. not news articles) which can be the reason why it is necessary to keep the stopwords in this data set. Pruning the data set seems to have a positive effect, while at the same time speeding up the time it takes to train the SVM model.

The Reuters data set in in Table 4 achieves an accuracy of around 0.655 with precision 0.67, which is in accordance to similar studies on the Reuters 50 50 dataset. For this configuration, stopwords were removed, using the 1-gram model, the vocabulary pruned by a proportion 7e-03, no digits removal and no stemming and surprisingly no tf-idf weighting.

(22)

Remove

stopwords Ngram Pruning Tf-idf # Features To lower Remove digits Stemming Accuracy F1macro Precisionmacro Recallmacro

Kaggle

FALSE 1 1e-04 TRUE 13341 TRUE FALSE FALSE 0.8214 0.8215 0.8232 0.8200

FALSE 1 1e-04 TRUE 13341 TRUE TRUE FALSE 0.8214 0.8215 0.8232 0.8200

FALSE 1 5e-05 TRUE 22275 TRUE FALSE FALSE 0.8212 0.8214 0.8231 0.8199

FALSE 1 5e-05 TRUE 22275 TRUE TRUE FALSE 0.8212 0.8214 0.8231 0.8199

FALSE 1 NA TRUE 22275 TRUE FALSE FALSE 0.8212 0.8214 0.8231 0.8199

FALSE 2 NA TRUE 153437 TRUE FALSE TRUE 0.7932 0.7922 0.7941 0.7906

FALSE 2 NA TRUE 153437 TRUE TRUE TRUE 0.7932 0.7922 0.7941 0.7906

FALSE 2 NA TRUE 167831 TRUE FALSE FALSE 0.7852 0.7848 0.7874 0.7828

FALSE 2 NA TRUE 167831 TRUE TRUE FALSE 0.7852 0.7848 0.7874 0.7828

FALSE 2 NA FALSE 153437 TRUE FALSE TRUE 0.7727 0.7718 0.7764 0.7684

FALSE 3 NA TRUE 285294 TRUE FALSE TRUE 0.6454 0.6399 0.6450 0.6370

FALSE 3 NA TRUE 285294 TRUE TRUE TRUE 0.6454 0.6399 0.6450 0.6370

FALSE 3 NA FALSE 285294 TRUE FALSE TRUE 0.6431 0.6344 0.6458 0.6298

FALSE 3 NA FALSE 285294 TRUE TRUE TRUE 0.6432 0.6344 0.6458 0.6298

FALSE 3 NA FALSE 289278 TRUE FALSE FALSE 0.6342 0.6245 0.6370 0.6197

Reuters

TRUE 1 7e-03 FALSE 4659 TRUE FALSE FALSE 0.6556 0.6551 0.6752 0.6556

TRUE 1 1e-03 FALSE 10834 TRUE FALSE TRUE 0.6516 0.6503 0.6686 0.6516

FALSE 2 2e-03 FALSE 39261 TRUE FALSE TRUE 0.6324 0.6335 0.6843 0.6324

FALSE 2 2e-03 FALSE 37906 TRUE TRUE TRUE 0.6304 0.6313 0.6804 0.6304

FALSE 2 1e-03 FALSE 73516 TRUE TRUE TRUE 0.6280 0.6283 0.6851 0.6280

TRUE 3 5e-04 TRUE 124597 TRUE FALSE TRUE 0.6332 0.6372 0.6863 0.6332

TRUE 3 5e-04 TRUE 122896 TRUE FALSE FALSE 0.6288 0.6321 0.6799 0.6288

TRUE 3 1e-03 TRUE 28081 TRUE FALSE TRUE 0.6220 0.6212 0.6518 0.6220

TRUE 3 5e-04 TRUE 118791 TRUE TRUE TRUE 0.6168 0.6156 0.6537 0.6168

TRUE 3 5e-04 TRUE 117043 TRUE TRUE FALSE 0.6140 0.6136 0.6513 0.6140

(23)

Dataset: Kaggle Stopwords 1−gram Dataset: Kaggle Stopwords 2−gram Dataset: Kaggle Stopwords 3−gram Dataset: Kaggle No stopwords 1−gram Dataset: Kaggle No stopwords 2−gram Dataset: Kaggle No stopwords 3−gram Dataset: Reuters Stopwords 1−gram Dataset: Reuters Stopwords 2−gram Dataset: Reuters Stopwords 3−gram Dataset: Reuters No stopwords 1−gram Dataset: Reuters No stopwords 2−gram Dataset: Reuters No stopwords 3−gram 0 5000 10000 15000 20000 0 50000 100000 150000 0 50000 100000 150000 0 5000 10000 15000 20000 0 50000 100000 150000 0e+00 1e+05 2e+05 3e+05

10000 20000 30000 0e+00 1e+05 2e+05 3e+05 4e+05 0e+00 2e+05 4e+05 6e+05 10000 20000 30000 0e+00 1e+05 2e+05 3e+05 4e+05 0e+00 2e+05 4e+05 6e+05 8e+05

0.3 0.4 0.5 0.6 0.7 0.45 0.50 0.55 0.60 0.65 0.4 0.5 0.6 0.2 0.4 0.6 0.8 0.4 0.5 0.6 0.5 0.6 0.7 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.45 0.50 0.55 0.60 0.65 0.45 0.50 0.55 0.60 0.65 0.65 0.70 0.75 0.80 0.55 0.60 0.65 0.70 0.75 0.80 Number of features f1_macro precision_macro recall_macro The effect on classification perfomance for different configurations

Figure 6: Effects of the number of features included in the model on the macro precision, recall and F1 scores. Note that accuracy is excluded since it is equal to the recall.

5.2 Incorporating semantic similarities and fine tuning

(24)

Remove

stopwords Ngram Pruning Tf-idf # Features To lower Remove digits Stemming Kernel Accuracy F1macro Precisionmacro Recallmacro C

Reuters

TRUE 1+2 0.007 TRUE 6544 TRUE FALSE FALSE Linear 0.6764 0.6796 0.7068 0.6764 100

TRUE 1+2 0.007 TRUE 6544 TRUE FALSE FALSE Semantic Path 0.5268 0.5290 0.5525 0.5268 20

TRUE 1+2 0.007 TRUE 6544 TRUE FALSE FALSE Semantic Wu-Palmer 0.3808 0.3837 0.4043 0.3808 10

Kaggle

FALSE 1 1e-04 TRUE 13341 TRUE FALSE FALSE Linear 0.8214 0.8215 0.8232 0.8200 1

FALSE 1 1e-04 TRUE 13341 TRUE FALSE FALSE Semantic path 0.7802 0.7797 0.7811 0.7785 2

FALSE 1 1e-04 TRUE 13341 TRUE FALSE FALSE Semantic Wu-Palmer 0.7596 0.7584 0.7610 0.7563 2

Table 5: Performance for different configurations for the Kaggle and Reuters data sets

value indicates that a smaller-margin hyperplane was found to be optimal for classifying more observations correctly.

The best model on the Kaggle data set in Table 5 achieved an accuracy of 0.8214, macro precision 0.8232, macro f1 0.8215 with C = 1. The interpretation of this smaller value indi-cates that a hyperplane with a larger margin was found to classify more observations correctly at the cost of some misclassifications. For this model, 1-gram was chosen and no stopword removal with a low pruning value of 1e-04 which reduced the number of features to 13341 and in turn made the matrix less sparse.

In Fig. 7 we have plotted the ROC curves from the models in order to get a better under-standing of the model performance for different threshold values. The ROC-curve in Figs. 7a, 7c, 7e for the Kaggle data show rather stable curves with AUC values around 0.9 which indi-cates good discriminative power. Out of the three classes, Edgar Allan Poe (EAP) shows the lowest AUC which suggests it may be harder to classify.

In Figs. 7b 7d 7f the ROC curves for the Reuters data are shown. We decided not to show the labels for each curve since there are 50 categories. Here it is clear that the linear kernel in Fig. 7b shows good discriminating performance with AUC values over 0.9. In Figs. 7d and 7f, the curves reach towards the middle line indicating worse performance.

(25)

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Kaggle data, linear kernel

ROC curve of class 0 (area = 0.91) ROC curve of class 1 (area = 0.94) ROC curve of class 2 (area = 0.94)

(a) Linear kernel, Kaggle data

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Reuters data, linear kernel

(b) Linear kernel, Reuters data

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Kaggle data, semantic kernel with path similarity

(c) Semantic kernel path similarity, Kaggle data

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Reuters data, semantic kernel with path similarity

(d) Semantic kernel path similarity, Reuters data

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Kaggle data, semantic kernel with Wu-Palmer similarity

(e) Semantic kernel Wu-Palmer, Kaggle data

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Reuters data, semantic kernel with Wu-Palmer similarity

(f) Semantic kernel Wu-Palmer, Reuters data

(26)

6 Discussion

In this thesis we have explored the semantic issues of the vector space model. The results showed that the semantic kernels with path- and Wu-Palmer similarities degraded the model performance on two data sets. We also showed the effects of various feature engineering con-figurations, showing that for example stopwords removal had a negative effect on classification performance on the fictional short text extracts of the Kaggle dataset while it improved the results on the longer-length journalistic Reuters dataset.

One potential downfall of constructing the proximity matrix with similarity scores from WordNet is that many words are unknown and hence assigned a similarity score of zero. On one hand, the WordNet scores can be considered reliable in the sense that the taxonomy is created by experienced linguists. On the other hand, language constantly evolves and the semantic kernel may perform worse on data sets with for example lots of slang words or words that have recently been introduced into the language.

We expected the semantic kernel to improve accuracy which leads us to believe that there are possibilities to explore different approaches for further research:

• Apart from designing the proximity matrix from WordNet, it could be of interest to create a new similarity measure based on co-occurrence information which could be used as a complement for words that are not found in WordNet. For this research area, it would be interesting to find a similarity score such that the scores from WordNet and the scores from co-occurrence information can be interpreted in the same way.

(27)

References

A. Basirat. Principal Word Vectors. PhD thesis, Acta Universitatis Upsaliensis, 2018.

N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. Journal of Intelligent Information Systems, 18(2-3):127–152, 2002.

J. Houvardas and E. Stamatatos. N-gram feature selection for authorship identification. In International conference on artificial intelligence: Methodology, systems, and applications, pages 77–86. Springer, 2006.

C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al. A practical guide to support vector classification, 2003.

S. F. Hussain. A novel robust kernel for classifying high-dimensional data using

sup-port vector machines. Expert Systems with Applications, 131:116 – 131, 2019. ISSN

0957-4174. URL http://www.sciencedirect.com/science/article/pii/ S0957417419302696.

G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics. Springer New York, 2013. ISBN 978-1-4614-7137-0.

D. Jurafsky and J. Martin. Speech and language processing. Pearson International Edition, 2009. ISBN 9780135041963.

H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2(Feb):419–444, 2002.

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, USA, 2008. ISBN 0521865719.

J. Platt. Probabilistic outputs for svms and comparisons to regularized likehood methods, ad-vances in large margin classifiers, 1999.

(28)

B. Schölkopf, A. J. Smola, F. Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.

J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-sity Press, USA, 2004. ISBN 0521813972.

B. T. Smith. Lagrange multipliers tutorial in the context of support vector machines. Memorial University of Newfoundland St. Johnâs, Newfoundland, Canada, page 17, 2004.

V. N. Vapnik. The Nature of Statistical Learning Theory, Second Edition. Statistics for Engi-neering and Information Science. Springer, 2000. ISBN 978-0-387-98780-4.

P. Wang and C. Domeniconi. Building semantic kernels for text classification using wikipedia. pages 713–721, 08 2008. doi: 10.1145/1401890.1401976.

T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5(Aug):975–1005, 2004.

Z. Wu and M. Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138. Association for Computational Linguistics, 1994.

(29)

A

Semantic Matrix Python Code

1 import nltk

2 nltk.download('wordnet')

3 fromnltk.corpusimport wordnetaswn 4 import numpyasnp

5 # initate empty proximity matrix P_{(N ×N )}

6 # len(vocab) = N

7 P = np.identity(len(vocab))

8 # iterate only upper diagonals since its a symmetric matrix

9 for iinrange(0, len(vocab)):

10 forj inrange(i+1, len(vocab)):

11 try: 12 a= wn.synsets(vocab[i])[0] 13 b= wn.synsets(vocab[j])[0] 14 # pathsim =(distance(a, b))−1 15 pathsim=a.path_similarity(b) 16 ifpathsim==None: 17 pathsim= 0 18 P[i,j]= pathsim 19 P[j,i]= pathsim 20 except IndexError:

21 # if word is not in wordnet:

22 pathsim= 0

(30)

B

Confusion matrices

EAP HPL MWS Predicted label EAP HPL MWS True label 2012 190 216 219 1349 86 252 86 1464 Confusion Matrix 250 500 750 1000 1250 1500 1750 2000

(a) Linear kernel on kaggle data

EAP HPL MWS Predicted label EAP HPL MWS True label 1910 233 275 277 1254 123 277 106 1419 Confusion Matrix 250 500 750 1000 1250 1500 1750

(31)

AaronPressman AlanCrosby AlexanderSmith BenjaminKangLim BernardHickey BradDorfman DarrenSchuettler DavidLawder EdnaFernandes EricAuchard FumikoFujisaki GrahamEarnshaw HeatherScoffield JanLopatka JaneMacartney JimGilchrist JoWinterbottom JoeOrtiz JohnMastrini JonathanBirt KarlPenhaul KeithWeir KevinDrawbaugh KevinMorrison KirstinRidley KouroshKarimkhany LydiaZajc LynneO'Donnell LynnleyBrowning MarcelMichelson MarkBendeich MartinWolk MatthewBunce MichaelConnor MureDickie NickLouth PatriciaCommins PeterHumphrey PierreTran RobinSidel RogerFillion SamuelPerry SarahDavison ScottHillis SimonCowell TanEeLyn TheresePoletti TimFarrand ToddNissen WilliamKazer Predicted label AaronPressman AlanCrosby AlexanderSmith BenjaminKangLim BernardHickey BradDorfman DarrenSchuettler DavidLawder EdnaFernandes EricAuchard FumikoFujisaki GrahamEarnshaw HeatherScoffield JanLopatka JaneMacartney JimGilchrist JoWinterbottom JoeOrtiz JohnMastrini JonathanBirt KarlPenhaul KeithWeir KevinDrawbaugh KevinMorrison KirstinRidley KouroshKarimkhany LydiaZajc LynneO'Donnell LynnleyBrowning MarcelMichelson MarkBendeich MartinWolk MatthewBunce MichaelConnor MureDickie NickLouth PatriciaCommins PeterHumphrey PierreTran RobinSidel RogerFillion SamuelPerry SarahDavison ScottHillis SimonCowell TanEeLyn TheresePoletti TimFarrand ToddNissen WilliamKazer True label Confusion Matrix 0 10 20 30 40 50

(32)

AaronPressman AlanCrosby AlexanderSmith BenjaminKangLim BernardHickey BradDorfman DarrenSchuettler DavidLawder EdnaFernandes EricAuchard FumikoFujisaki GrahamEarnshaw HeatherScoffield JanLopatka JaneMacartney JimGilchrist JoWinterbottom JoeOrtiz JohnMastrini JonathanBirt KarlPenhaul KeithWeir KevinDrawbaugh KevinMorrison KirstinRidley KouroshKarimkhany LydiaZajc LynneO'Donnell LynnleyBrowning MarcelMichelson MarkBendeich MartinWolk MatthewBunce MichaelConnor MureDickie NickLouth PatriciaCommins PeterHumphrey PierreTran RobinSidel RogerFillion SamuelPerry SarahDavison ScottHillis SimonCowell TanEeLyn TheresePoletti TimFarrand ToddNissen WilliamKazer Predicted label AaronPressman AlanCrosby AlexanderSmith BenjaminKangLim BernardHickey BradDorfman DarrenSchuettler DavidLawder EdnaFernandes EricAuchard FumikoFujisaki GrahamEarnshaw HeatherScoffield JanLopatka JaneMacartney JimGilchrist JoWinterbottom JoeOrtiz JohnMastrini JonathanBirt KarlPenhaul KeithWeir KevinDrawbaugh KevinMorrison KirstinRidley KouroshKarimkhany LydiaZajc LynneO'Donnell LynnleyBrowning MarcelMichelson MarkBendeich MartinWolk MatthewBunce MichaelConnor MureDickie NickLouth PatriciaCommins PeterHumphrey PierreTran RobinSidel RogerFillion SamuelPerry SarahDavison ScottHillis SimonCowell TanEeLyn TheresePoletti TimFarrand ToddNissen WilliamKazer True label Confusion Matrix 0 10 20 30 40

AUTHORSHIP CLASSIFICATION USING THE VECTOR SPACE MODEL AND KERNEL METHODS