• No results found

Using Artificial Intelligence to Verify Authorship of Anonymous Social Media Posts

N/A
N/A
Protected

Academic year: 2021

Share "Using Artificial Intelligence to Verify Authorship of Anonymous Social Media Posts"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

aster˚

as, Sweden

Thesis for the Degree of Bachelor of Science in Computer Science

USING ARTIFICIAL INTELLIGENCE

TO VERIFY AUTHORSHIP OF

ANONYMOUS SOCIAL MEDIA POSTS

Filip Lagerholm

flm14001@student.mdh.se

+46760460195

Examiner: Baran C

¸ ¨

ur¨

ukl¨

u

Supervisor: Miguel Le´

on Ortiz

(2)

Abstract

The widespread use of social media, along with the possibilities to conceal ones identity in the fibrillation of ubiquitous technology, combined with crime and terrorism becoming digitized, has increased the need of possibilities to find out who hides behind an anonymous alias. This report deals with authorship verification of posts written on Twitter, with the purpose of investigating whether it is possible to develop an auxiliary tool that can be used in crime investigation activities. The main research question in this report is whether a set of tweets written by an anonymous user can be matched to another set of tweets written by a known user, and, based on their linguistic styles, if it is possible to calculate a probability of whether the authors are the same. The report also examines the question of how linguistic styles can be extracted for use in an artificially intelligent classification, and how much data is needed to get adequate results. The subject matter is interesting as the work described in this report concerns a potential future scenario where digital crimes are difficult to investigate with traditional network-based tracking techniques. The approach to the problem is to evaluate traditional methods of feature extraction in natural language processing, and by classifying the features using a type of recurrent neural network called Long Short-Term Memory. While the best result in an experiment that was carried out achieved an accuracy of 93.32%, the overall results showed that the choice of representation, and amount of data used, is crucial. This thesis complements the existing knowledge as very short texts, in the form of social media posts, are in focus.

(3)

Contents

1 Introduction 3

2 Background 4

2.1 Authorship analysis . . . 4

2.2 Natural language processing (NLP) . . . 4

2.2.1 Bag-of-words representation . . . 4

2.2.2 N-gram representation . . . 5

2.2.3 Term frequency-inverse document frequency (TF-IDF) . . . 5

2.3 Supervised learning . . . 6

2.4 Artificial neural networks (ANNs) . . . 7

2.4.1 Feedforward neural networks (FNNs) . . . 7

2.4.2 Recurrent neural networks (RNNs) . . . 11

2.4.3 Long Short-Term Memory networks (LSTMs) . . . 13

3 Related work 16 3.1 Authorship analysis and feature representations . . . 16

3.2 Recurrent neural networks for authorship analysis . . . 17

4 Problem formulation 18 4.1 Assumptions and limitations . . . 18

5 Method 19 5.1 Scientific approach . . . 19

5.2 Data and pre-processing . . . 20

5.3 Classifying data . . . 20

6 Ethical and societal considerations 21 7 Implementation 21 7.1 Evaluation . . . 21

7.1.1 Adjustable parameters . . . 22

7.2 Data pre-processing . . . 23

7.2.1 Feature representation parameters . . . 24

7.3 Neural network classifier (LSTM) . . . 25

7.3.1 Building the LSTM model . . . 25

7.3.2 Training and validation . . . 25

7.3.3 Testing . . . 26

8 Results 26 8.1 Parameter configuration . . . 26

8.2 Training and validation results . . . 27

8.3 Final test accuracies . . . 37

9 Discussion 38 9.1 Interpretation of results . . . 38

9.2 Thesis goals and research questions . . . 39

9.3 Relation to previous research . . . 39

9.4 Generalization and possible consequences . . . 40

10 Conclusions 40

11 Future work 40

(4)

1

Introduction

The Internet has become increasingly important. It affects and facilitates nearly every aspect of modern life. While one of these aspects is our ability to communicate globally with the touch of a button, the widespread use of social media has become established within criminal networks and terrorist organizations [1,2,3]. At the same time, it is easier than ever to be anonymous on the Internet. Vicious people use this opportunity to spread hateful messages, communicate their plans, and to recruit members to their organizations [4]. Therefore, if we have the opportunity, it is important to use modern technology in order to facilitate crime prevention activities.

The purpose of this work is to investigate whether it is possible to develop a system that can function as an auxiliary tool in the context of criminal investigations and crime prevention. Specifically, the problem that is studied is whether posts written by an anonymous user on social media are possible to match to posts written by a non-anonymous user, and if it is possible to calculate how likely it is that the users are the same person. The term for this is authorship verification (in contrast to attribution and profiling) [5,6].

In this thesis, the approach to the problem is to extract the linguistic styles of tweets (Twitter messages), by converting them into numerical feature vectors, using existing methods in natural language processing (NLP). The feature representations that are evaluated are bag-of-words, n-grams (bin-grams and trin-grams), term frequency-inverse document frequency (TF-IDF), and, in addition, some combinations of them. As feature vectors, created by these methods, can be used as input for modeling an artificial neural network (ANN), a trained neural network model can in turn be used as a classifier, which was done in this thesis. The type of neural network that was used is a type of recurrent neural network (RNN) called Long Short-Term Memory (LSTM). This method is suitable in this context as RNNs are preferable for information that is naturally sequential [7], such as text, and the LSTM variant of the RNN allows modeling of even more complex information than standard neural networks [8].

Similar work has been carried out, although most of it focuses on longer text documents than what this thesis intends to do, or they do not use LSTMs specifically. Combining a more than a hundred-year-old science called stylometry, with technology supported by today’s computational capacity, and systematic methods to assist important social functions, makes this thesis both interesting and important.

In an experiment, carried out in this thesis, one of the evaluated representations (bigrams at character level) showed better results than others, with the best accuracy of 93.32% when over 17,000 tweets were used. The experiment also showed that when the amount of data is very limited, the used classifier could not successfully distinguish the stylistic properties of text written by a known author (a positive class) and what was not written by that author (a negative class). The consequences of these results are that the scope of application can be generalized and used for authorship verification in many areas where short texts occur.

This report is written for readers that has a basic knowledge of linear algebra, i.e. knowledge of the vector space model, matrix arithmetic, etc. In addition to this introduction which serves the purpose of giving an increased understanding of the entire work, the fundamental theoretical background is described in Section 2, while related work and the state-of-the-art within author-ship analysis, and use of recurrent neural networks in this context, are presented in Section 3. The research questions along with their limitations are described in Section 4, and the scientific and practical methods used to answer these are described in Section 5. The important topics of anonymity and personal integrity are discussed in Section 6. Section 7 describes what has been done during the work, i.e. as a result of defined methods. Section 8 describes the setup and results of the experiment, while Section 9 goes in depth with interpretations of the results. Finally, the report concludes with a summary of the conclusions and a final analysis in Section 10, followed by suggestions of what can be done to extend the work in Section 11.

(5)

2

Background

As the main objective of this thesis was to explore the possibilities for text classification of short social media posts, appropriate methods, used in relevant and contemporary research, have been implemented and evaluated. This section outlines the fundamental knowledge and theory needed to understand the contribution of this report. Subsection 2.1 introduces important terms in au-thorship analysis, while Subsection 2.2 complies with the basics of natural language processing (NLP) and common data representations in text classification. Finally, Subsections 2.3 and 2.4 seek to build an understanding of the method used for classification.

2.1

Authorship analysis

Authorship analysis, also called stylometry, is the study of linguistic style with the aim of deter-mining who is the original author of a text. Originally, its methods have been used for analysis of letters, or literary works such as Shakespeare [9]. In today’s applications, there are three general perspectives that can be divided into what is called authorship attribution, authorship profiling, and authorship verification [5,6]. Authorship attribution involves finding a probable author within a multitude of several other authors. Profiling refers to finding attributes that are likely to reveal an anonymous authors age, gender, origin, etc. Authorship verification focuses on whether an author’s linguistic style matches another author’s linguistic style. The purpose may be to increase the degree of suspicion of whether they are actually the same author. This thesis focuses on the latter perspective, i.e. authorship verification.

A common method in stylometry is to analyze an author’s vocabulary and the use-frequency of words in it, and then compare it with another author’s vocabulary. It is also possible to specifically analyze the use-frequency of function words, i.e. numerals, pronouns, prepositions, conjunctions, auxiliary verbs, etc. Other methods used in stylometry to compare texts is to analyze their average sentence length, or the use of very unusual words.

The concept of authorship analysis can be abstracted to a collection of methods for classifica-tions, which opens the possibility of using systematic methods. To automate the classification of data, artificial neural networks can be used [10].

2.2

Natural language processing (NLP)

Natural languages come in many forms such as Mandarin, Spanish, English, Hindi, etc. They are the languages spoken by humans, and which has evolved naturally throughout history [11]. Using characters and symbols, they can be expressed in writing. Since most algorithms require some type of numerical values for calculations, these characters need to be processed and converted from raw data into suitable representations. Suitable representations for machine learning purposes are commonly numerical feature vectors of fixed length [7].

The field of NLP is very broad and can be applied for many purposes. As for this thesis, the related NLP tasks are referred to as information extraction, and document classification [12]. Information extraction is about extracting features from text, and document classification concerns labeling of text into classes. The former task can also include activities such as modifying or cleaning text from, e.g., unnecessary information.

The general process of converting text into numerical feature vectors may be referred to as vectorization. When having several feature vectors, an entire dataset can be represented using a matrix [7]. Three of the most commonly used representations [13], bag-of-words, n-grams, and term frequency-inverse document frequency (TF-IDF), are described in 2.2.1 - 2.2.3. It should be mentioned that these representations often occur with variations or as a combination of each other. 2.2.1 Bag-of-words representation

The bag-of-words representation is constructed in two steps: tokenization, and counting token occurrences. Tokenization means breaking a stream of text into smaller parts, called tokens. Tokens are usually at word level, but may also be more fine-grained as at character level or even bit level. By counting token occurrences, a representation of word use-frequency is obtained [14]. For convenience, consider the tweets “I just visited a lab that is working on a miracle” and “We just

(6)

visited a factory that is producing new types of fuel”. Fig. 1 illustrates the steps of creating a bag-of-words representation of these. While this model is efficient for many classification applications, in some cases it is not sufficient to use it alone. The reason is that the order of the words is lost, and it does not take multi-word expressions into account [14].

[ ‘ a ’ , ‘ f a c t o r y ’ , ‘ f u e l ’ , ‘ I ’ , ‘ i s ’ , ‘ j u s t ’ , ‘ l a b ’ , ‘ m i r a c l e ’ , ‘ new ’ , ‘ o f ’ , ‘ on ’ , ‘ p r o d u c i n g ’ , ‘ t h a t ’ , ‘ t y p e s ’ , ‘ v i s i t e d ’ , ‘We’ , ‘ working ’ ]

(a) List of tokens

[ 2 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 ’ ] (b) First tweet [ 1 , 1 , 1 , 0 , 1 , 1 , 0 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 1 , 0 ’ ] (c) Second tweet

Figure 1: When applying tokenization, a list of strings, such as a), will be constructed. By counting the occurrences, the term frequency feature representation (bag-of-words) is produced for each tweet, as shown in b) and c).

2.2.2 N-gram representation

N-gram reduces the problems of the lost order, and the lack of consideration for multi-word ex-pressions in the bag-of-words representation [15]. While a bag-of-words representation in fact is a unigram, n-grams refers to a wider representation such as bigrams, trigrams, etc. Consider another example of tweets, “Looking forward to receiving a specific email” and “Bob fired Jeff”. As shown in Fig. 2, the multi-word expression “Looking forward” has lost its meaning, and we do not know if Bob fired Jeff, or the other way around. Using a simple n-gram, where n = 2, these potential issues are as mentioned reduced. Even though the global order still may be lost (although it is kept in the example for easier visualization), there is still a certain local order kept among the word pairs. [ ‘ e m a i l ’ , ‘ f o r w a r d ’ , ‘ a ’ , ‘ to ’ , ‘ s p e c i f i c ’ , ‘ Looking ’ , ‘ r e c e i v i n g ’ ] (a) Unigram [ ‘ L o o k i n g f o r w a r d ’ , ‘ f o r w a r d to ’ , ‘ t o r e c e i v i n g ’ , ‘ r e c e i v i n g a ’ , ‘ a s p e c i f i c ’ , ‘ s p e c i f i c e m a i l ’ ]

(b) Bigram of first tweet

[

‘ Bob f i r e d ’ , ‘ f i r e d J e f f ’ ]

(c) Bigram of second tweet

Figure 2: a) shows a unigram representation, while b) and c) are bigrams.

2.2.3 Term frequency-inverse document frequency (TF-IDF)

Another potential flaw with the bag-of-words representation, when used alone, is that for some applications, while very common words such as “a”, “in”, “to”, or “the” are likely to represent the highest use-frequency, they are not likely to be very important. In some cases, this can be

(7)

this problem is to, along with the term frequency, also calculate the “inverse document frequency” (IDF), and then combine them by a simple multiplication [16,17,18]. A document in this context is synonymous with, for example, a single tweet. The intuition for IDF is that it reflects how much a certain word provides with respect to all available documents/tweets. In other words, terms occurring in, e.g., all tweets are assumed to be rather unimportant. Using a raw count for term frequency and a logarithmic scaled inverse fraction for the inverse document frequency, the basic equation for TF-IDF is as follows:

tf idf (t, d, D) = ft,d∗ log

n

|d ∈ D : t ∈ d|, (1)

where, continuing the Twitter example, f is the raw count for the number of occurrences of term t in tweet d, d is the document (a tweet), n is the total number of tweets, and |d ∈ D : t ∈ d| is the number of tweets where the term t (a word in a tweet) appears. Fig. 3 shows how the tweets “My name is Jeff”, “Use the force, Luke”, and “Plato is my friend” are represented after applying TF-IDF. [ ‘ f o r c e ’ , ‘ f r i e n d ’ , ‘ i s ’ , ‘ J e f f ’ , ‘ Luke ’ , ‘my’ , ‘ name ’ , ‘ P l a t o ’ , ‘ the ’ , ‘ use ’ ]

(a) List of tokens

[ 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 4 2 8 0 4 6 , 0 . 5 6 2 8 2 9 , 0 . 0 0 0 0 0 0 , 0 . 4 2 8 0 4 6 , 0 . 5 6 2 8 2 9 , 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 ] (b) First tweet [ 0 . 5 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 5 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 5 0 0 0 0 0 , 0 . 5 0 0 0 0 0 ] (c) Second tweet [ 0 . 0 0 0 0 0 0 , 0 . 5 6 2 8 2 9 , 0 . 4 2 0 4 6 0 , 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 , 0 . 4 2 8 0 4 6 , 0 . 0 0 0 0 0 0 , 0 . 5 6 2 8 2 9 , 0 . 0 0 0 0 0 0 , 0 . 0 0 0 0 0 0 ] (d) Third tweet

Figure 3: An example of how three tweets can be represented by TF-IDF.

2.3

Supervised learning

Supervised learning refers to the process of learning from a set of examples of inputs and labeled output pairs. The goal is to find a function that can successfully make output predictions, given new and unseen inputs [7]. Motivations for such systems is that future outputs are often not known in advance when certain events occur, or when designing, e.g., a software, the actual solutions may not be possible to create by humans themselves. As an example, humans may be very good at detecting tangible objects in real life, while they are very bad at detecting cell patterns of medical conditions.

This type of learning is one of three main learning techniques in artificial intelligence, while the other two are unsupervised learning and reinforcement learning. The differences between these and supervised learning are that in unsupervised learning, the output labels are not known in advance, and in reinforcement learning, “rewards” or “punishments” are used to find a desired behavior. However, these other two types of learning techniques are not concerned in this thesis since I use data which is labeled to belong to a positive class (written by a known author), or a negative class (not written by the known author).

As an example of how data can be represented in supervised learning, a set of n social media posts with corresponding author labels can be represented such as (p1, a1), (p2, a2), ..., (pn, an),

where pi is a social media post generated by an unknown function y = f (x) (the function of

an author writing the actual text), and ai is the label that identifies the author. With this

representation, the goal is to find the function h that successfully predicts f . Since the output labels in my case consists of a finite set of values (positive (1) or negative (0)), the problem of finding this function is called a classification problem.

The function that is found based of any set of input-output data, which is called training data, is typically just an approximation. In order to evaluate its final accuracy, another set referred to as testing data is needed. In a testing procedure, the function h is given new input from this

(8)

separate dataset. By allowing the function to predict the output for this input, the predicted label can be compared with the actual output label. In this way, the overall accuracy can be calculated. Sometimes a third set of data, called validation data, is used to validate the accuracy during training.

2.4

Artificial neural networks (ANNs)

Artificial neural networks (ANNs) are artificially intelligent models that can be used for solving classification problems [19, 20, 21]. Inspired by the human brain’s neural network [20], they typically consist of a collection of neural units. These units are connected to each other via weighted edges, analogous to how a biological brain’s dendrites, axons, and synapses are connected. The basic concept is that given an input data, it learns by iteratively calculating error rates. Based on these, they adjust the weights to reduce the error rates in next iterations. When enough iterations have been performed, the network represents a model that can be used to classify unseen data.

Usually, a network of neurons is structured into several layers. These layers are referred to as input layer, hidden layer(s), and output layer. Each layer has an arbitrary number of nodes with edges connecting them [8]. As ANN is the collective name for several types of neural networks, the subsections below describe the fundamentals of the types that are relevant in this thesis. These are feedforward (FNN), recurrent (RNN), and Long Short-Term Memory (LSTM) neural networks. As RNNs are an extension of FNNs, and LSTMs are an extension of RNNs, the fundamental theories of the different types are described in chronological order in remaining subsections.

2.4.1 Feedforward neural networks (FNNs)

The most simplistic form of a neural network is called a feedforward neural network (FNN). An FNN has one input layer and one output layer, but it can have one or several hidden layers in-between. Using a single hidden layer is referred to as “single-layer perceptron”, while using several hidden layers may be called “multi-layer perceptron” [8]. These can be seen in Figs. 4 and 5. Input data, in the context of supervised machine learning, consists of a set of vectors with associated labels. The vectors contain parameters, each of which are fed to a separate node in the input layer. The labels represent the class of each vector representation. A vector-label pair is usually referred to as a training example, or training case.

..

.

..

.

..

.

I1 I2 I3 Ii H1 Hj O1 Ok Input layer Hidden layer Output layer

Figure 4: A single-layer perceptron. The input layer has i nodes, the hidden layer has j, nodes and the output layer has k nodes.

(9)

..

.

..

.

..

.

..

.

..

.

I1 I2 I3 Ii H11 H1j H21 H2k H31 H3l O1 Om Input layer Hidden layer Hidden layer Hidden layer Output layer

Figure 5: A multi-layer perceptron, currently with three hidden layers. The dashed arrows indicate that an arbitrary number of hidden layers can be used.

As the name “feedforward” implies, data is being passed forward through the network, from layer to layer, in a step known as forward pass. Fig. 6 illustrates a simplistic mathematical model of a single neuron, and how data from preceding neurons is being processed and passed forward. The output of the unit is calculated using the equation:

y = f (

n

X

i=1

xiwi+ w0), (2)

where xi is either the input data or the output of unit i, and wiis the current weight from unit

i. The bias weight w0allows for the output to be shifted to fit the output prediction better. f is a

so-called activation function, which may vary depending on the desired behavior. Commonly used activation functions are shown in Fig. 7. The purpose of an activation function is to return the probability of whether an input belongs to a specific class.

x2 w2

Σ

f

Activation function y Output x1 w1 x3 w3 Weights Bias w0 Inputs

(10)

−1.0 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.5 1.0 x y sigmoid(t) = 1+e1−t tanh(t) = 1+e2−2t − 1 relu(t) = 0 if t < 0, else t

Figure 7: Three commonly used activation functions. Sigmoid [22, 23], tangens hyperbolicus (TanH) [23], and rectified linear unit (ReLU) [24]

After having presented data to the input layer, and propagated it through all hidden layers towards the output layer in the forward pass step, the goal is to update all the weights in the network according to how wrong the outputs are. This is typically done by using a loss function, which first calculates the error term in the output layer. This can be done using the squared error function:

δ = 1 2(t − y)

2, (3)

where δ is the squared error (error term), t is the target (the label of each vector passed in the forward pass step), and y is the actual output value in the current node. Next, the error term need to be calculated for each node in the hidden layer(s), in what called is the backward pass. For each j:th hidden node, this can be done using the following equation:

δj= oj(1 − oj) n

X

k=N

δkwkj, (4)

where δ is the error term at a hidden layer node, and o is the current output for that node. The sum that is calculated is the sum of the products between error terms and weights connecting the current hidden node to each of the connected node of the previous layer.

Finally, when error terms have been calculated in the output layer and the hidden layer(s), the weights connecting the nodes can be adjusted. For a weight from node i to node j, the update rule is defined as:

∆wji= ηδjxji, (5)

where ∆wji represents how much the weight should be changed, η is a constant value referred

to as learning rate, and xjiis either the input parameter or the output value of the previous layer.

When these values have been calculated, the state of the network is updated by adding them to the current weights.

The steps forward pass, backward pass, and updating of weights are together referred to as the training phase and are the building blocks of the so-called backpropagation algorithm [25,26] (shown in algorithm 1).

(11)

Input: Training and validation set Output: Model with adjusted weights

1 Initialize weights;

2 while Overall error rate is not acceptably low, or within max num epochs do 3 Forward pass: Present training examples to network and compute outputs; 4 Compute error term(s) in the output layer;

5 Backward pass: Compute error terms in the hidden layer(s); 6 Update weights;

7 end

Algorithm 1: Backpropagation algorithm

When initializing a network, weights are usually randomized. Because of this, the accuracy is typically very bad in the beginning of the training phase. While repeating the steps in the training phase for all training examples, during a number of “epochs”, weights are adapting and the network becomes better and better. The training ends when the error term in the output layer is acceptably low, or when it has trained a certain number of epochs. As disclosed in Section 2.3, the data used in supervised learning is divided into different sets. The training phase described in this section uses a training set, whereupon a separate testing set is used to calculate the final accuracy of the trained FNN.

As previously shown in Fig. 5, a neural network can be of arbitrary size. Since neural networks are made for generalization of new and unseen data, having a network that is too big may lead to a problem referred to as overfitting [7]. If a network is very large, e.g. having as many hidden nodes as training examples, it would eventually memorize all the training examples, or a substantial portion of them. Fig. 8 tries to give an intuition for this, while Fig. 9 shows that the accuracy of an overfitted network may be adversely affected.

Figure 8: The red and blue dots represent two classes. The green line represents an overfitted model, while the black line represents a model that more likely will result in lower error rates [27].

(12)

−10 10 20 30 40 50 60 70 80 90 100 110 0.2 0.4 0.6 0.8 1 overfitting Epochs Error Training Testing

Figure 9: Overfitting during training. The diagram shows that the error rate starts increasing after 50 epochs. This means that the overall accuracy will be worse if training ”too much”.

Ways of avoiding this issue are using early stopping [7], dropout regularization [28], or using a smaller network. The early stopping method is a way of determining how many epochs can be ran before the model reaches an overfit. Usually this is done by monitoring the loss function of e.g. the validation set during the training phase. Dropout is a method where nodes along with their weights are “dropped” randomly during training to prevent the network from adapting too much. 2.4.2 Recurrent neural networks (RNNs)

Even though a representation such as n-gram may be used to keep a certain local order within pairs, or triples, of words, etc., contexts consisting of longer dependencies are still lost. Recurrent neural networks (RNNs) solves this by its cyclic behavior in which output data is fed back and used as input data in the subsequent iteration [8]. In theory, an RNN has no limit of how much history it can map. Thus, they are good at incorporating sequences. Since there are an infinite number of ways of designing the recurrence of an RNN, Fig. 10 illustrates a simplistic RNN structure with one hidden layer that is fully connected to itself. The recurrent connections in an RNN acts as a “memory”, even more like how the human brain works than when using regular FNNs.

Input layer Hidden layer Output layer

Figure 10: A type of recurrent neural network. The output of the hidden layer is fed back and aggregated along with the input in the subsequent iteration.

(13)

As with FNNs, RNNs also has a step of forward pass where data is presented to the network. The key difference is that in RNNs, there is an extra dimension of time due to the recursive nature of the network structure. This means that the outputs at the recurrent nodes (the nodes in the hidden layer in Fig. 10) not only depends on the current input, but also the output of the hidden layer at the previous timestep. The output of hidden nodes is calculated using the following equation: yth= fh( I X i=1 wihxti+ H X h0=1 wh0hbt−1h0 ), (6)

where, for the first summation sign, I is the number of input nodes, wih is the weights of all

incoming connections to hidden node h, and xti is the value itself of input i at the current timestep t. For the second summation, H is the number of hidden nodes, wh0his the weight of the recurrent connection, and bt−1h0 is the key part, the actual value of input h0 from the preceding timestep t − 1. The function fh is the activation function, just as in FNNs [8].

The final output at the output layer follows the equation:

yto= fo( H

X

h=1

whoyth), (7)

where o is an output node, whois the weight connecting each hidden node to the output node,

ythis the input value to the output node at the current timestep t, and fois the activation function.

When using RNNs for classification, logistic sigmoid [8] or softmax [29] functions are commonly used in the output layer. The loss function for calculating the error terms in an RNN follows the same equations as for FNNs, described in Subsection 2.4.1. In order to update the weights, an algorithm called backpropagation through time (BPTT) is used [30,26]. This algorithm is just like the regular backpropagation, although instead of just backpropagating and updating the weights in a single current timestep, the network is backpropagated through several timesteps as well. As Fig. 11 tries to illustrate, the BPTT treats the RNN just as a regular FNN by looking at it from an unfolded perspective. Input layer Hidden layer Output layer

Figure 11: The same RNN as in Fig. 10, but unfolded in three timesteps.

A potential issue when using BPTT for RNNs is called the vanishing gradient problem. BPTT is a so-called gradient learning algorithm, meaning that the amount of change needed to update the weights in the network corresponds to a gradient descent. When gradients become too small in a network, it cannot learn efficiently. The intuitive effect of the vanishing gradient problem is that the network will not be able to remember early inputs, as they are neglected, or forgotten, over time. This phenomenon only relates to certain activation functions. Sigmoid and TanH may suffer from this, while ReLU do not have the problem [31].

(14)

2.4.3 Long Short-Term Memory networks (LSTMs)

Long Short-Term Memory (LSTM) networks was introduced by Hochreiter and Schmidhuber in 1997 [32]. Their efforts succeeded in tackling the vanishing gradient problem [31] of standard RNNs, meaning that the proposed network performed better in terms of remembering information over time. Similar to RNNs, LSTMs uses recurrent connections. The key difference is that, instead of recurrently connecting regular nodes, the LSTM consists of a network of recurrently connected subnetworks. The subnetworks are referred to as memory blocks [8,33]. Further, a memory block contains three gates called input gate, output gate, and forget gate. The effect of these gates is to enhance the control of the network state by allowing them to decide what information to add or to forget (delete). In some cases, gates can also have so-called peephole connections directly to the memory cell [34]. In further efforts of mimicking the human brain, the peephole technology makes the network even more sensitive to “spikes” which in physiology can be described as sudden, short-lasting, events which entails rapid change of signals. Using the peephole connections, an LSTM network can learn fine-grained information based on precise timing. Fig. 12 shows the elements of a memory block, and how it is structured, while Fig. 13 shows an example of one possible LSTM network design. Memory cell fh × × fg × f Forget gate f

Input gate Output gate f

Figure 12: An LSTM memory block. The memory cell corresponds to the regular node in, e.g., a FNN. Input and outputs are passed through the activation functions fgand fh, and the gates, which

may have different activation functions, allows for a more dynamic behavior by controlling what data is to be passed through them or not. The red circles are ordinary arithmetic multiplication operators. The dashed arrows show the peephole connections.

As the memory block and the LSTM network, shown in Figs. 12 and 13, are only abstractions and ways of illustrating complex mathematical models, their equations for forward pass and back-ward pass are even more complex than for FNNs or RNNs. Based on Graves [8], the rest of this section explains the notations of the equations used in an LSTM hidden layer.

Like in Subsection 2.4.2, timesteps are denoted with a raised t, where e.g. t-1 is the previous timestep and t+1 is the next timestep. The variables φ, ι, and ω represents the forget gates, input gates, and output gates respectively. As the equations are presented, variables that have not previously been explained are described continuously.

(15)

× × × × × ×

Figure 13: An LSTM neural network with two nodes in the input layer, one hidden layer with two recurrently connected memory blocks, and two nodes in the output layer. The memory blocks are the same as in Fig. 12, although the upper block is mirrored to simplify drawing of connections. This is one of many ways of structuring an LSTM network [33], and hence, it is only an attempt to give a sense of the complexity of them.

Forward pass Input Gates: yιt= f ( I X i=1 wiιxti+ H X h=1 whιyt−1h + C X c=1 wcιst−1c ), (8)

where ytι is the output value at the input gate in timestep t, and f is the activation function processing the input at the input gate. In the first summation, I is the number of inputs, wiι is

the weight between an input and the input gate, and xti is the input parameter. In the second summation, H is the number of memory cells in the hidden layer, whιis the weight of the recurrent

connection from a cell output of another memory block in the hidden layer to the input gate, and yt−1h is the actual output parameter from that other memory block in previous timestep. As the third summation is used to add the peephole weights, index c refers to any of the C memory cells in the network. Then, wcι is the weight of the peephole connection between memory cells and

input gates. st−1

c refers to the state of cell c at the previous timestep.

Forget gates: yφt = f ( I X i=1 wiφxti+ H X h=1 whφyht−1+ C X c=1 wcφst−1c ), (9)

where yφt is the output of the forget gate in the current timestep, and similar to input gates, wiφ

and xt

(16)

connection from another memory block in the hidden layer to the forget gate, while wcφ is the

weight of the peephole connection between the memory cell and the forget gate. Cells: stc= yφtst−1c + yιtfg( I X i=1 wicxti+ H X h=1 whcyht−1), (10)

where stc is the state of cell c, which is the actual value of the cell at time the current timestep t.

fgis the activation function of the input activation function. wicand xtiare the weights and values

between the input node and the cell in the current timestep, while whcand yht−1are the weights and

values of a recurrent connection between a hidden memory block output of the previous timestep and cell c. Output gates: ytω= f ( I X i=1 wiωxti+ H X h=1 whωyht−1+ C X c=1 wcωstc), (11)

where yωt is the output of the output gate in the current timestep, and the corresponding weights and values are denoted w and x but with the subscript ω.

Cell outputs:

yct= ytωfh(stc), (12)

where yt

c, the output of the cell, is the product of the output at the output gate and the current

cell state st

c passed through the activation function fhof the output gate.

Backward pass Cell outputs: tc= K X k=1 wckδkt+ G X g=1 wcgδt+1g , (13)

where  is used to differentiate the output in the backward pass from y in the forward pass, K is the number of outputs and G is the total number of inputs in the hidden layer (where the index g is the index of any of those inputs). Like when using regular FNNs, the error term is defined as δ, and is calculated and used to update weights in the same way as described in 2.4.1 Feedforward neural networks (FNNs). Hence, δt

k is the error term at output node k at the current timestep.

The weight wcg is the weight between the cell and the input, and δt+1g is the error term for that

input, in the next timestep. Output gates: δωt = yωt C X c=1 fh(stc) t c, (14) where δt

ω is the error term in the output gate at the current timestep, and fh is the activation

function that processes the current state of the cell. States: ts= ytωfh(stc) t c+ y t+1 φ  t+1 s + wcιδt+1ι + wcφδφt+1+ wcωδωt, (15)

where, putting the pieces together to form the state t

s of a cell s in the current timestep, y t+1 φ

is the output of the forget gate at the next timestep, and t+1

s is the state in the next timestep.

wcιδιt+1 and wcφδφt+1 are the weights and error terms from the cell to the input gate and forget

gate in the next timestep respectively, while wcωδωt are the weight and error term between the cell

and the input gate at the current timestep. Cells: δct= yιtfg( I X i=1 wicxti+ H X h=1 whcyht−1)ts, (16)

where the sum passed to the activation function fg is the same sum that is calculated in the cells

(17)

Forget gates: δφt = f ( I X i=1 wiφxti+ H X h=1 whφyt−1h + C X c=1 wcφst−1c ) C X c=1 st−1c ts, (17) where the argument passed to the activation function is the same as in Eq. 9, and st−1

c is the value

of the cell state at previous timestep t − 1. Input gates: διt= f ( I X i=1 wiιxti+ H X h=1 whιyht−1+ C X c=1 wcιst−1c ) C X c=1 fg( I X i=1 wicxti+ H X h=1 whcyt−1h ) t s, (18)

where the argument passed to the activation function f is the same as in Eq. 8, and the argument passed to the input gate activation function fg is the same as in Eq. 10.

The forward pass is executed through a series of timesteps, typically from t = 1 to T , while the BPTT backward pass starts at the last timestep T , in order to derive error terms and update weights, all the way back to the first timestep.

3

Related work

The approaches in previous work related to authorship analysis vary greatly regarding everything from what type of text that is analyzed, how stylometric features are represented, how they are classified, and what kind of classification is in focus. What seems to be common to many previous works is that longer texts, such as email conversations, news articles, blog-posts, etc., are used.

This section explores previous works in authorship analysis, including works focusing on feature representation/extraction in text classification, and the use of artificial intelligence (specifically recurrent neural networks).

3.1

Authorship analysis and feature representations

In 2013, Brocardo et al. presented a method for verifying the authorship of short messages [5] using stylometric features. Their experiment was conducted based on a dataset consisting of e-mail conversations, or “online documents”, which are considerably longer than what this thesis is focusing on. Another difference is that they do not use neural networks to classify the texts. Several other similar works have been carried out where they focus on longer texts [35,36,37,38], and where they use the same dataset as [5].

Rocha et al. describes that there is a need for further research in the field of authorship analysis and social media forensics [6]. In their paper, they review different methods that are used to find attributes and to classify the authorship of text. They point out the difficulties when texts do not exceed a certain length, using tweets as an example. Among the discussions about the different methods, they argue that neural networks may not be the most efficient method to tackle the problem. The reason is that they are looking at the problem from a 1:N classification point of view where the network model predicts an output that belongs to a large number of classes. As a high-dimensional feature space is needed to get acceptable classifications, they mean that training a neural network would, in the case of multiclass classification, would be too computationally expensive. As previously mentioned, this thesis focuses on authorship verification (not attribution). This is a 1:1 perspective where we always will have a predefined label of whether a social media post is written by, e.g., an anonymous user, or if it is written by a known user.

In 2014, Koppel and Winter describes a method for solving the authorship attribution problem using an unsupervised learning technique [39]. In their experiment, they measure “document similarity” with the goal of determining whether a pair of blog-post authors are the same or not. The main differences from their work and this thesis is that I am using a supervised learning technique, and, as for previous examples, the text that is used in this thesis is shorter.

Bhargava et al. evaluates classification of stylometric features for use in online forensics [40]. Although they specifically target tweets as data to be used in their proposed method, the main

(18)

differences between their work and this thesis is that they, just as [6], focus on authorship attribu-tion (1:N), and that they are using a support vector machine (SVM) for classificaattribu-tion instead of a neural network.

Stamatatos presented a study within authorship analysis using n-grams at character level [41]. His experiment was carried out under cross-genre and cross-topic conditions as he argues that a majority of previous studies about text feature extraction have used data that are within the same genre, area, or topic. The results showed that using character level n-grams were better at extracting stylistic features than representations based on use-frequency of words.

In 2016, Peng et al. examined the n-gram concept one step further [42]. Their article describes using bit-level n-grams for use in forensic authorship analysis on social media. As they highlight the challenges of performing authorship analysis on social media posts due to variance in length and topics, they also argue that a possible lack of attention to grammatical rules may be a big flaw of using higher-level attributes such as word-level n-grams.

3.2

Recurrent neural networks for authorship analysis

Artificial intelligence, along with statistical techniques, has been described for classification of stylometric characteristics or attributes in longer texts for various purposes [43,44].

In 2015, Bagnall proposed a method for authorship identification [45]. His solution uses a special type of RNN that he calls multi-headed RNN. He argues that the scope of a standard language model, such as n-grams used in FNNs, is limited, while an RNN is more adaptable to suit the context. The multi-headed aspect of his RNN model consists of several independent RNNs which shares a common recurrent state. The independent RNNs, which all has individual softmax outputs, are mainly trained on one authors corpus separately, while the common recurrent state approximates the linguistic style of both authors. The purpose of this structure was to prevent overfitting. Although Bagnalls solution showed promising results, he remarks that there are more sophisticated and more modern methods based on the RNN concept. He concludes that LSTM networks can handle larger contexts and is capable of longer ranges and more complex dependencies. Whether they are effective in the context of authorship verification, he leaves as an open question and recommends further research on the issue.

In 2016, Lee and Dernoncourt presented a model based on a combination of an RNN (more specifically an LSTM network) and a convolutional neural network (CNN) for short-text classifica-tion [46]. They have a general approach, targeting several applications such as sentiment analysis, question answering, or dialogue management. They emphasize that short texts often appear in the context in which a part of the text has a certain dependence on a preceding part of the text. While they argue that it is particularly necessary to take this into account, they mention that many ANN systems used for classification of texts today analyzes texts separately without this consideration. Therefore, they mean that RNNs/LSTMs are superior in this regard. The strength of their work is that the results are based on an evaluation of three different datasets where the content represents texts written in different settings (three different platforms). While their LSTM solution shows good performance for general applications, the work described in this thesis extends the existing knowledge as it specifically targets social media posts.

In another article from 2016, Rao and Spasojevic [47] applied word embeddings (vector repre-sentations of text features) with LSTM for classifying social media posts as democratic/republican, or actionable/non-actionable. Although their work is not about identifying authorship, the simi-larity to this thesis is that they are analyzing short texts. Their model shows an accuracy around 85%, and the solution is currently used by companies in order to facilitate prioritization of, e.g., customer support messages.

(19)

4

Problem formulation

The problem investigated in this thesis relates to processing of natural language text, and inter-pretation of the data by means of artificial intelligence. Specifically, the main research question of this thesis is defined as follows:

• “Is it possible to, for a set of anonymous social media posts, use a neural network to calculate a probability of whether it matches a set of posts written by a non-anonymous author?”, and two subproblems are:

• “What representations, i.e. linguistic/stylistic features, are appropriate to extract and use for authorship verification in the context of very short texts (at most 140 characters)?”. • “How does the lack of accessible data affect the reliability, and the ability to classify an

author’s stylistic peculiarities?”.

The purpose of this work was to statistically prove that it is possible to obtain a sufficiently convincing probability of whether two sets of messages written on social media belongs to the same author or not (see Subsection 4.1 for further explanation). Further, the motivation and purpose is also that the answer to the problem definition could be the basis for the development of auxiliary tools operating in crime prevention/investigation activities. Regardless of whether the problem has been examined previously or not, it is important to continuously explore the possibilities of authorship verification under different conditions or circumstances. The motivation for this is that the world is changing. People change their habits, ways of communicating depending on who they communicate with, or the platform they use.

In the experiment carried out in this thesis, the goals were to, for each author analyzed, extract their linguistic styles, and classify them using a neural network. The intention of this was to allow for calculation of accuracy values that could show how well a classifier were able to distinguish between stylistic characteristics. The information obtained through this has made it possible for conclusions about the research question to be drawn. A discussion based on the results, how they can be linked to the purpose of this work, and how they can be transferred to practical applications is found in Section 9.

4.1

Assumptions and limitations

As mentioned in the beginning of this section, the desired result of the experiment was a sufficiently convincing probability. What defines this depends on the context, and what the requirements are. Since it is possible to calculate a probability of more or less anything, what is meant by a sufficiently convincing probability is that it must be relevant, reasonable, and adequate for conclusions to be drawn. Initially, my assumption has been that a probability of 70% will meet the general requirements. The choice of 70% was based on an arbitrary guess. However, this choice was not irreversible, and it is discussed in the evaluation of the results.

I assume one of the biggest challenges due to classification of very short text messages relates to whether the texts that are analyzed contains enough peculiarities to allow for any conclusions to be made. However, the motivation that it should work is that the information can be found at the “depth” instead of the “breadth”. Some users of social media today often share a considerable large amount of text, although in a concise format. The total amount of text in many short social media posts together can still be the same as the amount of text in a few documents such as, e.g., e-mails. In addition, the linguistic style of text written on social media should reasonably reveal more idiosyncrasies than text that is “filtered” through various taught grammatical rules. These were the underlying assumptions that together formed the hypothesis of my work.

Since classic stylometry proposes a variety (hundreds) of methods for analysis of text [13], and because natural languages exist in many forms, the focus was on a few selected stylistic character-istics/representations (presented in Section 5), and some variations of them. Even though there might exist more advanced feature representations than what I use in this thesis, my assumption was initially that the chosen representations still would give sufficiently clear results, allowing for

(20)

the research question to be answered. This means that a proof of concept was in focus, while op-timization of performance in terms of accuracy was a secondary priority, however not completely disregarded. Further, the work was restricted to the English language and the Latin alphabet, and that I deliberately did not take into account that users over time might adapt their linguistic style as they, e.g., become older, or that they might consciously change it in order to conceal their identity.

5

Method

This section intends to explain and motivate the scientific methods used, and the various choices taken during the practical work. The tasks that needed to be completed in order to achieve defined goals were divided into a two-stage process. The first stage involved processing, and conversion, of text into data that could be interpreted by a classifier, while the second stage involved executing the actual classification. Common to both stages are that Python was used as the main programming language due to the extensive amount of available documentation, regarding both machine learning and NLP.

Subsection 5.1 explains the scientific method and approach in this thesis, while Subsections 5.2 and 5.3 describes how the work was carried out to reach the defined goals.

5.1

Scientific approach

In this thesis work, the concept of falsification, as described by Zobel in the book “Writing for Computer Science” [48], was followed. Essentially, this means that a hypothesis based on the problem formulation was defined in the initial work, after which I tried to find counterexamples in order to falsify it during the rest of the work. More specifically, the method and steps taken to answer the research question followed a hypothetico-deductive model. This is an iterative process where, based on the problem definition, the steps can be described as:

• Hypothetico-deduction step - A speculative solution to the problem is predicted

• Deductive step - A logical reasoning on how a test of the hypothesis can be executed is defined

• Experiment - The hypothesis is tested in an experiment • Conclusion - The hypothesis is validated or rejected

The first step is where the hypothesis is defined. Then, predictions will be deduced from the defined hypothesis. These predictions are the basis of the experiment. Usually when this model is followed, evidence or counterexamples that disproves the hypothesis are sought during the experiment. In the case when a hypothesis is disproved (rejected) in step 4, the process must be repeated with a new hypothesis. However, in this thesis, only one iteration was needed to execute an experiment that allowed the problem formulation to be answered.

As Zobel mentions, falsification will never alone be able to completely verify a hypothesis, but it can effectively be used with, e.g., the concept of confirmation, which was done in this thesis as well. Confirming a hypothesis in context of science does not mean that a theory is proved, but rather that we can strengthen our belief in it.

The experiment carried out in this thesis work was in the form of a quantitative survey. Using this approach, I was able to obtain significant results. I also followed a statistical approach, since, according to Zobel, our intuition is not sufficient in some situations [48]. This means that it is not enough to evaluate the results given by an individual case separately and that measurements must be compiled in order to draw significant conclusions.

Using the methods described in the Subsections 5.2 and 5.3, the goals defined in Section 4 have been achieved. Thus, the research question and its subproblems could be answered through analysis and discussion of the compiled experiment results.

(21)

5.2

Data and pre-processing

The data used in the experiment, obtained from Kaggle1, was tweets collected from Twitter2. The reason for using tweets is that Twitter currently has over 300 million active users every month, thus being the most used social media platform next to Facebook [49]. The advantage of using tweets instead of Facebook posts is Twitter’s explicit 140-character limit on message length, while Facebook’s limit is thousands of characters. Using short text messages means a greater challenge, but it also provides an opportunity to address the hypothesis. In the experiment, tens of thousands of tweets composed by eight different authors were used. The choice of using a relatively small number of authors, with many tweets each, rather than a large number of authors with fewer messages, is a deliberate choice which aligns with the hypothesis of this work. The reason not to use both many authors with many tweets each is due to the computational time of training neural networks. By both comparing authors who figure in the same area, and those who work in completely different genres, the risk of biased results was reduced [41].

The experiment was divided into eight subtests where one user at a time was chosen to represent the “positive” class (the known author), while data from other users represented the “negative” class (the anonymous author). In each subtest, the data was composed out of 50% from the positive class and 50% from the negative class (the latter half distributed evenly across the remaining seven authors). This means that the task of the classifier was to predict whether the inputs of new and unseen data belonged to the positive class or the negative class, which is a 1:1 classification. For each test, i.e. for each user who simulated the known author, the amount of available data was decreased. This made it possible to evaluate how a limited amount of data affects accuracy. The tweets in each subtest were loaded into a raw data matrix after which it could be processed into a feature matrix.

For data processing and feature extraction, the most commonly used representations in text classification were implemented and evaluated. These are the bag-of-words, n-gram, and TF-IDF models as described in Subsections 2.2.1 to 2.2.3. The n-grams were implemented as bigrams, both at word- and character level, and trigrams at word level. In addition to these, a combination of bigrams and TF-IDF was also included in the experiment. After processing the raw data matrix into a feature representation, the data of the authors was divided into separate parts. For each of the subtests in the experiment (i.e., for each author in focus), there was one feature matrix and one author label vector. The feature matrix consisted of feature parameters where each row corresponded to the features of a tweet written by an author. The author label vector, which had as many elements as number of rows in the feature matrix, only contained either a 1, or a 0 (1 representing the positive class, and 0 representing the negative).

The libraries used to perform the tasks described in this section were NumPy3, Pandas4, and

Scikit-learn5. NumPy was used for e.g. shuffling of data, Pandas was used to manage data

structures, and Scikit-learn was used for tokenization of the tweets and to count use-frequency of tokens.

5.3

Classifying data

The classifier used in the experiment was an LSTM network, implemented with the TensorFlow6 library. As described in Subsection 2.4.3, LSTMs are much better than regular FNNs or RNNs at handling sequential data with long dependencies. There are several reasons for using TensorFlow. The first is that it allows for easy visualization of results. The second is that RNN/LSTM networks are known to be costly in the training phase [8, 33], especially if the input consists of many parameters (many input nodes), and as TensorFlow enables GPU acceleration through CUDA without further effort, it was a natural choice. Mainly, it allows for fast development of various kinds of neural network models, which facilitated the answering of the research question. In addition to TensorFlow, the Keras API7 was used to speed up modeling of the neural network

design and execution of the experiment.

1https://www.kaggle.com/ 2https://about.twitter.com/company 3http://www.numpy.org/ 4http://pandas.pydata.org/ 5http://scikit-learn.org/ 6https://www.tensorflow.org/ 7https://keras.io/

(22)

After having converted tweets into data that could be understood and fed to a classifier (as described in Subsection 5.2), the data was further divided into one training set, one validation set, and one testing set. Following existing research on how to determine the fractions of these [50], the proportions was initially chosen to 60%, 20%, and 20% respectively.

The training data was fed to an instance of an LSTM model, and the validation data was used to measure the accuracy during training. After training the network sufficiently, the overall accuracy was measured using the testing data.

Based on existing research on how to select the number of hidden layers and the number of memory cells, a technique called pruning was used [51]. In the context of neural networks, pruning is about starting with a large number of nodes, and then decreasing the size of the network and re-training it until the performance is significantly reduced. With this, a sort of threshold that defines how large a network must be in order to be effective could be found. The reason for wanting to decrease the size is that, the larger the network, the costlier it is to train.

6

Ethical and societal considerations

As this thesis may seem to be related to one of the fundamental human rights of modern societies, the right to personal integrity, it is important to highlight related ethical aspects. This section describes how data has been processed during this thesis work to avoid improper use and any discrimination.

In Sweden, where this thesis has been carried out, there is a law concerning management of personal data (Data Protection Act8). It defines what rights individuals have regarding privacy,

and its purpose is to protect people against privacy violations. The Data Protection Act describes how personal data are allowed to be processed, that is, all types of actions allowed to be taken upon personal data, such as collection, registration or storage in databases. As a fundamental part of the law includes consent, meaning that if data is to be used without censorship, consent must be given by the person concerned. Hence, to avoid any possible issues of unlawful handling of data, information that ties the social media posts to the authorship of real people have been removed and not stored. This includes information such as name, username, social security number, email address, location information, etc. Instead of using real usernames as identifiers, the posts have been relabeled with anonymous identification numbers.

7

Implementation

This section describes the building blocks of the implementation, which is the basis of the experi-ment. The code is structured into the following Python scripts:

• evaluate.py - Contains test parameters and function calls to other scripts.

• pre process.py - Loads data into memory and transforms it into feature matrices. • build model.py - Creates the structure of an LSTM network model.

• classify.py - Trains and tests accuracy of model using pre-processed data. • log results - Stores history of measured results and generates graphs.

7.1

Evaluation

The evaluation script (evaluate.py) is just a simple Python script containing all the parameters and logic necessary to perform the entire experiment and all its subtests. Given a parameter setup, the evaluation script runs as shown in Algorithm 2.

(23)

Input: Parameter configuration

Output: Graphs and test accuracies for each subtest

1 foreach subtest (author) do 2 Load data;

3 Build LSTM model; 4 foreach representation do 5 Generate representation; 6 Train (and validate) model; 7 Plot training history to graph; 8 Test model accuracy;

9 Save accuracy to file; 10 end

11 end

Algorithm 2: The overall logic of the evaluation script.

7.1.1 Adjustable parameters

• author - Integer value used to select which author to train the classifier for. Available author IDs are 0-7.

• feature representation - Integer value used to select which feature representation to use. The implemented representations currently available are: 0 (bag-of-words), 1 (bigram), 2 (trigram), 3 (tf-idf), 4 (combination of bigram and tf-idf), and 5 (character-level bigram). • num features - Integer value used to set a maximum limit on how many features to keep

for each tweet. E.g., two sentences of 5 words each, and where all words are different, would with a bag-of-words representation build a vocabulary of 10 words. If this parameter is set to 8, the representation will only consider the top 8 frequently used words.

• input size - Integer value used to set the number of input nodes in the neural network. However, this parameter is hardcoded to be equal to num features.

• num hidden - Integer value used to set the number of hidden layers.

• hidden size - Integer value used to set the size of each hidden layer (i.e. number of memory blocks).

• num epochs - Integer value used to set the number of epochs. This is recommended to set to a high number if early stopping is used.

• batch size - Integer value used to set the number of training examples (tweets) that are propagated per forward/backward pass. Mainly, it is used to save memory usage as the total amount of data is divided into smaller batches.

• train data percentage - Float value used to set the percentage of all data to be used for training (0.0 - 1.0).

• validation data percentage - Float value used to set the percentage of all data to be used for validation during training (0.0 - 1.0).

• test data percentage - Float value used to set the percentage of all data to be used for testing. However, the test data percentage is hard-coded to the remaining share after the training and validation data has been selected.

• lstm activation - String value used to set the activation function to be used in an LSTM layer. Available: ‘sigmoid’, ‘tanh’, ‘relu’, and ‘softmax’.

• dense activation - String value used to set the activation function to be used in a regular node layer. The available functions are the same as for LSTM.

(24)

• learning rate - Float value used to set the speed of learning by affecting the updating of network weights.

• lstm dropout - Float value used to set the fraction of dropped values (0.0 - 1.0).

• stop monitor - String value used to set the value to be monitored for early stopping. Available: ‘val acc’ or ‘val loss’.

• stop patience - Integer value used to set the number of epochs where no improvement has occurred before executing early stopping.

7.2

Data pre-processing

When downloaded from the dataset source9, the tweets are originally stored as one

comma-separated file per author with the columns as listed below. However, using the pre-process script (pre process.py), only the “Text”- and “Author”-fields are kept, and all usernames are replaced with the labels “author 0-7”.

• Row ID - Incrementing from 0 to n, where n is the number of tweets. • Date - Date of posting.

• ID - Unique post ID. • Link - URL to tweet.

• Re-tweet - Boolean value (however, no re-tweets are included in the datasets). • Text - The actual tweet text written by an author.

• Author - The username of the author.

The following subparagraphs describes the overall flow of the data pre-processing stage. Loading data Using the functions read csv() and concat() from the Pandas library, the needed files are loaded and concatenated into a single DataFrame object.

Shuffling After loading the data, it is randomly shuffled using the random() and permuta-tion() functions from NumPy, of course retaining the author labels to represent the original author of the texts. Shuffling is a necessary step when it comes to training of neural networks (and other supervised learning models), as the division of data into training set, validation set, and test set is done on a percentage basis. However, randomness is kept predictable by using a constant value as seed to allow for reproducibility of the experiment.

Converting into lists and re-labeling When data has been loaded into memory and shuf-fled, the text in the DataFrame is converted into a list of strings (which will be converted into a feature matrix), while the labels (author 0-7) are converted into a single list of integers. The author that will be in focus in a subtest is assigned the integer 1 representing the positive class, while “non-focus” authors will be assigned the integer 0 representing the negative class (as described in Subsection 5.2).

Generating feature representations To generate a numerical feature representation, the list of strings is passed to the fit transform() function which is a function within an instance of either a CountVectorizer or a TfidfVectorizer class in Scikit’s Feature extraction library. The fit transform() function learns the entire vocabulary of the text that is passed to it, and then it transforms the data into a numerical feature matrix, which varies depending on how the vector-izer objects were configured. A description of how the objects were configured to generate the representations used in this thesis can be found in Subsection 7.2.1.

(25)

Dividing data into separate sets The feature matrix is divided into separate training-, validation-, and testing sets. Given the chosen parameters for percentages, this is simply done by calculating how many of the total number of tweets each set should have, such as:

train portion = int(num tweets ∗ train data percentage) (19) validation portion = int(num tweets ∗ validation data percentage) (20) test portion = num tweets − train portion − validation portion, (21) and then creating new feature- and label sets with re-assigned indices, such as:

x train = tweet f eatures[: train portion] (22)

x val = tweet f eatures[train portion : train portion + validation portion] (23) x test = tweet f eatures[train portion + validation portion : num tweets], (24) where x represents the tweet features. The division of labels is done in exactly the same way. 7.2.1 Feature representation parameters

As described in Subsection 7.2, instances of the classes CountVectorizer and TfidfVectorizer from Scikit are used to generate feature representations from text. The classes are fundamentally the same, although the TfidfVectorizer normalizes the results by applying the equation described in Subsection 2.2.3. The following subparagraphs describe which parameters are common to all representations, and what makes each individual representation unique.

All representations

• max features - Corresponds directly to the variable num features as described in Subsection 7.1, which is the limit on how many features to keep in each row (i.e. number of columns in the feature matrix).

• token pattern - A regular expression to describe what a complete token is. In my imple-mentation, the regular expression 0(?u) \ \b \ \w + \ \ b0 is used to declare one-character words such as “I” and “a” as allowed tokens. Without this, they will not be included in the representation.

The bag-of-words and TF-IDF representations have no additional parameters added. However, for TF-IDF, the vectorizer will as mentioned be an instance of the TfidfVectorizer class instead of CountVectorizer.

Bigram

• ngram range - As this parameter sets the bounds (min/max) of the n-gram range, for bigrams it is set to (2,2).

• analyzer - Set to the string “word” to use a word-level bigram representation. Trigram

• ngram range - Set to (3,3). • analyzer - Set to “word”. Bigram combined with TF-IDF • ngram range - Set to (2,2). • analyzer - Set to “word”.

Figure

Figure 2: a) shows a unigram representation, while b) and c) are bigrams.
Figure 4: A single-layer perceptron. The input layer has i nodes, the hidden layer has j, nodes and the output layer has k nodes.
Figure 6: A model of a single neuron and its mathematical elements.
Figure 7: Three commonly used activation functions. Sigmoid [22, 23], tangens hyperbolicus (TanH) [23], and rectified linear unit (ReLU) [24]
+7

References

Related documents

Exakt hur dessa verksamheter har uppstått studeras inte i detalj, men nyetableringar kan exempelvis vara ett resultat av avknoppningar från större företag inklusive

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

a) Inom den regionala utvecklingen betonas allt oftare betydelsen av de kvalitativa faktorerna och kunnandet. En kvalitativ faktor är samarbetet mellan de olika

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar