Machine Learning Based Sentiment Classification of Text, with Application to Equity Research Reports

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Machine Learning Based

Sentiment Classification of Text, with Application to Equity

Research Reports

OSCAR BLOMKVIST

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Machine Learning Based Sentiment Classification of Text, with Application to Equity Research Reports

OSCAR BLOMKVIST

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2019

Supervisor at Skandinaviska Enskilda Banken: Andreas Johansson Supervisor at KTH: Tatjana Pavlenko

Examiner at KTH: Tatjana Pavlenko

(4)

TRITA-SCI-GRU 2019:318 MAT-E 2019:75

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

In this thesis, we analyse the sentiment in equity research reports written by analysts at Skandinaviska Enskilda Banken (SEB). We provide a description of established statistical and machine learning methods for classifying the sentiment in text documents as positive or negative. Specifically, a form of recurrent neural network known as long short-term memory (LSTM) is of interest. We investigate two different labelling regimes for generating training data from the reports. Benchmark classification accuracies are obtained using logistic regression models. Finally, two different word embedding models and bidirectional LSTMs of varying network size are implemented and compared to the benchmark results. We find that the logistic regression works well for one of the labelling approaches, and that the best LSTM models outperform it slightly.

Keywords: equity research, NLP, sentiment classification, logistic regression, word2vec, LSTM

(6)

(7)

Sammanfattning

Maskininl¨arningsbaserad sentimentklassificering av text, med till¨ampning p˚a aktieanalysrapporter

I denna rapport analyserar vi sentimentet, eller attityden, i aktieanalysrapporter skrivna av analytiker p˚a Skandinaviska Enskilda Banken (SEB). Etabler- ade statistiska metoder och maskininl¨arningsmetoder f¨or klassificering av sentimentet i textdokument som antingen positivt eller negativt presenteras. Vi

¨

ar speciellt intresserade av en typ av rekurrent neuronnät känt som long short-term memory (LSTM). Vidare undersöker vi tv˚a olika scheman för att märka upp träningsdatan som genereras fr˚an rapporterna. Riktmärken för klassificeringsgraden erh˚alls med hjälp av logistisk regression. Slutligen im- plementeras tv˚a olika ordrepresentationsmodeller och dubbelriktad LSTM av varierande nätverksstorlek, och jämförs med riktmärkena. Vi finner att logistisk regression presterar bra för ett av märkningsschemana, och att LSTM har n˚agot bättre prestanda.

Nyckelord: aktieanalys, NLP, sentimentklassificering, logistisk regression, word2vec, LSTM

(8)

(9)

Acknowledgements

Firstly, I would like to express my gratitude to Andreas Johansson, Head of Quantitative Analysis at SEB Solutions, for offering the opportunity to conduct this thesis project at SEB, and for providing valuable advice and insight into the world of equity research.

I would also like to thank J¨orgen Alamanos at SEB Equity Research, for providing the data and answering any and all questions about its content and format.

Furthermore, I would like to thank my supervisor Tatjana Pavlenko, Asso- ciate Professor at the Department of Mathematics at KTH, for the guidance and feedback during the work behind this thesis.

In addition, I would like to thank Olof L¨ofving, who worked on a similar project, for valuable and insightful discussions on the subject at hand.

Finally, I want to thank Lisa Wiklund, for putting up with me during the weeks when this thesis occupied nearly all my waking hours.

(10)

(11)

List of Figures

2.1 Illustration of an artificial neuron with m inputs and activation function ϕ, from (Ahire, 2018) [15]. . . 11 2.2 Illustration of a fully connected multi-layer feedforward neural

network, adapted from (Ahire, 2018) [16]. . . 12 2.3 Illustration of the Skip-gram model architecture, from (Mc-

Cormick, 2016) [26]. . . 20 2.4 The basic RNN architecture, in “loop form” (left) and unrolled

as a chain (right), from (Olah, 2015) [29]. . . 21 2.5 Detailed overview of a vanilla LSTM block, from (Greff et al.,

2017) [32]. . . 22 3.1 The distribution of document lengths in terms of number of

words. . . 28 3.2 The distribution of the price ratio difference for the 32233

documents where it was defined. . . 31 4.1 Two examples of the training progress of the own 3 model. . . 42 4.2 Two examples of the training progress of the own 10 model. . 42 4.3 Two examples of the training progress of the own 30 model. . 42 4.4 Two examples of the training progress of the google 3 model. . 43 4.5 Two examples of the training progress of the google 10 model. 43 4.6 Two examples of the training progress of the google 30 model. 43 A.1 The training progress of the own 3 model for the six cross-

validation folds. . . 53

(14)

A.2 The training progress of the own 10 model for the six cross- validation folds. . . 54 A.3 The training progress of the own 30 model for the six cross-

validation folds. . . 55 A.4 The training progress of the google 3 model for the six cross-

validation folds. . . 58

(15)

List of Tables

3.1 Description of the data contained in table Report TB. . . 27 3.2 Description of the data contained in table TargetPrice TB. . . 27 4.1 The results from the logistic regression classifier on the recom-

mendation data set, for different feature matrices and different sizes of n − grams. The highest cross-validated accuracy is presented in bold face. . . 36 4.2 The results from the logistic regression classifier on the tar-

get price data set, for different feature matrices and different sizes of n − grams. The highest cross-validated accuracy is presented in bold face. . . 36 4.3 Closest neighbours for our own word2vec model. . . 38 4.4 Closest neighbours for the Google word2vec model. . . 39 4.5 The prediction accuracies of all trained models for the six

cross-validation folds, along with the cross-validated accuracy.

The highest cross-validated accuracy, indicating the best model, is presented in bold face. . . 40

(16)

(17)

Chapter 1 Introduction

It has been estimated that approximately two and a half exabytes (2.5 × 10¹⁸ bytes) of data are generated every day. This vast amount of data has allowed the market for machine learning to flourish; it is expected to reach $15.3 billion in 2019 [1]. The majority of this data is unstructured data, i.e. not stored in well-structured databases, but rather made up of text files, mobile data, media files etc. Not only does unstructured data make up the majority of all available data, but the ratio of unstructured to structured data is increasing by the year. This brings us to natural language processing (NLP).

NLP is a subfield of computer science and artificial intelligence which deals with utilising unstructured data, in the form of text, to extract information in an automated fashion. [2]

1.1 Background

1.1.1 Sentiment Analysis

One area of NLP which has gained a lot of interest in recent years is sentiment analysis, the automated process of understanding opinions expressed in a piece of text. Sentiment analysis can take different forms, but one common objective is to classify a text as either positive or negative. This is sometimes referred to as polarity classification [3]. Historically, sentiment analysis was often performed using lexicon-based approaches, where lexicons of words and their pre-defined polarity were used. The total polarity of a document could then be computed by comparing the number of positively and negatively

(18)

annotated words in the text. In later years, however, statistical and machine learning methods have been far more popular. One common setting is to formulate the task as a statistical classification problem, and use documents with pre-defined polarity as training data. An example of a data set that has been frequently used to evaluate sentiment classification models is a collection of movie reviews from the Internet Movie Database, where each text is associated with a rating on a scale of 1 to 10. This rating can be used to label the texts as either positive or negative. [4]

Sentiment analysis has, among many other areas, been applied to finance.

The sentiment in a company’s annual reports has for example been shown to correlate with future performance of the company [2]. One area of finance where sentiment analysis studies are lacking is in equity research: reports from independent analysts, which can be assumed to be more objective than the companies’ own statements.

1.1.2 Equity Research

This thesis is conducted in collaboration with Skandinaviska Enskilda Banken (SEB). One of their business areas is equity research, which is a type of fundamental financial analysis performed by many banks and other financial institutions. Equity research analysts have in-depth knowledge about specific companies, and their analyses convey their view on the companies’ outlook.

More specifically, the analysts make informed predictions on key figures such as earnings and sales, based on financial reports and news regarding the company. These figures form the basis for a target price, at which the analyst believes the company’s stock should be trading. Based on the target price compared to the actual price of the stock, a recommendation is provided on whether to buy or sell the stock. The recommendations and the estimates behind them are presented in periodic reports where the analyst motivates their conclusions with (usually) a few pages of text. These recommendations, especially when updated from one level to another, have been shown to have value as an investment signal [5], as they influence the choices of investors and thus (indirectly) affect the market. If we could predict these recommendation updates based on a prior shift in the sentiment of the reports, this could potentially be used as the basis for a profitable investment strategy.

At SEB, there over 65 equity research analysts, covering more than 300 companies in the Nordics. Currently, they provide recommendations on a three-degree scale: 1 (buy), 2 (hold), and 3 (sell). Historically, this recom-

(19)

degrees on the recommendation scale. However, studying the definitions of the different scales, it becomes clear that the current class 3 (sell) is equivalent to the union of former classes 3 and 4 (tentative sell and definite sell, respectively, if you will). [6, 7]

1.2 Objectives

The purpose of this thesis is to provide exploratory research on the sentiment of equity research reports, specifically those written by analysts at SEB.

Firstly, we want to establish whether there is positive and negative sentiment to be detected in the reports. Secondly, we want to investigate to what accuracy we can classify a given report as positive or negative using long short-term memory (LSTM), a form of artificial neural network. Another purpose is to provide a foundation on which future research can build, where the long term goal is to find a relationship between changes in sentiment and future recommendation updates. If such a connection can be established, it could possibly be utilised for creating an investment strategy based on predicted changes to analysts’ recommendations.

(20)

Chapter 2 Theory

In this chapter, we present the mathematical concepts and machine learning theory that motivate our subsequent choice of methods. We start in section 2.1 by introducing some mathematical notation that will be used throughout the thesis. We go on in sections 2.2-2.3 by describing the concept of classification problems, explaining how to evaluate a classifier’s performance, and presenting a classical method for modelling a classifier. Finally, in sections 2.4-2.6, we discuss artificial neural networks and how they are trained, explain how they can be used to generate vector representations of words, and introduce the long short-term memory network, which will be the main model evaluated in this thesis work.

2.1 Notation and Terminology

There are several notational and terminological conventions that will be used throughout this thesis. These are summarised below as a reference for the reader.

• Vectors will be denoted by lower case boldface letters, e.g. x.

• Matrices will be denoted by upper case boldface letters, e.g. W .

• The transpose of vectors and matrices will be denoted by a raised >, e.g. x^>.

• Unless otherwise stated, all vectors will be assumed to be column vectors; row vectors will thus be represented as the transpose of a vector.

(21)

• The scalar product of two vectors x, y ∈ Rⁿ will be denoted by x · y.

• Element-wise multiplication of two vectors of the same dimension will be denoted by , e.g. z = x y.

• The Euclidean norm of a vector x = (x₁, . . . , x_m) will be denoted by

||x||, i.e. ||x|| =px²₁+ · · · + x²_m.

• The concatenation of two vectors x = (x1, . . . , xm) and y = (y1, . . . , ym) will be referred to as the concat operator, i.e.

concat(x, y) = (x₁, . . . , x_m, y₁, . . . , y_m).

• Functions which take vector arguments will be denoted by boldface letters, e.g. f (x).

• Scalar functions (non-boldface letters) applied to vector arguments will be used to denote element-wise application of the function to all elements in the vector, i.e.

g(x) = (g(x₁), . . . , g(x_m)) , x ∈ R^m.

• When considered as an activation function in the neural network setting, the logistic function will be referred to simply as the sigmoid function, and will be denoted by a lower case sigma:

σ(x) = 1 1 + e^−x

• The vector of length d where the k-th element is 1 and all other elements are 0, will be referred to as the one-hot vector of k with size d.

2.2 Classification Problems

In statistics and machine learning, a classification problem is the task of creating a model that assigns a given observation to one of several predefined classes, based on some input features of the observation. It is a form of what is known in machine learning terminology as supervised learning, meaning that the parameters of the model are decided based on a set of training data consisting of observations of known classes. The class of a given training

(22)

example is sometimes referred to as the label of that observation. As described in section 1.1.1, analysis of sentiment polarity can be formulated as a classification problem, where the label can be “positive” or “negative”. A model that is built to solve a classification problem is known as a classifier.

[8]

There are several metrics available for evaluating the performance of a classifier. Perhaps the most intuitive measure is prediction accuracy, i.e. the proportion of correctly classified observations on some data set. For accuracy to be a valid performance metric, however, it is important to be aware of the the distribution of the different labels data. Consider for example a small data set consisting of 9 positive documents and 1 negative documents.

A classifier that ignores all input features and “blindly” assigns every observation to the positive class would have a seemingly impressive accuracy of 90% on this data set, even though it really has no predictive power at all.

On the other hand, a classifier which correctly detects the negative document but misclassifies two of the positive documents has a lower accuracy of 80%, but in reality provides more predictive information. This is an example of the accuracy paradox, where (somewhat unintuitively) the model with lower accuracy is the better one. The accuracy paradox comes into play when the classes are unbalanced as in the example above, and in such cases, metrics other than accuracy must be used when evaluating the model. If, however, the data is evenly distributed over all classes, we say that the classes are balanced, and prediction accuracy is a valid metric. [9]

2.3 Logistic Regression

Logistic regression is a regression model with a binary response variable that can belong to either of two classes. In other words, it is a model designed for solving binary classification problems. In logistic regression terminology, we often label the two classes as 1 and 0, which could represent success/failure, positive/negative or any other pair of classes. The model assumes that the i-th observation of the stochastic response variable, Yi, i = 1, . . . , n is a Bernoulli distributed random variable, independent from all other observations, with probability distribution

(P (Y_i = 1) = π_i

P (Y_i = 0) = 1 − π_i, (2.1)

(23)

where π_i depends on the covariates x_i ∈ R^m, i = 1, . . . , n. The expected value of y_i is then π_i, which in logistic regression is modelled by the logistic function:

E[yi] = πi = exp(x_i· β)

1 + exp(x_i· β) = 1

1 + exp(−x_i· β), (2.2) where the parameters β ∈ R^m are unknown. The goal of logistic regression is to estimate them in a way that maximises the likelihood function of the training data. Since each observation y_i takes on the value 0 or 1, we can use equation 2.1 to rewrite the probability distribution of each observation as

fi(yi) = π_i^yⁱ(1 − πi)^1−yⁱ, i = 1, 2, . . . , n. (2.3) The (assumed) independency of the observations gives us that the likelihood function of the training data set, y = (y₁, y₂, . . . , y_n), is

L(y, β) =

n

Y

i=1

f_i(y_i) =

n

Y

i=1

π_i^yⁱ(1 − π_i)^1−yⁱ =

n

Y

i=1

π_i 1 − π_i

yi

(1 − π_i) (2.4)

The natural logarithm is a strictly increasing function, so maximising the likelihood is equivalent to maximising the log-likelihood, which is more convenient since it converts products into sums. We get

ln L(y, β) = ln

" _n Y

i=1

πi

1 − π_i

yi

(1 − πi)

#

=

n

X

i=1

yiln

π_i 1 − π_i

+

n

X

i=1

ln(1 − πi).

(2.5)

It is easily derived from equation 2.2 that 1 − π_i = [1 + exp(x_i· β)]⁻¹, and that ln [π_i/(1 − π_i)] = x_i · β, and thus the log-likelihood can be written as

ln L(y, β) =

n

X

i=1

y_ix_i · β −

n

X

i=1

ln [1 + exp (x_i· β)] . (2.6)

(24)

This log-likelihood can, for example be maximised by differentiating the log- likelihood with respect to β and approximating the roots using the Newton- Raphson method. For more details on the optimisation algorithm and logistic regression in general, see (Montgomery et al., 2012) [10]. Built-in functions for fitting logistic regression models are available in a lot of computational software, for example in the Python module Scikit Learn [11]. Once we have estimates for the parameters, ˆβ, we can estimate the response variable for unseen data x_∗ as

[10]

ˆ

y∗ = ˆπ∗ =

exp

x_∗· ˆβ 1 + exp

x∗· ˆβ =

1 1 + exp

−x∗· ˆβ (2.7)

2.3.1 Bag of Words

Bag of words (BoW) is a simple model for representing documents as a vector of integers. The BoW representation of a document is simply a collection of the unique words in the document, along with a count of how many times each word appears in the document. In other words, the BoW model keeps track of the multiplicity of each word, but disregards grammar and word order entirely. For a corpus consisting of n documents D₁, . . . , D_n, let |D_i| denote the length of the i-th document. The BoW representation of the corpus can be generated using algorithm 2.1.

Algorithm 2.1: Bag of words

1 sort all unique words in the corpus alphabetically, and let the number of unique words be d;

2 for i ← 1 to n do

3 for j ← 1 to |D_i| do

4 assign to the j-th word in Di an integer index Ii,j corresponding to its place in the sorted word list;

5 compute the one-hot vector of I_i,j with size d;

6 end

7 sum all one-hot vectors from D_i, and denote it by Dⁱ;

8 end

9 define the BoW matrix as the matrix whose rows are the Dⁱ;

(25)

Thus, the entire corpus can be represented as an n × d matrix, where each row is the BoW representation of one document. We will refer to this matrix as the BoW matrix.

The BoW concept can be extended to a more general case of bag of n-grams (BoN). An n-gram is defined as a sequence of n consecutive words in a given text. Consider, for example, the phrase ”today is a good day”. The n-grams for n = 2 (also known as bigrams) of this phrase are made up of the set

{“today is”, “is a”, “a good”, “good day”} . (2.8) The BoN representation of a corpus can also be computed using algorithm 2.1, but considering all n-grams instead of all words. Thus, it keeps a lot of the simplicity of the BoW model, but adds some information on word order. It is obvious that BoW is the special case of BoN with n = 1. The BoN matrix is a feature matrix of the corpus and can thus be used as the covariate matrix in a logistic regression. In general, the number of unique n-grams in a corpus grows quickly with increasing n, which causes a high degree of sparsity in the BoN matrix. Because of this, it is rare to work with n-grams for any n > 5. [12]

2.3.2 Term Frequency-Inverse Document Frequency

One advantage of the BoN model is its simplicity; it is easily computed and reflects the term frequency of each unique term (word or n-gram) in each document in the corpus. One drawback of this simplicity is that it considers all terms as equally important, as there is no weighting scheme for the different terms. One popular weighting scheme for compensating for this is term frequency-inverse document frequency (tf-idf). It builds on the assumption that a given term in a specific document has more importance the more often it appears in the document, but less importance the more documents in the corpus it appears in. In other words, common words like

”this” and ”a”, which often appear in most documents in a corpus, are considered less important than rarer words that only appear in a few of the documents. There are several different variants of tf-idf, with different definitions of term frequency and inverse document frequency, of which we will focus on one.

Consider a specific term t in a document d belonging to a corpus. We define the term frequency (tf ) and the inverse document frequency (idf ) as

(26)

follows:

tf = # of occurrences of t in d

total # of terms in d , (2.9)

idf = log₂

total # of documents in the corpus

# of documents in the corpus where t appears

. (2.10)

The tf-idf of the given term in the specified document is then computed as tf · idf [13]. The tf-idf representation of a corpus can easily be computed from the BoN matrix. Let B be the n × d BoN matrix. The elements of the tf-idf matrix, which we will denote by T , are then given by

T_i,j = Bi,j

Pd

j⁰=1B_i,j⁰ · log₂

n

Pn

i⁰=11(Bi⁰,j > 0)

, (2.11)

where 1 is the indicator function,

1(x > x0) =

(1 if x > x₀,

0 if x ≤ x₀. (2.12)

The tf-idf matrix of a corpus is a matrix of the same shape as the BoN matrix which can also be used as the covariate matrix of a logistic regression.

2.4 Artificial Neural Networks

Artificial neural networks (ANN), commonly referred to simply as neural networks (NN), are a class of machine learning models designed to mimic the way the human brain learns. An NN consists of interconnected simple computing cells, referred to as neurons, that make up a more or less complex nonlinear network. The neurons themselves may be linear or non-linear.

Although the actual similarity to the human brain has been debated, there is at least principle resemblance in two key features of NNs [14]:

• The network acquires knowledge through a training process based on many examples; it learns.

• The acquired knowledge is stored using connections between the neurons, somewhat analogous to the brain’s synapses.

(27)

The most common type of neuron takes a number of inputs, each with a corresponding weight, computes the weighted sum of these, adds a bias term, and applies what is referred to as an activation function to the result. The inputs to the neuron can either be actual inputs to the model as a whole, or outputs from other neurons in the network. Similarly, the output of the activation function can be part of the model’s final output, or be passed on to one or more other network neurons. This is illustrated in figure 2.1.

The activation function is often, but not always, a non-linear function which squishes its input to a bounded range; common activation functions include the sigmoid function and the hyperbolic tangent. [14]

Figure 2.1: Illustration of an artificial neuron with m inputs and activation function ϕ, from (Ahire, 2018) [15].

The neuron depicted in figure 2.1 can be described mathematically as

y = ϕ (x · ω + b) (2.13)

,

where x ∈ R^m is the vector of inputs, ω ∈ R^m is its corresponding vector of weights, b is its bias term, and ϕ is some scalar activation function.

A common NN architecture is the multi-layer feedforward network, illustrated in figure 2.2. This model structure consists of an input layer, one or more hidden layers, and an output layer. The hidden layers are so called since they are not “seen” from either the input or the output of the network. These hidden layers make the NN operate somewhat as a “black box”, where the end user doesn’t necessarily have any knowledge of the internal structure.

This property makes NNs rather difficult to interpret, but it is also this internal complexity which enables them to solve more intricate problems. If

(28)

all neurons in a layer have connections to all neurons in the previous layer and in the next layer, we say that the layer is fully connected. Furthermore, if all layers in a network are fully connected, we say that the network is fully connected [14].

Figure 2.2: Illustration of a fully connected multi-layer feedforward neural network, adapted from (Ahire, 2018) [16].

The fully connected NN depicted in figure 2.2 can be described mathematically as

y^> = f_o f₂ f₁ x^>W₁+ b^>₁ W₂+ b^>₂ W_o+ b^>_o , (2.14)

(29)

where

x ∈ R⁴ is the input vector, (2.15)

y ∈ R³ is the output vector, (2.16)

W1 ∈ R^4×5 is the weight matrix of hidden layer 1, (2.17) W₂ ∈ R^5×7 is the weight matrix of hidden layer 2, (2.18) W_o ∈ R^7×3 is the weight matrix of the output layer, (2.19) b₁ ∈ R⁵ is the bias vector of hidden layer 1 (not depicted), (2.20) b₂ ∈ R⁷ is the bias vector of hidden layer 2 (not depicted), (2.21) b_o ∈ R³ is the bias vector of the output layer (not depicted), and (2.22) f₁, f₂, f_o are the activation functions of the corresponding layers. (2.23)

We see that the resulting formula in equation 2.14 comprises nested applications of matrix multiplications, additions and element-wise activation functions. The weight matrices and the bias vectors are the parameters to be tuned during model training. The dimensions given in equations 2.15- 2.23 are specific to the network depicted in figure 2.2 and can be modified in any way that permits the matrix operations in equation 2.14. The number of layers could of course also be modified, which would change the formula accordingly, still following the same nested structure.

For the output layer, the choice of activation function depends on the problem at hand. For a multi-class classification problem with K classes, we want the output to be a vector of length K, so we let the output layer consist of K neurons. Let z = (z₁, . . . , z_K) ∈ R^K be the vector describing the state in the output layer before application of the output activation function. A convenient choice for the activation function is the softmax function,

sof tmax (z) = e^z¹ v ,e^z²

v , . . . ,e^z^K v

, (2.24)

,

where the normalisation factor v is defined by

v =

K

X

j=1

e^z^j. (2.25)

Using softmax as the output activation function forces all outputs to be positive and their sum to equal 1. Paired with an appropriate loss function

(30)

(further discussed in section 2.4.1), the output can be interpreted as the probability distribution over all K classes. [14]

2.4.1 Loss Function

When training an NN, we first define the model architecture (type of network, number of neurons in each layer, etc.). The parameters we want to optimise are the weights of all neuron connections in the network. To do this we need a differentiable loss function; a scalar metric quantifying the loss suffered when our model predicts the outcome ˆy while the true label is y. Let f denote the function describing a neural network, and let Θ denote the set of trainable parameters of the model. For the example network in section 2.4, we would have Θ = {W₁, W₂, W_o, b₁, b₂, b_o}, and f would be given by the expression in equation 2.14. We then write the loss function as L( ˆy, y). For a labelled training set (x_1:n, y_1:n), we define the total loss with respect to this data as the average of the loss function over all training examples:

L(Θ) = 1 n

n

X

i=1

L(f (x_i; Θ), y_i). (2.26)

The goal of the training algorithm is to estimate the parameters such that they minimise the total loss over the training set:

Θ = arg minˆ

Θ

L(Θ) = arg min

Θ

1 n

n

X

i=1

L(f (x_i; Θ), y_i). (2.27)

For a multi-class classification problem, the standard choice of loss function is categorical cross-entropy. Given the true label y ∈ R^K and the predicted output vector ˆy ∈ R^K, categorical cross-entropy is defined as

CE = −

K

X

i

y_iln(ˆy_i). (2.28)

We see that the cross-entropy penalises low predicted values for correct classes. For classification problems where each training example has only one correct class assignment, the true label y of an observation is the one- hot vector of the correct class. Thus, it singles out the element of ˆy that

(31)

corresponds to the correct class. It is readily shown that, in this case, the cross-entropy reduces to

CE = − ln(y · ˆy), (2.29)

which we use as the loss function:

L( ˆy, y) = − ln(y · ˆy), (2.30) This loss function, paired with a softmax activation function in the output layer, allows us to interpret the output vector as the probability distribution over all available classes. [12]

2.4.2 Stochastic Gradient Descent

Stochastic gradient descent (SGD), or some variant of it, is the most common algorithm for optimising the parameters of an NN according to 2.27. In its standard form, the SGD algorithm samples one training example at a time, computes the gradient (with respect to the parameters) of the loss function for this example, and updates the parameters according to the direction of steepest loss function descent. This process is iterated until some stopping criterion is met. This procedure, however, in each step uses the loss function of only one training example to approximate the total loss, which we want to minimise. This rough estimation yields a lot of noise which hampers the gradient calculations. Therefore, a variant known as minibatch stochastic gradient descent (mSGD) is often used. Given some initial values of the parameters Θ and using the same notation as in section 2.4.1, mSDG is described in algorithm 2.2. The initial values of Θ are typically set by draw- ing each element independently from a uniform distribution on the interval [−limit, limit], where limit is a small real number, usually around the order of magnitude of 0.1. [12]

Thus, the mSGD algorithm uses B training examples, rather than just one, to estimate the total loss before updating the parameters. We refer to each iteration in the while loop as an epoch. In each epoch, the algorithm passes through each example in the training data exactly once, but in a different order each time and with a different batch split. The learning rate parameter η_t governs the size of the parameter update step in the t-th epoch, and can be constant throughout the procedure or decay as a function of t. The loss

(32)

Algorithm 2.2: Minibatch stochastic gradient descent

1 while stopping criterion not met do

2 randomly split the data into subsets (batches), each of size (approximately) B;

3 set counter t ← 1;

4 foreach batch (x_1:B, y_1:B) do

5 g ← 0;ˆ

6 for i ← 1 to B do

7 compute the loss L(f (x_i; Θ), y_i);

8 g ← ˆˆ g + gradient of _B¹L(f (x_i; Θ), y_i) w.r.t. Θ;

9 end

10 Θ ← Θ − η_tg;ˆ

11 increase counter t ← t + 1;

12 end

13 end

function is generally non-convex, and we are not guaranteed to find the global minimum, but mSDG has been shown to perform well in practice. [12]

The stopping criterion referenced in algorithm 2.2 can take different forms.

Some of the most common variants include: a predefined number epochs being reached; the loss function or prediction accuracy reaching some predefined value; or the loss function or prediction accuracy fulfilling some convergence condition. Setting a fixed number of epochs beforehand requires some knowledge about how quickly the model converges; too few epochs and the model will not be able to fit to the data properly, and too many epochs can lead to a waste of computational time on a model that has already converged. Due to the stochastic nature of the optimisation, the appropriate number of epochs might also differ from time to time on the same model. Setting a fixed value for one of the performance metrics to reach is also treacherous as it requires quite a lot of test runs on the model training to learn appropriate thresholds;

a too lax threshold will generate a subpar model, while a too strict threshold might cause the training to run indefinitely. A convergence condition on one of the performance metrics is a more flexible approach, which allows training to stop once the model has stopped improving, regardless of the number of epochs it takes or the value of the metric at that point. As the optimisation can temporarily get stuck in local minima, it is common to introduce a “pa- tience” of a (small) number of epochs, say S_pat. If the monitored performance metric hasn’t improved in S_pat consecutive epochs, training is stopped and

(33)

For SGD and mSGD, the value or decay function of the learning rate needs to be decided on before training the model, and the choices made can greatly influence the effectiveness of the algorithm. Several more advanced algorithms, with the same basic idea as SGD, have been developed which use an adaptive learning rate. One such optimisation algorithm is Adam, introduced in 2014 by Kingma and Ba. Adam uses individual learning rates for each trainable parameter, and these rates are adaptive over time based on moving averages of the gradient and its square. This changes the learning rate parameter to be a vector η_tat each time t whose length is the number of trainable parameters, and modifies step 9 in algorithm 2.2 to: Θ ← Θ − η_t ˆg. There are some input parameters to the Adam optimiser that influence how the step size is calculated, but the authors recommend the default settings for most applications. Adam has shown state-of-the-art performance as a machine learning optimiser [18]. The SGD, mSGD and Adam algorithms are all available in the NN application programming interface (API) Keras [19].

The key to the parameter optimisation, regardless of which algorithm is used, is computing the gradient of the loss function with respect to the parameters, as it tells us how to update the parameters in order to decrease the loss and thereby improve our model. In most NN applications, this is achieved by an algorithm known as back-propagation, first introduced by Rumelhart et al.

in 1986 [20]. In short, the algorithm starts at the output layer and propa- gates backwards through the network, at each step computing the gradient with respect to the parameters in the layer using repeated application of the chain rule for derivatives. For a detailed explanation of the back-propagation algorithm, see for example (Nielsen, 2015) [21]. Back-propagation is implemented in the TensorFlow API [22], which can, for example, be used as a back-end for the Keras API in Python.

2.4.3 Dropout

The complex structure and large number of parameters in NNs make them flexible enough to learn difficult tasks. However, this flexibility also makes them prone to overfitting, meaning that they can yield near enough per- fect performance on the training data, but generalise poorly to unseen data.

One approach for combatting overfitting in NNs is to use dropout training.

Dropout training consists of (temporarily) setting a given proportion of the model weights to zero when computing the loss and its gradient. A different set of weights will be “dropped” for each training example, but the proportion stays constant throughout the training. Dropout prevents single

(34)

weights from having too much influence over the model, and has been shown to greatly improve generalisation of the model. [12]

2.5 Word Embedding

Word embedding is a collective term for a number of various methods for generating vector representations of words. Each word in a corpus is mapped to a continuous vector space in some predefined dimension d. The purpose of using word embeddings, rather than one-hot encodings of each word, is mainly two-fold: it reduces the number dimension of the word representations from the number of unique words in the corpus to the chosen dimensionality d;

and it generates similar vectors for similar words, instead of treating different words as completely separate classes.

One example of a word embedding method is word2vec, which has gained a lot of popularity over the last few years. It was developed in 2013 by Mikolov et al. at Google [23], and has shown state-of-the-art performance [24]. The word2vec method builds on the assumption that words that frequently appear in similar contexts have similar syntactic and semantic meaning. The word2vec method is not one specific algorithm, but rather a collection of closely related algorithms using slightly different approaches to achieve the same goal. One of these algorithms is known as the Skip-gram, which is formulated as a classification problem with the objective of predicting the words surrounding a given word in a document. Given a corpus of text to base the word representations on, containing W unique words as a sequence w₁, . . . , w_T of T total words, the Skip-gram aims to maximise the following average log probability:

1 T

T

X

i=1

X

j∈C(wi)

ln P (w_i+j|w_i) , (2.31)

where the context C(w_i) is made up of surrounding words, i.e. C(w_i) = {±1, ±2, . . . , ±c(w_i)}, and the size c(w_i) of the context varies for different words w_i.

We assign an integer w ∈ {1, . . . , W } index to each unique word in the corpus, and representing each word with the one-hot vector of its index. The set of input-output training pairs (the target word and its context word)

(35)

for the classification problem is then defined by algorithm 2.3, where R is a predefined upper bound on the context size.

Algorithm 2.3: Generation of training examples for the Skip-gram

1 initiate an empty training data input set: T I ← {};

2 initiate an empty training data output set: T O ← {};

3 for i ← 1 to N do

4 randomly draw an integer k from the uniform distribution over 1,. . . ,R;

5 define the context: C_i ← {i ± 1, i ± 2, . . . , i ± k};

6 foreach element c in C_i do

7 append the input set: T I ← T I ∪ {w_i};

8 append the output set: T O ← T O ∪ {w_c};

9 end

10 end

Thus, each word in the corpus will appear several times as the input to the model training, with its context words as the respective outputs. This dynamic context size weights closer words as more important, since more training examples will result from the closest context words compared the the farther context words. The one-hot vectors of the training inputs and outputs are then passed to a fully connected NN with one hidden layer and softmax output activation, as depicted in figure 2.3. As discussed previously, this allows us to interpret the output as the probability of the context word given the input, and we can optimise the parameters of the network to maximise the average log probability in equation 2.31. [23, 25]

The weight matrix of the hidden layer has dimensions W × M , where M is the predefined number of neurons in the network. Now, a key insight to the word2vec model is that since the inputs are all one-hot vectors, the hidden layer singles out one column of the weight matrix for each unique word.

Thus, when the training om the model is complete, the corresponding weight matrix column of each unique word can be interpreted as an M -dimensional vector representation of the word. It is these vector representations that form the final output of the word2vec Skip-gram model. There are some subtleties in the Skip-gram algorithm for speeding up the training which are left out here. For more details, see (Mikolov et al., 2013) [23, 25].

Once the vector representations of the words in the corpus are generated, we want to be able to quantify the similarity between two words. For word embeddings, this is usually done with cosine similarity, which is simply the cosine of the angle between the words’ vector representations. Given two

(36)

Figure 2.3: Illustration of the Skip-gram model architecture, from (Mc- Cormick, 2016) [26].

vectors w₁ and w₂, we can compute the similarity as

similarity(w₁, w₂) = cos(θ) = w₁· w₂

||w₁|| ||w₂||, (2.32) where θ is the angle between the two vectors. Thus, the similarity focuses solely on the orientation of the vectors and ignores any magnitude difference, and is bound to the interval [−1, 1]. [23]

A Skip-gram word2vec model trained on a corpus of news articles (totalling approximately 100 billion words) developed by Mikolov et al. has been made available for download by Google. It contains 300-dimensional vector representations for 3 million unique terms, using a maximum window size of R = 10 [27]. Furthermore, the Gensim API in Python provides functionality for training word2vec models on your own corpus. [28]

(37)

Figure 2.4: The basic RNN architecture, in “loop form” (left) and unrolled as a chain (right), from (Olah, 2015) [29].

2.6 Recurrent Neural Networks

Recurrent neural networks (RNN) is a class of NNs that is designed to deal with sequential input data. Put simply, RNNs are NNs which contain loops, where the a neuron passes its output as input to itself along with the input for the next step. We often refer to the step in the sequence of inputs as the time, even in cases where the sequence is actually spatial. The idea behind RNNs is that for input data that is inherently sequential, such as the words in a document or the phonemes that make up a spoken phrase, we want to utilise that structure and not treat all inputs independently. It is reasonable to assume that, at each point in time, the state of the sequence depends on all previous inputs as well as the current one. See figure 2.4 for an illustration of the RNN architecture. [12]

As shown in figure 2.4, we can unroll the loop element and view the RNN as a (very) deep non-looped network, where each layer takes a new input as well as the output from the previous layer. Since the layers in the unrolled network representation are really several passes through the same RNN unit, the parameters associated with them are the same in each layer. In theory, RNNs should be able to learn arbitrarily long dependencies in time. In practice, however, this is not the case. Due to the multiplicative nature of the chain rule for derivatives, and the large number of time steps often involved, RNNs suffer from the vanishing gradient problem. This means that for time lags of more than a few steps, the gradients go to zero, and model training takes infeasibly long time or learning stops altogether. Thus, the standard RNN model isn’t able to learn very long-range dependencies at all.

[30]

(38)

2.6.1 Long Short-Term Memory

Long short-term memory (LSTM) is an RNN architecture that was first introduced by Hochreiter and Schmidhuber in 1997 [31]. LSTM was developed with the ambition of augmenting the original RNN structure in a way that avoids the vanishing gradient problem. It does this by adding “gates”, which at several stages in the LSTM cell dictates what information to let through to the current state. Since its inception in 1997, several improving elements have been introduced by various contributors, and different variants of the architecture incorporate different sets of these elements. The most common variant, which we will refer to as “vanilla LSTM”, includes an input gate, a forget gate, an output gate, and peephole connections. It can be shown that this gated structure yields gradients that do not vanish, and the model is able to learn arbitrarily long-range dependencies. The vanilla LSTM block is illustrated in figure 2.5. [32]

Figure 2.5: Detailed overview of a vanilla LSTM block, from (Greff et al., 2017) [32].

Consider a document consisting of a sequence of T words. The input to the LSTM network at “time” t ∈ {1, . . . , T } will be the vector representation of the t-th word in the document. Let us denote this input vector by x^t. Furthermore, let M be the length of x^t, i.e. the dimension of the word

(39)

embedding vectors, and N the number of LSTM blocks in the layer. With this setup, we have the following trainable parameters:

• W_z, W_i, W_f, W_o ∈ R^{N ×M} input weights

• R_z, R_i, R_f, R_o∈ R^{N ×N} recurrent weights

• p_i, p_f, p_o ∈ R^N peephole weights

• b_z, b_i, b_f, b_o ∈ R^N bias weights

(2.33)

The different states in the LSTM block depicted in figure 2.5 can be described mathematically as

z^t= g W_zx^t+ R_zy^t−1+ b_z

block input (2.34) i^t= σ Wix^t+ Riy^t−1+ pi c^t−1+ bi

input gate (2.35)

f^t= σ W_fx^t+ R_fy^t−1+ p_f c^t−1+ b_f

f orget gate (2.36) c^t= z^t i^t+ c^t−1 f^t cell state (2.37) o^t= σ W_ox^t+ R_oy^t−1+ p_o c^t+ b_o

output gate (2.38)

y^t= h c^t o^t block output, (2.39)

where y⁰ = c⁰ = 0, and the activation functions g and h are the hyperbolic tangent function [32]. All activation functions (g, h and σ) are applied element-wise to their respective vectors. Greff et al. presented a compre- hensive evaluation of several variants of the LSTM architecture in 2017. It was concluded that the vanilla LSTM outperformed most of them, but that excluding the peepholes did not affect performance negatively. Excluding the peepholes reduces the total number of parameters and thus speeds up model training. Excluding peepholes from the model simply equates to setting pi, pf, and po to zero in equations 2.34-2.39. [32]

Depending on the application, we can use the output from every time step, or only the output from the last time step. For the task of classifying the document, we are only interested in the output after processing the entire word sequence, so we can pass the N -dimensional vector output from the last time step to a fully connected layer of the desired dimension (the number of classes) with softmax activation. [32] When classifying an entire sequence, it can be beneficial to study the sequence in both directions. We achieve this by letting an additional LSTM layer ingest the input data in reverse order in parallel with the standard LSTM layer. The outputs from the two

(40)

LSTM layers at time T are then concatenated to form a 2N -dimensional vector before being passed to any additional layers. This variant is known as bidirectional LSTM (BLSTM). [33]

Let y^T and ˜y^T be the N -dimensional output vectors for a document from two LSTM networks according to equations 2.34-2.39, reading the document forwards and backwards respectively. Furthermore, we introduce a fully connected output layer where

W_out ∈ R^{2N ×K} is its weight matrix, and

b_out∈ R^K is its bias vector, (2.40) and K is the number of possible classes. Then, the output of the entire BLSTM classifier with softmax activation is

y^∗ = sof tmax

concat y^T, ˜y^T>

W_out+ b^>_out

. (2.41)

Adding up the parameters introduced in equations 2.33 and 2.40 (excluding the peephole connections) we see that this BLSTM classifier has

2(4N M + 4N²+ 4N ) + 2N K + K (2.42) unknown parameters to optimise with gradient descent, where the factor 2 in the first term stems from the two LSTMs reading the text in opposite order.

Before training a BLSTM model, some hyperparameters need to be decided on. These include the learning rate decay scheme (adaptive learning rate scheme), number of LSTM units N , dropout proportion, and the stopping criterion. For the dropout in LSTM, an analysis by (Zaremba et al., 2015) [34] recommended a proportion of 20%.

2.7 Performance evaluation

When fitting a machine learning model we fit the model using some set of training data. This means that the parameters estimated by the model are especially tuned to the particular training data, and the model’s performance

(41)

on this data set does not fairly reflect the performance it would have on new, unseen data. A common method for making a better assessment of the expected performance is the test set approach. Prior to fitting our model, we randomly split our data into two sets, the training set and the test set (not necessarily of the same size). The model is fitted using only the training set, and the performance metric of interest is computed for the test set, thereby avoiding the bias of evaluating the model on the same data it was fitted on.

[35]

The test set approach has two main drawbacks. Firstly, which observations end up in which set at the random split will affect the model and the test set metric. Secondly, only a subset of the total amount of data is used to train the model, which could lead to an underestimation of the model’s performance.

Both these issues become more prominent the fewer data points are in the data set to begin with. An approach for combatting these issues is to use k-fold cross-validation. The k-fold cross-validation procedure is described in algorithm 2.4.

Algorithm 2.4: k-fold cross-validation

1 randomly split the data set into k subsets S₁, . . . , S_k of (approximately) equal size;

2 for i ← 1 to k do

3 fit the model on all data except for S_i;

4 compute the performance metric of interest τi(Si);

5 end

6 compute the average performance metric τ = ¹_kPk i=1τ_i

The cross-validated performance metric p is a better estimate of the actual performance of the model. Common choices for the performance metric p include the prediction accuracy discussed in section 2.2 and the value of the loss function discussed in section 2.4.1. The main drawback of cross- validation is the computational cost, which increases with the value of k since we have to fit the model k times. This can be especially problematic for complex models such as NNs, which often limits us to a fairly small value of k. [35]

(42)

Chapter 3 Application

In this chapter, the procedure of the thesis work, using methods introduced in chapter 2, is described. Section 3.1 describes the data handling: collection, cleaning, feature extraction, and further filtering, which was done in the statistical software R [36]. In section 3.2, we present our benchmarking method, which was implemented using the Scikit Learn API in Python [11].

In section 3.3 we explain how neural networks were used on the classification problem, implemented using the Keras API [19] in Python, with TensorFlow [22] as the back-end.

3.1 Preprocessing

3.1.1 Data Collection

All data was collected from SEB’s internal databases, and could be divided into two main categories: equity research data and price data. The equity research data was extracted using simple SQL queries and consisted of two data tables: Report TB and TargetPrice TB. These data tables are described in tables 3.1-3.2. The price data was extracted using ClariFI, a financial research platform provided by S&P Global [37]. This data contained historical daily prices in USD for all stocks represented in TargetPrice TB and historical daily exchange rates from all currencies represented in TargetPrice TB to USD.

In addition to the equity research data and the price data, the reports themselves were also available in PDF files. However, the full reports varied wildly

(43)

Table 3.1: Description of the data contained in table Report TB.

Table Name: Report TB Column Name Description

PublishedDate The date the report was released.

CompanyId An identifier for which company the report regarded.

RecommendationId The recommendation ID for the company at the time of the report.

FrontPageBullets The text on the first page of the report, containing a summary of the entire report.

DocumentType The report category (equity research, credit research, sector analysis etc.).

Table 3.2: Description of the data contained in table TargetPrice TB.

Table Name: TargetPrice TB Column Name Description

Date The date the target price in question was set.

CompanyId An identifier for which company the target price regarded.

TargetPrice The target price for the main stock of the company.

ISIN The ISIN code identifier for the main stock of the company.

Currency The currency of the reported target price.

(44)

in length, ranging from 1 to 20 pages. They would also require large amounts of wrangling to filter out table contents, page headers and other parts that were not part of the actual text. Thus, the front page summaries were used as the report texts, as they were more consistent in length and should contain the key points of the report.

3.1.2 Data Cleaning

Upon inspection of the Report TB data it was discovered that some of the front page summaries did not correspond to textual reports, but were notes from conference calls. Others consisted of HTML tags of unclear origin.

These two types of unwanted “texts” had structures that made them easy to identify programmatically and all reports containing these were removed.

Additionally, reports were removed if there was no rating, if the text field was empty, or if the document type was not company specific equity research.

Furthermore, the lengths of the documents were studied. It was concluded that the vast majority of the documents contained at most 400 words. See figure 3.1 for the full distribution. To achieve a narrower range of document lengths, the few documents containing more than 400 words were excluded from the data set.

Figure 3.1: The distribution of document lengths in terms of number of words.

(45)

As described in section 1.1.2 the recommendation scale at SEB has changed a few times historically, specifically the number of different recommendation levels has varied between 3 and 4. To acquire a comparable scale for all reports, and since the current definition of level 3 covers the former definitions of both levels 3 and 4, all level 4 recommendations in the data were redefined as level 3 recommendations. This left us with 67079 reports rated 1 (buy), 2 (hold), or 3 (sell).

3.1.3 Feature Engineering

To train supervised machine learning models on the sentiment of the reports, we needed to define which reports to classify as positive and negative. One plausible indicator to use for the sentiment was the recommendation itself.

However, upon manual inspection on some of the texts, it was discovered that, while buy recommendations where in general perceived more positive than sell recommendations, far more distinct sentiment was perceived in reports where the recommendation was up- or downgraded from the previous report for the same company. Thus, it was decided, for our main approach, to define positive and negative reports as reports with upgraded and downgraded recommendations respectively.

As this information was not explicitly available in Report TB, some data engineering was required. We defined a new column in Report TB which could take the values “Upgrade”, “Downgrade”, and “No Update”. Filling this column with the appropriate values was achieved by grouping the Report TB data by company, sorting it by date, and comparing each recommendation with the one from the previous report for the same company. The first report for each company was set to “No Update”. Additionally, reports where the previous report for the same company was released more than one year earlier were also set to “No Update”, since some random samples showed that this was usually because coverage of that company had been suspended for some time. This procedure left us with 2665 upgraded reports, 2776 downgraded reports, and 61638 reports with unchanged recommendation.

As obvious from the previous paragraph, recommendation updates are rare.

There was clearly a trade-off between the distinctness of the sentiment and the size of the data set. Partly to combat this, and partly to try and pick up on more nuanced sentiment indicators, a second approach was considered.

This approach focused on the ratio between the target price and the actual stock price, which as described in section 1.1.2 forms the basis for the recommendation. We will henceforth refer to this ratio as the price ratio. As with

(46)

the main approach, we focused on changes to the price ratio rather than the price ratio itself. A small price ratio change might just as well result from fluctuations in the stock price as from a revised outlook for the company, so we needed a threshold for what was considered a large enough price ratio change. To ensure a large data set, and since there was no obvious threshold for what constituted a large price ratio change, we defined positive and negative reports as the reports with the 33% largest price ratio increases and decreases respectively. This decision was also influenced by the fact that the recommendation upgrades and downgrades were so well balanced, so we felt comfortable assigning an equal number of positive and negative labels.

To extract this data, we started by defining a new column in Report TB, containing the target price at the time of the report, i.e. the latest updated target price for the company in question prior to the report date. This target price column was then converted to USD using the currency identifier from TargetPrice TB and the exchange rates from the price data. A stock price column was also defined containing the stock price at the beginning of the day of the report. Then the target price to stock price ratio was computed. At this stage a number of noticeable outliers were detected, where the target price was several times larger or smaller than the stock price. Some random samples indicated that this was due to incorrect currencies reported in TargetPrice TB. To avoid these outliers, all price ratios outside the range [0.5, 2] were set to N A. The reports were grouped by company, sorted by date, and the difference from the previous price ratio was calculated for each applicable report. With the same reasoning as for the first approach, the price ratio difference for reports after time gaps of more than one year were set to N A. Finally, the 33rd and the 67th percentiles of the price ratio differences were computed, and the top and bottom thirds were labelled as positive and negative reports respectively. In total, a price ratio difference was computed for 32233 reports, and by construction this yielded 10637 positive and 10637 negative reports. The distribution of the price ratio difference for all 32233 reports is presented in 3.2. We see that it appears to be normally distributed around zero.

To summarise, we were left with a corpus containing 67079 documents. From these documents, two data sets were created, which we will refer to as the recommendation data set and the target price data set. The recommendation data set contained 5441 documents, and the target price data set contained 21274 documents. All documents had a label of either positive or negative.

In practice, all reports remaining after the data cleaning were stored in the

Machine Learning Based Sentiment Classification of Text, with Application to Equity Research Reports

Machine Learning Based

Sentiment Classification of Text, with Application to Equity

Research Reports

OSCAR BLOMKVIST

Machine Learning Based Sentiment Classification of Text, with Application to Equity Research Reports

OSCAR BLOMKVIST

Abstract

Sammanfattning

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Background

1.1.1 Sentiment Analysis

1.1.2 Equity Research

1.2 Objectives

Chapter 2 Theory

2.1 Notation and Terminology

2.2 Classification Problems

2.3 Logistic Regression

2.3.1 Bag of Words

2.3.2 Term Frequency-Inverse Document Frequency

2.4 Artificial Neural Networks

2.4.1 Loss Function

2.4.2 Stochastic Gradient Descent

2.4.3 Dropout

2.5 Word Embedding

2.6 Recurrent Neural Networks

2.6.1 Long Short-Term Memory

2.7 Performance evaluation

Chapter 3 Application

3.1 Preprocessing

3.1.1 Data Collection

3.1.2 Data Cleaning

3.1.3 Feature Engineering