APPLICATIONS OF DEEP LEARNING IN TEXT CLASSIFICATION FOR HIGHLY MULTICLASS DATA

(1)

APPLICATIONS OF DEEP LEARNING IN TEXT CLASSIFICATION FOR HIGHLY MULTICLASS DATA

Submitted by Adam Grünwald

A thesis submitted to the Department of Statistics in partial fulfillment of the requirements for Master degree in Statistics in the Faculty of Social Sciences

Supervisor Rauf Ahmad

Spring, 2019

(2)

ABSTRACT

Text classification using deep learning is rarely applied to tasks with more than ten target classes. This thesis investigates if deep learning can be successfully applied to a task with over 1000 target classes. A pretrained Long Short-Term Memory language model is fine-tuned and used as a base for the classifier. After five days of training, the deep learning model achieves 80.5% accuracy on a publicly available dataset, 9.3% higher than Naive Bayes. With five guesses, the model predicts the correct class 92.2% of the time.

(3)

1 Introduction

Labelling different types of text documents is both important and desirable. There are plenty of different situations where this is useful. Automating these tasks is therefore something of great value if the automated system performs on a par with, or better than, humans. Labelling essays with grades is an example of a task that is time consuming but also important, which is why a lot of research has been put into Automated Essay Scoring (for research on Swedish essays, see Östling 2013). Removing posts from social media platforms which are against the terms of use or illegal (e.g. hate speech or threats of physical violence) is another case where automation, if done right, would be beneficial.

Automated labelling of texts can also be useful in other cases. Classifying e-mails as spam, classifying reviews as either positive or negative and assigning topics to Wikipedia articles (see Zhang, Zhao, and Lecun 2016) are more examples of useful applications. When assigning topics to Wikipedia articles, the number of target classes is larger. The literature makes a distinction between the case when the target classes are binary (e.g. "Spam"/"Not Spam",

"Positive"/"Negative") and when there are several possible target classes (e.g. different topics), where the latter problem is more complex.

A different text classification task is when a document can be labelled with several labels.

This is referred to as a multi-label classification task. If the label-space is very big, the task becomes an extreme multi-label classification task (XMTC). An example of such a problem is the challenge proposed by BioASQ in 2013 (BioASQ 2013). The objective of this task was to assign several Medical Subject Headings, also known as MeSH (Wikipedia 2019a), to new PubMed (large database of medical articles) documents.

Text classification tasks can be summarized as four different types: one out of two classes, one out of multiple classes, several labels out of a limited number of labels and several labels out of extremely many labels (see Gargiulo, Silvestri, and Ciampi 2018 for a similar summary).

Previously, there has not been any papers which deal with more than about 50 classes in a deep learning framework. The data used in this thesis has over 1000 target classes which makes this type of classification problem uncharted territory.

The objective of this thesis is to investigate the performance of a neural network transfer learning technique, known as ULMFiT (Howard and Ruder 2018), on a task similar to the second type; to determine which one class out of multiple classes that a text belongs to. This can be thought of as a fifth category of text classification task; a highly multiclass classification

(5)

task.

In the next section there will be an overview of related work. Then, the data and method will be introduced. A benchmark classifier will be constructed for comparison purposes. Lastly, an implementation of ULMFiT on a highly multiclass classification task will be presented along with results and a conclusion.

2 Related Work

2.1 Machine Learning

In text classification, a simple approach is to consider the text as a bag-of-words. In this approach, a sentence or a document is an observation and the variables are all the words which occur in all observations. The value of each variable is then the number of times it occurs in that particular observation. It is common to put an upper bound on how large the vocabulary can be in which case the least common (but also most common) words are omitted. It is also common to use bigrams, two-word sequences in a text. The sentence I love you has unigrams I, love and you and bigrams I love and love you.

In bag-of-words representations, raw counts of words are usually not the best option. It is often better to use TF-IDF (Term Frequency-Inverse Document Frequency) instead. Explained in simple terms, TF-IDF is a combination of how common a term is in a given document and how common the term is across all documents in the data. These features can then be used to train classifiers like Naive Bayes. The Naive Bayes will be used as a benchmark for comparison in this thesis.

Other popular representations include embedding words into vectors, done in word2vec (Mikolov et al. 2013) and fastText (Joulin et al. 2016). These embeddings are used to capture similarities between words and can be used to train a classifier that achieves good performance in a very short time.

2.2 Deep learning

A popular way to approach different tasks in NLP is to use a Long Short-Term Memory Recur- rent Neural Network (LSTM RNN) (Hochreiter and Schmidhuber 1997). The strength of the LSTM is that it can capture information in any part of the document. It also allows the model

(6)

to account for the specific order of words which has been shown in a paper by Sutskever et al.

(2014).

Another interesting approach is to use a Convolutional Neural Network (CNN) on a character level (see Zhang and Lecun 2016 and Zhang, Zhao, and Lecun 2016). Since CNNs are the current state-of-the-art in image recognition it has been suggested and shown that they can be successful in various NLP tasks. Conneau et al. (2017) extended the idea of character level CNNs by using up to 29 convolutional layers with promising results. CNNs has also been used for the multi-label problem by Liu et al. (2017) and by Kim (2014). A combination of RNN and CNN has been tried and shown to work well for text classification with few classes (Lai et al. 2015).

The idea to use inductive transfer learning in NLP was introduced by Dai and Le (2015) and later improved by Howard and Ruder (2018). They use a pre-trained language model from Merity, Keskar and Socher (2017) and then show that it can be fine-tuned with small amounts of data to perform well on a range of different tasks. This is the technique that will be used in this thesis since it has shown to be very successful on other text classification tasks.

3 Data

Training and validation data The data used for experiments is publicly available on Kaggle, an online community for data scientists, and a link to the data can be found in the references (Kaggle 2018). It consists of forum posts made on Reddit, "a social news aggregation, web content rating and discussion website" (Wikipedia 2019b). The purpose of such posts is usually to start a discussion. Fig. 1 shows an example of such a post which was posted in the subreddit Movies.

(7)

Figure 1: Self-post from the subreddit /r/Movies

The data has 1013 classes and 1000 posts per class resulting in over a million observations.

All observations are labelled with their respective class automatically when a user decides to make their post in a certain subreddit (in Fig. 1, the user has decided to make the post in the subreddit /r/Movies and the post is thus labelled as the class Movies). There are many more than 1013 classes on Reddit but the creators of the dataset have tried to clean the classes such that the overlap between them is as small as possible. The creators also mention that they believe the highest possible accuracy on this dataset is around 96% because some texts do not contain any useful information at all. One can refer to Fig. 2 for a small subset of the data. We concatenate the title and selftext and use it to predict the subreddit.

(8)

Figure 2: A subsample illustrating the structure of the data

Test data The creators of the Kaggle dataset kindly provided us their code for downloading and cleaning the data. We used this to download some data of our own in order to test the models performance on new data. However, the new test data contain some classes with only one observation. In the validation data, no class has less than 70 observations. This might affect the performance. See table 1 for differences in observations per class between validation and test data.

Minimum 25th percentile 50th percentile 75th percentile Maximum

Test 1 83 141 212 260

Validation 70 94 100 106 136

Table 1: Minimum values, maximum values and percentiles of observations per class for the test and validation data.

4 Method

4.1 Neural Networks

A simplified structure of the two-layer, feed-forward network can be seen in Fig. 3. It takes x = [x₁ x₂ . . . x_p]^T as input vector and produces z = [z1 z₂ . . . z_K]^T as output vector where

(9)

p is the number of variables and K the number of classes. In this network, the input x is transformed into z through one layer of hidden units and activation functions.

x₀ x₁ x₂ x₃ Input layer

h⁽¹⁾₀ h⁽¹⁾₁ h⁽¹⁾₂ h⁽¹⁾₃ h⁽¹⁾₄ Hidden layer 1

z₁ z₂ Output

layer

Figure 3: A feed-forward neural network with three input variables, four hidden units and two output variables. The intercept is represented by x0 = h⁽¹⁾₀ = 1. The arrows between the nodes are the weights and there are also activation functions between the layers but they are not visible in this simplified illustration.

In the following equations, the Rectified Linear Unit (ReLU) (Nair and Hinton 2010) activation function is used. It is defined as σ(x) = max(0, x) and introduces non-linearity into the network. The hidden units are then defined as

hi = σ(β_i0⁽¹⁾+ β_i1⁽¹⁾x1+ β_i2⁽¹⁾x2 · · · β_ip⁽¹⁾xp), i = 1, 2, . . . , M. (1) which is a linear function of the input variables put through the ReLU activation function. M is the number of hidden units in this layer. They can also be written in matrix notation

h = σ

b⁽¹⁾+ β⁽¹⁾x

(2)

where h = [h₁ h₂ . . . h_M]^T is the vector of hidden units and

β⁽¹⁾ =







β₁₁⁽¹⁾ β₁₂⁽¹⁾ . . . β_1p⁽¹⁾ β₂₁⁽¹⁾ β₂₂⁽¹⁾ . . . β_2p⁽¹⁾ ... ... . .. ... β_{M 1}⁽¹⁾ β_{M 2}⁽¹⁾ . . . β_{M p}⁽¹⁾







, b⁽¹⁾ =





 β₁₀⁽¹⁾ β₂₀⁽¹⁾ ... β_{M 0}⁽¹⁾







(10)

is the weight matrix and vector of intercepts used to transform the input into hidden units. The superscript refers to that this weight matrix and intercept vector corresponds to the first layer in the network.

In the two-layer example, the output is expressed as a function of the hidden units as

z_i = β_i0⁽²⁾+ β_i1⁽²⁾h₁+ β_i2⁽²⁾h₂ · · · β_iM⁽²⁾h_M, i = 1, 2, . . . , K. (3) In matrix notation

z = b⁽²⁾+ β⁽²⁾h (4)

where

β⁽²⁾ =







β₁₁⁽²⁾ β₁₂⁽²⁾ . . . β_1M⁽²⁾ β₂₁⁽²⁾ β₂₂⁽²⁾ . . . β_2M⁽²⁾ ... ... . .. ... β_K1⁽²⁾ β_K2⁽²⁾ . . . β_KM⁽²⁾







, b⁽²⁾ =





 β₁₀⁽²⁾ β₂₀⁽²⁾ ... β_K0⁽²⁾





 and M and K are the number of hidden units and output classes respectively.

To extend this two-layer network into a deep neural network with L layers it can be represented in matrix notation as

h⁽¹⁾ = σ

b⁽¹⁾+ β⁽¹⁾x h⁽²⁾ = σ

b⁽²⁾+ β⁽²⁾h⁽¹⁾ ...

h^(L−1) = σ

b^(L−1)+ β^(L−1)h^(L−2)

z = b^(L)+ β^(L)h^(L−1)

(5)

Note that the ReLU is not used when generating z, instead, if the network is trained for classification, z is put through a softmax activation to convert the values into probabilities. The softmax is defined as

sof tmax(z) = 1 PK

j=1e^z^j [e^z¹ e^z² . . . e^z^K]^T (6)

(11)

The softmax function will assign values close to one for the largest zi and close to zero for all others, unless the largest and second largest are very close.

4.2 LSTM Recurrent Neural Network

Another type of neural network suited for sequential data is a Recurrent Neural Network (RNN). A RNN takes an input vector and a vector of hidden states to produce a new vector of hidden states and an output. Let xt = [x₁ x₂ . . . x_t]^T be the ordered sequence of input words, h_t = [h₁ h₂ . . . h_t]^T, h_t−1 = [h₁ h₂ . . . h_t−1]^T be the hidden states corresponding to the inputs. Also let βhx, β_hh and βoh be weight matrices and let ot be the output. Then the state of the network at time t is

h_t = σ (β_hxx_t+ β_hhh_t−1+ b_h) o_t = sof tmax(β_ohh_t+ b_o)

(7)

where σ(x) usually are the tanh or ReLU nonlinear function and bh, b_o are the respective bias terms.

The LSTM (briefly discussed in section 2.2), was introduced to improve the RNN’s ability to remember more than the last parts of a document (Hochreiter and Schmidhuber 1997). The LSTM cell take as input the feature vector, xt, the previous cell state vector (defined in Eqn.

(8)), c_t−1, and the previous hidden state vector, h_t−1. This input then flows through three different "gates", generally referred to as the forget gate, input gate and output gate. They consist of different activation functions deciding what to forget from previous states, what to use as input and what to output. The LSTM cell then outputs a new hidden state, ht, and a new cell state, c_t, to be used in the next cell. The flow through the LSTM cell can be seen in Fig. 4 and is defined in equation form as

f_t = σ (β_{f x}x_t+ β_{f h}h_t−1+ b_f) it = σ (βixxt+ βihht−1+ bi) ot = σ (βoxxt+ βohht−1+ bo)

˜

ct = tanh (βcxxt+ βchht−1+ bc) ct = ft× ct−1+ it× ˜ct

ht = ot× tanh(ct)

(8)

(12)

where × denotes the Hadamard product, the gate function is σ(x) = _1+e¹−x and ft, it, ot are the output of the forget gate, input gate and output gate respectively. The different β matrices contains the different weights for the gates and new candidate state. The bias/intercept terms are represented by the different b vectors. The candidate state, ˜c_t, is combined with the previous cell state to form the new cell state, ct. Lastly, the new hidden state, ht, is formed as a function of the output state o_t and the cell state.

σ σ tanh σ

× +

× ×

tanh ct−1

Previous cell

ht−1

Previous hidden

xt

Input

ct

New cell

ht

New hidden ht

New hidden

Figure 4: Visualization of the LSTM cell structure. σ represents the logistic function, × is the Hadamard product and the + is just regular addition of the incoming terms.

The LSTM cells (shown in Fig. 4) are the core part of a language model. A language model would first use a layer that learns the relation between different words, known as an embedding layer. Then, some layers of LSTM-cells are applied (three in our case) and a linear layer that takes the hidden state from the LSTM as input and propagates it forward through a softmax activation (see Eqn. (6)) in order to predict the next word in a sequence of words. The prediction is defined as

ˆ

y_t = sof tmax(V h_t+ b_y) (9)

where V is a weight matrix (with dimensions Vocabulary size × Number of hidden states) and b_y are the bias/intercept terms.

(13)

4.3 Training a Neural Network

When training a machine learning model one wants to find the parameter values which minimize a loss function. Given the model

y = Xβ + (10)

a closed form solution that minimizes the MSE loss function can be directly calculated as

β = (Xˆ ^TX)⁻¹X^Ty (11)

If the amount of parameters is extremely large, one could instead use an algorithm called Gradi- ent Descent. The gradient descent method, when applied to linear regression, tries to minimize an appropriate loss function, e.g. MSE. To make the partial derivatives look nicer we define a slightly modified MSE as

L( ˆβ0, ˆβ1, . . . , ˆβp) = 1 2n

n

X

i=1

(ˆy⁽ⁱ⁾− y⁽ⁱ⁾)² (12) Gradient descent then takes the partial derivatives in each iteration with regard to all ˆβj as

∂L( ˆβ₀, ˆβ₁, . . . , ˆβ_p)

∂ ˆβ_j = 1 n

n

X

i=1

(ˆy⁽ⁱ⁾− y⁽ⁱ⁾)x⁽ⁱ⁾_j , j = 1, 2 . . . p. (13) Then it will update all parameters simultaneously with learning rate γ > 0 such that

βˆ_j^(new) = ˆβ_j^(old)− γ1 n

n

X

i=1

(ˆy⁽ⁱ⁾− y⁽ⁱ⁾)x⁽ⁱ⁾_j , j = 1, 2, . . . , p. (14) This update scheme is repeated until the difference between loss functions between iterations is sufficiently small. Then we can say that the loss function, which in this case is convex, has converged to its global minimum (see Goodfellow, Bengio, and Courville 2016 for gradient based optimization).

Deep neural networks often have millions of parameters and sometimes billions. The gradient can therefore not be calculated for all parameters and observations every time. Luckily, gradients between subsets of the data are often similar (Bottou 2018). Therefore, it is possible to split the dataset into mini-batches and then calculate the gradient on each mini-batch.

The size of the mini-batches is determined by the memory in the computers GPU and is often

(14)

somewhere between 32 and 256 (Goodfellow, Bengio, and Courville 2016). The calculation of gradients is done with back-propagation (Rumelhart, Hinton, and Williams 1986).

Optimizing the parameter values through calculating gradients on mini-batches is known as Stochastic Gradient Descent (SGD). The parameters are updated in a similar manner to Eqn.

(14) but n is replaced by the mini-batch size. When applying SGD with an adaptive learning rate scheme and adaptive momentum (Hinton 1977) the optimization algorithm must be able to handle this. The Adam optimizer (Kingma and Ba 2014) is a common choice under these circumstances since it is fast and can handle the adaptive learning rate and momentum scheme.

Therefore, the Adam optimizer will be used in this thesis.

When training a neural network for classification, the MSE loss function, described in Eqn.

(14), is replaced by another loss function known as cross-entropy loss. It is defined as

L(xi, y_i, θ) = −

K

X

k=1

yiklog(p(k|xi; θ)) = −y^T_i log(sof tmax(zi)) (15) With x_i being the predictors of observation i, y_i being a one-hot encoded vector where the correct label of observation i is coded as 1 and the rest are 0 and θ are the current parameters of the model. It reduces to the negative logarithm of the probability assigned to the correct class by the softmax function. Thus, it penalizes the model for assigning high probabilities to incorrect classes. Correct guesses, especially when assigned probabilities close to 1, will yield a low loss. The task of optimizing the classifier can then be described in equation form as

θ = arg minˆ

θ

1 n

n

X

i=1

L(x_i, y_i, θ) (16)

4.4 Optimization and Regularization

4.4.1 Dropout

A common way to prevent overfitting of a neural network is to implement dropout (Srivastava et al. 2014). With dropout, each time the gradient is calculated each unit and its connections will have a probability of being excluded in this particular calculation and updating of weights.

This has the effect that units in the network will not co-adapt to any greater degree, meaning that a unit cannot rely exclusively on the input of any other unit since there is a chance that this unit will not be present during training. At test time, all units will be included and weighted based on their probability of inclusion during training.

(15)

This method of preventing overfitting works well in practice. It is recommended that the dropout probability should be high if the amount of training data is small and vice versa. It makes intuitive sense that it is easier for a model to memorize a small training dataset and thus overfit which makes the need for regularization greater. In the implementation of ULMFiT in this thesis, different dropout probabilities will be used for different layers in the model. The relative size of the dropout probabilities is difficult to motivate theoretically.

4.4.2 Batch Normalization

When training a neural network with some kind of gradient descent algorithm one will face a problem which is referred to as internal covariate shift. It is defined as a change in distribution of the activations of a layer in the network due to changes in parameters from earlier layers.

A change in distribution during training will slow the training significantly since it increases the risk that the optimizer gets stuck due to vanishing gradients. Batch Normalization (Ioffe and Szegedy 2015) remedies this problem by normalizing activations while still allowing the normalized values to take on the same value as the original ones if this would be the optimal solution. Two layers of Batch Normalization, one before each linear layer, is used in the classifier part of the model in this thesis.

4.4.3 Weight decay

Weight decay is another way to reduce overfitting. It is commonly done in the form of L2 regularization (Ridge regression) which adds a penalty for big weights to the cost function of the network. Smith (2018) shows in his paper that the weight decay should be chosen to be a larger value for smaller learning rate values and vice versa. He also suggest to try out weight decay values of anything between 10⁻² and 10⁻⁶ depending on the dataset size and how other regularization techniques are implemented.

4.5 ULMFiT

The Universal Language Model Fine Tuning (ULMFiT) is a type of transfer learning technique in the Natural Language Processing domain introduced by Howard and Ruder (2018). Training a classifier with this technique includes three different steps. First, a language model is trained on a preferably very large corpus of documents. The more variety in the language of this corpus

(16)

the better. Training the language model on only medical documents for example would result in a model which understands medical terms very well but it would not generalize as well to other domains. The current state-of-the-art language model seems to be the GPT-2 presented by OpenAI in a very recent paper (Radford et al. 2019). They have not released their pre-trained model to the public so the AWD-LSTM (Merity, Keskar, and Socher 2017) is used instead in our implementation of ULMFiT.

The second step is to fine tune the language model to the corpus which is specific to the task. This is done using something called discriminative fine tuning and slanted triangular learning rates(STLR). Discriminative fine tuning means that when updating the weights of the model, different learning rates are used for different layers. The reasoning behind this kind of fine-tuning is that the first layers are found to contain more general information and the last layers contain more specific information (Yosinski et al. 2014).

STLR builds on the idea of the triangular learning rate schedule, proposed by Smith (2017).

The learning rate is triangular when it linearly increases and then decreases between a minimum and maximum value cyclically over a certain number of training iterations, called a cycle length.

The motivation for such a learning rate structure is to more rapidly escape saddle points (where the gradient is close to zero but far from global minimum) in the loss function and also to speed up training (Smith 2017). The STLR, which Howard and Ruder (2018) proposes, is a slightly modified version of Smiths triangular learning rate. The increase to its maximum value happens in fewer iterations and the decreasing period is longer. They suggest that this works better in practice.

Momentum (Hinton 1977) is another way to speed up training and quickly escape saddle points. Smith (2018) shows that the learning rate and momentum goes hand in hand, if you change one then you must change the other if you want optimal performance. He introduces cyclical momentum which Howard and Ruder then uses in combination with STLR in order to achieve good performance.

The third step, after the language model is fine tuned to a specific corpus, is to add the classifier on top of the language model. It takes a concatenated pooling of the last hidden states from the language model as input. The concatenated pooling, hc, is defined as:

hc= [hT, maxpool(H), meanpool(H)] (17)

where [·] is a concatenation and H = [h₁, h₂, . . . , h_T]. The maxpool-operation takes the

(17)

largest values (or most important features) from the hidden states in H and the meanpool- operation takes the average from each hidden state in H. This input is then fed into two linear layers which uses dropout and batch normalization described in section 4.4. The first of these layers uses the ReLU activation function described in section 4.1 and the second layer is propagated forward into the softmax function (Eqn. (6)) in order to assign the class probabilities.

4.6 Evaluation metric

The metric used to evaluate performance of the classifier is Accuracy@K. If K = 3, it means that the classifier gets three guesses at each document. The guesses are the three classes assigned the highest probabilities by the softmax-function. Accuracy@K is then defined as:

Number of correct guesses

Number of documents . Since only one guess can be correct per document, the metric is bounded between 0 and 1. In our experiments we will use K = 1, 3 and 5.

5 Experiments

5.1 Benchmark

First we trained a Naive Bayes classifier to use as a reference point for our neural network model since it can be trained relatively quickly. Unigrams and bigrams with TF-IDF representation were used as features. We removed words which appeared in more than half of the documents.

Using chi-squared feature selection (Manning, Raghavan, and Schütze 2008), we only included the top 60000 features. The performance of this benchmark is shown in table 2.

Model Accuracy@1 Accuracy@3 Accuracy@5

Naive Bayes 73.63% 85.65% 88.97%

Table 2: Naive Bayes performance given one, three and five guesses.

5.2 ULMFiT implementation

Language model The first thing we did when creating the language model was to build a vocabulary. In order to do so, the text is pre-processed where, for example, words like don’t is divided into two tokens: do and n’t. Special tokens indicating important things happening in the

(18)

text was used. For example, a token indicating that the following word is all upper case letters was used since the semantic meaning of, for instance, STOP and stop might be very different and thus carry a lot of information. There is also a token indicating where a new text starts, a token indicating capitalized letter and a token for words that are not in the vocabulary. In these experiments a vocabulary size of the 60000 tokens most common in the dataset has been used.

The tokenized data are then quantified, the most common token gets the value 0, second most common gets 1, and so on. One could limit the size of the vocabulary to a lower number to save computation time but since the data contains so many different classes we believe that the language can be quite diverse and that a large vocabulary is needed to successfully distinguish between classes.

The structure of the neural network used in the experiments is a version of the ULMFiT- model, tuned in different ways to be more suited for the highly multiclass classification task.

First, a language model is trained with the purpose that the model should get familiar with the english language in general and the specific language of the data in particular. The language model (seen in a down-scaled version in Fig. 5) consists of an embedding layer with 400 units, three LSTM layers with 1150 hidden units in each layer where the output of the last layer are 400 units, the same as the number of embedding units.

(19)

Input 60000 units

Embedding 400 units

LSTM 1150 units

Output 400 units

Figure 5: Language model structure. It takes the quantified vocabulary as input which is then fed into an embedding layer, used to learn the relations between words. This is then propagated forward through three layers of LSTM cells which then outputs hidden states of the same size as the embedding layer.

This is quite a large network which takes a long time to train. A model with pre-trained weights is used to initialize the weights in our training as suggested by Howard and Ruder (2018). Starting with these weights, the language model is trained on the Reddit data in order to learn the language used in this realm. For the purpose of tracking the models performance across training, 10% of the data is kept for validation. With more data, the language model can learn more which is why the validation set is somewhat small. It is also worth noting that language model does not get to know the class labels associated to each observation which means that a validation set for the classifier does not need to be set aside in this part of training.

Dropout values seen in table 3 is used during training. The values are set low since the training data is large.

(20)

Input layer Embedding layer Hidden layers Weights

Dropout probability 0.072 0.012 0.024 0.060

Table 3: Dropout probabilities for different layers and the weight matrices in the language model.

A short test is run for a few iterations where the learning rate versus loss is plotted, seen in Fig. 6. A rule of thumb is to use a learning rate somewhere in the steepest descent in the plot.

We want to find a point in the plot where the learning rate is high while loss is low. This leads us to initialize training with a learning rate of 0.04.

Figure 6: A smoothed plot of learning rate versus loss for the language model run on a few minibatches until the loss started increasing.

We also plotted the learning rate versus loss (see Fig. 7) after every second epoch during training in order to make reasonable adjustments to learning rate during training.

(21)

(a) Plot after second epoch (b) Plot after fourth epoch

Figure 7: Smoothed plots of learning rate versus loss constructed after second and fourth epoch.

Epoch Train loss Valid loss Accuracy Time LR

1 4.83 4.78 0.231 9:36:57 0.04

2 4.23 4.14 0.277 9:37:02 0.04

3 3.78 3.74 0.321 9:26:58 3e-04

4 3.74 3.69 0.328 9:27:29 3e-04

5 3.71 3.67 0.329 10:10:43 1e-04

6 3.7 3.66 0.331 10:11:01 1e-04

Table 4: Training schedule for the first language model. Cyclical momentum was also used and set to vary between 0.8 and 0.7 for all epochs. The Adam optimizer was used with β1 = 0.9 and β₂ = 0.99. Weight decay was set to 0.01.

In table 4 the training progression for the language model fine-tuned on the Reddit data is detailed. The total training was almost 60 hours of GPU time (on a Nvidia Tesla K80 GPU) and after the sixth epoch, the model is able to successfully predict the next word in the validation part of the data with 33.1% accuracy. In hindsight, an inspection of Fig. 7a and 7b lead us to believe that the learning rates of epoch three, four, five and six could have been set a little more aggressively, possibly leading to faster conversion and improvement in performance.

Classifier After the language model had been trained we trained the classifier. It sits on top of the language model, as explained in section 4.5. Fig. 8 shows a simplified structure of the classifier. The number of units in the output layer corresponds to the number of classes in the data.

(22)

LM Output

Concatenate pooling

Linear 1200 units

ReLU activation

Linear 50 units

Output 1013 units

Figure 8: Classifier structure. The leftmost layer is the output of the language model which is fed into a pooling layer. The pooled output then goes into a large linear layer which is propagated forward through a ReLU activation. The activations is then fed into another, smaller, linear layer where the units of this layer is used to calculate the output layer.

Batch normalization is used between layers and the dropout probability is set to 0.048 for the first layer and 0.1 for the second layer of the classifier. 15% of the data is used for validation and a batch size of 32 is used during training (largest that fit in memory). Training was initialized with a learning rate of 0.04.

The classifier training is detailed in table 5. One cycle policy (Smith 2018) and discriminative fine tuning (Howard and Ruder 2018) is used to train the model. The values in the last eight rows of the LR-column in table 5 refers to the minimum and maximum learning rate used during the cycle.

Gradual unfreezing was also used during training of the classifier in order to remedy the problem of catastrophic forgetting (that the model forgets the general language contained in the

(23)

language model). The last column of table 5 specifies which layers were trained when.

Epoch Train loss Valid loss Accuracy Time LR Layers

1 3.70 3.23 0.323 4:22:30 0.04 All

2 2.69 2.21 0.538 4:03:48 0.04 All

3 2.92 2.43 0.470 5:13:17 2e-2/2.6⁴to 2e-2 Last two 4 1.84 1.50 0.679 5:07:30 2e-2/2.6⁴to 2e-2 Last two 5 1.64 1.28 0.724 6:58:24 3e-3/2.6⁴to 3e-3 Last three 6 1.22 0.98 0.790 7:36:26 3e-3/2.6⁴to 3e-3 Last three 7 1.23 1.05 0.773 7:17:33 1e-3/2.6⁴to 1e-3 Last three 8 1.10 0.93 0.800 7:12:42 1e-3/2.6⁴to 1e-3 Last three 9 1.02 0.92 0.801 7:28:19 1e-3/2.6⁴to 1e-3 Last three 10 1.00 0.91 0.805 8:45:06 1e-4/2.6⁴to 1e-4 All

Table 5: Training schedule for the classifier. Cyclical momentum was also used and set to vary between 0.8 and 0.7 for all epochs. The Adam optimizer was used with β₁ = 0.9 and β₂ = 0.99. Weight decay was set to 0.01.

The learning rate becomes smaller as training progresses and is chosen by constructing plots like the one in Fig. 6. It was quite slow in the beginning which could indicate that the learning rate should have been initialized at a larger value. The training started by fine-tuning the whole model for two epochs and then used gradual unfreezing where we fine-tuned just the last two and last three layers for seven epochs. On the last epoch, the whole model was fine-tuned. Training time was 64 hours. After five epochs, the accuracy was almost on par with the Naive Bayes benchmark and after ten epochs the classifier could correctly classify a text with 80.5% accuracy, given one guess.

6 Results

Table 6 show the Accuracy@K metric for our classifier compared to the benchmark classifier on the validation and test data. We see that our neural network based model outperforms the benchmark, Naive Bayes, by a wide margin on all metrics and both datasets. The largest difference is observed when the model is only given one guess.

Fig. 9 show the distribution of error percentage between classes for the validation data and test data respectively. As seen in both figures, very few classes have a higher error rate than 40%.

(24)

Model Testdata Accuracy@1 Accuracy@3 Accuracy@5 Naive Bayes Validation 73.63% 85.65% 88.97%

ULMFiT Validation 80.49% 89.78% 92.2%

ULMFiT Test 78.57% 88.64% 91.32%

Table 6: Table of performance measured as Accuracy@K of the benchmark model and our classifier on both validation and test data.

(a) Validation (b) Test

Figure 9: Plot of error distribution between classes for both validation and test data

Table 7 show the classes with lowest and highest error rates in the validation data. The table also show the number of observations from each class and its error rate.

Table 8 is the same as table 7 but for the test data. As mentioned, this data is more unbal- anced than the validation data with some classes having less than ten observations.

It can be seen in table 7 and 8 that most of the classes that the model fail to predict accurately seem very general. For example, topics such as canada, united kingdom and networking seems very broad and could contain almost any type of discussion. Only around 20 classes has a higher error rate than 50% on both validation and test data and there are lots of classes with almost no errors.

Fig. 10 show some examples which the model failed to correctly classify (with one guess).

The first example seems like a decent guess. The second example could probably fit into either of the predicted and true class. In the third example, the text seems very hard to categorize without context. However, if one were to know that the label Invisalign refers to a kind of dental treatment similar to braces, maybe it would be possible to make that prediction. The

(25)

Lowest errors Highest errors

Class Observations Error rate Class Observations Error rate

KeybaseProofs 101 0 Construction 91 0.495

incest 110 0 seduction 95 0.495

ACL 105 0.01 southafrica 107 0.495

Stormlight_Archive 96 0.01 cscareerquestions 98 0.5

SkincareAddiction 91 0.011 Psychic 88 0.5

Kava 84 0.012 privacy 96 0.5

snapchat 95 0.021 bladeandsoul 97 0.505

ShingekiNoKyojin 117 0.026 linuxquestions 105 0.514

Mattress 116 0.026 AvPD 87 0.517

mead 114 0.026 socialism 110 0.518

reloading 108 0.028 dndnext 113 0.522

swoleacceptance 104 0.029 asktrp 105 0.524

WritingPrompts 101 0.03 canada 99 0.525

asmr 100 0.03 personalfinance 91 0.538

vikingstv 99 0.03 networking 103 0.544

sharditkeepit 97 0.031 Anarchism 104 0.567

Snus 93 0.032 hacking 108 0.574

Chromecast 92 0.033 actuallesbians 89 0.618

Geosim 112 0.036 techsupport 102 0.647

hookah 83 0.036 unitedkingdom 87 0.655

Table 7: Table of classes with lowest and highest error rates, number of observations in each class and error rate of each class for the validation data.

predicted label, wls, refers to discussions about "weight loss surgery", which seems like a good guess with the given information. In the last example, the model predicts russian with the true label being russia. Perhaps one of these classes should not have been in the data to begin with since there is very likely to be a big overlap between them.

(26)

Lowest errors Highest errors

Class Observations Error rate Class Observations Error rate

garlicoin 5 0 PoloniexForum 6 0.5

FidgetSpinners 4 0 hitmobile 2 0.5

netneutrality 12 0 seduction 234 0.504

vergecurrency 4 0 datascience 196 0.51

lightsabers 54 0 hacking 78 0.513

KeybaseProofs 217 0.005 Lineage2Revolution 31 0.516

DestructiveReaders 90 0.011 DFO 207 0.517

SkincareAddiction 209 0.019 privacy 206 0.519

snapchat 187 0.027 networking 234 0.526

TOR 74 0.027 FORTnITE 236 0.542

Porsche 72 0.028 personalfinance 203 0.552

incest 143 0.028 Psychic 215 0.558

malehairadvice 208 0.029 canada 224 0.562

emojipasta 133 0.03 schizophrenia 221 0.57

OneNote 59 0.034 actuallesbians 223 0.583

WritingPrompts 176 0.034 techsupport 213 0.601

puppy101 228 0.035 asktrp 206 0.607

tarantulas 83 0.036 StateOfDecay 64 0.609

sharditkeepit 52 0.038 bladeandsoul 216 0.62

SampleSize 225 0.04 unitedkingdom 206 0.743

Table 8: Table of classes with lowest and highest error rates, number of observations in each class and error rate of each class for the test data.

Figure 10: Some examples of texts that the model could not correctly predict. The text is displayed in tokenized form along with predicted and true labels.

(27)

7 Conclusion

This thesis aimed to investigate if a neural network could perform well on a multi-class classification task. A publicly available dataset was used, consisting of 1013 classes with 1000 observations per class. We found that a LSTM-based model which used a transfer learning technique known as ULMFiT could successfully be trained to perform well on this data. This model beat our benchmark model by a wide margin and could correctly classify the right class with 80.5% accuracy (given one guess), 89.8% accuracy (given three guesses) and 92.2% accuracy (given five guesses).

The models ability to correctly classify a text seemed to be dependent on how broad or narrow a class was. Classes which could contain almost any type of discussion was harder to classify, whereas more narrow classes where easier. There was also some overlap between some of the classes which contributed to the models inability to classify some texts (especially when given only one guess). The creators of the dataset believed that the maximum possible accuracy on this data were approximately 96% because some texts seemed to contain no useful information at all. In light of this, 92.2% Accuracy@5 must be considered a good result.

The accuracy could likely be increased even more with better choices of learning rate in each epoch. Increased training time would also be a way of boosting performance, although expensive. Another expensive way to boost performance would be to train several different models and ensemble their predictions.

Some interesting topics for further research would, for example, be to investigate if CNN- based language models and classifiers could get good performance in a similar setting. Another thing to investigate would be how the performance of the language model affects the performance of the classifier. Would using GPT-2 instead of AWD-LSTM as the pre-trained model result in a large boost in performance? Roughly one thousand observations per class were used in this thesis and it would be of interest to find out how much training data is required to achieve good results.

(28)

References

BioASQ (2013). The Challenge. [Accessed 2019-03-26]. URL: http : / / bioasq . org / participate/challenges_year_1.

Bottou, L. (2018). “Online Learning and Stochastic Approximations (revised 5/2018)”. Online Learning in Neural Networks, 1–35.

Conneau, A. et al. (2017). “Very Deep Convolutional Networks for Text Classification”. arXiv:

1901.09821.

Dai, A. M. and Q. V. Le (2015). “Semi-supervised Sequence Learning”, 1–10. arXiv: 1511.

01432.

Gargiulo, F., S. Silvestri, and M. Ciampi (2018). “Deep Convolution Neural Network for Ex- treme Multi-label Text Classification”. Healthinf, 641–650.

Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. http://www.deeplearningbook.

org. MIT Press.

Hinton, G. E. (1977). “Relaxation and its role in vision”. Ph.D Thesis, University of Edinburgh.

Hochreiter, S. and J. Schmidhuber (1997). “Long short term memory. Neural computation”.

Neural Computation9.8, 1735–1780. arXiv: 1206.2944.

Howard, J. and S. Ruder (2018). “Universal Language Model Fine-tuning for Text Classifica- tion”. arXiv: 1801.06146.

Ioffe, S. and C. Szegedy (2015). “Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift”. arXiv: 1502.03167.

Joulin, A. et al. (2016). “Bag of Tricks for Efficient Text Classification”. arXiv: 1607.01759.

Kaggle (2018). The reddit self-post classification task. [Accessed 2019-03-27].URL: https:

//www.kaggle.com/mswarbrickjones/reddit-selfposts.

Kim, Y. (2014). “Convolutional Neural Networks for Sentence Classification”. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

1746–1751.

Kingma, D. P. and J. Ba (2014). “Adam: A Method for Stochastic Optimization”. arXiv: 1412.

6980.

Lai, S. et al. (2015). “Recurrent Convolutional Neural Networks for Text Classification”. Aaai’15, 2267–2273.

(29)

Liu, J. et al. (2017). “Deep Learning for Extreme Multi-label Text Classification”. Proceed- ings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 115–124.

Manning, C. D., P. Raghavan, and H. Schütze (2008). Introduction to Information Retrieval.

New York, NY, USA: Cambridge University Press.ISBN: 0521865719, 9780521865715.

Merity, S., N. S. Keskar, and R. Socher (2017). “Regularizing and optimizing LSTM language models”. arXiv: 1708.02182v1.

Mikolov, T. et al. (2013). “Efficient Estimation of Word Representations in Vector Space”.

arXiv: 1301.3781.

Nair, V. and G. E. Hinton (2010). “Rectified Linear Units Improve Restricted Boltzmann Ma- chines”. ICML’10 Proceedings of the 27th International Conference on International Con- ference on Machine Learning, 807–814.

Östling, R. (2013). “Automated Essay Scoring for Swedish”. Proceedings of the Eighth Work- shop on Innovative Use of NLP for Building Educational Applications, 42–47.

Radford, A. et al. (2019). “Language Models are Unsupervised Multitask Learners”. URL: https://d4mucfpksywv.cloudfront.net/better- language- models/

language_models_are_unsupervised_multitask_learners.pdf.

Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). “Learning representations by back- propagating errors.” Nature 323, 533–536.

Smith, L. N. (2017). “Cyclical learning rates for training neural networks”. Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017April, 464–472.

arXiv: 1506.01186.

— (2018). “A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay”, 1–21. arXiv: 1803.09820.

Srivastava, N. et al. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Over- fitting”. Journal of Machine Learning Research 15 (2014) 15, 1929–1958.

Sutskever, I., O. Vinyals, and Q. V. Le (2014). “Sequence to Sequence Learning with Neural Networks”. arXiv: 1409.3215.

Wikipedia (2019a). Medical Subject Headings. [Accessed 2019-03-26].URL: https://en.

wikipedia.org/wiki/Medical_Subject_Headings.

— (2019b). Reddit. [Accessed 2019-03-27]. URL: https : / / en . wikipedia . org / wiki/Reddit.

(30)

Yosinski, J. et al. (2014). “How transferable are features in deep neural networks?” arXiv:

1411.1792.

Zhang, X. and Y. Lecun (2016). “Text Understanding from Scratch”. arXiv: 1502.01710v5.

Zhang, X., J. Zhao, and Y. Lecun (2016). “Character-level Convolutional Networks for Text”.

arXiv: 1509.01626v3.

APPLICATIONS OF DEEP LEARNING IN TEXT CLASSIFICATION FOR HIGHLY MULTICLASS DATA