Evaluation of text classification techniques for log file classification

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/048--SE

Evaluation of text classiﬁcation

techniques for log ﬁle

classiﬁca-tion

Utvärdering av textklassiﬁceringstekniker för klassiﬁcering av

loggﬁler

Per Olin

Supervisor : George Osipov Examiner : Cyrille Berger

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

System log files are filled with logged events, status codes, and other messages. By analyzing the log files, the systems current state can be determined, and find out if some-thing during its execution went wrong. Log file analysis has been studied for some time now, where recent studies have shown state-of-the-art performance using machine learn-ing techniques.

In this thesis, document classification solutions were tested on log files in order to classify regular system runs versus abnormal system runs. To solve this task, supervised and unsupervised learning methods were combined. Doc2Vec was used to extract docu-ment features, and Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based architectures on the classification task. With the use of the machine learning models and preprocessing techniques the tested models yielded an f1-score and accuracy above 95% when classifying log files.

(4)

Acknowledgments

First, I would like to thank my supervisor at CommScope Magnus Ekhall for the opportunity to work on an interesting project and for providing a great dataset for my research. I had a lot of fun and felt welcomed at CommScope during the stay of my thesis.

I would like to thank George Osipov for the great help and support throughout the thesis. You have given me a lot of great feedback and discussion during the thesis which was really helpful.

I would like to thank my examiner Cyille Berger for the help with getting started with the thesis and for providing us with additional opportunities to get our thesis reviewed.

Finally, I would like to thank my opponent Mathias Nilsson for the great feedback he has given me during the thesis.

(5)

3.5 Dataset experiments . . . 16 3.6 Preprocessing . . . 16 3.7 Document segmentation . . . 17 3.8 Segmentation experiments . . . 17 3.9 Feature extraction . . . 17 3.10 CNN implementation . . . 18 3.11 LSTM implementation . . . 19 3.12 LSTM-CNN implementation . . . 19 3.13 Architecture experiments . . . 21 4 Results 23 4.1 Base model implementation . . . 23

(6)

4.3 Document segmentation . . . 24

4.4 Architecture comparison . . . 26

5 Discussion 29 5.1 Results . . . 29

5.2 Method . . . 31

5.3 The work in a wider context . . . 32

6 Conclusion 33 6.1 Future Work . . . 33

Bibliography 34 A Appendix 38 A.1 Dataset histograms . . . 38

A.2 Architecture images . . . 40

A.3 Results with more metrics . . . 43

(7)

List of Figures

2.1 Doc2Vec PV-DM model. Paragraph id D is used as a word to train the model. . . . 5

2.2 Doc2Vec PV-DBOW model. Paragraph id D is used to predict a random sequence of words from the document. . . 6

2.3 A common structure of a CNN, illustrating the convolution layer, and pooling layer with respective kernel and filter. The figure is taken from [fig:cnn_img]. . . . 8

2.4 RNN in its unfolded form. . . 8

2.5 Structure of an LSTM node, illustrating the input, forget and output gates. Figure is taken form [fig:lstm_img]. . . . 10

3.1 Diagram showing the workflow of thesis. . . 13

3.2 A segment from an STB log file. It shows how the file looks before any preprocess-ing is applied. . . 14

3.3 Histogram for the word count of each log file. X-axis showing the word count, while y-axis showing the number of file with the same word count. . . 15

3.4 Pseudocode for the creation of a balanced dataset. . . 15

3.5 Stopword removal, using NLTK’s stopword library. . . 16

3.6 Lemmatization code in Python using TextBlob library. . . 17

3.7 Expression for non-alphanumeric removal. . . 17

3.8 The architecture of the CNN model used in the thesis (Values shown in the figure comes from the final tuning of the model). . . 19

3.9 The architecture of the BI-LSTM model used in the thesis (Values shown in the figure comes from the final tuning of the model). . . 20

3.10 Architecture of BI-LSTM-CNN model used in the thesis (Values shown in the fig-ure comes from the final tuning of the model). . . 21

4.4 The results of the CNN, LSTM, and LSTM-CNN models while being the base, segmented, and fine-tuned versions. . . 27

A.1 Histogram for the word count of each log file. X-axis showing the word count, while y-axis showing the number of file with the same word count. . . 38

A.2 Histogram showing a more detailed view of the word count of each log file by limiting the x-axis to 30000. X-axis showing the word count, while y-axis showing the number of file with the same word count. . . 39

A.3 Histogram showing a more detailed view of the word count of each log file by limiting the x-axis to 5000. X-axis showing the word count, while y-axis showing the number of file with the same word count. . . 39

A.4 Architecture of CNN model used in the thesis. . . 41

(8)

List of Tables

2.1 The left table shows a vocabulary of five words and the right table shows the one-hot-encoding of each word. Each row in the table is the one-one-hot-encoding of the

word. . . 4

2.2 Confusion matrix of a binary class problem. . . 10

3.1 Overview of datasets used in the experiments. . . 16

3.2 Hyper-parameters for the PV-DBOW model. . . 18

3.3 Hyper-parameters for the CNN model. . . 18

3.4 Hyper-parameters for the LSTM-CNN model. . . 21

3.5 The hyper-parameter space for each model. . . 22

4.1 Number of hidden nodes hyper-parameter tuning results (Highest scores are marked in bold). . . 23

4.2 Dropout value hyper-parameter tuning results (Highest scores are marked in bold). 24 4.3 The resulting hyper-parameters for each model. . . 27

A.1 Table showing the accuracy, precision, recall and f1-score of the three models on balanced datasets. . . 44

A.2 Table showing the accuracy, precision, recall and f1-score of the three models on unbalanced datasets. . . 45

A.3 Table showing the results of the segmentation experiment. The table shows how each model performed on 15 different segment values (highest scores for each column are marked in bold). . . 46

(9)

1 Introduction

System log files contain information about events executed by a system during its runtime. Upon analyzing such a file it is possible to determine the system’s current state and find out if something went wrong. In the beginning, log file analysis was done by inspection. It is a time-consuming process since the generated file can be very large. Classifying a large file could take days, and it is easy to miss something important while reading every line.

With progress in the machine learning field, new ways to solve classification problems have become available. One common approach used to find anomalies in a log file is to read sequences of the log file, learn what is common in order to find deviations in the sequence [23, 6].

The approach to identifying abnormalities in log files in this thesis will be the same as a text classification problem. Studies have shown great results in the area of text classification which is the reason for this approach [24, 14, 41]. One thing that makes the classification task of a log file harder compared to a short story or a sentence is the text size. Methods to classify large text have been found, such as document classification using the Doc2Vec [18] framework. In this thesis document segmentation is applied to each log file, in combination with Doc2Vec to extract good features. The extracted features are then passed to a couple of text classification models to evaluate how different models fare on a log file classification.

1.1 Motivation

This thesis is carried out at CommScope’s Linköpings office, which develops Set-Top Boxes (STB). At CommScope they have a large test system that tests STBs in a variety of situations, to ensure the quality of the product. Test cases only test certain functionality and specified requirements. With log file analysis it is possible to detect errors outside of the specified test cases, which can help the developers detect new errors which in turn may improve the quality of the product.

1.2 Aim

The aim of this thesis is to examine how text classification models can be applied to the task of log file classification. The models will be compared against each other to find their strengths

(10)

1.3. Research questions

and weaknesses. The final goal of this thesis is to provide a pipeline that can be used to classify log files in a development environment.

1.3 Research questions

The following research questions will be the center of the thesis.

1. What machine learning models can be used to analyze log files in order to detect abnor-mal system run-through?

2. How do different Long Short-Term Memory and Convolutional Neural Network-based approaches compare in F1-score on the task described above?

1.4 Delimitations

This thesis aims to investigate how well text classification models can be used to classify log files. The log files that will be used in this thesis are provided by CommScope. Within the scope of this thesis, no other log files will be used.

In this thesis, only three models will be evaluated. There are a lot more models in the text classification family, but those are outside the scope of this thesis.

(11)

2 Theory

This chapter covers the general theory underlying this thesis study. A general understanding of neural networks is required for this theory section. The main part of the theory lies in Natural Language Processing (NLP) and the concepts within, such as the general approach, different models and metrics used for evaluation of NLP models.

2.1 Natural language processing

NLP is a subfield of computer science and Artificial Intelligence (AI) that studies how a com-puter can understand natural languages. The meaning of words is not something that a computer can comprehend by nature, but with rules and models, a computer can get better at interpreting words.

In the early days of NLP, models were created with the use of linguistic rules and theory. Later statistical approaches were used which led to most of the machine learning algorithms that we see today [13].

Some areas of NLP are sentiment analysis, document retrieval, text summarizing and text classification. The focus of this study will be on text classification.

2.2 Text representation

Text classification and other NLP problems have an input of an arbitrary text, a sequence of words. Machine learning models perform calculations to achieve the goal of the given task. One common way to represent text in a way that a computer can understand is to convert it into numbers. The task of converting text into a format that a computer can understand is called feature engineering of text data [16].

One-Hot-Encoding

The most common word representation is called a encoding, also called a one-hot-vector. In one-hot-encoding, each word is represented as a vector with a dimension of your vocabulary, which is a collection of all unique words in the dataset [16]. All indexes of the vector are zero but one, which is one, corresponding to the index of the given word in the vocabulary, see Table 2.1. A one-hot-encoding can represent a sentence as well as a single

(12)

2.2. Text representation nlp is great machine learning

nlp is great machine learning

nlp 1 0 0 0 0

is 0 1 0 0 0

great 0 0 1 0 0

machine 0 0 0 1 0

learning 0 0 0 0 1

Table 2.1: The left table shows a vocabulary of five words and the right table shows the one-hot-encoding of each word. Each row in the table is the one-one-hot-encoding of the word.

word. If a word is present in the sentence a one is placed at the corresponding index in the vector.

This representation works, but have some limitations. Given a dataset with a lot of unique entries, the word representation will require a lot of memory and the time to train the model increases [42]. Another flaw is that the one-hot-encoding only stores information that a word appeared, but nothing about words relations to other words. With all words represented by zeros and ones, two given word vectors would be orthogonal.

Word embedding

A solution to the dimensionality problem and the lack of information can be a word embed-ding. Word embeddings are a vector representation of words from the vocabulary similar to one-hot-encoding, but instead of being represented by zeros and ones, the words are repre-sented as vectors of real numbers. The word embedding is a continuous vector which has a much lower dimension compared to a one-hot-encoding.

Word embeddings can be generated by many methods, such as dimensionality reduc-tion [32], co-occurrence matrix [19] and neural networks [26]. The neural network solureduc-tions proposed by Mikolov et al. [26] is a framework called Word2Vec. Word2Vec consists of mainly two types of models, Skip-Gram model and Continuous Bag Of Words (CBOW). Both models are shallow networks that only have two layers. Both models have an embedding dimension parameter and a window size parameter. The embedding dimension corresponds to the size of the learned word vectors. The window size corresponds to the number of context words considered during the training of the models. During training, one word is selected to be the target word, and the words surrounding the target words within window size steps are the context words. Depending on which model is used, you either try to predict the target word form the context words or to predict the context words from the target word. The re-sult is vectors that reflect the similarity of words. In the dense vector space, words of similar meaning will be placed close to one another.

Document embedding

A document embedding framework called Doc2Vec is based on the Word2Vec framework mentioned in the previous section. The embedding is called paragraph embedding in the paper [18] to emphasize that it can be used on text of varying size. Doc2Vec has two models, just like Word2Vec. One model is called the Distributed Memory Model of Paragraph Vectors (PV-DM). Doc2Vec is an unsupervised learning model trained on a large text corpus with the objective to predict the next word in a sequence. The PV-DM model is inspired by the CBOW Word2Vec model. Both predict a word based on the surrounding context. In PV-DM a paragraph identifier is added to the context when the model is predicting the target word (see Figure 2.1). The model has two matrices, W and D. W is a matrix of word vectors, where each column is a word vector. It has the dimension M ¨ q where M is the number of words in the vocabulary and q is the dimension of each word vector. D is the paragraph matrix, where each column is a paragraph vector. The dimension of D is N ¨ p, where N is the number of

(13)

2.2. Text representation

Figure 2.1: Doc2Vec PV-DM model. Paragraph id D is used as a word to train the model.

paragraphs in the corpus and p is the dimension of the paragraph vectors. The matrices are updated using backpropagation and stochastic gradient descent. At each iteration, a fixed-length context is sampled from a random paragraph, the paragraph identifier is added to the context, and an error is computed to update the weights. As a result, each matrix W and D are tuned to find similarity between words and paragraphs respectively.

On new documents, that the model did not train on, a column is added to the paragraph matrix that is computed by gradient descending on D.

The second Doc2Vec model is called the Distributed Bag of Words version of Paragraph Vector (PV-DBOW). PV-DBOW does not use context words to train the model. It forces the paragraph vector to predict a random word sequence in the document, similar to how the n-gram word2vec model work. The PV-DBOW model can be seen in Figure 2.2. At each iteration of the gradient descent, a random text sample is taken from the paragraphs. From that sample, one word is selected and then a classification task with the paragraph vector.

Since the model does not use the context words, the word vectors are not stored, which means that the PV-DBOW model takes less space compared to the PV-DM model.

(14)

2.3. Preprocessing

Figure 2.2: Doc2Vec PV-DBOW model. Paragraph id D is used to predict a random sequence of words from the document.

2.3 Preprocessing

Data in its raw form is commonly unstructured and noisy. The process of cleaning the data is called preprocessing. It is done to help improve the result of a model. Another benefit of preprocessing data is the reduction of the dataset which in turn reduces the time needed to train the model. Models trained on noisy data are likely to produce a degraded result [30], unless special modification to the model is made to combat the noise.

Preprocessing steps have shown an improvement in the given problem, but it can also have a negative impact [34]. When selecting which preprocessing step to take it is good to experiment with them to see if it gives the desired result [16].

Lowercasing

When the vocabulary is created from the dataset it is case sensitive. The same word such as “House” and “house” would be seen as different words, which would yield duplicate entries of the same word in the vocabulary. Duplicate entries in the vocabulary result in longer training times of the model and less accurate classification.

The solution to this problem is to lowercase all words before adding them to the vocabu-lary. This will result in a vocabulary where the same word never appears twice [16].

(15)

2.4. Deep learning

Removing stop words

Stop words in a text are words that are not as important, more frequently appearing words. When stop words are removed from a text, the text consists of more keywords, words that ap-pear less frequently and carry more meaning in that given text. In the sentence “I would like to learn about NLP” some examples of stop words would be: “I”, “Would”, “To”, “About”, and the keywords: “Learn”, “NLP”. In this example, about half of the sentence was removed. Depending on which words you choose to remove from the vocabulary, its size may be greatly reduced. [16]

Lemmatizing

Lemmatizing is the process of turning words into their respective base word also called lemma or dictionary form. Words can appear in different forms such as: “walk”, “walking”, “walked”. Without lemmatization, they would produce unique entries in the vocabulary, in turn making the vocabulary larger. [16]

Tokenization

Tokenization is the process of turning text into smaller tokens. A token can be a sentence from a large document or a word, depending on the problem at hand [16]. By turning a large text into tokens it becomes easier to process and to feed into a model. Since a large text is considered a single text object, it would be processed all at once. By dividing it into multiple text objects, each token is processed separately instead. The process of tokenization is not as straightforward as splitting a string by each whitespace or period. There are cases when that does not work. In this example sentence, we can see a case where it would fail “Dr.Strange

can’t help you right now”. If we split the sentence by whitespaces it would fail to tokenize

can’t since it is a conjunction of “can” and “not”. With periods as a delimiter, we would split the name of Dr.Strange into two tokens.

2.4 Deep learning

This section covers deep learning architecture that is discussed in this thesis.

Convolutional Neural Network (CNN)

CNN has been used in many different areas in recent year, but appear most commonly in the area of computer vision. CNN models have a grid-like architecture, processing data of example 1D such as a sentence or 2D such as an image. The CNN architecture has a few key features, first the convolution layer and the second is a pooling layer. A common structure of a CNN is shown in Figure 2.3 [11].

The convolution layer has a kernel of a given size, either in 1D or 2D according to the input. The kernel is applied to the output of the previous layer, performing the convolution and is moved across the layer with a given stride. The output of the convolution layer is often called a feature map which is fed into the next layer. A CNN is said to have sparse connec-tivity which is achieved when the kernel is smaller than the input. In a fully connected NN with five input nodes and five output nodes, there will be a total of 25 connections between the two layers. A CNN that have a kernel size of 3, a stride of 1, and five input nodes and five output nodes have around 15 connections depending on which padding is used [11].

The pooling layer uses a filter on the previous layer, similar to the kernel in the convolu-tion layer, but instead of a convoluconvolu-tion operator, it applies an operator on the pooling area, the most common operators are: max, min, and average pooling. Parameters of the pooling layer are the filter size and stride. This results in the downsampling of the previous layer. As

(16)

2.4. Deep learning

Figure 2.3: A common structure of a CNN, illustrating the convolution layer, and pooling layer with respective kernel and filter. The figure is taken from [2].

Figure 2.4: RNN in its unfolded form.

an example, if the filter size is set to 3 and the stride is set to 2, it results in downsampling of half the previous layer [11].

As a result, CNN has fewer weights compared to a regular neural network which reduces the training time and memory used and CNN is translation invariant [11]. From the con-volution layer, new features are extracted which are sometimes called hidden features. One example is a feature that can detect an edge in an area of an image [15].

One downside of using a CNN for processing 1D sequential data is that the features ex-tracted from the convolution layer do not take the whole input sequence into account since the kernel size is usually much smaller than the input size. As a result, CNN could miss im-portant information from the input sequence. One solution to this problem is the Recurrent Neural Network (RNN).

Recurrent Neural Network (RNN)

RNN is an architecture that specializes in the processing of sequential data. The architec-ture is called recurrent since it processes the input in a recurrent manner. Given an input sequence X(t)and the model state h(t)where t represent a time step or index, the current sate is calculated according to equation 2.1,

h(t)= f(h(t´1), x(t); θ) (2.1) f maps the state at time t to the state at time t+1, where θ is used to parameterize f . In equation 2.1 it shows how the previous state h(t´1)is used to calculate the current state h(t). The result of passing the previous state to the next state is a model that looks at the whole sequence when performing its task, instead of just looking at each state independently. In turn, the extracted features are dependent on the whole input sequence.

(17)

2.5. Unbalanced dataset

Since the states of the RNN are calculated sequentially by the time step, the way the weights are updated is a little bit different from feedforward networks. RNN use Back-Propagation Through Time (BPTT) to update the weights. BPTT works backward, from the final time step, calculating the gradients using the generalized back-propagation algorithm. Then in an iterative manner the gradients are propagated back through time, updating the weights of the RNN [11].

One drawback of RNN is the vanishing and exploding gradient problem. In BPTT the gradients are calculated at the last time step and then propagated backward. If the gradient is large or small at the last time step, it will either become too large (explode) or become too small (vanish) by the time it reaches the first time step. With a gradient either vanishing or exploding, the information they provide will be of no value.

In equation 2.2 and 2.3 the expanded error derivate for the gradient for t time steps at the k time step and the equation for the RNN state. To combine both equations, the derivate of

Bct

Bct´1 have to be calculated. In this example, σ is the tanh activation function. The derivate of

tanh produces a value smaller than 1, which is what causes the vanishing gradient.

A solution to this problem is a model that can control the flow of its gradients through time. [11] BEk BW = BEk Bhk Bhk Bck k ź t=2 Bct Bct´1 ! Bc1 BW (2.2) ct=σ(W ¨ ct´1+U ¨ xt) (2.3)

Long Short-Term Memory (LSTM)

LSTM is based on gated RNN architecture [12]. Gated units within the model are used to control the flow of information inside a node. LSTM has three gated units, the input gate, the forget gate and the output gate. Each of the gated units has a sigmoid function [11] applied yielding a real value between 0 and 1 (see structure in Figure 2.5). With the use of these three gates, the model learns how to regulate the flow through the gates, shown in Figure 2.5. This means that the model can regulate how much of the previous state that it would like to take into account in the calculation of the current state. It takes Xtand ht´1 as input and

is passed through the input gate. The input gate is a sigmoid function that determines how much of the input will pass through it. The values from the input gate are passed to the forget gate to determine if any of the inputs should be forgotten and not taken into account in the calculations. Then the internal state of the model is updated by summing the values outputted from the input and the forget gates. After the current state is calculated, the values are passed to the output gate. The output gate is activated by the input to the LSTM cell. The final output of the cell is the summation of the internal state and the output gate values. The output is then passed to the next cell, and the same process is applied in that cell.

2.5 Unbalanced dataset

Unbalanced datasets are common to come across. One example where they occur is in the area of sentiment analysis, where you would like to classify the sentiment of reviews [28]. Depending on the topic that the reviews are on, the majority class can be either positive or negative. With an unbalanced dataset the minority class sometimes becomes ignored by the classifier [40], which in turn provides a good result if accuracy is the metric used, see more in section 2.6.

Ways to combat an unbalanced dataset have been explored in [38, 28, 5]. They can be divided into two areas, an algorithmic approach, and a dataset adjustment approach. The algorithmic approach can assign a cost to misclassified cases which adjusts the learning of

(18)

2.6. Evaluation

Figure 2.5: Structure of an LSTM node, illustrating the input, forget and output gates. Figure is taken form [7].

Actual Predicted

Positive class Negative class

Positive class TP FP

Negative class FN TN

Table 2.2: Confusion matrix of a binary class problem.

the model. Dataset adjustment is usually done by either oversampling the minority class or undersampling the majority class.

In a study by Mountassir et al. [28] they evaluate different undersampling methods on sentiment analysis datasets. They gradually reduce the majority class of the dataset until they have an even distribution of the two classes. From their results, it is easy to see that an even distribution of the classes in a dataset produces a better result in the majority of cases.

2.6 Evaluation

Evaluation of classification task has a few common metrics: accuracy, precision, recall, and F1

-score. One common way to visualize the results of a model is by creating a confusion matrix. It shows how well a model performed on its task (see Table 2.2). TP means true positive which is all cases that were positive and correctly classified, TN is all cases that were negative and were correctly classified. Where FP and FN mean false positive and false negative, which includes all the misclassified cases.

Accuracy

Accuracy is a measurement of how many cases that were correctly classified divided by all the classified cases. The equation for accuracy can be seen in equation 2.4. Accuracy has a

(19)

2.6. Evaluation

range from 0 to 1, where 1 means that all cases were correctly classified and 0 mean that none were correctly classified.

Accuracy= TP+TN

TP+TN+FP+FN (2.4)

Accuracy is a measurement that is simple and effective. One downside of accuracy is that it does not take the class distribution into account. If the dataset is imbalanced, and the model classifies everything as one class, then it may achieve good accuracy.

Precision

Precision is a metric that calculates the percentage of all classification cases that are relevant. Precision= TP

TP+FP (2.5)

The precision metric will show how the FP rate of the classification looks. This is good if the model is sensitive to false positives and you would like to avoid them.

Recall

Recall is a metric that calculates the percentage of the total relevant result correctly classified in all cases.

Recall= TP

TP+FN (2.6)

Recall will show how the FN rate of the classification looks. This is good if the model would like to avoid FN.

F

1

-score

F1-score is a combination of both precision and recall. F1-score is the harmonic mean of

pre-cision and recall.

F1-score=2 ¨ Precision ¨ Recall

Precision+Recall (2.7)

If the dataset has an uneven distribution and both recall and precision are wanted, then the F1-score is a good metric. If the distribution is even, then both accuracy and F1-score will be

quite close.

Cross-validation

When evaluating a model it is important to split the dataset into different parts, one for train-ing and one for testtrain-ing. The purpose of this is to have a set of data that the model has not seen before, and that way it is possible to see how well the model can handle new data and find out if it has become overfitted.

Cross-validation is an evaluation method that is commonly used to solve this problem. In this thesis k-fold cross-validation is used, where k = 5. K-fold cross-validation trains the same model on k different subset from the same distribution. With k = 5, the training will use4₅of the dataset, and1₅of the dataset is used for testing. After each model is trained, their metrics are calculated, and then the average metrics are calculated. The averaged metrics are used to evaluate the models.

Another use of cross-validation is to get a stable evaluation of a model with a lot of ran-domness. One example is neural networks, as they are initialized with random weights and their optimization algorithm use randomness as well. This produces different results between runs of the same model. With cross-validation, the results are averaged which can be used to evaluate the model’s performance.

(20)

2.7. Related work

2.7 Related work

In this section, related work to the thesis is presented. The areas that are closely related to the thesis are text classification and anomaly detection.

Anomaly detection

Anomaly detection is an area that has been researched for many years, with methods varying from simple keyword matching to more modern approaches with machine learning models. One approach that has shown great results is the use of log parser, where all logs in the log file are converted to a key value, and the resulting log file is represented as a sequence of log key values from the parsing [9, 23, 6]. The models are later trained to find an anomaly in the log key sequence by learning how normal log sequences look compared to sequences with anomalies. Many of the earlier models for anomaly detection have come from the RNN family such as [9, 36, 37]. But models using CNN [23, 6] have also shown great results. They still use a log parser that converts the logs to log keys, but they are then later fed to a CNN.

Text classification on short texts

Text classification on short texts has been researched extensively, and has shown great re-sults [24, 41, 14, 10]. It can be implemented in many ways, but most of them use word embedding. The word embedding enables the model to extract features that hold a lot of information, which the model can use to get a better result.

Other than the word embedding, there are two model families that have been used a lot in recent years: CNN and LSTM based models [4, 20, 39, 27, 14]. In [14], Kim shows how a simple CNN model used with Word2Vec can produce a competitive result in the domain of text classification.

More recently, new models have appeared. These are hybrids of the commonly used models such as CNN-LSTM [24] and LSTM-CNN [41]. The hybrid models can take advantage of CNN’s ability to extract local features, and with the LSTM they can extract more long term features.

In a paper written by Sajeevan et al. [33], they compare an LSTM-CNN model and a CNN-LSTM model on a sentiment analysis task. The LSTM-CNN model trained faster and achieved a better result compared to the CNN-LSTM model.

Classification of large text files

As mentioned in the previous section, text classification is common in short texts. Short text classification has shown great results, but they have a limited input size. Wan et al. [35] use Doc2Vec, mentioned in section 2.2 and evaluate how well it classifies large legal texts. In the paper, they propose to divide the large text files into smaller chunks according to a theory from audio segmentation. They show that models with segmented input can outperform a model that has the whole text file as input.

In an article written by Han Lau et al. [17], they compare the Doc2Vec models, PV-DM and PV-DBOW against each other and against other documents representations. The PV-DBOW Doc2Vec model provides the best result and requires a shorter training period to converge. In the same article, they compare how the different models perform on a dataset with long documents and datasets with short documents. On long documents, Doc2Vec outperforms almost all other models, on the short documents the performance gap decreases.

(21)

3 Method

In this section, the main resources used during the thesis are described, such as computing resources and libraries used. The implementation of classification models is shown and the evaluation of experiments is described.

This chapter begins by describing how the dataset was constructed, as it is the foundation of the thesis. Then it continues with a description of how the dataset was processed to re-move noise and produce better data for the learning models. Then each learning model and experiment is described. The workflow used in the thesis is shown in Figure 3.1. Each node in the workflow diagram is described in the coming sections.

3.1 Framework and environment

All of the implementations are written in Python 3.6, in a combination of different libraries. The machine learning models were implemented in Keras 2.3.0 [8] using Tensorflow 2.1.0 [25] as a backend (see a more complete list of libraries in appendix section A.4). The training was done on an AWS EC2 cloud compute service [3]. The selected service P3.2xlarge provide the following:

• GPU: Tesla v100 • GPU Memory: 16 GB • CPU: 8 cores

• Memory: 61 GB

(22)

3.2. Dataset

6491 05:01:10.663 rte(939) Note: SystemWrapper:

ãÑ GetUdpFilterDataFlowGap udpfilter has found max no data

ãÑ time: 50 msec

6492 05:01:10.673 kernel(0) Debug: udpfilter_kreatel_ts_check

ãÑ 224.13.18.3:11111 longest period receiving no data: 50

ãÑ ms

KATT Test END (PASSED) (2020-01-09 05:01:11) platform/media/

ãÑ katt-tests/tc_mft_multicast_mpeg2_ts_h264_aac.py KATT_ANCHOR_10_1_ANCHOR_KATT

KATT Test START (2020-01-09 05:01:12)

platform/media/katt-ãÑ tests/tc_statistics.py

6493 05:01:13.458 rte(939) Note: Getting IPC proxy

6494 05:01:13.465 rte(939) Note: Starting KATT test __main__.

ãÑ TestStatistics.test_multicast_h264

Figure 3.2: A segment from an STB log file. It shows how the file looks before any prepro-cessing is applied.

3.2 Dataset

The dataset used in this thesis was created from log files provided by CommScope. The log files were generated from their test system of the set-top boxes (STB). Each STB is tested on multiple test cases that generate a log file of all the tests. The log file was split into parts where each part contains the log of one test case. An example of a log can be seen in Figure 3.2. It shows how the file is divided by each log of each test with the START and END markers.

The dataset has 1050960 observations, each log contains from 4 up to 5698945 words. The dataset have two classes, normal and abnormal. All normal log files are tests that have passed successfully, whereas the abnormal did not. Out of all observations, 26137 belong to the abnormal class, which is approximately 2.5% of the dataset. A histogram of the word count for the log files is shown in Figure 3.3. The figure shows how many log files there are of the same size, with a cut off size of 200000 words. See Figure A.2 and Figure A.3 in the appendix for a more detailed histogram.

3.3 Construction of dataset

As mentioned in the previous section, the dataset was compiled from raw text files that con-tain multiple logs generated from STB tests. Each log in the file is marked with a KATT Test START and KATT Test END which was used to extract each individual test case from the log files. The dataset before any preprocessing takes 132GB. It was stored as a comma-separated values (CSV) file with the following columns: Text, WordCount, Label and TestName. In this thesis, the important columns are Text and Label, the other was not used in the evaluation and training of the models. The Text column has the raw text log, the Label column has the class of that log. The class was extracted from the last line of each test case. The test system output whether a test PASSED or FAILED, which is translated to normal and abnormal, respectively, which was used in the classification task.

(23)

3.4. Creation of balanced dataset

Figure 3.3: Histogram for the word count of each log file. X-axis showing the word count, while y-axis showing the number of file with the same word count.

normalCount = 0 abnormalCount = 0 balancedDataset = []

for text in dataset:

if text is normal and normalCount <= n: balancedDataset.append(text)

normalCount += 1

elif text is abnormal and abnormalCount <= n: balancedDataset.append(text)

abnormalCount += 1

if normalCount == n and abnormalCount == n

return balancedDataset

Figure 3.4: Pseudocode for the creation of a balanced dataset.

3.4 Creation of balanced dataset

The constructed dataset in the previous section has a heavy class imbalance. To create a bal-anced dataset, undersampling was used. From the study [28] they found out that the four undersampling methods, remove similar, remove farthest, remove by clustering, and ran-dom removal performed great on the given task, with no clear winner of the four methods. Since the evenness of the dataset seems to be more important compared to the undersam-pling method, a simple undersamundersam-pling method was selected for the thesis. In this thesis, the balanced dataset was created by selecting the first n observations of each class, creating a dataset of size 2n (see pseudocode in Figure 3.4).

(24)

3.5. Dataset experiments

Name Size #normal #abnormal

DS1 2000 1000 1000 DS2 4000 2000 2000 DS3 8000 4000 4000 DS4 16000 8000 8000 DS5 32000 16000 16000 DS6 2000 1968 32 DS7 4000 3921 79 DS8 8000 7799 201 DS9 16000 15618 382 DS10 32000 31304 696

Table 3.1: Overview of datasets used in the experiments.

text = " ".join(x for x in text.split() if x not in stopWords) Figure 3.5: Stopword removal, using NLTK’s stopword library.

3.5 Dataset experiments

To evaluate how a balanced versus unbalanced dataset affect the model performance, a CNN, an LSTM, and an LSTM-CNN model (described in later sections) was trained and tested on both balanced and unbalanced datasets. The dataset size started small at 2000 observations, the next one is double in size, all the way to a size of 32000 observations. An overview of the different datasets is shown in Table 3.1. During the dataset experiments, the number of segments was set to 1. The corresponding hyper-parameters that each model use is described in section 3.10, 3.11 and 3.12.

Based on the result of these experiments, the dataset used in the rest of the thesis will either be balanced or unbalanced.

3.6 Preprocessing

The preprocessing steps that were implemented, are the ones that were mentioned in sec-tion 2.3 and the addisec-tion of non-alphabetic character removal. The libraries used for pre-processing are: Natural Language ToolKit (NLTK) [21] and TextBlob [22]. NLTK was used to remove stopwords from the corpus and to tokenize each document. NLTK has a stopword library, which was used to remove stopwords from the corpus. For each word in the dataset, the word is removed, if it exists in the stopword library (see code example 3.5). In anomaly detection it is common to preprocess the logs to extract log keys [23, 6]. Log keys are the constant part of each log message.

In this thesis, log keys are extracted by removing the variable part of the message in the preprocessing steps.

Tokenization was done using NLTK since it will divide a text into linguistic tokens, whereas a string split function will only split by a delimiter.

Lemmatization was done in a similar fashion to stopword removal. Each word’s correct lemma was looked up and then replaced (see code example 3.6).

Non-alphanumeric removal was done using regular expression in Python. With reg-ular expression, you can search and replace, remove characters, etc. To remove all non-alphanumeric characters a regular expression was created to find all non-non-alphanumeric and remove them (see expression 3.7).

(25)

3.7. Document segmentation

text = " ".join(Word(x).lemmatize() for x in text.split()) Figure 3.6: Lemmatization code in Python using TextBlob library.

text = re.sub(r’[^\w]’, ’ ’, text)

Figure 3.7: Expression for non-alphanumeric removal.

Finally, the numbers were removed from each document. All numbers were removed to decrease the vocabulary size and to remove variable parts of the logs. Each line in the log has a line number and a timestamp, which would increase the vocabulary by at least the number of the log file. The numbers were removed using Python’s built-in method isdigit(). Each word was check. If there was a digit, it was removed from the document.

3.7 Document segmentation

After all the documents were preprocessed, the tokenized documents were segmented. The process of segmenting the documents was done with the help of NumPy [29]. NumPy is a Python package to help scientific computing in Python with vector and matrices tools and array operations, etc. NumPy has an array operator called array_split, taking an array and number of splits n as input, and outputs n sub-arrays of equal size if possible. Since the documents have been tokenized from the preprocessing and already are in array format, this operation worked perfectly for the segmentation of the documents.

3.8 Segmentation experiments

In order to determine how segmentation of the documents affects the model’s performance, different segment sizes were tested. The number of segments started at 1, then incremented by one up to 15 segments.

The segmentation experiments were tested on the DS3 dataset. The three models (CNN, LSTM, and LSTM-CNN) were trained using the different number of segments and then eval-uated using 5-fold cross-validation. The number of segments that show the best result will be used in the rest of the thesis.

3.9 Feature extraction

Feature extraction of large documents and paragraphs can be done by Doc2Vec mentioned in section 2.2. In this thesis, a library called Gensim [31] was used to create and train the docu-ment embedding. The Gensim library impledocu-ments both of the Doc2Vec models from [18]. The model used in this thesis is the PV-DBOW model since it has proven to give better results, as mentioned in section 2.7.

The hyper-parameters for the PV-DBOW model comes from [17]. They have compared different pairs of hyper-parameters and found what proved to be optimal for their datasets (see Table 3.2).

Gensim made the creation of the Doc2Vec model easy, where one command is needed to create the model, one command to create the vocabulary and one command to train the model. With the model created, a new document vector is created by calling the infer_vector function.

(26)

3.10. CNN implementation

Vector Size Window Size Min Count Sub-Sampling Negative Sample Epoch

300 15 5 10´5 5 20

Table 3.2: Hyper-parameters for the PV-DBOW model.

Kernel Sizes Filter Size Dropout L2-norm

3, 4, 5 100 0.5 0.001

Table 3.3: Hyper-parameters for the CNN model.

With the Doc2Vec model created and trained, the features for all documents are generated by inferring each document on the Doc2Vec model. That resulted in a document vector for each document, that is ready for the classifier to use.

3.10 CNN implementation

The CNN architecture implemented in this thesis was inspired by Kim [14], where the author used Word2Vec embeddings with a CNN model. The model is simple but produced a good result on sentiment analysis. Since the model in the paper is designed for a word sequence, some modifications had to be done. The model input in this thesis will be the features gener-ated from the Doc2Vec model. The architecture in this thesis replaced the first layer with an input layer with the input dimension(numSegments, embeddingSize). numSegments was set according to the results of the segmentation experiment. The value for embeddingSize is set to 300, which is mentioned in section 3.9.

The CNN model architecture is shown in Figure 3.8 (a larger figure is shown in the appendix, see Figure A.4). From the top, we have an input layer with the dimensions

(numSegments, embeddingSize), which is passed to three convolutional layers with a kernel size of 3, 4 and 5 respectively. The padding for the convolutional layer is set to same padding. With other padding options, the model will not work on small segment sizes, since the kernel will not fit on top of the input segment. With the same padding enabled, n number of zeros are added at the beginning of the input segment and at the end of the segment. n is set to the number so the kernel can fit on top of the whole input sequence.

The output of the convolutional layer is set to(numSegments, f ilterSize)and is passed to an activation layer. The activation layer uses the Rectified Linear Unit (ReLU).

The activation is passed to a global-max-pooling layer. The global-max-pooling layer acts as a max-pooling layer, but the pool size is set to the input size by default, which is visible in the architecture.

The max-pooling layers are concatenated and then connected to a fully connected layer with ReLU activation, and L2-norm is applied to the layer. The activation values from the

fully connected layer are then passed to a dropout layer, with a dropout value of 0.5. Then the result from the dropout layer is passed to the last layer, which is fully connected and has a sigmoid activation function. The output of the sigmoid is the predicted class.

The hyper-parameter values in this model are the: convolutional kernel sizes, filter size, dropout value, and L2-norm value. The values used for the hyper-parameters comes from the

tuning done by Kim. The hyper-parameter that the CNN model used is shown in Table 3.3. The hyper-parameters are used for the dataset experiment and segmentation experiments, mentioned before.

(27)

3.11. LSTM implementation

Figure 3.8: The architecture of the CNN model used in the thesis (Values shown in the figure comes from the final tuning of the model).

3.11 LSTM implementation

The LSTM model used in this thesis was a bidirectional LSTM model. A bidirectional network combines the results of moving forwards and backward in time, as opposed to a regular network that only moves forward in time.

The implemented architecture used in this thesis is shown in Figure 3.9. The first layer is the input layer, like the CNN model (but use a different number of segment). The input is passed to the second layer which is a bidirectional LSTM layer. The LSTM layer has 64 number of hidden units.

The result from the bidirectional LSTM layer is passed to a dropout layer, which in turn is connected to a fully connected layer with a sigmoid activation, which is the final layer of the model.

The hyper-parameter tuning for the LSTM model was done by a manual search of the parameter space. Each value tested in the search space was evaluated using 5-fold cross-validation. Each hyper-parameters were tested with all other parameters set to a constant value.

The LSTM model got two hyper-parameters tuned, the number of hidden nodes and the dropout value. The tuned LSTM model was used in the dataset experiment and in the segmentation experiment. In the manual search, 10 values was tested for each hyper-parameter. The tested values for the number of hidden nodes and dropout value are

[1, 2, 4, 8, 16, 32, 64, 128, 256, 512]and[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]respectively.

3.12 LSTM-CNN implementation

The LSTM-CNN model implemented in this thesis was based on the model from [41]. The model from the paper has been modified for the use of document classification, in the same

(28)

3.12. LSTM-CNN implementation

Figure 3.9: The architecture of the BI-LSTM model used in the thesis (Values shown in the figure comes from the final tuning of the model).

way as the previously implemented models (see architecture in Figure 3.10, a larger version of the figure is shown in the appendix, see Figure A.5).

The input layer is like the other models implemented in this thesis (but with a different number of segments). The rest of the architecture is a combination of the LSTM model and the CNN model described in the earlier section, which is visible in the LSTM-CNN architecture figure (Figure 3.10). The top two layers are the same as the LSTM model, where the rest of the model is almost the same as the CNN model, where the only difference is the input to the convolutional layers.

The hyper-parameters for the LSTM-CNN model is the combination of the CNN and LSTM model. The value of the hyper-parameters is the same as well (see Table 3.4).

(29)

3.13. Architecture experiments

Figure 3.10: Architecture of BI-LSTM-CNN model used in the thesis (Values shown in the figure comes from the final tuning of the model).

Kernel Sizes Filter Size CNN-Dropout LSTM-Dropout L2´norm NumHidden

3, 4, 5 100 0.5 0.3 0.001 64

Table 3.4: Hyper-parameters for the LSTM-CNN model.

3.13 Architecture experiments

In this section, all models described in the previous sections were compared against each other. The CNN, LSTM and LSTM-CNN model are tested on the balanced DS5 dataset. The number of segments was set to the value found in the segmentation experiment, which was unique for each model. Each models hyper-parameters were tuned on the dataset. The train-ing and tuntrain-ing of the models used 80% of the dataset, and 20% of the dataset was used to evaluate the tuned models.

The parameter tuning of each model was done using a random search of the hyper-parameter space. The random search was done using the Optuna framework [1]. The Optuna framework provide state-of-the-art optimization algorithms to make the tuning faster. Each model had it search space defined, and then they were tuned for 100 iterations. The search space for each model is shown in Table 3.5.

(30)

3.13. Architecture experiments CNN LSTM LSTM-CNN Epochs [5, 30] [5, 30] [5, 30] FilterSize [32, 512] - [32, 512] kernel1 [1, 10] - [1, 10] kernel2 [1, 10] - [1, 10] kernel3 [1, 10] - [1, 10] CNNDropout [0.1, 0.9] - [0.1, 0.9] l2 [1e ´ 5, 1e ´ 1] - [1e ´ 5, 1e ´ 1] NumHidden - [16, 512] [16, 512] LSTMDropout - [0.1, 0.9] [0.1, 0.9]

(31)

4 Results

This section presents the model performance on all experiments described in the method section. This section starts by presenting the results of the implementation. Then the dataset balance and size is discussed. After that the result of the document segmentation is discussed. Finally, the results of the model architectures are provided.

4.1 Base model implementation

For the first two experiments each model was implemented with base values for the hyper-parameters. The hyper-parameters for the CNN model were the same as in Kim’s paper [14], and the LSTM-CNN model’s hyper-parameters were a combination of the CNN and LSTM model. For the LSTM base model, its hyper-parameters were tuned before the experiments. The hyper-parameters were tuned on the balanced DS2 dataset, training the model for 20 epochs for each fold of the cross-validation. The result of the tuning is shown in Table 4.1 and Table 4.2.

In Table 4.1 it is shown that the LSTM model gets the best overall score when the number of hidden nodes was set to 64, which is what was used in the first two experiments. In

NumHidden Accuracy Precision Recall F1-score

1 0.85850 0.87065 0.85844 0.85709 2 0.89225 0.89757 0.89224 0.89184 4 0.90800 0.91127 0.90799 0.90781 8 0.91875 0.92196 0.91872 0.91857 16 0.92825 0.93025 0.92827 0.92815 32 0.93250 0.93458 0.93248 0.93241 64 0.93825 0.93968 0.93826 0.93820 128 0.93775 0.93955 0.93777 0.93768 256 0.93150 0.93500 0.93140 0.93134 512 0.93325 0.93509 0.93327 0.93318

Table 4.1: Number of hidden nodes hyper-parameter tuning results (Highest scores are marked in bold).

(32)

4.2. Unbalanced dataset

DropValue Accuracy Precision Recall F1-score

0.0 0.93950 0.94122 0.93957 0.93944 0.1 0.93475 0.93626 0.93475 0.93469 0.2 0.93900 0.94028 0.93902 0.93895 0.3 0.93950 0.94167 0.93949 0.93942 0.4 0.93550 0.93791 0.93551 0.93540 0.5 0.93725 0.93917 0.93725 0.93718 0.6 0.93475 0.93702 0.93472 0.93465 0.7 0.93225 0.93371 0.93221 0.93219 0.8 0.92575 0.92828 0.92573 0.92563 0.9 0.91625 0.91879 0.91622 0.91611

Table 4.2: Dropout value hyper-parameter tuning results (Highest scores are marked in bold).

Table 4.2 it shows a bit more of a split result. When the dropout value was set to 0.0, the recall and F1-score were the best, but with a dropout value set to 0.3, the accuracy and precision

were at its best. Looking at the results from tuning on the DS5 dataset (see section 4.4), we can see that a dropout value greater than 0.0 yields a better result. Since dropout is commonly used as a method to prevent overfitting, and based on the DS5 dataset a model with some dropout is better, a value of 0.3 was used in this thesis.

4.2 Unbalanced dataset

In the dataset experiment, each of the three models was tested on 10 different datasets, 5 bal-anced, and 5 unbalanced. In Figure 4.1 a graph showing how balanced datasets of different sizes affect the three models. Since the dataset is balanced, accuracy and F1-score are almost

the same. In the graph, both the accuracy and F1-score improve a lot when the dataset size

increases. The largest jump in accuracy and F1-score comes from the early dataset size

incre-ments. For example, with a dataset of 2000 observations, the accuracy is around 0.92, and with a dataset of 8000 observations, the accuracy is around 0.955. After the dataset reaches a size of 8000 observations, the rate of the accuracy increase is lower. The accuracy reaches 0.963, 0.966, and 0.968 on the CNN, LSTM, and LSTM-CNN models respectively.

In Figure 4.2 a graph showing how unbalanced datasets of different sizes affect the mod-els. In all models, the accuracy is really high, around 0.99, while the F1-score is relatively

low, around 0.85. In the unbalanced case, F1-score increases at approximately the same rate

as in the balanced experiment, but the F1-score start at a much lower value, around 0.79. The

F1-score reaches a value of 0.88 for each model, while the accuracy is barely affected, ending

at 0.99 for each model.

During the dataset experiment, more metrics were used to capture the performance of the models. In the appendix, a more detailed view of the experiment is shown. In Table A.1 and Table A.2, the accuracy, precision, recall and F1-score is shown for all models.

4.3 Document segmentation

With the results from the dataset experiment, it was clear that a balanced dataset gave the best performance for the three models. The segmentation experiments were done using the DS3 dataset. In Figure 4.3 three graphs, representing each model show how the number of segments affects performance. The CNN model’s performance seems to increase with the number of segments, peeking at 13 number of segments. In the LSTM and LSTM-CNN case, both peaks when the number of segments is still low, and then performance decrease. The LSTM and LSTM-CNN peaks at 4 and 3 respectively. Looking at Figure 4.3a a trend is visible when following the local maximum points of the graph. The first local maxima are at

(33)

4.3. Document segmentation

(a) CNN model accuracy and F1-score. (b) LSTM model accuracy and F1-score.

(c) LSTM-CNN model accuracy and F1-score.

Figure 4.1:CNN, LSTM and LSTM-CNN performance on balanced datasets of different sizes.

(34)

4.4. Architecture comparison

Figure 4.3:CNN, LSTM and LSTM-CNN performance with different number of segments.

3 segments, the second one is at 8 segments, third at 11, and finally the forth at 13 segments. Three of the four local maxima are happening after the number of segments has incremented with 5. This may indicate that the global maxima are at 18 segments or more.

Another trend is also visible in the LSTM-CNN model. Here each local maxima repeat after the segment have increased by 3. Opposite from the CNN model, where the next local maxima are lower than the previous one.

During the dataset experiment, more metrics were used to capture the performance of the models. In the appendix, a more detailed view of the experiment is shown. In Table A.3 the accuracy, precision, recall and F1-score is shown for all models.

4.4 Architecture comparison

The experiment done in the architecture comparison was done on the DS5 dataset. The archi-tecture comparison was done in two steps. The first step was to tune the hyper-parameters for each model. The training and tuning were done using 80% of the DS5 dataset. The second step of the architecture comparison was to evaluate the models. The evaluation was done on the remaining 20% of the DS5 dataset.

The result of the hyper-parameter tuning is shown in Table 4.3. The left column of the table are all the hyper-parameters that are being tuned, and the remaining columns are the hyper-parameters for each model. A couple of cells in the table have dashes (-), which means that the model did not tune that hyper-parameter.

The results of step two, the evaluation of each model is shown in Figure 4.4 (more metrics of the comparison is shown in the appendix, Table A.4). Each model was evaluated on three versions of the same architecture, the base version, the segmented version, and finally the fine-tuned version. In the figure, each model version is grouped up by the model type. The metric in this figure is the F1-score. Only one metric is used here since both accuracy and

F1-score shown approximately the same result when used on a balanced dataset. The graph

(35)

4.4. Architecture comparison CNN LSTM LSTM-CNN Epochs 27 25 28 FilterSize 508 - 435 kernel1 2 - 6 kernel2 6 - 8 kernel3 5 - 4 CNNDropout 0.37465 - 0.73226 l2 0.00051 - 0.00025 NumHidden - 438 275 LSTMDropout - 0.59067 0.67912

Table 4.3: The resulting hyper-parameters for each model.

Figure 4.4: The results of the CNN, LSTM, and LSTM-CNN models while being the base, segmented, and fine-tuned versions.

is slightly below. In the following order, CNN, LSTM, and LSTM-CNN each model’s best performance is 0.98078, 0.98171, and 0.97874 respectively.

To give some perspective to how well the generated features are, and to see how the different models perform, figures showing how all features and activations are clustered is shown in Figure 4.5.

In order to plot the features, the dimension must be reduced. The Doc2Vec features have a dimension of 300, which we want to reduce to a dimension of 2 in order to plot the features. The feature reduction was done with the help of Principal Component Analysis (PCA). PCA is used to reduce the dimensionality of features, but at the same time try to keep as much relevant information as possible.

In Figure 4.5a, the evaluation part of the DS5 dataset (20% of DS5) was put through the Doc2Vec model, and then the 2-dimensional feature was created using PCA. In the figure, some cluster is visible, only containing one color, showing that the generated Doc2Vec fea-tures carry some similarity information about the logs. To illustrate the effect of the neural networks used in this thesis, the activations of the second to the last layer is extracted from the CNN, LSTM, and LSTM-CNN models. Same as with the Doc2Vec features, PCA is used to reduce the dimensionality to a dimension of 2. The reduced activations are plotted, showing how well the network managed to separate the two classes.

(36)

4.4. Architecture comparison

(a) Plot of PCA features created from the Doc2Vec features.

(b) Plot of PCA features created from the CNN activations.

(c) Plot of PCA features created from the LSTM activations.

(d) Plot of PCA features created from the LSTM-CNN activations.

Figure 4.5: Extracted activations from the second to the last layer in each model, compared to the Doc2Vec features.

(37)

5 Discussion

In this chapter I will start by analyzing the results of the experiments, going through what went as expected and what did not go as expected. Then I will discuss the method used during the thesis project, and finally, I will write about the thesis in a wider context.

5.1 Results

In this section, the results from the experiments are analyzed and discussed.

Unbalanced dataset

The results from the dataset experiments are consistent with the theory of unbalanced datasets. In Figure 4.2 the effect of an unbalanced dataset is clear, accuracy achieves almost a perfect score, while F1-score is much lower. The cause for this problem is found in the training

step. When the model trains on the unbalanced dataset, it will receive an insufficient amount of the minority class to be properly trained, which will result in unreliable predictions.

Looking at Figure 4.1 we can see a substantial improvement in the metrics. The accuracy has dropped compared to the unbalanced case, but the F1-score has increased a lot. From the

two figures one can see the importance of selecting the correct metrics for a given problem. If accuracy would have been the only metric used during this thesis, then unbalanced datasets would have been used for this thesis. The results may have looked great during the training and testing, but it is very likely that most of the examples from the minority class would be misclassified if the system were to be used.

Looking at Figure 4.2 F1-score graphs, a trend is showing. The F1-score increase with the

dataset size, which may mean that given a large enough unbalanced dataset, it may yield a reasonable F1-score.

Document segmentation

The document segmentation experiment had some interesting results. The two LSTM models behaved as expected (see Figure 4.3b and Figure 4.3c). Both had an increase in accuracy with a few numbers of segments (3 and 4), and then the accuracy starts to drop when the number of segments increases. The results of the LSTM models are consistent with what Wan et al. [35]

(38)

5.1. Results

have found. The CNN model, however, got a higher accuracy as the number of segments increased. The cause for the accuracy is likely that the CNN model gets to perform more convolutions upon the input sequence. When the input sequence is 3 or fewer segments, there will only be one convolution per convolution layer. With each convolution, the CNN model extracts additional hidden features that may improve the model. One observation that supports the claim is a trend found in Figure 4.3a, mentioned in section 4.3. Three out of four local maxima appear with each fifth increment of the number of segments. The CNN model has three convolutional layers with a kernel size of 3, 4, and 5 respectively. This means that after each increment of 5 segments, each convolutional layer will have extracted at least one more hidden feature.

Architecture comparison

The results from the model comparison and overall experiments have been better than ex-pected. In section 3.2, we found out that the log files have a high variance in the file size. With a too high variance inside the same class, it can become hard for a classifier to make good predictions. But the results shown in Figure 4.4 clearly show that the three selected models perform great in their three different versions. I believe that the reason why we got a great result across all models are the Doc2Vec features. Looking at Figure 4.5a, we can see clusters forming with only the Doc2Vec features. With the Doc2Vec features, it is easier for the models to separate the two classes.

One interesting result from the model comparison would be that all three models lie around the 0.98 mark after their fine-tuning. Based on the theory I expected the LSTM-CNN model to get better performance compared to the other two models. Looking at the result of related work on text classification [41], the LSTM-CNN outperforms an LSTM and a CNN model. Since all of the three models, F1-score lies around the 0.98 mark I have more reasons

to believe that the Doc2Vec features provided a great foundation for the models.

Which machine learning methods can be used to analyze log files in order to

detect abnormal system run-through?

In this thesis two different types of learning algorithms have been used in combination, the base of all models in this thesis comes from Doc2Vec which is an unsupervised learning algo-rithm, and the model that runs on top is a supervised learning algorithm. The combination of using an unsupervised learning algorithm to structure the features passed into the supervised algorithm worked surprisingly well on this problem. Looking at the results in Figure 4.4, we can see that all three models could be used to detect abnormal system run-through with a very high prediction rate. There would still be cases where the models would misclassify a log file, which makes this solution not perfect for every environment. If the system must ensure that a log file is abnormal or normal, then it would not be the optimal solution, but if the system could provide a high success rate, with a low mean-time-between-failure rate, then the three models provided would work great.

How do different LSTM and CNN based approaches compare in F

1

-score on the

task described above?

From the results, we can see that there is not much difference in the F1-score between the

three models. The LSTM model peaks at 0.98171, the CNN comes in second at 0.98078, and lastly, the LSTM-CNN has an F1-score of 0.97874. The difference between LSTM and

LSTM-CNN is 0.00297, which is really small, and the difference between the LSTM and LSTM-CNN is even smaller. I would have expected a greater difference between the models, but as I have mentioned before, I believe that the difference is so small because of the great foundation

Evaluation of text classification techniques for log file classification

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/048--SE

Evaluation of text classiﬁcation

techniques for log ﬁle

classiﬁca-tion

Utvärdering av textklassiﬁceringstekniker för klassiﬁcering av

loggﬁler

Per Olin

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Natural language processing

2.2

Text representation

One-Hot-Encoding

Word embedding

Document embedding

2.3

Preprocessing

Lowercasing

Removing stop words

Lemmatizing

Tokenization

2.4

Deep learning

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)

Long Short-Term Memory (LSTM)

2.5

Unbalanced dataset

2.6

Evaluation

Accuracy

Precision

Recall

F

-score

Cross-validation

2.7

Related work

Anomaly detection

Text classification on short texts

Classification of large text files

3

Method

3.1

Framework and environment

3.2

Dataset

3.3

Construction of dataset

3.4

Creation of balanced dataset

3.5

Dataset experiments

3.6

Preprocessing

3.7

Document segmentation

3.8

Segmentation experiments

3.9