Anomaly Detection in User Authentication Logs using Long Short-Term Memories and Word Embeddings

(1)

Anomaly Detection in User Authentication Logs using

Long Short-Term Memories and Word Embeddings

MIKAEL FORSMARK

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Authentication Logs using Long Short-Term Memories and Word Embeddings

MIKAEL FORSMARK

Master in Computer Science Date: June 11, 2020

Supervisor: Cyrille Artho Examiner: Erik Fransén

School of Electrical Engineering and Computer Science Host company: TeamEngine Collaboration Software AB

Swedish title: Anomalidetektion i användarautentiseringsloggar med hjälp av Long Short-Term Memories och Vektorbaserade Ordinbäddningar

(4)

(5)

Abstract

As an increasing amount of sensitive data is being stored at various connected services and platforms, the need for robust user authentication mechanisms that maintain the users’ safety and personal integrity online are growing. Meanwhile, too strict and simplistic user authentication policies may result in a degraded user experience which in- creases the demand for a sharp user authentication security tool that flags login events as abnormal with high accuracy.

This study investigates if Long Short-Term Memories (LSTM) and Word Embeddings can be combined in order to detect abnormal user authentication behavior. Two anomaly detection models, where the first one focuses on detecting abnormal login events while the second one detects abnormal sequences of login events, are proposed and applied in a user authentication log context consisting of 280,063 login attempts. As there are no known anomalies in the user authentication log, the models are trained on the normal login attempt flow while two types of manipulated abnormal log events are inserted into the test data in order to verify the models’ anomaly detection performances.

By reconstructing the data containing no anomalies with the trained models and studying the resulting reconstruction errors, the reconstruction errors of the test data are used to find abnormal user authentication events. Finally, the results are compared to baseline models.

The results of the two proposed models were varying. When detecting abnormal login sequences, the proposed LSTM and Word Em- beddings combination showed promising results with significantly good recall values during the detection of one of the anomaly types as a highlight. When instead attempting to detect abnormal log events, the proposed LSTM and Word Embeddings combination showed poor results where the selection of the reconstruction error-based anomaly threshold was deemed to play a significant part. The examination of the log event attribute combination [User, Country, Login Status]

turned out to be the combination that resulted in the best anomaly detection accuracies for both models.

(6)

Sammanfattning

I takt med att mängden känslig data som lagras på uppkopplade tjäns- ter och plattformar ökar så ökar även behovet av robusta använda- rautentiseringsmekanismer som bibehåller användarnas säkerhet och personliga integritet på internet. Samtidigt kan för strikta och simplis- tiska säkerhetspolicies för användarautentisering resultera i en för- sämrad användarupplevelse, vilket höjer efterfrågan efter skarpa sä- kerhetsverktyg för användarautentisering som flaggar inloggningshän- delser som onormala med hög exakthet.

Den här studien undersöker om Long Short-Term Memories (LSTM) och Vektorbaserade Ordinbäddningar (eng: Word Embeddings) kan kom- bineras för att upptäcka onormalt användarautentiseringsbeteende. Två anomalidetektionsmodeller, där den ena fokuserar på att upptäcka onormala inloggningshändelser medan den andra upptäcker onormala se- kvenser av inloggningshändelser, är framtagna och applicerade i en användarautentiseringsloggkontext som består av 280,063 inloggnings- försök. Eftersom att det inte finns några kända onormala inloggnings- försök i loggen så tränas modellerna på det normala inloggningsflödet medan två typer av manipulerade onormala inloggningshändelser är placerade i testdatan för att kunna verifiera modellernas anomalidetektionsprestanda. Genom att rekonstruera datan som inte innehåller några anomalier med de tränade modellerna och studera de uppkom- na rekonstruktionsfelen så kan rekonstruktionsfelen av testdatan an- vändas för att hitta onormala inloggningsförsök. Modellernas resultat är slutligen jämförda med naiva anomalidetektionsmodeller.

De två framtagna modellernas resultat är varierande. Vid anomalidetektion av onormala inloggningssekvenser så visar den LSTM- och Vektorbaserade Ordinbäddningskombinerade modellen lovande resultat, där de uppseendeväckande höga recallvärdena vid anomalidetektion av den ena typen av manipulerade onormala inloggningshändel- ser är en höjdpunkt. Vid anomalidetektion av onormala inloggnings- händelser så ger den LSTM- och Vektorbaserade Ordinbäddningskom- binerade modellen dock dåliga resultat där valet av det rekonstruk- tionsfelsbaserade anomalitröskelvärdet bedöms ha varit en betydande faktor. Analys av logghändelseattributen [Användare, Land, Inlogg- ningsstatus] visade sig vara den kombination som resulterade i bäst anomalidetektionsprestanda för båda modellerna.

(7)

Acknowledgements

First of all, I would like to thank TeamEngine Collaboration Software AB for hosting me, for providing me with the data needed for the thesis and for the opportunity to conduct my research with them. Special thanks to my supervisor at TeamEngine, Dara Reyahi, for all the sup- port and rewarding discussions. Lastly, I would like to extend my sincere gratitude to my supervisor at KTH, Cyrille Artho, for his in- valuable feedback and guidance throughout the degree project, and my examiner Erik Fransén.

(8)

1 Introduction 1

1.1 Research Questions . . . 3

1.2 The Principal . . . 3

1.3 Limitations . . . 3

1.4 Ethics and Sustainability . . . 4

1.5 Outline . . . 4

2 Background 5 2.1 Theory . . . 5

2.1.1 Neural Networks . . . 5

2.1.2 Word Embeddings . . . 6

2.1.3 Long Short-Term Memory (LSTM) . . . 8

2.2 Introduction to Research Areas . . . 11

2.2.1 Anomaly Detection . . . 11

2.2.2 Sequence Modeling . . . 12

2.2.3 Log Files . . . 13

2.3 Related Work . . . 14

2.3.1 Mining-based Anomaly Detection . . . 14

2.3.2 Log Analysis-based Anomaly Detection . . . 14

2.3.3 Machine Learning-based Anomaly Detection . . . 15

2.3.4 Neural Network-based Anomaly Detection . . . . 16

2.4 Summary . . . 17

3 Methodology 18 3.1 Data . . . 18

3.1.1 Scope and Structure of Data . . . 18

3.1.2 Data Preprocessing . . . 19

3.2 Model Implementations . . . 22

3.2.1 Event-focused LSTM . . . 22

3.2.2 Sequence-focused LSTM . . . 23

3.3 Model Fitting . . . 24

vi

(9)

3.4 Hyperparameter Optimization . . . 24

3.5 Anomaly Detection . . . 26

3.5.1 Event-focused LSTM . . . 27

3.5.2 Sequence-focused LSTM . . . 27

3.5.3 Naive Baseline Models . . . 28

3.6 Summary . . . 29

4 Results 30 4.1 Test Setup & Result Significance . . . 30

4.2 Event-focused LSTM . . . 31

4.2.1 Randomized Events . . . 31

4.2.2 Swapped Events . . . 33

4.3 Sequence-focused LSTM . . . 34

4.3.1 Randomized Events . . . 34

4.3.2 Swapped Events . . . 36

4.4 Summary . . . 37

5 Discussion 38 5.1 Internal Threats to Validity . . . 38

5.1.1 Multivariate vs Univariate Time Series . . . 38

5.1.2 Data Simplification Limitations . . . 39

5.1.3 Word2Vec . . . 40

5.1.4 Network Designs and Training Options . . . 40

5.1.5 Hyperparameter Optimization Evaluation . . . . 41

5.1.6 Reconstruction Error Metric . . . 42

5.2 External Threats to Validity . . . 43

5.2.1 Minimum Number of Log Events Per User . . . . 43

5.2.2 Data Sparseness . . . 43

5.2.3 Anomaly Insertions . . . 44

5.3 Result Comparisons . . . 44

5.3.1 The Naive Models’ Validity . . . 44

5.3.2 Event-focused Anomaly Detection . . . 45

5.3.3 Sequence-focused Anomaly Detection . . . 47

5.4 Future Research . . . 48

5.4.1 Event-focused Anomaly Detection with Alterna- tive Threshold Selection . . . 48

5.4.2 User-focused Log Anomaly Detection vs Global Log Anomaly Detection . . . 49

5.4.3 Comparison with LSTM and Alternative Machine Learning Techniques . . . 50

(10)

6 Conclusions 51

Bibliography 53

(11)

Introduction

As more and more sensitive data is being stored in various connected devices and services, safe and robust user authentication mechanisms have become a fundamental tool to ensure that sensitive data only can be accessed by authorized users. Most modern services do therefore warn the user nowadays every time a login attempt is being performed on a new device or in a new country for instance. However, if a user tends to use several devices or travel a lot at work, these warnings may only disturb the user and decrease its usage of the service. If the user instead could be warned only when a user authentication attempt diverges from the user’s ordinary login pattern, the accuracy of these warnings could increase and lead to less spamming of the user and greater attention from the user when an actual warning is sent.

Anomaly Detection and LSTM

Anomaly detection in general has been a popular research topic during the last decade [1], where various attempts of applying machine learning to problems related to the topic have been performed because of various machine learning techniques’ previous contributions to pattern recognition in and classification of large amounts of data. The previous research within the area has mostly been focusing on anomaly detection in system log files, where Recurrent Neural Networks like LSTM (Long Short-Term Memory) have shown promising results in several cases [2].

LSTM is known for being powerful when it comes to sequence modeling and has therefore been seen as a kind of state-of-the-art- technique within anomaly detection, but the studies regarding its suitability when it comes to detecting abnormal user authentication behavior are at the time of writing few, which reveals a subarea where

1

(12)

its applicability is not fully verified yet. Along with the limited research within anomaly detection in the user authentication log area, the knowledge about the importance of the different information seg- ments provided in a log event in this context is also limited. A comparison between the anomaly detection accuracies when utilizing different log event attribute combinations could therefore also bring new insights to the anomaly detection area in general.

Word Embeddings

The thriving interest of the applications of sequential modeling and analysis have also taken the form of research within the ability to analyze and utilize patterns within text data. Natural language processing techniques like Word Embeddings that translates text data to numeric form while maintaining the semantic meaning of the translated data elements have found many areas of use, where linguistic-related tasks like speech analysis [3] have been the most prominent applications.

As Word Embeddings utilizes the context of the text data element when converting to numeric form, an interesting application would be to utilize the technique to emphasize the connection between each user’s log events. By translating the log events clustered by user, the connection between a user’s log events would still be evident in the numeric form and the performance of a subsequent sequential analysis of each user’s login pattern could be increased even further as the log events in the user’s individual login flow would consist of similar numeric representations.

Combination of Word Embeddings and LSTM

To address the identified possibilities, this thesis aims to investigate which log attributes that are useful when detecting anomalies in user authentication logs, where the focus lies on utilizing the users’ individual login flows to spot individual anomalies. The log events will be converted to numeric form clustered by user with the help of Word Embeddings in order to emphasize the importance of the users’ individual login patterns and two anomaly detection approaches with LSTM as the core model will be presented in order to further investigate which design choices that are the most appropriate in this specific anomaly detection context.

(13)

1.1 Research Questions

• How accurate are Long Short-Term Memories (LSTM) combined with Word Embeddings when classifying user authentication events in authentication logs as authentic or fake?

• Which combinations of log event attributes are the most useful ones when detecting anomalies?

1.2 The Principal

TeamEngine¹is a software company specialized in providing products and services for board and management collaboration, insider management, due diligence and crisis management. As some features of the service includes sharing and signing of important documents, the security of the service is strongly prioritized and it is crucial that the authentication mechanism works as expected and protects against any malicious login attempts. At the same time, the broad target group of the service means that there are a lot of different user types and behaviors, and it is therefore in the company’s interest to develop a sharp user authentication mechanism that always warns about all suspicious and malicious login attempts without sending false warnings about authentic login attempts.

1.3 Limitations

It is important to emphasize that this thesis will not focus on building a tool performing real-time user authentication classification or a tool scanning and detecting anomalies in user authentication logs. This thesis will only focus on implementing previously mentioned models separately and evaluate them at a snapshot of a user authentication log in order to access the desired academic results.

The techniques and models used in the thesis will be existing ones, as the focus is on comparing existing models in a new context and not to invent new technical solutions. As the user authentication data also is collected from a very specific business area and the success of the models will be rated based on the flagging of manually manipulated data that is constructed based on that specific dataset, the thesis does

1www.teamengine.com/en/index.html

(14)

not aim to find a universal truth regarding user authentication classification. This thesis solely focuses on investigating these models in this specific user authentication log context and in that way potentially giving additional insights about the subject in general.

1.4 Ethics and Sustainability

This thesis proposes a solution that aims to find abnormal user authentication behavior and as a lot of today’s modern services offer the user the possibility to create an account, the result could interest and affect a broad crowd. The competition between establishing stronger security mechanisms among the services and finding new ways of exploiting the services is an ongoing race that has been affecting users worldwide during several years, and new evaluations of the security is therefore always rewarding and needed. Even though it is important to stress that the proposed solution only is tested in a very specific context, the general idea behind the solution is indeed applicable in multiple areas and could be the foundation of a complementary tool in the war against malicious hacker attacks.

The user data used in this thesis is completely pseudonymized in order to erase the connection between the user and a private individual but still maintain the connection between the user and the log events. The dilemma between the risk of violating a user’s personal integrity and possibility to increase the user’s online security is an ethical aspect that has been discussed during the initial phases of the thesis project but as the data is not made publicly available and no statistics about the individual users are presented in the report, the personal integrity of the users is deemed to be maintained.

1.5 Outline

In Chapter 2, the research areas concerned, relevant background theory and related studies are presented. The methodology including an introduction to the data and its needed preprocessing along with the implementation details of the different components in the anomaly detection solutions are described in Chapter 3. The results of the performed experiments are presented in Chapter 4, while an extensive analysis of the results and discussion about the results’ implications is performed in Chapter 5. Finally, the findings of the thesis and drawn conclusions are presented in Chapter 6.

(15)

Background

This chapter presents the relevant background theory of this thesis by firstly presenting the theory behind the solution proposed in this thesis with an introduction to neural networks in general and the neural networks used in this thesis, Word Embeddings and Long Short- Term Memories (LSTM). Later on, an introduction to the research areas anomaly detection and sequence modeling is given along with a brief introduction to log files and their purpose. Lastly, previously done research related to the topic of the thesis is presented along with a brief summary.

2.1 Theory

2.1.1 Neural Networks

Neural networks (NN) are computing systems whose design and function are inspired by the biological neural networks in human brains.

As can be seen in Figure 2.1, NNs can be seen as weighted directed graphs where neurons are representing the nodes. The network consists of one or more input neurons, one output neuron and inside the network, the remaining neurons are ordered in one or more hidden layers. The neurons belonging to the input layer, the hidden layers and the output layer are connected by weighted edges and each neuron can have multiple inputs that are utilized to compute a single output [4].

5

(16)

Input Hidden Layers Output

Figure 2.1:4-layer neural network with 6 neuron inputs, 2 hidden layers with 4 and 3 neurons respectively and 1 output neuron.

The output of each neuron is decided by multiplying the input values with their corresponding weight value and summing the result with a bias value. An activation function is later on applied at the result in order to ensure stability and smooth transitions of the output when small input changes occur [5]. To increase the accuracy of the model, backpropagation algorithms are often used to repeatedly adjust weights and biases during training in order to decrease the errors that have occurred during the training [6].

NNs are known for their universality as they have proven to be able to compute any function [5], which have resulted in multiple applications of the technique within several research areas.

2.1.2 Word Embeddings

A Word Embedding, also referred to as a Natural Embedding, is a collec- tion of natural language processing techniques where textual words are converted to dense vectors with the use of various training methods inspired by neural network modeling [3]. By weighing in the surrounding context of a word in a given sequence, a word-embedding neural network can produce a vector representation of the word which

(17)

is close to other words’ vector representations that are close to that word semantically [7].

Two significant word-embedding models, Continuous Bag-of-Words (CBOW) and Skip-Gram, were proposed by Mikolov et al. [7] with the umbrella term Word2Vec in 2013 with the focus on achieving this func- tionality.

CBOW is similar to the Feedforward Neural Network architecture described in Section 2.1.1 with an input layer, a single hidden layer without an activation function and an output layer. Each input word is indicated by a one-hot encoded vector with the size of the vocabulary to be converted and the CBOW tries to predict the target word by looking at surrounding words, also known as the context, where the learned weights are used as the word’s vector representation [7].

Skip-Gram on the other hand is similar to CBOW, but instead of predicting the current word based on the context, the Skip-Gram tries to predict the context based on the current word [7]. For each surrounding word belonging to the context of the target word, Skip-Gram outputs a probability that predicts the context word. In a succeeding paper of Mikolov in 2013, the Skip-Gram is extended with subsam- pling of frequent words and a decrease of the amount of weight updates in order to speed up the model and learn more regular word representations [8]. The conceptual difference between the two models can be seen in Figure 2.2.

(18)

w(t-2)

w(t-1)

w(t+1)

w(t+2)

Sum w(t)

w(t-2)

w(t-1)

w(t+1)

w(t+2) w(t)

Input Projection Output Input Projection Output

CBOW Skip-Gram

Figure 2.2:An illustration of the conceptual difference between CBOW and Skip-Gram. CBOW predicts the current word based on the words in the context while Skip-Gram predicts the words in the context based on the current word.

2.1.3 Long Short-Term Memory (LSTM)

Recurrent Neural Network (RNN) is a type of neural network equipped with feedback connections re-providing previous output of a neuron back to the neuron [9].

An overview of an RNN structure can be seen in Figure 2.3. In this example, an input sequence with three elements is passed through the hidden layer element by element. When a calculation with an element has been performed in the hidden layer, the output is passed on to the hidden layer calculation of the succeeding element, ultimately resulting in a final single output o3.

(19)

Output o3 Hidden Layer

i[1-3]

Input

(a)RNN.

Output o1 Hidden Layer Input i[1]

o2 i[2]

o3 i[3]

(b)Unfolded RNN.

Figure 2.3:Recurrent Neural Network.

This means that this kind of neural network architecture is considered to have a memory as it constantly utilizes information from the past to find new insights about the future which differentiates it from the feedforward neural network architecture, as those networks loose the previous state between each sample [10].

A common problem that general RNNs possess occurs during the training of the network. During the backpropagation of the network, gradient descent is performed to adjust the weights of the network and in that way minimize the error. When these calculations are performed backwards through multiple previous time steps, the gradients tend to either ”explode” (become very large) or ”vanish” (approach zero) as the calculations becomes a long chain of multiplicative operations, which leads to that the network stops learning [9, 10]. This problem is known as the Vanishing Gradient Problem and the problem occurs more often if multiple time steps are to be crossed [10]. This is an issue that the introduction of Long Short-Term Memories originally aimed to address [9].

(20)

Forget Gate Input Gate Output Gate Output

Memory for the upcoming timestep

Output for the upcoming timestep Memory from the

previous timestep

Output from the previous timestep

Input

Figure 2.4:A memory cell of a Long Short-Term Memory.

A Long Short-Term Memory (LSTM) is a recurrent neural network- based architecture that introduces memory cells instead of recurrent neurons [9]. A conceptual illustration of a memory cell of an LSTM can be found in Figure 2.4. The vanishing gradient problem is solved by the introduction of an internal state that keeps track of the memory cell’s memory, which means that the current weights are stored and not immediately overwritten by irrelevant values [9, 11]. The memory is kept up to date by having a forget gate that controls how much of the memory that should be flushed, an input gate that controls how much of the input and the output from the previous time step that should be added to the memory and an output gate that decides how much of the memory that should be passed as output of the memory cell [10].

These adjustments have made it possible to utilize the benefits of recurrent neural networks without the drawbacks with the disturbed learning, even though the importance of sequential ordering of the operations in an LSTM has resulted in problems with parallelization [12]. LSTMs have however been considered to be a state-of-the-art technique in several sequence modeling areas, such as anomaly detection [2], and have gained different extensions throughout the years.

(21)

2.2 Introduction to Research Areas

2.2.1 Anomaly Detection

The research area of Anomaly Detection, also known as Outlier Detection, concerns the problem of finding patterns in data that do not conform to the expected behavior. The area has been of great interest from several industries as detected anomalies often point to important and critical information in various contexts [1], and this versatility has resulted in a focus on finding appropriate anomaly detection methods and eval- uating their performances in different problem sets [13]. Examples of previous usages and techniques are detection of credit card theft with neural networks using Parametric Statistical Modeling [1], detection of outliers in medical data [1] and system log analysis of avionics systems with the help of Nonparametric Bayesian models to detect security incidents [14].

The area of anomaly detection can be divided into three main sub- areas depending on the context and the utilized techniques [15].

1. Supervised Anomaly Detection is when the training data and test data are completely labeled as ”normal” or ”abnormal”. A classifier can then be trained on the labeled training data and be clearly assessed by investigating if it labels the test data correctly.

2. Semi-supervised Anomaly Detection is when the training data is completely unlabeled and considered to be completely normal, which means that the model safely can learn from the training data and afterwards be tested at labeled test data.

3. Unsupervised Anomaly Detection utilizes data that is not labeled at all. There is no distinction between a training and a test dataset, the algorithm scores the data instead based on the intrinsic prop- erties and detects outliers by measuring distances and densities.

Anomaly Detection for Discrete Sequences

A significant application of anomaly detection is within discrete sequences which focuses on detecting abnormal patterns within discrete sequential data, also known as time series [16]. Instead of identifying separate elements that differ from the normal set of data, this orien- tation focuses on detecting anomalies within the sequential nature of the data. Extensions of sequence modeling techniques are often utilized in these kind of problems and the new contextual insights about

(22)

data that these studies can give have turned the research topic into a popular one [1].

Anomaly Insertion within Unlabeled Test Data

When dealing with unlabeled data which is the case within unsupervised and semi-supervised anomaly detection, there are some assump- tions and modifications that need to be made in order to be able to evaluate the performance of the model in use.

Firstly, if a model is trained on unlabeled data, it is important to make sure that the data solely consists of data points that belong to the data class that is considered to represent the normal behavior in the specific anomaly detection context [15]. In this way, the model to be trained will be able to trust the data and build its conception of the characteristics of normal data completely based on the patterns the model detects within the training data.

Secondly, in order to be able to measure the accuracy of the model in terms of precision and recall, the data needs to be labeled in some way. This can be done by inserting artificially manipulated anomalies in the dataset that deviates from the characteristics of the training data, as the model presumably has been trained on completely normal data [14]. The underlying motivation behind the content of these insertions are often derived from the knowledge about the original dataset or randomization of data attributes, as these measures surely results in outlying data points.

2.2.2 Sequence Modeling

Sequence modeling, or Sequence to Sequence learning, is a concept within deep learning regarding the mapping of input sequences to output sequences by the utilization of a model [17]. As the lengths of the input and output sequences can vary and not necessarily need to be of the same length, the research area of sequence modeling has found several applications throughout the years [18]. Common traditional sequence modeling areas include speech recognition where a sequence of audio clips is mapped to words and machine translation where a sequence of words in one language is mapped to a new sequence of words in another language since the sequences in these natural sequence problems often are characterized by having various formats and lengths [17].

The research within sequence modeling has been strongly focused

(23)

on the use of neural networks during the recent years, where RNN- based Encoder-Decoder architectures gradually have proven to be one of the most prominent tools within the research area [18].

Encoder-Decoder

The encoder-decoder architecture can be divided into two main parts.

The first one, the encoder part, consists of an RNN that takes an input sequences of varying lengths, parses the elements one by one and repeatedly updates a hidden state that in the end of the input sequence acts as a summary c of the whole sequence [19].

The decoder part of the architecture consists of another RNN that tries to generate the output sequence by predicting each sequence element on a certain timestep yt based on the summary of the current hidden state ht received from the encoder, the hidden state from the previous timestep ht−1and the previously predicted sequence element y_t−1. Finally, the predicted sequences elements are collected and re- turned as the output sequence [19].

There are many variants of the encoder-decoder architecture depending on the RNN in use and the input [18], but the general structure of the architecture is in most cases the same.

2.2.3 Log Files

Log files are text files that hold information about different kinds of events in various areas such as software, operating systems and networks [20]. As these files often contain extensive information of the ex- act execution flow of the applications, these files have been considered to be essential tools when monitoring and debugging various types of applications and can today be found in most sorts of software applications. Logged events in these files typically follow a coherent format with fields like timestamp, event description and severity level [21] to mention a few typical attributes, and this format coherency eases the reading and interpretation of the logs.

User Authentication Logs

A specific type of log files is user authentication logs that contain information about authentication events in systems, services and networks.

As the logs focus on authentication events, the event types in the logs typically are of the types ”successful login” or ”failed login” with additional information like timestamp, user, IP address and user agent

(24)

[2]. The multiple log event attributes and discrete sequential nature of these logs do therefore give a clear overview of the authentication attempts of the application in question.

2.3 Related Work

During the last decades, several studies have been performed about anomaly detection within sequential data and log contexts of different kinds in order to identify possible areas of use and appropriate methods. A wide range of techniques have been used up to this date, and this section aims to present a selection of the historical usage of these techniques within anomaly detection.

2.3.1 Mining-based Anomaly Detection

Some of the early anomaly detection methods within sequential data utilized different mining methods to find outliers.

An example where data mining was used was in 2003 when Abad el al. [22] conducted a proof-of-concept study about enhancing intru- sion detection within logs by combining the data mining tool RIPPER with log correlation, where the different ways an attack can be re- flected depending on the log environment were utilized to increase the accuracy of the flagged intrusions with promising results.

Later on in 2008, Bezerra et al. utilized process mining in a study [23] where two different anomaly detection algorithms focusing on sample size and threshold size respectively were presented and tested in logs from process aware systems. During the same year, Bezerra et al. performed another study [24] where the same two algorithms along with an iterative anomalous trace classifier were tested within business process logs. In both studies, the proposed sampling algorithm showed the most accurate results.

2.3.2 Log Analysis-based Anomaly Detection

Other studies have focused on building tools that extract the actual important information from logs before performing the anomaly detection analysis of the logs.

An early example of this strategy took place in 2009 when Fu et al. [25] composed a study where the authors proposed a solution for performing anomaly detection within logs by utilizing unstructured log analysis. By developing an algorithm that could extract the ”type”

(25)

of a log event (also known as a log key), a Finite State Automaton (FSA) could analyze the constructed sequence of log event types and in that way work in a more simplified problem context.

The study by Fu et al. was later on featured in an experience report composed by He et al. [26] in 2016. The report discussed six state-of-the-art log-based anomaly detection methods (3 supervised and 3 unsupervised methods) where these methods were tested on two publicly-available production log datasets and compared to each other. In the study, the supervised methods generally proved to show better results than the unsupervised methods in terms of performance and speed. The same year, He et al. performed another study [21] focusing on log mining and its usage within anomaly detection in logs.

The study evaluated four log parsers on five datasets with over ten million raw log messages and also wrapped the parsers into a toolkit for future usage.

2.3.3 Machine Learning-based Anomaly Detection

Different variants of machine learning techniques have also been attempting to detect anomalies.

In 2008, Oliner et al. [27] presented an unsupervised algorithm for anomaly detection called Nodeinfo that was applied on several publicly- available supercomputer system logs and published as an online-version as well. The outlying log events were manually tagged as ”alerts”

in order to ease performance measurement and as the algorithm was tested on several public logs, the algorithm showed robust and repro- ducible performance.

In another study performed by Zhang et al. [28] in 2012, a system called University Credential Abuse Auditing System (UCAAS) with a Lo- gistic Regression Classifier as the anomaly detector was constructed with the purpose of safeguarding user accounts by detecting com- promised accounts in authentication logs. Characteristics that were significant for the system when detecting suspicious behavior were temporal-spatial violations (user activities from geographically different locations in a short period of time), IP addresses used by several accounts during a short period of time and general suspicious user behavior, like many logins during vacation periods.

Comparisons between different kinds of machine learning techniques within a log anomaly detection context has also been a popular research topic. One example of a study of that kind was performed in 2019 by Lartigau [14]. A system log analysis of avionics systems

(26)

with the help of Nonparametric Bayesian models to detect different kinds of security incidents was proposed and the performance was compared to an LSTM model and a Hidden Markov model that al- ready were in use in the specified context. The results showed that the Nonparametric Bayesian performed better than the LSTM model and the Hidden Markov model in almost every case when it came to detecting the different security incidents, showing that the proposed tool is an promising alternative to the more commonly used neural networks when detecting anomalies in sequential data.

2.3.4 Neural Network-based Anomaly Detection

As mentioned in the previous section, different kinds of neural networks have been a popular tool in various studies where anomaly detection within sequential data has been the focus.

In a study from 2018, Russac et al. [29] translated univariate and multivariate sequences of credit card transactions to word vectors by utilizing the neural network Word2Vec. The word vectors were later on passed to the machine learning models Logistic Regression and Random Forest, and the overall performance and memory usage during the training of the two models were compared to when passing the transaction data in the form of one-hot encoding to the two machine learning models. The results were promising and showed that the usage of Word2Vec decreased the memory usage and improved the machine learning models’ learning capacity.

Other kinds of neural networks that have been used more regu- larly are RNNs such as LSTM, which has been tested in various ways within anomaly detection. In a study from 2016, Malhotra et al. [30]

implemented an LSTM-based encoder-decoder model that learned the normal behavior of various publicly available time series datasets and then tried to reconstruct the test data with the trained model. If a predicted test data element deviated too much from the true value, i.e., the reconstruction error was too big, the test element was flagged as an anomaly. The model proved to be robust both when it came to accuracy within the different time series context and also in the ability to function with both short and long sequences, leaving the reconstruction error methodology as an interesting anomaly detection tool.

Extensions to LSTM have been constructed in order to boost its anomaly detection ability even more. In 2018, Brown et al. [2] proposed extensions to RNNs (with the focus on LSTMs) that at that point of time in other studies had shown state-of-the-art performance within

(27)

anomaly detection in order to increase the understanding of the deci- sion factors inside the RNNs. Attention mechanisms were inserted in different kinds of RNNs and later on tested on the authentication log from the Los Alamos National Laboratory (LANL) cyber security dataset with promising results.

During the last years, several other studies [14, 31] have been conducted to challenge RNNs’ and specifically LSTMs’ strong position within anomaly detection and is at the point of writing a research topic on the rise.

2.4 Summary

The area of anomaly detection has been a popular research during the last decade which have resulted in several areas of use and methods, where RNNs is a commonly recurring technique. Log anomaly detection can be considered to be a quite prominent subtopic, but the studies that are focusing on pure authentication logs are however few, which leaves a gap in the knowledge on what possibilities the topic en- tail and how certain techniques would perform in such environment.

(28)

Methodology

This chapter presents the complete methodology used in the thesis.

The original dataset and the following preprocessing of the data are presented, the architectures and fitting settings of the implemented models are described, the subsequent optimizations of the hyperpa- rameters of the models and their findings are presented and the ulti- mate anomaly detection solutions used are explained.

3.1 Data

3.1.1 Scope and Structure of Data

The data utilized in this paper is user authentication log data provided by the software company TeamEngine.¹ The data consists of a snapshot from a log with the global login attempt history of TeamEngine’s collaboration software spanning over a continuous time period of 7 months. There are 280,063 login events in the log snapshot with a total of 16,315 unique users.

The log snapshot is structured in the way that one login attempt is linked to a log event and it is one log event per row in the log file.

The log events are ordered in temporal order, and each log event contains the information about the different log event attributes. These attributes consist of the timestamp of the login attempt, the user that tried to log in, the login method used (password or electronic identification via BankID²), the IP address of the user, the user agent of the device the user used to try to log in, whether it was a successful/failed login

1www.teamengine.com/en/index.html

2www.bankid.com/en/

18

(29)

attempt, and some additional system-related information about the login attempt. The usernames are pseudonymized by being hashed and salted in order to preserve the personal integrity of the users.

3.1.2 Data Preprocessing

Initial Preprocessing

To prepare the log data for further preprocessing and usage, the log data is firstly parsed into CSV (comma-seperated values) format where the attributes of each log event are separated by commas. The log events are then trimmed by removing the columns with the additional system information which are not of interest and removing the log events where attributes are missing or corrupt. All log events belonging to a user with fewer than ten occurrences are also removed from the data.

Data Simplification and Conversion

To simplify the forthcoming training of the neural networks, the user agents, IP addresses and timestamps are simplified or converted in order to achieve a smaller set of categorical values that these attributes can take. The user agent attributes of all log events are simplified by extracting the actual user agent identifier (i.e., ”Windows”, ”iPhone”, and so on) of the user agent attributes, the IP addresses are converted to countries by utilizing the IP2Location database³ and IPInfo⁴ when IP addresses were not available in the database. The timestamps are converted to ”morning”, ”afternoon”, ”evening” or ”night” depending on if the time of the timestamp is between 06-12, 12-18, 18-00 or 00-06 respectively.

User-focused Split of Data

As the focus of the thesis mainly is to detect abnormal patterns in the users’ login behaviors and not the global login flow, the dataset is sorted by users and then temporally ordered in order to enhance the meaning of the user’s individual login flow by feeding those as coherent sequences to the neural networks. To also be able to verify the models’ ability to understand the users’ individual login patterns,

3www.github.com/ip2location/IP2Location-Python/blob/master/

data/IP-COUNTRY.BIN

4www.ipinfo.io/

(30)

the data is sampled during the split into a training, validation and test dataset with the goal of having all users represented in the training and test dataset or just the validation dataset. For the event-focused LSTM, the first 80 % of each user’s log events form the training dataset where the last 10 % of the training dataset is removed from the training dataset and is used as the validation dataset during training, while the last 20 % of each user’s log events constitute the test dataset. For the sequence-focused LSTM, the first 70 % of each user’s log events form the training dataset while the last 30 % of each user’s log events are used for further splitting into validation and test dataset, where the first 30 % of the remaining data is used as validation dataset and the rest is used as test dataset. The overall dataset split information can be seen in Table 3.1.

Training Dataset (%)

Validation Dataset (%)

Test Dataset (%) Event-focused

LSTM 72 8 20

Sequence-focused

LSTM 70 9 21

Table 3.1: The dataset split percentages of the original dataset.

Anomaly Insertion

As there are no known abnormal log events in the data, manually crafted abnormal log events need to be constructed and inserted into the test dataset in order to verify the anomaly detection algorithms’

performances. As the scope of this thesis is to investigate if the anomaly detection algorithms can detect both log events in a user’s login flow that are obviously abnormal and abnormal sequences where the characteristics of the log events are not abnormal individually but where the sequence of the log events results in an abnormal login pattern, two types of abnormal events are inserted into the test dataset:

• Log events where the attributes are randomized values taken from the original dataset, hereby referred to as randomized events.

• Two randomly selected log events that are swapped with each other within a user’s login flow, hereby referred to as swapped events.

(31)

The two types of anomalies are inserted in separate test runs to be able to investigate the algorithms’ abilities to detect the anomalies separately and more precisely. Both anomaly types are inserted in randomly chosen users that vary for each test run to be able to further test the robustness of the anomaly detection algorithms.

Feature Combination Selection

To evaluate the difficulty of detecting anomalies when looking at different log event features, 3 feature combinations are chosen to be examined in separate test runs.

• User, time period of the day, and successful/failed login.

• User, country, and successful/failed login.

• User, country, and user agent.

The feature combinations are chosen based on the interest of the au- thor and are limited to 3 types of combinations because of time constraints. The size of the examined feature combinations are set to 3 as the size is judged to be sufficient when it comes to giving enough information about the log events to the upcoming analysis without being too extensive. Too much information could lead to more sparse numeric representations of the log events that could aggravate the LSTMs’ abilities to detect patterns in the data, which is not desirable.

Text Data Conversion to Numeric Data Using Word2Vec

The text data from the log file is converted to numeric data by using Gensim’s implementation of a word-embedding neural network, Word2Vec⁵, whose settings can be seen in Table 3.2 where the vector size is configured as a hyperparameter and CBOW is the model used. The selected features of each log event are merged into a single string per log event and grouped by users in chronological order.

As Word2Vec is used to convert text data to numeric vectors based on the surrounding context of the word in order to maintain the semantic meaning of the text data in numeric form, the selected context of the text data is an efficient way to emphasize which text data elements that should be converted to numeric vectors close to each other. This characteristic of the Word2Vec is utilized in order emphasize the connection between a user’s log event by feeding the users’ log events

5www.radimrehurek.com/gensim/models/word2vec.html

(32)

in groups of five to the Word2Vec. The size of the context window is set to five, which means that Word2Vec utilizes the five preceding and five succeeding log events when learning the connection between the received grouped log events, and a vocabulary where each unique log event is mapped to a numeric vector is created. Finally, the log events in the original dataset are replaced by their corresponding numeric vector.

Vector Size Window Size Lowest Acceptable Frequency

Num Workers Used During Training

300 5 1 4

Table 3.2: The settings of Word2Vec.

3.2 Model Implementations

3.2.1 Event-focused LSTM

The event-focused LSTM is implemented using the LSTM layer and Dense layer implemented by the deep learning framework Keras.⁶

The general structure of the model follows an encoder-decoder structure where the encoder part consists of an LSTM layer with the activation function Rectified Linear Unit and the number of neurons mod- eled as a hyperparameter. The decoder part of the model consists of another LSTM layer with the same setup as the encoder part and a final Dense layer whose size also is controlled by the word vector size hyperparameter. The number of timesteps used is one.

Input LSTM

N = 900

RepeatVector LSTM

N = 900

Dense

N = 300

Output

t = 1

Encoder Decoder

Figure 3.1: Flow chart of the event-focused LSTM model. N is the number of neurons in the particular layer and t is the number of timesteps used.

6www.keras.io

(33)

3.2.2 Sequence-focused LSTM

The sequence-focused LSTM generally follows the same implementation details as the event-focused LSTM with the Keras implementation of the LSTM layer and encoder-decoder structure with some minor modifications.

The number of timesteps used is five, which means that the model is trained to learn and reconstruct log event sequences with the length of five. The encoder part share the same settings as the encoder part of the event-focused LSTM with the number of neurons of the LSTM layer configured as a hyperparameter and the activation function Rec- tified Linear Unit. The LSTM layer is followed by Keras’ implementation of the RepeatVector layer, which take its input and repeats it five times to pass on the timestep-focused sequences to the decoder part.

The decoder-part consists of an LSTM layer with the number of neurons configured as a hyperparameter and the Rectified Linear Unit as the activation function. The LSTM layer finally returns the sequences that it received as input through Keras’ layer wrapper TimeDistributed that applies a Dense layer with a size controlled by the word vector size hyperparameter on the five temporal steps received in each sequence.

Input LSTM

N = 600

RepeatVector LSTM

N = 600

Dense

N = 300

Output

t = 5

Encoder Decoder

TimeDistributed

Figure 3.2: Flow chart of the sequence-focused LSTM model. N is the number of neurons in the particular layer and t is the number of timesteps used.

(34)

3.3 Model Fitting

Both models are fit on the training set using the Adam optimizer [32]

and the default learning rate 0.001 used by Keras. The models are trained on the training data during 10 epochs as the number was deemed to be enough to train the models sufficiently while saving execution time. The Mean Absolute Error (MAE) [33], which can be seen in equation 3.1, is used as the loss function to be minimized.

MAE = Pn

i=1|y_i− x_i|

n =

Pn i=1|e_i|

n . (3.1)

Equation 3.1: Mean Absolute Error. n is the number of timesteps, yi

is the true target of each timestep and xi is the prediction for each timestep, leaving ei to be the reconstruction error.

3.4 Hyperparameter Optimization

Among many potential hyperparameter that could be optimized, the vector size of the word embeddings and the number of neurons in the LSTM layers were chosen for further evaluation using a Grid Search [34].

Due to time constraints, the two models were optimized for the feature combination [User, Country, User Agent] as that feature combination was judged to be the hardest one for the models to learn as there is a lot of variation in the possible attribute value combinations.

A dataset containing the first 120,000 log events of the original dataset was used during the optimization and all hyperparameter combinations of the two metrics were fit five times during 10 epochs on different partitions of the dataset using a rolling window of sample size 20,000 and rolling size 10,000 during the optimization of the event- focused LSTM whereas a rolling window of sample size 40,000 and rolling size 20,000 was used during the optimization of the sequence- focused LSTM. During the fitting of the models, the validation loss was recorded and after each completed hyperparameter configuration test, the average validation loss for that configuration was calculated.

The results of the two hyperparameter optimizations can be seen in Figure 3.3 and Figure 3.4 below.

(35)

Figure 3.3: Validation loss during the different hyperparameter con- figurations of the event-focused LSTM.

Figure 3.4: Validation loss during the different hyperparameter con- figurations of the sequence-focused LSTM.

(36)

As can be seen in Figure 3.3, the validation loss during the fitting of the event-focused LSTM reaches a minimum when using the word vector size 300 and setting the number of neurons to 900, and these values were chosen for the implementation of the event-focused LSTM.

When examining Figure 3.4, the validation loss during the fitting of the sequence-focused LSTM starts to level out when using the word vector size 300 and setting the number of neurons to 600. Although the validation is decreasing slightly when increasing the number of neurons even further, these settings are selected for further evaluation anyways as the time per test run decreases drastically with lower number of neurons.

3.5 Anomaly Detection

The anomaly detection method used by the two LSTM-based models focuses on analyzing the reconstruction error, a method that also was used by Malhotra et al. in 2016 [30]. The reconstruction metric used in that study was Mean Squared Error (MSE) which is a commonly used metric. However, the fact that the data used in this thesis is sparse combined with the knowledge that MSE penalizes small errors during the reconstruction of data resulted in a concern that the small natural deviations that will occur during the reconstruction of the data could be flagged as anomalies, and in that way result in a higher number of false positives. The reconstruction error metric used in this thesis is therefore the Mean Absolute Error (MAE), which also is used by the models as loss function.

The basic idea behind the concept is to train the models on data that represents normal login behavior, let the trained models reconstruct data that also represents normal login behavior, saving the MAE values occurring for each sample during the reconstruction process of the normal data and then finally compare the normal MAE values with the MAE values that occur when letting the models reconstruct the test data, that contains both normal and abnormal log events. As the sizes of the calculated MAE values implies how hard it was for the model to reconstruct that particular sample, the metric can give an idea about how much the input data deviates from the reality the models are trained to know. The MAE values that occur when letting the models reconstruct the normal data can therefore be seen as normal MAE values and in that way be used to find a threshold that can be compared with when letting the models reconstruct the test data. If a

(37)

MAE value during the reconstruction of the test data sample exceeds this threshold, this means that the test sample deviates significantly from the behavior that the model considers to be normal and a suspicious anomaly is detected.

Although the idea behind the anomaly detection method is the same for both models, there are some significant differences in the implementation of the concept that are described separately below in Section 3.5.1 and Section 3.5.2.

3.5.1 Event-focused LSTM

After having trained the event-focused LSTM on the training data, the model reconstructs the training data and test data separately and saves the MAE values for each sample in the list corresponding to the dataset the sample belongs to. As one of the major goals of the thesis is to find anomalies in the users’ individual login patterns, the highest MAE value that occurred per user during the reconstruction of the training data is stored. When later investigating the MAE values that occurred when reconstructing the test data, the MAE values are compared to the highest MAE value that occurred for that specific user during the reconstruction of the training data. If the MAE value for that test sample exceeds the user’s threshold, the log event is considered to be an anomaly in that user’s login behavior and is flagged. The use of a global anomaly threshold was evaluated initially but as it turned out to be hard to find a global threshold that would catch all anomalies without resulting in a lot of false positives, the user-focused threshold calculation was kept.

Finally, the precision, recall and F1 score are calculated to measure the anomaly detection method’s performance. The precision is calculated by looking at how many of the flagged events that actually were abnormal manipulated events whereas the recall is calculated by looking at how many of the manipulated events that actually were flagged.

The F1score constitutes the harmonic mean of the precision and recall and can be seen in equation 3.2.

F₁ = 2 · precision · recall

precision + recall (3.2)

3.5.2 Sequence-focused LSTM

Compared to the event-focused LSTM, the sequence-focused LSTM differs in the way the threshold is calculated and in its definition of

(38)

precision and recall.

After having trained the sequence-focused LSTM on the training data, the model reconstructs the validation data which contains a mi- nor set of the users with sequences containing both normal and abnormal log events. As the true class of the log events in the validation is known beforehand, scikit-learn’s built-in method precision_recall_curve⁷ can be used to calculate the model’s precision and recall on the validation data where the MAE values of the log events can be used as probabilities that indicate whether the log event should be considered as an anomaly or not. With the precision and recall curves calculated, the threshold value that results in the highest F1score is selected as the global threshold to use in the upcoming anomaly detection.

Later on, the test data is reconstructed by the model and the com- puted MAE values per log event sequence are compared to the global threshold to determine if the sequence contains a potential anomaly or not. Finally, the precision, recall and F1 score are calculated where a log event sequence is considered to be successfully flagged as an anomaly if the sequence contains one or more of the inserted abnormal events.

3.5.3 Naive Baseline Models

As there were no existing benchmark models at TeamEngine, two naive models are created for this task to serve as benchmarks when measuring the two LSTM-based models’ performances. The naive models are designed to reflect the observed login security mechanisms that are used by various services and function as follows:

• Naive Event Anomaly Detection. Parses the log events in the test data and if the naive model has not seen a log event with the examined attribute values before (i.e., the log event does not occur in the training data), the log event is flagged as an anomaly.

• Naive Sequence Anomaly Detection. Parses the log events in the log event sequences and if the naive model has not seen a log event with the examined attribute values before, the log event sequence that contain the log event is flagged as an anomaly.

Finally, the precision, recall and F1 scores are calculated in the same way as the naive models’ corresponding LSTM-based anomaly detection method.

7www.scikit-learn.org/stable/modules/generated/sklearn.

metrics.precision_recall_curve.html

(39)

3.6 Summary

The two anomaly detection methods proposed in this chapter approach the anomaly detection area from different angles, which enable the possibility to further evaluate the suitability of the chosen neural networks and the corresponding design choices for this particular kind of anomaly detection. This combined with the possibility to evaluate the importance of the different log attribute combinations result in multiple evaluation possibilities of the proposed methodologies, ultimately resulting in good conditions for gaining a deeper understanding of the characteristics of the login flow that are important to examine when detecting anomalies.

(40)

Results

This chapter presents the results achieved when using the two proposed anomaly detection models in the described user authentication log context. Firstly, a description of the performed tests and result presentation is given. Later on, the results of the event-focused LSTM- based anomaly detection method along with the results of the corresponding naive model are presented, while the results of the sequence- focused LSTM anomaly method and its corresponding naive model can be found in the second part of the chapter.

4.1 Test Setup & Result Significance

The two LSTM-focused models are tested on the user authentication data where the event-focused LSTM-based model focuses on detecting abnormal log events while the sequence-focused LSTM-based model focuses on detecting sequences that contain abnormal log events.

Both LSTM-based anomaly detection methods consist of two stacked LSTM layers together with a final Dense layer and operate on numeric vector representations of the log events, where a LSTM layer size of 900 and 600 for the event-focused LSTM and sequence-focused LSTM respectively and a word vector size of 300 were chosen as settings after a hyperparameter optimization. Both models utilize the reconstruction error as the anomaly detection method, where Mean Absolute Error (MAE) is used as the reconstruction error calculation method. The event-focused LSTM compares the reconstruction errors of the test data with the normal reconstruction errors per user while the sequence-focused LSTM compares the reconstruction error of the test data with a global reconstruction error threshold.

The event-focused LSTM and sequence-focused LSTM were eval-

30

(41)

uated using a 11-fold Rolling-Window Analysis with a sample size of 50,000 log events and rolling size of 20,000 log events. The results are divided into the examined log feature combinations and are presented as averages in Table 4.1, 4.2, 4.3 and 4.4. The box plots of the F1 scores obtained during the examined tests can be viewed in Figure 4.1, 4.2, 4.3 and 4.4.

The statistical significance of the differences between the results of the LSTM-based models and their corresponding naive model in the various feature combinations have been confirmed at a level of p <

0.05 with the Wilcoxon Signed-Rank Test [35] in almost all cases. The only feature combination where the statistical significance can not be confirmed is when examining the [User, Time, Login Status] feature combination when performing anomaly detection of swapped events using the sequence-focused LSTM.

4.2 Event-focused LSTM

4.2.1 Randomized Events

As can be seen in Table 4.1, the naive model performs significantly better when trying to detect randomized event anomalies no matter which metric or feature combination that is examined, where the low recall values of the event-focused LSTM-based anomaly detector leads to low F1 scores as well.

When comparing the results between the different feature combinations, the event-focused LSTM scores the best precision when examining the [User, Time, Login Status] combination while the most robust result can be found when examining the [User, Country, Login Status], where the best F1score of the event-focused LSTM occurs. Meanwhile, it can be noted that the naive model produces the most robust results when examining the [User, Time, Login Status] and [User, Country, User Agent] combinations.

(42)

Feature Combination Model Precision (%) Recall (%) F₁(%) User, Time, Login Status LSTM

Naive

70.4 94.2

0.2 63.1

0.3 75.6 User, Country, Login Status LSTM

Naive

47.9 70.2

1.5 24.0

2.9 35.7 User, Country, User Agent LSTM

Naive

25.1 87.2

0.1 65.2

0.2 74.6 Table 4.1: The average metrics of the event-focused anomaly detection of randomized events after a 11-fold Rolling-Window Analysis.

0 0.2 0.4 0.6 0.8 1

UCS (LSTM) UCS (Naive) UTS (LSTM) UTS (Naive) UCUA (LSTM) UCUA (Naive)

F1 Score

Feature Combination (Model)

Figure 4.1: Box plot of the F1scores obtained during a 11-fold rolling- window analysis of event-focused anomaly detection of randomized events. ”UCS” represents the [User, Country, Login Status] feature combination, ”UTS” represents the [User, Time, Login Status] feature combination and ”UCUA” represents the [User, Country, User Agent]

feature combination.

(43)

4.2.2 Swapped Events

As can be seen in the Table 4.2, the naive model is still significantly better at detecting anomalies no matter which metric or feature combination that is examined, even though the distance between the results has decreased a lot. The increased difficulty in detecting swapped events compared to random events is clearly manifested in the result in general, as all metrics show significantly worse results compared the anomaly detection of random events.

When comparing the results between the different feature combinations, the event-focused LSTM scores the most robust result when examining the [User, Country, Login Status] where the event-focused LSTM’s best results are found in all metrics. Also in this case, it can be noted that the naive model produces the most robust results when examining the [User, Country, User Agent] combination.

Feature Combination Model Precision (%) Recall (%) F1 (%) User, Time, Login Status LSTM

Naive

13.8 27.8

0.5 21.7

0.9 24.3 User, Country, Login Status LSTM

Naive

14.9 20.1

3.5 28.5

5.6 23.6 User, Country, User Agent LSTM

Naive

10.2 22.0

1.1 43.9

1.9 29.3 Table 4.2: The average metrics of the event-focused anomaly detection of swapped events after a 11-fold Rolling-Window Analysis.

(44)

0 0.2 0.4 0.6 0.8 1

UCS (LSTM) UCS (Naive) UTS (LSTM) UTS (Naive) UCUA (LSTM) UCUA (Naive)

F1 Score

Feature Combination (Model)

Figure 4.2: Box plot of the F1scores obtained during a 11-fold rolling- window analysis of event-focused anomaly detection of swapped events. The same feature combination abbreviations as Figure 4.1 are used.

4.3 Sequence-focused LSTM

4.3.1 Randomized Events

As can be seen in the Table 4.3, the naive model scores the best precision in all feature combination cases. However, the significant difference in the recall between the two models where the sequence-focused LSTM clearly outperforms the naive model results in a significantly better F1 score by the sequence-focused LSTM, where the sequence- focused LSTM shows the most robust performance in all feature combination cases.

When comparing the results between the different feature combinations, the results are remarkably uniform no matter which feature combination and metric that is examined. The sequence-focused LSTM performs slightly better when examining the [User, Country, User Agent] feature combination while the naive model performs slightly better when examining the [User, Time, Login Status] and [User, Coun- try, User Agent] feature combinations, but the differences are generally small.