Authentication Using Deep Learning on User Generated Mouse Movement Images

(1)

Authentication Using Deep Learning on User Generated Mouse Movement Images

Olof Enström

Computer Science and Engineering, master's level 2019

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

Abstract

Continuous authentication using behavioral biometrics can provide an additional layer of protection against online account hijacking and fraud.

Mouse dynamics classification is the concept of determining the authen- ticity of a user through the use of machine learning algorithms on mouse movement data. This thesis investigates the viability of state of the art deep learning technologies in mouse dynamics classification by designing convolutional neural network classifiers taking mouse movement images as input. For purposes of comparison, classifiers using the random forest algorithm and engineered features inspired by related works are imple- mented and tested on the same data set as the neural network classifier.

A technique for lowering bias toward the on-screen location of mouse movement images is introduced, although its effectiveness is questionable and requires further research to thoroughly investigate. This technique was named ’centering’, and is used for the deep learning-based classifi- cation methods alongside images not using the technique. The neural network classifiers yielded single action classification accuracies of 66%

for centering, and 78% for non-centering. The random forest classifiers achieved the average accuracy of 79% for single action classification, which is very close to the results of other studies using similar meth- ods. In addition to single action classification, a set based classification is made. This is the method most suitable for implementation in an actual authentication system as the accuracy is much higher.

The neural network and random forest classifiers have different strengths.

The neural network is proficient at classifying mouse actions that are of

similar appearance in terms of length, location, and curvature. The ran-

dom forest classifiers seem to be more consistent in these regards, although

the accuracy deteriorates for especially long actions. As the different clas-

sification methods in this study have different strengths and weaknesses,

a composite classification experiment was made where the output was

determined by the least ambiguous output of the two models. This com-

posite classification had an accuracy of 83%, meaning it outperformed

both the individual models.

(3)

Preface

I chose to work with machine learning for my master’s thesis as I wanted to learn more about the field, and BehavioSec has been a great place for that. I would like to especially thank my external supervisor Per Burstr¨ om for all the constructive discussions we’ve had and the continuous feedback he has given me.

I would also like to thank Philip Lindblad and Tony Libell for the advice and technical support they have provided over the course of the project. Finally, I would like to thank my internal supervisor Niklas Karvonen for his great feedback on report structure.

Olof Enstr¨ om

(4)

1 Introduction 5

1.1 Background and Related Works . . . . 6

1.2 Problem Definition . . . . 9

1.2.1 Delimitations . . . . 10

2 Theory 11 2.1 General Concepts . . . . 11

2.1.1 Solving an Optimization Problem Using Machine Learning 11 2.1.2 Performance Metrics . . . . 12

2.1.3 Hyperparameters . . . . 13

2.1.4 Bias-Variance Tradeoff . . . . 14

2.1.5 Cross Validation . . . . 16

2.1.6 Ensemble Learning . . . . 16

2.2 Classification Algorithms . . . . 17

2.2.1 Decision Trees . . . . 17

2.2.2 Bagging . . . . 18

2.2.3 Random Forests . . . . 18

2.3 Deep Learning . . . . 18

2.3.1 Training a Neural Network . . . . 20

2.3.2 Convolutional Neural Networks . . . . 21

2.3.3 Convolution Layer . . . . 21

2.3.4 Pooling Layer . . . . 22

2.4 Anti Overfitting Strategies in Neural Nets . . . . 23

3 Method 25 3.1 The Data Set . . . . 25

3.2 Segmentation . . . . 26

3.3 Baseline Classification using Random Forests . . . . 27

3.3.1 Feature Extraction . . . . 27

3.3.2 Classification . . . . 29

3.3.3 Cross Validation . . . . 31

3.3.4 Feature Importance . . . . 31

3.3.5 Experiment Setup . . . . 31

3.4 Deep Learning Classification . . . . 32

3.4.1 Image Generation . . . . 32

3.4.2 Generalization Methods . . . . 33

3.4.3 Neural Net Design . . . . 34

3.4.4 Experiment Setup . . . . 35

3.5 Composite Classification . . . . 36

3.6 Exploring Performance Differences . . . . 37

(5)

4 Result 38

4.1 Random Forest . . . . 38

4.1.1 Single Action Classification . . . . 38

4.1.2 Set of Actions Classification . . . . 38

4.1.3 Feature Importances . . . . 38

4.2 Deep Learning . . . . 41

4.2.1 Single Action Classification . . . . 41

4.2.2 Set Based Classification . . . . 41

4.3 Differences in Classification Results . . . . 45

4.3.1 Classification Accuracy and Sampling Rates . . . . 46

4.3.2 Differences for Unnatural Behavior . . . . 47

4.3.3 Composite Classification . . . . 48

5 Discussion 49 5.1 The Data Set . . . . 49

5.2 Random Forest . . . . 49

5.3 Deep Learning . . . . 51

6 Conclusions 53

7 Future Work 53

Appendices 58

A Initial Results, Random Forest 58

(6)

1 Introduction

User authentication is becoming increasingly important in our new digital so- ciety. In services ranging from online banking to social media, account hijack- ing can have serious consequences. As user credentials can be compromised in numerous ways, additional methods are necessary for reliable authentication.

Methods using physiological biometrics facial recognition or fingerprint scan can be effective but require conscious effort from the user and are often only used once per session. A method which provides continuous authentication without disrupting the user is identity verification using behavioral biometrics. It is the concept of constantly verifying or rejecting a user’s authenticity based on how they interact with their device instead of granting full account access after ini- tial authentication.

Today there are systems in place (e.g. [1]) using behavioral biometrics to pre- vent fraud. An example of such a biometric is keystroke dynamics, the way a user interacts with their keyboard. Using keystroke dynamics, authentica- tion systems can be enhanced to recognize how a user writes. A downside to using only keystroke dynamics for continuous authentication would be that it requires tasks to include keyboard usage, which is something not all tasks do.

A behavioral biometric that can be used together with (or instead of) keystroke dynamics is mouse dynamics, the characteristic user interaction with a mouse device.

To classify the authenticity of the user based on these biometrics requires data collection from either a script running in a browser or a data collection appli- cation running on the user’s computer. Having websites or programs collect data on keyboard and mouse usage does have ethical implications, but knowing exactly what the user is doing is unnecessary for the authentication. Behav- ioral biometrics like these make use of various types of machine learning for classifying if the user is who it claims to be. This essentially means training a model to recognize patterns and make predictions without explicitly program- ming what the patterns are

¹

. This means it does not matter what a user is doing, only how they are doing it. Thus, the data collection of keyboard usage (which can contain sensitive data) can be completely anonymized. This way, what keys are pressed or links are clicked is not visible, but the key presses and mouse strokes are distinguishable and can be used to differentiate between users.

Authentication using machine learning will inevitably include some false accep- tance and false rejection and for viability in real authentication systems these rates need to be kept low. If the false acceptance rate (FAR) is too high the au- thentication is essentially useless as impostors would get accepted as real users, and if the false rejection rate (FRR) is too high the usability suffers too much as authentic users would be denied access repeatedly. This makes continuous

1A very basic description, more in depth explanations will follow in the Theory section

(7)

authentication through behavioral biometrics a field with great potential but high standards.

Even though successful models for classifying users based on behavior already exist, many methods have not yet been explored. Deep learning is a subset of machine learning which uses multi-layered artificial neural networks to learn complex relationships and patterns in data. Recent years have seen a big in- crease in popularity for deep learning in fields including biometrics and computer vision. A 2018 survey [2] concludes that deep learning approaches in biometric classification are outperforming previous state of the art models in modalities such as face and voice recognition, partly due to the strong feature learning present in deep neural networks. This survey also notes that the field of behav- ioral biometrics is relatively unexplored in terms of deep learning approaches.

In the field of mouse dynamics classification, the potential of state of the art deep learning algorithms has not yet been explored.

This thesis aims to change this by investigating the potential of continuous au- thentication through mouse dynamics classification using state of the art deep learning technologies alongside other machine learning methods which have seen previous success in the field. As a basis for this, BehavioSec has provided a data set of mouse movements from 9000 users in a live production environment.

1.1 Background and Related Works

Mouse dynamics as a behavioral biometric was first proposed in a 2004 study [3] by Gamboa et al. and showed great promise. The feature extraction used in the study influenced papers like [4], [5] as well as this thesis. Metrics like accel- eration, angular velocity, and curvature were extracted from the mouse data of each action, and statistical data like min, max, mean, and standard deviation of each metric were defined as features of the mouse stroke. The data set in the study was very task specific as it was collected from a memory game that users played in their browser. A data set generated through a specific task will henceforth be referred to as a task specific data set, as opposed to a general use data set. Task specific classification may be fitting for some use cases, but for continuous authentication general use data is needed as the purpose is to authenticate users while they are exhibiting their normal behavior through ar- bitrary tasks and not performing a specifically chosen task.

The first mouse dynamics studies using a general use data set was Ahmed et al.’s study in 2007 [6] and Schulz’s study in 2006 [7]. Many definitions and classes made in [6] are used in several subsequent studies in the field. These classes include the mouse movement categories PC, DD, and MM (point-click, drag- drop, and mouse-move), as well as 8 direction-classes of a mouse movement.

Ahmed et al.’s method for classification was based on explicit feature extrac-

tion for training a neural net. The feature extraction in this study was made

on a session basis. Features included average speed, a direction histogram, and

(8)

other averaged behavior over a session. As a result of this, much of the char- acteristic mouse behavior from the individual mouse stroke actions never made its way into the neural net. A full session of mouse actions were then used to test the classifier. While the classification was successful (2.4% FRR and FAR for a session with 3000 actions), the accuracy was heavily dependent on the length of the session. For a session consisting of 1000 actions, the best FAR and FRR obtained was only 24% and 4.6%, respectively. In 2011, Ahmed et al.

published the subsequent study [8] where their results are improved due to a decision making algorithm which only classifies a session as positive or negative once the confidence level reaches a certain threshold, which can take a varying amounts of actions to attain. This study also relied on a large amount of actions before an accurate classification can be made.

Published around the same time as [6], Schulz’s study [7] segmented mouse data into curves which where then grouped together to form classification histograms.

One of the observations made in [7] were that certain users were significantly easier to accurately classify. Histograms made from 60 mouse curves had EERs

²

ranging from 1% to 37%. The main results in this study were average EERs of 24.3% to 11.1% for histograms made from 60 and 3600 mouse curves, respec- tively.

While [6]–[8] were successful in terms of classification, the authentication time that comes from collecting enough data to form histograms of those sizes may be too large for many cases. More recent studies like [4], [5], [9] have indicated that an accurate classification can be made on a fraction of the test data when extracting features on an action basis as opposed to a session-histogram basis.

The success of early studies like [3], [6] suggest that mouse dynamics has great potential as a behavioral biometric. The studies were not free from critique however, as Jorgensen et al. presented the study [10] arguing that environment variables in these studies have not been controlled enough to attribute the clas- sification results to characteristic mouse behavior alone. [10] shows that the type of device (mouse or touchpad) used has a significant impact on the result.

The study also mentions other environment variables whose effect on classifi- cation results can be investigated. Examples of these include screen resolution and mouse polling rate.

A study that took consideration to the possible effect of screen resolution and mouse polling rate was [9] by Zheng et al. This study notes that metrics like velocity and acceleration are correlated to factors such as mouse sensitivity and screen resolution. The study instead focuses on feature engineering from direc- tion and curvature metrics, and presents a result of 1.3% EER on sets of 20 mouse actions ending in clicks. Similarly low EER results were achieved in 2018 by Antal et al. [5], who were using a publicly available data set [11]. The data

2Equal Error Rate, See 2.1.2

(9)

set contains roughly 60 000 actions over 65 sessions from 10 users and is already split up into training and test sessions, which had a major impact on the classi- fication accuracy in [5]. The study presents results from two different scenarios, A and B. Scenario A consists of splitting up the training sessions of the data set for both training and testing of their classification. For classification based on a single action, the scenario A average EER was 21%. Scenario B consists of using the training sessions for training the classifiers, and the testing sessions for the testing of the classifiers. The corresponding single action EER for scenario B was 30%. The main result presented is the classification results for sets of 20 actions, where scenario A and B had 0.04% and 18.8% EER, respectively. This could be seen as either a better or worse result than [9] depending on whether the results of scenario A or B is used for the comparison. The data set in [9] is not comprised of different sessions for training and test data, so scenario A is likely the better comparison. Environment factors are however not considered in the main results presented in [5], which can possibly be a reason for the better results.

The method in [5] included segmenting raw mouse data into three categories, each of which define an action. These categories are the same ones defined in [6] (MM, PC and DD). In contrast to the session-histogram based feature ex- traction used in [6]–[8], this study used an action based feature extraction using features similar to the ones used in [3]. These features were then used to train a random forest

³

classifier. Just like [9], accurate classifications were made after only 20 actions compared to the 3000 actions in [6] and [12]. When Antal et al. published this study, they also announced that they would proceed with investigating mouse dynamics using deep learning and compare their previous result to the result of a deep learning classification [13]. That study has not yet been published.

As previously mentioned, no studies from academia in the field of mouse dynam- ics have explored deep learning classification. It has however been attempted in private enterprise. In 2017, Splunk[14] posted an article of an image-based deep learning classification attempt. They represented their mouse data as high-contrast colored images, where the color represented speed, direction and acceleration. The full details of the implementation were not disclosed. These images were then used in a deep convolutional neural network

⁴

which could ex- tract useful features and subsequently classify the users behind the images. The data set used for user identification included 180 images from a specific user ’A’

(positive data), and 180 images from ’other’ users (negative data). The classifier achieved an accuracy of 79%. It should be noted that the data set used here is significantly smaller than all other studies mentioned. Furthermore, the study was a small-scale article performed in private enterprise without disclosing the methods or data set. Thus it is not equivalent to a thorough study in the field.

3 See section 2.2.3

4 See section 2.3.3

(10)

This article is however one of the few examples that has used deep learning for mouse dynamics classification. The studies [6], [8] use a 3-layer neural net for classification, however the network is not deep and the input consists of session based feature engineering and not generated images.

The studies [3]–[10], [14] have all used two-class (also known as binary-) clas- sification. Multiclass classification was attempted in [4] alongside binary and resulted in faster classification at the expense of accuracy. A study that used one-class classification (OCC, a type of anomaly detection) was done by Shen et al. in 2012 [12]. This study achieved a 7.78% FAR and 9.45% FRR for 5 minute sessions using a one-class support vector machine. Shen et al. followed up this study in 2013 and 2014 with [15], [16], presenting similar classification results in much shorter authentication time (11.8s and 6.1s, respectively), using anomaly detection in all studies. In [5], it is stated that binary classifiers per- form better than one-class provided that negative training data is available. If only some negative data is available, it is not necessarily better to use a binary classifier. Studies like [17] have shown that if the ratio between positive and negative sessions is greater than 2.5-3.5:1 the performance of binary classifiers can deteriorate to the point where an OCC is preferred. As the data set used in this study has a vast amount of users, binary classification should outperform an OCC.

In summation, there are many studies on mouse dynamics classification indicat- ing its viability as a method for continuous authentication. There are however no published studies using state of the art deep learning for mouse dynamics classification, moreover there are no studies comparing manual feature extrac- tion to features generated in convolutional neural networks. Therefore, the aim of this thesis is to study how deep neural networks perform compared to other machine learning methods in the case of mouse dynamics classification.

1.2 Problem Definition

Previous works in the field have shown that accurate classifiers can be made us- ing machine learning methods with explicit feature engineering, and that deep learning needs to be explored further in mouse dynamics. As previously men- tioned, this thesis will be using an in-house data set provided by BehavioSec comprised of 9000 users and 600 000 sessions. The aim is to use this data set and contribute to the field of mouse dynamics by investigating the classification performance of deep neural networks and comparing it to results attained using methods similar to related works. The plan to achieve this includes:

• Implementing classifiers using machine learning methods with explicit fea- ture engineering similar to previous works in the field

• Investigating different mouse action representations suitable for deep learn-

ing

(11)

• Designing a deep learning model that accurately classifies mouse move- ment data based on said representation

• Comparing the performance of classifiers using explicitly engineered fea- tures similar to previous works and the deep learning model

The detailed strategy for accomplishing these goals is covered in section 3, fol- lowed by the result in section 4 and further discussion in section 5.

1.2.1 Delimitations

The scope of the thesis is to investigate the potential of deep learning technolo-

gies for mouse movement-based authentication. It does not include the imple-

mentation of a system that would make use of such an authentication method,

it only serves as a proof-of-concept akin to other studies within the field. There

is no attempt to collect any data or use any data set other than the in-house

data set provided for this thesis, which itself has to be limited as the amount of

users is far too high for performing individual binary classification.

(12)

2 Theory

This section serves to give a basic overview of concepts necessary to understand the methods used and results obtained in this thesis. It will introduce machine learning, over- and underfitting, random forests, artificial neural networks, and more. The concepts and algorithms are discussed in a general context rather than the context of mouse dynamics classification, as the application of the con- cepts is covered in section 3. Although many of the concepts are much broader than discussed and could be explained further, only the parts relevant to the understanding of methods used in the thesis are covered.

2.1 General Concepts

Machine learning is a research field combining state of the art computer sci- ence and statistics to create models that can be trained to understand data, find patterns, and make predictions. Machine learning comes in many different forms, but this study only includes supervised learning, meaning the predictions made by the machine learning model can be compared to the true values and metrics such as accuracy can be measured. Differentiating between users is a classification problem, meaning the label values to be predicted are discrete.

’authentic’ or ’impostor’. Data belonging to an authentic user is henceforth referred to as ’positive’ data and data belonging to an impostor is referred to as ’negative’ data.

2.1.1 Solving an Optimization Problem Using Machine Learning

There are many different machine learning algorithms for solving a classification

problem, however they all share the same foundation. A labelled data set con-

tains samples (x, y) where x is a set of features describing the data sample and y

is the true label of the sample. A classification algorithm will, based on feature

values of all samples in a training set, try to separate the samples containing

different labels as well as possible. Binary classification can be illustrated by

mapping samples (based solely on their feature values) to a 2-dimensional plane

(13)

where training samples are separated by a line called a decision boundary. This analogy is visualized in figures 2, 3, and 4. All samples that are mapped on one side of the decision boundary would be classified as one class, and all samples mapped on the other side would be classified as the second class. The goal of the model is to make the decision boundary separate the mapped samples of the two labels with the largest margin possible. To accomplish this, each sample has an ’ideal mapping’ corresponding to its class label. To map each sample (even beyond the training set) to its ideal mapping while only considering the feature values of the sample means the classification is perfect, which is rarely the case in real classification problems. It is often possible to perfectly map each training sample using mapping functions that consider many small variations.

This is however undesirable as it often leads to overfitting as discussed in section 2.1.4. Instead, the distance between the actual mapping and the ideal mapping is measured and a cost (or loss) is calculated. The optimization problem to be solved is to minimize the cost for each sample. Solving this optimization problem is done by passing through each sample in the training set, measuring its cost, and adjusting the algorithm parameters to minimize the cost. This is known as training a classifier. After the optimization is made, subsequent map- pings of samples with unknown class labels is hopefully on the correct side of the decision boundary more often than not, thus resulting in a useful classification model. In practice, machine learning algorithms may work very differently from this analogy but the underlying concept is similar.

2.1.2 Performance Metrics

When measuring the performance of a classifier, the most intuitive metric to look at may be the accuracy. Accuracy is the fraction of the true labels cor- rectly predicted by the classifier. There are however several other important performance metrics when it comes to evaluating a classifier. The distribution of false positives and false negatives can be crucial depending on the classifiers purpose. A commonly used tool for evaluating performance is the receiver op- erating characteristic curve (ROC curve). It is a plot of the true positive ratio (TPR) and false positive ratio (FPR, equivalent to FAR) at different threshold values. These metrics are given by

T P R = T P

P , F P R = F P

N , (1)

where P is the amount of positive data points, T P is the amount of correctly

predicted positive data points, N is the amount of negative data points, and F P

is the amount of negative data points falsely predicted as positive. Threshold

values in this context means the value a prediction confidence needs to exceed to

be classified as positive. Depending on the nature of the classification problem,

the importance of false positives and false negatives may differ greatly. In the

case of authentication in sensitive use cases (like transferring money between

bank accounts), having an impostor occasionally verified as an authentic user

is likely considered worse than an authentic user occasionally being flagged as

(14)

Figure 1: Example of ROC curve.

an impostor. If the difference in severity between false positives and negatives is big, different threshold values may be used for the classification.

Consider the example where a binary classifier predicts class labels y (0 or 1) of data samples X with values between 0 and 1. The output values of the classifier describes the prediction confidence of input sample X belonging to class 1. That is, if the classifier outputs the value 0.9 it means the classifier is 90% certain that X belongs to class 1. Similarly, if the output is 0.1 it means the classifier is only 10% certain that X belongs to class 1. If the threshold is set to 0.5 in this case, outputs below 0.5 are classified as 0 and outputs above 0.5 are classified as 1. An important metric that can be taken from the ROC curve is the area under the curve (AUC). The AUC describes the likelihood of the classifier output to be higher for a randomly chosen positive sample than for a randomly chosen negative value. It is often used to measure the performance of a classifier instead of accuracy, as the AUC does not depend on the specific classification threshold chosen. Figure 1 depicts the ROC curve for a classifier with 0.95 AUC.

Another commonly used performance metric is the equal error rate (EER).

It is a metric describing the false positive rate and false negative rate at the threshold value where they are equal.

2.1.3 Hyperparameters

Machine learning algorithms often have parameters whose effect on the per- formance is different depending on the scenario. Whether it is the nature of the classification problem, the features in the data set, or the amount of data, the performance may be very different depending on the choice of parameters.

These are known as hyperparameters, and are set by the designer before start-

ing the training of the model. In order to find the best hyperparameter values

(15)

for the scenario, experiments are made. Models with different hyperparameter values are trained on the training set and then have their performance tested on the validation set. A validation set is used to test the performance of a model on data the model has not trained on, and should be a good indicator of how the model will perform on the test data. As a result of this, the model will be biased not only toward the training data, but the validation data as well. The test set is not used for any adjustments to the model as the test data should remain completely unseen by the classifier until it is time to measure the final performance. Once the best hyperparameter values are found (i.e. the best performance on the validation set was reached), the test set is used to measure the performance of the final model.

2.1.4 Bias-Variance Tradeoff

When training a classification model, the classifier is initially likely to have a heavy bias, making it ignore important features and relationships. For example, a correlation between a single feature value and the true label may be high and dictates the entire classification performance, even though considering other factors too may lead to a better result. This means the model used is too simple and not considering enough factors when making its prediction (the model is underfitted ). After training on a large data set with a variety of different feature values and relations, the model can instead show high variance, where small fluctuations in feature values can lead to different outputs. This means the model is too complicated and tailored to fit the training data perfectly (the model is overfitted ). Overfitted models perform much better on training data than on test data. The key to good classification is having low bias and low variance, however this is difficult as decreasing one usually means increasing the other. This is known as the bias-variance tradeoff. Figures 2, 3, 4 illustrate an underfitted model, an overfitted model, and a model with a good fit.

Figure 2: An underfitted model. The circles and squares represent two different

classes and the dividing line represents the model’s decision boundary, where

all points right of the line are classified as squares and all points left of the line

are classified as circles.

(16)

Figure 3: An overfitted model.

Figure 4: A well fitted model.

In the case of decision trees

⁵

, the variance would increase (as the bias de- creases) together with the depth of the tree as the tree can model more complex relationships. The best maximum depth of a tree is commonly found through experiments where trees with different depths are trained on the same data and then testing the classification performance of the trees with the validation set.

In the case of feed-forward neural nets

⁶

there are hyperparameters of the net- work that affects the potential of high variance, for example the number of neurons and hidden layers. Adding more hidden layers to a neural net increases the potential variance of the model, and having too few hidden layers can re- sult in the model having too high bias. During the training of a network, the network is trying to minimize the loss of the training data. If the network is complex (it has many parameters), the parameters can be overtrained even to the point where the training data is fully ’memorized’ (the loss reaches 0 and the model can predict it with 100% accuracy), but predicting test data results in a much worse accuracy (sometimes as bad as random guess accuracy). Strategies to avoid high variance in deep neural nets are covered in section 2.4.

5 See section 2.2.1

6 See section 2.3

(17)

2.1.5 Cross Validation

A natural repercussion of training a classifier is that the classifier’s performance is biased toward the data it is trained on. Given a set of training data, it is possible to get completely different result when training the classifier on the first half of the data and testing on the second half, and vice versa. The model may be biased toward a particularly strong feature-label correlation from the first half, or have high variance due to lack of such a correlation. Classification results can differ greatly depending on the permutations of the data set. A common way to reduce the effect of this is cross validation, the concept of splitting the data into several subsets, and testing the performance in different scenarios. A popular version of this is K-fold cross validation. This involves splitting the data set into k parts (also known as folds), and for i = 1, 2, ..., k, training the classifier on each fold except for fold i, which the classifier is tested against. This means that k different scenarios are created, each with a different classification result. The k results can then be averaged out to form the final result of the classification.

Figure 5: 4-fold cross validation. For each iteration, fold i (gray) is used for testing and all other folds (white) are used for training.

An example of 4-fold cross validation can be seen in figure 5.

2.1.6 Ensemble Learning

Ensemble learning is the concept of using several models and combining them

to form a meta-model stronger than the original ones. For example, combining

several weak models (weak meaning high bias) biased toward different features

can result in a model with a bias lower than the sum of its parts. It is also

possible to combine several high-variance models to form a model with lower

variance. The models combined should in practice be fast to train and test,

otherwise problems with computation time can arise.

(18)

Figure 6: Hypothetical decision tree before and after the first split. After the split, the classifier can correctly predict the label ’survived’ of the input vector.

2.2 Classification Algorithms

There are many different machine learning algorithms that can be used for classification. A small subset of these algorithms are used in this study and are described in this section.

2.2.1 Decision Trees

A decision tree T is a tree graph where the leaves represent an output. In the case of classification, the leaves represent predicted classes and in the case of regression, the leaves represent the predicted feature value. The tree T is origi- nally a single, useless node. Upon receiving training data however, more nodes are created through a ’split’ and the path from the original node depends on a threshold value from a specific feature. After the splitting/training procedure is finished, a tree T

k

is a prediction function

T

_k

(u) = p, (2)

where u is the input vector and p is the predicted class (or value). Note that the index k is not relevant in the case of a single decision tree, but will be relevant in description of models using several trees. Figure 6 is an example of a very simple decision tree inspired by a Kaggle challenge [18] based on the sinking of the RMS Titanic. Real decision trees are of course more sophisticated in terms of splitting and choosing thresholds. This example tree is a classifier meant to predict if a passenger of Titanic survived or not by only examining 3 features.

This is what a decision tree can look like after one split. As it has barely had

any training, the model has high bias as it only considers the passenger age

before making a prediction on whether the passenger survived or not. Given

(19)

enough data, and provided that the max depth hyperparameter is high, this decision tree can grow to have high variance instead.

2.2.2 Bagging

Bootstrap aggregating, commonly referred to as bagging, is an ensemble learn- ing algorithm that combines the result of many models (often high-variance decision trees) trained on very similar data sets. Given traning set X, sample sets X

k

, k = 1, 2, ..., N

T

(where N

T

is the amount of models to be trained) are created through picking random sample elements from X where the same element can be picked multiple times. This is repeated until X

k

and X are of equal size. Each sample set is then used to train a model, and the output of the meta-model is the most common (or average) output of the individual models.

2.2.3 Random Forests

Random forests, formally introduced in [19] by Leo Breiman, use decision trees and bagging with the tweak of assigning each tree a subset of features to exam- ine. If all features were examined by each tree, the strongest predictive features would dominate and trees would be strongly correlated. Correlation between trees would mean that they often predict the same class based on similar cri- teria, which defeats the purpose of having several trees. Having several trees trained on slightly different training sets and examining slightly different fea- tures are the key factors behind the strength of the model. The output of a random forest can be described as

h(u) = 1 N

T

NT

X

k=1

T

_k

(u) (3)

for regression and

h(u) = mode({T

1

(u), T

2

(u), ..., T

N_T

(u)}) (4) for classification, where mode is a function extracting the most common value in a set. The performance of the random forest is dependent on hyperparameters including but not limited to amount of trees N

_T

, max tree depth and amount of features for each tree to examine.

2.3 Deep Learning

Deep learning can be seen as a set of machine learning algorithms that depend on

a lot of data to learn deep patterns in the input. More specifically, deep learning

models follow an architecture inspired by the neural circuits in the brain and are

often referred to as artificial neural networks. Artificial neural networks consist

of interconnected ’neurons’, which given an input z will produce an output

y. While there are several different types of artificial neural networks, this

section will only deal with feed-forward neural nets. In feed-forward neural nets

(20)

Figure 7: Basic fully connected feed-forward neural net with 2 hidden layers.

the neurons are organized in layers, where the outputs of a former layer serve as input to the subsequent layer. Each neuron N is defined by its activation function f

_N

(z, w, b) = y, which takes not only the input z into account but also the weights w and bias b. Weights and bias are parameters specific to the neuron that are tuned through training of the network, so the output of f

N

will be different for the same input when the parameters have changed. The neurons within a layer often share the same activation function although their weights and biases are individual. Activation functions are non-linear functions, which is an important detail. If the network does not contain any nonlinearity its ability to model a non-linear relationship between the input and true class is limited. An example of an activation function is sigmoid, defined as

sigmoid(x) = 1

1 + e

^−x

. (5)

Sigmoid has the value range of (0,1), which makes it a practical activation function for the last layer of a neural network for binary classification. This is because outputs of binary classifiers often are prediction confidences of the positive class ranging between 0 and 1, like discussed in section 2.1.2.

Figure 7 depicts a very simple fully connected neural net. For simplicity, the

network is only showing a few neurons per layer. Fully connected layers or net-

works simply imply that the output of each neuron serve as the input for all

neurons in the subsequent layer. Initially, inputting a labeled training set to the

neural net will result in arbitrary activation across the neurons. Because the

data is labelled however, the values at the output neurons can be compared to

the true labels, and a loss can be computed through a loss function chosen by

(21)

the network designer. For binary classification, a popular choice of loss function is binary cross-entropy, defined as

L(y) = −(y · log(p) + (1 − y) · log(1 − p)), (6) where y is the true label and p is the predicted probability. For example, a true label 1 and a prediction of 0.6 will result in a correct classification but the loss will still be 0.22. When all predictions coincide perfectly with the true labels, the loss function equals 0.

2.3.1 Training a Neural Network

With the goal of minimizing the loss, an optimization of the network parameters (all the biases and weights of the neurons) is made. The optimization considers the loss as a function of all network parameters, which can often be hundreds of millions

⁷

. A gradient is then calculated through a method called backpropa- gation [21], which begins computing the loss caused by the last layer and then uses the chain rule to propagate back through the network and calculate each parameter’s effect on the loss. This gradient is used to update the parameters in the right direction for minimizing the loss function. The parameters are then updated and the training process can continue with another input sample. The process of minimizing this loss through gradient computation is called gradient descent. Different input samples will give different gradients, so through the training process the weights and biases will be tuned in different directions but eventually (if the tuning parameters have suitable values) the gradient descent has converged and the loss is minimized. In other words, the network is better at classifying the input than before the training process.

There are many algorithms for gradient based loss optimization. These are referred to as optimizers, and are chosen by the designer. An example of a pop- ular, relatively new optimizer is Adam [22] from 2014. Optimizers often come with hyperparameters that can be tuned freely. An important hyperparameter for an optimizer is the learning rate, which sets how much each parameter is tuned in the calculated direction. The learning rate is often low, preventing the network from making too radical changes at once. Neural networks are often trained in batches, where the gradients of each input are averaged together for the batch and only one parameter update is required per batch. The batch size has an effect on not only computation time, but the network performance as well. This is because the gradient averaging serves to tune the parameters not only in one direction, but in the average direction of the batch. Tuning the learning rate and batch size carefully is important, as mistuning of these could easily result in the loss ’jumping over’ the desired minimum. Since the network training is a careful and slow process, many epochs of training is often required to reach good results. An epoch is simply a pass of the entire training set.

7 The popular image recognition network VGG16 [20] has over 130 million parameters

(22)

2.3.2 Convolutional Neural Networks

In the domain of image recognition, convolutional neural networks [23] (CNNs, or ConvNets) are the state of the art. Images are often large in terms of input elements (a 224 x 224 RGB image has over 150 000 input elements), so normal fully connected neural nets would have trouble with the amount of parameters to tune. ConvNets use a few special layers to reduce the amount of features and generalize the characteristics of images in practical ways.

2.3.3 Convolution Layer

Like the name suggests, convolutional neural networks contain convolution lay- ers. As inputs to convolutional layers often have 3 dimensions (Width x Height x 3 for RGB images), inputs to a convolution layer is referred to as volumes. A convolution layer works by computing dot products between filters and a small part of the input volume. What this does is compute a kind of correlation score between the filter and the small part of the volume. The size and number of filters is chosen by the networks designer, common sizes are 3x3 and 5x5 for pattern recognition. The filters also have matching depth with the input vol- ume, however for explanation purposes it is easier to consider the convolutional layer in 2 dimensions, which is why the input volume is referred to as an image grid in the forthcoming example.

A convolutional layer splits up the input grid in patches, using a sliding window approach. The amount of patches is determined by the aforementioned filter size, as well as the stride and padding chosen. The stride defines the number of steps the sliding window moves on the grid each iteration, and the padding defines how much (or if ) the input should be padded at the edges. Figure 8 displays a 3x3 filter moving across a 5x5 image with a depth of 3 and stride of 1.

For each patch of the image, a dot product is computed with each of the filters (note that even though we are considering these as matrices, the dot product is still defined as the sum of the element wise multiplication). This then followed by passing the result to an activation function with a bias. Consider the case of filter

F =





−1 4 3

−1 5 −5

−2 3 −2



 passing through patch P =





0 1 1 0 1 0 1 1 0



 .

The resulting dot product is 13, which will have a bias added to it before an activation function is applied. Convolution layers often use the rectified linear unit (ReLU) activation function, which is defined as

ReLU (x) = max(x, 0). (7)

In other words, the output of a convolution using ReLU is only non-zero if the

sum of the dot product and bias is greater than zero. In this example, the

positive weights in the filter highly correlate with the positive values in the

(23)

Figure 8: A 3x3 filter sliding across a 5x5 RGB image with stride 1. The highlighted squares are the ’patch’ which is element-wise multiplied with the filter and summed up.

patch, which is the purpose of the convolution layer. This way, the bias controls how strong the correlation between the filter and patch needs to be to have a non-zero output. Many other input patches would result in a negative dot product and, depending on the bias, nulled output. These weights in the filter and the corresponding bias are parameters of the network that are commonly changed during the training process. After all filters have passed through the entire image, the resulting output has a width and height corresponding to the amount of iterations it took to slide across the image, which is 3x3 in the case displayed in figure 8. To preserve the width and height of the input, the input needs to be padded unless the filter size is 1x1. A padding of 2 would preserve the dimensions in the case of figure 8. The depth of the output corresponds to the amount of filters used in the layer.

2.3.4 Pooling Layer

Convolutional neural networks often include pooling layers in their model, as

pooling layers provide a manner of feature selection where the output of the

layer is smaller than the input. This can both reduce the computation times

and increase the performance of the network as less important features are

dropped. Pooling layers have similarities to convolution layers in the sense that

there is a sliding window with adjustable stride, but instead of using trainable

filters and computing dot products the pooling applies a static operation that

always produces the same output given the same input. A common operation

used for this is simply extracting the maximum element in the patch. This is

(24)

known as max pooling.

2.4 Anti Overfitting Strategies in Neural Nets

As a neural network grows to have many trainable parameters in relation to the size of the data set, it can be prone to overfitting. Network designers can however take measures to avoid this. A commonly used anti-overfitting strategy is to use dropout layers [24]. A dropout layer will, at a rate chosen by the net- work designer, randomly drop neurons and their connections during training.

As an effect of this, training on the same sample several times yields different results as different neurons are dropped in the dropout layer. This prevents the network from tailoring the weights to perfectly predict the sample, and can lead to a better performance.

Another strategy is weight decay [25], which prevents weights from growing too large and thus create dominating features. This is performed through the addition of a term to the loss function. This term penalizes large weights, which can help prevent the optimization process from overfitting while trying to min- imize the loss. Dropout as well as weight decay are different strategies that directs the network towards this. There are other methods to prevent overfit- ting that do not explicitly change the network, such as early stopping. Early stopping simply stops training earlier than the specified max number of epochs, as a peak in performance has already been reached. This performance is often measured in average loss on the validation set, meaning the peak performance was reached at the minimum value of the validation loss. Training many epochs is likely going to lead to overfitting, and early stopping can be a way of stopping the training before overfitting has a chance to occur. Figures 9 and 10 show examples using dropout and early stopping, respectively.

Figure 9: Neural net training using the dropout technique.

(25)

Figure 10: Neural net training where early stopping is used to restore the weights

and biases where the measured validation loss is at it’s lowest point.

(26)

3 Method

3.1 The Data Set

The set of mouse data provided for this thesis consists of 1.9 million mouse movement objects made by 9200 users over 600 000 sessions. The data comes from live usage of an online bank, however the final raw data set was compiled and ingested into a database in the very beginning of this project. A session is defined as an online banking session such as visiting a website and perform- ing arbitrary mouse movements. The session is over when the user leaves the website. The mouse movement objects for each session comes from all mouse movements made on a specific web page, i.e. the only segmentation made on data for each session is when a user navigates to a new web page. Each data point (in addition to containing some metadata about the session) contains x, y, and t values, representing the screen position and time of a mouse movement.

The amount of data differs for each user, where some users have vastly more data than others. Statistics on sessions per user and objects per session can be found in Table 1.

Statistics on Data Set

Metric Mean Median Stdev Mode Max n>1000

Sessions per user 65 58 44 52 935 0

Objects per session 3 2 5 2 101 0

Objects per user 204 172 158 145 2356 28

Table 1: Statistics on sessions and actions in the full raw data set, rounded to closest integer

As seen in Table 1, the average user can be expected to have around 150-200 recorded mouse movement objects. This is a very small number per user, and even though segmentation increases the amount of actions the per-user amount of data is a fraction of the data used in studies such as [5], [6], [12]. There are however 28 users with more than 1000 mouse movement objects. Classifiers corresponding to each of these users would have a moderate positive training set and a huge selection of negative training sets as the total amount of users is far higher than any of the aforementioned studies. These 28 users will henceforth be referred to as the 28 ’high-action’ users. It is necessary to narrow down the data set not only because most users have too few data points, but because of the nature of the classification task. Since authentication is a binary classifica- tion problem, each user needs its own classifier and training 9000 classifiers is not practically viable.

An important thing to note is that this data set is not split into training and

test sessions like [11] where the test sessions look very different from the training

sessions. This means that training and testing of classifiers on this data set will

(27)

have to originate from the same sessions. The implications of this are discussed in section 5.2.

3.2 Segmentation

Raw mouse data may need processing before important characteristics can be extracted. Averaging values over a session like in [6] may lead to successful clas- sification, but the results in [5] suggests that segmenting the mouse movements over a session into actions lead to much quicker identification of the user. As previously mentioned, the raw data set used in this study is comprised of mouse movement objects from users over sessions. These objects are however not seg- mented to the point where specific mouse strokes are noticeable. To clearly capture characteristics of each mouse stroke, the data is segmented into actions of three categories. The categories are mouse-move, point-click, and drag-drop (the same categories as in [6]. Figure 11 shows an original mouse movement object that is further segmented based on the time between the movements.

The strategy used for segmenting the mouse movements into the MM, PC, and DD labels are based on two factors, time and mouse clicks. As the raw data set is just a collection of data points, mouse strokes have to be assem- bled as a collection of data points. When reading the stream of mouse events (x

k

, y

k

, t

k

), k = 1, 2, ..., N for a user, mouse movement events with a small dif- ference in time ∆t are grouped together as a sequence which ends when ∆t reaches a time threshold of 5 seconds, or when a mouse click event occurs. The categorization criteria of mouse actions are defined as:

• MM: All events of the sequence are mouse movement events, meaning no mouse click events are in the sequence.

• PC: The events of the sequence are mouse movement events, and the end of the sequence are the mouse click events ’mouse down’ immediately followed by ’mouse up’.

• DD: The events of the sequence start with a ’mouse down’, followed by mouse movement events, ending with a ’mouse up’.

This inevitably leads to a few actions which are just stagnant mouse clicks

(like segment 1 in Figure 11). These are easily filtered out and are not used in

any of the final results.

(28)

Figure 11: Basic plots of original and segmented mouse movements

3.3 Baseline Classification using Random Forests

As this study aims to compare previous strategies used in mouse dynamics classification to a deep learning approach, the first classification attempt uses methods similar to [5] to establish a baseline accuracy for the data set. This means the approach uses an action based feature extraction and an implemen- tation of random forests for classification. Contrary to [5] however, this study is wary of environment factors and attempts to limit their effect on the final results

⁸

. The methods used in [5] are the most appropriate as they perform well on small amounts of data and the authors were very thorough in describ- ing their methods for reproducibility purposes. Furthermore, the data for each user in this study is not enough to accommodate the methods from many other works in the field (e.g. [6]–[8]). The data set used for this is comprised of all actions made by the 28 ’high-action’ users defined in 3.1.

3.3.1 Feature Extraction

After segmenting the data as described in section 3.2, the data is ready for feature extraction. Feature selection can be difficult without prior knowledge of characteristic features in mouse dynamics. The feature selection in this study therefore takes inspiration from related works in the field, where [3] is the main source. Each action consists of n points {(x

1

, y

1

, t

1

), ), ..., (x

n

, y

n

, t

n

)}, and from these points a large amount of features can be constructed. The initial features are taken from the time series metrics found in table 2. As an action contains many values for these metrics, statistical functions mean, maximum, minimum,

8 See section 3.3.1

(29)

Metric Definition

Horizontal velocity v

x

v

x

=

^∆x_∆t

, ∆x = x

n

− x

n−1

, ∆t = t

n

− t

n−1

Vertical velocity v

y

v

y

=

^∆y_∆t

, ∆y = y

n

− y

n−1

General velocity v v =

^∆s_∆t

, ∆s = p(∆x)

²

+ (∆y)

²

Acceleration a a =

^∆v_∆t

, ∆v = v

n

− v

n−1

Jerk j j =

^∆a_∆t

, ∆a = a

_n

− a

n−1

Angular velocity ω ω =

^∆θ_∆t

, ∆θ = atan2(∆y

n

, ∆x

n

) − atan2(∆y

n−1

, ∆x

n−1

)

Curvature c c =

^∆θ_∆s

.

Table 2: Time series metrics from which features are extracted. atan2(y, x) is defined as the radian angle between the positive x axis and a vector going from (0,0) to (x, y) in the Euclidan plane.

and standard deviation are used to extract 4 features for each of these time series metrics. Other features extracted include distance between endpoints D

D = p

(y

n

− y

1

)

²

+ (x

n

− x

1

)

²

, (8) sum of lengths L (total length)

L =

n

X

i=1

∆s

i

, (9)

total amount of points n, and total time T

T =

n

X

i=1

∆t

_i

. (10)

To prevent the extracted features from being correlated to environment fac-

tors such as mouse sampling rate and screen resolution, a few measures are

taken. As the real data points from different sessions had different mouse sam-

ple rates, the time between points ∆t = t

i

− t

i−1

can differ greatly. A solution

(30)

Figure 12: Mouse action before (left) and after (right) resampling.

to this is to resample all data points to the same frequency. A resampling to a frequency of 1000 Hz is performed through linear interpolation, using the real data points (x

_i

, y

_i

, t

_i

), i = 1, 2, ..., n to represent points of a continuous func- tion f (t) = (x, y). The resampled points (x

^∗_i

, y

^∗_i

, t

^∗_i

) are then created through f (t

^∗_i

), i = 1, 2, ..., n

₂

, where ∆t

^∗

= 1ms for all points. Figure 12 illustrates an action before and after resampling.

The second measure taken is to normalize all points to screen resolution, mean- ing all x values are divided by screen width and all y values are divided by screen height, confining all x and y values to be between 0 and 1. These measures may decrease classification accuracy, but strengthens the legitimacy of the model.

The purpose of this thesis is to classify on behavior alone, thus additional accu- racy attained through environment factors such as screen resolution should be avoided.

After an action has gone through the feature extraction, the resulting set of features can be referred to as a feature vector. In other words, each feature vector is a set of values together representing an action.

3.3.2 Classification

Once all the feature vectors are created, binary random forest classifiers for each

of the 28 users are trained on both positive and negative data. Even though

there are 28 users, the sole purpose of the classification is determining whether

the given feature vector comes from the ’authentic’ user, or an impostor. Thus,

for each user i = 1, 2, ..., 28, N positive actions (belonging to user i) are chosen,

and N negative actions (belonging to any user except for i) are chosen and used

(31)

to train a classifier specific to the user i. Once the classifier is trained, feature vectors never seen before by the classifier are used to test the prediction accuracy.

In addition to classifying users based on single actions, a set based classification is made. That is, giving a single prediction based on multiple inputs, given that the inputs share the same class. As seen in section 1.1, nearly all studies present accuracies based on sets of actions. When trying to jointly classify a set of actions

S = {u

₁

, u

₂

, ..., u

_n

}, (11) a few different approaches can be taken. One approach is to average the values of all features in each vector u S to create the new vector v. This would however not fit our case since there is no guarantee the vectors in S represent the same type of action, and averaging could result in a type of action never before seen by the classifier. A more appropriate strategy would be to use prediction confidence values. Many classifiers offer a function which, given a feature vector, will return a set of confidence values C which represent the predicted probabilities p of the input belonging to the class on the corresponding index. Given an input vector u, the confidence values in a random forest are given by

C = f (u) = {p

1

, p

2

, ..., p

cn

}, p

i

= 1 N

T

N_T

X

k=1

V

k

(u),

V

_k

(u) =

( 1 T

_k

(u) = i

0 T

k

(u) 6= i , i = 0, 1, 2, ..., cn

(12)

where cn is the amount of classes and N

_T

is the amount of trees in the forest. T

_k

(u) is as familiar the class prediction of the tree T

_k

on input vector u.

This equation simply says that the confidence values in random forests are the fraction of trees that predicted the corresponding class. From this equation, we can easily formulate the classification function

⁹

h(u) = argmax(f (u)), (13)

where argmax is a function returning the index of the largest value. In our case, confidence values {p

1

, p

2

} would be returned from f , p

1

denoting the confidence value for class 0 (impostor), and p

2

denoting the confidence value for class 1 (authentic user). For example, p

1

= 0.1 and p

2

= 0.9 means that the classifier is 90% certain that the input belongs to the authentic user. The approach would then be to classify each vector u S separately, summing up all confidence values and choosing the class with higher confidence value as the prediction.

The set of action classification is then defined as g(S) = argmax(C

_sum

), C

_sum

=

n

X

k=1

C

_k

, C

_k

= f (u

_k

), (14)

9The random forest classification function defined in 2.2.3 is equivalent to this, the purpose of this definition is to illustrate the relationship between confidence values and classification output.

(32)

where n = |S|. Using this method, a set S

⁰

of 3 vectors {u

1

, u

2

, u

3

} giving con- fidence values f (u

1

) = {0.4, 0.6}, f (u

2

) = {0.3, 0.7}, f (u

3

) = {0.95, 0.05} would get the result g(S

⁰

) = 0 as the sum C

sum

would be {1.65, 1.35}. This means the entire set would be classified as belonging to class 0.

A second method g

^∗

that will have similar output to g is simply choosing the most common prediction out of the individual classifications, i.e.

g

^∗

(S) = mode({h(u

₁

), h(u

₂

), ..., h(u

_n

)}). (15) In the example above, the outputs g(S

⁰

) and g

^∗

(S

⁰

) will however be different as g(S

⁰

) = 0 and g

^∗

(S

⁰

) = 1. Comparisons between set based classification using g and g

^∗

show that g slightly outperforms g

^∗

and is therefore chosen as the set based classification algorithm for the study.

3.3.3 Cross Validation

10-fold cross validation is used for all random forest based classification results in this study. This is a good way of getting more consistent and reliable results without requiring additional data. The distribution of positive and negative actions remains equal in the individual folds.

3.3.4 Feature Importance

Examining feature importances can be helpful to understand the classification model better. The model for feature importances used in this study is widely used and referred to as permutation importance. Permutation importance uses an already trained classifier and randomly shuffles values of a feature within the set used to test the importance. If the feature was important, the shuffled values should result in a decrease in accuracy compared to classification on the real values. Weights for each feature is calculated this way, where higher weights imply more important features. The ELI5 [26] implementation for permutation feature importance is used to compute weights for each feature.

3.3.5 Experiment Setup

For all the random forest based classification, the Python library Scikit-learn [27]

is used. It is a powerful machine learning toolkit containing implementations

of random forest, cross validation, feature importance, and more. As described

in section 3.3.1, x and y values are normalized to screen resolution, and data

points are resampled to a frequency of 1000 Hz. For single action classification,

binary classifiers for each of the 28 ”high-action” users are created and 1000

positive + 1000 negative actions are used for training and testing. The negative

actions are evenly sampled from each of the 27 remaining users. For example,

a binary classifier for user 5 has actions from user 5 as positive actions, and

actions from users 1-4 and 6-28 as negative actions. All these actions are taken

from a subset of the data where the total length of the action has to exceed a

(33)

minimum threshold (≈ 0.1 in normalized units). This subset contains approx- imately 80% of the total actions for the 28 high action users. This is done to prevent the inclusion of actions containing very little movement, like segment 1 in figure 11. The set based test setup also uses 1000 positive and negative actions, together with the set classification strategy described in section 3.3.2.

Attempts to find the best hyperparameters for each classifier are made with cross validation and holding out a part of the data set to test the tuned clas- sifier on. The hyperparameters used by the classifiers in the presented results are number of trees N

_T

= 1000, maximum tree depth max depth = null, and number of features each tree can examine max f eatures = √

n, where n is the amount of features. Other hyperparameters are default values for random forests in Scikit-learn.

3.4 Deep Learning Classification

As described in section 2.3.3, convolutional neural nets are good at finding patterns in images. The deep learning approach taken is to generate images from the data set and train a ConvNet as a binary classifier. As the data set consists of mouse movement points on a screen, an image representation of the actions is trivial to create. There are however different approaches to choosing the best representation format.

3.4.1 Image Generation

Representing a mouse action in an image can be performed in many different ways. Since it is difficult to know exactly how a neural network will react to different approaches without explicitly testing them, different strategies simply have to be tried and evaluated. Intuitively, the background of the image would be of the same color, and the data points of the mouse action would be rep- resented on the grid with a different color than the background. The image size 224 x 224 has been used for other popular networks (e.g. [20]) and is large enough to capture the characteristics of the mouse actions, and is therefore cho- sen as the image resolution. The following subsections explore a few different methods, starting simple and cumulatively growing more sophisticated.

Method 1 The first approach consists of filling the 224 x 224 grid with black pixels, represented with RGB values (0, 0, 0). All locations (x

i

, y

i

) of mouse events in the action are then placed on the grid as white pixels (1, 1, 1). The locations have to be translated to fit the grid’s size, and this translation is done as

x

^∗_i

= x

_i

width · 224, y

^∗_i

= y

_i

height · 224, (16)

where (x

^∗_i

, y

^∗_i

) is the image grid location of point (x

_i

, y

_i

Authentication Using Deep Learning on User Generated Mouse Movement Images