Using XGBoost to classify the Beihang Keystroke Dynamics Database

(1)

UPTEC F 18049

Examensarbete 30 hp 15 Augusti 2018

Using XGBoost to classify the

Beihang Keystroke Dynamics Database

Johanna Blomqvist

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Using XGBoost to classify the Beihang Keystroke Dynamics Database

Johanna Blomqvist

Keystroke Dynamics enable biometric security systems by collecting and analyzing computer keyboard usage data. There are different approaches to classifying keystroke data and a method that has been gaining a lot of attention in the machine learning industry lately is the decision tree framework of XGBoost. XGBoost has won several Kaggle competitions in the last couple of years, but its capacity in the keystroke dynamics field has not yet been widely explored.

Therefore, this thesis has attempted to classify the existing Beihang Keystroke Dynamics Database using XGBoost. To do this, keystroke features such as dwell time and flight time were extracted from the dataset, which contains 47 usernames and passwords. XGBoost was then applied to a binary classification problem, where the model attempts to distinguish keystroke feature sequences from genuine users from those of `impostors'. In this way, the ratio of inaccurately and accurately labeled password inputs can be analyzed.

The result showed that, after tuning of the hyperparameters, the XGBoost yielded Equal Error Rates (EER) at best 0.31 percentage points better than the SVM used in the original study of the database at 11.52%, and a highest AUC of 0.9792. The scores achieved by this thesis are however significantly worse than a lot of others in the same field, but so were the results in the original study. The results varied greatly depending on user tested. These results

suggests that XGBoost may be a useful tool, that should be tuned, but that a better dataset should be used to sufficiently benchmark the tool. Also, the quality of the model is greatly affected by variance among the users. For future research purposes, one should make sure that the database used is of good quality. To create a security

system utilizing XGBoost, one should be careful of the setting and quality requirements when collecting training data.

ISSN: 1401-5757, UPTEC F 18049 Examinator: Tomas Nyberg Ämnesgranskare: Michael Ashcroft

Handledare: David Strömberg & Daniel Lindberg

(3)

Populärvetenskaplig Sammanfattning

Idag så använder majoriteten av alla företag och privatpersoner datorer och databaser för att skydda tillgångar och information. Det är därför viktigare än någonsin att ha säkra system som korrekt verifierar att rätt människor kommer in i dessa system. Vi är vana vid att använda till exempel fysiska nycklar och lösenord. Men så kallade biometriska lösningar blir allt mer intressanta. De bygger på att biologiska markörer, som till exempel fingeravtryck, är unika för varje individ. Ett steg längre är beteendebiometri, alltså att vi har unika beteenden, så som skrivstil och rörelsemönster. Denna studie har tittat på ett sådant område, så kallat Keystroke Dynamics, som bygger på att vi alla skriver med olika rytm på ett tangentbord när vi använder en dator. Tanken är att för att ta sig in i ett system, ska man inte bara behöva ha tillgång till det rätta lösenordet, utan även behöva skriva på det sätt som tillhör inloggningen. För att skapa ett sådant här system, kan man använda sig av maskininlärning.

Då matar man en modell med exempel av hur människor skriver, och sen är tanken att modellen ska lära att känna igen vad som särskiljer dem.

Det finns många olika teorier att använda för att göra detta, och denna studie har använt sig av det relativt nya kritikerrosade XGBoost. XGBoost är ett verktyg som bygger på beslut- sträd, där data kategoriseras genom att gå igenom ett ‘träd’ av relevanta frågor. Datasetet som har använts i detta projekt är det öppna ‘Beihang Keystroke Dynamics Database’.

Studien visade, till något av en besvikelse, att XGBoost var ungefär lika bra som andra maskininlärningsmodeller på samma dataset. Slutsatserna som drogs var att detta förmod- ligen beror på att datatestet var för litet. I framtiden bör forskning titta mer på XGBoost och dess potential kring Keystroke Dynamics, och bör fokusera på att skapa ett stort dataset som kan användas i all forskning.

Principen om Keystroke Dynamics användes redan under andra världskriget, då telegrafister började känna igen varandra genom rytmen som slogs på telegraferna när morsekod skick- ades. När sedan datorerna gjorde sitt intåg så har man försökt använda denna princip för tangentbord, och 1985 visade David Umphress och Glen Williams att ‘tangentbordsprofiler’

är unika. Det har kommit flertalet studier på ämnet sedan dess, men ett generellt problem i branschen är att det inte finns ett erkänt dataset som kan användas för att forska på (och således göra jämförelser enkla). En anledning till detta är att det finns många varianter på system. Datainsamlingen kan göras i labb, eller via internet hemma hos deltagarna. Texten de matar in som prover kan var lång, eller kort. De kan få välja texten själva, eller så skriver alla samma text. Språk kan ju självklart också göra skillnad.

Att samla in data tar också tid, och på grund av just tidsbrist så valdes det i denna studie att använda ett befintligt dataset. Beihang Keystroke Dynamics Database, Dataset A, består av 47 deltagare som 4-5 gånger har skrivit in ett egenvalt lösenord som binds till ett unikt användarnamn via tangentbord på ett internetcafé. Deltagarna har också fått tillgång till andra deltagares lösenord och lämnat prover på hur de skriver dessa, för att härma en ‘attack’. Detta dataset valdes för att det var lättåtkomligt, och för att det bedömdes intressant att studera just fritext-databaser i en kommersiell miljö, just för att det speglar verkligheten bäst. XGBoost bedömdes intressant att undersöka för att det inte ännu har använts i Keystroke Dynamics-fältet, och för att det har åstadkommit otroligt bra resultat i andra sammanhang, och vunnit branschpriser.

När man väl har datan så reduceras den till ett antal så kallade features. De features som har valts i denna studie är dwell time (hur länge en tangent är nedtryckt) och fyra varianter av flight time (hur lång tid det är mellan två tangenttryckningar). Genom att ta medelvärdet av dessa tider för en lösenordsinmatning, skapas feature-sekvenser om fem värden för varje inmatning, oavsett lösenordslängd. Sedan delas användarna upp i två grupper, en vars data används för att träna modellen, och en för att testa den. Det är för att modellen inte ska testas på data den redan har sett.

XGBoost-modellen tränar genom att titta på feature-sekvenserna för att försöka sätta upp rätt regler för att avgöra om inmatning tillhör den genuina användaren eller om det är en attack (någon som har fått rätt användarnamn och lösenord). Den gör detta genom att titta på skillnaden mellan ett par av feature sekvenser, och försöker avgöra vad det är som

(4)

särskiljer en användares rytm från en annans, vad det är för skillnader som krävs för att två inmatningar ska kategoriseras som olika. Vid testning så används de här inlärda reglerna vid en jämförelse av en inmatning (som antingen är en ‘attack’ eller en ‘genuin’) med en inmatning som vi vet tillhör den genuina användaren, och avgör om skillnaden betyder att inmatningen är genuin eller en attack. Problemet blir då en så kallad binär klassificering.

Modellen behöver bara lära att säga ‘genuin’ eller ‘inkräktare’ när den får ett par av feature- sekvenser.

Vid testning så togs en del statistik fram. Ett mått som kallas ‘false acceptance rate’

(‘falsk acceptans-andel’) fick ett medelvärde på 19.75%, och ‘false rejection rate’ (‘falsk bortvisnings-andel’) fick ett medelvärde på 19.30%. ‘Equal error rate’ (EER), där andelen falska och sanna acceptansen sätts still lika, landade på 19.75%. Dessa siffror är höga för ett säkerhetssystem, om man till exempel jämför med fingeravtrycksläsning som har 0.02%

och andra studier inom Keystroke Dynamics. Dock så är det viktigt att komma ihåg att steg 1 i ett sådant här system är att ha tillgång till lösenordet (inte bara tillgång till ett finger). Även om resultaten var sämre än andra studier, så var det bara något sämre än originalstudien på samma dataset, som hade ett EER om 11.83%. Detta ledde till slutsatsen till att det stora problemet inte var XGBoost, utan datasetet själv.

Dessa resultat pekar alltså på att det som spelar mest roll för säkerheten i ett biometriskt system är datasetet. Datasetet måste vara tillräckligt stort för att en maskininlärningsmod- ell ska träna på tillräckligt många varierade sekvenser för att lära sig en tillräckligt generell uppsättning av regler. 47 användare visade sig vara för litet. Jag anser att Keystroke Dy- namics kan komma att bli ett bra alternativ att använda i säkerhetssammanhang, förslagsvis tillsammans med andra system, så som lösenord eller taggar. Jag tror även att principen har stor potential att användas inom smartphones, som har flertalet sensorer inbyggda redan.

XGBoost bör definitivt fortsättas undersökas i framtiden, och forskning borde fokusera på att skapa ett stort dataset som kan användas för benchmarking.

(5)

Contents

1 Introduction

Being able to provide and verify identity is a vital part of a functional and secure society.[29]

People are asked to identify themselves multiple times a day, when doing everything from unlocking personal phones and using public transport, to making bank transactions and gaining access to secure systems at workplaces. Naturally, this leads to both commercial and scientific interest in the field of authentication. There are numerous methods of establishing identity and rights to information. Methods are usually divided into three different areas, which can be used separately or together to complement each other: Those based on possession (keys and tokens), knowledge (passwords and codes), and lastly, biometrics.

Biometric systems are based on biological traits, which are often described as being more difficult to copy, learn, or obtain for illegitimate users than for example tokens or passwords.

There are two types of biometrics: Physical biometrics, which analyzes physical characteristics, such as scanning of fingerprints and retinas, and behavioral biometrics, which analyze traits associated with behavior.[2] Systems utilizing physical biometrics are widely in use, even in personal products such as phones. Behavioral biometrics, on the other hand, are still relatively unexplored, but technology is beginning to catch up with theories. One emerging technique is ‘keystroke dynamics’.

Keystroke dynamics utilizes keyboard typing patterns, which are believed to be unique enough from person to person to form a basis for identification. The data collected can be of different types, for example, pressure data, which keys are being pressed, or the most common: Rhythm. A rhythm profile for an individual is constructed by time stamp data, constructing features from data representing pressing and releasing of the keys. The most simple features are known as ‘digraphs’, time relationships between two actions on the keyboard. There are two major digraph categories: ‘Dwell time’ and ‘flight time’, representing for how long a key is pressed and how long it takes to find the next, respectively.[1] Flight time can be defined in a number of different ways, and a schematic of five common features can be seen in Figure 1.

Figure 1: The four common keystroke features utilized in keystroke dynamics, demonstrated by different key stamps from pressing on two consecutive keys ‘J’ and ‘Y’. ‘D’ is for dwell time, represents how long a key is pressed down for. ‘F’ is for flight time, represents four different interpretations of time in-between pressing two keys, either between 1. releasing key J and pressing key Y; 2. releasing key J and releasing key Y; 3. pressing key J and pressing key Y; and 4. pressing key J and releasing key Y.[26]

(8)

1 Introduction

To decide whether an individual trying to access the system is authorized to do so, the system compares the recorded rhythm with previous samples provided by the genuine user.[2]

Both Teh et al. and Ali et al. divide different approaches to this comparison into two categories: Statistical and machine learning approaches. Note that they have not used a formal definition of the terms, but rather a natural division of methods. Examples of statistical methods include common measurements such as mean, median, and standard deviation. Probabilistic modeling methods based on Gaussian distribution, such as Bayesian and Gaussian Density functions, as well as cluster methods based on homogeneous clusters, such as K-means are also categorized as statistical. The most used statistical methods are those based on pattern distance, such as Euclidean and Manhattan distance.[26][2] Examples of machine learning methods are Neural Networks, Decision Trees, Fuzzy Logic, and Support Vector Machines (SVM).[2]

An implementation of decision trees which is relatively new is XGBoost, or ‘extreme gradient boosting’, presented by Tianqi Chen and Carlos Guestrin in 2014.[6] XGBoost has raised a lot of interest in the general machine learning community, and have for example performed very well in recent ‘Kaggle’ competitions.[6]

Security systems based on keystroke dynamics are of interest not only because biometric systems, in general, are believed to be more secure, but also because they are cost-effective compared to other biometric systems, as the only products needed is a keyboard and a software to carry out the authentication. As Ali et al. comments, it is also a noninvasive method in regards to users as it only requires typing information (as opposed to fingerprints, eye scans, DNA samples, which require more, and possibly more personal information of the user).[1] Attempts to classify keystroke dynamics data using decision trees have been proved successful in the past, however, no research has emerged where XGBoost is used to classify the data.[1]

Therefore, this study has attempted to classify the Beihang Keystroke Database, a com- mercially accessible database consisting of usernames and passwords, created by Li et al.

in 2011, with XGBoost. The purpose has been to investigate whether XGBoost is a good approach to keystroke dynamics problems. Therefore, this study has not collected any data, nor created a working security application, but has classified data from an existing dataset, and analyzed its performance.

1.1 Problem Statement

The purpose of this report is to examine whether the framework XGBoost can be used to classify the data provided in the Beihang Keystroke Dynamics Database according to user.

Ideally, the classification could be used for keystroke dynamics security systems, and the model should produce an accuracy higher than those provided by traditional frameworks, such as SVM and Neural Networks, results which were published along with the database by Li et al in 2011.

1.2 Background

This section of the report sets the study into context by reviewing previous studies in the area. First, studies made about keystroke dynamics, in general, are reviewed, then those looking at the classification problem in general, and some comments are made about XGBoost. Lastly, details of the scientific scope and the outline of the report is presented.

1.2.1 Biometrics and Keystroke Dynamics

The concept of Keystroke Dynamics began in 1899: William Lowe Bryan and Noble Harter wrote an article titled ‘Studies on the Telegraphic Language: The Acquisition of a Hierarchy of Habits’, which was published in The Psychological Review, where they remarked on

(9)

1 Introduction

individual and unique typing pattern of telegraphists.[19] The idea resurfaced in World War II when telegraphists would identify the sender of the message based on punching rhythm.[16]

In 1985, David Umphress and Glen Williams showed that keystroke profiles are unique enough to provide an effective protection, especially together with other means of identification, such as a password.[28] In 1995, Shepherd authored comprehensive lecture notes where he outlined the possibilities of keyboard data, the core of keystroke dynamics. He demonstrated that keyboards enable keystroke data to be monitored and time-stamped, allowing for logging of press and release events.[24]

Important work was also made in 2000 by Fabian Monrose and Aviel Ruben when they published the paper ‘Authentication via Keystroke Dynamics’. They outlined a framework for the field of keystroke dynamics. They discuss that keystroke dynamics systems can be either static or continuous. Static systems only act during a specific time interval, for example during a logging in phase, so that the user needs a username, password, and the right typing rhythm in order to pass through. Continuous systems check keystrokes patterns throughout the session to authenticate the user and can label it as an impostor (and take appropriate measures) at any time. Systems which lie in-between static and continuous are also possible.[20] This will be more thoroughly discussed in the theory section of this report.

In their study, Monrose and Ruben collected typing data from 63 different users over 11 months. They examined how a good system should be designed in order to detect anomaly data, and concluded that individualized thresholds, specific to the users of the system, should be implemented rather than a general one. For classification, KNN was applied to cluster similar feature sets of so-called digraphs (common pairs of letters in a text) to distinguish between users. Due to ‘the superior performance of Bayesian-like classifiers’, they decided to implement their own distance measure based on Gaussian distribution (details can be studied in their paper), which achieved an identification accuracy of 92.14%. Monrose and Ruben finish with recommending working with ‘structured’ text (the text typed is the same for all participants in the study), over free text (the participants are free to choose what text to type, sometimes however within certain restrictions). Although typing rhythms have been proven to be individual enough to provide reliable identification profiles, Monrose and Ruben suggest that security systems should use keystroke dynamics in combination with other methods (such as tokens), as ‘slight changes in behavior are inevitable’. This conclusion was supported by Unar et al. in their review ‘A review of biometric technology along with trends and prospects’ published in 2014.

In 2002, Bergadano et al. published a paper called ‘User Authentication through Keystroke Dynamics’ which presented results from a study where 44 individuals provided samples from typing a structured text 682 characters long (in Italian), as well as a shorter (English) text. The data was analyzed by a distance algorithm that considered the relative distance between trigraphs (similar to digraphs, but with three letters). Bergadano et al’s method consisted of two steps: User classification, where training samples are classified, and the user authentication step, where each user is ‘attacked’ by samples provided by other authenticated users of the system, as well as an additional 110 other individuals. In their testing phase, an attempt had to be sufficiently close to the behavior of the claimed user in order to be classified as such, not only far away from the other profiles. Bergadano et al. report a result of a ‘false rejection rate’, or FRR, of 4% and a ‘false acceptance rate’, or FAR, of

<0.01%.[Bergadano] (Details of FRR and FAR are explained in the ‘Theory’ chapter.) The Beihang Keystroke Dynamics database was studied by the creators in the paper ‘Study on the Beihang Keystroke Dynamics Database’, where they applied five different SVMs, a Gaussian method, and a Neural Network-classifier. Their EER results ranged from 11.8327%

(made with one of the SVMs) to 20.7295% (the NN-classifier).[18]

1.2.2 Classification methods and XGBoost

Which classification method that is best suited for keystroke dynamics depends on several variables. For example, the amount of data, number of users, the type of input (long, short, free, fixed), and what type of authentication one wishes to implement (static, continuous).

There have been a number of reviews that aim to give the reader an overview of the field of

(10)

1 Introduction

keystroke dynamics. Examples are Karnan et al. in 2011, Teh et al. in 2013, and Ali et al.

in 2017. This section will start with giving a brief summary of their conclusions regarding classification methods, followed by a brief history of XGBoost. All reviews stress that the main issue of the field is that there is yet no benchmarking dataset for keystroke dynamics, nor a set evaluation method, which aggravates comparisons between results.[16][26][2] Teh et al. reviewed around 70 studies, concluding that among these, statistical approaches were most commonly used (in 61% of the cases), with machine learning methods second (37%). Out of the machine learning methods, neural networks were by far the most common, followed by what they call ‘generic methods’, and decision trees in third place. Some of the most successful results achieved with machine learning methods were, in particular, made with a support vector machine (FAR of 0.76% Azevedo et al, 2007) and a naive Bayesian method (EER of 1.72% Balagani et al, 2011). The most successful attempt using decision trees achieved an EER of 2% (short input, Nonaka, and Kurihara, 2004) and a FAR of 0.88%

(long input, Sheng and Phoba, 2005) (mentioned also by Ali et al). (EER, or ‘equal error rate’, is defined as the threshold where a value maps to a positive label that satisfies FAR

= FRR. Details are provided in the ‘Theory’ chapter.)

Karnan et al. reviewed approximately 30 studies in the field. From their work, one can conclude that the most successful machine learning attempts achieved used neural networks (FAR of 1% Cho et al, 2000) and potential functions and Bayes decision rule (FAR of 0.7%).

The most successful attempt using random forest reached a FAR of 14% (Bartlow and Cukic, 2006).[16]

Ali et al. reviewed approximately 80 studies. Out of the machine learning methods, a study made with a random forest decision tree approach produced very promising results of a FAR of 0.03% (Maxion and Killourhy 2010). However, this was done on data from digit input (instead of text).

As stated, XGBoost is a tool that implements ‘gradient boosting’, first introduced by Jerome H. Friedman in 1999.[8] The idea is based on the wish to work with simple, ‘weak learners’

(models that are only ‘slightly better than guessing’ (p. 337)[15]), in order to shorten computation time. Their performance alone are not enough for making good predictions, so something had to be done to optimize them. From this ambition, the idea of ‘boosting’

was developed. In essence, boosting is similar to the practice of ‘bagging’ in the sense that it combines the predictions from several of these weak learners together, to make a stronger one.[6] However, unlike bagging, boosting creates these learners sequentially and typically with adaptive training sets.[3] Mathematical details about XGBoost is provided in the ‘Theory’ chapter.

1.3 Scope

This study has only aimed to analyze user identification via typing rhythm, and nothing else, to enable a static authentication system. (Other studies aim to analyze further characteristics of the users. For example, Ebb et al. published a study in 2011, in which they try to interpret emotional states of users.[12]) The data used is from the Beihang Keystroke Dynamics Database A. The machine learning method applied is extreme gradient boosting, through XGBoost. Other methods, such as neural networks and SVM, were decided against due to a sparse dataset (which work badly with neural networks), and due to that SVM:s are already widely researched in this area. This project did not include collecting data or creating an application for implementing a security system.

1.4 Thesis Outline

Chapter 2 presents the theory relevant to the study, by explaining the basics of authentication and keystroke dynamics, as well as the theory behind machine learning, focusing on decision trees and XGBoost. Chapter 3 presents the implementation part of the study, detailing what software was used and description and motivation of the used dataset. It also describes how features were extracted, and how the training and testing of the model were

(11)

1 Introduction

carried out. Chapter 4 presents the results of this implementation, with are discussed in chapter 5. The report rounds off with a conclusion (chapter 6) and some reflections about possible future work in the field (chapter 7).

(12)

2 Theory

2.1 Authentication systems

This subchapter discusses the basics of terminology used in the field of authentication and the usual building blocks used in a security system.

2.1.1 Authentication, Verification, and Identification

The terms authentication, verification, and identification are in common language often used interchangeably. Technically however, they do describe different nuances of security.

Authentication is a general term, as Ali et al. succinctly put it, ‘Authentication, in short, is the process of verifying a person’s legitimate right prior to the release of secure resources’.

[2]

Verification is one way of establishing this right. It involves a person presenting a proof of identity, and a system verifying that the person is who they claim to be and that they have the right to access.[2] This can be done on different levels, from just making sure that an authorized username provides the right password, to make sure that a photo ID is authentic, and matches the individual providing it. Verification is the most common method used in security systems.[2]

In the case of identification, differs from verification as the system gains no knowledge of who the person claims to be. They simply provide a mean of identification (not an identity), and the system needs to establish if this sample of identification (for example a fingerprint) is authorized and present in a database, and if so, which identity does it belong to. This method is more time-consuming but necessary in some fields, for example in the forensic sciences.[2]

As Ali et al reports, verification is the over-represented type of authentication in keystroke dynamics accounting for more than 89% of the research.[2] This study has also implemented a verification method, as a username is provided, and an access attempt is only checked against previously provided samples which the system knows belong to the legitimate user.

2.1.2 Static vs Continuous

Authentication systems can be either static or continuous. Static systems claim the majority of research (83%), and are the simplest versions where a system only checks the identity of a user once, most often as a first step to gain access (for example logging in).[2] However, this system does nothing to prevent already accessed information to be navigated by an intruder, if the authorized user leaves the system open for use (voluntarily or not).

Continuous systems try to battle this issue, by continuously verifying that the user of the system is who they have claimed to be. Keystroke dynamics is in fact very appropriate for this type of authentication since the user, if gaining access to a computer-based system, probably will continue to use the keyboard. If the user had provided a fingerprint, it would, of course, be more difficult in practice to continuously verify it.[5] Nonetheless, experiments involving continuous systems are more difficult to set up, and the application prospects may not be worth the effort (how often would a person leave sensitive information unattended?), which might be why research has focused more on static systems.[2] Therefore, this study gas implemented a static verification system.

2.1.3 Systems overview

A static biometric verification system is created in two parts: First, the database is set up through a registration phase, and then the verification, the implementation of the system, can commence. A flowchart describing the parts are depicted in Figure 2. In the registration

(13)

2 Theory

phase, authorized users are asked to provide a number of identification samples which are registered as genuine samples. The next step is to extract features (easily comparable data) from the samples, which are stored in a database with information on which user they belong to. The features are also used in the next phase, where the training occurs. A model is trained to distinguish users from each other and from impostors is created using the registration features, and the system is set up.

This model is then used when verifying attempts to access the system. Users provide their means of identification, a data sample, from which the features are extracted. These features are processed by the model and compared with the stored features belonging to the claimed user, finally resulting in a decision of access made by the system.

Figure 2: A flowchart describing the setup and running of a general static biometric authentication system. The data samples provided depend on the type of identification used (behavioral, physical), and so does the features extracted. The model can be of a machine learning or statistical type and will learn how to distinguish impostors from genuine users.

This is used when running the system, which decides on either granting or denying the user access.

This study has focused on the setup phase, particularly in extracting features and training a machine learning model. The next subchapter will specify what these data samples can consist of, and what features can be extracted when dealing with keystroke dynamics. The subchapter after that will go into how a machine learning model, and particularly XGBoost, is trained. Throughout this report, the word ‘authentication’ will be used, as it is deemed more general and less confusing.

2.2 Keystroke Dynamics

All methods need to structure the incoming data in some way, pick features to focus their analysis on and to store, to enable future comparisons. Some models do this as a part of the learning process, for others, it is necessary to do this manually.[30] This subchapter will present different types of data and features common when dealing with keystroke dynamics.

2.2.1 Keystroke Dynamics Data

Keystroke Dynamics data is normally collected through regular QWERTY keyboards, which have receivers that register time-stamps of the keystrokes. Participants are asked to type the text multiple times. To evaluate the model created in the research, it is common practice to collect samples where the participants (either the same ones as provided the genuine samples, or a new pool of participants) have been asked to provide samples using the details of someone else, for example, using someone else’s username), actings as impostors on the system, and referred to as ‘impostor samples’. [26]

For the practical data collection, researchers can choose to work in a lab setting, where the participants of the study are in a controlled environment. This often entails that the

(14)

2 Theory

participants are using the same (or same type of) keyboard. The experiment can also choose to restrict what type of keyboard is used, but neglect where they are geographically situated when carrying out the experiment. They can also choose to simply not restrict it at all and let participants choose both setting and keyboard. The first version has the advantage of being controlled, allowing the study more rigor. However, it may not represent a real- life situation very well and affecting the participants" behavior, thus leading to biased and unrepresentative data. The second option removes the lab environment issue but raises the question of whether it is preferable that the participants are familiar with the keyboard or not. Which option leads to (the most) distorted data: using one’s own keyboard, which one rarely does when intruding, or using an unfamiliar device, which is not very representative of for example a situation at a workplace. The third option simply gives the researchers no insight into how device familiarity might affect results.[26] Some studies allowed for the participants to practice on provided keyboards before recording any data, to allow for familiarization.[13]

Although time-stamp data is the most common data collected, some studies have tried to research and implement receivers for registering keystroke pressure, expanding the feature base for profiling.[26] Currently, it is also very popular to work with mobile devices, which typically already contain hardware for analyzing movement, such as accelerometers and gyroscopes, which opens up possibilities for multi-featured profiles.[1][26]

The text from which the time-stamps are collected can also be of different nature. The text can either be fixed, meaning the participants all write the same text (as in the widely used GreyC Keystroke Dataset [13]), or free, allowing the participants to freely choose their text.

It can further be a short sequence of letters (such as a username, password, or short phrase), or longer, often defined as paragraphs of more than 100 words[26][4].

2.2.2 Features

When the data has been collected, data that is easy to compare between samples, and that are distinctive enough to be of use in the identification process. This ‘reduced data’ is normally referred to as ‘features’. One or several features can be drawn from one sequence of data, which in this context is the sequence of time-stamps from one typing session of the text. Among the most common are the already discussed dwell and flight times, referred to as ‘digraphs’, as they represent relationships between two keystrokes (‘n-graphs’ are also sometimes used; timing relationships between three or more keystrokes).[26] If the text samples are fixed, or long enough, the data can be reduced further by creating sub-sets of digraph samples depending on what keys are being pressed, creating relationships of important keystrokes.[20]

2.3 Supervised Learning

Machine learning can be divided into two categories: Supervised and unsupervised learning.

This study is of the supervised type since the training data is already labeled (unsupervised learning tries to find any patterns in data). The basis of any supervised learning algorithm is to present it with training data, and the correct labels (called observed values. The training occurs when the algorithm attempts to map the training data to an output as close to the observed values as possible. If the process involves training several models, the best performing model can be chosen by testing them all on a subset of the training data (called

‘validation’) and evaluate. The model obtained can then be tested on testing data. This last step produces statistics of the accuracy of the model.[21]

2.3.1 Data Leakage and Cross-Validation

The training and testing data is often derived from the same dataset. A common issue in machine learning is data leakage, where a model produces very good results when tested because some form of information has ‘leaked’ between the training and testing sets. For

(15)

2 Theory

example, a model that is tested samples it has also been trained on and therefore has seen before can perform misleadingly well. There are several methods to help to avoid this. An easy strategy is to take care when splitting the data into the training set and testing set, basing the decision on some independent parameter.[23]

The concept of splitting a set can also be used within the training set, by splitting it into a training and validation set, so that in each step of the training, the model evaluates on unseen data. This also prevents data leakage. However, ignoring parts of the data when training can lead to the opposite of over-fitting, under-fitting. To ensure that this does not occur, cross-validation can be implemented. Cross-validation splits the training sets a number of times, into what is called folds, training the model the same number of times, and each time letting a new split form the validation set, and the rest the training set.[14] A visualization of a cross-validation with five folds can be seen in Figure 3. The algorithm then evaluates which split scored best, and uses that model in the testing phase on the current test set. The algorithm is then evaluated by calculating the average performance of all those splits.

Figure 3: Illustration of a cross-validation, where the dataset has been split into 5 folds.

2.3.2 Classification

Classification is the goal of training the model, it is where incoming samples are categorized by the model. This is in fact done both during training, where the prediction is checked and sent as feedback in the training process, as well as when actually running the system.

There are different types of classification methods, which depend on what type of data the tree structure receives in the training. The two main types are: binary and multiclass classification. In binary, a data sample is to be classified as one out of two different classes, and in the latter, three or more classes are possible. Multiclass classification problems are very complex but can be transformed into binary classification problems. One way is to train one model per class, and when doing so, letting all samples not belonging to that class act as negative samples. This is called ‘one vs rest’. Multiclass methods involving many classes is time-consuming and requires good hardware.[21]

Keystroke dynamic classifications are often multiclass problems since there is often more than two users authorized to use a system. There is, however, ways to train one, and only one, binary model on several classes. One such method is used in this study. The model is trained to compare two samples of feature sequences, and to decide whether or not they belong to the same class (eg user). The data the model receives are pairs of feature sequences, and during training also a label describing if the feature sequences belong to the same class, or not.[21]

2.3.3 Training

The training itself occurs mathematically by evaluating a model, through a so-called objective function. The objective function keeps track of the error in the machine learning

(16)

2 Theory

model, thus the goal is to minimize it. In trees in general, the objective function has a term l that represents training loss. Training loss describes how correct the model is compared to the observed data. XGBoost also implements a regularization term Ω, see Equation 1.

Regularization controls overfitting, and the two terms help balance the model, so it is not too predictive, nor too simple, an important aspect for any machine learning algorithm. Ob- jective functions help the model perform well because they give feedback on what is going on (this case through the regularization and training loss hyperparameters).[6]

Obj(Θ) = l(Θ) + Ω(Θ) (1)

The loss function l can be calculated in several ways. Two of the most popular are mean square error,

l =

n

X

i=1

(y_i− ˆy) (2)

and logistic loss,

l =

n

X

i=1

y_iln(1 + e^(−ˆ^yⁱ⁾) + (1 − y_i)ln(1 + e^y^ˆⁱ). (3) [6]

In both cases, yiis the observed value, and ˆy the estimated value. For a binary classification problem using logistic loss, the observed value will be either 0 or 1, and the estimated value will be a probability between 0 and 1 of chosen ‘positive’ value (for example, probability of real value being 1).

The regularization term Ω can also be calculated in various ways. The most common is to use some form of norm, and described in Equation 4 and Equation 5 are what is called the L2 norm and L1 norm, respectively.

Ω(w) = λ||w||² (4)

Ω(w) = λ||w||₁ (5)

How regularization controls overfitting can be explained through ‘Bias-Variance tradeoff’, which is a way of analyzing the expected error of the estimation the model produces. The error is made up of three terms, a bias term, a variance term, and lastly an irreducible error.

The bias term is a measure of how much the estimated value differs from the true value. The variance term is a measure of how sensitive the model is to variations in the data. The last term, as suggested, cannot be reduced by the model, but the former two can. The tradeoff lies in that the best model is complicated enough to predict the training data (high bias and lower variance), but also general and uncomplicated that it predicts new data well (higher variance, which leads to lower bias). If it is too complicated, the model might be overfitting to the training data, and if it is too uncomplicated, it might underfit.

While training loss tries to push the model to be perfect (increasing complexity), regularization balances this by reducing the complexity of the model. The complexity of the model depends on its hyperparameters. The regularization types of L1 and L2 reduces the ‘freedom’

of the hyperparameters, limiting the range of their values, effectively limiting the number of different variations of hyperparameters, which reduced complexity, and thus controlling overfitting.[21]

Next, the method used in this study is explained, and how its objective function is evaluated through boosting.

(17)

2 Theory

2.4 Decision Trees and XGBoost

As previously mentioned, XGBoost is a framework which builds on gradient boosting. This subchapter will start with laying out the theory behind decision trees, introducing what sort of mathematical function they are implemented with, moving on to explaining extreme gradient boosting, or XGBoost.

2.4.1 Decision Trees

Decision trees are used for a variety of different classification problems. They classify data by letting it traverse a tree full of dividing questions, finally reaching a ‘leaf’ (end node) that predicts something final about the data. Trees are popular because they are easy to visualize, scale well, are good with anomalies such as missing data and outliers, and can handle both continuous and discrete data. A disadvantage of decision trees is their sensitivity to variance in data. One method to avoid this is to let more than onetree classify the data, so-called

‘tree ensemble’.[21] This is the model that XGBoost is built upon, and is more thoroughly discussed in the next subchapter.[6] First, a more detailed and mathematical explanation of decision trees.

Decision trees create the tree structure by partitioning connecting nodes according to parameters which are learned through training and assigns leaves different values that map on to a score that refers to the question in some way. The data is sorted through the partitions, which asks information about the data, and when arriving at a leaf the data point is assigned the value that represents the classification the tree has made. If these values are discrete (for example ‘yes’ or ‘no’), the trees are called classification trees. If the values are continuous (for example a probability) they are called regression trees.[21] An example of the structure of a regression tree can be seen in Figure 4.

Figure 4: An example of a regression tree, where the data points are travelers on the Titanic, the classification is what chance of survival one individual would have. The questions asked at the partitions are what gender the individual is, what age and in what class it was traveling in. The leaves have been assigned different values learned through training. So, for a data point representing a young boy traveling in 3d class, the model will classify it as having a survival rate of 27%.[22]

Mathematically, the model that represents a decision tree is represented by a an adaptive

(18)

2 Theory

basis-function model,

f (x) =

M

X

m=1

wmφ(x; vm) (6)

where M represents the number of ‘distinctions’ we want to do (for the Titanic example in Figure 4, M would be 5), w_m is the ‘mean response’, or ‘leaf weight’ for the class m, ie the response to the question ‘what is the survival rate for anyone classified in leaf m’ (for example 27%), φ is the basis function for x, the data input, and v_m, the ‘splitting variable’

which denotes the splitting question and its threshold value (in the above case, only ‘yes’

or ‘no’).[21] This is the function that a supervised learning method has to analyze in each step in order to optimize classification of the training data, in accordance with subchapter 3.3.3. In a tree ensemble, there are multiple trees. The model for such an ensemble can be expressed as

ˆ yi =

K

X

k=1

fk(xi), fk∈ F (7)

where ˆyi is the prediction for the given data xi, K is the number of trees in one assembly, and fk is the function for given tree, or Equation 6. F is the function space of all trees.[6]

2.4.2 XGBoost

XGBoost uses an ensemble of trees, ideally weak learners. On top of that, it applies boosting.

Boosting is the process of iterating through a number of steps, adding a new function each step that is based on the error from the previous estimation. Normal gradient boosting uses the principle of gradient decent to find the minimum of the objective function, through adding a new function in each step - boosting. XGBoost, does all of this, while considering a slightly different take on the regularization term, but most importantly considers second derivatives (which leads to faster computations).[7] Now follows a mathematical description of XGBoost. It starts of with its objective being written as (analogously with Equation 1)

Obj =

n

X

i=1

l(yi, ˆyi) +X

k

Ω(fk), fk∈ F (8)

where the base models f , can be found through boosting. The boosting starts with the model formulating a prediction

ˆ

y_i⁽⁰⁾= 0 (9)

and adds functions (boosts it) ˆ

y⁽¹⁾_i = f₁(x_i) = ˆy⁽⁰⁾_i + f₁(x_i) (10) ˆ

y⁽²⁾_i = f1(xi) + f2(xi) = ˆy_i⁽¹⁾+ f2(xi) (11) ...

ˆ y^(t)_i =

t

X

k=1

fk(xi) = ˆy_i^(t−1)+ ft(xi) (12)

where ˆy_i^(t)is the model at step t.[7]

The next step is finding the base models f . This is found through optimizing Equation 8, which can be expressed as

(19)

2 Theory

Obj^(t)=

n

X

i=1

l(y_i, ˆy_i^(t−1)+ f_t(x_i)) + Ω(f_t) + constant, (13)

and if the ftthat minimizes the expression is found, the f in each step in the boosting is found.

If Taylor expansion of Equation 13 is applied up to the second order, and if the loss function L is defined as the square loss (Equation 2), Equation 13 can be approximated as

Obj^(t)=

n

X

i=1

[g_if_t(x_i) +1

2h_if_t²(x_i)] + Ω(f_t), (14) where gi and hi are defined as the first and second order derivatives of the loss function;

gi= ∂_y_ˆ(t−1)l(yi, ˆy^(t−1)) (15)

and

hi= ∂_y²_ˆ(t−1)l(yi, ˆy^(t−1)) (16)

[7].

The next part is to expand the regularization term to control the complexity, or the overfitting, of the model. If the set of leaves Ij is defined as

I_j = i|q(x_i) = j, (17)

the regularization can be expanded to

Ω(f_t) = γT +1 2λ

T

X

j=1

w²_j (18)

where T is the number of leaves in the tree, and λ and γ are hyperparameters which can be tuned when working with XGBoost. The larger the respective values of λ and γ, the more conservative the algorithm becomes. Equation 14 can be written

Obj^(t)=

n

X

i=1

[g_if_t(x_i) +1

2h_if_t²(x_i)] + γT +1 2λ

T

X

j=1

w²_j (19)

which can be reduced to

Obj^(t)=

T

X

j=1

[G_jw_j+1

2(H_j+ λ)w_j²] + γT, (20) where Gj and Hj are defined as

G_j =X

i∈I_j

g_i (21)

and

Hj=X

i∈Ij

hi, (22)

(20)

2 Theory

The best minimization of Equation 20, or in other words, the optimal leaf weight, is com- puted by

w^∗_j = − Gj

Hj+ λ (23)

giving an optimal loss value for the new model:

Obj^∗= −1 2

T

X

j=1

G²_j

Hj+ λ+ γT, (24)

which is called the structure score, which is simply a measure of how good the tree structure is (the smaller the score, the better).[6]

The performance of a tree can thus be measured, but how is the best tree structure found?

It is not realistic to create all possible structures and evaluate them. The trick is to greedily growing the tree. This means that as the tree grows, it tries to add a split at each node.

The score of the tree gains to its objective function (ie a penalty) for a split is calculated as

Gain =1 2[ G²_L

H_L+ λ+ G²_R

H_R+ λ− (GL+ GR)²

H_L+ H_R+ λ] − γ, (25) with the first three terms are called the ‘training loss reduction’, and represents the score of the new left child, the right child, and the score if we do not split. The last term is a complexity penalty, or ‘regularization’, from introducing a new leaf into the structure. Now, the way forward is obvious: If the total gain is negative (so that the added scores are smaller than γ), the tree structure loses in performance by adding that leaf. If it is positive, the tree reduces its overall structure score and should perform the split. This is a variation of the pruning the tree. A model can also implement so-called recursive pruning, where the tree implements all possible splits to a maximum depth, and then recursively prune the tree of the splits that created negative gain. This is done to not overlook a split that looks like a loss at first, but that can lead to beneficial splits later.[6]

This method of boosting a model with an objective with a regularization term and consid- erations of second derivatives has thus been named XGBoost. Other advantages to other tree boosting tools are that it is completely scalable, and according to its creators, ‘it runs more than ten times faster’ than other systems.[7]

2.4.3 XGBoost Hyperarameters

The XGBoost library offers three different types of hyperparameters for tuning: General parameters, parameters for the chosen booster (in this case a tree), and learning parameters.[8] Scikit has a plugin function called ‘grid search’ which enables tuning by searching through ranges of different parameters using cross-validation to find the optimal values for that particular model. This subchapter will go through those hyperparameters tuned in this study.

General parameters concern message printouts and the number of threads used to perform parallel processing. Learning parameters concern the learning procedure and the evaluation of the objective function. Neither of these parameters have been tuned in this study.

Booster parameters concern the tree structure itself. There are several different parameters and the following, which are the most common to tune, have been used in this study (default values are those for the Scikit API for XGBoost Classification:

Max depth: Max depth sets a limit to how ‘deep’ the tree can be, how many partitions can occur. A higher value increases the complexity of the model, risking overfitting. Range:[0,∞], default:3.

(21)

2 Theory

Min child weight: Min child weight sets a threshold on the weight of a tree node for partitioning to continue. A higher value results in a more conservative model. Range:[0,∞], default:1.[8]

Gamma: Gamma, as seen in Equation 18, sets a threshold for the minimal loss reduction to occur in a step in order for the tree to create a further split. A higher value results in a more conservative model. Range:[0,∞], default:0.[8]

Subsample: Subsample sets the share of data selected for training, to prevent overfitting.

Range:(0,1], default:1.[8]

Column sample by tree: Column sample by tree sets the share of the features to be used in each tree. Range:(0,1], default:1.[8]

Alpha: Alpha represents the L1 regularization term, as in Equation 5. The higher the value, the more conservative the model becomes. Default:0.[8]

Lambda: Lambda represents the L2 regularization term, as in Equation 4. The higher the value, the more conservative the model becomes. Default:1.[8]

Eta (learning rate): Eta, also called learning rate, adjusts the weights after each boosting step. The higher the value, the lower the weights become and thus making the model more conservative. Range:[0,1], default:0.1[8]

2.5 Testing

Testing of the completed model is done to simulate a live scenario, to obtain an estimate of the model’s expected performance on new data. The model is then presented with unseen data (a testing set), and given the task of classifying it. The nature of the output varies, depending on the measurement. The most common measurements when dealing with classification are presented below:

2.5.1 Feature Importance

The varying importance of the different features has held when building the model can be visualized through the plot importance function in XGBoost. The importance is calculated by considering the number of times each variable has been split and weighting this by squared improvement gained from each split.[11]

2.5.2 FPR and FNR (false positive/negative rate):

When doing a binary classification (are these sequences from the same sample or not?), there are two types of errors the model can make: False positives (answering ‘yes’ when the answer is ‘no’), and a false negative (the opposite). The rate of the occurrences of these errors are calculated as

F P R = F P

F P + T N = F P N−

(26)

and

(22)

2 Theory

F N R = F N

T P + F N = F N

N₊ (27)

where F P , T N , F N , T P are the number of false positives, true negatives, false negatives, and true positives, respectively.[21]

FPR and FNR are sometimes, in models designed for authorization, called ‘false acceptance rate’ and ‘false rejection rate’, or FAR and FRR, respectively. These terms will hereafter be used in this report when discussing these particular measurements. It can be argued that false acceptances are more dangerous for a security system, while a high FRR would probably lead to irritation and low user value.

2.5.3 TPR and TNR (true positive/negative rate):

Equivalently with FAR and FRR, there are two types of correct classifications the model can give: true positive and true negative answers. They are calculated as

T P R = T P

T P + F N = F P

N₊ (28)

and

T N R = T N

F P + T N =F N

N₊. (29)

These measurements are also known as ‘true acceptance rate’ and ‘true rejection rate’. [21]

2.5.4 ROC (receiver operating characteristic):

All four characteristics above are calculated at a certain threshold (what percentage of certainty of an answer is required to label a sample as ‘positive’ or ‘negative’.) To get a complete view of the model, one needs to run it at several thresholds. Plotting TAR against FAR at different thresholds results in a so-called ROC curve. In Figure 5, an example ROC curve is depicted. If the threshold is set to = 1 (classifying all samples as negative), the model reaches a point on the bottom left. If the threshold is set to = 0, the model ends up in the top right corner (classifying everything as positive). If the model acts on pure chance, TAR will equal FAR, and it will end up on a point on the diagonal. If the model has been trained so well that its TAR = 1 and FAR = 0 (no false positives), which is desirable in security systems, the model will end up on a point in the upper left corner. Therefore, a good result is when the ROC curve is as close to the upper left corner as possible.[21]

(23)

2 Theory

Figure 5: A standard ROC curve, with example results from three different models. The blue line, representing FPR = TPR (or FAR = TAR), is ‘worthless’ since it is as good as randomly guessing the answer. The model represented by the yellow line is better at classifying the data than the purple, since it is closer to the upper left corner (where TPR

= 1 and FPR = 0). Figure credit:[25]

2.5.5 AUC (area under curve):

The ROC curve, while giving a visually pleasing way of comparing models, is difficult to interpret quantitatively. Therefore, the ‘AUC’, or area under the ROC curve, is often calculated and presented as a measure of the quality of the model. As the ideal ROC curve hugs the y-axis and up to TAR = 1, and then steps out horizontally, the maximum AUC score is 1.[21]

2.5.6 EER (equal error rate):

Another quantitative measure of the ROC curve is the equal error rate, or the value of FAR when FRR = FAR. Visually, this can be done by drawing a F AR = 1 − T AR line through the plot, and see where it intersects the ROC curve. The ideal value of EER is 0, occurring when the curve is in the top left corner.[21]

(24)

3 Implementation

3.1 Software

This study has built an XGBoost-model using Python 3.6.4. Java or R can also be used.

Apart from some program handling plots and other visualization (for example matplotlib, Pandas, Scikit, plotly), and other extensions to make programming easier, the computer needs to be equipped with the following to run XGBoost:

• GNU Compiler Collection (gcc)

• XGBoost in some way (for example a clone of the git repository[9])

Detailed environment instructions can be found on the official XGBoost website.[8] The training of keystroke dynamics is generally not that power consuming, since the data consists of pure text (as opposed to images or other file sizes). Hence, a normal personal computer should be enough to run these programs.

3.2 Dataset

The dataset used in this study was made by Li et al. in 2011 and presented in the report

‘Study on The BeiHang Keystroke Dynamics Database’. The database contains keystroke data from a total 117 users collected in two different settings, in a cybercafe, and online, resulting in two databases: Database A and Database B. In this study, only Database A (49 users) have been used. The users provided usernames, and both training and testing sets of data of typing out a personally chosen password.[18]

This subchapter describes the specific characteristics and the downsides of the data included, as well as explaining the motivation for choosing this particular dataset for the present study, including comments on other frequently used datasets.

3.2.1 Characteristics of the data

For each of the participants, the database contains a username and chosen password. For each username, there are 4-5 registration samples provided by the genuine user. There are also a varying number of attempt samples, using the correct username and password, each provided by either the genuine user or another participant posing as an ‘impostor’. The registration samples were used for training, and the attempt samples were used for testing the model.

The data for every password entry consists of two timestamps per key; press and release.

This enables calculation of dwell time and different flight times. Li et al. utilized this to create a feature vector consisting of four parts. One describes dwell time, and the other three describes three different types of flight time, as depicted in figure 1.[18]

3.2.2 Issues

Every dataset has its issues, and the most critical is inaccurate data. In the process of creating the dataset, Li et al. filtered out those data points created by obvious misuse of the system.[18] Two additional samples were disregarded due to errors in data, leaving 47 users.

They provided in total 208 registration samples and 1164 attempt samples. Other, more difficult to detect ‘inaccurate data’ include training samples that indicate behavior far away from the norm of the user. In research settings it might be rarer than in real life situations, where mood, illness, day, and time will always affect typing rhythm in an unpredictable way.

(25)

3 Implementation

3.2.3 Motivation

The decision to utilize the BeiHang database in this study was based on several factors, mainly its size, the features of the data (previously discussed), its collection method, and its accessibility. The size of the dataset with respect to users was deemed to be average, as observed by both Teh et al. and Ali et al. in their respective reviews.[26][1]

Other popular datasets include the GreyC Keystroke Dataset, which in 2009 had collected over 7500 samples from 133 users. However, in this database, the users all typed the same password (‘greyc laboratory’).[13] As discussed, fixed text for all users has its advantages, but it was decided that examining a free text database would be more realistic and academically useful. For the same reason, the CMU Keystroke Dynamics Benchmark Dataset with 51 users typing a static password, was eliminated.[17] In addition, both the GreyC and CMU databases have been thoroughly studied compared to the BeiHang (131 and 346 citations, respectively, compared to 36, according to Google Scholar).

A potential additional advantage of this dataset is the data collection. The settings were collected from a commercial system in a ‘free’ environment, meaning not a controlled laboratory. Li et al. argue that this would be ‘more comprehensive and more faithful to human behavior’. [18] Of course, the users were aware of their participation in a study, so it cannot be said to be completely identical to the everyday use of a keyboard. There is also an advantage to letting users choose their own password, as one would be comfortable with hers or his own password, result in typing rhythm more similar to normal use, than copying unknown text, as Teh et al. conclude.[26]. The data in the BeiHang Database might not be classified as completely ‘free text’ as there are limitations to passwords (mainly length), but they are decidedly more free than structured text.

This text-setup is different from the majority of research conducted, in which passwords are not chosen by users, and the collection was carried out in laboratories.[13]

The BeiHang Database was also easily accessible, with the data precisely labeled in txt-files.

Although Giot et al. argue in a review from 2015 that the most reliable is for researchers is to ‘collect the datasets [themselves] that fits the need of their studies’, it was deemed unachievable for this study to build a new dataset of good quality, given the time frame. As several studies and reviews state, there is a strong need for a good-quality benchmarking dataset for Keystroke Dynamics.[1] There seemed to be no need to expand the sea of semi- good datasets, and focus has instead been to expand on the research on an existing database.

3.3 Feature Extraction

After acquiring the dataset, features are extracted. As the passwords were all different from each other, direct comparison of time-stamp data key to key was impossible. The features selected instead are the mean of those featured pictured in Figure 1; dwell time, and four different types of flight time, for each sample. An example of such a data sequence can be seen in Table 1, and the resulting features below it. The usernames are also included in both the training and testing set. Labels if the data was collected from a genuine user or an impostor was also included in the data used in testing. Registration data is used for training the model, and attempt data for testing the model. Tables 2 and 3 depicts how data frames used for training and testing, respectively.

Table 1: Password, time stamps and resulting extracted features for an example password

‘psw’.

key p s w

keystroke down up down up down up

timestamp 0 1 3 4 5 6

• dwell time average = 1

• flight time type 1 average = 1,5

(26)

3 Implementation

Table 2: Excerpt from the registration data containing feature sequences from two different samples provided by two different genuine users. The unit is in ms. ‘d’ stands for ‘dwell time’ and ‘f’ for ‘flight time’ (of which there are four different types).

d mean f1 mean f2 mean f3 mean f4 mean username 162036.0 389064.0 550526.0 550225.0 711687 12345 138766.0 872.4 147306.0 137660.0 284094 304567

Table 3: Excerpt from the attempts data containing feature sequences from two different samples provided, both claiming to be user ‘12345’ and providing the correct password, but one sample is genuine (‘0’), and one an impostor (‘1’). The unit is in ms. ‘d’ stands for

‘dwell time’ and ‘f’ for ‘flight time’ (of which there are four different types).

d mean f1 mean f2 mean f3 mean f4 mean username impostor 136353.0 247693.0 386233.0 386832.0 525371.0 12345 0 101175.0 57725.2 160851.0 161111.0 264237.0 12345 1

3.4 Training

When the features have been extracted, the next step is to actually train the model. The blog post ‘Building Supervised Models for User Verification Part 1 of the Tutorial’ by Maciek Dziubinski and Bartosz Topolski was the inspiration for the structure of training and testing algorithms in this study.[10] The first step is to split the usernames into usernames whose data will only be used for training and validation, and usernames whose data will only be used for testing (the testing dataset is covered in the next subchapter).

This is done to prevent data leakage and can be repeated multiple times in order to analyze potential differences between different divisions. The data used in the training is as previously mentioned, all from the registration database (where all samples come from genuine users), as seen in Table 2. The next step is to split the training set into folds for cross-validation. Within the folds, the binary classification part is prepared: All feature sequences are paired with one another, labeled according to whether they belong to different or the same user, and the feature difference is calculated (simply the difference between the sequences per feature). This is visualized in Figure 6. This, in turn, creates new features, referred to as a ‘difference feature’, which include labels. The model can now be trained to recognize patterns to whether two feature sequences belong to the same or different users.

The difference feature calculated from both two feature sequences from the same user, and from two different, can be seen in Table 5 and 6, respectively. This same pairing is done for all data in the validation set.