Catch the fraudster: The development of a machine learning based fraud filter

(1)

UPTEC STS 20035

Examensarbete 30 hp

September 2020

Catch the fraudster

The development of a machine learning based

fraud filter

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Catch the fraudster

Anton Andrée

E-commerce has seen a rapid growth the last two decades, making it easy for customers to shop wherever they are. The growth has also led to new kinds of fraudulent activities affecting the customers. To make

customers feel safe while shopping online, companies like Resurs Bank are implementing different kinds of fraud filters to freeze transactions that are thought to be fraudulent. The latest type of fraud filter is based on machine learning. While this seems to be a promising

technology, data and algorithms need to be tuned properly to the task at hand.

This thesis project gives a proof of concept of realizing a machine learning based fraud filter for Resurs Bank. Based on a literature study, available data and explainability requirements, this work opts for a supervised learning approach based on Random Forests with a sliding window to overcome concept drift. The inherent class imbalance of the setting makes the area-under-the-receiver operating-curve a suitable metric. This approach provided promising results that a machine learning based fraud filter can add value to companies like Resurs Bank. An alternative approach on how to incorporate non-numerical features by using recurrent neural networks (RNN) was implemented and compared. The non-numerical feature was transformed by a pre-trained RNN-model to a numerical representation that reflects the features suspiciousness. This new numerical feature was then included in the Random Forest model and the result demonstrated that this approach can add valuable insight to the fraud detection field.

ISSN: 1650-8319, UPTEC STS 20035 Examinator: Elísabet Andrésdóttir Ämnesgranskare: Kristiaan Pelckmans

(3)

Acknowledgments

I would like to thank everyone who made this thesis-project possible.

My supervisors at Resurs Bank, Lisa Rodrigues and Sandrine Wallisson, who gave me a warm welcome to the company and also supported me during the whole process. My reviewer at Uppsala University, Kristiaan Pelckmans, who guided me throughout the

project. And lastly, Fabian Persson and Markus Skogsmo who gave me ideas and proofread the thesis.

(4)

Populärvetenskaplig sammanfattning

Internets utveckling har förändrat hur människor handlar på ett flertal olika sätt. Innan internets uppkomst var man tvungen att fysiskt vara på plats i en butik och anpassa sig efter butikens öppettider. I dagens moderna samhälle kan man istället köpa nästan allt online, något som också görs i allt större utsträckning. Trots de många fördelarna med e-handel, finns det även nackdelar. En av de största nackdelarna är att e-handeln även gjort det enklare för bedragare, som nu kan utföra betalningsbedrägerier på distans. Detta har gjort det allt svårare för myndigheter att stoppa och identifiera bedragarna. I takt med att e-handeln har ökat har alltså även betalningsbedrägerier online ökat.

E-handlarna anlitar ofta externa aktörer1som tillhandahåller betalningslösningar. Det är ofta på dessa externa aktörer som ansvaret läggs att förhindra bedrägerierna. Aktörer som under en längre tid tillhandahållit betalningslösningar har samlat på sig massvis av data för de transaktioner som genomförts. Dessa stora datamängder skulle kunna användas för att hitta mönster i bedrägeritransaktionerna och på så vis förutspå framtida fall av bedrägerier. Ett vanligt tillvägagångssätt för att hitta sådana mönster är att använda sig utav maskininlärning.

I detta examensarbete har uppdraggivarens insamlade data undersökts och en Random Forest modell har tränats och testats för att se ifall implementering av ett maskininlärn-ingbaserat bedrägerifilter är möjlig. Dessutom har en ny approach inom denna kontext för hur man kan använda sig utan icke-numeriska features undersökts. Den nya approachen är att använda ett recurrent neural network (RNN) för att förutspå misstänksamheten i emailaddressen som kunden lämnade vid transaktionstillfället. Misstänksamheten anges kvantitativt vilket möjliggör att den kan användas som ytterligare en feature till Random Forest modellen.

Som mått av modellens prestanda har "area-under-the-receiver operating-curve" (AUC-ROC) använts då denna enhet är särskilt lämpad för obalanserade datamängder. Resul-tatet utan RNN feature var runt 0.9, vilket är betydligt bättre än ett "filter" som klassifierar alla transaktioner som icke bedrägerier (vilket har en AUC-ROC på 0.5). När den nya approachen testades, observerades en ökning av AUC-ROC med cirka 0.9 procent.

(5)

1 Introduction

Since the dawn of the internet age, ecommerce has seen a rapid growth while market after market has been revolutionized by this new sales technique. Ecommerce has made customers’ lives easier by being constantly available no matter time nor place. Today one can shop pretty much everything online, from clothes to food and even houses. While ecommerce has led to many positive impacts on society, it has also given rise to a new kind of fraudulent activities. The good thing about the internet in general and ecommerce in particular is that actions can be carried out from anywhere as long as you have an internet connection. The same thing applies to fraudulent activities carried out online, which makes it very difficult to track down and arrest the fraudster. This has led to a drastic increase of fraud attempts which could possibly damage the reputation of e-commerce. Digital buyers have already stated that they will not purchase goods from online store which they believe does not provide a sufficient level of payment security [1].

To make shopping online as safe as possible, most retailers on the world wide web chooses to outsource the transaction procedure to other companies that is specialized in that area. These specialized companies are usually hold financially accountable if they allow fraudulent transaction to take place. It is therefore in these companies’ best interest to be able to detect a fraudulent transaction fast and with good precision. Therefore, fraud filters have to keep up with the ever-changing and continuously more sophisticated fraudsters.

The case company, Resurs Bank, has departments that are working full time on fraud prevention. The current fraud filter is using thresholds for various key values as an indi-cation of fraud, if enough key values exceed their corresponding threshold the transaction is frozen and further investigations begins. These thresholds are set by educated people who use their field knowledge of fraud trends and educated guesses as decision tools [2]. When it comes to more qualitative/non-numerical features that is made up of text instead of numbers, threshold values cannot be used directly. A usual way to take these features into account is to use so called grey lists. For example, if a specific product has been associated with a fraudulent transaction, it is thereafter stored in a grey list intended

(8)

for suspicious products. Whenever a new transaction is being processed with this cor-responding product, it is considered suspicious. A problem with this approach is that it can only detect suspicious qualitative features that has already been caught one or more times. It is therefore interesting to investigate if qualitative features could be processed in a way so that new transactions also can be frozen.

The approach used by Resurs Bank to combat fraud is a quite common one and many companies work in a similar manner with the issue. But setting all threshold by hand and continuously update and analyse new fraud trends can both be time consuming and sub-optimal [3]. Another approach is starting to get more traction and is portrayed as the next generation of fraud filters. This new approach is based on a technology called ma-chine learning, where vast amount of historical transactions can be used to statistically find optimal ways of detecting fraudulent transactions in an automated process [4]. A machine learning based fraud filter could hopefully decrease the workload and increase the performance of the fraud detection unit. A successful implementation would be ben-eficial both in terms of financial savings and customer satisfaction. This thesis-project aims to investigate the possibilities for Resurs Bank to implement such a fraud filter and to evaluate what kind of results that could be expected. It also aims to investigate how the information, that can be gained from qualitative features, can be implemented in a more sophisticated method than grey lists.

1.1 Related work

A lot of research has been done in the field of fraud detection and many different machine learning approaches has been proposed and compared. Most research agree upon the fact that to just take raw labelled data and create a model will not yield optimal results because of the nature of fraud. Some key problems has been identified and summarized by Pozzolo et al. [5]. The first and most discussed problem is the imbalanced data set. Since fraudulent transactions occurs in relatively small numbers compared to non-fraudulent transactions the problem with imbalanced data sets is inevitable. Another problem is the non-static behaviour of both fraudsters and the consumers. A model based on historical data will soon be outdated and its ability to make correct predictions will

(9)

decrease over time [5]. Furthermore, many current fraud filters only considers isolated transactions and will therefore loose information that previous transactions from the same customer might add [6]. Different methods to address these issues has been suggested.

In an extensive study where numerous articles were reviewed, Priscilla and Prabha [7] concluded that ensemble methods such as Random Forest and Ada Boost performs better than single classifiers such as Support Vector Machines. The study also indicated that most research used some sort of sampling method to overcome the issue of imbalanced data sets, which have shown to increase the performance of the models [7].

In an article that gained a lot of traction, Whitrow et al. [6] considered to use ag-gregated features to take customer behaviour into account. Feature aggregation gains information by considering a succession of transactions from the same customer. The aggregated features can for example be the sum of all transactions over the past week or the number of purchases from the same e-store the past month. Whitrow et al. [6] found that the length of the aggregation period has a large impact of how effective the approach is. The main concern about using aggregated features is that it’s not possible to determine whether a single transaction has been labelled as fraudulent. It’s only possible to know that there are one or more transactions in a set of transactions that are labelled as fraudulent. The study also concluded that Random Forest classifier performs better than other classifiers such as SVM and logistic regression [6].

When it comes to the changing behaviour of both customers and fraudsters, Lucas et al. [8] examined historical data to analyse how shopping patterns was changing between different days and months. One of their finding was that the largest changes in consumer patterns was strongly correlated to the calendar events. The largest observable change was between working days, Saturdays, Sundays and school holidays. These were then used as categorical inputs into a Random Forest model, which resulted in a small increase of performance [8]. A problem with this approach is that it only takes cyclical behaviour changes into consideration. A technique that continuously adapt to behaviour changes, in contrast to what was proposed above, is to use incremental learning. Incremental learning interprets data as a continuous stream and updates the model in chunks of data of predetermined size [5]. Gao et al. [9] suggests that for imbalanced data sets it is beneficial to, for each chunk, also include minority data (fraudulent transactions) from

(10)

previous chunks to make the training data set less imbalanced.

How to measure the results of a fraud filter is a debate that has no clear answer. In the review of Perscilla el al. [7] over 12 different metrics were found in the articles. The most frequently used were True Positive Rate, Precision and F-measure (or F1-Score) followed quite closely by Accuracy, Precision-Recall Area Under the Curve and Area Under the Receiver Operating Curve. Accuracy is used fairly commonly although many researchers agree upon the fact that it isn’t an appropriate measurement for fraud detection because of the imbalanced data set [7, 5], the same thing has also been said about True Positive Rate [5].

1.2 Research definition

As mentioned earlier, machine learning is thought to play a major role in future fraud detection systems. It is therefore important for Resurs Bank to understand how or if their data can be utilized for such a purpose. The aim of this thesis is to investigate what kind of machine learning model that that could be implemented and what kind of results that can be expected. Furthermore, the thesis also investigate how non-numerical features can be utilized in a way that bridges the shortcomings of the grey-list method by using natural language processing.

1.3 Questions

• How can Resurs Bank utilize their data to construct a machine learning based fraud filter?

– What kind of results can be expected?

• Can non-numerical features, such as email address, be implemented into a machine learning model in more sophisticated ways than to use grey lists?

(11)

2 Theory

This section will explain the different techniques that has been used in this thesis-project and why these were chosen. First, an overview of the concept machine learning is given followed by a more in-depth explanation of the models used in this project. Thereafter, some key concepts that are closely related to the problem of detecting fraud are examined. And lastly, some evaluation metrics that are used to interpret the results are declared.

2.1 Machine learning

Machine learning is currently one of the fastest growing areas of computer science. There are several different fields of machine learning with one aspect that holds them together. That is the fact that previously known data is used as input into a learning algorithm which learns thought the data and automatically adapt a models characteristics to suit the users purposes as good as possible [10]. The model can then be used for these purposes which could be for example to predict future stock prices or to recognize faces in images. Commonly, machine learning is divided into two larger subgroups, these are supervised and unsupervised machine learning. In supervised machine learning, all the data points have a known corresponding output target, while in unsupervised machine learning the data has no such known output target. Data with known corresponding output target is also known as labelled data [11].

2.1.1 Over- and underfit

There is a well-known problem in machine learning that always needs to be considered. This problem originates from the fact that one tries to predict future data with the knowl-edge of historical data. This historical data set could contain noise of some sort that cause some data points to be unrealistic. If a very complex model is trained on this data set until it classifies all data points correctly, these unrealistic data points might compromise the performance of the model for unseen data points. When this happens, the model is said to be overfitted [12, 13].

One way to avoid overfitting is to make the model less complex. However, if one makes the model too simple, the result can be an underfitted model which also leads to

(12)

poor performance. Ultimately, one finds a trade-off between a too simple model and a too complex model [12].

2.1.2 Decision trees

A decision tree is a supervised machine learning model used for classification. The model is easy to understand and illustrate but can still be complex and efficient depending on the depth of the tree and how many features one choose to include [13]. Figure 1 illustrates a simple decision tree that tries to classify whether a person is a male or female. The model starts at the root of the tree, considers one feature at a time and depending on the outcome of an if statement the model continues on a specific branch of the tree until it has reached a leaf (marked green in Figure 1), which corresponds to a classification. The first if statement checks whether the person is taller than 185cm or not. If the answer to that if statement is yes, then the person is classified as male, if not, the model continues to an if statement regarding the persons weight. If the person weight is over 85kg, then the person is classified as male otherwise as female. This is a very simple model with only two features to separate the classes. The simplicity can cause the model to mis-classify many persons, an obvious example is that a female with a height over 185cm is immediately misclassified as male. But on the other hand, a more complex model with a great deal of features is more prone to overfitting, which means that the model adapts too well to the training data and loose some of its generalizability [13]. Another tree-based machine learning model that tries to overcome these issues is Random Forest which will be introduces further down.

Height > 185cm? Root Male Yes Weight > 85kg? Male Yes Female No No

Figure 1: Illustration of a simple decision tree to classify gender.

(13)

uses labelled data as a mean to modify the model so that it produces optimal results. The algorithm that constructs the decision tree usually do so by a top-down approach, by iterative splitting the feature that best separates the different classes. There exist several different methods to find the optimal feature to consider and at which value to split it. One of the most common methods that is used is Gini impurity (equ. 1) [14].

Gini impurity= n

∑

=1 pi(1 − pi) (1) where,

n = number of data points

p = the probability of an item with label i being chosen

Gini impurity represents the probability of incorrectly classifying a randomly chosen data point in a data set given that the data point were randomly classified according to the data sets class distribution [14]. When finding the best split with Gini impurity, one provisionally splits the data set into two data sets and calculates the Gini impurity for each. As in the male/female example, the two data sets could include data points with weight over/under 85kg. The two impurity measures are then weighted with respect to how many data points the respective data set includes. By adding these weighted measures, we get a number that indicates how good it is to split the data set in that particular place, the smaller number the better. If Gini impurity was used in the example above, 85kg would have had the minimal sum of the weighted Gini impurity measures and was therefore chosen as first and best split.

2.1.3 Random Forest

One objective for a machine learning model is to generalize well to previously unseen data [11]. As mentioned above, decision trees are prone to overfit which usually results in the model to perform sub-optimal when used on unseen data. Another approach that tackle this issue is to combine many different decision trees into an ensemble model called Random Forest and let a majority vote determine the final classification. Random Forest has an ability to limit overfitting without substantially increasing error due to bias which is why they are such powerful models [14].

(14)

A Random Forest is created by multiple decision trees where each tree is made up of data that is randomly sampled, with replacement, from the original data set (this tech-nique is known as bagging) [11]. The new random data sets will have the same size as the original data set. Each tree will base its splits on a random subset of the features for each split. When classifying new data, all decisions trees will make their own classification and then a majority vote will make the final decision [15]. Figure 2 below illustrates a Random Forest with three trees (a Random Forest usually consists of many more trees). In this case, two of the trees classify the individual as female and therefore the final classification is also female. The rightmost tree in the illustration represents the tree in Figure 1, even though this tree misclassified the female whom is over 185cm, the other trees corrects that error.

Dataset ♂ ♂ ♀ Majority voting ♀ Bagging ♂ ♀ Majority voting Final classification Bagging ♂ ♂ ♀ Bagging

Figure 2: Illustration of a Random Forest model with three trees and max depth of three.

2.1.4 Neural Networks

For a long time, conventional computers had no ability to learn and adapt to new cir-cumstances. All software was hard coded to perform a particular task. Inspired by the learning process of a biological brain, data scientist developed Artificial Neural Networks to bridge this gap. A Neural Network consists of neurons which can be perceived as a processing unit. The neurons are connected to each other through different weights. The

(15)

neurons receive weighted information via these connections from other neurons and pro-duces an output which is the weighted sum of the received information that is passed through an activation function. The activation function is usually a nonlinear function such as the Sigmoid function or similar [16].

Neurons are usually arranged in layers, these can be single layer (see Figure 3) or multi-layer structures. The number of layers and the size of these can be seen as tuning parameters. A standard Neural Network is called a feed-forward neural network which means that information flows in one direction, as seen in Figure 3, and no feedback is present [16]. Input 1 Input 2 Input 3 Output Hidden layer Input layer Output layer

Figure 3: A single level Neural Network with three neurons in the input layer, five neurons in the hidden layer and one neurons in the output layer.

The Neural Network learns through a learning algorithm, the most common one is the back-propagation algorithm [16]. In essence, the back-propagation algorithm is an optimization algorithm that updates the networks weights to minimize a cost function which is high if the networks predicted class differs a lot from the true class. Since the learning algorithm needs the true class as input, the backpropagation is a supervised ma-chine learning algorithm. For each weight update, the algorithm calculates the gradient of the cost function and takes a step in the opposite direction. The updates continue until a predetermined number of updates has been performed.

(16)

2.1.5 Recursive Neural Networks

There are several different types of Neural Networks. In contrast to the standard feed-forward architecture that was presented above, there is a so called Recurrent Neural Net-work (RNN) that uses feedback of information. The feedback within the netNet-work allows values in the data to be influenced by data in previous time steps. This characteristic makes RNNs able to process sequential data one element at a time and learn the sequen-tial dependencies which is especially useful for data values that are dependent on past data, such as time series or written text [17]. An illustration of a RNN can be seen in Fig-ure 4. The model takes the sequential data as input, one element at a time in sequential order. The state of the hidden layer is feed back as input into the succeeding networks hidden layer. The output from the last output is then used for classification or regression [18]. h0 h1 h2 ht

=

h o0 i₀ o1 i₁ o2 i₂ o_t i_t o i

. . .

Figure 4: An illustration of how a RNN works. The color scheme matches that in Figure 3 to symbolise the similarities.

A drawback of RNNs is that they are unable to detect correlation between elements that is too far apart. It is for example in natural language processing (NLP) difficult for an RNN model to learn the correlation of the first and last word in a longer sentence [19].

2.2 Word representation

Sometimes, features consist of words or text. To be able to work with these features and use them as input into a machine learning model some transformation is needed. A com-mon way to represent text in a numerical way is to use one-hot-encoding. This method basically has predetermined symbols that can be used, commonly the ascii characters and

(17)

some other symbols as dot and commas. Each letter or symbol is represented as a vector of the same length as the number of predetermined symbols. One element of the vector has the value 1 while the other has the value 0. For example, if one wants to represent the letter "A" it would be as [1 0 0 ... 0], "B" would be as [0 1 0 0 ... 0] and so on. A word would be represented as a matrix with each letter’s representation as a row.

2.3 Concept drift

Concept drift occurs when the underlying data for predictive models change over time. This can result in poor and degrading performance of the model’s predictive abilities. Concept drift can be, and often is, evident in consumer behaviour and fraudulent be-haviour [5, 20]. As an example, a freshly graduated student goes from having a tight budget into employment with a high salary. The increase of income will lead to more frequent spending habits with goods of higher cost. These new purchases might seem odd for a model trained on historical data and misclassification might occur more often. A technique used to take these changes of data into consideration is incremental learning, which continuously adds new, and drops old data to the training data set and then updates the model regularly. As illustrated in Figure 5, this technique adapt the model to the new circumstances [20]. Static model Model quality Non-static model U pdate1 U pdate2

Figure 5: The illustration shows how static models in a non-static environment loose quality over time. It also shows how updating the model with new data can reassure that the model stays relevant.

(18)

A systematic way to apply incremental learning is to use a sliding window approach. A sliding window is made up of chunks of a specific time interval. The sliding windows time interval has to be a multiple of the chunks time interval. For example, if one wants to predict the following day’s whether, one can use the last three days as training data for a predictive model. Every new day, the training data drops the data from four days ago and adds data from the current day. In this example, the time interval of the sliding window is three days and the time interval of the chunks is one day. As seen illustrated in Figure 6, the last three chunks are used as training data for the model that predicts the fourth chunk and so on.

Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 Chunk 6 Chunk 7

Model 1

Model 2

Model 3

Model 4

Time

Figure 6: The illustrations shows how a sliding window approach works in practice. In this example, each window includes three chunks.

2.4 Imbalanced data sets

A common problem in machine learning classification occurs when one of the classes is overrepresented in the data set. What happens is that the models classification will be biased towards the majority class [21]. This occurs commonly in transaction data sets, where the vast majority of transactions are non-fraudulent and only a small percentage are fraudulent [5]. If this data set is used as training data for a model without measures being taken to resolve the issue at hand, the model might not perform as well as desired.

(19)

A standard measure to overcome this problem and make the model more reliable is to balance the data set [5, 22]. There are several techniques to do this, three of the most used is presented further down. Another approach is to use Skew-Insensitive Learning, a technique that alter the training process of the model. This technique has shown promis-ing results but is not as established as the balancpromis-ing techniques [22], which is why it isn’t explored in this project.

2.4.1 Random under sampling

Balancing data sets by random under sampling means that one randomly excludes data points that belongs to the majority class until the ratio between the classes are equal. A problem with this approach is that a lot of information can be lost depending on how big the ratio between the majority and minority class is. Loss of information will make the model less accurate [23].

2.4.2 Random over sampling

Balancing data sets by random over sampling means that one randomly chooses data points that belongs to the majority class and replicate these until the classes are equal. A pro of this approach is that no information is lost. A drawback is that overfitting can occur for some models because of the replication of data points [23].

2.4.3 SMOTE

Synthetic Minority Over sampling Technique (SMOTE) uses the information about ex-isting minority data points to synthetically produce new data points that belongs to the minority class. This is achieved by "drawing" lines between data points of the minority class, and then the new data points are uniformly sampled along these lines [23, 21]. Figure 7 below illustrates the whole process.

The pros of using SMOTE is that no information is lost and that the new data points aren’t replications of the of the original data points, which can increase the generaliz-ability and lower the chances of overfitting [23, 21]. A drawback is that if the minority class is made up of separate clusters, then some of the synthetic produced data points will

(20)

Before SMOTE • • • •• • • • • • • • • After SMOTE • • • ◦ ◦◦ ◦◦ ◦ ◦ •• • • • • • • • • • Majority class • Minority class

◦ Synthetic data point

Figure 7: The illustration shows the SMOTE produces synthetic data points using the information of existing data points.

be located in between these clusters. These data points might be unrealistic and could therefore decrease the performance of the model. However the chances of this happening could be reduced by having the "lines" only to be drawn to the k-nearest neighbour [23].

2.5 Evaluation metrics

Defining an evaluation metric is necessary so that comparison with other models and fine tuning of parameters can be made. One of the most common metrics to evaluate the qual-ity of a machine learning model is classification accuracy, which is the percentage of all input samples that were classified correctly. Using this evaluation metric for imbalanced data sets doesn’t make sense. For example, if the data set contains 99 percent non-fraudulent transactions and 1 percent non-fraudulent transaction, the classification accuracy would seem to be very good if the model classified all the transactions as non-fraudulent. Such a model would overlook all the fraudulent transactions and would therefore actually be very bad [24]. This thesis will therefore use an alternative evaluation metric (which is presented further down) to overcome this issue.

One aspect that is often overlooked in research about fraud detection is the econom-ical aspect. It is common knowledge that undetected fraudulent transactions usually means financial costs in terms of compensation to the card holder. But costs are also associated to non-fraudulent transactions that is misclassified as fraudulent, these are be-cause of increased administration costs. There are some metrics that are able to take these associated costs into account. One example is the Fβ -score, also known as the weighted

(21)

harmonic mean of precision and recall. The problem with these weighted metrics is that one has to be certain about the associated costs. To be certain, an in-depth economical analysis should be done, which is outside the scope of this master thesis.

2.5.1 The confusion matrix

A table that is often used in machine learning to interpret the performance is the confusion matrix. The confusion matrix is in itself quit easy to understand and is made up of four elements; True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). TPs are the amount of positive data points that are correctly classified, FPs are the amount of negative data points that are incorrectly classified, TNs are the amount of negative data points that are correctly classified and FNs are the amount of negative data points that are incorrectly classified. To structure these elements as a table (can be seen in Table 1) makes it easy to get an overview of the models performance. The elements can also be used to calculate more complex evaluation metrics, as will be presented below. Table 1: Confusion matrix. Positive and negative represent fraudulent and non-fraudulent transactions respectively.

Predicted class

Class Positive Negative Observed class Positive True Positive (TP) False negative (FN)

Negative False positive (FP) True negative (TN)

2.5.2 Area Under the Receiver Operating Characteristics

As stated above, using accuracy as evaluation metric for imbalanced data sets is not appropriate. A more general approach, that is widely used and advocated [8, 25, 26], is to use the Area Under the Receiver Operating Characteristics (AUC-ROC) as evaluation metric. The ROC-curve is a plot of the true positive rate (TPR) on the y-axis and the false positive rate (FPR) on the x-axis for all possible probability thresholds [24]. A probability threshold is in the Random Forest case interpreted as the percentage of trees

(22)

needed for a specific class to "win" the majority voting. Since this metric is evaluated over all different thresholds it is a good measurement on how well the model separates the classes [7].

The TPR is defined as the percentage of all positive data points that are correctly classified, it is calculated with equ. 2. FPR is the percentage of all negative data points that are falsely classified, it is calculated with equ. 3. The AUC-ROC is calculated as the area under the ROC-curve which can be seen in Figure 8. The higher the area the better the model is at separating the different classes. As seen in Figure 8, a ROC-curve starts in the origin, at (0, 0) and ends in (1, 1). Depending on how good the classifier is, it takes different paths between those points. A random classifier, that just as well could predict a positive data point as a negative and the other way around, would be a straight line between the starting and the ending point. A random classifier would therefore result in an AUC-ROC of 0.5. A classifier of better quality would lead to the line becoming an arc that would extend above the diagonal, resulting in an AUC-ROC above 0.5. A perfect classifier, that classifies all data points correctly, would have an AUC-ROC of 1.0 [24].

T PR= True Postive

True Postive+ False Negative (2)

FPR= False Postive

True Negative+ False Positive (3)

Although the AUC-ROC is an accepted evaluation metric for imbalanced data sets, there are still some problematic things about it. One issue is that both negative and positive instances contribute equally to the metric which in some cases isn’t desirable [27]. Another problematic issue is that just a few studies have been performed regarding how the AUC-ROC metric behaves in a sliding window environment where the data is prone to concept drift. According to research currently available, substantial differences in AUC-ROC has been detected depending on whether one chooses to calculate the metric for the given chunk’s prediction or the whole streams prediction. Calculating the AUC-ROC for a given chunk seems to give rise to a more optimistic outcome. It was advocated that when one wants to compare different models one should concentrate on AUC-ROC

(23)

for the whole data stream for more consistent result [28].

False Positive Rate

T rue Positi v e Rate Better Perfect Classifier Random Classifier 0.0 0.5 1.0 0.0 0.5 1.0

Figure 8: The illustration shows how to interpret a ROC-curve. The more the arc rises above the diagonal, the better the model is at separating the classes.

(24)

3 Method

In this section, it is declared how this study has been carried out. External tools and pro-gramming languages that are used are announced. What kind of data that was used, how this was obtained and issues that are related to the data are also described. Furthermore, some important feature preprocessing are explained in detail as well as the structure of the models and how results will be evaluated.

3.1 Tools

Python is the programming language used in this thesis-project. This is because Python is continuously ranked as the number one go-to programming language for machine learn-ing and AI applications [29, 30]. It is a relatively simple programmlearn-ing language with a large community and many libraries created for machine learning.

The libraries NumPy and Pandas has been used to work with the large data sets in a efficient manner. Both import and prepossessing of data has been achieved with help from these libraries. Another library that has been useful is scikit-learn which has simple and efficient tools for predictive data analysis [31]. This library has been used when implementing the machine learning models. For the RNN, the library PyTorch was used. Resurs Bank is using an SQL database to store their data, and therefore the query language SQL was used for data collection.

3.2 Data

In machine learning, the quantity and quality of the data available is crucial for the suc-cess of the model. The banking industry has a lot of regulations that makes it difficult to get access to large amount of data. Because of this, the data sets for transactions available online lack a lot of information about the features. Therefore, this thesis will only use the data that is given by Resurs Bank. The positive side of this is that it is certain that Resurs Bank can, if they so wish, implement the proposed model because they have stored the features needed.

The data set consist of the transaction data from 2016-01-01 to 2019-31-12. A lot of information is stored, such as date, amount, customer zip code, customer given email

(25)

etc. The data set also has indications whether the transaction is marked as suspicious and whether the transaction was marked as fraudulent or not after a final evaluation.

The collection of data was made using an SQL query that were designed to fetch an equal amount of data from each year (15,000 transactions). To make sure that these transactions reflected the whole time period, the transactions was randomly sampled. It was not specified how many of these transactions that should be fraudulent which led to an extremely small number of fraudulent transactions collected (e.g. for 2018 only 17 transaction were fraudulent). To overcome this issue, another SQL query was made that fetched all transaction from each of these years that was marked as fraudulent. The data set eventually got a bit smaller because some transactions were removed during prepossessing for various reasons. After preprocessing, the final data set was made up of 56,862 transactions where 54,981 was non-fraudulent and 1,881 was fraudulent.

3.2.1 Problems originating from the data collection

The way the data was collected has led to three key concerns. The first is due to the fact that the transactions were randomly sampled over a specified time period. This means that it is not possible to aggregate features in the same manner that Whitrow [6] is doing, this because there are no way of knowing if all transactions for a particular customer is present in the data set. The second issue is that the data set doesn’t truly reflect the ratio between non-fraudulent and fraudulent transactions. The collected data set has a higher proportion of fraudulent transactions. Problems that the second concern might give rise to will be further examined in the discussion section.

The transactions had a column that indicated whether or not the transaction had been marked as fraudulent after the final manual inspection. According to people who worked closely to the database, it is fully possible that this column is only updated for the trans-action that has been frozen by the current fraud filter. Therefore, some of the fraudulent transactions (that wasn’t detected by the current filter) might be wrongly labelled in the data set. However, these are probably just a small amount of transactions compared to all the fraudulent transactions in the data set because of the complementary SQL query that fetched all transactions that were marked as fraudulent. The key issue because of this is that comparison with the current filter could be inequitable because the fraudulent

(26)

transactions would be biased towards the current filter. This is the main reason why no such comparison is made in this thesis.

3.3 Classification model(s)

Since a lot of comparison between different model has been done in previous studies in the same field, this will not be done in this thesis. Instead, focus will lie on implementing the model that most of the literature read found to be the one that yielded the best results, which is Random Forest. The same literature has also promoted the use of over sampling to balance the data set, so this will also be incorporated with the model. Both SMOTE and random over sampling will be tested over different over sampling ratios to see which technique that yields the best result.

The fact that the data set used occurs on a time line and that concept drift often is evident in this kind of data [5, 20], a decision to use a sliding window approach was made. Each chunk will have the size of a quarter year and each window will consist of four chunks. This setup was chosen to make it easy to work with, but also to make sure each window had enough training data. In some previous studies, it has shown promising results to pass on the minority data from chunk to chunk and thus use all previous minority data in each update [5]. This is to balance the data set in a more natural manner. This technique will also be tested to see how it performs in contrast to the base model that is illustrated in Figure 9.

Q1 -16 Q2 -16 Q3 -16 Q4 -16 Q1 -17 Q2 -17

Model 1

Model 2

Figure 9: Illustration of how data is divided into chunks in a quarter yearly manner and combined for different updates. It is also possible to see how fraudulent transactions can be passed on from one chunk to the next. The blue, dotted arrow points at the chunk that the model will predict.

(27)

3.4 Prepossessing

Most features need some sort of prepossessing to make sense to a machine learning model. The feature has to be numerical and shouldn’t be lacking data. Therefore, it was necessary to transform some of the features. The report will not go through how pre-possessing was done for each and every feature. Only the features that require extensive prepossessing are explained below.

3.4.1 Cyclical continuous features

Time related features is usually an important feature when it comes to fraud prediction. From the database it is possible to obtain a timestamp in the format: "year-month-day hour-min-sec". A choice was made to only use the information of month, weekday and hour of the transaction since minute and second information is assumed to be too detailed. A mutual problem of these time related features is that they are cyclical, which means that the lowest and the highest number will seem to be very far apart. For example, if a customer use to shop at night and buys a dress at 11:55 pm (represented as 23) and shoes at 00:05 pm (represented as 0) our machine learning model will assume that these purchases took place 23 hours apart but when in reality it’s only 10 minutes.

That the time representation seems far apart at the end/beginning of a new time cy-cle will lead to misinterpretation and can cause error in the model which can decrease efficiency. To represent time features in a cyclical manner, one can use the cyclical be-haviour of cos and sin waves. To show exactly how this is done the hour feature will be used as an example. In the raw data case, hours will be represented as a scalar from 0 to 23. In the cyclical representation, the hour feature will consist of a 1x2 vector with the first and second element calculated by equ. 4 and equ. 5 respectively. By using these formulas the time features will get a cyclical behaviour as seen in Figure 10. All cyclical time features can be transformed in the same way.

sin_hour = sin(2 · π · hour/hour_per_day) (4) cos_hour = cos(2 · π · hour/hour_per_day) (5)

(28)

Time Am pl it ud Cycle1 Cycle2

Figure 10: The illustration shows how cyclical data is transformed from a discontinuous representation (dotted line) to a continuous representation (blue line). An important remark is that the blue line is in fact three dimensional and projected into two dimensions for comparability.

3.4.2 Qualitative features

Resurs Banks current fraud detection system uses, among other things, grey lists to iden-tify suspicious transactions. Grey lists can for example be made up of certain products, email accounts etc. that has been previously linked to fraudulent activities. When a new transaction takes place, the data of that transaction is compared to the elements in the grey list. If a match is made, that means that the transaction might be fraudulent and it is marked as suspicious. Grey lists can be very useful in fraud detection, but they lack the ability to identify previously unseen data as suspicious. This project will introduce a, to our knowledge, new approach to handle grey lists for fraud detection systems. To further explain the approach, the example of email accounts will be used. Instead of comparing the email account with each element in the email account grey list, our approach will use a RNN model to predict the level of suspiciousness of the email account. The prediction will then be used as an input feature to the model for fraud classification. To make it easy to implement, each year’s email data is used as training data for an RNN model. The model is used to predict the level of suspiciousness for the following years email data, which is then used as part of the Random Forest training data. This setup makes it impossible to incorporate the email suspiciousness feature into the classification for year 2017s data. Therefore, this feature will only be tested in transactions from year 2018 and

(29)

2019. It’s not a certainty that email accounts has characteristics that is well suited for this kind of classification but similar approaches has shown promising results in other, related domains. For example, Bahnsen et al. [32] used a RNN with URLs as input to detect phishing threats online. In their study, the approach provided an accuracy rate of 98.7%. If this approach produces good results, it will have a key advantage over grey list comparison because it will be able to classify previously unseen email addresses as suspicious.

3.4.3 Geographical location

Geographical location could be important to include for two reasons. Firstly, if someone purchases a good from a location that is not the same address as the cardholder has registered as his/hers living address, that might be a sign of fraud. Secondly, certain geographical areas might have a higher degree of fraudulent activities and therefore it’s important to have a feature that has meaningful information which can represent how a location relates to other locations. In other words, it should be numerically apparent in the feature that two addresses in the same city is quite close and that a location in north Sweden and a location in south Sweden is far apart.

For the location features, there are two possible ways to go by using zip code and postal address which are both stored in the database. The postal address is made up of word(s) and a number, for example "Bergevägen 16" and the name of the city, e.g. "Uppsala". This representation is usually quite difficult for a machine learning model to make sense of, since two street names within the same city block might have very different names. On the other hand, one can easily test if the billing address is the same as the shipping address. This test produces "True" if the addresses match and "False" if they don’t. One can therefore use "Address_match" as an input into the machine learning model that can have the values 1 (True) or 0 (False).

The "Adress_match" feature doesn’t tell where the address is located geographically. To produce such information, the zip codes are used. For each zip code it is possible to obtain the corresponding longitude/latitude values, although to get access to the exact information one has to pay a high annual fee [33]. To overcome this issue, two organisa-tions were found that are trying to produce open APIs that are free to use for all. These

(30)

databases are not very precise and lack some zip codes but for this thesis project it is considered enough. These two were combined to get the best characteristic from both. The first one was provided by Geonames (www.geonames.org) and has approximately 99.3 percent of all zip codes in its database, although many zip codes share the same longitude/latitude, especially in the urban areas. The second database was provided by Postnummerupproret (www.postnummerupproret.nu) and contains 34 percent of all zip codes but on the other hand, most zip codes have unique longitude/latitudes. To produce a zip code-to-geoinformation lookup table, we take all the zip codes from the second database and add all zip codes from the first database that wasn’t represented. Together, the finalized merged database containing information from both sources. In the lookup table, the longitude and latitude values are normalized to be between 0 and 1. 0 represent-ing the southernmost tip and westernmost tip of Sweden respectively and 1 representrepresent-ing the northernmost and easternmost tip of Sweden respectively. In Figure 11 information from both databases and the merged database are plotted to illustrate how the merged database results in more useful information.

(a) Geonames, 16 390 zip-codes

(b) Postnummerupproret, 4 947 zipcodes

(c) Merged data, 16390 zip-codes

Figure 11: Geographical information represented in two separate databases and one that is merged by the two.

3.5 Hyper-parameter tuning and evaluation metrics

An important part of the development of a machine learning model is to find the best suited hyper-parameters. A hyper-parameter is a parameter that has to be set before the

(31)

training of the machine learning model. They are used to configure various aspects of the learning algorithm and have a big impact of its performance. The most common way to find the optimal hyper-parameters for a given problem is to perform a grid search, which basically is to repeatedly test different settings from a predefined settings-grid and then evaluate the performance of each setting [34].

There are quite many hyper-parameters that can be set to different values for a Ran-dom Forest model. A script that would do a grid search for all would take too long time to execute. Instead, three hyper-parameters that were assumed to be of greatest impor-tance for the model’s predictive abilities was selected and grid search was performed. These chosen hyper-parameters are ’Maximum tree depth’, which determines how deep the trees should be allowed to grow, ’Number of trees’, which is the number of trees in the Random Forest, and ’Over sampling ratio’, which represents the ratio between the majority and minority class after over sampling has been performed.

To be able to obtain reliable results and to avoid overfitting, it is important not to optimize the hyper-parameters on the testing data. Instead one can split the training data into a training data set and a evaluation data set and then train the model with the training data set and test its performance on the evaluation data set [13]. Another technique that has become popular because of its ability to avoid overfitting is K-fold cross-validation. This technique divides the training data set into K number of chunks with the same ratio of the different classes. Then all chunks except one is used for the training of a model which is tested with a grid search on the chunk that was left out. This procedure is repeated until all chunks has been used as test data and the result is averaged [13]. The hyperparameters that yields the optimal averaged result is then chosen for the final model. Even though K-fold cross-validation is seen as the standard way of doing validation, it is not applicable in this project. In a data stream context where data points arrive on a timeline and the behaviour might change over time, it is crucial that the training data occurs before the validation data [35].

When tuning the hyper-parameters, it was important to use an approach that would be possible in production. For this purpose, the six first quarters of data2(Q1 to Q6) was used for training/evaluation. The data from Q1-Q4 was used to train a model1 and the

(32)

data from Q2-Q5 was used to train a model2. Model1 and model2 predicted the trans-actions of Q5 and Q6 respectively. As mentioned before, conducting a K-fold validation of a data stream of this sort is not suitable. By doing grid search on both Q5 and Q6 and then average the result, one can achieve some of the good qualities that a K-fold validation would give in respect to avoiding overfitting.

’Maximum tree depth’ and ’Number of trees’ are directly linked to the model itself and it was therefore chosen to perform a grid search on these two hyper-parameters alone. After those hyper-parameters was set, another search was made to determine to what extent the minority data would be over sampled. The values of the hyper-parameters that was tested can be seen in Table 2.

The hyper-parameters of the RNN has not been tuned in any way. Parameters were chosen according to the tutorial in [36], except for the number of iterations that was changed from 100,000 to 30,000 to lower calculation costs and make the training run faster. Tuning of the hyper-parameters might have increased the performance a bit, but since the reason of utilize the RNN was to investigate the usefulness of the new approach, tuning was not considered important at this stage.

Table 2: The table displays the values that was tested. The white rows was tested in a grid search while the grey row was tested in isolation.

Max tree depth 1 10 25 50 75 100

Number of trees 20 50 100 200 300 500 1000

Over sampling ratio 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

As the literature study suggests, this thesis will use AUC-ROC as evaluation metric. The reason for this is that AUC-ROC doesn’t only consider the values of the confusion matrix for a single threshold, but rather take all thresholds into account. This charac-teristic implies that the maximum AUC-ROC is achieved with the hyper-parameters that separates the classes the most [7]. Although this might not be the settings that max-imizes the economical profits for this given problem. This is because the AUC-ROC metric considers both TPR and FRP to an equal extent and it can therefore yield sub-optimal results when one of the classes is more important to classify correctly. In this

(33)

use case, it is more economically harmful to miss-classify a fraudulent transaction than a non-fraudulent transaction. A choice was made to not focus on this aspect and still use AUC-ROC since it is a generalized approach which is motivated to use in such an early stage.

3.6 Comparison of results

To be able to draw conclusions from the results, comparison to some base line has to be made. As stated earlier in this section, comparison to the current filter is not possible due to way the database is updated. Instead a very simplistic base line is used. This base line is a "filter" that predicts all transaction as non-fraudulent. As evaluation metric, the AUC-ROC will be used since it takes both majority and minority class into equal consideration. Out of interest, the accuracy will also be displayed in the results to see that it’s not a good fit for imbalanced learning.

Apart from the base model described in 4.3, two other characteristics will be added onto the model and tested and evaluated. The first one is that minority transactions are passed on to the next chunk, and thus the minority data is remembered across the stream. The second one adds a new feature that uses customer email addresses given at the time of purchase. The feature is a quantification of the email address’s suspiciousness. The suspiciousness is found using a RNN. These are compared to see if the new characteristics adds value to the model’s predictive abilities.

The list below summarizes which filters/models that will be compared. • Filter that predicts all transaction to be non-fraudulent

• Random Forest (with sliding window)

• Random Forest (with sliding window) + minority pass on

• Random Forest (with sliding window) + RNN based email address feature

Since the training of a Random Forest model is a stochastic process which can give rise to slightly different models and thus different predictions for the same data, the evaluation process will be run several times. More precise, the filters/models above will be tested on the same data 5 times and then the expected value and the standard deviation will be shown in the result section.

(34)

4 Results

This section will present the result that has been obtained. Firstly, the result from the hyper-parameter search will be displayed. Then the result of the models that are config-ured according to those hyper-parameters are shown.

4.1 Hyper-parameter setting

The search grid was performed with the values given in the Figure 12 in the method section. Of those values, the hyper-parameter setting combination that yielded the best result according to the metric AUC-ROC was 1000 trees and a maximum of 10 levels. As seen in the Heatmap (Table 12) there is a cluster of high AUC-ROC values around the optimal hyper-parameter setting which indicates that the hyper-parameters chosen have a correlating impact of the models result.

Figure 12: The figure shows the result of the grid search that was performed to find the optimal hyper-parameters.

After these hyper-parameters were found, another search was made to find the optimal over sampling ratio. Both SMOTE and random oversampling was tested to see which over sampling technique yielded the best result. Over sampling ratios given in Table 2 in the method section was used and the resulting AUC-ROC can be seen in Figure 13. As seen, it is quite clear that over sampling through SMOTE didn’t improve the AUC-ROC. In fact, the AUC-ROC was reduced with the use of SMOTE. On the other hand, random over sampling seems to increase the performance of the model. Because of this result, it

(35)

was chosen to use a over sampling ratio of 1.0 for the final classification model.

Figure 13: The figure shows the random forests result for different over sampling ratios and two different over sampling techniques.

4.2 Model and Predictions

After the optimal hyper-parameters were found, a final model could be tested. The results can be seen in Table 3. The result is displayed yearly instead of showing the result for each chunk (quarter year), this is to make it easier to comprehend. The combined results for all transactions are also displayed. As stated in the theory section, when comparing models in a streaming data environment that is prone to concept drift one should focus on the whole stream instead of the chunks on their own [28].

In Figure 14 the ROC-curves for the different models are plotted. The model with the RNN-feature could not be tested on data from 2017 for reasons given in the method section. And therefore, to make the comparison fair, the other model’s ROC-curve are also produced with data from 2018 and 2019. As seen, the model with the RNN-feature produces the highest AUC-ROC. It’s also possible to see that the model performs worse when minority class transactions are passed on to the following chunks. Another thing that is certain, is that these models all perform substantially better than the "filter" that predicted all transactions to be non-fraudulent according to the AUC-ROC metric.

To make it easier to grasp the result the different model’s respective confusion matrix is shown below in Table 4, 5 and 6. These confusion matrices are produced using data

(36)

Table 3: The table displays the results that the different models gave. The accuracy is calculated for a probability threshold of 0.5.

Filter Metric 2017 2018 2019 Combined 2018/2019 Random F or est model A UC-R OC 0.8659 ± 0 .0024 0.8904 ± 0 .0011 0.8537 ± 0 .0014 0.8744 ± 0 .0012 0.8980 ± 0 .0009 Accurac y 0.9695 0.9279 0.9643 0.9533 0.9458 Random F or est model A UC-R OC 0.8671 ± 0 .0012 0.8992 ± 0 .0009 0.8659 ± 0 .0024 0.8638 ± 0 .0006 0.8955 ± 0 .0010 + minority pass on Accurac y 0.9741 0.9475 0.9565 0.9591 0.9513 Random F or est model A UC-R OC – 0.8966 ± 0 .0011 0.8550 ± 0 .0031 – 0.9057 ± 0 .0006 + RNN based featur e Accurac y – 0.9214 0.9776 – 0.9478 All non-fr audulent A UC-R OC 0.5 0.5 0.5 0.5 0.5 Accurac y 0.9784 0.9719 0.9941 0.9810 0.9826

(37)

Figure 14: The figure displays the different ROC-curves that the different filters produced for data from year 2018 and 2019.

from 2018 and 2019 as testing data for comparability. The default threshold of 0.5 was used. Out of interest the same models were tested with a threshold of 0.7, the resulting confusion matrices can be found in Appendix A.

Table 4: Confusion matrix - Random Forest (with sliding window) Predicted class

Class Fraudulent Non-fraudulent Actual class Fraudulent 262 (TP) 224 (FN)

Non-fraudulent 1292 (FP) 26206 (TN)

Table 5: Confusion matrix - Random Forest (with sliding window) + minority pass on Predicted class

Class Fraudulent Non-fraudulent Actual class

Fraudulent 256 (TP) 230 (FN) Non-fraudulent 1132 (FP) 26366 (TN)

(38)

Table 6: Confusion matrix - Random Forest (with sliding window) + RNN based email address feature

Predicted class Class Fraudulent Non-fraudulent Actual class Fraudulent 269 (TP) 217 (FN)

Non-fraudulent 1243 (FP) 26255 (TN)

To understand how necessary the sliding window approach is, the chunks has also been tested with a static model. By only using the first window (Q3 2016 - Q2 2017) as training data into a model that was used to predict all the following chunks. The AUC-ROC for the different chunks, predicted both by the static model and the non-static model, can be seen in Figure 15. It is quite clear that the static model performs worse than the non-static model (except for one chunk), this can be seen as an indication that concept drift is present in the data set. Both the static and non-static model seems to perform worse for Q2-Q4 of year 2019. The reason for this is not clear, but it can have many different explanations. The data for the year 2019 had, for example noticeably fewer fraudulent transactions than in other years, this could perhaps affect the result.

(39)

4.3 Use of qualitative features

As mentioned above it is possible to see an increase of performance when the RNN-feature was added onto the model. When the years 2018 and 2019 are combined the additional RNN-feature lead to an increase in AUC-ROC of around 0.9 percent.

According to sci-kit-learns function to evaluate feature importance, the RNN email feature was the fourth most important feature right after "amount", "location" and "hour of the day". The RNN email feature was also considerably more important than other techniques that was implemented to extract information from the email address. These other techniques included length of email address and if the address has some of the most normal domains (e.g. "gmail.com" or "outlook.com"). A bar-diagram of the feature importance can be seen in Figure 16. The grey list used in the current filter could not be implemented without having access to all transactions because there was no way of knowing whether the email address had been seen in advance or not.

Figure 16: Feature importance. Some of the related features are summed up (e.g billing address and shipping address).

4.4 Running time

When implementing a machine learning model in production it is important to be able to appreciate the running time, both for the training of the model and for the predict

(40)

of a new transaction (also known as inference). In this work, a timer has been used to measure time for various calculations. The average training time for the Random Forest model was 31.3 seconds with an average of 16,519 transactions. The average time it took for that model to make an inference was 0.00013 seconds. The average training time for the RNN was 261.7 seconds with an average of 14,452 transactions. The average time it took for that model to make an inference was 0.004775 seconds. The search-grid mentioned in 4.1 had a running time of 647.5 seconds.

An important remark to make is that neither Random Forest nor RNN has a linear time complexity. This means that it is not possible to predict the running time of a data set twice as large by simply multiply the running time given above with a factor of two. To be able to do this kind of analysis, a more in-depth investigation must be made to understand the time complexity of the training algorithms. In this thesis, the times serves as a way to compare the relative models. However, the inference can still be expected to be roughly the same no matter the size of the training data set.

(41)

5 Discussion

The discussion section will follow a similar pattern as the result section. First the re-sults from the hyper-parameter setting is discussed. This is followed by a discussion of the predictive models and their performance. Lastly, the RNN features contribution is treated.

5.1 Hyper-parameters

In the same way that models need updating to stay relevant, it is possible that the hyper-parameters also need to be updated. In this thesis-project the hyper-hyper-parameters are tuned once and only for data early on the timeline. One could perhaps automate the tuning process and perform it every time a new model is to be trained. How this would affect the result is still unknown but could be interesting to explore further. However, one thing that is certain is that such an approach would result in a much slower training process. If outdated hyper-parameters could cause decreasing performance, that would explain the slightly sloping of the trend-line of the non-static model that is possible to see in Figure 15.

A surprising result according to the tuning part, is that the best performance was reached when random over sampling technique was used. This is surprising since most literature advocates the use of SMOTE for this purpose. A possible explanation could be that the data set used in this thesis already is balanced enough. This seems reasonable since the data set is constructed in a way that it contains a higher proportion of fraudulent transactions than the "real" database. The effect of having a larger ratio of minority data points could limit the positive aspects about not directly replicate the present data points. However, this explanation does not clarify why the AUC-ROC actually decreased slightly when SMOTE was used.

5.2 Model/Predictions

The size of the chunks and the window in the model were chosen to fit the amount of data available. It is fully possible that another combination opts better result or is more functional in practice.

(42)

In production, the label of the transaction is usually not known right after it has been completed, this is especially true for the fraudulent transactions. In this project, this fact has been disregarded because all data was already old enough for its label to be confirmed. But in practice one might have to exclude the most recent transactions from the training data set for each update. How this will affect the model’s performance is unknown, but it is assumed to be of minimal impact.

In Figure 15 one can see how important the updating mechanism is. The non-static model stays at approximately the same performance over time while the static model’s performance decreases. This can be seen as a sign that fraudsters behaviour is prone to change over time. This is an important result for all kind of fraud filters, not only those that are machine learning based. If one uses a fraud filter based on manually set thresholds, one must still be prepared to update these thresholds quite often for the filter not to be outdated.

One can see a decrease in performance when the fraudulent transactions are cumula-tively passed on to the next window. This kind of technique have shown promising results in other articles, but in this case, it was not beneficial. There are several reasons why this could be the case. One reason could be that the minority class is getting too common and therefore the ratio between minority and majority class is too far from what it should represent. Another reason could be that the behaviour of the fraudsters does change a lot, which also would explain the worse result of the static model. A third reason could be that Resurs Bank has made small changes over time in how they collect data for each transaction.

In one of the models proposed in this thesis-project, the fraudulent transactions are cumulatively passed on to the next chunk to reduce the impact from the imbalanced data set. This technique works in this setting because the data set only consists of four years of transactions. This would not be the case in a real time environment, and it is therefore important to set some kind of forgetting rate of the fraudulent transactions that are passed on to the succeeding window. If this isn’t done, the fraudulent transaction will eventually become a majority class which is the exact opposite from what the reality is. This would probably make the model less accurate and therefore it is necessary to prevent. By using a reasonable forgetting rate, fraudulent transactions that are from several years ago would

Catch the fraudster: The development of a machine learning based fraud filter

Examensarbete 30 hp

September 2020

Catch the fraudster

The development of a machine learning based

fraud filter

Abstract

Catch the fraudster

Acknowledgments

Populärvetenskaplig sammanfattning

Contents

1

Introduction

1.1

Related work

1.2

Research definition

1.3

Questions

2

Theory

2.1

Machine learning

∑

=

. . .

2.2

Word representation

2.3

Concept drift

2.4

Imbalanced data sets

2.5

Evaluation metrics

3

Method

3.1

Tools

3.2

Data

3.3

Classification model(s)

3.4

Prepossessing

3.5

Hyper-parameter tuning and evaluation metrics

3.6

Comparison of results

4

Results

4.1

Hyper-parameter setting

4.2

Model and Predictions

4.3

Use of qualitative features

4.4

Running time

5

Discussion

5.1

Hyper-parameters

5.2

Model/Predictions