Master Thesis Report

(1)

Master Thesis Report

Identification of Problem Gambling via Recurrent Neural Networks

Predicting self-exclusion due to problem gambling within the remote gambling sector by means of

recurrent neural networks

Måns Bermell

May 19, 2019

Supervisors:

Ivan Ukhov, Gears of Leo AB Clayton Forssén, Umeå University Examinator:

Ludvig Lizana, Umeå University

Student Fall 2018

Master Thesis, 30 ECTS

(2)

(3)

Abstract

Under recent years the gambling industry has been moving towards providing their customer the possibility to gamble online instead of visiting a physical location. Aggressive marketing, fast growth and a multitude of actors within the market have resulted in a spike of customers who have developed a gambling problem. Decision makers are trying to fight back by regulating markets in order to make the companies take responsibility and work towards preventing these problems. One method of working proactively in this regards is to identify vulnerable customers before they develop a destructive habit.

In this work a novel method of predicting customers that have a higher risk in regards to gambling-related problems is explored. More concretely, a recurrent neural network with long short-term memory cells is created to process raw behaviour data that are aggregated on a daily basis to classify them as high-risk or not. Supervised training is used in order to learn from historical data, where the usage of permanent self-exclusions due to gambling related problems defines problem gamblers. The work consists of: obtain a local optimal configuration of the network which enhances the performance for identifying problem gamblers who favour the casino section over sports section, and analyze the model to provide insights in the field.

This project was carried out together with LeoVegas Mobile Gaming Group. The group offers both online casino games and sports booking in a number of countries in Europe.

This collaboration made both data and expertise within the industry accessible to perform this work. The company currently have a model in production to perform these predictions, but want to explore other approaches.

The model that has been developed showed a significant increase in performance compared to the one that is currently used at the company. Specifically, the precision and recall which are two metrics important for a two class classification model, increased by 37% and 21% respectively. Using raw time series data, instead of aggregated data increased the responsiveness regarding customers change in behaviour over time. The model also scaled better with more history compared to the current model, which could be a result of the nature of a recurrent network compared to the current model used.

Keywords: LSTM, machine learning, online casino, problem gambling, recurrent neural network, time series

(4)

(5)

Preface

As enormous amount of data is being collected in the digital world, it is essential to analyze it and learn from history. The basis for this thesis arised from my curiosity of understanding how state-of-the-art machine learning techniques can be utilized in a real world example;

while at the same time contributing to society by assisting online casinos to prevent problem gambling. This paper can assist future studies to understanding advantages and pitfalls with using recurrent neural networks for predicting problem gambling behaviour.

This thesis is a collaboration with Gears of Leo AB; here I have spent the last semester with learning about the gambling industry, developing neural networks but also made friends and future colleges. I would never have made it this far without the support of both family, friends and fellow students. A special thank you goes to the data science team and especially my supervisor Ivan, who constantly was available for guiding me during this thesis.

May 19, 2019

Måns Bermell

(6)

(7)

Feature list

date Day when an activity was observed.

deposit_approved_num Number of approved deposits per active day.

deposit_approved_sum Total approved deposits in euros.

deposit_denied_num Number of denied deposits.

deposit_denied_sum Total denied deposits in euros.

exclusion_other_num Number of non problem gambling related exclusion tools used.

exclusion_problem_gamble Flag indicating that the user is excluded from future gam- bling either by him- or herself or by the staff.

last_seen_num Number of days since previous active day.

limit_num Number of limits set, such as limiting session time per week.

player_id Unique number for each user.

session_num Number of sessions registered.

session_sum Total session time in seconds.

slots_turnover_num Number of bets placed on casino slots.

slots_turnover_sum Total sum of bets placed on casino slots in euros.

sports_turnover_num Number of sport bets placed.

sports_turnover_sum Total sum of sport bets placed in euros.

turnover_bonus_num Number of bets placed with bonuses.

turnover_bonus_sum Total sum of bets placed with bonuses in euros.

turnover_cash_num Number of bets placed with own money.

turnover_cash_sum Total sum of bets placed with own money in euros.

turnover_midnight_num Number of bets placed between 00:00 and 04:00.

turnover_saturday_num Number of bets placed on Saturdays.

winning_bonus_num Number of bonus winnings.

winning_bonus_sum Total bonus winnings in euros.

winning_cash_num Number of winnings based on own money.

winning_cash_sum Total winnings from own money in euros.

withdrawal_approved_num Number of approved withdrawals.

withdrawal_approved_sum Total sum of withdrawals in euros.

withdrawal_canceled_num Number of canceled withdrawals.

withdrawal_canceled_sum Total sum of canceled withdrawals in euros.

(10)

Abbriviations

ANN Artificial neural network BN Batch normalization FP False positive FN False negative FC Fully connected

GGR Gross gaming revenue LSTM Long short-term memory ML Machine learning

PR Precision recall PG Problem gambling RG Responsible gambling RNN Recurrent neural network ROC Receiver operating characteristic SE Self exclusion due to problem gambling TP True positive

TN True negative VM Virtual machine

(11)

1

Introduction

1.1 Motivation

The majority of people living in the Western world has at least once completed a wager.

Wagering, or more commonly called betting, is not a new phenomenon; it has been around for millenniums. The first dice was found in ancient tombs dated back to 2800–2500BC [1].

Today the gambling industry has turned into a multi-billion industry where only the Swedish market generated the operators a gross gaming revenue¹ (GGR) of 2.2 billion euros in 2017 [2]. Online gambling is the sector that has been driving the growth within the industry under most recent years. In 2017 46% of the total GGR was obtained from the online market in Sweden, which was an increase by 11.4% compared to 2016 [2].

This rapid growth of a product that is constantly available has led to an increase in the number of people developing a problem due to gambling. The number of people contacting helplines regarding gambling problems due to online casinos increased by 188% in 2013–

2017 [3], according to the yearly report from Stockholms gambling prevention organisation Stödlinjen. On the 1st of January 2018 gambling addiction was added to socialtjänstlagen in Sweden, which implies that gambling addiction is as important as other common addictions, such as alcohol and tobacco addictions [4]. Problem gambling is a collection name describing the negative social, economical or health consequences that gambling with money can result in [5].

Under recent years many countries have started to regulate the markets in order to control the development and force operators to work with regulations in order to support customers and work with anti money laundering [6]. Within the gambling sector a lot of focus has thus been targeted on responsible gambling (RG) in order to obtain and maintain their licenses.

One could argue that there is a clash of interest within the gambling industry where, on one hand customers with gambling problems often generate large revenues, and on the other hand, companies risk fines and losing their licences. The serious operators try to benefit from being responsible by creating long-term relationships with customers. This generates a more positive public attitude towards the companies. One common practice in order to stay responsible is to monitor customers and intervene when they seem to gamble irresponsibly which could increase the risk of them becoming problem gamblers. Since the customer base of companies can reach millions it is not feasible to do this in a manual fashion; instead algorithms utilizing the computational power of computers are used.

The task of identifying customers who have a higher risk of becoming problem gamblers could be implemented with a simple logic; for example wagering more than X per week indicates problem gambling behaviour. However, since gambling behaviours varies strongly between people, this approach is not adequate.

1Gross gaming revenue implies revenue after winnings have been payed out to the customers.

(12)

Machine learning (ML) is a field within artificial intelligence (AI) that has been researched thoroughly the last decade. The popularity is due to the increase of computational power which makes these types of algorithms available to the public. The industry have thus also adopted machine learning to boost business and take a more data–driven approach.

In section 2, recurrent neural network (RNN), which is a special type of neural network used for processing sequences/time series is explained. Identifying problem gambling with an RNN utilizing raw time series data is a novel method, based on published research. The method of configuring the network is thoroughly explained in section 3. This novel approach demonstrates good performance both overall as well as when only recent historical data is taken into account; these results are further discussed in section 4 and section 5 respectively.

1.2 Company

This thesis is a collaboration between the student and the company Gears of Leo AB. Gears of Leo is the technology and product company within the LeoVegas mobile gaming group, which is an online casino and betting company. The main objective of the company is to empower the mobile gaming experience at LeoVegas by being data driven. More specifically, the thesis is a collaboration with the data science team who performs advanced analysis and predictive modeling

1.3 Objectives

The main goal of this thesis is to explore the possibility of using recurrent neural networks as a mean to identifying customers who run the risk of becoming problem gamblers or have already developed a gambling problem. The thesis will later be used as support and a first study for the data science team at the company for understanding the pros and cons regarding RNNs for the task at hand. The company has been collecting data about customer activity; this data can be used to understand which behaviours lead to development of problem gambling.

The most accurate procedure to label customers regarding risk of being/becoming problem gamblers is via a thorough clinical study. This requires a lot of manual work and does not scale practically for the large required dataset. Another approach is to make customers fill out a form, however, since neural networks requires large datasets and not to interfere with the company-customer relation another method of labelling the customers must be used.

1.4 Problem formulation

Relevant data regarding the interaction between the customer and the casino have been given on an daily aggregated level. The data contains for example information regarding how much each customer have spent per day. Without aggregating or transforming the data further it is used in its raw time series format. Each customers time series has been labeled in advance regarding if they have decided to exclude themselves from the casino due to gambling related problems or not. The problem is to use this data in order to train an RNN that can predict whether or not a customer not previously seen by the network is a problem gambler or not. In order to understand whether or not the RNN performs any good; it has to be compared with the current model²used for this specific task at the company.

2

(13)

Supervised learning techniques require data that has been labeled; this implies that the output that the model is trying to predict is known in advance. The output is often referred to the target variable. Sufficient amounts of data is also required for all different targets. In the case of this thesis, behavioural data was required from both problem gamblers and non- problem gamblers. For this specific problem, the optimal target variable could be obtained by performing clinical studies of a large set of customers in order to identify whether they are problem gamblers or not. Unfortunately, this approach is not feasible due to both the enormous work of conducting such studies and due to the company’s policies regarding contacting the customer. Moreover, Supervised machine learning, using deep networks, as explained in section 2.1.2, requires vast amounts of data in order to produce relevant results [7]; which implies that a small clinical study would not gather enough data. A scalable clinical study that have been discussed, is to ask customers to fill in a form based on the problem gambling severity index (PGSI) score [8]. This approach have been considered by the company; but not yet implemented due to the errors it could generate. The approach that has actually been taken is as follows.

All customers using the online casino have the possibility to exclude themselves from it due to various reasons; one of these reasons is problem gambling. When a customer excludes him- or herself due to gambling-related problems their account will be closed and all money in the account are refunded as well as details regarding personal information will be deleted.

The user will also be blocked from using the casino for wagering in the future. This thesis will use these self-exclusions (SE) due to gambling problems as a proxy for problem gambling;

which implies that we are predicting self-exclusions rather than problem gambling.

1.5 Previous studies

In current literature, there are several different machine learning techniques being used for predicting customers with harmful gambling behaviour. Previous studies for classification of problem gamblers have mainly focused on unsupervised clustering methods [9–11]. These methods do not require any labelled data; hence a definition of what problem gambling really is, is not needed for training the models. On the other hand there is supervised learning, which is based on knowing the ground truth for what the model is trying to predict. For supervised learning the data must thus be prelabeled/classified. The most widely used proxy for problem gambling is self-exclusion due to gambling-related problems, where customers can exclude themselves for an indefinite time. In studies where supervised learning have been used, behavioural markers have primarily been calculated. Behavioural markers are scalars that by themselves could carry information if the users is experiencing problem gambling behaviour. These markers are then used as data for training a model based on known labels.

S. Dragivevic et. al, 2016 [12], evaluates several supervised machine learning techniques are evaluated. The authors aggregated the raw customer data into 33 behavioural markers based the work of Braverman. J et. al, 2012 [10] and Hayer. T et. al, 2010 [13], by using multiple algorithms, such as regression models and t-tests. The models examined logistic regression, Bayesian networks, neural networks and random forest. Since the majority of users in their dataset was not marked as problem gamblers, there was a clear imbalance between the two classes (excluder or non-excluder) in the data, the authors observed a large performance increase when balancing the data via oversampling. The best performing model turned out to be the random forest method.

A similar study by Philander. K et. al, 2014 [14] evaluated nine different supervised learning techniques.The aim was to give further studies an indication on which categories of machine learning models could be neglected and which models could be effective for this specific problem. The study concluded that supervised learning techniques can generally improve classification accuracy compared to clustering techniques. Artificial neural networks (ANNs) were found to be the most useful, however there were still a lot of unidentified problem

(14)

gamblers as well as many false positives. The authors discuss the need for continuing to explore other types of models, especially since the field of machine learning is moving at such a fast pace.

Two studies by Dragivevic. C et. al, 2016 [12] and Philander. K et. al, 2014 [14] discuss the need for hold-out testing in order to get unbiased performance measurements. Hold-out testing is a method where models are evaluated on a dataset not used to train the models.

Performing hold-out determines the true performance since the model have never seen this data and can not learn from it.

However, there is no implementation where raw time-series data has been used for these predictions. It might however be research regarding this topic within gambling companies that is not presented to the public or yet published.

1.6 Overview

In section 2 recurrent neural networks are explained starting from basics, along with other important aspect pertaining to machine learning. Section 3 covers the method used in this thesis, starting with how the data was extracted and finishing with the choice of the final model. Section 4 reports the results regarding the models used, as well as reasoning regarding various hyper-parameters. The analysis of the final model both in isolation and in a comparison with the current model used at the company is discussed in section 5. The thesis is concluded in section 6.

(15)

2

Theory

2.1 Machine learning

Machine learning (ML) is a branch within the broader term artificial intelligence. The general idea of ML is to use models that can learn from historical data, identify structures/patterns within data and to make decisions/predictions. ML can be divided into two subgroups: supervised and unsupervised learning. Supervised learning is used when the data is labelled/categorized, that is, when the truth is known. Unsupervised learning is used to find structures in the data when labels are unknown and it is commonly used for clustering data, finding outliers etc. This work concerns only supervised learning techniques.

2.1.1 Perceptron

Scientists and researchers often study how evolution has solved a perticular problem. Since it spans billions of years, the solutions being observed tend to be extremely efficient. Brains and, more generally the nerve systems are no exception. They consist of a multitude of neurons connected in a sophisticated network. In 1957 Frank Rosenblatt came up with a mathematically model of a neuron, called the perceptron [15]. A visual representation of the perceptron is shown in Fig. 2.1. The perceptron transforms inputs xi with weights wi

and a bias b, which is independent of the inputs, via an activation function f into a scalar output by, as shown in Eq. (2.1). The activation functions used in this thesis are given in Eq. (2.2) and (2.3).

by = f ( _n

∑

i

wixi+ b )

(2.1)

Sigmoid: f (x) = σ(x) = 1

1 + e^−x (2.2)

Tanh: f (x) = tanh(x) =e^x− e^−x

e^x+ e^−x (2.3)

If the input data x is correctly labelled with respect to the target value y it is possible to alter the weights w and bias b to make better estimates by. This process is referred to backpropagation and is more thoroughly explained in section 2.1.6. A perceptron can only learn linearly behaviours.

(16)

!

_"

!

_#

!

_$

%

&

_"

&

_#

&

_$

'

Transfer

function Activation function

()

*

Figure 2.1: Illustration of a perceptron. It performs an elementwise multiplication of inputs and weights (x and w), sums the products, add a bias term b, and passes the result through an activation function f to compute the final outputby

2.1.2 Artificial neural networks

The human nerve system does not consist of a single neuron; they are connected to eachother via synapses. This was also mimicked by researchers, by stacking perceptrons (nodes) both in parallel and in sequence the network of nodes could learn more advanced, non-linear behaviours. A visual representation of such network of nodes is shown in Fig. 2.2, where each circle represents a perceptron and the lines between them represents the synapses in the brain. Information flows from the input layer to the output layer; hence it is often called a feed-forward network. The layers of an artificial neural can be divided into three sections, input layer, hidden layer and output layer. The input layer takes the input data and transmits it’s output forward. The hidden layer can be built up by multiple layers with various number of nodes in each layer. Having more than one hidden layer is often refereed to a “deep” network.

Having different number of nodes in each layer and different number of layers can give different performance results; hence one has to find the most suitable combination for each specific task.

Input layer

Hidden layer

Output layer

Figure 2.2: Simplified illustration of an artificial neural network. Information flows from the input layer to the output layer. The hidden layer can consist of multiple layers of different sizes.

(17)

As explained in section 2.1.1 the network can learn from historical data by altering its weights and biases in each neuron. Disadvantages of the ANN is that the input data must have the same shape; hence it is not suitable for handling data of varying size or length.

2.1.3 Recurrent neural networks

Recurrent neural networks (RNNs) process sequences of data, which is enabled by the hidden state htfrom previous time step being given as input to the network as well as the input data x_t. The network thus has a notion of time. Let x be a sequence of data: x = (x1, x₂, ..., x_T).

Then the hidden state htis updated in each node for each time step as follows:

ht= {

0, t = 0

Φ(xt, ht−1), otherwise (2.4)

where Φ is a non-linear function [16]. The classic update for the RNN is the Elman network updates:

h_t= {

0, t = 0

ϕh(Whxt+ Uhht−1+ bh) otherwise (2.5)

y_t=ϕ_y(W_yh_y+ b_y) (2.6)

where W are weight matrices (Wh, Wy are the input to hidden state weight matrix and hidden to output weight matrix respectively), Uhis the hidden to hidden state weight matrix and bhand by are the biases for the hidden state and output respectively, and ϕ is a smooth, bounded activation function [16].

Depending on the application it is possible to utilize an RNN in different ways; it can be broken down as follows: one-to-many, many-to-one, many-to-many and one-to-many, which refer to as input and output of the network. A one-to-many network takes one input x1and outputs many predictions for future steps; a similar logic is applied to the other versions.

The many-to-many configuration is illustrated in Fig. 2.3; here the algorithm returns an output prediction for each input. For the many-to-one version, the algorithm returns a single output at the end of the sequence, which is illustrated in Fig. 2.4.

!_"

#_"

ℎ_" ^!%

...

#_%

!_&

#_&

ℎ%

...

!_'

#_' ()'

!_*

#_*

()" ()% ()_&

ℎ_* ()_*

=

...

Figure 2.3: An unfolded many-to-many recurrent neural network, where xtis the input features at each time step, ht is the hidden weights at each time step and yt

is the result for each time step. The hidden memory weights Mt are stored within each cell.

A major issue with traditional Elman RNNs is the vanishing/exploding gradient problem, which is a dilemma for all neural networks with gradient-based learning algorithms and backpropagation [16]. For theses algorithms the weights and biases are updated proportionally

(18)

!_"

#_"

ℎ_" !_%

...

#_%

!_&

#_&

ℎ_%

...

!_'

#_' ()

Figure 2.4: An unfolded many-to-one recurrent neural network.

to the partial derivative of the user-defined loss function. It originates from the fact that, when a number smaller or larger than one is multiplied several times, it will either converge to zero or diverge to infinity. Which leads to incorrect updates of weights used early by the network. In the next section the long short-term memory (LSTM) cell is introduced as a solution for this specific problem.

2.1.4 Long short-term memory

The long short-term memory is a cell that is incorporated within each node of the classical RNN. This cell can maintain memory over a longer period of time compared to the classical RNN [16]. It was proposed in 1997 by Hochreiter and Schmidhuber but has since been slightly modified. In this section the implementation given by Graves. A, 2013 [17] will be explained.

An LSTM cell is made up by an inner memory and five one-layer ANNs with different activation functions. Below is an explanation of all variables used in the LSTM [17]. These variables are also illustrated in Fig. 2.5.

• Forget gate f - Neural network with sigmoid activation function, σ.

• Input gate i - Neural network with tanh activation function.

• Output gate o - Neural network with sigmoid activation function, σ.

• Candidate layer eC - Neural network with tanh activation function.

• Hidden state h - Vector with hidden state information.

• Memory state C - Vector with memory state information.

The major difference between an LSTM unit and a recurrent unit is that while the simple recurrent unit calculates a weighted sum of the input signal before it applies the activation function, each LSTM unit manages a memory block within the cell. The LSTM cell uses multiple activatios in order to learn which data should be kept, forgot or updated in the memory. This built-in memory improves the ability to handle the vanishing/exploding gradient problem; hence it is better suited for discovering long-range dependencies in the data.The formulas that make up the LSTM cell are shown in Eq. (2.7)–(2.12) [17].

(19)

! ! !

×

#$%ℎ

×

× +

(

₎

*

₎

+,

₎

-

₎

,

_)./

ℎ

_)./

0

₎

ℎ

₎

,

₎

12

₎

Figure 2.5: Illustration of an LSTM cell. Circles are simple mathematical oper- ations such as addition and multiplication while rounded rectangles are activation functions such as sigmoid function and tanh.

it= tanh (Wxixt+ Whiht−1+ WciCt−1+ bi) (2.7) f_t=σ (W_xfx_t+ W_hfh_t₋₁+ W_cfC_t₋₁+ b_f) (2.8)

Cft=ftCt−1 (2.9)

Ct=fCt+ ittanh (Wxcxt+ Whcht−1+ bc) (2.10) o_t=σ (W_xox_t+ W_hoh_t₋₁+ W_coC_t+ b_o) (2.11)

ht=ottanh (ct) (2.12)

Similarly to the artificial neural network it is possible to build multiple layers and multiple cells per layer of LSTM cells, which can increase the performance of the network.

2.1.5 Cross-entropy loss

Machine learning is an optimization problem; where the weights and biases has to be tuned in order to achieve wanted results. A loss function is thus needed to tell the network how it performs and where the overall largest error originates from. The cross-entropy loss is used for classification based problems. It increases exponentially the further away the prediction is from the true label . The general formulation for the loss is as follows:

Loss =−

∑M c=1

y_o,clog(byo,c) (2.13)

where M is the number of classes within the model, o are the observations andby are predicted probabilities [18]. For a two-class classification model, the loss is given as:

Loss = − (y log (by) + (1−y) log (1−by)) . (2.14)

After the loss have been calculated, it is possible to understand what specific data the

(20)

network does not perform well on. The weights and biases can then be updated in order to make the loss smaller if the network processes the same data once again.

2.1.6 Backpropagation

The aim of backpropagation is to calculate how the weights inside the model should be changed in order to decrease the final loss/error E. The final error of the model is calculated by comparing the true values y and predicted valuesby via a loss function L. One calculates the derivative of the error with respect to each specific weight. For a one-layer network this deriviate is: ∂E/∂wij where wij is the weight between nodes i and j. This is done by using the chain rule:

∂E

∂w_ij = ∂E

∂o_j

∂oj

∂net_j

∂netj

∂w_ij , (2.15)

where oj is the output of neuron j and netj is the weighted sum within neuron j. For a more complex network, the same approach is used to propagate the error backwards via the chain rule. The weights can later be updated as follows:

w^new_ij = w_ij+ ∆w_ij = w_ij− η ∂E

∂wij

, (2.16)

where η is a predefined learning rate which is used to take smaller or larger steps when updating the weights.

2.1.7 Adam optimizer

Instead of using the simple update of the weights and biases mentioned earlier; one can use a more advanced approach: Adam. Adam is a first-order, gradient-based optimization algo- rithm that is used for stochastic objective functions. The algorithm is based upon adaptive moment estimations and updates individual adaptive learning rates for each parameter [19].

It is based on the advantages from AdaGrad [20] and RMSProp [21]. The algorithm requires little memory and is thus useful when solving machine learning problems where large datasets or many features are used.

For each step, the algorithm updates exponential moving averages of the gradient as well as the squares of the gradients, where two constants control the decay for these exponential moving averages. The effective update steps are bounded by the predefined step size, however it will adaptively change the magnitude according to the gradients [19].

2.2 Metrics

Measuring the loss of a model does not tell the entire story. One also need to observe some vital performance measurements in order to understand how the model actually performs, such as accuracy.

A binary classifier outputs either true or false; the predictions can be classified as four different kinds of results: true positive (TP), true negative (TN), false positive (FP) and false negative (FN). The number of TPs represents all correct classified positives; TNs

(21)

represents all correct classified negatives; similar reasoning holds for FP and FN. Fig. 2.6 illustrates a confusion matrix for a binary classifier.

TP

FN

FP

TN

True Label

Predicted Label

True

True False

False

Figure 2.6: Confusion matrix for a binary classification model. The x-axis rep- resents a binary classifier and the y-axis represents the ground truth.

2.2.1 Accuracy

The accuracy is a standard metric to evaluate predictions; it is defined as the fraction of all the correct classified data divided by the sum of all correct- incorrect classified data as seen in Eq. (2.17). There is a major drawback however with using accuracy when a dataset is imbalanced.¹ If a dataset is made up by two classes c1 and c2, where 95% of the data comes from c1, by making a model predict all data as c1it will achieve an accuracy of 95%

accuracy. However, this model is arguably useless.

Accuracy = TP + TN

TP + TN + FP + FN (2.17)

Hence other metrics are better suited when working with imbalanced data, which we discuss next.

2.2.2 Precision and recall

Precision is the fraction of correct classified positives out of all true positives:

Precision = TP

TP + FP (2.18)

Recall or true positive rate is another measurement. It calculates the fraction of correctly classified data of all predicted positives:

Recall = TP

TP + FN (2.19)

1The dataset contains a different count each class instances.

(22)

Precision-recall curve

A binary classifier outputs a continuous value between 0 and 1; in order to decide whether the output should be set to either 0 or 1 a cut-off threshold is used. If the output is above the threshold one sets the label as 1 and vice-versa. By altering this threshold it is possible to make a trade-off between precision and recall where a low threshold increases the recall and lowers the precision; while conversely a high threshold increases the precision and lowers the recall. This trade-off is commonly illustrated via a plot of all thresholds that alters the precision and recall; hence it is named precision-recall (PR) curve.

2.2.3 F

_β

score

If one would like to monitor values regarding both precision and recall to measure the performance of a classifier; it is not suitable to have two scalars. In order to merge both precision and recall into one scalar, the Fβscore can be used. This is a single measurement that combines values of precision and recall with different weights as shown in Eq. (2.20).

The value of β is chosen depending on which metric is more important. For instance, if β = 1 is chosen, precision and recall have the same importance, while β > 1 assigns more important to recall.

F_β=( 1 + β²)

· precision· recall

(β²· precision) + recall. (2.20)

(23)

3

Method

3.1 Data collection

The company has a dedicated team of data engineers who are responsible of developing and maintaining a data warehouse. The data warehouse is based on Google BigQuery [22] and includes a wide range of data that can be used for reports and more advanced analytics in order to facilitate the work of other teams within the company. This thesis will only consider customer activity data available in the data warehouse.

3.2 Data preprocessing

The company have two models for predicting problem gambling; one that is used for customers who primarily use sports booking and one for casino customers. This segmentation of customers is based on which market (casino or sports) the largest proportion of wagers are spent. Since the casino market is substantially larger than the sports betting (91% of total GGR [23]) the focus of this thesis is for customers within the casino segment. Another restriction on the data used, is to only consider customers who have made at least one deposit to their gambling accounts.

All relevant behavioural data that could be used in a time series fashion was extracted from the data warehouse via SQL queries. The features that were extracted contain information discussed in the list below; the features are measured in either: occurrences, euros or seconds depending on the feature. For a specific list of all features used, see the feature list defined at the beginning of the report.

• Player ID - A unique ID that belongs to a customer.

• Date - The date where the daily activities were aggregated.

• Deposits - The amount and occurrences a customers had an approved or denied transfer to their gambling account. Denied deposits only originates due to limitations set by the customer previously.

• Withdrawals - The amount and occurrences a customer had an approved or can- celed transfer from the gambling account. Canceled withdrawals are registered when customers changes their mind regarding a withdrawal.

• Turnover - Bets placed also known as turnover and wagers; both number of wagers and the size of the wagers are registered. Information regarding both casino/sports bets, bonuses/cash spent, bets placed on Saturdays and bets placed around midnight

(24)

(between 00:00–04:00) are available. Information regarding Saturdays and midnights has been observed to be a strong indicator of problem gambling behaviour [24].

• Winnings - Winnings before subtracting the turnover. Information regarding both bonus/cash winnings is available in the total amount as well as number of winnings.

• Session time - The time between login and logout is measured and referred to as a session. The sessions is registered in both seconds and the number of session registered per day.

• Exclusion due to problem gambling - A binary variable indicating if the customer has chosen to exclude him- or herself due to gambling-related problems. Used as target variable as discussed in section 1.4.

• Other exclusions - Customers have the possibility to exclude themselves short or long-term due to other reasons than problem gambling.

• Limits - Customers can also limit themselves by, for instance, setting a maximum session time or maximum cash spent per day. Measured as the number of limits set for each day.

The features were aggregated per player and per active day (a day where any of the features are non-zero). In this manner, it is possible to represent all customers in a time series fashion, where each new entry represents a new active day. The feature date was transformed into another feature called last_seen, in order to give the network a notion of days between each active day.

A vital preprocessing step used was to normalize each feature so that it has zero mean and unit variance. The motivation behind this step was to not have some features dominate the results [25].

3.3 Model specific preprocessing

For classifying the customer time series, a recurrent neural network with long short-term memory cells was utilized. This model was chosen due to its high performance in other fields such as speech recognition [26] and handwriting recognition [27]; that are also based on sequences.

Before any customer time series data can be processed by the model it has to be transformed into a certain format. The standard shape of a single time series with multiple features is as follows:

x =







f00 f10 . . . fN 0

f01 f11 . . . fN 1

... ... . .. ... f0T f1T . . . fN T





,

where fitis feature i at time step t. Processing one customer at a time is a computational inefficient approach. On the other hand, one could add a third dimension to x which represents different time series; then feed the model with all the available data at once.

However, this generates other types of problems, such as: all time series might not be of the same length (impossible to create this third dimension) and the computer is required to read all data into memory. Assume first that all users have equally many active days. Then

(25)

memory. The most common approach to minimize the memory usage is to let subsets of the data pass through the model; these subsets are called mini-batches. Mini-batches causes a trade-off between computation time and computer power/memory needed. Updating the models weights, introduced in section 2.1.2, via mini-batches instead of one time series at a time makes the updates less noisy, since the updates are more generalized towards a subset of the data.

3.3.1 Padding and bucketing

The previous reasoning was based on all customers time series having the same length; this is of course not true. To tackle this problem a method called padding is applied. Padding is performed by adding zeros at the beginning of all sequences shorter than the longest one, in order to make it possible to add this third dimension discussed previously.

However, padding generates two new problems: increasing the amount of data needed to process by the model and contains false information (zeros). In order to reduce the amount of padding of the data a technique called bucketing is used. The method of bucketing is to define multiple buckets with minimum/maximum allowed sequence length and then sorting each time series in to the correct bucket. Each bucket is then divided into mini-batches;

where each mini-batch is padded according the maximum sequence length within the bucket.

An illustration is shown in Fig. 3.1. For this implementation 30 buckets were used; where the lower and upper limits were defined by percentiles based on the spread of lengths in the data.

MB

1

MB

2

MB

3

MB

4

MB

₅

MB

6

MB

7

MB

8

MB

9

MB

10

MB

11

Bucket 1 Bucket 2 Bucket 3

Figure 3.1: An illustration of how different mini-batches (MB) are sorted depend- ing on the their sequence length. Note that one mini-batch in each bucket might consist of fewer sequences (illustrated by the physical size of the MBs) than the standard mini-batch size.

3.3.2 Deep learning architecture

The model used, is composed in several layers made up by multiple nodes in each layer.

Normalization of the data is a vital step as mentioned earlier; so is normalization between the layers. Since normalization between the layers can not be computed as a preprocessing step, a so called batch normalization (BN) layer is added between the LSTM layers as shown in Fig. 3.2. Batch normalization is a layer within the network that enables a more efficient training of deep neural networks [28]. The purpose of this layer is to normalize the output

(26)

from previous layer before passing it to the next. The BN layer normalizes the input xmb

and transform it in the following manner:

ymb← γbxmb+ β≡ BNγ,β(xmb),

where ymb is the output of the BN layer, bxmb is the normalized input and γ, β are two trainable parameters [28]. As the final step of the network, a fully connected (FC) layer was added. This is a one-layer artificial neural network with a sigmoid activation function. The sigmoid activation returns a single value by ∈ (0, 1) for each sequence x1...T. An illustration of the entire network used in this thesis is shown in Fig. 3.2.

Figure 3.2: An overview of the models architecture used for classifying problem gamblers. It is unrolled in time to better understand how information flows. There are L number of layers with n number of LSTM nodes within each layer. Between each LSTM layer there is a batch normalization layer that performs batch normal- ization. The final layer of the network is a fully connected standard artificial neural network layer with a sigmoid activation function which transforms the output to values between 0 and 1.

As mentioned in section 2.1.4 the number of layers and nodes per layer can drastically al- ter the performance. These parameters, which we denote by n and L, respectively, must thus be chosen via experimental investigation. Such parameters are often referred as hyperparameters since they are also required to be optimized, but are not trainable; hence, multiple networks with different configurations has to be trained.

3.4 Learning process

As mentioned in section 2.1.6; the weights and biases are updated in order to decrease the loss if the same data in fed through the model. By continuously updating the model for to long, it will primarly perform well on the same data used for training. A common approach to make the model perform well on unseen data is to perform all training of the weights and biases on a subset of the data. Where the data not used for training is used to evaluate the true performance of the model. Using this technique, the final performance measurement

(27)

is ideally equally good for both the training- and testing data. For this project, the data was split into three subsets: train, validation and test set. The train set was used to train the weights and biases as well as choosing the model specific cut-off threshold discussed in section 2.2.2. The validation set was used to evaluate which combinations of layers and nodes per layer that resulted in the best performance. The test set was then used for a final unbiased performance measurement. All data within each set was bucketed, padded and divided into mini-batches.

The proportion of positives (see section 2.2) within the dataset was substantially smaller than the negatives; there were more customers without self-exclusions compared to those who had self-excluded. This can result in a model predicting everyone as non-excluder. To overcome this, different techniques can be utilized. In the scope of this thesis, five different ideas were tested. They are as follows:

• Over-sampling - Sampling the less common label with replacement until equally many positives and negatives are obtained.

• Under-sampling - Sampling the more common label without replacement until equally many positives and negatives are obtained.

• Combined-sampling - Combining over- and under-sampling, where the less common label is sampled twice and the more common label is under sampled until equally many positives and negatives are obtained.

• Weighting - By increasing the loss generated for predicting the less common label it is possible to provide a balance within the loss, instead of the number of samples.

• Doing nothing - As a benchmark it is a good idea to see how the model performs without any augmentation of the data.

The mean and standard deviation of each feature were calculated based on the train set; they were then used to to normalize all three subsets, resulting in zero mean and unit variance for the train set and approximately the same for the other two sets. The reason to why the normalization was based on only the train set was that the other data is considered to be unknown.

(28)

3.4.1 Training

The parameters within the model are updated by iterating over the training data multiple times. Processing the training data once is referred to as an epoch; each epoch processes all mini-batches continuously in parallel. The procedure of updating the trainable parameters are illustrated in the algorithm below:

Training of parameters: Learns parameters given the training set xtrand validation set xva.

Result: The optimized weights and biases for the model.

1 Randomly initialize the weights and biases within the network

2 Calculate the loss based on the validation set Lva 3 while Lva has not converged do

4 for xmbin xtr do

5 Predictbymb based on xmb

6 Calculate the cross-entropy loss: L(ymb,bymb)

7 Propagate the error to the trainable parameters

8 Update the parameters via the ADAM optimizer

9 end

10 Calculate the loss Lvabased on the validation set x_va

11 end

The technique of observing the loss based on the validation set and stopping the training when this loss have converged is referred to as early stopping. Early stopping is a method commonly used to prevent overfitting¹ of a model. The number of allowed epochs without any improvement until the training stops is referred to as the patience; it is used to make sure that the minimum has been found before stopping.

When the model found an optimal set of parameters, the Fβ score with β = 0.5 was cal- culated based on the training set for all thresholds that would alter the outcome of the predictions. The specific threshold that corresponded to the highest F0.5 was chosen as the final threshold for this model. The reason to why the F0.5 score is the metric of interest is based on a business decision at the company. The F0.5 score favours a higher precision than a higher recall. A higher precision gives fewer false positives and is more precise regarding the positive predictions.

3.4.2 Validation

As previous mentioned, different configurations of the network can give different results.

Due to this fact, many of the parameters that are not updated during training should be tested in order to find the optimal configuration of the network. In the scope of this thesis, the number of layers L and the number of nodes per layer n were chosen as parameters that would be more thoroughly examined.

Nine models were trained on 30% of the training data which is a common approach to reduce the time taken by tuning of hyper-parameters [30]. The models was predefined as all combinations on a grid of L = [3, 5, 7] and n = [100, 150, 200]. This is commonly referred to as a grid search in hyperparameter space. Their optimal threshold was decided as previous mentioned via the training set and evaluated on the validation set. The model with the highest F0.5 score with respect to the validation set was chosen as the final model.

1When a model fits exact or close to exact on a specific dataset and does not generalize to other data [29].

(29)

3.4.3 Testing

The test data has not been used in any way to improve or choose the best model. The evaluation of the final model will not include any biases when evaluated on the test set. No changes to the model based on the results or the analysis from the test set can be done, since this would create a bias towards the test set.

(30)

4

Results

4.1 Setup

For any high demanding task, such as training a model and performing computational heavy analysis, a virtual machine (VM) on Google Cloud platform was utilized. The configuration of the VM is described in table 4.1. The large amount of main memory was essential in order to utilize a larger mini-batch size, which substantially decreased the training time.

Table 4.1: The configuration of the virtual machine used for high demanding analysis and training of different models.

Operating System Ubuntu 16.04

CPU 16 vCPUs

CPU Type Intel(R) Xeon(R) 2.00 GHz

Memory 104 GB

4.2 Dataset

The dataset included 916, 312 depositing casino customers. Within the set, 75, 979 (8.3%) of the customers had self-excluded due to gambling-related problems. The majority of the customers only had a few active days, as seen in Fig. 4.1. In Fig. 4.1 one can see that the fraction of self excluders increases with the number of active days, but the vast majority of self excluders still have very few active days; where the median of active days for all customers is only 4 days. The dataset contains registered activities within the period 2015- 06-09 to 2018-10-30. How the dataset was split into the three subsets: training-, validation- and testing set are shown in table 4.2.

Table 4.2: The subsets used for training, validation and testing. Including some basic numbers regarding each subset.

Training Validation Testing Total Customers 457, 824 183, 582 274, 906 916, 312

Distribution 50% 20% 30% 100%

Fraction SE 8.29% 8.31% 8.28% 8.29%

Average active days 21.13 21.26 21.22 21.19

(31)

0 20 40 60 80 100 Active days

10⁰ 10¹ 10² 10³ 10⁴ 10⁵ 10⁶

Number of customers

Customer distribution

Non-self excluders Self excluders

0%

5%

10%

15%

20%

Fraction self excluders

Figure 4.1: The distribution of the entire dataset used for this thesis. The dataset is extremely skewed towards customers with few active days; where the median of active days is only 4 days. The fraction of self excluders is larger among users with more active days compared to customers with fewer active days.

4.3 Training

The libraries used for training of the models and their versions are shown in table 4.3.

Training all nine configurations of different networks was a time-consuming part, which took almost 40 hours, even though only 30% of the available training data was used; as illustrated in Fig. 4.2. This section will cover the training of all configurations of hyperparameters, recall section 3.3.2.

Table 4.3: Libraries used for training the networks.

TensorFlow 1.11.0 An open source machine learning framework developed by Google.

TensorBoard 1.11.0 A web based application used to monitor TensorFlow processes.

Keras 2.2.4 A high-level neural network API which can be run on top of TensorFlow, Microsoft Cognitive Toolkit or Theano.

Regarding the input to the model, the features listed in table 4.4 were used. The feature exclusion_problem_gamble was used as the target variable for the model.

4.3.1 Hyperparameter optimization

Several combinations of layers and nodes in each layer was evaluated in order to choose the most optimal configuration. There were several other parameters and methods that could affect the performance of the model; some of those were investigated in order to either exclude or include them in a more complex grid search. Conclusions regarding alternative parameters/methods are given in the list below.

(32)

Training procedure

0 5 10 15 20 25 30 35 40

Time [Hours]

0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34

Validation Loss [Cross Entropy]

3 Layers, 100 Nodes 3 Layers, 150 Nodes 3 Layers, 200 Nodes

Figure 4.2: The cross-entropy loss based on the validation test during training of the 9, predefined networks on the hyperparameter grid. The more complex the model is (more layers and units per layer) the larger the training time is and the noisier the updates are.

Table 4.4: Features used as input for the model. A general explanation is given in section 3.2 while a more specific explanation is given at the beginning of the report.

deposit_approved_num slots_turnover_num turnover_saturday_num deposit_approved_sum slots_turnover_sum winning_bonus_num

deposit_denied_num sports_turnover_num winning_bonus_sum deposit_denied_sum sports_turnover_sum winning_cash_num

last_seen_num turnover_bonus_num winning_cash_sum limit_num turnover_bonus_sum withdrawal_approved_num session_num turnover_cash_num withdrawal_approved_sum session_sum turnover_cash_sum withdrawal_canceled_num turnover_midnight_num withdrawal_canceled_sum

Learning rate The Adam optimizer that was used for training, computes learning rates adaptively, which reduces the need for fine tuning the initial learning rate. Thus the learning rate was not included in the grid search. The initial learning rate was thus kept at the default value of 10⁻³.

Batch size The batch size can be increased in order to speed up the training time. There is a trade-off between the model’s computational speed and the ability for the model to generalize to new data [31]. After experimenting with different batch sizes, a batch size of 1280 was used in order to fully take advantage of the computational power of the VM and to reduce the training time.

Sampling technique From a small experiment where different sampling techniques were compared, the conclusion was that the weighted technique was the most promising.

The results of the minor experiment can be found in appendix A.

Dropout Dropout can be used to reduce overfitting; it works by stochastically eliminate some connections within the neural network. A 20% chance of removing a connec- tion between the LSTM layers was implemented; however, no obvious improvement was observed compared to not using dropout. Furthermore, using dropout increased

(33)

the training time substantially, since it takes longer time to reach a local minimum.

Dropout was hence omitted and only early stopping, discussed in section 3.4.1 was used to counteract overfitting.

Layers The number of layers within the network affected the outcome significantly when running minor experiments. In order to keep the training time within a reasonable time, a limit on the number of layers was set to 7, in combination with the number of nodes per layer.

Nodes per layer Similarly to the layers, the number of units per layer was also found to have a significant impact on the outcome. The maximum number of units per layer was set to 200.

The cross-entropy loss based on the validation set during training is shown in Fig. 4.2. One can see that when the network increases in complexity (more layers or more units per layer) it takes more time to reach the minimum and the updates becomes a lot more noisy. For example, in the case of simpler models (for example 3 layers and 100 nodes) one can easily see where the optimum is found, which is not the case in the case of more complex networks.

4.4 Validation

The model with the highest F0.5 score with respect to the validation set was found to have 5 layers with 150 units per layer. As seen in table 4.5, this model also had the highest accuracy and precision but only the fifth highest recall. Since the optimal model was not found on the perimeter of the grid, the exploration outside the hyperparameter grid was not considered. Explorations in the neighbourhood were likewise omitted; since the aim of this thesis is to investigate the potential of recurrent neural network within responsible gambling and not to find the optimal hyperparameters. From this point, only the model with the highest performance is considered.

Table 4.5: The performance of the 9 models trained, using only 30% of the available training data. The model with 5 layers and 150 units per layer performed best with respect to accuracy, precision and F0.5. This model was thus chosen for further optimization.

Layers Units per Layer Accuracy Precision Recall F0.5 Score

3 100 0.922 0.551 0.324 0.483

3 150 0.926 0.594 0.352 0.523

3 200 0.926 0.595 0.340 0.517

5 100 0.918 0.510 0.347 0.466

5 150 0.929 0.635 0.340 0.541

5 200 0.922 0.556 0.326 0.487

7 100 0.914 0.480 0.344 0.445

7 150 0.914 0.481 0.346 0.446

7 200 0.913 0.466 0.346 0.436

4.4.1 Testing

The best model was retrained on the entire training set and evaluated on the test set; the training is shown in Fig. 4.3. The final performance of the model is given in table 4.6. From the table one can see that the final performance has increased substantially, compared with previous model where only 30% of the available training data was used. The F0.5 score

(34)

increased by more than 6 percentage points and precision by 8 percentage points. The recall only increased by 2.5 percentage points; however, this metric has less significance on the F0.5, which we are optimizing for.

It can be seen in Fig. 4.3 that regularization is required in order to reduce overfitting. After approximately four hours of training, the validation loss starts to increase while the train loss still decreases; hence, this is the point where the final parameters for the model is obtained.

0 2 4 6 8 10

Time [Hours]

0.10

0.15

0.20

0.25

0.30 Cross Entropy Loss

Training of final model

Validation Loss

Training Loss

Figure 4.3: The training of the final model using the entire training dataset; the optimal weight parameters was obtained after 12 epochs, which took approximately four hours.

0.0 0.2 0.4 0.6 0.8 1.0

Recall

0.0

0.2

0.4

0.6

0.8

1.0 Precision

Precision recall curve

PR curve

Used threshold

Figure 4.4: The precision-recall curve for the final model based on the test set.

A high threshold increases the precision, while a low threshold increases the recall.

The used threshold was 0.5474.

(35)

Table 4.6: Performance of the final model using 5 layers and 150 units per layer.

It was trained on the entire training set and evaluated on both the validation- and test set. The test set gives the final performance of the model.

Accuracy Precision Recall F0.5

Validation set 0.936 0.718 0.377 0.608

Test set 0.936 0.710 0.380 0.604

In Fig. 4.4 the precision-recall curve is illustrated; the optimal threshold which maximized the F0.5 is shown as the black diamond; the threshold was found earlier to be 0.5474. An- other common performance illustration for a classification model is the receiver operating characteristic (ROC) curve, which is shown in Fig. 4.5. It illustrates the different combinations of recall and one minus the specificity¹ for all possible thresholds, which is similar to the PR curve. The PR- and ROC curves are based on the test set and thus the threshold can not be changed in order to move the black diamond to alter properties of the model;

that would create bias towards the test set.

Figure 4.5: The receiver operating characteristic curve for the final model based on the test set. The threshold used for the model resulted in a rather low specificity rate. The dashed line represent a classification model that randomly assigns a class to each time series.

A confusion matrix for the classifier can be seen in Fig. 4.6, where each true label has been normalized. From the confusion matrix one can conclude that the largest error comes from false negatives, while the false positives are only around 1% out of all non-excluders. This is expected since the model has been optimized for F0.5, which favours precision.

1One minus the specificity is also known as the false positive rate.

(36)

False True

Predicted label

False

True

True label

(248598) 0.99 0.01

(3553)

(14103) 0.62 0.38

(8652)

Normalized confusion matrix

0.2

0.4

0.6

0.8

Figure 4.6: A normalized confusion matrix for classifications on the test set given the final model. Since the data set is imbalanced, a majority of the customers fall within the true negative section.

Master Thesis Report