Churn Prediction: Predicting User Churn for a Subscription-based Service using Statistical Analysis and Machine Learning Models

(1)

Master Thesis, 30 credits

Department of Mathematics and Mathematical Statistics

Spring Term 2020

CHURN PREDICTION

Predicting User Churn for a

Subscription-based Service using Statistical Analysis and Machine

Learning Models

Alexandra Hägg and Amanda Flöjs

(2)

CHURN PREDICTION,

PREDICTING USER CHURN FOR A SUBSCRIPTION-BASED SERVICE USING STATISTICAL ANALYSIS AND MACHINE LEARNING MODELS Submitted in fulfilment of the requirements for the degree Master of Science in Industrial Engineering and Management

Department of Mathematics and Mathematical Statistics Ume˚a University

SE - 907 87 Ume˚a, Sweden Supervisors:

Natalya Pya Arnqvist, Ume˚a University Patrik Trelsmo, Schibsted Media Group Examiner:

Jun Yu, Ume˚a University

(3)

Therefore, any company that engages in the subscription-based business needs to understand the user behavior and minimize the number of users canceling their subscription, i.e. minimize churn. According to marketing metrics the probability of selling to an existing user is markedly higher than selling to a brand new user. Nonetheless, it is of great importance that more focus is directed towards preventing users from leaving the service, in other words preventing user churn. To be able to prevent user churn the company needs to identify the users in the risk zone of churning. Therefore, this thesis project will treat this as a classification problem.

The objective of the thesis project was to develop a statistical model to predict churn for a subscription-based service. Various statistical methods were used in order to identify patterns in user behavior using activity and engagement data including variables describing recency, frequency and volume. The best performing statistical model for predicting churn was achieved by the Random Forest algorithm. The selected model is able to separate the two classes of churning users and the non-churning users with 73% probability, and has a fairly low missclassification rate of 35%. The results show that it is possible to predict user churn using statistical models. Although, there are indications that it is difficult for the model to generalize a specific behavioral pattern for user churn. This is understandable since human behavior is hard to predict.

The results show that variables describing how frequent the user is interacting with the service are explaining the most whether a user is likely to churn or not.

Key Words: User Churn, User Retention, Subscription-based Service, Machine Learning Models

(4)

Prenumerationstjänster blir alltmer populära i dagens samhälle. Därför är det viktigt för ett företag med en prenumerationsbaserad verksamhet att ha en god först˚aelse för sina användares beteendemönster p˚a tjänsten, samt att de minskar antalet användare som avslutar sin prenumeration. Enligt marknads- föringsstatistik är sannolikheten att sälja till en redan existerande användare betydligt högre än att sälja till en helt ny. Av den anledningen, är det viktigt att ett stort fokus riktas ˚at att förebygga att användare lämnar tjänsten. För att förebygga att användare lämnar tjänsten m˚aste företaget identifiera vilka användare som är i riskzonen att lämna. Därför har detta examensarbete behandlats som ett klassifikations problem.

Syftet med arbetet var att utveckla en statistisk modell för att förutsp˚a vilka användare som sannolikt kommer att lämna prenumerationstjänsten inom nästa m˚anad. Olika statistiska metoder har prövats för att identifiera användares beteendemönster i aktivitet- och engagemangsdata, data som inkluderar variabler som beskriver senaste interaktion, frekvens och volym. Bäst prestanda för att förutsp˚a om en användare kommer att lämna tjänsten gavs av Random Forest algoritmen. Den valda modellen kan separera de tv˚a klasserna av användare som lämnar tjänsten och de användare som stannar med 73% sannolikhet och har en relativt l˚ag missfrekvens p˚a 35%. Resultatet av arbetet visar att det g˚ar att förutsp˚a vilka användare som befinner sig i riskzonen för att lämna tjänsten med hjälp av statistiska modeller, även om det är sv˚art för modellen att generalisera ett specifikt beteendemönster för de olika grupperna.

Detta är dock först˚aeligt d˚a det är mänskligt beteende som modellen försöker att förutsp˚a. Resultatet av arbetet pekar mot att variabler som beskriver frekvensen av användandet av tjänsten beskriver mer om en användare är p˚aväg att lämna tjänsten än variabler som beskriver användarens aktivitet i volym.

(5)

with the Machine Learning team and therefore making this thesis project possible. We also want to thank our fantastic supervisor at Schibsted Media Group Patrik Trelsmo. Your advise, guidance and support has been of great importance for the success of this project. Also, a big thanks to all the members of the Machine Learning that made us feel welcome and helped us to create value for the Company as a whole.

Finally, we would like to thank our supervisor at Ume˚a University Natalya Pya Arnqvist for your helpful suggestions, thoughts and efforts!

Thank you to you all!

(6)

1 Introduction 1

1.1 Company Description . . . 1

1.2 Background . . . 1

1.3 Aim . . . 2

1.3.1 Research Questions . . . 2

1.4 Project Scope . . . 2

1.4.1 Confidentiality . . . 2

1.4.2 Limitations . . . 2

1.5 Thesis Disposition . . . 3

2 Theory 4 2.1 Logistic Regression . . . 4

2.1.1 The Logistic Model . . . 4

2.1.2 Estimating the Regression Coefficients . . . 5

2.1.3 Making Predictions with the Logistic Model . . . 6

2.2 Random Forest . . . 6

2.2.1 Decision Tree . . . 6

2.2.2 Bagging . . . 8

2.2.3 Random Forest Algorithm . . . 9

2.3 Extreme Gradient Boosting . . . 9

2.3.1 Ensemble Learners and Regularization . . . 9

2.3.2 Gradient Boosting . . . 10

2.3.3 Attributes of Extreme Gradient Boosting . . . 11

2.4 Overfitting . . . 11

2.5 Imbalanced learning . . . 12

2.5.1 Under-sampling . . . 12

2.5.2 Over-sampling . . . 12

2.6 K-fold Cross Validation . . . 13

2.7 Model Evaluation Metrics . . . 14

3 Data Description 17 3.1 Activity Data . . . 18

3.2 Engagement Data . . . 18

3.3 Subscription Data . . . 19

4 Method 20 4.1 Data pre-processing . . . 20

4.1.1 Data setup . . . 22

4.1.2 Classification Setup . . . 24

4.2 Learning Algorithms . . . 25

4.2.1 Parameter tuning . . . 25

4.3 Software Tools . . . 26

(7)

5.2 Random Forest . . . 30

5.3 Extreme Gradient Boosting . . . 32

6 Discussion 35 6.1 Data Setup . . . 35

6.2 Limitations . . . 36

6.3 Model Selection . . . 37

6.3.1 Final Model . . . 38

6.3.2 Tuning . . . 38

6.3.3 Results . . . 39

6.3.4 Continuous Change . . . 40

6.4 Feature Importance . . . 40

7 Conclusion 42 7.1 Suggestions for Future Research . . . 42

7.2 Answering research questions . . . 43

8 References 45

(8)

1 Introduction

This section describes the background of the thesis project. Further, the aim and the research questions of the project are defined. Lastly, the limitations of the thesis project are presented.

1.1 Company Description

Schibsted Media Group, hence forward called the Company, was established in 1839, in Oslo by Christian Schibsted, and was by that time a printing house producing magazines and textbooks. Since then the Company has developed into an international media group with several digital brands, and with more than 5000 employees in more than 20 countries. The main markets of the Company are Norway, Sweden, Finland, Denmark and Poland. Today the Company has digital services that empower customers, leading marketplaces and world-class media houses in Scandinavia. The Company can be divided into three business areas, such as the Nordic Marketplaces, Next, and News Media. The Company’s mission is to ”Empower people in their daily lives”.

The business area Nordic Marketplaces connects millions of buyers and sellers every month through the digital marketplaces such as Blocket.se and Finn.no.

Next includes such growth companies as Lendo, Let’s Deal and Compricer that create value for customers and users. The third business area, News Media, keeps people updated and informed on what is happening in the society through its media houses such as Svenska Dagbladet, Aftonbladet, VG and Aftenposten (Schibsted, 2019).

1.2 Background

The Company has millions of users of its services every day. In the media houses of the Company there are both newspapers and digital newspapers that people can subscribe to. The digital users are defined as those who read news using digital devices while print users are defined as those who read a newspaper.

When a user chooses to unsubscribe from a newspaper it is called churn. It is of importance that the Company retains their subscribed users and keeps the churn rate low since losing users means losing profit.

For any company that engages in the subscription-based business there is a need to understand the user behavior. According to marketing metrics the probability of selling to an existing user is markedly higher than selling to a brand-new. Therefore it is necessary for companies to retain the subscribed users by keeping the churn rate low. By definition, the churn rate is ”The annually percentage rate at which users stop subscribing to a service” (Marketing Technology Transformation, 2016).

(9)

Being one of the leading distributors of news in Scandinavia, the Company has many subscribed users to their news-distributors. To be able to continuously improve their business and optimize their user retention they need help to calculate and utilize churn prediction. To be able to prevent user churn the Company need to identify the users in the risk zone of churning. Therefore, the thesis project treated this as a classification problem. This thesis project will only focus on one of these news-distributors and only on its digital users. The news-distributor will throughout the thesis be called ”The Swedish News-site”

due to confidentiality. At the start of the thesis project the Swedish News-site has not been using any model for predicting churn, therefore there is a high interest in the development of such a model.

1.3 Aim

The thesis project aims to develop statistical models to predict future user churn for digital subscribers for a Swedish news-site. Further, the project aims to use these models to find potential drivers of churn. The future goal is to implement the final model as a tool for the Swedish News-site to predict probabilities on which users are likely to churn on a monthly basis.

1.3.1 Research Questions

Based on the aim of the thesis project, the research questions are defined as following:

• Is it possible to develop a statistical model to predict churn for a subscription-based service?

• To what extent is it possible to find potential drivers of churn using statistical models?

1.4 Project Scope

1.4.1 Confidentiality

Due to confidentially the thesis workers have signed an agreement to not reveal any confidential or secret information regarding the Company’s business or user data. All the information and results in the thesis have been approved by the Company before being published.

1.4.2 Limitations

Due to the restricted time frame of the project some limitations have been set.

For the future work these limitations could be amended to attain additional insights about user retention.

(10)

• The thesis project limits the use of personal data. Personal data as age, gender and demographics will be excluded.

• The data used for the thesis project will only include users that pay for an ongoing subscription, starting from K2, i.e. freemium users will not be included. Freemium users are the users between K1 and K2 and have a free subscription (Figure 1).

• The thesis project will not deal with observations before a user’s start date of the subscription or after the user’s potential cancel date of the subscription.

• The time frame for the considered data set used is 2019-04-01 to 2019-12-31.

Figure 1: Visualization of each step of the customer journey, starting at the first visit to the news-site, ending as a subscriber quitting the service.

1.5 Thesis Disposition

The thesis consists of seven sections. Initially a short background of the host company is presented together with the project description in the above section, Section 1. Followed by Section 2, that will cover the theory used in the report. Thereafter the data provided will be described in Section 3, including descriptions of the variables and their structure. The approach used for analysing the available data will be described in Section 4. In Section 5 the results are presented in relation to a selected baseline model and important performance and evaluation metrics are shown. In Section 6 the analysis and discussion will be presented. Lastly, suggestions for future research will be presented and summarized in Section 7.

(11)

2 Theory

The purpose of this section is to cover the theory used in the thesis project.

2.1 Logistic Regression

Logistic Regression is a regression analysis that is appropriate to apply when the target variable is categorical, often binary. Logistic Regression is like other regression analyses a predictive analysis and is used to describe data by explaining the relationship between the target variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Instead of modelling a target variable Y directly as in Linear Regression, the Logistic Regression model predicts the probability for Y to a particular category. In the binary case, one observation is classified into one class if the predicted probability is larger than a set threshold and to another class if the probability is lower than the threshold, (Figure 2) (James et al., 2013, p.131-132).

2.1.1 The Logistic Model

When modelling the relationship between the probability p(X₁, X₂, ..., X_p) = P (Y = y|X₁, X₂, ..., X_p) and the predictors X = (X₁, X₂, ..., X_p) in the Logistic Regression a Logistic function is used to receive outputs between 0 and 1 for all X . The Logistic function is given by,

p(X) = e^β⁰^+β¹^X¹^+...+β^p^X^p

1 + e^β⁰^+β¹^X¹^+...+β^p^X^p (1) where β₀, β₁, ..., β_p are the regression coefficients of the model. The regression coefficients are estimated by fitting the model with the Maximum-likelihood method, (Subsection 2.1.2). From Function 1 it can be seen that the Logistic function will always produce a s-shaped curve, which means that no matter what value X takes, a reasonable prediction will be obtained since the values of p(X) is restricted to p(X) ∈ [0, 1]. In Figure 2 the implementation of the Logistic Function is displayed (James et al., 2013, p.131-133).

(12)

Figure 2: A figure visualizing the S-shaped curve that the Logistic Regression is fitted to. The threshold value is a set value, the threshold value in this figure is 0.5 (JavaTpoint, n.d.).

The Logistic function can be rewritten as, p(X)

1 − p(X) = e^β⁰^+β¹^X¹^+...+β^p^X^p (2) where the fraction on the left-side is called the odds. The odds can assume any value between 0 and ∞, where a value of 0 indicates a low probability and

∞ a very high probability for the target class. Function 2 can be transformed by a logarithmic transformation,

log( p(X)

1 − p(X)) = β₀ + β₁X₁+ ... + β_pX_p (3) the left-side is now representing the log-odds. It can be distinguished that the Logistic Regression has log-odds that is linear in X (James et al., 2013, p.131-133).

2.1.2 Estimating the Regression Coefficients

The coefficients of the Logistic function (Function 1) are unknown and need to be estimated by using the available training data. Preferable for Logistic Regression Maximum-likelihood is used for estimating the regression coefficients.

When fitting a Logistic Regression and using Maximum-likelihood the aim is to find the estimates ˆβ so that the predicted probability in Function 1 corresponds as closely as possible to 1 for all observations when the true outcome is positive, and 0 for all observations when the true outcome is negative. This can be done by using the likelihood function presented as,

`(β₀, β₁, ..., β_p) = Y

i:yi=1

p(x_i) Y

i⁰:y_i0=0

(1 − p(x_i⁰)) (4)

(13)

The estimates ˆβ0, ˆβ0, ..., ˆβp are chosen by maximizing the likelihood function in (Function 4) (James et al., 2013, p.133-134).

2.1.3 Making Predictions with the Logistic Model

After the regression coefficients of the Logistic Regression model have been estimated with the likelihood function (Function 4) it is possible to compute the predicted probability by plugging in the estimated values in Function 1, as following,

ˆ

p(X) = e^β^ˆ⁰^{+ ˆ}^β⁰^X¹^{+...+ ˆ}^β^p^X^p

1 + e^β^ˆ⁰^{+ ˆ}^β⁰^X¹^{+...+ ˆ}^β^p^X^p (5) For simple classification the dummy variable approach is useful. Assume that the target variable Y is coded as following,

Y =

®0 if class 1

1 if class 2 (6)

The model will then predict class 2 if ˆp(Y = 1|X) > threshold and class 1 if ˆ

p(Y = 1|X) <= threshold, where the threshold is a set value for the model (James et al., 2013, p.134-135).

2.2 Random Forest

Random Forest is a supervised learning algorithm built on an ensemble of decision trees, trained with the bagging method. Since the theories behind these two algorithms are fundamental for the random forest algorithm these will be presented first (Hastie et al., 2009, p.587).

2.2.1 Decision Tree

Decision Trees belong to the family of supervised learning algorithms and can be used both for regression and classification problems. Regression trees and classification trees are very similar, except when it comes to desired response. Classification trees predict a qualitative response and regression trees a quantitative one (James et al., 2013, p.311). This thesis project will primarily use classification trees, therefore the following theory will mainly focus on the basics of a classification tree. In the figure below, a simple Decision Tree is displayed.

(14)

Figure 3: An example of a Decision Tree, where X₁, X₂, X₃ represents different features, T stands for true and F for False. The only node in the top represents the root node and the two nodes in the middle are decision nodes, in the end of the tree there is eight leafs (Hackerearth, n.d.).

Classification trees predict each observation to belong to the most commonly occurring class of training observations in the region to which it belongs. The classification tree classifies the observations by sorting them down the tree, starting from the root node in the top down to some leaf node, where the leaf nodes provide the classification of the observation (Figure 3). For growing a classification tree recursive binary splitting is used, which is a top-down, greedy approach. This means that the splitting begins at the top of the tree, where all observations belong to a single region, and then successively split the prediction space. The split is greedy since the best split is made at each step of the tree, instead of choosing a split that might lead to a better tree in some future steps. As a criterion for the binary splits a classification error rate is used, which is defined as the fraction of the training observations in the mth region that does not belong to the most common k th class,

E = 1 − max

k (ˆp_mk), (7)

where ˆp_mkrepresents the proportion of training observations in the mth region that belongs to the k th class (James et al., 2013, p.311-312).

However, the classification error rate is not sufficiently sensitive for tree-growing, therefore we introduce the Gini index,

G =

K

X

k=1

ˆ

p_mk(1 − ˆp_mk). (8)

Gini index measures the total variance across the K classes. When all of the ˆ

pmk’s are close to 0 or 1 the Gini index takes a small value. Therefore the

(15)

Gini index is often referred to be a measurement of node purity and a small value of the index indicates that the node contains predominantly observations from a single class. The Gini index is often used to evaluate the quality of a specific split since it is a measurement that is more sensitive to node purity than the classification error rate. Both the classification error rate and the Gini index can be used as measurements for tree pruning, although the classification error rate is preferred if prediction accuracy of the final pruned tree is the goal (James et al., 2013, p.312).

With Decision Trees comes problems with high variance, since Decision Trees are sensitive to the specific data it is trained on. Therefore, if the training data is changed, the final Decision Tree can be quite different which can lead to different predictions (Brownlee, J. 2016). To reduce the high variance the ensemble method bootstrap aggregation (or bagging for short) is introduced in the section below.

2.2.2 Bagging

Bagging is a powerful ensemble method that can be used to reduce the variance of statistical learning methods like Decision Trees with high variance.

Given a set of n independent observations Z₁, ..., Z_n, each with variance σ², the variance of the mean ¯Z of the observations is given by σ²/n, i.e. averaging a set of observations reduces the variance. One natural approach to reduce the variance and increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build prediction models for each training set, and average the resulting predictions. Let a Decision Tree built on the predictors in x be defined as ˆf (x). Then ˆf¹(x), ˆf²(x), ..., ˆf^B(x) can be calculated for B separated training sets and by taking the average of them a single low-variance statistical learning model can be obtained (James et al., 2013, p.316-317),

fˆ_avg(x) = 1 B

B

X

b=1

fˆ^b(x). (9)

Since having multiple training sets are unusual, bootstrap can be used, by taking repeated samples from the same single training data set. By generating B different bootstrapped training sets, one can train each bootstrapped sample set to get ˆf^∗b(x), and finally take the average of all the predictions (James et al., 2013, p.317). Bagging can be defined as the following,

fˆ_bag(x) = 1 B

B

X

b=1

fˆ^∗b(x) (10)

Bagging can be applied to Decision Trees by constructing B Decision Trees based on the B bootstrapped training sets and by taking the average of the

(16)

prediction results so that, the variance can be reduced. For classification trees, given a test observation the predicted class by each of the B trees is recorded and a majority vote decides the class the test observation should be assigned to, i.e. the overall prediction is the most commonly occurring class among the B predictions (James et al., 2013, p.317).

2.2.3 Random Forest Algorithm

A Random Forest is built on the combination of Decision Trees and bagging, but compared to bagged trees the Random Forest algorithm has been improved so it decorrelates the trees. When building all the Decision Trees of the Random Forest, for each split in each tree a random sample of m predictors will be chosen as split candidates from the full set of p predictors. For the split only one of those m predictors can be used and at each new split a new random sample of m predictors is taken out. In general m =√

p, which means that at each split in the tree the algorithm will not be allowed to consider a majority of the possible predictors. The reason for this is to create an algorithm that is able to handle highly correlated bagged trees (James et al., 2013, p.319).

2.3 Extreme Gradient Boosting

Extreme Gradient Boosting is a scalable end-to-end ensemble learner that implements the Gradient Boosting method together with regularization. Since the mentioned theories are fundamental for the Extreme Gradient Boosting algorithm these will be presented first (Chen and Guestrin, 2016).

2.3.1 Ensemble Learners and Regularization

Let D = (x_i, y_i) be a given data set with n examples and m features. Then a tree ensemble learner uses K additive functions for predicting the output ˆyi,

ˆ

yi = φ(xi) =

K

X

k=1

fk(xi), fk ∈ F (11)

where F is the space of regression trees and the used ensemble learner is represented by φ. Each fk is corresponding to an independent tree structure, containing T number of leaves and leaf weights w. In comparison with Decision Trees, defined in Subsection 2.2.1, the regression trees are including a continuous score in each weight w and by taking the sum of the scores for all T leaves the final prediction for each tree can be obtained. The set of functions in F are trained by minimizing the following regularized loss function for the response yi,

L(φ) =X

i

l(ˆy_i, y_i) +X

k

Ω(f_k) (12)

where Ω(f_k) = γT +¹₂λ||w||². Further, l is a convex loss function that measures the difference between the prediction ˆyi and the target yi. The second term

(17)

with Ω is penalizing the complexity of the model with help of the regularization parameters γ and λ to reduce the risk of overfitting, see Subsection 2.4 (Chen and Guestrin, 2016).

2.3.2 Gradient Boosting

Function 12 describes the loss function for tree ensemble learners and it can not be optimized using traditional optimization methods in the Euclidean space (Chen and Guestrin, 2016). This can be solved by training the model in an additive manner, meaning that for each tree a prediction ˆy^(t)_i is computed for the i -th observation and the t -th iteration and the function with the optimal loss for that specific iteration is chosen. This is done by adding the function output f_t to the loss function in Function 12,

L^(t) =

n

X

i

(l(yi, ˆy_i^(t−1)) + ft(xi)) + Ω(ft) (13)

In other words, the f_t that improves our model according to Function 12 is greedily added to the loss function. Further, second-order approximation is useful to quickly optimize the loss function, it then will take this form,

L^(t) u

n

X

i=1

[l(yi, ˆy^(t−1)) + gift(xi) + 1

2hif_t²(xi)] + Ω(ft) (14) where g_i = ∂_y_ˆ(t−1)l(y_i, ˆy^(t−1)) and h_i = ∂_y²_ˆ(t−1)l(y_i, ˆy^(t−1)) are the first and second order gradients statistics on the loss function. Function 14 can be rewritten to be more simple at step t, by removing the constant terms as following,

Le^(t) =

n

X

i=1

[g_if_t(x_i) + 1

2h_if_t²(x_i)] + Ω(f_t) (15) Define I_j = {i|q(x_i) = j} as the instance set of leaf j and q is the independent tree structure. Then Function 15 can be rewritten by expanding Ω as follows,

Le^(t) =

n

X

i=1

[g_if_t(x_i) + 1

2h_if_t²(x_i)] + γT +1 2λ

T

X

j=1

w_j²

=

T

X

j=1

[(X

i∈Ij

gi)wj +1 2(X

i∈Ij

hi+ λ)w_j²] + γT

(16)

The optimal weight w_j^∗ of leaf j for a fixed tree structure q(x) can be computed as presented,

w_j^∗ = − P

i∈Ijg_i P

i∈Ijh_i+ λ (17)

(18)

and the corresponding optimal value by,

Le^(t)(q) = −1 2

T

X

j=1

(P

i∈Ijg_i)² P

i∈Ijh_i+ λ + γT (18)

To measure the quality of a tree structure q Function 18 can be used as a scoring function. The score can be compared with the impurity score for evaluating Decision Trees. Although, it is often impossible to enumerate all possible q tree structures. Therefore, Extreme Gradient Boosting is using a greedy algorithm that starts from one single leaf and iteratively adds branches to the tree. Let I_L and I_R be the instance sets of left and right nodes after the split. Then if I = IL∪ IR the loss reduction after the split is given by,

L_split = 1 2

"

(P

i∈ILgi)² P

i∈ILh_i+ λ + (P

i∈IRgi)² P

i∈IRh_i+ λ− (P

i∈Ig_i)² P

i∈Ih_i+ λ

#

− γ (19)

and Function 19 is often used for evaluating the split candidates. In the algorithm the split for each node is the one that minimizes the function (Chen and Guestrin, 2016).

2.3.3 Attributes of Extreme Gradient Boosting

Extreme Gradient Boosting is as mentioned built on the combination of Gradient Boosting and regularization. Further, Extreme Gradient Boosting is using tricks like feature subsampling similar to how the Random Forest algorithm uses bagging. By subsampling features the algorithm prevents overfitting and it also makes the algorithm less computationally expensive. Further, the Extreme Gradient Boosting algorithm uses shrinkage to prevent overfitting by scaling the newly added weights by the factor η (learning rate) at each step of the tree boosting. By using shrinkage the algorithm reduces the influence of each individual tree, which makes room for future trees to improve the model (Chen and Guestrin, 2016).

2.4 Overfitting

Overfitting is a modelling error that can occur in statistical learning procedures and it happens because a model during the training phase is working too hard to find patterns in the training dataset. This creates a risk that the model picks up some patterns that are caused by random chance rather than explaining the data. The model could then become too complex and lose its flexibility to fit new data. Overfitting of the training data results in a small prediction error for the training set, but a larger error for the test set’s prediction (James et al., 2013, p.32).

(19)

2.5 Imbalanced learning

Most real-world classification problems face issues with class imbalance, which arises when the dataset contains an unequal distribution between its classes.

Dealing with class imbalance often tends to provide a critically imbalanced degree of accuracy, where the majority class having an accuracy close to 100 percent and the minority class having a very low accuracy. To solve the arising problems with class imbalance, over-/under-sampling methods are introduced.

In over-/under-sampling the data are sampled by changing numbers of observations in the desired class, which can be done by replacing, generating or removing observations. All theory related to imbalanced learning used in this project work is based on Learning from Imbalanced Data (He and Garcia, 2009).

Consider a training dataset S with m samples, S = (x_i, y_i), i = 1, ..., m, where x_i ∈ X is an instance in the n-dimensional feature space X = {f₁, f₂, ..., f_n}, and y_i ∈ Y = {1, ..., C} are the class identity labels associated with the feature space X. In particular, C = 2 represents the two-class classification problem.

Further, subsets of S are defined as S_min ⊂ S and S_maj ⊂ S, where S_min is the set of minority class samples in S, so that S_min∩S_maj = ∅ and S_min∪S_maj = S.

Sets generated from sampling procedures are labeled as E and the re-sampled data can be defined as,

S_Sampled = S_min+ S_maj+ E. (20)

2.5.1 Under-sampling

Under-sampling methods remove observations from the majority class to balance the classes of the dataset. Under-sampling can be done with or without replacement. Removing observations from the majority class can cause classification problems since the classifier risk to miss important patterns of the majority class. Under-sampling can be done in several ways with KNN under-sampling and Near Miss 1-3 being the most frequently used algorithms. These algorithms base their sampling on distance measurements. The thesis project will only apply the algorithm for random under-sampling (He and Garcia, 2009).

The random under-sampling algorithm removes data from the original dataset.

More specifically, the algorithm randomly selects a set of the majority class observations S_maj and removes these observations from S. In Function 20 the sampling term E will be negative.

2.5.2 Over-sampling

Over-sampling methods generate new observations to balance the classes of a data set, over-sampling can be done with or without replacement. The problem with generating new observations for the minority class is that the risk for overfitting increases (Subsection 2.4). There are various methods for

(20)

performing over-sampling such as SMOTE, Adasyn and random over-sampling.

This thesis project will only use random over-sampling algorithm (He and Garcia, 2009).

The random over-sampling algorithm adds a set E sampled from the minority class. From the original data set a set E is randomly selected from the minority observations in Sminwhich will be replicated and added to the data set S. The majority class S_maj will be kept to its size to increase the size of S, as shown in Function 20.

2.6 K-fold Cross Validation

Cross validation is one of the most widely used methods to assess the generalization ability of a predictive model and to prevent overfitting (Subsection 2.4). K-fold cross validation uses the entire training dataset but splits the dataset into K smaller parts of roughly equal size. Each of the K folds will be used to validate the fitted model on the remaining K-1 folds. For each fold the prediction error will be calculated on the fitted model. The splitting into folds can be done in different ways, it can be done randomly or not randomly, stratified or not stratified. When random fold is used the observations will be selected randomly from the entire set and when using the stratified method the percentage of the two classes will be intact (Hastie et al., 2009, p. 241-242).

This project will use stratified K-fold cross validation to ensure that the model is not using observations that only should be used for validation. Figure 4 shows a visualization of a five fold cross validation.

Figure 4: An example of five fold cross validation (Sklearn, 2007-2009).

(21)

2.7 Model Evaluation Metrics

In this project several model evaluation metrics are used such as AUC score, False Positive and False Negative rates, Recall, Precision, F-score, and Lift Score. This section gives a brief description of these metrics.

Confusion Matrix

The confusion matrix summarizes prediction results for a classification model.

A confusion matrix for a binary classifier, used in this thesis project, is displayed in Table 1.

Table 1: A confusion matrix for a binary classifier.

Predicted Class

0 1 Total

True class 0 True Neg. (TN) False Pos. (FP) N 1 False Neg. (FN) True Pos. (TP) P

Total N* P* Nt

For this thesis project 0 represents a non-churning user, and 1 represents a churning user (Table 1). True Positives (TP) are the number of observations that are predicted to churn corresponding to the reality of churn. True Negatives (TN) are the number of observations predicted not to churn and corresponds to the reality of no churn. False Positives (FP) are the number of observations predicted to churn but they are actually not churning. False Negatives (FN) are the number of observations predicted to not churn but they are actually churning (James et al., 2013, p.145-149).

From the confusion matrix multiple standard performance metrics can be computed, which are briefly explained below.

1. False Positive Rate: (FPR) is computed as following, False Positive Rate = F P

F P + T N = F P

N (21)

where a low FPR means that the model classify a low percentage of the negative examples wrong.

2. False Negative Rate: (FNR) is computed as following, False Negative Rate = F N

F N + T P = F N

P (22)

where a low FNR means that the model classify a low percentage of the positive examples wrong.

(22)

3. Recall:

Recall = T P

T P + F N = T P

P (23)

High values of the recall in this thesis project indicate that the model correctly recognizes churning users, i.e. the model predicts a small number of False Negatives.

4. Precision:

P recision = T P

T P + F P = T P

P ∗ (24)

High values of the precision in this thesis project indicate that an observation of an churning user is indeed churning, i.e. the model predicts a small number of False Positives.

5. F-score: The traditional F-Score is the F1 score, which is a harmonic mean of the precision and the recall measurements. The F1 score takes into account both False Positives and False Negatives. The F₁ score is computed as,

F1 score = 2 ∗ Recall ∗ P recision

Recall + P recision (25)

The F-score can also be weighted with β. The general formula for the weighted F-Score, where β is positive and chosen such that recall is considered β times as important as precision is defined as,

Fβ = (1 + β²) ∗ P recision ∗ Recall

(β²∗ P recision) + Recall (26) 6. Lift score: From the Confusion Matrix in Table 1 can a lift score be

computed,

Lif tScore =

T P T P +F P T P +F N T P +T N +F P +F N

=

T P P ∗ P Nt

(27) The Lift Score can be used to describe the gain of a model. This by comparing randomly generated predictions with the model’s predictions.

The Lift Score can take values in the range [0, ∞] and a random model has a Lift Score of 1. To explain, if the model has a Lift Score of 2 and a random model predict 10% correctly between classes, then it implies that the model will predict 20% correctly (James et al., 2013, p.145-149).

(23)

ROC-AUC

One useful graphic tool for displaying both type of errors simultaneously for all possible thresholds is a ROC-curve (Receiver Operating Characteristics curve).

Summarized over all possible thresholds the overall performance of a classifier can be measured by the area under the ROC-curve (AUC). The AUC measures the entire two-dimensional area underneath the entire ROC-curve from (0,0) to (1,1). The ROC-curve has the True Positive Rate as its y-axis, and the False Positive Rate as its x-axis. Figure 5 displays an example of a ROC-curve.

Figure 5: An example of a ROC curve (James et al., 2013, p.148).

The ideal ROC-curve should hug the top left corner, which results in a larger AUC and implies that the larger the AUC the better the classifier. The dotted line in Figure 5 represents the classifier’s ROC-curve when not performing better by chance, i.e. the classifier has an AUC=0.5 (James et al., 2013, p.145-149). The AUC provides an aggregated measure of performance across all possible classification thresholds. The AUC can be interpreted as the probability that the model ranks a random positive observation more highly than a random negative observation, i.e. AUC represents the probability that a random positive observations is positioned to the right of a random negative observation (Developers, 2020).

(24)

3 Data Description

This section will present the data used in the thesis project and will also describe how the data was collected and processed before classification analysis.

The model setup and performance are then presented following by the procedure for identifying important variables that could drive churn.

The Company provided the thesis workers with user activity data over the period from the beginning of April 2019 to last of December 2019. The data is located in a database which includes tables connected to the user activity data through their user ID. For example, such tables include data for each interaction a user had during a specific day, type of device and which articles had been viewed. An overview of the database structure is shown as a star scheme in Figure 6.

Note! All of the data displayed in this report is fictional and manually created by the authors, due to confidentiality.

Figure 6: A star scheme of the dataset.

(25)

3.1 Activity Data

The activity data covers all the data related to a user’s interactions and activity. Each row represents one day of activity for one specific user, and the data is logged every time the user interacts with the Swedish News-site.

The data contains, for example, anonymous user ID, date of activity, device type and number of article views on a specific day. An example of this dataset is visualized in Table 2 and represented as matrix A.

Table 2: An example of the activity data, matrix A.

ID DATE DEVICE TYPE · · · ARTICLE VIEWS

4476 2019-05-01 mobile · · · 1 4476 2019-05-02 mobile · · · 2 4476 2019-05-05 mobile · · · 4 5761 2019-06-23 desktop · · · 0 5761 2019-06-27 desktop · · · 6

... ... ... . .. ...

n 2019-08-10 tablet · · · 3

Matrix A contains all the activity data per date, from now on called treatment, meaning that each row in the data represents the activity for a specific user and a specific date.

3.2 Engagement Data

Similarly to the activity data there is the engagement data which also covers the data related to user’s interactions but with aggregated data. The engagement data is connected with the activity data through an ID key and the date.

Each row represents the engagement variables logged since the user started the subscription, there is therefore one row per treatment for the entire subscription lifetime for each user. An example of this data set is visualized in Table 3 and is represented as matrix B.

Table 3: An example of the engagement data, matrix B.

ID DATE R7 USAGE R30 USAGE · · · ENGAGEMENT CLASS

4476 2019-04-01 7 25 · · · high

4476 2019-04-02 7 25 · · · high

4476 2019-04-03 6 24 · · · high

5761 2019-06-26 1 4 · · · mid

5761 2019-06-27 0 4 · · · low

... ... ... ... . . . ...

n 2019-08-10 7 6 · · · mid

R7 Usage and R30 Usage shown in Table 3 are aggregated variables which

(26)

register the interactions with the Swedish News-site by the user the last seven and thirty days respectively. Engagement class is an example on how the Company labels user’s engagement into three different segments such as low, middle and high engagement class. These labels are calculated by the Company using the aggregated variables.

3.3 Subscription Data

The subscription dataset contains all the data regarding the information about each user’s subscription. For example, it includes information of when the subscription has started, if a user has cancelled the subscription and how many days a user has been a subscriber to the Swedish News-site. Each row represents all the information about one specific user. An example of this data set is visualized in Table 4 and is called matrix C.

Table 4: An example of the subscription data, matrix C.

ID START DATE BILLING ENGINE LIFETIME DAYS · · · CANCEL DATE

4476 2019-10-18 Payex 96 · · · 2020-01-20

5761 2018-01-14 Klarna 1185 · · · NULL

... ... ... ... . .. ...

n 2019-08-10 Payex 130 · · · 2019-12-15

Billing Engine in Table 4 is the payment method used for the subscriber.

There are only two payment options for the users. Users with Payex as billing engine is charged by invoice, while users with Klarna is charged through their account with the service. Lifetime Days is the number of days the subscription was/has been active whereas Cancel Date is the date the subscription was cancelled by the user. If Cancel Date is NULL, there is no cancel date and the user still has an active subscription.

(27)

4 Method

To complete the project within given timeline the project was planned and divided into eight phases (Figure 7). Initially the purpose and the scope of the project were defined, as described in the Introduction, Section 1. The available data was collected (Section 3) and later pre-processed. Furthermore, several features were extracted from the raw data in order to create a structured dataset to be used for modelling. It was a rather laborious and time-consuming process that covered almost 80% of the total time of the project. Moreover, several learning algorithms have been applied and evaluated throughout an iteration process. To investigate if the working model could be improved deep learning models have also been tested. The results were analyzed and evaluated so that the best model could be selected.

Figure 7: An executive summary of the process of the project together with the phases of the project.

4.1 Data pre-processing

This section describes the procedure of data pre-processing before applying the learning classification algorithms. The steps included handling non-available data, feature engineering, removing observations, handling imbalanced data.

It will also be described how the data was split for training and testing subsets.

Data is available from 2019-04-01 to 2019-12-31. The data available from the Company included about 75 000 users after removing users with double subscriptions, family subscriptions and only print subscriptions. Due to high computational cost only a smaller random sample of 45 000 users were extracted from the dataset. The random sample of users contains both active users and

(28)

unsubscribed users. For the users that are still active there are NULL values in Cancel Date. These NULL values are filled with a date far away from the current date, in this case the date used was 2200-01-01. This is done since there are a few observations before start date and after cancel date in the data that need to be removed. Moreover, the other variables which describes the user behaviour such as Article Views are filled with 0’s for the days without any activity.

Further, the event of churn event censorship was needed to be handled. This means that a user in the training data could churn just after the date the experiment ends (2019-12-31) which causes a problem since the model aims to predict churn 30 days ahead. Figure 8 visualizes this problem. If an observation is created for a user that might churn just after the experiment ends, then it is not similar to reality since the model can not tell the truth about the users that are still active by the end of the experiment. Therefore, 30 days of data from 2019-12-31 and back are removed for the users that are still active by the end of the experiment date.

Figure 8: Visualization of an example of censored churn that could occur in the data used in the project.

Moreover, since there is a large number of active users and the event of churn happens infrequently there is a large number of zeros and a small number of ones. The response vector Y is heavily weighted towards one class, more specifically zeros and therefore there is a need to re-sample the training data to create a more stable model. If the data is not re-sampled when dealing with an imbalanced dataset, there is a risk of optimizing for the trivial solution of always predicting a single class and only performing well on the majority class. Hence, the models will probably only predict zeros and no churns at all. This would lead to that the aim of the project will not be met, to predict probabilities of which users are likely to churn on a monthly basis.

(29)

It is therefore of high importance to re-sample the data to obtain desired outcome. The two approaches used during the project are random over- and under-sampling, described in Subsection 2.5. The majority class of active users was downsampled so that the same amount of active and churned users were obtained. Under-sampling was only done for the training set. The reason for this was that the reality of imbalanced classes was desired to be kept in the test set to receive a more correct result of the evaluation metrics. Both methods of over- and under-sampling were applied, however under-sampling proved to be the best approach for handling imbalanced data and for meeting the project requirements.

4.1.1 Data setup

The aim of the project is to model the response (churn) using explanatory data consisting of the activity data, engagement data and subscription data presented in Section 3. The thesis project had access to the personal data for the users. However, the personal data was excluded at the Company’s request since they wanted to find variables that are possible for them to affect in order to prevent churn. Further, due to the modelling objective, the target variable was created to predict churn 30 days ahead (Figure 9). For each date and for all the observations in the dataset the target variable was created to predict whether a user churned within a month or not.

Figure 9: Visualization of how the target variable was created - churned within a month calculated for each date and for each observation.

The target variable is represented in Table 5 with binary classification values of 0 and 1, where 0 corresponds to no churn and 1 corresponds to churn meaning that the user has churned within a specific month.

Table 5: Example of the response vector Y.

Y

ID DATE CHURN

4456 2019-08-01 0 4456 2019-08-20 1 ... ... ... n 2019-11-30 1

Using the raw data consisting of the activity matrix A and engagement matrix B, features the could be interesting was extracted and collected in matrix Z.

(30)

The additional extracted features could have a possible effect on the response variable, churn (Table 6). The variable, most common engagement class was later converted into a dummy variable, since it is actually a categorical variable in the raw data.

Table 6: An example of the extracted features, matrix Z.

ID DATE activity last30days days since lastLogin DistinctCategories NrSessions 30days mostCommon Engagement Class

4456 2019-06-11 3 7 4 0 high

1234 2019-06-12 4 8 3 8 mid

2345 2019-08-30 5 9 6 6 low

... ... ... ... ... ... ...

n 2019-08-31 6 0 7 3 high

Since the project partly deals with time-series data within the individual user’s data, the variables measure the user’s interactions with the Swedish News-site over some bounded time windows for each treatment date. Therefore, the historical features consisting of the raw data of the matrices A and B were calculated for all treatments independently resulting in three matrices W₁ and W₂ which represent data as shifted lag features. Let W₁ be the matrix of the shifted lag features for the matrix A, and W₂ be the matrix of the shifted lag features for B.

Table 7: An example of how the shifted lagged features was structured, for matrices W₁ and W₂.

DATE t t-1 t-2 t-3 . . . t-30

2019-10-09 1 - - - - -

2019-10-10 2 1 - - - -

2019-10-11 3 2 1 - - -

2019-10-12 4 3 2 1 - -

... ... ... ... ... . .. ...

2019-11-30 4 4 3 2 1 3

The lagged features from the matrices W₁ and W₂ and the extracted features in the matrix Z were merged resulting in the final matrix used for modelling which is denoted as matrix X (Table 8). By modelling the response vector of churn (Table 5), as a function of the two data matrices Z, W₁ and W₂ consisting of the extracted features and the lagged features, the function f is obtained.

Y = f (Z, Wˆ ₁, W₂) (28)

Function 28 will be referred to when presenting a machine learning method to estimate the probabilities of churn in Section 5. The methods used to obtain the probability estimates of the response variable is defined in Theory, Section 2.

(31)

Table 8: An example of the final concatenated modelling matrix X with historical rolling window values.

DATE R 7 USAGE R 7 USAGE -1 ARTICLE VIEWS-29 ARTICLE VIEWS-30 engagementClass high . . . activity last30days distinct categories30

2019-09-09 7 0 12 1 1 . . . 4 0

2019-09-10 4 7 8 3 1 . . . 5 4

2019-09-11 9 5 6 2 0 . . . 5 5

2019-09-12 6 3 1 1 0 . . . 6 7

... ... ... ... ... ... ... ... ...

2019-09-30 4 6 1 6 1 . . . 8 4

4.1.2 Classification Setup

The data is divided into several parts with the main partitioning area being a split into a training and a testing dataset. The training dataset is used to fit the models and the testing dataset is used to evaluate the performance of the model. To prevent historical information to be lost the partitioning was performed after matrix X was created.

In many machine learning methods it is common to split the data with the corresponding response of a fixed time frame into 80% for training and 20%

for validation (Bishop, C., 2006, p.32-33). However, since the thesis project is dealing with time-related and dynamic observations changing throughout time it is chosen to use a time-based splitting to best simulate real-life scenarios.

The data is split with respect to time, more specifically on the date 2019-09-30 for the training data. The model is then tested on the test dataset which covers the period from 2019-09-30 to 2019-12-31. 30 days of historical data is required for calculating features for the first observation in the training dataset. The data setup used for classification analysis is visualized in Figure 10.

Figure 10: Visualization of how the data is split for training and testing the model.

To summarize, the implemented model will score users on a monthly basis by estimating the probability of churn within a month.

The setup for training the model can be divided into the following five parts.

1. Data collection and data pre-processing.

2. Splitting the total dataset into the training and testing sets.

3. Use under-sampling to balance the training dataset.

(32)

4. Training the model on the training dataset and tuning the model using randomized grid search.

5. Evaluating the best model on the testing dataset.

4.2 Learning Algorithms

To estimate the probability of churn, different learning algorithms were used and evaluated. As a baseline model a Logistic Regression was fitted to the data, defined in Subsection 2.1. Proceeding, a Random Forest was fitted defined in Subsection 2.2, followed by Extreme Gradient Boosting classifier, from now on referred to as XGBoost, defined in Subsection 2.3. These models are presented in the Results, Section 5.

A Decision Tree and several ensemble learners such as Adaptive Boosting and Gradient Boosting were also considered. Moreover, Support Vector Machine and several Neural Networks with different architectures were tested and evaluated.

However, the above mentioned models did not outperform the other models, and is therefore not presented in the Results, Section 5.

Two settings of the extracted features were used to train the three selected classifiers. In the first setting the complete collection of the features, (Z, W₁, W₂), were utilized, while the second setting excluding both matrices Z and W₂, resulting in feature matrix W1. These two settings from now on will be referred to as the complete and reduced models correspondingly.

All the models were evaluated with the same performance metrics such as ROC-AUC, Recall, Precision, F1-score, weighted F1-score and Lift Score.

Moreover, the model metrics obtained from the confusion matrix was also used to evaluate the performance of the models. This is defined in Subsection 2.7. To assure stability in the evaluation metrics, the complete model for each learning algorithm were re-evaluated with ten test- batches of 45 000 randomly selected users.

4.2.1 Parameter tuning

The hyperparameters of the Random Forest and the Extreme Gradient Boosting classifiers were tuned first individually, and then the best parameters were found using then cross-validation, defined in Subsection 2.6.

After discussions with the Company and the Swedish News-site it was decided to select the final model by optimizing three performance metrics; AUC score, False Positive Rate and False Negative Rate. The AUC score should be maximized whereas False Positive Rate and False Negative Rate are minimized.

The reason behind this is that the cost of missing out on actual churning users is high. In other words the Company would rather expose a retaining strategy

(33)

to more users than to too few, instead of missing out on the actual churning users.

When the best performing model was selected, it can be used for churn prediction.

Below is a five step guide for predicting with the classification model.

1. Collect new data from the current date and 30 days back.

2. Extract features and build the feature matrices.

3. Load the trained model.

4. Run the model on the collected data.

5. Use the model on each user to predict whether any of the users is at risk of churning.

4.3 Software Tools

Throughout the project, Python has been the primary computational language using Jupyter Lab as an interface. The data used for the project was accessed through the Company’s database using SQL queries. The extracted data was then processed using the Pandas library in Python which worked well with the extracted data.

With the help of the library Scikit − Learn the statistical modelling and the tuning of the hyperparameters were performed. The baseline model a Logistic Regression model, the Random Forest model and the other ensemble learners were built using Scikit-learn. The library works well with Python and such libraries as Pandas, it also gives a user friendly experience. Although, the Extreme Gradient Boosting model was fitted using the library xgboost and the parameters were tuned with the help of Scikit-Learn.

The xgboost package was used since it has many computational advantages compared to other implementations of the gradient boosting. This package is relatively fast and memory-efficient, and uses parallel computation.

Matplotlib and Seaborn was the main tool used for visualization, whereas Scikit-Learn inbuilt functions for F-score, Recall, Precision and ROC-AUC were used for evaluation.

(34)

5 Results

In this section, the results of the thesis project are presented, which are based on data from a fixed time period, more precisely data from the first of April 2019 to the last day of December 2019. As mentioned in Subsection 4.1 the dataset used for modelling consists of 45 000 randomly sampled users from the fixed time period and the dataset was re-sampled for each of the different classification methods. The training set contains observations between 1st of April to September 30th 2019, while the test set consists of observations from 1st October 2019 to 31th December 2019 (Figure 10).

As explained in Subsection 4.2 several classification methods were tested and evaluated for the thesis project. Initially a Logistic Regression classification algorithm was fitted as a baseline, followed by a Random Forest and a XGBoost.

The three classification methods presented are Logistic Regression, Random Forest and XGBoost. As mentioned in Subsection 4.2, these methods were trained with two different feature setups. For each classification method a complete and a reduced model were fitted to investigate if certain features could affect the response more than others. Referring to Function 28, the complete model include the feature setup of the shifted lagged features of the activity data, the aggregated historical features and the extracted features and the response vector is modelled by Y = f (Z, W1, W2), while the reduced model only contains the shifted lagged features of the activity data and the response vector is modelled as Y = f (W₁).

Moreover, the results and performance of the different classification methods presented are based on the evaluation metrics mentioned in Subsection 4.2.

The evaluation metrics used to evaluate the models are Recall, Precision, F1-score, Fbeta-score (beta=2), Lift score, False Positive Rate and False Negative Rate, defined in Subsection 2.7. A table summarizing the values of those evaluation metrics for all the models for both the complete and reduced models can be seen in Table 9. To assure stability in the models, the complete models were re-evaluated using ten batches of test data for 45 000 randomly selected users. Further, for the complete model for each learning algorithm a 95% confidence interval for the AUC metric is presented in Table 10. The ideal ROC-curve should mimic the one presented in Figure 5 in Subsection 2.7. For each model a density plot of the estimated probabilities for each class is visualized followed by a plot over the fifteen most important features for predicting churn for the non-parametric methods Random Forest and XGBoost.

(35)

Table 9: A summary of the evaluation metrics for the three classification methods and the two different feature setups of the

complete and reduced models. Log.Reg being short for Logistic Regression, RF being short for Random Forest and XGBoost being

short for Extreme Gradient Boosting. For the complete model the average of the metrics presented and the standard deviation is shown

in the parenthesis.

Complete Reduced

Evaluation Metric\Model Log.Reg RF Xgboost Log.Reg RF Xgboost

Recall 0.678 (0.004) 0.650 (0.000) 0.667 (0.004) 0.75 0.63 0.65 Precision 0.120 (0.000) 0.126 (0.005) 0.120 (0.000) 0.08 0.09 0.09 F1 score 0.203 (0.004) 0.210 (0.000) 0.210 (0.000) 0.15 0.16 0.16 Fbeta score, beta=2 0.350 (0.000) 0.351 (0.003) 0.354 (0.005) 0.29 0.28 0.29 Lift score 1.86 (0.013) 1.947 (0.009) 1.906 (0.015) 1.22 1.29 1.29 AUC 0.722 (0.002) 0.728 (0.001) 0.728 (0.002) 0.604 0.605 0.604 False Negative Rate 0.322 (0.003) 0.350 (0.003) 0.335 (0.004) 0.25 0.37 0.35

False Positive Rate 0.342 (0.002) 0.313 (0.001) 0.317 (0.002) 0.60 0.48 0.49

Table 10: 95% confidence intervals for the AUC metric for the three complete models.

Confidence interval 95%

AUC Logistic Regresssion ( 0.7207 , 0.7234 )

Random Forest ( 0.7278 , 0.7292 ) Extreme Gradient Boosting ( 0.7269 , 0.7295 )

5.1 Logistic Regression

As baseline model for each of the two feature setups a Logistic Regression was fitted with Sklearn’s default parameters meaning that the model is penalized with Ridge Regression. The model was trained on a training set with balanced classes by random under-sampling.

On the left in Figure 11, the ROC-curve for the complete model is shown and the reduced model is shown to the right. The ROC-curve of the complete model shows slightly more curvature than the ROC-curve of the reduced model. This might indicate that the complete model is somewhat better at separating the two classes, but still not perfectly. In the bottom right corner of each plot the AUC score for the model is presented. The complete model results in an average AUC score of 0.722, which means that the model has 72.2% chance to rank a random observation of the churning class higher than a random observation of the non-churning class. The reduced model gives a AUC score of 0.604, which is more than 0.1 lower than the AUC score of the complete model.