An analysis of customer retention using data mining

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2018,

An analysis of customer retention using data mining

MOA BÄCK ENEROTH

(2)

Author

Moa Bäck Eneroth

Information and Communication Technology KTH Royal Institute of Technology

Examiner

Henrik Boström

Professor of Computer Science - Data Science Systems Department of Software and Computer System

KTH Royal Institute of Technology

Supervisor

Johan Montelius

Associate Professor in Communication Systems Department of Software and Computer System KTH Royal Institute of Technology

(3)

Abstract

This thesis aimed to answer the question whether the use of third-party applications, in addition to the original product, have an impact on customer retention at a digital rights management company. The research originated in the null hypothesis that there is no relationship between the dependent variable customer retention and the independent variable usage of third-party applica- tions. To evaluate whether the hypothesis can be rejected or not, the relation- ship between the two variables was analyzed using logistic regression. The result showed that there was a positive impact, for the chosen set of included variables. Consequently, the conclusion was that there could be a potential positive correlation between the two variables and the null hypothesis could, therefore, be rejected.

Keywords

Customer analysis, customer retention, data mining, machine learning, logistic regression.

(4)

Sammanfattning

Detta examensarbete hade som målsättning att svara på frågan huruvida an- vändandet av tredje-parts-applikationer, utöver användandet av originalpro- dukten, har en inverkan på kundlojalitet hos ett företag som arbetar med att hantera digitala rättigheter. Studien utgick ifrån nollhypotesen att det inte finns en relationen mellan den beroende variabeln kundlojalitet och den oberoende variabeln användandet av tredje-parts-applikationer. För att kunna utvärdera huruvida hypotesen kan förkastas eller inte, analyseras relationen mellan de två variablerna med hjälp av logistisk regression. Resultatet visade att att det fanns en positiv inverkan för valt dataset. Följaktligen var slutsatsen att det potentiellt skulle kunna finnas en positiv korrelation mellan de två variablerna och nollhypotesen kunde därför förkastas.

Nyckelord

Kundanalys, kundlojalitet, data mining, maskininlärning, logistisk regression.

(5)

Acknowledgements

I would like to use this space to express my gratitude towards people who in different ways supported this project.

First of all, thank you Johan Montelius for all the support during the whole project, I could not have done it without you. Also, thank you Henrik Boström for your advice and for providing me with feedback on my work. Last but not least, thank you Gabriel Franco and Henrik Eriksson, for providing me with this opportunity and for believing in me.

(6)

Chapter 1 Introduction

In the world, as we know it today, technology is virtually everywhere. It is integrated into our lives in such a way that it would be rather hard to imagine one without it. And still, some people argue that this is just the beginning of what is to come. A couple of years ago, Klaus Schwab (2016), founder and ex- ecutive chairman of the World Economic Forum, declared that the world is on the verge of a major technological revolution. He describes it as the Fourth In- dustrial Revolution and that it will fundamentally change the way humankind live, work and relate to one another. With its scale, scope, and complexity, this transformation will be unlike anything the world has ever experienced before. A world where billions of people will be connected, where we will have access to unprecedented processing power as well as storage capacity, the possibilities will be endless.

1.1 Background

There is already a major buzz around data, in terms of big data, data mining, data science and machine learning. However, it is not all news to the world.

Data mining as a concept has been around for decades. One of the very first moments in modern time of data mining occurred back in 1936, when Alan Turing introduced the idea of a machine that could execute computations, similar to the computers we have today (Marr, 2016). He later also created the so-called ”Turing test”, formed to determine whether a computer could be considered intelligent. To pass the test, it would have to fool a human into believing it was also human. Samuel Checkers-playing Program is said to be the

(8)

world’s first self-learning program, created in 1952 by a man named Arthur Samuel, and was a program that improved its strategies by observing successful moves (ibid.). A few years later, Frank Rosenblatt designed the very first neural network for computers, simulating the thought process of the human brain(ibid.).

We have come a long way since then. In 2016, an artificial intelligence algorithm created by Google beat a professional player at the Chinese board game Go, one of the most complex boardgames there is (Marr, 2016). The ability of computers today is growing at a remarkable rate. And as the quantities of data we produce continue to grow exponentially, so will the computer’s ability to process and analyze that data. The term big data refers to massive quantities of data, and data mining the process of analyzing all of that data.

It is an interdisciplinary field, intersecting both machine learning and statistics, where a collection of techniques are applied to extract data patterns (Han, Kamber and Pei, 2011). Data mining is a tool and as with any tool, it is not only sufficient to understand how it works, but it is also equally important to understand how it can be used. This thesis will explore the field of data mining and how it can unveil behavioral patterns and insights about customers when applied to large amounts of customer data.

The data is customer-generated data, provided by a digital rights management company. Due to confidentiality reasons, the name of the company will not be disclosed. Also, other details, such as the exact meaning of variables and their actual values will not be published.

1.2 Problem

As the tech industry develop, alongside emerging technologies and innovative ideas, the competition for customers is getting increasingly more competitive.

Not only do companies need to attract new customers, but more importantly, they also have to make sure to retain the already existing ones (Hassouna et al., 2015). So besides just providing a viable product or service, companies also need to really understand who their customers are. More importantly, they must learn how to distinguish users who are about to churn and how to prevent them from actually leaving the service or stop using the product.

(9)

The field of customer churn analysis is not unexplored (Burez and Van den Poel, 2007; Ahn, Han, and Lee, 2006; Miguéis, Van den Poel, Camanho, and Cunha, 2012; Faris, Al-Shboul, and Ghatasheh, 2014; Glady, Baesens, and Croux, 2009; Farquad, Ravi, and Raju, 2014). A customer churn analysis for a software-as-a-service company, that compare different prediction models, was able to identify the most important software usage features as well as classify what customers could be about to churn (Ge et al., 2017). An other study present an approach on how to predict game churn, based on a player’s survival ensemble (Bertens, Guitart and Perianez, 2017). However, research with the aim to investigate an already chosen independent variable’s impact on a dependent variable appear to be rather uncommon and could be a research gap that this thesis could potentially fill.

1.3 Purpose

The digital rights management company already have a pretty good understanding of who their customers are and what impacts customer retention. It includes factors such as demographical features as well as what platform the customers are usually using and how much time they spend engaged with the application. However, what they do not know is how the use of third-party applications, in addition to the company’s own product, impacts customer retention. The impact is by the company expected to be positive, but there has been no research on the matter, until now. The purpose of this thesis is to use the tools that data mining provide, to answer the following question.

Does the usage of one or several third-party applications have an impact on customer retention?

The results in this thesis will be handed over to the digital rights management company and hopefully add to their knowledge about their customers, helping them make more strategic decisions about how to decrease churn rate.

The study also encourages further research that could be done using a similar approach. As for external value, this study could be used as inspiration for research with similar context, although the similarity would be hard to verify as the product in this thesis is never disclosed.

(10)

1.4 Objectives

In order to fulfill the purpose of this thesis, the question defined in the previous section will have to be answered. To better understand the customers and their behavior, this thesis will make an analysis of data that the digital rights management company possess, data that the customers have generated by engaging with the product. This data analysis will evaluate whether there is a relationship between customer retention and the usage of third-party applications in addition to being an active user of the original application.

1.5 Methodology

Statistical analysis is the science of gathering data and uncovering patterns and can be used to make predictions about the future, based on past behavior, and to test a hypothesis in a test (Statistics How To, 2014). The latter, more commonly known as hypothesis testing, is where a null hypothesis can either be rejected or not, after that an analysis has been performed on the gathered data (ibid.). This is the research approach that will be used in this thesis. It is of empirical nature, as information will be obtained from observations to evaluate a hypothesis, rather than through logical reasoning alone. It will originate in the null hypothesis that there is no relation between the dependent variable customer retention and the independent variable usage of one or several third- party applications, in addition to the original application. This hypothesis can and will only be rejected if the concluding result from the analysis revealed that there is, in fact, a relationship between the two variables.

To be able to answer the research question presented in the previous section, there are a few steps that have to be performed. The first and maybe most important step is the gathering of data, and that in itself include several steps.

First of all, contextual research must conclude in what factors to include in the data analysis. When this is done, the actual act of gathering the data can be performed — a task known to be very time-consuming. The form in which the data is presented is not always the same as the one preferred when applying the model, which means that some cleaning and altering of the data is required.

The following step includes implementing the chosen model. In the case of this thesis, the model of choice is regression analysis, and more specifically,

(11)

logistic regression. There are a number of powerful tools and libraries available, to make this step less of a hassle, and some of these will be used in the work of this thesis. The final step is when the model is set to run using the gathered data, to generate a result. The result is then analyzed and evaluated with regard to previous steps and the model’s performance. Hence, it is also of importance to include performance measurements when running the model and taking these into account when evaluating the result.

1.6 Delimitations

What is of interest in this thesis is the overall impact of customers using one or several third-party applications, in addition to the original product, on customer retention, i.e. the impact for all customers. However, in the collaboration with this digital rights management company, it means over one hundred million users and a tremendous amount of data. To fit the scale and scope of this thesis, limited by time and resource, the sample will hence only include newly registered customer.

1.7 Ethics and Sustainability

All user-generated data are to be handled with caution. Most companies aim to be transparent about how and why they collect this data as well as how it is later used.

Although precautions are being taken by most companies, data breaches are not unheard of. Recently, an analytics firm by the name Cambridge An- alytica was revealed to have harvested millions of Facebook profiles, using them to predict and influence voters during the 45th United States presiden- tial election (Graham-Harrison and Cadwalladr, 2018). One of the reasons this turned into a major scandal was the fact that Facebook was noticed about the breach years before the reveal, but without taking action.

To prevent these things from happening, the European Union have a new regulation in place as of May 25th, 2018. The General Data Protection Reg- ulation(GDPR) aims to protect all EU citizen’s data privacy and reshape the way organizations across the region approaches data privacy(EU GDPR Por- tal, 2018). Simply put, this will force companies to collect consent from all

(12)

users when saving information and to be transparent with what data are being stored, where and for how long. The regulation is an example of an important, if not even necessary, approach to responsible data management (Gregoire, 2018). However, companies collecting user-generated data is not necessarily entirely a bad thing; in fact, this is what enables companies to deliver a cus- tomized user experience. What legislations like GDPR ensure is that data is protected and hence trust between users and companies can be preserved. Due to the GDPR, the digital rights management company have been compelled to do some changes in the way they handle and store the data they possess.

Among these changes, the data have been anonymized, and the handling of data during this project have been done with uttermost caution and respect.

1.8 Report Structure

This thesis is structured as follows. This initial chapter has introduced the sub- ject of research, together with a presentation of the problem and the purpose of this thesis. It has also touched upon ethics concerning the field of study, and delimitations. The second chapter will provide an in-depth presentation of theory and extended background. The third chapter will enclose the method and the results are then found in the fourth chapter. The fifth chapter will discuss the results, the method, and its limitations. This chapter will also present suggestions for future work. The sixth and final chapter will make conclusions about the data analysis.

(13)

Chapter 2 Extended Background

Data mining is the practice of searching through large quantities of data to dis- cover patterns that go beyond simple analysis. It uses sophisticated mathematical algorithms to segment the data, in order to predict the probability of future events based on historical events (Surampudi, 2018). There is a great deal of intersection between data mining and statistics, in fact, most of the techniques used in data mining can be located in a statistical framework (ibid.). How- ever, statistical models do usually make strong assumptions about the data, which also induce strong statements about the results. This means that if the assumptions were to be flawed, the validity of the model can be questioned.

The case is not the same for the machine learning models used in data mining. They, on the other hand, make weak assumptions about the data and therefore cannot make what is considered as strong statements about the results.

Yet, they can still produce very good results regardless of the data (Suram- pudi, 2018). Empirical observations can be qualified, but it is important to remember that the predictive relationships that are discovered are not causal relationships. It can never be assumed that a population identified through data mining have a specific behaviour just because they belong to the population in question (ibid.). Data mining can provide probabilities, but never exact answers.

Data mining models and functions can generally be categorized as either supervised or unsupervised data mining, notions derived from the science of machine learning (Surampudi, 2018). Supervised learning, also know as directed learning, is when the learning process is directed by a known dependent attribute. It strives to explain the behaviour of the target, using a function of

(14)

independent predictors, making this approach common fore predictive models while unsupervised learning is used for pattern detection. Logistic regression, the machine learning method used in this thesis, is a supervised learning process.

Building a supervised model involves training, a process of analyzing data where the target is known. This is how the function learns how to make predictions. The data are divided into two sets, one for training and the other for testing. The latter is used for determining the performance of the model and whether it is generalizable to other data. This way, the risk of overfitting is decreased. Overfitting occurs when the model fits the training data too well and consequently is less likely to make good predictions for other data.

2.1 The Process of Data Mining

The iterative process of data mining, presented in Figure 2.1, has four main phases (Surampudi, 2018). The first phase is all about understanding and defining the objectives and possible requirements. Once the business problem is defined, the data mining problem can be specified.

Figure 2.1: The data mining process, from (Surampudi, 2018).

(15)

When this is done, the data can be collected, which is the second phase. This stage is known to be very time-consuming, much due to the often very large amount of data, how it is stored and how easy it is to find what you are looking for. It is important to only select data that address the problem and remove the data that does not, as well as identify the quality of the data and make proper preparations for it to fit the model.

What model to choose should be based on the problem as well as the desired goal of the analysis. When the model has been chosen, proper cleaning and preparations can be made for the data to fit the model. The implementation of the model can be carried out in various ways. However, for the less experienced, there are several powerful software tools and libraries to use. The model is then used to analyze the data. Here, it is also important to evaluate performance measurements of the the model, and include this when making conclusions about the result. The result are interpreted and in the last phase, the deployment phase, insights and actionable information can finally be re- trieved.

2.2 Statistical Definitions

The fields of data mining and machine learning are greatly intertwined with the field of statistics. For this reason, there are some statistical concepts that need to be touched upon.

Population and Sample

Population and sample are two of the most fundamental concepts there is in statistics. A population is a collection of individuals or objects which are of interest when solving a research problem. Sometimes wanted measurements for all in the population are obtained, but usually, only a set of the individuals or objects of that population are observed. This is what is known as a sample.

”Population is the collection of all individuals or items under consideration in a statistical study. Sample is that part of the population from which infor- mation is collected.” (Weiss, 1999)

(16)

Confidence Interval

In statistics, a confidence interval is an interval estimate that is measured from observed data. The confidence level is the frequency of possible confidence intervals that contain the true value of the corresponding parameter.

”A confidence interval for a parameter is a range of numbers within which the parameter is believed to fall. The probability that the confidence interval contains the parameter is called the confidence coefficient. This is a chosen number close to 1, such as 0.95 or 0.99.” (Agresti and Finlay, 1997)

Correlation

Correlation is a measure of the extent to which a change in one variable is related to a change in another variable, on a range from -1 to 1 (Berry and Linoff, 2004). A correlation of 0 is equal to no correlation, meaning that the variables are not related. A correlation of 1 means that the two variables will change in the same direction when one of them is altered, however, not necessarily by the same amount.

The same is for -1, but here the change will be in the opposite direction.

R²is another measure of correlation. It is the correlation squared and ranges from 0, no relationship, to 1, complete relationship.

2.3 Logistic Regression

Logistic regression is used when examining the relationship between a dependent variable and one or several independent variables. The value of a dependent variable depends on the values of the independent variables. The dependent variable represents the output, whose variation is of interest, whereas the independent variables represent the input, i.e. the cause of variation. When talking about these variables in terms of data mining, the dependent variable is usually considered to be the so-called target variable and the independent variables as regular ones.

A variable is considered categorical when it is bound to only being able to take on one out of a limited and often fixed number of possible values. Each observation is consequently assigned to a specific nominal category based on

(17)

its qualitative property. A binary dependent variable is a variable limited to only two values, e.g. 0/1, pass/fail, or alive/dead.

The method is a mathematically-oriented approach to analyze the effect that variables have on one another. Predictions are made through a set of equations which connects input values with an output. The equations below provide the mathematical formulas for the logistic regression model (Nisbet, Elder and Miner, 2009).

p(y = 1|x¹, ..., xn) = f (y)

f (y) = 1/(1 + e− y) y = β0+ β1x1+ β2x2+ ... + βnxn

Here, y is the binary target variable for each individual; β₀ is a constant; β_i is the weight given to the specific variable i associated with each individual;

and x₁, ..., xnare the predictor variables for each individual from which is to be predicted.

2.4 Performance Measurements

Performance measurements are used to determine how well performing the model is. This section will touch upon confusion matrix, and the metrics that can be derived from it, as well as the receiver operating characteristic curve.

Classification Accuracy

A confusion matrix, seen in table 2.1, is a tool used to measure the performance of a binary classification (Hassouna et al., 2015). It is a visual representation of information about the actual classifications as well as the predicted produced by the model.

(18)

Table 2.1: Confusion Matrix.

Predicted True Predicted False

Actual True True Positive (T P ) False Negative (F N)

Actual False False Positive (F P ) True Negative (T N)

From the confusion matrix, several other metrics can be derived. Classifica- tion accuracy, the percentage of the observations that are correctly classified, is determined using the equation that follows.

Classif icationAccuracy = T P + T N T P + F P + T N + F N

Classification accuracy can be a vague indicator, particularly in the case of extreme data (Hassouna et al., 2015). For this reason, there are two comple- mentary metrics to use as well; sensitivity and specificity. Sensitivity is the proportion of the actual positives that are correctly classified, determined by the following equation.

Sensitivity = T P T P + F N

Specificity, on the other hand, is the proportion of the actual negatives that are correctly classified and is determined by the following equation.

Specif icity = T N T N + F P

Last but not least, the precision is determined by the following equation.

P recision = T P T P + F P

(19)

Receiver Operating Characteristic Curve

A receiver operating characteristic curve, widely know as a ROC curve, is a depiction of the relations between the true positive rate and false positive rate (Hassouna et al., 2015). Figure 2.2. presents an example of a ROC curve.

Figure 2.2: An example of a ROC Curve, from (Hassouna et al., 2015).

The best performing models are very close to the upper left corner, stretching for the coordinate (0,1). That requires the model to be close to a hundred per- cent on both sensitivity and specificity. The dotted line that divides the ROC space depicts a random predictor. ROC curves close to this line correspond to random guessing classifiers.

(20)

Chapter 3 Method

The research problem of focus in this thesis is an increasingly competitive environment, making companies having to not only attract new customers but also make sure to retain the customers they already have. How to prevent a customer from churning is a hard nut to crack, and there are for sure a number of ways to do it. This thesis is carried out in collaboration with a major digital rights management company and a factor that has been taken into account here is the huge amount of data that they have about their customers. Making use of the data can improve business (McAfee and Brynjolfsson, 2012), although most data possessed by companies are not fully utilized (Halaweh and Massry, 2015). The data might have been collected for another purpose, but could hold a lot of information about unknown customer behavior and patterns that cannot be directly interpretable. These could, however, be very useful when wanting to understand what keeps a customer from churning.

To unlock the full potential and to mine insights about their customers, one approach is to make an analysis of historical customer data. Data that describes past user behavior and whether the customer is considered to be an active user during the time period that is being analyzed. By applying data mining techniques to this data, it is possible to find indications on a correlation between specifics in behavior and retention. In other words, data mining can provide powerful tools when wanting to predict what customers that have a higher probability of defecting in the future. This information could be abso- lutely essential when making business strategic decisions about what actions that could be taken to prevent users from leaving the service (Brynjolfsson, Hitt and Kim, 2011).

(21)

Instead of looking at everything that could have an impact on customer retention, a single particular factor could be of interest, and that is the case here.

The digital rights management company has a large number of people work- ing on an application programming interface(API) and would like to know if the usage of this is in some way impacting the overall customer retention for their product, which the API is just a small part of. In addition to their primary platforms, their product is also integrated into other applications. This enables users to interact with their service through a third-party, meaning that a user can be faced with the functionality provided by the company also out- side of the primary applications, often with additional features. This raises the question of whether a user experiences additional value when using one or several of these third-party applications, in contrast to only using the primary application, and if it impacts the company’s overall user retention.

So, the question aimed to be answered in this thesis is whether an independent variable has an impact on a dependent target variable. The dependent variable in that equation is not too hard to figure out. It is, of course, customer retention. It is a binary target variable, as each customer can either be an active user or a churned user, but not both of them at the same time and there are neither no other options. Further, the independent variable of interest is the usage of one or several third-party applications, in addition to using the original application.

There are several different methods that could be used for making an analysis of the relationship between these two variables. One of the most common approaches is regression analysis (Hung, Yen, and Wang, 2006). It is not just a single method, but rather a collection of methods and the one that is most common when making predictions of customer churn is logistic regression (Burez and Van den Poel, 2007; Ahn, Han, and Lee, 2006; Miguéis, Van den Poel, Camanho, and Cunha, 2012; Faris, Al-Shboul, and Ghatasheh, 2014; Glady, Baesens, and Croux, 2009; Farquad, Ravi, and Raju, 2014). It is based on a statistical-oriented approach of analyzing the variables’ impact on each other.

This understanding of how the different variables relate to one another can then be used for making predictions. The prediction is made by using a set of equations that connects input factors with the desired target variable.

The analysis involves a series of stages. Initially, it is important to have a perfectly clear picture of the problem and the research question that follows.

(22)

This is then the starting point from which the locating and gathering of data will originate. How this stage is performed and completed depends on the context of the research. As this study is carried out in collaboration with a company, using their data, the act of locating and gathering data will be determined by how the company has chosen to store it and whether they have some system providing an overview of the data. Due to limitations in time and resources, the analysis has been chosen to only include newly registered customers. The sample will, therefore, enclose all customers who registered with the company on a specific date and the period of time that will be included in the data analysis is set to be approximately two months, starting from that date.

The logistic regression model will be implemented in python, using the data science platform Anaconda. It includes several useful applications and libraries for data processing, out of which Jupyter Notebook, NumPy, SciPy, and Pandas will be used. Jupyter Notebook is a web application that allows interactive computing for use cases such as statistical modeling and machine learning. NumPy, SciKit-Learn, and Pandas are all powerful libraries. NumPy adds support for large multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions for operating on these arrays.

SciKit-Learn builds on top of NumPy and another library called SciPy, and features various classification and regression algorithms, such as logistic regression and hence crucial for this model implementation. Pandas is used for data manipulation and analysis and offers data structures and operations for manipulating numerical tables and time series.

To evaluate the implemented model’s performance, there is a set of ef- fective performance measurements that can be used, explained in 2.4 Perfor- mance Measurements. The ones that will be used in this analysis are a correlation heatmap, a receiver operating characteristic curve, and a confusion matrix. Additionally, a number of other measurement metrics can be derived from the confusion matrix. These are the classification accuracy, sensitivity, specificity, and the model’s precision. When running the model with the gathered data, the model’s output and performance should be evaluated together.

If the model were to perform badly, the output should be seen as less likely to be truthful. If the model, on the other hand, turns out to be well performing, then the output will be seen as sufficient data to analyze further.

(23)

Logistic regression is based on the equation y = β₀+β1x1+β2x2+...+βnxn, where y, in this case, is whether a customer is considered to be a monthly active user. The null hypothesis in this thesis is that there is no impact from using one or several third-party applications on customer retention. In other words, the expected value for the β that belongs to the factor of third-party application usage, is to be zero. If it, however, is not zero, the null hypothesis will and can be rejected.

Logistic regression has proven to be useful for research with similar pur- poses (ibid.), but there is, of course, alternative approaches that could have been used to provide a solution to the problem presented in this thesis. In contrast to conducting this type of quantitative empirical study, another possible approach could have been to conduct a qualitative study. Instead of looking at the behavior of a very large number of customers, searching for a pattern, an alternative method could have been to conduct extensive customer interviews or a survey (Ghaleb Magatef and Fakhri Tomalieh, 2015). This strategy does, however, depend on that representable customers are found, i.e. customers that can represent the whole population of customers. As this thesis is made in a collaboration with a digital rights management company with customers all around the globe, it would be a very hard, not to say impossible, task to find a suitable sample of customers to interview. Regression, on the other hand, is reported to have high accuracy and interpretability for understanding the key drivers of retention, and for providing information to set up retention actions (Zhang et al., 2015). Furthermore, there is evidence that a logistic regression model, built on a well-prepared dataset, is just as viable as other maybe more advanced algorithms, such as random forest and support vector machines, when talking about best classification performance and churn prediction models (Coussement, Lessmann and Verstraeten, 2017). Considering this, statistical analysis and the model of logistic regression seemed to be the most appropriate research approach for this thesis and its purpose.

(24)

Chapter 4 Results

This chapter will present the results acquired in this thesis. The work covers two main phases. The first one includes the gathering and preparation of data, and the second, the analysis of that data.

4.1 Data

The process of gathering data can be quite extensive, and this project was no exception. One of the first steps is to get an understanding of what data to use in the analysis, which depends on both the model and, of course, the question one is seeking an answer to. In the case of this thesis, there was an opportunity to make use of knowledge already possessed by the digital rights management company. Interviews with employees and access to previous internal research provided support for initial assumptions about what variables that could be of importance for the analysis. While navigating through and getting familiar with huge amounts of data, these variables were carefully selected and re- trieved using Google BigQuery and SQL. A major obstacle here was the lack of documentation, making it hard to obtain an overview of the available data.

The ways of picking a sample were many, but due reasons mentioned in 1.6 Delimitations, the sample was based on newly registered users. The sample did therefore collect all customers who registered with the digital rights management company on a certain date — in this case, January 10th 2018.

The whole sample consisted of data collected for all of these customers over a time period of 60 days. The dataset enclosed a total number of 483 106 observations.

(25)

The complete list of the final variables that were selected cannot be disclosed due to confidentiality reasons. However, two variables are of particular interest; whether the customer was considered a monthly active user at the end of the chosen time period, and whether the customer made at least one web- API request during that period of time. For a customer to be recognized as a monthly active user, also known by the acronym MAU, the user has to be active at least once during each time period of thirty days. Further, having made a web-API request indicates that the customer has been using a third-party application.

4.2 Data Analysis

Before running the model, an initial correlation study was produced using a correlation heat map which visualizes the correlation between the different variables, seen in Figure 3.1. The two variables of interest are whether a customer was considered a monthly active user at the end of the chosen time period, and whether a customer made at least one web-API request during that period of time. These two variables intersect at the coordinates (3,29) and (29,3) and indicate a positive correlation with a value above 0. The full list of variables can be found in Appendix A, with some changes due confidentiality reasons.

Figure 4.1: Correlation heatmap.

(26)

Figure 4.2: A sample of the logistic regression analysis result.

Figure 4.3: A sample of the logistic regression analysis result.

Further indications on a potential positive correlation between these two variables can be seen in the results from the logistic regression model visible in Figure 4.2 and Figure 4.3, and fully presented in Appendix A.

Remember the equation y = β0+ β1x1 + β2x2+ ... + βnxn, where y, in this case, is whether a user is a monthly active user. Then x146is the usage of a third-party-application and β146 its weight, i.e. the coefficient, which turns out to be 0.0656 for this model and with the selected variables. The positive character implies a positive impact, meaning that a user is more likely to retain being a monthly active user if using a third-party-application with a factor of nearly 0.07. The standard error is only 0.001, and the confidence interval between 0.063 and 0.068, performance measurements that demonstrate the logistic regression model to have high performance accuracy.

(27)

When predicting the test set results and calculating the accuracy of the logistic regression classifier, there is a score of 0.85. To avoid over-fitting while still producing a prediction, a 10-fold cross-validation is used to train the model.

The cross-validation average accuracy is close to the logistic regression model accuracy, performing 0.853.

Another way of describing the performance of the logistic regression classification model is by using a confusion matrix. The result from using such a matrix is shown in Table 4.2 below. The total number of correct predictions are 123 536, while the total number of incorrect predictions are 21 396.

Table 4.1: Confusion Matrix.

Predicted False Predicted True

Actual False 76 307 6 847 Correct predictions 123 536 Actual True 14 549 47 229 Incorrect predictions 21 396

Using the information provided by the confusion matrix, scores for precision, sensitivity, f-measure and support in what is seen in Table 4.3 below.

Table 4.2: Result of precision, sensitivity, and f-measure.

Precision Sensitivity f1-Score

0 0.84 0.92 0.88

1 0.87 0.76 0.82

Avg / Total 0.85 0.85 0.85

The f1-score is a measure that combines precision and sensitivity and when they are close, this measure is roughly the average of the two.

(28)

The receiver operating characteristic(ROC) curve for the model is presented in Figure 4.4. The dotted line dividing the space in two represents the ROC curve of a purely random classifier, whereas the other line represents the result for the logistic regression model. A well performing logistic regression model should place itself as far away from the dotted line as possible, striving towards the top-left corner (Hassouna et al., 2015). Since the ROC curve for the logistic regression model is closer to the upper left corner than the curve of the random classifier, it can be considered to be very well performing.

Figure 4.4: Receiver Operating Characteristic curve.

All of the results indicate that the logistic regression model is successful and that there is a positive correlation between the two variables of interest, being whether a customer was considered a monthly active user at the end of the chosen time period, and whether a customer made at least one web-API request during that period of time.

(29)

Chapter 5 Discussion

This chapter will discuss the results, presented in the previous chapter and the method of choice, with its pros and cons. Further, it will reflect upon future work and what could be considered to be the next steps going from here.

The initial phase of this thesis was the gathering of data, an extensive and very time-consuming process. In my opinion, this process is very easily un- derestimated. It is the foundation of the data analysis and hence something that should be done thoroughly. This, however, can be much harder than one might think at first. It requires a comprehensive knowledge of what data that could be of value for the analysis. This is, of course, much easier if one is familiar with the data, to begin with. In this case, I was not. This made the process of locating and making decisions about what data to use so much harder.

Luckily, employees at the digital management company could provide some insight about the data, but due to lack of documentation, it was still quite hard to find.

One concern regarding the method is the grouping of users. The logistic regression analysis is based on two cohorts; one consisting of the people who are using third-party applications and the other one being the rest of the users. When comparing these two cohorts, some parts of the truth is certainly enlightened, but it also leaves others parts out in the dark. An argument for this comparison being untrustworthy is that the third-party-application-users probably have similar characteristics. They could e.g. be somewhat like the so-called early adopters, meaning individuals eager to start using a product or a technology as soon as it becomes available, whereas the rest is, well, the rest;

surely an assembly of users with very different characteristics. Being an early

(30)

adopter could actually be the very reason for a user to engage with these kinds of third-party applications, and what is known about these early adopters is that they are more engaged and hence also less likely to churn. This does, of course, complicate the process of isolating just the impact of the usage of third-party applications.

To eliminate irrelevant factors, the group containing the rest, i.e. the non- third-party-application-users, would have to be refined. The purpose would be to have two almost identical groups, with only one factor making them notably different; whether they have used a third-party application or not. To achieve this, one would have to find the closest matching user in the group of non-third-party-application-users for every user in the group of the third- application-users. However, this is not all without complications either. This method will raise the question of selection bias, where some users are more likely to be selected than others, thus biasing the sample. The data-gathering process will reflect the distortion in the sample and, since the sample is no longer a true representation of the population, biased estimates will be made regardless of the number of samples that would be collected (Bareinboim, Tian, and Pearl 2014).

Whether the method of choice in this study was the right one are hard to say, again, since every method has its own advantages. However, this method does, without doubt, indicate that there is a positive relationship between the two variables of interest.

Regarding the data analysis, the level of certainty can be increased and the indication stronger. This could be done by further research. First of all, it should explore the factor of time further. The data used in this study is bound to a time period of 60 days, gathered during a specific time of the year. There could, of course, be seasonal effects which are not at all included in this study.

Second, this study views all third-party-applications as one factor. This variable can, of course, be divided into several variables. That way, the impact of a specific application could be identified, something that too could contribute to an increase in business value.

Additionally, regarding the limitations of the method, future work could make further attempts to isolate the sole impact of third-party applications.

Using the different advantages of various data analysis methods could possibly when put together, provide a result closer to the truth.

(31)

Chapter 6 Conclusion

This thesis was conducted in collaboration with a digital rights management company, with the aim to answer the following question about their customers.

Does the usage of one or several third-party applications have an impact on customer retention?

The work originated in the null hypothesis that there is no impact from using one or several third-party applications, in addition to the original application, on customer retention. In other words, that there is no relation between the independent variable usage of third-party applications and the dependent variable customer retention. Furthermore, the hypothesis can be rejected if the analysis concludes that there is, in fact, a relationship between the two variables. The data analysis revealed the impact of a customer using one third-party applications during the first two months to be nearly 0.7. It is positive, meaning that the impact is also positive. Therefore, there is most likely a positive correlation between customers using third-party applications and customer retention. The conclusion here is consequently that the null hypothesis can be rejected. Although the method of choice was able to provide an answer to the question, it does not, however, say that it is the most appropriate approach. As discussed in the previous chapter there a number of ways to go about this problem. The truth is complex and to unveil it, more time and resources are required. Nonetheless, the results and conclusions provided by this thesis could be an important steppingstone for initial insights and further research.

(32)

References

Agresti, A. and Finlay, B. (1997). Statistical Methods for the Social Sciences.

3rd ed. Prentice-Hall: New Jersey.

Ahn, J. H., Han, S. P., and Lee, Y. S. (2006). Customer churn analysis: Churn determinants and mediation effects of partial defection in the Korean mobile telecommunications service industry. Telecommunications Policy, 30(10), 552-568. http://dx.doi.org/10.1016/j.telpol.2006.09.006

Bareinboim, E.; Tian, J.; and Pearl, J. 2014. Recovering from selection bias in causal and statistical inference. In Brodley, C., and Stone, P., eds., Pro- ceedings of the TwentyEight National Conference on Artificial Intelligence (AAAI 2014), 2410–2416. Menlo Park, CA: AAAI Press.

Berry, M. and Linoff, G. (2004). Data mining techniques: For Marketing, Sales, and Customer Relationship Management. 2nd ed. Indianapolis, Ind.:

Wiley Pub.

Bertens, P., Guitart, A. and Perianez, A. (2017). Games and big data: A scal- able multi-dimensional churn prediction model. 2017 IEEE Conference on Computational Intelligence and Games (CIG).

Brynjolfsson, E., Hitt, L. and Kim, H. (2011). Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance?. SSRN Elec- tronic Journal.

(33)

Burez, J., and Van den Poel, D. (2007). CRM at a pay-TV company: Using analytical models to reduce customer attrition by targeted marketing for sub- scription services. Expert Systems with Applications, 32(2), 277-288.

http://dx.doi.org/10.1016/j.eswa.2005.11.037

Coussement, K., Lessmann, S. and Verstraeten, G. (2017). A comparative analysis of data preparation algorithms for customer churn prediction: A case study in the telecommunication industry. Decision Support Systems, 95, pp.27- 36.

EU GDPR Portal. (2018). EU GDPR Information Portal. [online] Available at: https://www.eugdpr.org/ [Accessed 17 May 2018].

Faris, H., Al-Shboul, B., and Ghatasheh, N. (2014). A genetic programming based framework for churn prediction in telecommunication industry. Com- putational Collective Intelligence, Technologies and Applications (pp. 353- 362). Springer. http://dx.doi.org/10.1007/978-3-319-11289-3_36

Farquad, M., Ravi, V., and Raju, S. B. (2014). Churn prediction using com- prehensible support vector machine: An analytical CRM application. Applied Soft Computing, 19(1), 31-40. http://dx.doi.org/10.1016/j.asoc.2014.01.031

Ge, Y., He, S., Xiong, J. and Brown, D. (2017). Customer churn analysis for a software-as-a-service company. 2017 Systems and Information Engineering Design Symposium (SIEDS).

Ghaleb Magatef, S. and Fakhri Tomalieh, E. (2015). The Impact of Customer Loyalty Programs on Customer Retention. International Journal of Business and Social Science, 06(8).

Glady, N., Baesens, B., and Croux, C. (2009). Modeling churn using customer lifetime value. European Journal of Operational Research, 197(1), 402-411.

http://dx.doi.org/10.1016/j.ejor.2008.06.027

(34)

Graham-Harrison, E. and Cadwalladr, C. (2018). Revealed: 50 million Facebook profiles harvested for Cambridge Analytica in major data breach.

[online] The Guardian. Available at: https://www.theguardian.com/news/

2018/mar/17/cambridge-analytica-facebook-influence-us-election [Accessed 17 May 2018].

Gregoire, M. (2018). How the tech industry can restore trust. [online] World Economic Forum. Available at: https://www.weforum.org/agenda/2018/01/how- the-tech-industry-can-restore-trust/ [Accessed 17 May 2018].

Halaweh, Mohanad and Massry, Ahmed E. (2015) ”Conceptual Model for Successful Implementation of Big Data in Organizations,” Journal of Interna- tional Technology and Information Management: Vol. 24: Iss. 2, Article 2.

Available at: http://scholarworks.lib.csusb.edu/jitim/vol24/iss2/2

Han, J., Kamber, M. and Pei, J. (2011). Data Mining: Concepts and Tech- niques. 3rd ed. San Francisco, Calif: Morgan Kaufmann.

Hassouna, M., Tarhini, A., Elyas, T. and AbouTrab, M. (2015). Customer Churn in Mobile Markets: A Comparison of Techniques. International Busi- ness Research, 8(6).

Hung, S. Y., Yen, D. C., and Wang, H. Y. (2006). Applying data mining to telecom churn management. Expert Systems with Applications, 31(3), 515- 524. http://dx.doi.org/10.1016/j.eswa.2005.09.080

Marr, B. (2016). A Short History of Machine Learning – Every Manager Should Read. [online] Forbes.com. Available at: https://www.forbes.com/

sites/bernardmarr/2016/02/19/a-short-history-of-machine-learning- every-manager-should-read/#6a15105d15e7 [Accessed 5 Sep. 2018].

McAfee, A. and Brynjolfsson, E. (2012). Big Data: The Management Revolu- tion. [online] Harvard Business Review. Available at: https://hbr.org/2012/10/big- data-the-management-revolution [Accessed 16 May 2018].

(35)

Miguéis, V. L., Van den Poel, D., Camanho, A. S., and Falcão e Cunha, J. (2012). Modeling partial customer churn: On the value of first product- category purchase sequences. Expert Systems with Applications, 39(12), 11250- 11256. http://dx.doi.org/10.1016/j.eswa.2012.03.073

Nisbet, R., Elder, J. and Miner, G. (2009). Handbook of statistical analysis and data mining applications. Amsterdam: Academic Press/Elsevier.

Statistics How To. (2014). Statistical Analysis: Definition, Examples. [online] Available at: http://www.statisticshowto.com/statistical-analysis/ [Ac- cessed 5 Sep. 2018].

Surampudi, S. (2018). Oracle Data Mining Concepts, 18c. [online]

Oracle Help Center. Available at: https://docs.oracle.com/en/database/

oracle/oracle-database/18/dmcon/index.html [Accessed 25 May 2018].

Zhang, Z., Wang, R., Zheng, W., Lan, S., Liang, D. and Jin, H. (2015). Profit Maximization Analysis Based on Data Mining and the Exponential Retention Model Assumption with Respect to Customer Churn Problems. 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

Weiss, N. (1999). Introductory statistics. Reading, MA: Addison Wesley.

(36)

Appendix A

Table 6.1: Table of variables shown Figure 3.1, with some changes due to confidentiality reasons.

1 The total number of active days, 30 days after registration

2 The total number of hours of consumption, 30 days after registration 3 Have been active at least once a month, 60 days after registration 4 Have been active at least once a week, 30 days after registration 5 Product on day 1: type 1

6 Product on day 1: type 2 7 Product on day 1: type 3 8 Product on day 1: type 4 9 Product on day 1: type 5 10 Product on day 1: type 6 11 Product on day 1: type 7 12 Product on day 1: type 8 13 Product on day 30: type 1 14 Product on day 30: type 2 15 Product on day 30: type 3 16 Product on day 30: type 4 17 Product on day 30: type 5 18 Product on day 30: type 6 19 Product on day 30: type 7 20 Product on day 30: type 8 21 Product on day 60: type 1 22 Product on day 60: type 2 23 Product on day 60: type 3 24 Product on day 60: type 4 25 Product on day 60: type 5 26 Product on day 60: type 6 27 Product on day 60: type 7 28 Product on day 60: type 8

29 Have made a web-api-request at least once, 60 days after registration

(37)

Figure 6.1: Logistic regression results, part 1 out of 3.

(38)

(39)

(40)

TRITA EECS-EX-2018:555

An analysis of customer retention using data mining

An analysis of customer retention using data mining

MOA BÄCK ENEROTH

Author

Examiner

Supervisor

Abstract

Keywords

Sammanfattning

Nyckelord

Acknowledgements

Contents

Chapter 1 Introduction

1.1 Background

1.2 Problem

1.3 Purpose

1.4 Objectives

1.5 Methodology

1.6 Delimitations

1.7 Ethics and Sustainability

1.8 Report Structure

Chapter 2

Extended Background

2.1 The Process of Data Mining

2.2 Statistical Definitions

2.3 Logistic Regression

2.4 Performance Measurements

Chapter 3 Method

Chapter 4 Results

4.1 Data

4.2 Data Analysis

Chapter 5 Discussion

Chapter 6 Conclusion

References

Appendix A