Customer Lifetime Value Prediction Using Statistical Modeling

(1)

IN

DEGREE PROJECT TECHNOLOGY AND ECONOMICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

Customer Lifetime Value

Prediction Using Statistical

Modeling

Predicting Online Payments in an Industry Setting

ALINA SUSOYKINA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

(2)

(3)

Customer Lifetime Value Prediction

Using Statistical Modeling

Predicting Online Payments in an Industry Setting

by

Alina Susoykina

Master of Science Thesis TRITA-ITM-EX 2018:581 KTH Industrial Engineering and Management

Industrial Management SE-100 44 STOCKHOLM

(4)

Kundens Livstidsvärde Förutsägelse med Statistisk

Modellering

Predikera Online Betalningar i en Industriell Miljö

Alina Susoykina

Examensarbete TRITA-ITM-EX 2018:581 KTH Industriell teknik och management

Industriell ekonomi och organisation SE-100 44 STOCKHOLM

(5)

Master of Science Thesis TRITA-ITM-EX 2018:581

Customer Lifetime Value Prediction Using Statistical Modeling

Predicting Online Payments in an Industry Setting

Alina Susoykina

Approved

2018-06-04

Examiner

Hans Lööf

Supervisor

Ulrika Stavlöt

Commissioner Contact person

Abstract

Customer lifetime value (CLV) provides a measure of customer revenue contribution. It allows to justify marketing campaigns, overall budgeting, and strategy planning. CLV is an estimated cash flow attributed to the entire relationship with customers in the future.

Ability to utilize information gained from CLV analysis at the most efficient way, provides a strong competitive advantage.

The concept of CLV was studied and modeled with application to the online payments industry which is relatively new and at its growing phase. Ability to predict CLV

accurately conveys a great value for guiding the industry (i.e. emerging companies) to maturity. CLV analysis in this case becomes complex due to the fact that usually the databases of such companies are huge and include transactions from different industries: e-commerce, financial services, travel, gaming etc.

This paper aims to define an appropriate model for CLV prediction in the online payments setting. The proposed model segments customers first in order to improve performance of the predictive model. Then Pareto/NBD model was applied to predict CLV at the customer-level for each customer segment separately. Although the results show that it is possible to predict CLV at some extent, the model needs to be further improved and possible pitfalls need to be scrutinized. Discussion on these issues is provided in the following sections.

Key-words: customer lifetime value, statistical modeling, machine learning, customer relationship management, Pareto/NBD

(6)

Examensarbete TRITA-ITM-EX 2018:581

Kundens Livstidsvärde Förutsägelse med

Statistisk Modellering

Predikera Online Betalningar i en Industriell Miljö

Alina Susoykina

Godkänt

2018-06-04

Examinator

Hans Lööf

Handledare

Ulrika Stavlöt

Uppdragsgivare Kontaktperson

Sammanfattning

Kundens livstidsvärde (Customer lifetime value) är ett mått på hur en kund bidrar till företagets omsättning. Det tillåter att åskådliggöra försäljningskampanjer, företagets budget och företagets strategi. Kundens livstidsvärde är en estimering av

betalningsflöde som ett företag kan tjäna av kunder i framtiden. Möjligheten att nyttiggöra informationen från kundens livstidsvärde analys ger företag en starkt konkurrenskraftig fördel.

Kundens livstidsvärde var studerat och modellerat med anknytning till online

betalningstjänster industri, vilken har utvecklats kraftigt inom senaste åren. Möjligheten att predikera kundens livstidsvärde med hög noggranhet medför ett starkt värde för företag som erbjuder tjänster inom online betalningar och kan driva dessa till mognad.

Att predikera kundens livstidsvärde inom denna bransch anses vara en komplex process, då databaser hos såna föratag är stora och inkluderar information om

transaktioner från olika industrier såsom: elektronisk handel, finansiella tjänster, rese- och spelbolag.

I denna studie definieras en modell för att kunna predikera kundens livstidsvärde baserat på data från ett företag som tillhandahåller online betalningstjänster. För att uppnå bättre prestanda, segmenterar den föreslagna modellen kunder först. Därefter en Pareto/NBD modell används, för att predikera kundens livstidsvärde för varje

kundsegment. Trots att resultat visar att kundens livtidsvärde kan modelleras till en viss nivå, modellen behöver förbättras och möjliga blindskär måste granskas.

Nyckelord: kundens livstidsvärde, statistisk modellering, maskininlärning, hantering av kundrelationer, Pareto/NBD

(7)

Chapter 1 Introduction

There is a crucial aspect of any marketing policy to comprehend. The whole marketing strategy is built upon this aspect, namely the justification of investment budgets distributed among different marketing activities. This justification requires a measure which reflects a customer value entailed in the future relative to company performance [32]. Customer lifetime value (CLV) provides such a measure. CLV evaluates financial value of the relationship between a customer and a firm which is measured by present value of the predicted future cash flows from such a relationship [18].

Being aware of CLV, a firm is able to effectively decide on distribution of resources between different consumer segments, compare alternative investment opportunities (considering consumer base as an intangible asset [1]), and as a result adjust it’s marketing strategy. Continuous adjustment to dynamic customer behavior is called customer relationship management (CRM) whose goal is to establish profitable long-term relationship with customers [12]. In order to achieve the goal, CRM conducts a research on customer needs and preferences that provides insights for better customer targeting and potential product developments in favor of higher customer satisfaction. By identifying customer segments, CLV and specific needs of each segment, a company is able to construct efficient investment portfolio, prioritize marketing activities and improve CRM implementation [26]. Improved relationship with customers by the nature of the cause leads to greater customer retention and loyalty that implies competitive advantage acquisition [7, 38, 5].

Due to the contemporary state of technology, companies are able to store a huge amount of data. They often manage large transaction databases with millions of observations. This data convey a great important value constitut- ing knowledge about customers. Data mining techniques allow to extract this

1

(10)

2 CHAPTER 1. INTRODUCTION

value. Data mining tools serve as the backbone driving CRM systems to enable customer analysis which facilitate planing of effective marketing strategy [41].

These techniques have gained an increased attention from companies which seek to manage relationship with customers as efficient as possible. According to M. J. A. Berry and G. Linoff (2004) [4], data mining is a collection of techniques focused on discovering existing patterns in large data sets that includes methods of machine learning and statistics.

CLV is a complex measure that involves studying of consumer behavior in the form of customer lifetime, retention rate, churn rate, prediction of the future positive and negative cash flows, etc [13]. The most uncertain aspect of this evaluation, - prediction, can be effectively estimated by means of data mining and statistical learning. This paper illustrates the application of these methods in a case study of the online payments industry. First, customer segmentation by means of unsupervised machine learning, namely clustering [6, 36, 11], is performed. This allows to divide customers into groups according to trending behavioral patterns or other common characteristics. In this paper specifically, customers were segmented by similar purchasing behavior. Then, CLV for each customer segment is computed by means of a probabilistic model (Pareto/NBD [17]). Final step is to estimate whether the probabilistic model performs better either on each segment separately or on the whole customer population as it is given, make conclusions.

1.1 Problem Formulation and Motivation

The project is focused on the online payment industry that is already on a very high demand today, and continues to grow at a very high pace. According to World Payments Report 2017 [9], non-cash transaction volumes grew by 11.2% during 2014–2015, reaching 433.1 billion that is the highest growth during the past decade. It is also estimated that non-cash transaction volumes will continue to grow at a compound annual growth rate (CAGR) of 10.9% during 2015-2020. Developing countries such as China, India, Africa, Mexico, and Brazil are estimated to significantly contribute to this growth by a stable CAGR of 19.6% during this period and expected to become "power houses" [9, 35] of the future global non-cash transactions growth. On the other hand, mature markets (including Europe) are expected to grow much slower, by a modest 5.6%. Accordingly, most Nordic countries have taken a leading position in going cashless, with Sweden topping the list [9]. Online and mobile payments composed about 31.2% of the total cards transactions volume in 2015 and their

(11)

CHAPTER 1. INTRODUCTION 3

share is estimated to increase to 45% in 2019.

To summarize, the online payment industry is a relatively new financial service with a great potential for further growth. The industry requires proper investigation and calibrated prediction models for consumer behavior in purpose to maximize a firm’s net profit. A lack of consumer segmentation is inherent to the industry [10] leading to indifferent concern for different groups of customers. Aiming at improvement of a firm’s customer retention strategy, CLV must be evaluated and the involved future cash flows must be accurately predicted. For this purpose, relevant variables which describe CLV and appropriate statistical models should be selected, while precision evaluation should be systematized.

To the best of our knowledge, the research on CLV prediction in the online payments industry has never been done before. However, CLV estimation is often of a great interest to businesses. There are lots of studies on CLV estimation for different industries, where the most frequent are banking [13, 38, 5, 14] and e-commerce [8, 44]. Although the mechanisms behind these industries resemble at some extent those of online payment industry, there are, obviously, some significant differences. The most prominent attribute is that this industry serves transactions coming from (almost) all other industries, hence it encom- passes most of their specific patterns. Therefore, research described in this paper is relevant and valuable.

1.1.1 Purpose of the research

Following the problem and motivation described above, the purpose of this study is formulated. The study aims at investigating possible methods for estimation of potential revenue (CLV) generated by a certain group of active customers in the online payments industry in the future. To perform this estimation, the probabilistic model (Pareto/NBD [17]) has been applied to the case study in the industry. Customer segmentation by means of unsupervised machine learning was also performed in order to (1) show an efficient tool for strategy planning and (2) check whether customer segmentation improves prediction performance.

1.1.2 Hypotheses

Taking into consideration the problem described above, the following research hypotheses are formulated:

Hypothesis 1: It is impossible to accurately (at some extent) predict CLV due to totally

(12)

4 CHAPTER 1. INTRODUCTION

random behavior of customers in the online payment industry.

In order to reject or accept the hypothesis, it is essential to be able to track customer activities, extract inherent behavioral patterns, and measure an im- pact coming from these customers in the future. All of these essential parts are aimed to be employed throughout the paper. The hypothesis is going to be assessed at the final stage of the research project by evaluation of prediction performance. R-squared is taken as a measure of prediction ability of the applied models.

Hypothesis 2: Clustering executed beforehand improves performance of the probabilistic CLV models because customers with more or less similar purchasing behavior fall into the same segment reducing intra-group variance.

For the above hypotheses testing, comparisons of different method combinations need to be performed. These combinations are:

1. the probabilistic CLV models computed only;

2. clustering performed first, then the probabilistic CLV models computed based on each cluster (consumer segment) separately.

Comparison of these items enables to compare accuracy of the first and the second method combinations.

1.2 Research Questions

Throughout the paper, the following research questions are aimed to be an- swered to solve the stated problem and test the research hypotheses:

1. What is the current state of customer segmentation and CLV prediction in the e-payment industry today?

2. What are the components of a customer lifetime value modeling and customer segmentation, and how well could they be measured in the context of online payments?

3. Which statistical models are able to incorporate CLV components and accurately predict the future cash flows coming from customers? What is an appropriate model for customer segmentation?

4. Whether CLV model might be improved by preliminary customer segmentation? How the customer base might be segmented?

(13)

CHAPTER 1. INTRODUCTION 5

1.3 Limitations

There is one significant limitation in the scope of this study. Electronic payment systems (EPS) simply serve online transactions (purchases, transfers, etc) which occur in result of trade. The first and main incentive of customers to purchase is a desire of goods and services utilization while the choice of payment method comes at the final stage of an order [10]. EPS itself do provide a product in the form of a service but it’s not a driving force of the buyer-seller interaction. In other words, a transaction is motivated by trade initially. Only with this motivation, a customer faces a choice of payment method. Then the quality of a product provided by EPS influences it’s retention of these motivated customers.

Accordingly, this paper illustrates CLV computing for those merchants whose transactions appear in the database. With knowledge of revenue (CLV) generated by each merchant, a company is able to calculate approximate predicted profit coming from this revenue that is very important for decision making.

CLV calculation in this case does not reflect consumer preferences regarding EPS directly. One can only infer churners [8] of EPS product tracing frequency of transactions per each end user. It is unclear which product usage threshold to use for churn identification in the setting of EPS which serve a number of different verticals (travel, gaming, financial services, e-commerce, etc) where each vertical implies different purchasing behavior. One can compare, for example, the number of transactions and volumes of e-commerce and financial services verticals.

1.4 Structure of the Paper

The structure of the paper is organized as follows: in Chapter 2, scientific literature is reviewed and relevant theory for the scope of this paper is framed.

In Chapter 3, data exploration and data preprocessing are described; applied methods and their mathematical description are covered. In Chapter 4, the outcomes of the conducted case study are defined. In Chapter 5, the presented research is discussed following several aspects. Conclusions are provided in Chapter 6 with a review of the research, its findings and contributions.

(14)

Chapter 2 Theoretical Framework

This chapter provides relevant theory to motivate the choice of methods and concepts, with respect to industry specific features, limitations, the data set, and the proposed structure of the study.

2.1 Customer Lifetime Value

In recent years, marketing strategy and performance enhancement are increas- ingly organized around sustaining of the relationship with customers rather than around product design. In particular, "the importance of a customer is no longer judged by his or her single transaction with the firm, but rather by the series of transactions or potential transactions with the firm" [18]. The way of evaluation of the customer base shifted from transaction focus to the focus on ongoing relationships with customers. It is recognized, for example, that customers with longer transaction history with a bank have the higher probability of retention and thus have a longer life cycle compare to relatively newly acquired customers [5].

Therefore, the key to reasonable marketing strategy is to recognize customer segments according to their potential value of a relationship. The customer lifetime value (CLV) provides a measure of such potential that, therefore, can be defined as the discounted future income generated by the relationship with customers [13]. In other words, "the customer lifetime value is generally defined as the total net income that a company can expect from a customer" [18]

over the entire time of existence of any relationships with the company.

6

(15)

CHAPTER 2. THEORETICAL FRAMEWORK 7

2.1.1 Customer Lifetime Value Definition

This section provides an overview of different definitions of CLV described in the scientific literature. The overall idea behind these definitions is that CLV is considered as a tool which enables to appreciate the value of a customer in time, paying attention to frequency, duration, and the monetary value of the future transactions.

Consider different purchasing behavioral patterns depicted in Figure 2.1 [19].

This figure illustrates that customer behavior changes in time, leading to vari- ous purchasing patterns. Compare, for example, the first and the second customer from the top. The first customer have made purchases very frequently in the beginning of the considered period. However she is not longer active during the rest of the period. In contrast, the second customer in the list purchases rarely through the whole period. Therefore, the first customer is most likely churned, while the second customer is expected to continue to be active, though with lower purchase rate. The question that arises here is "Whose customer value is higher?".

Figure 2.1: Illustration of different purchasing behavior.

First of all, it is important to define the meaning of the word "value". The value of a customer is defined by the term "cash flow". The cash flow of a relationship with a customer is the sum of cash inflows and outflows induced by a certain customer or a customer segment [33]. The inflows associated with a customer could be the payment amount of executed transactions, while the outflows could be, for example, in the form of marketing costs directed to a customer retention.

(16)

8 CHAPTER 2. THEORETICAL FRAMEWORK

Therefore, the customer lifetime value is generally determined as the present value of all future cash flows generated by a customer over her entire life cycle associated with the firm [18]. While this notion is adopted from traditional marketing literature, it reflects the discounted cash flow approach well-known in finance [18]. In finance, the concept of present value (PV) points at the present worth of expected cash flows generated by an investment alternative in the future. "Present worth" means that the cash flows are discounted at the certain discount rate that enables investors to compare future cash flow streams generated by different alternatives [3]. An investor always has at least one investment alternative which is to deposit money at a bank and earn predefined interest rates for this deposit. Unlike other investment opportunities, a bank deposit is a relatively safe asset with predefined returns. This implies interest-earning potential in time, an attribute recognized as the time value of money:

PV = CF

(1 + r)ⁿ, (2.1)

where CF is expected cash flow in the future; r is the periodical rate of return (interest) or the discount rate; n is the number of periods. Therefore, CLV definition has to be compatible with the fundamental principles of finance, such as the present value.

The most fundamental model of CLV for a single customer is a function of expected positive (revenues) and negative (costs) cash flows in the future [29]:

CLV =

n

X

i=1

Vi− Ci

(1 + r)ⁱ, (2.2)

where i is a period when cash flows from a customer’s transactions occurred;

V_i is revenue from a customer in a period i; Ci is the total costs incurred by revenue Vi generation in a period i; n is the total number of periods expected during the lifetime of a customer.

Based on this fundamental model, CLV prediction using regression models is simply built upon cash flows associated with customers [18]. These predictive models train on changing purchasing behavior of customers in the past (based on the transaction history), and then compute the future worth of customers (CLV) given the past transactions. Hence, there are several potential drawbacks of these models. First, the mentioned predictive models are not ca- pable to calculate potential CLV for the customers whose transaction history is not available. For example, a customer has made the first transaction at the company just recently, implying the lack of information for training of predictive models. Second, these predictive models do not consider the probability

(17)

of churn [20]. While a customer has actually defected, the regression models can predict further purchases, leading to improper marketing strategies. Thus, churn detection plays a significant role in the CLV prediction.

Addressing the above mentioned drawbacks, the improved version of the CLV model is described in the literature [18, 20]:

CLV =

T^∗

X

t=0

(v_t− c_t)p_t

(1 + r)^t − AC, (2.3)

where vtis the payment amount of a customer at time t; ctare direct costs associated with a transaction at time t; AC are the costs of customer acquisition at time t0, vt− c_tis a customer margin contributed at time t; T^∗is time horizon for CLV estimation; ptis the probability that a customer not churned, i.e. will continue to purchase in the future. Accordingly, AC, acquisition costs are defined as the total marketing costs for a period divided by the number of customers acquired for that period. Note that the presence of AC in the model is motivated by the customers that "yet to be acquired" [18]. Hence, if the database consists of existing customers only, the term AC should be omitted.

Figure 2.2: Conceptual Framework for CLV Modeling.

The total sum of positive and negative values associated with all customer relationships can be recognized as an intangible asset to the firm that should be managed accordingly. Hence, the sum of the current and potential lifetime

(18)

values of all customers of the company is often referred as a customer equity (CE) [18]:

CE =

n

X

i=1

CLVi, (2.4)

where i is a single unique customer, CLVi - is the total CLV per customer i. CE eventually reflects a firm value or its stock price [20].

To summarize, the conceptual framework of CLV modeling is structured in Figure 2.2 [20]. In order to properly define the customer lifetime value and customer equity of a firm in total, it is important to incorporate in the analysis three main branches of marketing activities: customer acquisition, customer retention, and customer expansion (cross-selling). These activities in turn influence customer behavior and their implied CLV. Ignoring one of these branches may lead to significant bias in the modeling results which might lead to inefficient decision making.

2.1.2 CLV Utilization

CLV analysis equips a company for a range of important decision making areas. In purpose to rank these areas, Ekinci at al. [14] conducted in-depth inter- views with experts from the banking sector. The suggested survey consisted of different decision areas which were found in the scientific literature. Experts were asked to rank the areas by importance and also to propose their own aims of the analysis which are, in their opinion, important and not listed in the survey. The first four resulted purposes follow below, starting from the most appreciated option:

1. Segmentation and targeting, selection of prioritized customers;

2. Developing strategies for marketing mix;

3. Determining when to scale down or cease attention towards a group of customers;

4. Marketing budget allocation/optimal promotion determination maximizing CLV.

Consumer segmentation is the essential part of marketing because individual preferences vary widely and require different marketing mixes. Consumer segmentation is defined as "viewing a heterogeneous market as a number of smaller homogeneous markets, in response to differing preferences, attributable to the desires of consumers for more precise satisfaction of their wants" [40].

(19)

Thus, the main goal of segmentation is to develop business insights for a company that allow for diversification of product design and marketing activities, meeting the needs of different consumer segments [22]. This diversification creates a crucial competitive advantage.

Traditionally, companies segment their customers by four main characteristics:

demographic, geographic, psychographic, and previous purchasing behavior.

The two latter criteria are acknowledged to be more sensible for illuminating of consumer preferences variety [22]. CLV has been found to be another effective criteria for consumer segmentation as CLV itself reflects purchasing patterns of customers [13].

Once a customer base is segmented, a company needs to allocate its marketing resources across customer groups. Data mining approaches to marketing budget allocation attracted increased attention in the researching literature. A lot of studies are focused on developing of quantitative models for optimal budget allocation and showed remarkable results. For example, Labbi and Berrospi [27] demonstrated a three-step methodology for CLV calculation: customer segmentation, customer dynamics, and portfolio optimization. The authors indicated behavioral dynamics and estimated CLV by means of Markov decision-process models and Monte Carlo simulations. Labbi and Berrospi [27] applied this framework to an airline company, motivating movement from mileage-based management to value-based management of its frequent flying customers. Consequently, marketing costs declined by approximately 20 per cent and customer responsiveness to marketing actions was improved by 10 per cent.

To conclude, CLV is a very important tool which enables companies to effectively manage its marketing as well as business strategy. Being able to properly understand customer preferences and segment, companies are better equipped for decision-making processes focusing on the actual value of customers.

2.2 Recency, Frequency, Monetary Value Frame-

work

RFM framework originates from direct marketing [11]. Considering past purchases of customers, these models allow to choose optimized marketing pro- grams (e.g. direct mail) specified for distinguished groups of customers in purpose to increase their response propensity.

RFM models divide customers into groups, depending on three customer metrics:

(20)

• Recency: time between first and last transaction;

• Frequency: the number of purchases beyond the initial one, i.e. frequency = total number of transactions - 1;

• Monetary Value: the average payment amount of all transactions in the training period per customer. Note that in certain cases other metrics (e.g. median) are more reliable than mean.

Therefore, customers who started to purchase long time ago, purchase frequently, and spend large amounts of money are the most valuable to a company. A popular extension of RFM framework is to perform the whole analysis per product category that is evidently the same as analysis per each vertical (gaming, travel, e-commerce, etc).

Studies strongly suggest that the transaction history of customers is a better predictor of their future behavior than geographic or demographic characteristics mentioned in the previous section [20]. Studies also show that customer response propensity varies the most by their recency, followed by their purchase frequency and monetary value [20].

Fader, Hardie, and Lee [16] show that RFM variables which describe past patterns are important at describing the future patterns of purchasing behavior.

They took these variables as inputs and showed that they are sufficient statistics for the CLV modeling. The significant advantage of this type of modeling is that it uses nothing more than RFM variables which are easy to extract from the transactional databases. Fader, Hardie, and Lee [16] also showed in their study that the iso-CLV curves produce the same CLV for a customer with different values of RFM variables.

2.3 CLV Modeling

2.3.1 Pareto/NBD Model for Transaction Count Prediction

Due to a huge difference between business natures, CLV analysis is divided on two types of modeling: stochastic and deterministic models [13]. The latter type of modeling describes consumer behavior in the contractual setting that is relatively easier to handle. For example, in the subscription based companies, the expected time of payments is certain at some extent. For the existing customers in the customer base, there are two alternatives: a customer either recharges her account on the certain date or churns. In contrast, in the noncontractual setting which is described by stochastic models, transactions are

(21)

expected to happen at any point of time, i.e. randomly. Hence, in the noncontractual setting, it is necessary to compute probabilities on whether a customer will continue to be active and how she will behave [17]. One of the most popular stochastic models described and widely used in the scientific literature on the CLV studies is so called Pareto/NBD model.

The Pareto/NBD model was developed and published by Schmittlein, Morri- son, and Colombo [39] (1987), hereafter SMC. The model extracts and describes repeat-buying behavior of customers in a noncontractual setting. The indata to the model is RFM framework: frequency, recency, and T values which are computed for the selected observed period (0, T ]. The period starts on the date of a customer’s first purchase and ends at the end of the stated period. Let tibe the time of the last repeat purchase of a customer within an observed period, than for a customer i, it holds that (0 ≤ ti ≤ T ). The model trains on the data (transactional history) during training period (0, T ], and then provides predictions on the number of transactions during the subsequent validation period (T, T + T^∗]. Having transactional history during the validation period, we are able to evaluate the performance of the model, comparing predicted values to observed values. Than, if we can conclude that our predictions are accurate enough, we are able to provide predictions by means of the model for some unseen future period. All these periods are illustrated in Figure 2.3. Usually, the length of validation period is chosen as 1/3 (or at most 1/2) of the training period length. The length of forecast period depends on the case.

Figure 2.3: Training, validation, and forecast periods for Pareto/NBD estimation.

(22)

Before proceeding to the implications of the model, it is worth to consider assumed framework first.

2.3.2 Assumptions and Mathematical Formulation of Pareto/

NBD Model

The model relies on the number of underlying assumptions:

1. All customers follow the "life circle" of relationship with a company, which consists of two stages. During the first stage, customers are "alive" (i.e.

active) performing transactions at some frequency rate. After some time, they inevitably transfer to the next stage, where they churn or become

"dead" according to marketing literature. At this last stage, customers stop doing transactions permanently.

2. While a customer is active, her number of transactions follows a Poisson process with transaction rate λ. Hence, it turns out that the probability of observing a certain number of transactions, x, within the time interval (0, t]is given by

P (X(t) = x|λ) = (λt)^xe^−λt

x! , x = 0, 1, 2, ... (2.5) The expression above shows that the time between transactions has ex- ponential distribution with transaction frequency rate λ,

f (t_j − t_j−1|λ) = λe^−λ(t^j^−t^j−1⁾, t_j > t_j−1 > 0, (2.6) where tj is the time when the jth transaction has been done (time stamp).

3. Length of a customer’s lifetime, τ , is unobserved. In order to estimate this latent variable, we assume that it is exponentially distributed with dropout rate µ:

f (τ |µ) = µe^−µτ. (2.7)

4. All customers do not necessarily exhibit the same frequency transaction rates. Hence we assume that the transaction rate λ varies independently between customers. We assume that it follows a gamma distribution with shape and scale parameters, r and α respectively:

g(λ|r, α) = α^rλ^r−1e^−λα

Γ(r) . (2.8)

(23)

5. In conjunction with the previous assumption, customers are heterogeneous at their dropout (or "dying") rates. In other words, the dropout rate of a customer is not dependent on any characteristics or actions of other customers. It is also described by a gamma distribution with shape and scale parameters, s and β respectively:

g(µ|s, β) = β^sµ^s−1e^−µβ

Γ(s) . (2.9)

The second and the fourth assumptions taken together provide us with the NBD model which describes the distribution of the number of transactions during the active stage of customer’s lifetimes:

P (X(t) = x|r, α) = Z ∞

0

P (X(t) = x|λ)g(λ|r, α)dλ

= Γ(r + x) Γ(r)x!

α α + t

r t α + t

x

.

(2.10)

At the same time, the union of the third and the fifth assumptions provides Pareto distribution that is probability distribution on whether a customer is still alive:

f (τ |s, β) = Z ∞

0

f (τ |µ)g(µ|s, β)dµ

= s β

β β + τ

s+1

,

(2.11)

and cumulative distribution function:

F (τ |s, β) = Z ∞

0

F (τ |µ)g(µ|s, β)dµ

= 1 −

β

β + τ

s

.

(2.12)

Integration of the two sub-models (NBD and Pareto) reasonably constructs la- bel for the whole model, namely "Pareto/NBD model".

2.3.3 Implications of Pareto/NBD Model

In their paper [39] (SMC), the authors derive and show implications of the model [2]:

(i) the probability that a customer is still alive at the time T, conditioned on her time of the last transaction (ti) in the period, her total number of repeat transactions (xi), and her individually estimated parameters µ and λ(equation 1.A1 in SMC);

(24)

(ii) the probability that a randomly selected customer i is still alive at the time T, conditioned on her time of the last transaction (ti) in the period, her total number of repeat transactions (xi), and aggregated estimation of parameters s, β, r, and α (equation 1.A3 in SMC);

(iii) the mean and the variance of the number of repeat transactions of a randomly selected customer i within the period (0, T ] (equations 17 and 19 respectively in SMC);

(iv) the expected value of the number of future transactions x^∗_i during the subsequent time (validation) period (T, T + T^∗], conditioned on ti, x_i, T and the parameter values E(x^∗_i|s, β, r, α, x_i, t_i, T, T^∗) (equation 22 in SMC).

Hence, it turns out that in order to estimate the number of repeat transactions, we need to estimate aggregate-level parameters s, β, r, and α once. For a customer i with characteristic vector (xi, t_i, T ), the individual-level likelihood function takes the form [16]:

L(µ, λ|xi, ti, T ) = λ^xⁱe^−(µ+λ)T + λ^xⁱµ Z

t^T_i e^−(µ+λ)τdτ (2.13)

2.3.4 Gamma-Gamma Model for Monetary Value Prediction

In the gamma-gamma model, the observed average payment amount mx, cal- culated in the training period, is an imperfect metric of the latent mean transaction value E(M ). This metric is computed per customer level. The main assumption of the gamma-gamma model is that the average payment amount value per customer is distributed according to a gamma distribution of shape pand scale ν [15]:

p(m_x|p, ν, x) = (νx)^pxm^px−1_x e^−νxm^x

Γ(px) , (2.14)

where x is the total number of transactions, mxis the average payment amount, pis the shape of the gamma distribution (the model assumes that this parameter is the same for all customers), ν is the scale parameter (ν varies across customers and has a prior that is also gamma distributed with parameters (q, γ).

The expectation value of the gamma distribution, which is the mean purchase value is p/ν in this case [15].

(25)

2.4 Customer Segmentation Techniques

This section outlines popular customer segmentation techniques and explains why data mining methods are more favorable. The intention behind segmentation is to distinguish common purchasing patterns or trends prevailing among customers.

2.4.1 Popular Segmentation Techniques in Marketing

There are three segmentation techniques that are the most popular in business and in the direct marketing literature: RFM analysis, decision trees, and logistic regression [28].

RFM analysis is based on the three variables described earlier in this paper. The purpose of this analysis is to identify segments of customers according to their probability of responding to marketing activities. Marketing literature provides different versions of the analytical model. In the approach proposed by Hughes [23], values of the RFM variables take one of the discrete codes from the set 1, 2, 3. Each discrete code represents a single category for every RFM variable, where values of the variables do not overlap. Thus, every customer is assigned with one category. For example, for recency variable, customers are sorted by their value, divided into three equal customer groups, and then assigned to one of the three categories [11]. The model continues to divide customers within created groups by frequency leading to 9 (3x3) number of groups, and then divides further by monetary value ending up with 27 (3x3x3) number of groups. These groups are then used to rank customers according to expected marketing responsiveness. These ranks allow to formulate guide- lines on marketing activities, highlighting target groups. Despite simplicity, the model has several disadvantages:

1. it is not suited to cover other explanatory variables;

2. the number of categories is arbitrary, while an accurate number should be defined;

3. it ends up with a large number of groups (27 in the described example) that is sometimes hard to interpret. Moreover the difference between the closest groups might be insignificant.

Decision trees is the second segmentation technique considered in this section.

Tree-based methods are very popular due to its simplicity and interpretabil- ity [24]. In the customer segmentation setting, a decision tree is constructed by "subsequently splitting the entire group of heterogeneous customers into

(26)

smaller and more homogeneous subsets of customers" [11]. The splitting oc- curs in the top-down manner, where the top of the tree consists of all customers considered. Each node splits into two subsequent by one of the explanatory variable. In this way, the algorithm of this model identifies the best possible splitting decision at each node until a certain criteria is met that denotes that the tree has been fully grown [24]. The popular version of decision trees that is used in the literature for customer segmentation purposes is the chi-square automatic interaction detection (CHAID) decision tree [28]. According to this version, tree growing processes are based on the chi-square statistic criteria. It means that CHAID algorithm split the data based on independent variables into groups with approximately the same number of observations. In this way all possible cross tabulations for each independent variable are created. Mul- tiple pair-wise chi-square independence tests are conducted on each pair of tabulations to control for type I error.

CHAID decision tree is extensively used to construct rules which describe responsive and non-responsive customers. Hence, in order to implement the model, responsive and non-responsive customers should be defined and la- beled according to past marketing performance. The fact that we need such labels is a drawback of this method of customer segmentation because it is often simply unavailable (due to technical aspects, special features of the product, or absence of marketing campaigns in the past). Significant advantage of CHAID is that it can incorporate other than RFM variables such as demographic, location, and others.

Logisitic regression is a prominent method for a binary variable prediction [11].

Hence, in the context of marketing, logistic regression is able to predict responding and not responding customers to marketing campaigns. According to this technique, the probabilities that each customer belongs to one of the categories are estimated by fitting and maximizing the likelihood function. In statistical terms, the model estimates conditional distribution of the dependent variable given explanatory variables [24]. The training set (or input) for this model is the transactional history where some customer characteristics are reflected. The advantage of this model is that it is relatively (compare to RFM analysis) flexible for explanatory variables inclusion. Moreover, logistic regression is a popular segmentation technique in the marketing literature because of the three main reasons [11]:

1. The concept of this technique is simple.

2. It provides the probabilities that the response belongs to one of the groups rather than discrete groups assignment as in RFM and decision trees.

(27)

Moreover, it is allowed to set own thresholds of probabilities according to which a customer is assigned with a group. In the binary prediction, this threshold is 50% by default, meaning that if the posterior probability is higher than 50%, then a customer is assigned to a respective group.

3. The technique is not computationally expansive, while the performance is relatively robust.

While decision trees and logistic regression overcome some disadvantages of RFM analysis, these models require predefined labels of response variable for training. In the marketing analysis, it is not obvious which response variable to use in order to segment customers. One alternative is to use data on the results of marketing campaigns where labels such as responsive and non-responsive are provided. Therefore, some previous experience of marketing activities implementation is required for a company to fit the models. Reasonably, in case of newly founded companies or startups, this data do not exist. Resource allocation rules become extremely useful in this case, allowing to avoid blind planning. Hence, these companies need to utilize some other models of unsupervised learning.

2.4.2 Clustering for Customer Segmentation

Clustering is a method of unsupervised machine learning which implies that a model is built without any response (independent) variable [6]. Hence, in case of unsupervised learning, we do not try to explain a response variable by some other predictive variables as in case of, for example, fitting linear regression. Instead, we are searching for underlying patters in the vectors of measurements of these predictive variables itself. Such problem is referred to as unsupervised since there is a lack of a response variable that can navigate (or supervise) our analysis [21]. An ability to perform such analysis becomes very powerful in the setting of limited data (for example startups with no marketing campaigns history recorded).

Cluster analysis seeks to define possible relationships between the variables or between the observations. The goal of this analysis is to establish relatively distinct groups of observations, where observations in a single group share common characteristics or behavior [6]. For instance, in a market segmentation setting, we often have an access to some customer characteristics (variables), such as age, gender, zip code, family income etc. Intuitively we might believe that we can divide all customers on the groups according to their purchasing propensity, say big spenders versus low spenders. Having the information

(28)

(labels in database) about each customer’s behavior, we are able to perform supervised analysis. However, in reality we do not know whether each potential customer is a big spender or not [21]. In this case, we can try to cluster our customers based on above mentioned observed variables, obtaining different groups which probably reflect spending habits of customers.

Clustering methods can be divided on the two categories: partitioning and hierarchical. Hierarchical clustering recursively divide observations into clusters in either top-down or bottom-up way [24]. In the top-down approach (which works in an opposite direction as bottom-up approach), all data points belong to one cluster at first, and then the algorithm recursively splits each cluster into the smaller ones, minimizing cost function (Figure 2.4). Minimization of cost function (linkage criterion) provides decisions for splitting (top-down way) or merging (bottom-up way) a pair of observations. Linkage criterion considers the distance, or in other words similarity, between a pair of observations. The algorithm proceeds in such a way which includes the most similar observations in the same group or cluster, maximizing inter-cluster dissimilarity [24].

In contrast to hierarchical clustering, partitioning methods do not produce many splits that allow to construct a dendogram. Partitioning methods distinguish clusters only once. The most commonly used clustering algorithm, according to literature, is K-Means algorithm [21]. The popularity of this algorithm can be justified by its simplicity, simple implementation, and relative efficiency.

Figure 2.4: Hierarchical clustering represented by dendogram. Each data point placed in the bottom of dendogram (a, b, c, d, e) constitutes a single cluster, while all these points start in one cluster at the top. In this example, a top- down approach is shown. Applying a cut after the lowest row in the picture would yield 3 clusters: {a}, {b,c}, and {d,e}.

(29)

2.4.3 K-Means Algorithm

The K-means algorithm provides a simple way of partitioning a data set into K distinct, non-overlapping clusters [24]. To perform this type of clustering, we first need to state the number of clusters K which we want to end up with.

Then the algorithm will assign each observation to only one cluster from the range [1, K]. Each cluster is represented by a centroid, which is the "multi- dimensional average" of the cluster. It means that a centroid is represented by a vector consisting of averaged coordinates of each variable. The optimization problem of this algorithm is to divide observations in such a way that the distance between the centroid and observations assigned to this centroid is minimized. Hence, the idea behind K-means clustering is to reduce the within-cluster variation as much as possible. The most common way to measure this variation is squared Euclidean distance [24]:

d_Euclidean(x_ij, x_i⁰_j) = 1

|S_k| X

i,i⁰Ck

p

X

j=1

(x_ij − x_i⁰_j)², (2.15) where |Sk| denotes the number of data points in the Kth cluster.

Figure 2.5: Flowchart of the K-Means algorithm.

K-means algorithm is an iterative algorithm which follows the four steps (Figure 2.5). For each iteration, (K times n) comparisons are needed, where n

(30)

is the number of observations. The algorithm iterates until convergence, i.e.

until the coordinates of centroids are no longer changing through iterations.

Basically, a user of K-means algorithm needs to specify the two parameters:

the number of K clusters, and the distance function. The seeding of clusters might be either fixed, in order to be able to reproduce the results obtained later, or randomly generated. When centroids are seeded randomly each time of initialization, there is a potential to end up with different locations of centroids, leading to slightly different partitioning.

The most crucial parameter of the algorithm is the choice of K, which might be justified by different methods. Probably the most simple and common way is "elbow method" [43]. The mechanism behind this method is to run K-means clustering on a certain dataset with different values of K, storing the sum of squared errors (SSE) for each value [43]:

SSE =

K

X

K=1

X

xiSK

||x_i− C_K||², (2.16)

where CK is a Kth centroid. Therefore, elbow method is represented in the form of a graph with increasing number of clusters K on x-axis and SSE on y- axis. The sense behind this graph is that the SSE is decreasing with K toward 0. Logically, the SSE equals 0 when the number of clusters K is the same as the number of observations in a dataset that is each observation is represented by its own cluster, avoiding any error. Usually, a range of [1,10] clusters is used, and it’s enough to determine a trade-off between the number of clusters and SSE (i.e. clustering performance) [43]. Often the resulted line looks like an arm, where the "elbow" on this arm is taken as the value of K with the best trade-off.

(31)

Chapter 3 Method

This chapter provides description of the chosen method. It also illustrates the data set used and data preparation procedures for the analysis.

3.1 Dataset Description

The data set used in this study is provided by an online payments company.

The data set consists of the real transactions initialized by either individuals or legal entities. Each observation in the data set is a single transaction, served by the company, with a predefined set of characteristics (timestamps of transactions, payment amounts, verticals, date of birth of customers, gender, and other technical characteristics). Each customer has own unique identification number different from personal numbers, thereby avoiding disclosure of any personal information. The data set provided does not contain any tags showing exposure to marketing campaigns, hence it is not possible to apply the techniques described in the section 2.4.1 of this paper. In this case, unsupervised machine learning techniques such as clustering comes at handy.

Specifically, the data set consists of 28,334 unique customers with 553,623 transactions in total. All transactions were initialized and served in Sweden, with payments made in SEK. The time period of transactions is from January 1, 2016 to June 30, 2017. All customers were selected randomly in such a way that their first transaction appeared in the company’s transactional history happened during the first quarter of 2016. This way of selection allows to compare customers who started to use the service in more or less the same period of time, excluding external influential factors. This also allows to track the whole life circle of customers that helps to understand how customer behavior matures in time.

23

(32)

24 CHAPTER 3. METHOD

3.2 Proposed Structure of the Project

The proposed flow of the project is illustrated in Figure 3.1. Note that we aim to reject or accept hypotheses stated in this paper, i.e. we need to check whether CLV prediction model (Pareto/NBD) performs better on clusters (different customer segments) or on the dataset taken as a whole. Therefore, we have two variations of the flow denoted as 1 and 2 above the arrows in Figure 3.1. The very first step which is required for both versions is data preprocessing. Then, according to the first version, clustering and CLV prediction steps are following. The second version skips the clustering step, having CLV prediction right after the data preprocessing step. The results of two versions were compared afterwards.

Figure 3.1: Flowchart of the proposed model.

3.3 Data Preprocessing

Before starting to work with predictive and clustering models, the dataset should be prepared appropriately. This is the first step of the proposed model of the project, which consists, in turn, of cleaning and feature engineering parts.

3.3.1 Cleaning

Data cleaning aims to assure the quality of analysis. Irrelevant, incomplete, inaccurate or incorrect records should be identified and removed, corrected or replaced. The process was started from the standard procedures: check- ing whether the specified conditions for selecting data are met: time period

(33)

CHAPTER 3. METHOD 25

of all transactions, all customers have their first transaction in the first quarter 2016, location is Sweden, currency is SEK. Due to business nature, transactions with no specified payment amount are possible. Although the number of such transactions is small (less than 2% of the total number of transactions), they were found and deleted to avoid bias.

In the process of data exploration, new merchant effect was identified and ex- cluded. Merchants which started to be served by the online payment company after the first quarter 2016 were considered as new. Transactions coming from these new merchants contributed to the growth in the number of transactions per user. This growth is recognized as bias because customers increase their transaction frequency in the dataset not due to their needs or desires but due to broaden accessibility of the online payment company on merchant webpages.

Such growth in the number of transactions shows that the company more suc- cessfully penetrates customer wallet through new merchant acquisition. In other words, customers do not switch to new merchants once they have in access a familiar online payment method, holding their spendings constant.

Instead, a familiar online payment method begins to serve third parties with which customers have been having relationships before. Exclusion of the new merchant effect is necessary also because we do not know how customers (displayed in our dataset) behaved before we started to see them in our database.

After the cleaning has been applied, roughly 91% of unique customers and 82% of all transactions remained in the dataset.

3.3.2 Feature Engineering

The input data required for Pareto/NBD model is RFM framework, specifically frequency, recency, monetary value, and time since customer’s first purchase till the end of considered period T . Using the transactional history, these four discrete (excluding monetary value) variables were computed for each customer separately (Table 3.1). For simplicity, the discrete variables were computed using a day as a period. Frequency, recency, and T variables are expressed in days, while monetary value variable is expressed in SEK.

Table 3.1: Distribution of RFM variables computed for each customer.

RFM Variables Mean Standard Deviation

Frequency 3.98 13.95

Recency 95.21 122.58

Monetary Value 6377.49 32399.45

T 318.22 30.91

(34)

26 CHAPTER 3. METHOD

To perform clustering, only two variables were used, both per customer for 2016: total payment amount and the number of transactions (Table 3.2). Us- ing these variables helps to distinguish between high frequent, low frequent, high infrequent, and low infrequent spenders. Demographics characteristics such as gender and age were also checked for their possible influence on total payment amount and the number of transactions variables. Simple linear regression was run which showed that neither of demographics variables is statistically significant (p-values > 0.05). Consequently, these variables were not included in the clustering model.

Table 3.2: Distribution of variables (per user) included in clustering.

Exploratory Variables Mean Standard Deviation Total payment amount 27819.03 149433.03 Number of transactions 8.61 49.88

3.4 Customer Segmentation

As stated in the dataset description section in this chapter, there is no information provided on accomplished marketing activities. Therefore, customer segmentation has been done by means of unsupervised machine learning, specifically by clustering. There are two goals of application of clustering in this project: 1) to group a large set of customers according to similar purchasing behavior 2) to run the CLV predictive model on each cluster separately in order to check for performance improvement. Due to simplicity and prominence, K-means algorithm was chosen.

The clustering was done on the two continues variables showed in Table 3.2.

The Euclidean distance was used as the distance function, which is set in the utilized library (scikit-learn) by default. Scikit-learn is a free software machine learning library written for the Python programming language [31]. Scikit- learn features, among others, K-means algorithm, making application of the algorithm easier and faster.

The number of clusters K was chosen according to the "elbow method". The method implies that the number of clusters should be chosen as soon as growing the number of clusters K stops to contribute to SSE reduction significantly.

The plot showing dependence of SSE on K was drawn, where it is possible to decide on the value of K by human naked eye.

(35)

CHAPTER 3. METHOD 27

3.5 CLV Computing

Pareto/NBD and Gamma-Gamma models were compiled by means of pystan library in Python. Pystan provides an interface to Stan, a probabilistic language which is designed for statistical modeling [42]. First, Pareto/NBD model was implemented to predict the number of purchases. 500 warm up iterations and 1000 total iterations were used. We should keep in mind the limitation of Pareto/NBD model: it is assumed that customers perform only one transaction per day. It basically means that we predict the number of active days per customer. Second, the daily average monetary value per customer was estimated by means of gamma-gamma model. Finally, we obtain CLV predictions per customer by multiplying the number of active days predicted by the estimated monetary value. In this way, we avoid the problem of un- derestimating of the number of transactions caused by the above mentioned assumption.

The whole year of 2016 is chosen as a training period for the model, while the first half of 2017 is a validation period. The size of the validation period is 1/2 of the training period which is consistent with justified proportions.

Discounting rate for CLV computing was neglected since the interest rate in Sweden in 2016-2017 was fluctuating around 0, according to the Swedish Riks- bank. Therefore, it is assumed that the interest rate of that size and for this short period would not significantly influence calculations. This disregard simplifies calculations.

(36)

Chapter 4 Results

This chapter represents the results of the experiments performed in this project. The results of clustering are presented first, followed by the results of CLV prediction modeling. In the final section of this chapter, hybrid of the two models is evaluated.

4.1 Customer Segmentation

To decide on the number of clusters, the "elbow method" was applied (Figure 4.1):

Figure 4.1: Elbow method applied to define the number of clusters.

As clearly displayed in the figure, SSE is dramatically decreasing as the number of clusters grows. We can notice that the shape of the line resembles,

28

(37)

CHAPTER 4. RESULTS 29

at some extent, an arm with "elbow" at the point of four clusters on x-axis. Af- ter the threshold of four clusters, SSE continues to decrease but not as mush significant as before. Consequently, the number of clusters for K-means algorithm of four was chosen and applied to hold a trade-off between the number of clusters and SSE.

Table 4.1 provides a summary on the distribution of RFM variables per each cluster. The overall means of RFM variables for all customers in the dataset are presented in the last line. The means of RFM values per cluster are compared to those of overall. As we can see from the table, there is a huge difference between the clusters. The biggest by the number of customers, cluster 1, takes the majority of customer pool. This leads to the fact that the RFM variables for cluster 1 are the closest to overall values compare to other customer segments.

So small recency value for the cluster 1 can be explained by limited number of transactions coming from customers in this segment. In other words, they executed their first transaction in the first quarter 2016 and then stopped ex- ecuting transactions after 95 days on average. For the other clusters, recency values are high, showing that customers were performing their transactions till the end of considered period (2016). These conclusions are supported by frequency values, where cluster 3 shows customers with the highest total number of transactions. Cluster 4 shows extremely high monetary value, having only 3 customers in the segment. This cluster is considered as an outlier. And finally, T values are almost the same for all clusters that is expected because customers who made their first transaction in the same time period were chosen.

Table 4.1: Means of RFM variables per cluster compared to overall mean.

Number of Frequency Recency Monetary Value T customers

Cluster 1 25080 86% 97% 65% 100%

Cluster 2 525 635% 243% 1255% 100%

Cluster 3 65 1209% 245% 3429% 101%

Cluster 4 3 511% 245% 17525% 99%

Overall 25,673 3.98 95.21 6377.49 318.22

4.2 CLV Prediction Performance

This section aims to describe the results obtained for CLV prediction. Predic- tions were evaluated by the coefficient of determination (R-squared). As mentioned before, CLV prediction was computed through three steps: 1) prediction

(38)

30 CHAPTER 4. RESULTS

of the number of transactions for each customer by Pareto/NBD model; 2) prediction of the monetary value for each customer by Gamma-Gamma model; 3) to obtain final CLV values, the results from the first and the second steps were multiplied. Hence, we can evaluate performance of the models on each step (Table 4.2):

Table 4.2: R-squared for CLV prediction evaluation.

Pareto/NBD Gamma-Gamma CLV prediction (Number of (Monetary value)

transactions)

Training period 97% 99% -

Validation period 60% 0% 28%

According to the table above, both models, Pareto/NBD and gamma-gamma, perform very well on the training period. It means that the models are able to efficiently describe variation in the dataset. Pareto/NBD also shows quite good performance on the validation period with r-squared of 60%. However, gamma-gamma failed at prediction of the average monetary value per customer. This can be explained by very high variation in payment amounts of customers: this continues variable fluctuates from 0.04 to 1,000,000 SEK for a single transaction. Poor prediction results can be handled by adjusting parameters of the model, which is out of the scope of this project. Nevertheless, the final prediction of CLV for the validation period resulted with r-squared of 28% which is satisfying considering a huge variation variation in the dataset.

Hence, we can see that CLV can be predicted at least at some extent.

4.3 Hybrid Model of Customer Segmentation and

CLV Computing

The main experiment of this paper is organized to check whether clustering can improve CLV prediction performance. The reasoning behind this experiment is that clustering aims at dividing a dataset on groups according to some common characteristics. Hence, it is expected that predictive models run on the resulted clusters perform better because they are "trained" on groups of customers with reduced variance. The results of this experiment applied to the considered case study are presented in Table 4.3.

As Table 4.3 illustrates, Pareto/NBD model shows good results for the training period as well as for the validation period. The model performance on

(39)

CHAPTER 4. RESULTS 31

the training period for almost each cluster is slightly better than for those on overall dataset. However, we are more interested in the validation period performance since it reflects the possibility of predictions in the future. Cluster 3 shows significantly better performance than overall dataset. Clusters 1 and 2 has slightly higher r-squared compare to overall dataset. Only cluster 4 ex- hibits low r-squared that can be explained by very limited population in this group: only 3 unique customers has fallen in this segment.

Gamma-gamma model performs very well on the training period for almost all clusters. It means that the model predicts very close values to the ones that are actually observed (and on which it was trained). The "closeness" is of 99%

r-squared for three clusters. However the model failed at predictions for the validation period due to reasons stated in the previous section.

The final predictions of CLV are not that successful. Cluster 1 and 4 are pre- dictable at some extent, having r-squared almost the same as for overall dataset.

This result heavily depends on performances of Pareto/NBD and gamma- gamma models.

Table 4.3: R-squared for CLV prediction evaluation of hybrid model.

Time Period Pareto/NBD Gamma-Gamma CLV prediction

Cluster 1 Training 98% 99% -

Validation 60% 0% 28%

Overall Training 97% 99% -

To summarize, the raised hypotheses in 1.1.2 section of this paper are ad- dressed. Hypothesis 1 is rejected, at least at some extent: we have 28% of r-squared for overall dataset. This allows us to claim that it is not impossible to predict CLV. Hypothesis 2 is accepted: predictive models perform better on customer segments. Although gamma-gamma model failed for both method combinations, and we can not reject or accept the hypothesis, this statement is supported by Pareto/NBD model. Prediction performance on the validation period improved, on average, after clustering for Pareto/NBD model.

Customer Lifetime Value Prediction Using Statistical Modeling

Customer Lifetime Value

Prediction Using Statistical

Modeling

Predicting Online Payments in an Industry Setting

ALINA SUSOYKINA

Customer Lifetime Value Prediction

Using Statistical Modeling

Predicting Online Payments in an Industry Setting

by

Alina Susoykina

Kundens Livstidsvärde Förutsägelse med Statistisk

Modellering

Predikera Online Betalningar i en Industriell Miljö

Alina Susoykina

Kundens Livstidsvärde Förutsägelse med

Statistisk Modellering

Contents

Chapter 1

Introduction

1.1 Problem Formulation and Motivation

1.1.1 Purpose of the research

1.1.2 Hypotheses

1.2 Research Questions

1.3 Limitations

1.4 Structure of the Paper

Chapter 2

Theoretical Framework

2.1 Customer Lifetime Value

2.1.1 Customer Lifetime Value Definition

2.1.2 CLV Utilization

2.2 Recency, Frequency, Monetary Value Frame-

work

2.3 CLV Modeling

2.3.1 Pareto/NBD Model for Transaction Count Prediction

2.3.2 Assumptions and Mathematical Formulation of Pareto/

NBD Model

2.3.3 Implications of Pareto/NBD Model

2.3.4 Gamma-Gamma Model for Monetary Value Prediction

2.4 Customer Segmentation Techniques

2.4.1 Popular Segmentation Techniques in Marketing

2.4.2 Clustering for Customer Segmentation

2.4.3 K-Means Algorithm

Chapter 3

Method

3.1 Dataset Description

3.2 Proposed Structure of the Project

3.3 Data Preprocessing

3.3.1 Cleaning

3.3.2 Feature Engineering

3.4 Customer Segmentation

3.5 CLV Computing

Chapter 4

Results

4.1 Customer Segmentation

4.2 CLV Prediction Performance

4.3 Hybrid Model of Customer Segmentation and

CLV Computing