Machine Learning Based Prediction and Classification for Uplift Modeling

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Machine Learning Based

Prediction and Classification for Uplift Modeling

LOVISA BÖRTHAS

JESSICA KRANGE SJÖLANDER

(2)

(3)

Machine Learning Based

Prediction and Classification for Uplift Modeling

LOVISA BÖRTHAS

JESSICA KRANGE SJÖLANDER

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020

Supervisor at KTH: Tatjana Pavlenko Examiner at KTH: Tatjana Pavlenko

(4)

TRITA-SCI-GRU 2020:002 MAT-E 2020:02

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

The desire to model the true gain from targeting an individual in marketing purposes has lead to the common use of uplift modeling. Uplift modeling requires the existence of a treatment group as well as a control group and the objective hence becomes estimating the difference between the success probabilities in the two groups. Efficient methods for estimating the probabilities in uplift models are statistical machine learning methods. In this project the different uplift modeling approaches Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation are investigated. The statistical machine learning methods applied are Random Forests and Neural Networks along with the standard method Logistic Regression. The data is collected from a well established retail company and the purpose of the project is thus to investigate which uplift modeling approach and statistical machine learning method that yields in the best performance given the data used in this project. The variable selection step was shown to be a crucial component in the modeling processes as so was the amount of control data in each data set. For the uplift to be successful, the method of choice should be either the Modeling Uplift Directly using Random Forests, or the Class Variable Transformation using Logistic Regression. Neural network - based approaches are sensitive to uneven class distributions and is hence not able to obtain stable models given the data used in this project. Furthermore, the Subtraction of Two Models did not perform well due to the fact that each model tended to focus too much on modeling the class in both data sets separately instead of modeling the difference between the class probabilities. The conclusion is hence to use an approach that models the uplift directly, and also to use a great amount of control data in the data sets.

Keywords

Uplift Modeling, Data Pre-Processing, Predictive Modeling,

Random Forests, Ensemble Methods, Logistic Regression, Machine Learning, Mulit-Layer Perceptron, Neural Networks.

(6)

(7)

Abstract

Behovet av att kunna modellera den verkliga vinsten av riktad marknadsföring har lett till den idag vanligt förekommande metoden inkrementell responsanalys. För att kunna utföra denna typ av metod krävs förekomsten av en existerande testgrupp samt kontrollgrupp och målet är således att beräkna differensen mellan de positiva utfallen i de två grupperna.

Sannolikheten för de positiva utfallen för de två grupperna kan effektivt estimeras med statistiska

maskininlärningsmetoder. De inkrementella responsanalysmetoderna som undersöks i detta projekt är subtraktion av två modeller, att modellera den inkrementella responsen direkt samt en klassvariabeltransformation. De statistiska maskininlärningsmetoderna som tillämpas är random forests och neurala nätverk samt standardmetoden logistisk regression.

Datan är samlad från ett väletablerat detaljhandelsföretag och målet är därmed att undersöka vilken inkrementell responsanalysmetod och maskininlärningsmetod som presterar bäst givet datan i detta projekt. De mest avgörande aspekterna för att få ett bra resultat visade sig vara variabelselektionen och mängden kontrolldata i varje dataset. För att få ett lyckat resultat bör valet av maskininlärningsmetod vara random forests vilken används för att modellera den inkrementella responsen direkt, eller logistisk regression tillsammans med en klassvariabeltransformation. Neurala nätverksmetoder är känsliga för ojämna klassfördelningar och klarar därmed inte av att erhålla stabila modeller med den givna datan.

Vidare presterade subtraktion av två modeller dåligt på grund av att var modell tenderade att fokusera för mycket på att modellera klassen i båda dataseten separat, istället för att modellera differensen mellan dem. Slutsatsen är således att en metod som modellerar den inkrementella responsen direkt samt en relativt stor kontrollgrupp är att föredra för att få ett stabilt resultat.

(8)

(9)

Acknowledgements

We would like to thank Mattias Andersson at Friends & Insights who is the key person who made this project happen to begin with. A great thanks for introducing us to the uplift modeling technique, and for suggesting our thesis project for the CRM department at the retail company. We would also like to thank Elin Thiberg at the retail company who supervised us when in need, and who gladly answered every question we had regarding the structure of the different data sets. Another person at the retail company who was supporting and guided us in the right direction was Sara Grünewald and for that we are truly grateful.

Last but not least, we would like to send a great thank you to our examiner and supervisor, Professor Tatjana Pavlenko, for providing professional advise and for guiding us during our meetings.

(10)

(11)

1 Introduction

This thesis begins with a general introduction to the area for the degree project, presented in the following subsections.

1.1 Background

In retail and marketing, predictive modeling is a common tool used for targeting and evaluating the response from individuals when an action is taken on. The action is normally refereed to a campaign or offer that is sent out to the customers and the response to model is the likelihood that a specific customer will act on the offer.

Put differently, in traditional response models, the objective is to predict the conditional class probability

P (Y = 1|X = x)

where the response Y ∈ {0, 1} reflects whether a customer responded positively (i.e. made a purchase) to an action or not (i.e. did not make a purchase). X = (X1, ..., Xp) are the quantitative and qualitative attributes of the customer and x is one observation.

Using traditional response modeling, the resulting classifier can then be used to select what customers to target when sending out campaigns or offers in a marketing purpose. In reality, this is not always the desirable approach to use since the targeted customers are those who are most likely to react positively to the offer after the offer has been sent out. The solution is thus to use a second order approach recognized as uplift modeling.

The original idea behind uplift modeling is to use two separate train sets and test sets, namely one train and test set containing a treatment group and one train and test set containing a control group. The customers in the treatment group are subject to an action whereas the customers in the control group are not. Uplift modeling thus aims at modeling the difference between the conditional class probabilities in the control and treatment group, instead of just modeling one class probability:

P^T(Y = 1|X = x) − P^C(Y = 1|X = x) (1) where the superscript ^T denotes the treatment group, and the superscript ^C denotes the control group. This method is called Subtraction of Two Models and is presented in Section 4.2.1. Each probability in (1) is estimated using the statistical machine learning methods presented in Section 1.5. If the result of (1) is negative, it indicates that the probability that a customer makes a purchase when belonging to the control group is larger than when the customer belongs to the treatment group. This is called a negative effect and is very important to include in the models for being able to investigate how the campaigns are affecting the customers, see Section 4.4.2 for more details. There also exist other approaches for uplift

(14)

Using the uplift modeling approach the true gain from targeting an individual can be modeled. The purpose of using the uplift modeling approach is hence to optimize customer targeting when applying it in the marketing domain.

1.2 Problem

There is a problem that arises when using uplift modeling, i.e. using one treatment group and one control group. For every individual in the experiment, only one outcome can be observed. Either the individual belongs to the treatment group or the individual belongs to the control group. One individual can never belong to both groups. Put differently, it is not possible to know for sure that there is a causal connection that the costumer in treatment group responds because of the treatment since the same costumer cannot be in the control group at the same time. Thus it is not possible to evaluate the decisions at the individual observational unit as is possible in for example classification problems where the class of the individual is actually known. This in turn makes it a bit more tricky when evaluating uplift models.

Furthermore, uplift modeling has not yet been tested on the data used in this project, or on similar data belonging to company who owns this data. Thus it is not clear if it is even possible to apply the uplift modeling technique on this data and to obtain applicable results.

The question to be answered is hence how to optimize customer targeting in the marketing domain by using the uplift modeling approach, and at the same time being able to model the true gain from targeting one specific individual. Furthermore, how should the uplift modeling technique be implemented in the best way to obtain the most applicable results given this kind of data?

1.3 Purpose and Goal

The purpose of this thesis is to present methods of how to optimize the customer targeting in marketing campaigns related to the area of retail. The thesis presents investigations and discussions of different statistical machine learning methods that can be used when the aim is to estimate (1).

The goal of the degree project is to present the uplift modeling approach in combination with the statistical machine learning method that yields in the best performance, given the data used in this project. The result of the project should yield in a guidance towards which approach that is best suited, and thus is the best method of choice, for analyzes that falls into the same category as for those in this project.

1.3.1 Ethics

Today there exists a lot of data on the internet that provides powerful tools when it comes to marketing and other predictions of behavior and personalities. Our goal with this project is to find the subgroup of costumers that will give the best respond to retail campaigns. This

(15)

task can look quite harmless on its own. Though in recent years, it has been shown that when techniques similar to this is used in other circumstances it can have serious consequences.

For example [9], the company Cambridge Analytica used millions of peoples behavioural data from Facebook without their permission for the 2016 election in the USA. The data was then used to build models to find persuadable voters that could be manipulated through fake information from ads on Facebook without their knowledge. This is obviously a serous threat to the democracy and a new effective way of spreading propaganda.

The uplift modeling technique was also used for the campaign for Obama in 2012 [20]. Using this technique for the campaign was acceptable in that time since the data was not illegally collected. Also, the result from the models was used to choose which people to target with campaign commercials. Hence, it is important to question for what purpose it is ethical correct to use this technique? Also, one has to question if the data that is used is acceptable to include in the models. Is it legally collected and would every person find it reasonable that their data is used for the purpose of the task?

The laws regarding personal data gets toughen by time which means that different companies cannot use peoples data anyway they want. This makes it easier to draw boundaries of what kind of data that can be used when applying the uplift modeling technique. Although it is still important to always question the purpose and understand the power of the technique.

1.4 Data

The data is collected from the retail company’s database and includes qualitative and quantitative attributes about the customers. The data describes, among other things, the behaviour of different customers in terms of how many purchases has been made in different time periods, how many returns has been made as well as different amounts that the customers has spent on online purchases and in stores. There is also one binary response variable that shows whether a customer has made a purchase during a campaign period or not.

Each data set used in this project corresponds to one specific campaign each, hence one customer can occur in several data sets. There are one variable that describes whether a customer belongs to the control group or the treatment group. Customers belonging to the control group are customers who did not receive any campaign offer, while customers belonging to the treatment group did receive the offer.

1.5 Methodology

Using the uplift modeling approach assumes the use of a statistical machine learning method to model predictions of the actions for individuals in the treatment group as well as individuals in the control group.

(16)

a conditional divergence measure as splitting criterion in a tree-based method. The third approach is to use a Class Variable Transformation that allows for a conversion of an arbitrary probabilistic classification model into a model that predicts uplift directly. As there are advantages and disadvantages with each of the uplift modeling approaches, all of them will be examined in this project.

Furthermore, each uplift modeling approach requires the use of a suitable statistical machine learning method. For both the Class Variable Transformation and Subtraction of Two Models, it is possible two use almost any statistical machine learning method that can predict conditional class probabilities. Examples of such methods are Logistic Regression, Support Vector Machines, Multilayer Perceptrons (Neural Networks), tree-based methods and K- Nearest Neighbour. For the purpose of being able to compare the model performance when using a simple model compared to a more complex model, Logistic Regression and Multilayer Perceptrons will be used in these uplift modeling settings. Using a conditional divergence measure as a splitting criterion obviously requires the use of a tree-based method. The method of choice for this approach in this project is thus the ensemble learning method Random Forests.

1.6 Delimitations and Challenges

The first note that needs to be made is that to be able to use uplift modeling, there is a need of an existing treatment group and control group related to a certain campaign or offer¹. Not only must a control group exist, it also needs to be large enough for uplift modeling to be beneficial. The control group needs to be at least ten times larger than it needs to be when measuring simple incremental response. Also, when modeling binary outcomes, the treatment group and control group together needs to be quite large.

An issue to take into consideration when performing uplift modeling is the complex customer influences. If a customer interacts with a company in several ways such as different marketing activities, advertisements, communications etc., it can be hard to isolate the effect of the specific marketing activity that is intended to model, unlike when there are fewer kinds of interactions between the customer and the company.

Lastly, uplift modeling models the difference between two outcomes rather than just one outcome. Radcliffe et al. [15] points out that this leads to a higher sensitivity of overfitting the data. Thus, even methods that are normally robust to the need of variable selection and such, are in the need of it before any uplift modeling can be done.

1.7 Outline

In section 2, the idea behind uplift modeling is explained along with some related work that has already been made in the area. Section 3 contains a detailed description of the data, i.e.

statistics of the different campaigns that are used and what type of variables that are collected

1Since this thesis uses uplift modeling applied to the area of retail, the action that has a related response which is modeled will be recognized as a campaign or offer throughout the whole thesis.

(17)

in the different data sets. The variables are listed in a table where no variable is excluded, meaning that the table contains the list of all the variables that are used before any kind of variable selection is made.

The description of the data is followed by Section 4 which contains all the theory related to this thesis. Here, a theoretical description of how to pre-process data is presented along with some theory of variable selection. Furthermore, the three different approaches for uplift modeling, as well as the statistical machine learning methods that are used to perform uplift modeling is described. The three uplift modeling approaches used in this thesis are Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation. The statistical machine learning methods used for uplift modeling are Logistic Regression, Random Forests and Neural Networks. Moreover, a description of the resampling method Cross Validation is presented. The evaluation metrics that are used in this project are also presented, namely Receiver Operating Characteristic Curves and Qini curves. Finally, Section 4 is ended with the description of the programming languages that are used for the different approaches and methods, and why these languages are well suited for these kind of problems.

In Section 5 all the experimental results are presented. Firstly, the results from the pre- processing of the data are presented. Secondly, each implementation is described along with tables and figures of the results of the best performing models. The report ends with Section 6 in which the conclusions of the results are discussed.

(18)

2 Theoretical Background and Related Work

Machine learning is an area within computer science and statistics which often aims to, given some attributes, classify a specific instance into some category, or the conditional probability that it belongs to each of the classes. This technique can be used in a lot of areas, with one of them being marketing. Though in reality, this regular kind of classification technique is not really well suited for marketing. For instance, consider a marketing campaign where an offer is sent out to a (randomly) selected subgroup of potential customers. Then using the results of the actions taken from the customers, a classifier can be built on top of it. Thus, the resulting classifier is used to select which customers to send the campaign to. The result will be that the customers who are most likely to react positively to the offer after the campaign has been sent out, will be used as targets. This is not desirable for the marketer.

Some customers would have made a purchase whether or not they were targeted with the campaign, and thus unnecessary expenses are wasted in the case of sending the offer to this kind of customer. Then there are customers who actually react in a negative way by getting a campaign offer. Some might find it disturbing to receive campaign offers from the company in question, or stop being a customer out of some other reason just because they received the offer. When a customer stops doing business with a company it is called customer churn.

Customer churn is something that the company in question really wants to avoid. In other words, this is not a customer the marketer wants to target since it is an unnecessary expense to send out the campaign in this case, and they needlessly looses a customer. The first kind of customer just described is called a Sure Thing and the second one is commonly mentioned as a Do-Not-Disturb.

Then there are two more categories of customers, namely the Lost Cause and the Persuadable. You can tell by the name that the lost cause is someone who would react negatively, i.e. would not make any purchase at all, whether they were targeted or not. To reach out to this kind of customer is also a waste of money. The persuadable on the other hand, is the customer that the marketer wants to find and target. This kind of customer is a person who would not have made any purchase if they would not have received the campaign offer, but who would make a purchase if they did. These are the kind of customers the marketer can affect in a positive direction. An overview of the different type of customers can be seen in Table 2.1.

The solution to this kind of problem is called Uplift Modeling. The original idea behind uplift modeling is to use two separate training sets, namely one data set containing the control group and one containing the treatment group. The control group contains the customers who were not targeted by the campaign, and the treatment group contains the customers who received the campaign. Uplift modeling thus aims at modeling the difference between the conditional class probabilities in the control and treatment group, instead of just modeling one class probability. Thus the true gain from targeting an individual can be modeled. A more detailed and theoretical description of uplift modeling can be seen in Section 4.2. Uplift modeling is already applied frequently in the marketing domain according to [19], although it has not received as much attention in the literature as one might believe.

(19)

Response if

treated

No Do-Not-Disturb Lost Cause

Yes Sure Thing Persuadable

Yes No

Response if not treated

Table 2.1: The four categories of individuals considered when applying the uplift modeling technique.

In an article about uplift modeling in direct marketing, Rzepakowski et al. [17] uses decision trees to model the uplift for e-mail campaigns. The decision tree based models are also compared to more simple standard response based models such that three uplift models and three standard response models are used in total. The data that is modeled on reflects the costumers of a retail company. The goal is thereby to classify costumers as persuadable where the response reflects whether they go to the retail company’s website or not because of the campaign. The result of the study is that they find it possible and more effective to use uplift modeling than response models to predict the persuadables, i.e. which costumers who has a positive response to the campaigns. The standard response models were good at predicting if a costumer were going to the website or not, but performed very bad in predicting if they responded to the campaign or not. Rzepakowski et al. also show that uplift modeling done with decision trees (Modeling Uplift Directly) yields in a better result than when using Subtraction of Two Models. This is hence the reason why this project will focus solely on comparing different approaches for uplift modeling, and not to include traditional response or purchase models since they have been proven in many cases to perform worse.

The same object is discussed by Radcliffe et al. [15], who write about uplift modeling and why it performs better than other traditional response methods. They also discuss thoroughly about many important aspects such as evaluation of uplift models as well as variable selection for uplift modeling, which is very helpful for getting deeper insights in the matter. It is also written about Subtraction of Two Models and why this approach does not work well compared to Modeling Uplift Directly. Radcliffe et al. indicates that it is clearly important to understand that just because Subtraction of Two Models is capable of building two good separate models that preforms well on unseen data, this does not necessarily yield in a good uplift when taking the difference of the two models.

(20)

3 Data

The data used in all the statistical machine learning methods in this project is collected from a well established retail company that has physical stores as well as a website where the customers can make orders online.

The behaviour for the different customers can vary a lot when it comes to how a purchase is being made. Some customers only make orders online while some might only shop in a physical store. Furthermore there are customers who make purchases both online and in a store. The data used in the methods in this project will consider all kinds of purchases (both store and online).

In the following subsections, the markets and campaigns used in this thesis will be presented among with a table of descriptions of all the variables.

3.1 Markets and Campaigns

The customer base can be segmented into different categories depending on the customers purchase behaviour and the company is working actively with encouraging frequent customers to do more purchases. For the purpose of not loosing a frequent customer to a silent stage, the data used in the methods in this thesis will only consider campaigns sent to frequent customers. The focus will be on this category of customers since the wish is to use uplift modeling so that campaigns will mainly be sent to customer of the type persuadable.

Also, the campaigns differs depending on the stage of the customer and thus by focusing on the frequent customers, there will be a consistency when it comes to what kind of campaign that is used in the methods of this project.

Today the company is present in more than 70 retail markets. One specific market is chosen for this project, and thus all the data used in the uplift models will be generated from this market.

The campaigns in question that are sent to the customers are actual postcards. These postcards are sent to the customers mailbox and the postcards are only valid on one online purchase each. All the campaigns contains an offer of a 10% discount at one purchase at the company’s webshop. The only thing that differs between different campaigns are the time period of which they were sent out.

In this thesis, six different campaigns that were sent out to customers in the chosen market will be considered. The campaigns and their start date among with other information can be seen in the Table 3.1.

3.2 Variables

Following is a table of all the variables used in the data set before variable selection is made.

Each row is related to a different customer and each column in the data set contains all the different variables. If a variable is not specified to concern online purchases only, then it

(21)

Campaign CampaignStartDate AddressFileDate N T C

1 2017− 10 − 30 2017− 09 − 25 181 221 97.24% 2.67%

2 2018− 02 − 05 2018− 01 − 18 82 828 90.34% 9.66%

3 2018− 02 − 12 2018− 01 − 18 155 096 90.34% 9.66%

4 2018− 04 − 02 2018− 03 − 05 62 121 90.34% 9.66%

5 2018− 06 − 25 2018− 05 − 31 310 607 90.34% 9.66%

6 2019− 03 − 04 2019− 02 − 18 207 071 90.34% 9.66%

Table 3.1: The six different campaigns used in the uplift models. AdressFileDate is the date the customers were chosen to be a part of the campaign and N is the total number of customers in each data set. T and C are the percentage of customer that belongs to the treatment group and the control group, respectively.

concerns purchases made both online and in stores. Note that the data concerning customers that has made a purchase in a store are customers who are also club members (or staff). This is because the company are not able to collect data about store customers who do not have a membership or are not staff. Purchases made online on the other hand can concern both members and non members as well as staff.

All the amounts are in EUR, with the most recent currency rate. The gross amount is the amount a piece is set to cost before any kind of reduction or discount is made. If a reduction is made, i.e. if a specific item is on sale or has a new reduced price or equivalent, the new price is then the brutto amount. Discount implies that a customer has a personal discount that is used on a purchase, i.e. it is not the same as an overall reduction that is set to an item, but some kind of discount used by a specific customer. The final amount payed by the customer is then the net amount.

Variable Description

group 0if customer belongs to the control group, 1 if treatment group.

gender Gender of customer, 1 if female, 0 if male and set to missing if unknown.

age Age of the customer.

resp_dd_f lag Response flag which is 1 if customer has made a purchase online within the response window², 0 otherwise.

resp_dd_pieces Number of pieces ordered online in total during response window.

resp_dd_price_net Total net price on online orders during response window.

Clubmember Club membership which is 1 if customer was a member at address file date³, 0 otherwise.

IsStaf f 1if customer is staff, 0 otherwise.

lastP urchaseDate Date of the latest purchase in the observation period.

(22)

Has_P urch_i 1 if customer has made a purchase within the last i = 3, 12, 24⁴months before address file date, 0 otherwise.

Has_P urch_Child_i

1if customer has made a purchase from children within the last i = 3, 12, 24 months before address file date, 0 otherwise.

Has_P urch_Ladies_i

1 if customer has made a purchase from ladies within the last i = 3, 12, 24 months before address file date, 0 otherwise.

Has_P urch_LadiesAcc_i

1if customer has made a purchase from ladies accessories within the last i = 3, 12, 24 months before address file date, 0 otherwise.

Has_P urch_M en_i

1 if customer has made a purchase from men within the last i = 3, 12, 24 months before address file date, 0 otherwise.

orders_i Number of orders the past i = 3, 12, 24 months before address file date.

orders_red_or_dis_i Number of orders with reduction or discount the past i = 3, 12, 24months before address file date.

orders_ret_i Number of returned orders the past i = 3, 12, 24 months before address file date.

share_red_or_dis_order_i Share of orders with reduction or discount the past i = 3, 12, 24months before address file date.

share_ret_order_i Share of orders with returned pieces the past i = 3, 12, 24 months before address file date.

dd_pcs_i Number of pieces in total the past i = 3, 12, 24 months before address file date.

dd_net_amt_i Net amount in total the past i = 3, 12, 24 months before address file date.

dd_red_or_dis_pcs_i Number of pieces with reduction or discount in total the past i = 3, 12, 24 months before address file date.

dd_ret_pcs_i Number of returned pieces the past i = 3, 12, 24 months before address file date.

dd_ret_net_amt_i Returned net amount the past i = 3, 12, 24 months before address file date.

Table 3.2: Table of all the variables used in the data set, as well as the description for each variable. There are 54 variables in total in each data set.

4i = 3, 12, 24indicates that there are three different variables with the same kind of information, but for 3, 12and 24 months.

(23)

4 Methods and Theory

Uplift modeling is a data mining/predictive modeling technique that directly models the incremental impact of a treatment on an individual’s behaviour. This will be the underlying model for constructing the statistical machine learning methods. There are a great amount of statistical machine learning methods that can be used for regression or classification. In this case when using a statistical machine learning method with the purpose to apply it in an uplift modeling setting, suitable models are Logistic Regression, Random Forests and Multilayer Perceptrons (Neural Networks) as these performs binary classification.

The following sections will hence include the theoretical background for data pre-processing, uplift modeling, classification and evaluation metrics. Last but not least, the different programming environments of choice are presented along with some arguments of their compatibility with the data and statistical machine learning methods used in this project.

In this project the input variables are denoted as X_m ∈ {X1, ..., X_p}, also called input ”nodes”

for Neural Networks, where p is the number of attributes in the data and m is an index corresponding to one variable. The response variable is denoted as Y and a prediction is denoted ˜Y and which takes on values within [0, 1]. The values represents the probability that an observation belongs to a certain class (0 or 1). A vector with all the variables is defined as X = {X1, ..., X_p}, where an observation xi is a column vector of p elements.

Furthermore, referring to a matrix with N observations and p variables is denoted with a bold letter X∈ R^N^×p and the response vector is denoted y = (y₁, ..., y_N). One observation of X is then the row vector x^T_i = (x_i,1, ..., x_i,p)^T where i = 1, ..., N .

4.1 Data Pre-Processing

The data produced nowadays is large in size and does usually have a very high dimension.

Also, the data does most likely include a lot of errors such as missing values and outliers.

Pre-processing data is about removing and manipulating these values so that the data is a good representation of the desired objects. Also, a part of the process may include dimension reduction when needed. The management of data can be a very challenging task since manual pre-processing of data takes a lot of time, see [8].

Moreover, the variables in the data can be in very different ranges and can have different amount of impact on the prediction. When making a predictive analysis (and other analysis as well), it is of high importance to have a data set with good representation and quality to get an acceptable result. Also, it is important to choose the variables that is best associated with the response and not to have a too large dimension of the data. To obtain this, data pre-processing is made in several ways to form a good data representation.

(24)

4.1.1 Data Cleaning

Cleaning the raw data is a crucial step in order to get good quality of the data representations.

It is important to identify and remove incorrect and incomplete data and also, if needed, to replace and modify bad data points. In the following subsections, different ways to handle missing values and outliers will be presented.

Missing Values

It is very common that some features in a data set has missing values, and thus it is of a high importance to handle the missing data some how. To delete certain columns or rows that has a missing value is one way to handle it, but depending on what kind of missing value it is, there exists other techniques that might be more suitable for handling these values.

Overall, the missing values can be divided into three different categories according to [5], namely missing at random (MAR), missing completely at random (MCAR) and missing not at random (NMAR).

The missing data is MAR if for example respondents in a certain profession are less likely to report their income in a survey. The missing value thus depends on other variables than the one that is missing. If the data is said to be MCAR, then the missing value does not depend on the rest of the data. This can for example be if some questionnaires in a survey accidentally get deleted. If the missing data depends on the variable that is missing, the data is said to med NMAR. An example of this can be if respondents with high income are less likely to report their income in a survey. Having this kind of missing data causes the observed training data to give a corrupted picture of the true population. Imputation methods are in these conditions dangerous.

It is possible to use imputation methods both on data that is under the assumption to be MAR as well as MCAR, although MCAR is a stronger assumption. Whether or not the data is MCAR often needs to be determined in the process of collecting the data.

As mentioned before, there are several ways to handle the missing data and the simplest one is to delete the observations that contains the missing data. This method is usually called the listwise-deletion method and it is only workable if the proportion of deleted observations is small relative to the entire data set. Furthermore, it can only be used under the assumption that the missing values are MAR or MCAR.

On the other hand, if the amount of missing data is large enough in size compared to the entire dataset, the method just mentioned is not good enough. In such cases it is possible to fill in an estimated value for each missing value by using a Single Imputation method such as Mean Imputation which means that the missing value is replaced with the mean of all the completely recorded values for that variable. Another way to handle missing values is to use some sophisticated algorithm such as EM-algorithm or Multiple Imputations. The latter fills the missing values m > 1 times and thus creates m different data sets which are analyzed separately and then the m results are combined to estimate the model parameters, standard errors and confident intervals. Each time the values are imputed they are generated from a distribution that might be different for each missing value, see [6].

(25)

In this project, the statistical software suite SAS is used for all the pre-processing of the data and hence the existing procedure MI is used for handling of some of the missing values.

The MI procedure is a multiple imputation method that has a few different statements to choose from depending on what type of variables that needs to be imputed. The FCS statement is used in this project along with the imputation methods LOGISTIC (used for binary classification variables) and REG (used for the continuous variables). The FCS statement stands for the Fully Conditional Specification and it determines how the variables with an arbitrary missing data pattern are imputed, and thus the methods LOGISTIC and REG are two of the available methods related to this statement.

The result after the use of the procedure yields in m separately imputed data sets with appropriate variability across the m imputations. These imputed data sets then needs to be analyzed using a standard SAS procedure which in this project is the MIXED procedure, since it is valid for a mixture of binary and continuous variables. Once the analyses from the m imputed data sets are obtained, they are combined in the MIANALYZE procedure to derive valid inferences. This procedure is described in detail in [18].

Outliers

Another important step in the process of cleaning the data is the handling of outliers. An outlier is an observation that is located in an abnormal distance from the rest of the data, i.e.

the observation do not seem to fit the other data values. What is an abnormal distance or not can be decided by for example comparing the mean or median of the data from similar historical data sets, see [6]. The handling of outliers is not necessary for all kind of statistical machine learning methods as some methods are immune to the existence of outliers. In this project, the detection and handling of outliers are done for Logistic Regression and Neural Networks as these methods are sensitive to the existence of predictor outliers. Decision- trees are immune to outliers and thus outlier detection is not done as a part of the data pre- processing step for Random Forests.

The method used for dealing with outliers in this project is Hidden Extrapolation. Hidden Extrapolation can be used in multivariate regression cases. The idea is to define a convex set called regressor variable hull (RVH). If an observation is outside this set, it can be confirmed to be an outlier. In Figure 4.1 it can be seen, for a two variable case, that the point (x₀₁, x₀₂) lies within the range of the variables X₁ and X₂ but not within the convex area. Hence, this observation is an outlier of the data set that is used fit the model.

To determine the RVH, let us define the hat matrix

H = X(X^TX)⁻¹X^T (2)

where X is the N × p matrix with the data set that is used to fit the model. The diagonal elements of the hat matrix h_ii can be used to determine if an observation is an outlier or not. h depends on the Euclidean distance between observation x and the centroid. It also

(26)

Figure 4.1: A vizualisation of the idea behind hidden extrapolation. The gray area is the ellipsoid that include all observations of the RVH. The figure is taken from [11].

x_isatisfies:

x^T_i (X^TX)⁻¹x_i ≤ hmax (3) that observation lies within the ellipsoid that consists of all the observations in RVH. For example, if the wish is to determine whether an observation x₀ is an outlier or not, h₀₀can simply be calculated and the result can then be checked to see if it is smaller than or equal to h_max. If the following holds:

h₀₀ =x^T₀(X^TX)⁻¹x₀ ≤ hmax

the observation is not an outlier since it lies within the RVH, see [11].

4.1.2 Variable Selection and Dimension Reduction

Selection of variables is an important part in the data pre-processing step. Statistical machine learning methods are used to find relationships between the response variable and the input variables in form of a function Y = f (X) + ϵ where ϵ is an error term. If there are too many variables compared to the amount of training data, it is hard for the model to find the underlying function and then it gets overfitted. If the final model only includes variables that is truly associated with the response, the model accuracy gets improved by adding them. In reality this is usually not the case since most variables are noisy and they are not completely associated with the response. Adding many noisy variables to the model deteriorates it, and it will as a consequence perform worse on unseen data.

Some statistical machine learning methods, like decision-tree learners, performs variable selection as a part of the modeling process and are thus often not in a need of variable selection. However, for statistical machine learning methods used in an uplift modeling setting, variable selection needs to be done as the difference between to outcomes is modeled and in many cases the uplift is small relative to the direct outcomes which leads to the risk of overfitting the data to increase heavily according to [15].

(27)

Net Information Value

A common technique for variable selection when performing uplift modeling, i.e. (1), is called the Net Information Value, N IV , which is demonstrated in [15]. The method ranks the variables and is used for every method in this project.

The N IV is formed from the Weight of Evidence, W OE. Each continuous and categorical predictor is split into bins i, where i = 1, ..., G. G is the number of bins created for continuous predictors or the number of categories for categorical predictors. The predictors are thus turned into discrete predictors. For each bin i, the W OE is defined as

W OE_i =ln (

P (X_m = i|Y = 1) P (X_m = i|Y = 0)

)

where Y ∈ {0, 1} is the label that tells whether a customer made a purchase or not and Xm

is one predictor from the vector X = (X₁, ..., Xp)with index m. Further, the Net Weight of Evidence N W OE_i is defined as

N W OE_i = W OE_i^T − W OEi^C

where^T again denotes the treatment group and^C denotes the control group. Using N W OE_i, the N IV for each variable in the data set can be calculated using

N IV =

∑G i=1

N W OE_i·(

P^T(X_m = i|Y = 1) · P^C(X_m = i|Y = 0)−

P^T(X_m = i|Y = 0) · P^C(X_m = i|Y = 1)) The uplift package [3] in R calculates the N IV in the following way:

Algorithm 1 Net Information Value in the uplift package [3].

1. Take B bootstrap samples and compute the N IV for each variable on each sample according to:

N IV = 100·

∑G i=1

N W OE_i·(

P^T(X_m = i|Y = 1) · P^C(X_m = i|Y = 0)−

P^T(X_m = i|Y = 0) · P^C(X_m = i|Y = 1))

2. Compute the average of the N IV (µ_{N IV}) and the sample standard deviation of the N IV (σ_{N IV}) for each variable over all the B bootstrap samples

3. The adjusted N IV for a given variable is computed by adding a penalty term to µ_{N IV}:

N IV = µN IV − σ_{N IV}

√B

(28)

is for a variable, the better predictor it can be considered to be.

Variable Selection using Random Forests

Random Forests performs variable selection as a part of the modeling process and can thus be used to evaluate the Variable Importance (VI) in a data set. Random Forests is an ensemble learning method that works by constructing multiple decision trees during training, and which outputs the most commonly occurring class among the different predictions (in classification settings) or the mean prediction (in regression settings). Using decision trees, one aims at creating a model that predicts the label/target using some input variables.

Decision trees consists of a tree structure with one root node which is split into two daughter nodes and where the node m represents the corresponding region R_m. The process is then repeated for all the new regions.

The splitting is based on a splitting criterion based on the input variables. Put differently, the variable chosen at each step is the one that splits the region in the best manner. Using the so called Gini index in a classification tree, it is possible to get an overall summary of the VI which is an output of the Random Forests algorithm and which shows the variables that has been chosen at each split. The Gini index is thus used to evaluate the quality of each split and is defined in [5] in the following way, for each node m:

G_m = ∑

k̸=k^′

ˆ

p_mkpˆ_mk′ =

∑2 k=1

ˆ

p_mk(1− ˆpmk)

where ˆp_mkis the proportion of the training observations from the kth class in the mth region.

As the target only has two outcomes in this project, i.e. Y ∈ {0, 1}, there are only two classes k. The proportion ˆp_mk is defined as:

ˆ

p_mk = 1 N_m

∑

xi∈Rm

I(y_i = k)

where y_iis one response observation and x_i is one vector corresponding to one observation in the region R_m. The node m represents a region R_m with N_m number of observations and an observation in node m is classified according to the majority class in node m:

k(m) =arg max

k pˆmk

A large VI value indicates that the variable is an important predictor, and thus it is possible to rank the variables accordingly when VI is measured using Random Forests. In this project, this method is used separately to rank the variables according to the VI and the best ranked variables are then used as input to the Random Forests method that performs uplift.

(29)

Dimension Reduction using Principal Component Analysis (PCA)

Dimension reduction is made using Principal Component Analysis (PCA) for both Neural Networks and Logistic Regression on non-binary variables. PCA reduces the dimension of the data into The Principle Components in the direction of where the variance is maximized.

The resulting components becomes orthogonal, i.e. they become mutually uncorrelated. The following theory is taken from [21].

Lets say the original data matrix is given by X with p variables and N observations, i.e.

X ∈ R^N^×p. A one-dimensional projection of the data, Xααα with N elements, can be made using any unit-norm vector ααα ∈ R^p^×1. The sample variance of that projection is given by equation (4) assuming the variables of X are centered and where x_iare observations from X, i.e. x₁, ...,x_N ∈ R^p^×1.

V ar(Xαd αα) = 1 N

∑N i=1

(x^T_iααα)² (4)

The direction of the maximum sample variance, also called a loading vector, is given by v₁in equation (5) where (X^TX)/N is the sample covariance.

v₁ =arg max

||ααα||2=1

{V ar(Xαd αα) }

=arg max

||ααα||2=1

{

ααα^TX^TX N ααα

}

(5)

The loading vector v₁is the largest eigenvalue of the sample covariance and it gives the first principle component z₁ = Xv₁. The next principle component is generated by calculating another vector v₂ using (5) that is uncorrelated with v₁. This is repeated r times and it generates the following optimization problem where the matrix V_rconsists of all the optimal loading vectors.

V_r =arg max

A:A^TA=Ir

trace(A^TX^TXA) (6)

The matrix A consists of the unit-norm vectors α that optimizes the problem. ”trace” is the sum of the diagonal elements of the resulting matrix A^TX^TXA. V_r also maximizes the total variance of the resulting components even if the loading vectors are defined sequentially.

4.1.3 Binning of Variables

Binning of variables is the procedure of converting continues variables into discrete variables.

Usually, discretization of continues variables can yield in the variable to loose some information. Although in this project, binning is implemented for some variables anyways since there is an advantage of doing so for linearly dependent variables.

(30)

option called numbin, which is used to decide the number of bins, i.e. number of categories that the variables are discretized into.

There exists several different binning methods and in this project the binning is done using bucket binning. Bucket binning means that evenly spaced cut points are used in the binning process. For example, if the number of bins are 3 and the continuous variable is in the range of [0, 1], the cut points are then 0.33, 0.67 and 1. Thus, the resulting discrete variable then take on values in the range of [1, 3].

4.2 Uplift Modeling

In this section, the problem formulation to the uplift modeling problem will be introduced and represented, furthermore three common approaches to the uplift problem is being discussed.

To distinguish between the treatment group and the control group, notations with the superscript ^T will denote quantities related to the treatment group, while notations with the superscript ^C will denote quantities related to the control group. As an example, the probabilities in the treatment group will be denoted P^T and likewise, the probabilities in the control group will be denoted P^C. In addition, the notation M^U will denote the resulting uplift model.

The response variable takes on values as Y ∈ {0, 1} where 1 corresponds to a positive response to the treatment while 0 corresponds to a negative response. Put differently, 1 means that the individual has made a purchase while 0 means that the individual has not made a purchase. The input attributes are the same for both models, i.e. for both the model containing the treatment data as well as the model containing the control data. The definition of the expected uplift is defined as the difference between success probabilities in the treatment and control groups according to equation (1) i.e. the uplift is caused by taking the action conditional on X = (X₁, ..., Xp). If the result is negative, it indicates that the probability that a customer makes a purchase when belonging to the control group is larger than when the customer belongs to the treatment group. This is called a negative effect and is very important to include in the models for being able to investigate how the campaigns are affecting the customers, see Section 4.4.2 for more details.

Whether uplift modeling is an instance of a classification or regression problem is not fully clear as it can be treated as both. Uplift modeling can be viewed as a regression task when the conditional net gain (1) is treated as a numerical quantity to be measured. It can also be viewed as a classification task as the class to predict is whether a specific individual will respond positively to an action or not. Thus if the expected uplift is greater than zero for a given individual, the action should be taken on. However, as mentioned earlier it is not possible to evaluate the uplift model correctness on an individual level, see [19]. For simplicity, uplift modeling will be refereed to as a classifier throughout this thesis.

(31)

4.2.1 Subtraction of Two Models

When creating the algorithms for estimating equation (1) described in the introduction, there are three overall approaches that are commonly used. The first approach consists in building two separate classification models, one for the data in the treatment group, P^T, and one for the data in the control group, P^C. The uplift model approach Subtraction of Two Models can hence be defined as

M^U = P^T(Y = 1|X = x) − P^C(Y = 1|X = x)

which means that for each classified object, the class probabilities predicted by the model containing the data of the control group is subtracted from the class probabilities predicted by the model containing the data of the treatment group. This way, the difference in the class probabilities caused by the treatment is estimated directly (demonstrated in [7]). The input X = (X₁, ..., X_p)is the same for both models but origins from two different data sets.

This means that the model parameters in P^T will be different from the model parameters in P^C.

The advantage of this approach is that it can be applied using any classification model and it is easy to estimate the uplift. The disadvantage is that this approach does not always work well in practice since the difference between two independent accurate models does not necessarily lead to an accurate model itself, see [4]. Put differently, the risk is that each model might focus too much on modeling the class in both data sets separately, instead of modeling the difference between the two class probabilities. Also, the variation in the difference between the class probabilities is usually much smaller than the variability in class probabilities themselves. This in turn can lead to an even worse accuracy, see [17].

Despite of the disadvantages just mentioned, there are some cases when this approach is competitive. According to Sołtys et al. [19], this can be either when the uplift is correlated with the class variable (e.g. when individuals that are likely to make a purchase also are likely to respond positively to an offer related to the purchase), or when the amount of training data is large enough to make a proper estimation of the conditional class probabilities in both groups.

Since this approach can be applied with any classification model, and for the purpose of having a simple approach to compare with when investigating more advanced approaches, this approach will be implemented using both Logistic Regression and Neural Networks.

Logistic Regression is a linear statistical machine learning method that is easy to implement while the more complex Multilayer Perceptron (MLP) is a class of feedforward artificial Neural Network. By implementing both of these methods, it is possible to analyze whether or not the more simpler method Logistic Regression performs better or worse than a more complex method of Neural Networks.

4.2.2 Modeling Uplift Directly

(32)

approach is hence the need of modification since the model of choice needs to be adapted to differentiate between samples belonging to the control and the treatment groups. The advantage on the other hand, is the possibility to optimize the estimation of the uplift directly.

Decision trees are well suited for modeling uplift directly because of the nature of the splitting criteria in the trees. A splitting criteria is used to select the tests in nonleaf nodes of the tree.

To maximize the differences between the class distributions in the control and treatment data sets, Rzepakowski et al. [16] proposes that the splitting criteria should be based on conditional distribution divergences, which is a measure of how two probability distribution differ. Put differently, using this approach, at each level of the tree the test is selected so that the divergence between the class distributions in the treatment group and control group is maximized after a split has been made.

The Divergence measure used for this project is the squared Euclidean distance. Given the probabilities P ={p1, p₂} and Q = {q1, q₂}, the divergence is defined as

E(P, Q) =

∑2 k=1

(p_k− qk)²

where k is equal to 1 and 2 for binary classification like in this project, i.e the response has two classes Y ∈ {0, 1}. In this case, p1 and p₂ is equal to the treatment probabilities P^T(Y = 0)and P^T(Y = 1). q₁and q₂is then equal to the control probabilities P^C(Y = 0)and P^C(Y = 1).

For any divergence measure D, the proposed splitting criteria is defined in (7) and the largest value of D_gaindecides the split of that node.

D_gain = Daf ter_split

(

P^T(Y ), P^C(Y )

)− Dbef ore_split

(

P^T(Y ), P^C(Y ) )

(7) P^T and P^Care the class probabilities in the treatment and control group before and after the split. The resulting divergence measure after a split has been made is defined as:

Daf ter_split

(

P^T(Y ), P^C(Y ) )

=

a2

∑

a=a1

N_a N D

(

P^T(Y|a), P^C(Y|a))

(8)

where N is the number of observations before the split has been made, a∈ {a1, a₂} is the left and right leaf of that split and N_a is the number of observations in each leaf after the split has been made. E.g. if the split is made out of a binary variable, A ∈ {0, 1}, the left leaf a1

corresponds to A = 0 and the right leaf a₂corresponds to A = 1 in (8).

This uplift modeling approach will be implemented using decision tree learners which in this project is chosen to be the ensemble learning method Random Forests.

4.2.3 Class Variable Transformation

The third approach, likewise the one described in Section 4.2.2, models the uplift directly.

Jaskowski et al. [7] proposes the introduction of a Class Variable Transformation, i.e. let us

(33)

define Z ∈ {0, 1} such that

Z =





1 if Y = 1 and T, 1 if Y = 0 and C, 0 otherwise.

(9)

where T denotes the treatment group data and C denotes the control group data. (9) allows for the conversion of an arbitrary probabilistic classification model into a model which predicts uplift. In other words, if the customer has made a purchase, i.e. Y = 1, and belongs to the treatment group, Z is set to 1. This kind of person is then either a sure thing or a persuadable, see Table 2.1. If the customer on the other hand has not made a purchase, i.e.

Y = 0, and belongs to the control group, Z is also set to 1. The customer is then either a lost cause or a persuadable. For all other cases Z is set to 0 which means that all do-not-disturb belongs to this group, i.e. there will be no risk of approaching the do-not-disturbs with a campaign.

Note that this approach does not exclusively target the persuadables as would be the optimal thing to do. The reason for this is simply because one individual can never belong to both the treatment group and the control group, thus only one outcome for that individual can be observed. Therefore, it is not possible to use Class Variable Transformation to target the persuadables exclusively.

By assuming that T and C are independent of X = (X₁, ..., X_p), and that P (C) = P (T ) = ¹₂, Jaskowski et al. shows that

P^T(Y = 1|X = x) − P^C(Y = 1|X = x) = 2P (Z = 1|X = x) − 1

which means that modeling the conditional uplift of Y is the same as modeling the conditional distribution of Z (see [7] for more details). It is thereby possible to use (9) and combine the treatment and control training data sets and then apply any standard classification method to the new data set and thus get an uplift model for Y . Jaskowski et al. also shows that the assumption P (C) = P (T ) = ¹₂ must not hold in practise. It is possible to rewrite the training data sets so that the assumption becomes valid and such a transformation does not affect the conditional class distributions. Put differently, this approach can still be beneficial in cases where there are imbalanced control and treatment groups.

In this project, the campaigns are actual postcards instead of phone calls that is widely used in for example the insurance or the telecommunication business. Many uplift modeling approaches rely on the fact that it is of great importance to not target the do-not-disturbs, since this group of individuals are most probably greater when approached using actual phone calls instead of advertisement that is sent out by for example email or a text message.

Hence in this project, the group of do-not-disturbs can be argued to be not as large as it might would have been if the offer instead were given using a physical phone call. Furthermore, recall from Section 3 that the share of observations belonging to the control group in each data set is relatively small compared to the share of observations in the treatment group.

Considering these two facts, the transformation suggested in (9) will be slightly modified to

Machine Learning Based Prediction and Classification for Uplift Modeling

Machine Learning Based

Prediction and Classification for Uplift Modeling

LOVISA BÖRTHAS

JESSICA KRANGE SJÖLANDER

Machine Learning Based

Prediction and Classification for Uplift Modeling

LOVISA BÖRTHAS

JESSICA KRANGE SJÖLANDER

Abstract

Keywords

Abstract

Acknowledgements

Contents

1 Introduction

1.1 Background

1.2 Problem

1.3 Purpose and Goal

1.4 Data

1.5 Methodology

1.6 Delimitations and Challenges

1.7 Outline

2 Theoretical Background and Related Work

3 Data

3.1 Markets and Campaigns

3.2 Variables

4 Methods and Theory

4.1 Data Pre-Processing

4.2 Uplift Modeling