Prediction of Lead Conversion With Imbalanced Data : A method based on Predictive Lead Scoring

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

| /LIU-IDA/STAT-A–21/031–SE/

Prediction of Lead Conversion

With Imbalanced Data

–

A method based on Predictive Lead Scoring

Ali Etminan

Supervisor : Joel Oskarsson Examiner : Anders Grimvall

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

An ongoing challenge for most businesses is to filter out potential customers from their audience. This thesis proposes a method that takes advantage of user data to classify po-tential customers from random visitors to a website. The method is based on the Predictive Lead Scoring method that segments customers based on their likelihood of purchasing a product. Our method, however, aims to predict user conversion, that is predicting whether a user has the potential to become a customer or not.

Six supervised machine learning models have been used to carry out the classifica-tion task. To account for the high imbalance in the input data, multiple resampling meth-ods have been applied to the training data. The combination of classifier and resampling method with the highest average precision score has been selected as the best model.

In addition, this thesis tries to quantify the effect of feature weights by evaluating some feature ranking and weighting schemes. Using the schemes, several sets of weights have been produced and evaluated by training a KNN classifier on the weighted features. The change in average precision obtained from the original KNN (without weighting) is used as the reference for measuring the performance of ranking and weighting schemes.

(4)

Acknowledgments

First, I would like to give a special thanks to my supervisor, Joel Oskarsson, for giving me exceptional feedback throughout the entire thesis. His support has had a great impact on the quality of this report.

I would also like to thank my examiner, Ander Grimval, and my opponent, Mudith Silva, for their constructive comments and valuable suggestions. I wish to express my gratitude to Samuel Jenks and Malin Schmidt for providing me with the necessary tools and information required for this thesis.

Finally, I would like to thank my lovely wife, Romina. Without her unconditional support, I would have not made it through this journey. And finally, my little daughter, Nila, whose presence is the main source of my ambitions.

(5)

List of Figures

2.1 Sales Funnel - Source: Duncan and Elkan, 2015 [duncan_probabilistic_2015] . . . 4

2.2 Random Oversampling . . . 7

2.3 Random Oversampling Examples (ROSE) . . . 8

2.4 Oversampling with SMOTE . . . 9

2.5 Oversampling with K Means SMOTE . . . 10

2.6 Oversampling with SVM SMOTE . . . 10

2.7 Oversampling with Borderline SMOTE 1 . . . 11

2.8 Oversampling with Borderline SMOTE 2 . . . 12

2.9 Comparison of Random oversampling with different variants of SMOTE . . . 13

2.10 Support Vector Machine with linear decision boundary . . . 16

2.11 Logistic curve . . . 18

2.12 A comparison of ROC Curve (left) and Precision-Recall Curve (right) . . . 25

4.1 Distribution of source and medium of referrals to the website . . . 35

4.2 Average number of sessions and average bounce rate over the weekdays, sepa-rated by social/non-social media users (left) Average number of sessions and av-erage bounce rate over 24 hours, separated by social/non-social media users (right) 35 4.3 Distribution of the scaled average number of converted/not converted users by days of the week (left) and over 24 hours of the day (right) . . . 36

4.4 Pageviews Per Session and Average Session Duration before removing outliers (top) and after removing outliers (bottom) . . . 37

4.5 Correlation heatmap for numerical and binary variables in the data . . . 37

4.6 Logistic Regression with SMOTE Tomek link . . . 38

4.7 Support Vector Classifier with SVM SMOTE . . . 39

4.8 Decision Tree with no oversampling . . . 39

4.9 Random Forest with Random Oversampling . . . 40

4.10 K Nearest Neighbors without oversampling . . . 41

4.11 Gradient Boosting without oversampling . . . 41

4.12 Feature ranks calculated by Permutation Importance method with the top per-forming Gradient Boosting model as its estimator . . . 42

4.13 Feature ranks calculated with Pearson Correlation Coefficient method (left) and Fisher Coefficient method (right) . . . 45

4.14 Absolute values of feature weights calculated by applying NMF scheme to PCC and FC ranks . . . 45

4.15 Absolute values of feature weights calculated by applying NRF scheme to PCC and FC ranks . . . 46

4.16 KNN classifier performance using features weighted with PCC and NRF schemes 46 7.1 Logistic Regression with no oversampling . . . 59

7.2 Logistic Regression with Random Oversampling . . . 59

7.3 Logistic Regression with SMOTE . . . 60

7.4 Logistic Regression with SVM SMOTE . . . 60

(8)

7.6 Logistic Regression with Borderline SMOTE . . . 61

7.7 Logistic Regression with ADASYN . . . 61

7.8 Support Vector Classifier with no oversampling . . . 62

7.9 Support Vector Classifier with Random Oversampling . . . 62

7.10 Support Vector Classifier with SMOTE . . . 62

7.11 Support Vector Classifier with SMOTE Tomek Link . . . 63

7.12 Support Vector Classifier with K Means SMOTE . . . 63

7.13 Support Vector Classifier with Borderline SMOTE . . . 63

7.14 Support Vector Classifier with ADASYN . . . 64

7.15 Decision Tree with Random Oversampling . . . 64

7.16 Decision Tree with SMOTE . . . 65

7.17 Decision Tree with SVM SMOTE . . . 65

7.18 Decision Tree with SMOTE Tomek Link . . . 65

7.19 Decision Tree with K Means SMOTE . . . 66

7.20 Decision Tree with Borderline SMOTE . . . 66

7.21 Decision Tree with ADASYN . . . 66

7.22 Random Forest with no oversampling . . . 67

7.23 Random Forest with SMOTE . . . 67

7.24 Random Forest with SVM SMOTE . . . 68

7.25 Random Forest with SMOTE Tomek Link . . . 68

7.26 Random Forest with K Means SMOTE . . . 68

7.27 Random Forest with Borderline SMOTE . . . 69

7.28 Random Forest with ADASYN . . . 69

7.29 K Nearest Neighbors with Random Oversampling . . . 70

7.30 K Nearest Neighbors with SMOTE . . . 70

7.31 K Nearest Neighbors with SVM SMOTE . . . 70

7.32 K Nearest Neighbors with SMOTE Tomek Link . . . 71

7.33 K Nearest Neighbors with K Means SMOTE . . . 71

7.34 K Nearest Neighbors with Borderline SMOTE . . . 71

7.35 K Nearest Neighbors with ADASYN . . . 72

7.36 Gradient Boosting with Random Oversampling . . . 72

7.37 Gradient Boosting with SMOTE . . . 73

7.38 Gradient Boosting with SVM SMOTE . . . 73

7.39 Gradient Boosting with SMOTE Tomek Link . . . 73

7.40 Gradient Boosting with K Means SMOTE . . . 74

7.41 Gradient Boosting with Borderline SMOTE . . . 74

7.42 Gradient Boosting with ADASYN . . . 74

7.43 Pearson Correlation Coefficient with Normalizing Max Filter . . . 75

7.44 Fisher Coefficient with Normalizing Max Filter . . . 75

7.45 Fisher Coefficient with Normalizing Range Filter . . . 76

7.46 Permutation Importance with Normalizing Max Filter . . . 76

(9)

List of Tables

2.1 Example of lead attributes and their rankings [michiels_lead_2008] (source: Ab-erdeen Group, 2008) . . . 4 2.2 Suggested marketing actions for each lead score [lindhal_qualitative_2017] . . . . 5 2.3 Confusion matrix for binary classification . . . 24 3.1 Characteristics of the input data . . . 29 4.1 Best hyperparameter values for Logistic Regression and SMOTE Tomek link . . . . 38 4.2 Best hyperparameter values for Support Vector Classifier and SVM SMOTE . . . . 38 4.3 Best hyperparameter values for Decision Tree with no oversampling . . . 39 4.4 Best hyperparameter values for Random Forest with Random Oversampling . . . 40 4.5 Best hyperparameter values for K Nearest Neighbors without oversampling . . . . 40 4.6 Best hyperparameter values for Gradient Boosting without oversampling . . . 41 4.7 ROC-AUC and Average Precision (AP) scores of all combinations of classifiers

with resampling methods . . . 43 4.8 Estimated run-time of classifiers with and without oversampling in seconds . . . . 44 4.9 Comparison of Average Precision obtained with KNN using different feature

ranking and weighting schemes . . . 44 7.1 Complete list of hyperparameters and their corresponding grid of values used for

classifier tuning . . . 58 7.2 Complete list of hyperparameters and their corresponding grid of values used for

(10)

1 Introduction

1.1 Background

The urge for digital transformation has swept across the entire business community in the past two decades. The rapid evolution of tools that retrieve, analyze and transform business data have forced organizations to constantly shift and adapt their strategy to unravel con-sumer behavior. This has emphasized the usage of concon-sumer data to find existing patterns and associations in their actions and characteristics. Organizations dedicate a good amount of budget, time, and human resources to build data infrastructures to track and make infer-ences of their consumer’s habits. This will in turn become the foundation of every consequent decision made by marketing and sales teams.

Larger organizations gain their competitive advantage by tracking a diverse range of fea-tures about their users across their digital platforms to build sophisticated data models. In general, these models require a domain expert to determine certain parameters and do the decision-making manually and by following business objectives. Lead Scoring methodology is widely practiced by marketing and sales professionals to score and prioritize users based on their data. Lead scoring assigns an importance score to lead (prospective user) interactions like watching a demo, filling out a form, opening an email, etc. This can also be extended to demographics features such as age, gender, and location of the lead. A total score is then calculated for every lead that determines the likelihood of the lead becoming a customer. Marketers use this score to segment leads and create a different marketing strategy for every segment. Leads with the highest scores will be contacted by the sales team while leads with lower scores will be treated with proper marketing content and follow-ups.

While lead scoring is built upon data, it still depends on experts to manually set the impor-tance scores which need ongoing re-evaluation as business priorities and consumer behaviors evolve. Therefore, the method cannot be deemed as fully data-driven. Over the recent years, there has been a growing interest in the utilization of machine learning models to automate the process of scoring leads that are widely referred to as Predictive Lead Scoring or Automated Lead Scoring. Studies suggest that rather than pure domain knowledge or gut feeling, orga-nizations should pursue predictive lead scoring as a replacement or complement to manual lead scoring [24].

Input data for predictive lead scoring can consist of a website and social media data as far as attributes can be traced back to a lead. Data generated through Customer

(11)

Relation-1.2. Objectives

ship Management tools (tools that allow businesses to manage and integrate marketing and sales activities) is another common source of input data. Numerous studies have attempted to investigate the performance of probabilistic and non-probabilistic models in this domain. Nygård et al. [24] compares the model performance of Logistic Regression, Decision Trees, Random Forests, and Neural Networks. The choice of models reveals how linear, non-linear, ensemble, and deep models perform on this particular problem. In contrast, Duncan et al. [10] compares two probabilistic models for this problem and Benhaddou et al. [2] experi-ments with Bayesian Networks that work best with small and expert-curated datasets.

The application of machine learning in lead scoring has been a subject of interest by many researchers in recent years. Current implementations are mostly patented and offered as commercial features. Therefore, there is an evident demand from the business community for further research and exploration of this methodology. This thesis aims to build a predictive model by investigating some linear and non-linear classifiers. It also studies the contribution of features in the prediction task by applying a number of feature ranking and weighting schemes. The weights are then applied to the features to find out the actual importance of each feature as well as to compare ranking and weighting methods. The explicit use of feature weights is commonly overlooked by similar studies. This is since the type and source of their input data demand a different data preparation approach.

1.2 Objectives

The data used in this thesis consists of 11 features and a highly imbalanced binary class vari-able. The extreme class imbalance puts the task in the same category as disease or fraud detection tasks. These tasks suffer from rare positive samples that make the learning process challenging. Therefore, as the primary objective, this thesis evaluates oversampling methods that handle problems with rare positive labels. It then aims to find the best combination of the classifier with a re-sampling method.

Considering the significance of feature weights in predictive lead scoring, the thesis de-fines its third objective as an evaluation of three feature ranking and two feature weighting schemes. The weighting schemes are used to normalize the ranks. In this stage, we intro-duce the combination of ranking and weighting scheme that best represents the features in the input data.

The outline of this report can be described as follows:

• Chapter 2. Theory: Description of theories behind methods used by this thesis • Chapter 3. Methodology: The methodology used by this thesis

• Chapter 4. Results: Exploratory data analysis and report of achieved results with addi-tional plots and tables

• Chapter 5. Discussion: Discussion of strengths and weaknesses concerning obtained results and comparison with related work

• Chapter 6. Conclusion: Conclusion on how this paper can provide a new reflection on the topic of predictive lead scoring

(12)

2 Theory

This chapter introduces the theory and mathematical formulations behind concepts used by this thesis. This gives the reader intuition into the choice of certain models and parameters in consecutive chapters. It also enhances the reproducibility of the methodology used in this thesis.

2.1 Lead Scoring and Predictive Lead Scoring

As briefly discussed in the introduction chapter, identifying and prioritizing users with the potential of taking a desired action or making a purchase is a challenging task for many businesses. In the business jargon, these users are referred to as leads. First, we discuss the lead scoring method that has helped many businesses tackle this challenge. Then we briefly review the theories behind the machine learning assisted lead scoring known as Predictive Lead Scoring (a.k.a Automated Lead Scoring).

Lead Scoring

In a study performed by Aberdeen Group [22] lead scoring is defined as:

Lead scoring is a technique for quantifying the expected value of a lead or prospect based on the prospect’s profile, behavior (online and/or offline), demographics, and likelihood to purchase.

The quantification is done by assigning a score to all attributes of a lead. Scores are de-termined by marketing experts that value each attribute concerning business priorities. [22] introduces a number these attributes which is presented in table 2.1.

The sum of the scores attributed to a single lead quantifies the lead’s readiness (likeli-hood) to make a purchase. Leads can be divided into several groups depending on their level of readiness. Those with a high score are considered as hot leads while those that receive a smaller score are referred to as cold leads. Leads with the least score, are in the awareness stage, in which they have an initial impression of the company and its products. Leads with higher scores are considered as MQL (Marketing Qualified Leads) and SQL (Sales Qualified Lead) de-pending on their score. MQLs generally require nurturing to be prepared for sales. Nurturing may include contacting the lead with more marketing and promotional materials. SQLs on

(13)

2.1. Lead Scoring and Predictive Lead Scoring

Attribute Rank

Webinars attended 5.0

Purchase propensity scores 4.5

Email click-throughs 4.0

Website activity (pages visited and recency) 4.0 Website activity (type of activity) 3.5

Keywords clicked 3.5

Website activity (length of time each page was visited) 3.0 Attitudinal and lifestyle information 3.0

Table 2.1: Example of lead attributes and their rankings [22] (source: Aberdeen Group, 2008)

Figure 2.1: Sales Funnel - Source: Duncan and Elkan, 2015 [10]

the other hand are deemed to be closer to a sales point and thus contacted directly by the sales team. Figure 2.1 depicts a sales funnel that demonstrates different stages of a lead from a busi-ness perspective. A study by [19] on a company that provides busibusi-ness solutions, suggests how the business can approach leads in different stages. These suggestions are presented in table 2.2. In this table, leads are categorized based on their profiles and engagement. Leads that have provided many details in their profiles are categorized as target fit while leads that provide basic information are labeled as potential fit. In addition, the level of the lead’s past engagements with the company determines if the lead has high interest, medium interest or low engagement [19].

Predictive Lead Scoring

In the conventional lead scoring method introduced in the previous section, attribute scores are assigned manually. Scores need constant revision and adjustment as business priorities and objectives change. Moreover, since a human expert decides on the scores, the process becomes error-prone. Using machine learning to find potential prospects automates the lead scoring process. Instead of explicitly assigning importance scores to lead attributes, these are learned from the data. Similar to lead scoring, predictive lead scoring can use any at-tribute as input. However, since this method involves constructing an automated model, the integration of data from various sources becomes a challenge.

(14)

2.2. Data Preprocessing

Lead Description Marketing Action Target fit,

High interest Send email that encourages the lead to leave phone number or call in directly, or propose to purchase the product directly. Target fit,

Medium interest Send offer of free trial of the service or propose relevant material that is close to purchase of the product.

Target fit,

Low engagement Priority lead that needs further nurturing and "why now"-message

Potential fit,

High interest Send email that encourages the lead to leave phone number or to call in directly, or propose to purchase the product directly. Potential fit,

Medium interest Continue to nurture with marketing materials that can increase interest, send offer of free trial of the service. Pursue information to evaluate if it is a good fit.

Potential fit,

Low engagement Send nurturing content that can create a demand for the product. Pursue information to evaluate if it is a good fit.

Table 2.2: Suggested marketing actions for each lead score [19]

Integration of data can be done by manually connecting different datasets. Some pub-lic platforms like Google Analytics also enable integration with certain CRM platforms like Hubspot. Hubspot provides a fully-featured digital marketing platform that also includes a CRM service. CRM-generated data is the most popular data source for predictive lead scor-ing. This data consists of features related to both personal and behavioral attributes. More importantly, it contains information about historical sales.

The predictive lead scoring model can be defined to solve a binary problem. In this set-ting, leads are classified as likely or not likely to purchase or convert. This can be solved by investigating the performance of binary classifiers. A study by Benhaddou et al. [2] also de-fines a binary task but uses a Bayesian Network with binary features to tackle this problem. On the other hand, another study by Duncan et al., [10], uses the stages of the sales funnel (see fig 2.1) as class labels. This approach would also predict the level of a lead’s readiness for sales. The selection of the modeling approach is highly influenced by the type of available input data.

2.2 Data Preprocessing

It is essential to create a proper representation of the data before proceeding to model build-ing. Cleaning data can consist of simple steps like feature normalization or more sophisti-cated steps such as outlier detection, feature encoding or handling missing values.

Handling Outliers

Outliers can produce erroneous results with certain algorithms and therefore need to be re-moved properly from the data. Below we introduce two statistical approaches for this task. Standard Deviation Method

In case of dealing with a normally distributed data and depending on how sensitive we are to outliers, the mean is first estimated, and the cut-off threshold is set as:

• One standard deviation from the mean: covering about 68% of the data • Two standard deviations from the mean: covering about 95% of the data

(15)

2.3. Resampling

• Three standard deviations from the mean: covering about 99.7% of the data

Any data point smaller or greater than the threshold will be considered an outlier and thus removed. Choosing a proper number of standard deviations depends on the domain and size of the data [7].

Interquartile Range Method

If data is not normally distributed, a different method has to be applied for outlier removal. Interquartile Range (IQR) is obtained by finding the 25th and 75th percentiles of the feature. Percentiles here are referred to as quartiles since they are divided into four quarters.

IQR= (Q3´Q1)k (2.1)

According to equation 2.1, by subtracting the first quartile Q1from the third quartile Q3 we are left with the feature values between the two quartiles or within the interquartile range. These values are then multiplied by a factor k to adjust the cut-off threshold. k can take a value from 1.5 to any number depending on the range of values in the feature [7].

One-hot Encoding

Many machine learning models are not able to handle categorical data in their original form and so categorical features need to be presented in a way that models can interpret them. In cases where categories are nominal and there is no ordering among them, the variable is converted to a one-hot representation. One-hot encoding expands a variable with k categories into k different binary variables each representing one category. For instance, if Xjcontains 3 categories A,B and C and if Xi_j, values of feature j at observation i, equals C (Xi_j =C), the one-hot representation would look like:[0, 0, 1]

While this approach is simple to implement, it comes with a major drawback when the data is either high-dimensional or contains high cardinality categorical features or both. In such cases, data dimension escalates to a larger extent occupying a larger amount of space in memory. In some cases, this may leave the model with a lot of parameters which is compu-tationally inefficient.

An alternative method for ordinal variables is conversion into integer factors. The order of categories in ordinal variables has to be taken into consideration. So for instance, if Xj contains 3 categories A,B and C they will be converted to numerical values as(A = 1, B =

2, C = 3). This omits the increased dimension issue but makes the assumptions that the categories are ordered.

2.3 Resampling

A common challenge in classification tasks is to deal with imbalanced classes in the dataset. In binary classification, the class that has a significantly smaller number of samples is called the minority class while the other class is referred to as the majority class. Datasets with multi-ple classes (multinomial) can also suffer from imbalanced class distribution. Overseeing this property can yield misleading results even if a sophisticated classifier is at play. It is therefore essential to experiment with some resampling techniques to achieve less biased results to-wards the majority class(es). Depending on the task at hand, undersampling, oversampling or a combination of both can be applied. Regardless of the method used, the resampling strategy can be either defined as a simple strategy (i.e. resample only the majority class or only the minority class) or customized by setting desired proportions for each class label.

(16)

2.3. Resampling 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2

Random Oversampling

Majority Minority

Figure 2.2: Random Oversampling

Random Oversampling

Random oversampling involves duplication of samples from the minority class. The over-sampling can be done iteratively to balance all classes with respect to the majority class. The duplication is carried out by sampling the minority data points with replacement (figure 2.2). Random oversampling is known to cause overfitting since generated samples overlap with the original samples. Therefore, a certain amount of dispersion might be desired over pro-ducing exact copies of the original data points. Assume that we want to generate a synthetic sample based on the original sample xi. For this, we first sample x from probability distribu-tion KH jwhere KH jis centered at xi and Hj is a matrix of scale parameters. KH jis usually selected from a unimodal and symmetric distribution that is scaled by Hj. In this regard, the new sample is generated in the neighborhood of xiwhere the width of the neighborhood is determined by Hj. This method is also referred to as Random Oversampling Examples (ROSE) [21] (see figure 2.3) .

SMOTE

Synthetic Minority Oversampling TEchnique or SMOTE is another popular oversampling method that comes with a variety of implementations. As opposed to Random Oversampling that simply duplicates the data points, SMOTE uses the K nearest neighbors of the minority sample xito generate synthetic samples.

xnew=xi+λ(xzi´xi) (2.2)

xziis one of the nearest neighbors of the sample xifrom the minority class. λ is a param-eter between 0 and 1 that dparam-etermines the distance between the new and the original sample. Baseline SMOTE uses a uniform distribution to select xi to generate a new sample [8] (see figure 2.4). This makes SMOTE sensitive to noise. A noisy sample from the minority class

(17)

2.3. Resampling 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2

Random Oversampling with dispersion

Majority Minority

Figure 2.3: Random Oversampling Examples (ROSE)

that is among the majority samples has an equal probability of being selected for resampling. This may result in generating more noisy samples where the majority samples have a high-density [17]. Other implementations of SMOTE take a different strategy in selecting xi but still use equation 2.2 to generate new samples.

While oversampling may be able to balance class distributions in the data, it might not be able to compensate for the lack of information in some cases. One study shows that synthetic samples generated by SMOTE result in the same expected value in the minority class while reducing its variance [4].

K Means SMOTE

K Means SMOTE does the oversampling in three stages:

• Clustering: data is clustered into k groups using k means clustering algorithm

• Filtering: only clusters with high portion of minority samples are selected for oversam-pling

• Oversampling: oversamples filtered clusters using equation 2.2

First, the input space is clustered into k groups. Then in the filtering stage, it finds clusters where minority samples make up more than 50 percent of the cluster’s population. Applying SMOTE to these clusters reduces the chance of generating noisy samples. Moreover, the goal is also achieving a balanced distribution of samples within the minority class. In other words, if there exist multiple clusters of minority samples, we want them to be oversampled to the same extent. Therefore, the filter step allocates more generated samples to sparse minority clusters rather than dense ones [17] (see figure 2.5). Imbalance ratio threshold (irt) hyperpa-rameter determines the threshold for the equation 2.3. Equation 2.3 divides the number of

(18)

2.3. Resampling 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2

SMOTE

Majority Minority

Figure 2.4: Oversampling with SMOTE

majority samples in cluster c over the number of minority samples in the same cluster. The counts are incremented by 1 to avoid division by zero or ir=0.

ir= majorityCount(c) +1

minorityCount(c) +1 (2.3)

By increasing the threshold for ir, filtering becomes more sensitive and requires a cluster to have a higher proportion of minority instances to be selected. Lowering this threshold has the opposite effect [17].

SVM SMOTE

This method uses a Support Vector Machine (SVM) to select xifor resampling. In classifica-tion tasks, SVMs draw a hyperplane that has the maximum distance with the closest point(s) in each class. The hyperplane acts as an n-dimensional decision boundary that is usually accompanied by soft margins on each dimension. Points on or in between the soft margins and the hyperplane are referred to as support vectors. SVM SMOTE uses SVM to first iden-tify the support vectors and then uses them to generate new samples [23]. The motivation behind this is that support vectors have a significant effect on the position of the separating hyperplane and the classification performance in general (see figure 2.6). For further details on SVMs please refer to section 2.4.

Borderline SMOTE1

Borderline SMOTE1 first checks the label of xi’s m nearest neighbors. It then proceeds to classify xias one of the following:

(19)

2.3. Resampling 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2

K-Means SMOTE

Majority Minority

Figure 2.5: Oversampling with K Means SMOTE

3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2

SVM SMOTE

Majority Minority

(20)

2.3. Resampling 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2

Borderline SMOTE1

Majority Minority

Figure 2.7: Oversampling with Borderline SMOTE 1

• in danger - at least half of the nearest neighbors are from the same class as xi. • safe - all nearest neighbors are from the same class as xi.

The algorithm selects a minority sample in danger and its k nearest neighbors from the same class to generate a new sample. This means new samples are generated closer to the minority class border. It also prevents selecting noisy samples for resampling which occurs in the original SMOTE algorithm. However, it still uses equation 2.2 to generate new samples [11] (see figure 2.7).

Borderline SMOTE2

This algorithm operates similarly to Borderline-SMOTE1, except that it considers the k near-est neighbors of xi to be from any class. The value of λ here ranges from 0 to 0.5 as opposed to 0 to 1 in borderline SMOTE1. when λ ă 0.5 the new sample is generated closer to the mi-nority class [11]. Increased number of samples near the decision boundary tends to improve classification performance which is the main motivation behind Borderline SMOTE2 (see fig-ure 2.8). Figfig-ure 2.9 compares the behavior of Random Oversampling, ROSE, and different variants of SMOTE.

(21)

2.3. Resampling 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2

Borderline SMOTE2

Majority Minority

Figure 2.8: Oversampling with Borderline SMOTE 2

ADASYN

Adaptive Synthetic oversampling (ADASYN) is a special case of SMOTE. It generates new samples proportional to the number of samples that are not from the same class as xi in a given neighborhood [13]. Steps performed by ADASYN are described in algorithm 1.

Algorithm 1:ADASYN

1. Calculate the number of synthetic samples G to be generated (G= (ml´ms)ˆ β) a) mland msare the number of majority and minority samples respectively. b) β specifies the desired balance level. β=1 creates a perfectly balanced dataset. 2. For each xiin minority class calculate ri =∆i/K

a) ∆iis the number of samples in the K nearest neighbors of xithat belong to the majority class

3. Calculate ˆriby normalizing rias ˆri=ri/řmi=1ri(ˆriis a categorical distribution) 4. Calculate the number of samples to be generated for each minority sample as

gi= ˆriˆG

5. For each minority sample generate ginew samples using equation 2.2

One can consider ˆri as weights for minority samples. Therefore, it can be observed that ADASYN forces the learning algorithm to focus on samples that are originally more difficult to learn [13].

(22)

2.3. Resampling 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2 No oversampling Majority Minority 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1

4.2 Random Oversampling Majority

Minority 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1

4.2 Random Oversampling with dispersion Majority Minority 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2 SMOTE Majority Minority 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1

4.2 Borderline SMOTE1 Majority

Minority 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1

4.2 Borderline SMOTE2 Majority

Minority 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1

4.2 K-Means SMOTE Majority

Minority 3.8 3.9 4.0 4.1 4.2 3.7 3.8 3.9 4.0 4.1 4.2 SVM SMOTE Majority Minority

Figure 2.9: Comparison of Random oversampling with different variants of SMOTE

Combination of Oversampling and Undersampling

Resampling methods can be combined to make a hybrid solution of over and under sampling. One popular combination is SMOTE-Tomek link that combines SMOTE oversampling with Tomek link undersampling.

Tomek link has been described by Batista et al. [1] as:

Given two examples x and y belonging to different classes, and let d(x, y)be the distance between x and y. A (x, y) is called a Tomek link if there is not a case z such that d(x, z) ă d(x, y) and d(y, z) ă d(y, x). If two examples form a Tomek link,then one of these examples is noise or both examples are borderline (near the class border).

By removing Tomek links the space is cleaned from noisy data, thus producing better decision boundaries for oversampling.

(23)

2.4. Classifiers

Weight Correction

Using a resampled synthetic dataset to train the model changes the class distribution in the training set while the classes in the validation/test set still inherit the imbalanced weights. Consider the scenario where we oversample the minority class in the training set to have as many samples as the majority class, each class gets a weight (prior probability) of 0.5. This confuses the model when predicting labels in the unseen data as it assumes both classes have equal probability, which is not true.

To compensate for this problem one can take the posterior probabilities returned by the classifier and first divide by the class fractions in the training set and then multiply by the class fractions in the validation/test set. Finally, the new values need to be normalized to ensure that the new posterior probabilities sum to one [3]. This is mathematically represented in equation 2.4 p(y|x) = p(x|y)p(y) p(x) = q(x) p(x)q(y|x) p(y) q(y) =q(y|x) p(y) q(y)Z (2.4)

The term q(y|x)represents the posterior probabilities returned by the model. p(y)is the probability of the positive class in the test/validation set and q(y)is the probability of the positive class in the training set. Therefore, _q(y)p(y) is the correction term that is multiplied by the posterior probabilities. Z contains the normalizing constants for every y. In a binary task,

1

Zis calculated according to equation 2.5. 1 Z =q(y=0|x) p(y=0) q(y=0) +q(y=1|x) p(y=1) q(y=1) (2.5)

2.4 Classifiers

Theories in the preceding sections were concerned with the handling of input data. Prepro-cessing and resampling methods were discussed to see how a similar dataset can be prepared as an input of a machine learning model. The next objective of this thesis is to search, tune and evaluate several binary classifiers and find the best oversampling and classifier combination. In this section, a number of linear and non-linear classifiers will be discussed in detail.

Decision Trees

Tree-based models (a.k.a) Decision Trees are widely used for both classification and regression problems. Decision Trees are best known for their interpretability and their ability to handle non-linear problems. There are different algorithms of decision trees including CART (Clas-sification And Regression Trees), ID3, and C4.5 which differ based on some factors including how they handle overfitting. In general, decision tree follows a sequential binary procedure to build the trees.

Consider a predictor space with N samples and p dimensions (X1, ..., Xp). The algorithm first chooses a splitting value s for the jth predictor Xjto divide the predictor space into R1 and R2so that R1(j, s) =tX|Xij ăsu and R2(j, s) =tX|Xij ěsu. R1and R2are the two new subspaces which are also known as terminal nodes. The algorithm repeats this step for every subspace until a stopping criterion is met. The stopping can be triggered for instance when the tree reaches a certain depth or when the leaf node contains less than a minimum number of samples. The choice of Xj and s at each step of the algorithm is made so that the error within the resulting subspaces is minimized.

(24)

2.4. Classifiers

In regression tasks, the predicted value of a new point is the mean value of the subspace it falls into. At every step, the model aims to minimize the Sum of Squared Residuals (RSS presented in equation 2.6. RSS= ÿ i:xiPR1(j,s) (yi´ˆyR1) 2₊ ÿ i:xiPR2(j,s) (yi´ˆyR2) 2 _(2.6)

where ˆyR₁ and ˆyR2 are the mean value within the R1 and R2 subspaces. Classification

trees on the other hand, use the majority class label within each subspace to predict the class of a new point. For choosing Xj and s, the model attempts to minimize either equation 2.7 (Entropy) or equation 2.8 (Gini Index).

D=´ K ÿ k=1 ˆpmklog ˆpmk (2.7) G= K ÿ k=1 ˆpmk(1 ´ ˆpmk) (2.8)

where ˆpmk is the proportion of training data points in the mth subspace that are from the kth class. Both criteria try to form regions with the majority of points belonging to one class [3]. In other words, they attempt to minimize node impurity. Although trees are easily interpreted, experiment shows that learned trees are very sensitive to the details of the input data. Therefore a small change to the training data results in a very different set of splits [14]. One way to tackle the sensitivity of the model towards the training data is to do pruning. In cost complexity pruning method, trees with too many terminal nodes are penalized by a factor α. The value of α can control the bias-variance trade-off in the case of having an overfit or underfit model. The complexity of the model is inversely proportional to the value of α. In contrast to the top-down approach for building the trees, pruning begins at the bottom from the leaf nodes back to the root of the tree [14].

Random Forest Classifier

Random Forest is an ensemble method that was introduced to address the overfitting issue of Decision Trees. Random Forest uses the Bagging (Bootstrap Aggregating) method with decision trees to provide more robust results. Consider an input data with N samples and p features. The Random Forest model is described in algorithm 2.

Algorithm 2:Random Forest 1. For b=1, 2, ..., B repeat:

a) Create a bootstrap b of size N by sampling from the original data with replacement

b) Fit a decision tree to b. At each split of the tree choose m random features from X where m ď p and calculate the prediction ˆfb

2. For regression, calculate the average prediction

ˆfbag(X) = 1 B B ÿ b=1 ˆfb₍_X₎

3. For classification, count the majority vote as the final prediction

If the models trained on each bootstrap has variance σ2, the variance of the mean of all bootstraps ( ˆfbag) is given by σ2/B that is smaller or at least equal to every individual variance.

(25)

2.4. Classifiers

Figure 2.10: Support Vector Machine with linear decision boundary

This proves that bagging guarantees to reduce model variance. Random Forests are also able to estimate the expected error without cross-validation. On average and depending on the size of the bootstrap, about one-third of the data points are not selected for bootstrapping which are referred to as out-of-bag samples. These samples can in turn be used as an unseen test dataset to estimate generalization error [14]. Random Forests and Decision Trees are highly flexible models. Depth of the tree(s), number of leaf nodes, number of samples in every leaf node, and the proportion of m features to fit the tree(s) have significant impacts on the behavior of the final model.

Support Vector Classifier

Support Vector Classifiers (SVCs) or Support Vector Machines (SVMs) are originally designed for linear classification. Think of an input space with n observations and j dimensions. SVC tries to find a hyperplane with j dimensions to classify the observations. Ideally, the hyper-plane has the largest possible margin with observations within each class. However, a major drawback of this is that the hyperplane becomes extremely sensitive to the training observa-tions. Adding or removing observations from the training set can make considerable changes to the position of the hyperplane. To address this, a soft margin is drawn parallel to the hy-perplane in each dimension. Points that are either on or inside the soft margins are called Support Vectors. The role of soft margins is to allow a desired level of misclassification in the model to reduce the generalization error [14].

As illustrated in figure 2.10, with soft margins, we allow some observations (support vec-tors) to be on the wrong side of the margins or on the wrong side of the decision boundary. This reduces the model sensitivity towards the training data. The size of the margins is con-trolled by C or the Cost parameter. When C Ñ 8, no misclassification is tolerated and the soft

(26)

2.4. Classifiers

margins are removed. This results in a model with a high variance. On the contrary, when C Ñ 0, the margins are widened, allowing more misclassified observations. This also implies that the number of support vectors is inversely proportional to the value of C [14].

Extending Support Vector Classifiers to non-linear problems can increase computational costs significantly. This is since training a SVC to find the optimal decision boundary involves solving a quadratic optimization problem. In this regard, if for instance, we decide to enlarge the feature space to a higher order polynomial, we can end up with a huge number of terms to compute for the optimization problem. To overcome this issue, SVCs use kernel functions to efficiently handle non-linear problems [14].

Kernel Functions

Kernel functions try to quantify similarities between two observations. The linear kernel cal-culates the inner product between pairs of training observations. However, it turns out that only the inner product between pairs of support vectors have an impact on the model. There-fore, it is not required to compute the inner products between all possible pairs in the training set. Consider x to be a vector of P dimensions, equation 2.9 represents the linear kernel. Note that this paper uses bold notations to represent vectors.

K(x_iTxi1) =

P ÿ j=1

xijxi1_j (2.9)

To move beyond linearity, a polynomial kernel (equation 2.10) of order d can be used instead.

K(xT_ixi1) = (1+

P ÿ j=1

xijxi1_j)d (2.10)

A more flexible kernel function is the radial kernel or the Radial Basis Function (RBF) (equa-tion 2.11) K(xT_ixi1) =exp(´γ P ÿ j=1 (xij´xi1_j)2) (2.11)

RBF calculates the Euclidean distance between two observations. If the distance is large, the exp in equation 2.11 returns a small value. This means observations far from the decision boundary are ignored. γ = _2σ1₂ is a smoothing coefficient and the value of σ in the denomi-nator controls the shape of the kernel.

Logistic Regression

Linear Regression uses the Ordinary Least Squares (OLS) method to fit a straight line to the training data. The predicted variable ˆy is modeled directly as a continuous variable. For every observation of X1, ..., Xpthe target value is calculated using equation 2.12

ˆy=β0+β1x1+...+βPxP (2.12) where β0is the y-intercept (the mean value of ˆy) and β1, ..., βPare model coefficients. Every

βidetermines how much ˆy should change for one unit of change in xi. Logistic Regression on the other hand, calculates a conditional probabilities between 0 and 1. This makes it suitable for binary classification problems. For this purpose, it uses the logistic function to calculate the probabilities [14].

p(Y|X) = e

β0+β1X1+...+βPXP

(27)

2.4. Classifiers

Figure 2.11: Logistic curve

Remember that X is a vector with p dimensions. The logistic function can be re-written in the log-odds form as in equation 2.14.

log p(X) 1 ´ p(X)

=β0+β1X1+...+βPXP (2.14) The logistic function creates an S-shaped output that represents probabilities from 0 to 1. This indicates that although increasing or decreasing X, increases or decreases the log-odds value, the relationship to p is non-linear. Logistic Regression uses Maximum Likelihood Estimation (MLE) to learn the model parameters [14]. when MLE is applied to the Logistic Regression model, it attempts to minimize the expression in equation 2.15

L(θ) =´(y log(ˆy) + (1 ´ y)log(1 ´ ˆy)) (2.15) Equation 2.15 is the negative log of the likelihood function or the log loss. Note that y and ˆy indicate the true and predicted class labels respectively. θ is the vector of parameters that the model is trying to learn. Therefore, L(θ)quantifies the loss of the model. One common way to solve the optimization problem defined by MLE is through a technique known as gradient descent.

Gradient descent is an iterative process. It calculates the gradient of the loss function for each training observation and tries to move the opposite direction (downhill) of the gradient function to find the local minimum. If the local minimum is also lower than all other local minimums or the loss function is convex (has one local minimum), the algorithm may be able to find the global minimum. In terms of equation 2.16, at each step the change in θ (OL(θ)) is multiplied by a small step size η. It is then subtracted from the current value of θ untilOL(θ) is smaller than a minimum threshold.

(28)

2.4. Classifiers

Regularization

In Linear and Logistic Regression models, the bias-variance trade-off is usually controlled by regularization methods. L1-regularization (Lasso Regression) and L2-regularization (Ridge Regression) are two basic regularization methods. Considering equation 2.12 for Linear Re-gression with N observations and p dimensions, the L2-regularization is applied to the loss function using equation 2.17. Note that both methods can also be applied to the logistic func-tion in equafunc-tion 2.13 for the Logistic Regression model.

L(θ) +λ

P ÿ j=1

β2_j (2.17)

The first term in equation 2.17 is the log loss presented in equation 2.15 and the second term is the penalty. The penalty term becomes larger when the value of β is large. The λ parameter controls the level of regularization. When λ= 0 the penalty term becomes 0 and no regularization is performed. As λ Ñ 8 regularization becomes stronger, decreasing the complexity of the model. The L1-regularization, presented in equation 2.18 works similarly to L2-regularization except that it uses the absolute value of β in the penalty term. This forces some of the β estimates to be 0 while in L2-regularization the coefficients can become close to 0 but not 0. This also makes L1-regularization suitable for the task of feature selection [14].

L(θ) +λ

p ÿ j=1

|βj| (2.18)

Another regularization method, Elastic Net, combines the L1 and L2 regularization meth-ods to overcome the limitations of both. While L2 is not able to perform feature selection like L1, there are certain limitations with how L1 removes features. For instance for highly cor-related features, L1 keeps one of the features and omits the rest. Equation 2.19 is the penalty term for the Elastic Net. The second term in the equation averages highly correlated features, while the first term provides a sparse solution in the coefficients of the averaged features [14].

p ÿ j=1

(α|βj|+ (1 ´ α)β2_j) (2.19)

The α parameter in equation 2.19 controls the contribution of L1 and L2 methods in the regularization task.

K Nearest Neighbors (KNN)

K Nearest Neighbors (KNN) is a supervised non-parametric machine learning model. For a test observation xi and a given k, it finds the k nearest points to xi. In classification, xi is assigned the majority class label of its k neighbors. In regression, xiis assigned the average value of its k neighbors. When K is small the model becomes more sensitive to training observations and thus have higher variance.

The behavior of the KNN classifier can also be explained using the Bayes theorem [3]. Assume we have N observations and k classes. To classify xiusing its K neighbors, we draw a sphere that is centered on xiand contains all the K neighbors. Consider V to be the volume of the sphere and Nk be the number of points belonging to class Ck. Also, Kk denotes the number of the K nearest neighbors that belong to class Ck. The probability density for each class (conditional probability) can be expressed as

p(X|Ck) = Kk NkV and the overall density of X (unconditional probability) as

(29)

2.4. Classifiers

p(X) = K

NV and the class prior probability as

p(Ck) = Nk

N

Now using equation 2.20 we can calculate the posterior probability density of the class label given the K neighbors.

p(Ck|X) =

p(X|Ck)p(Ck)

p(X) (2.20)

The class Ckwith the highest probability density will be assigned to the test observation [3]. There are different metrics to calculate the distance between xi and its K nearest neigh-bors. Equation 2.21 represents the Minkowski distance. The Euclidean and Manhattan distance metrics are special cases of the Minkowski distance when P=2 and P=1 respectively.

d(p, q) = P g f f e n ÿ i=1 (pi´qi)P (2.21)

KNNs can handle non-linear classification tasks quite well. However, they tend to un-derperform with imbalanced data. This is since samples from the majority class are more frequent. Therefore, it is more likely that most of the K neighbors of a test observation are from the majority class. One way to tackle this problem is to use a weighted KNN. In this method, labels for each of the K neighbors are multiplied by weights proportional to the in-verse of their distance to the test observation. In non-weighted KNN, points in the K nearest neighbors are uniformly assigned a weight of 1 while the rest get a weight of 0 [12].

Gradient Boosting

Gradient Boosting is an ensemble model. The idea of boosting is to combine the output of several weak learners to produce more robust results. Learners whose error rates are slightly above random guess are considered weak learners. One of the most commonly used boosting methods is AdaBoost which was designed for binary classification. Consider classifier G(x)

for a binary task with N observations. In AdaBoost, a number of weak learners (for example classification trees) Gm(x), m = 1, 2, ..., M are created at every step. These trees have only a single split which is also known as stumps. The final prediction is a weighted average of the prediction of all the stumps.

G(x) =sign M ÿ m=1 αmGm(x) (2.22)

αmdetermines the contribution of Gm(x). The algorithm starts by initializing the weights of the training points uniformly as 1/N. Then at every step, a classifier is trained on the data. The weights of misclassified points are increased while correctly classified points are given a smaller weight. This forces the algorithm to focus on observations that are more difficult to learn [12].

Since AdaBoost gives a much higher influence to misclassified points, it becomes very sensitive to outliers. This degrades the performance of the model when trained on noisy data. Gradient Boosting addresses this issue and also extends its applicability to both re-gression and classification tasks. Gradient Boosting is a generalization of AdaBoost. It can

(30)

2.4. Classifiers

take any differentiable function as the loss function and use the gradient descent method for optimization (see section 2.16). For a general loss function in equation 2.23

L(f) =

N ÿ i=1

L(yi, f(xi)) (2.23)

the derivative (gradient) with respect to f(xi)is calculated at every observation with re-spect to equation 2.24.

gim =

BL(yi, f(xi)) Bf(xi)

(2.24) Algorithm 3, taken from [12], describes Gradient Boosting for regression. In classification with k different classes, steps 2(a) to 2(d) should be performed for each of the k classes. Note that for both regression and classification problems, Gradient Boosting grows a regression tree at each step. In step 1 of the algorithm, the initial predictions are computed. This can either be calculated through an external estimator or by computing the log odds value of the target variable. In step 2(a) we calculate the pseudo residuals, that is the difference between the observed and the predicted values in each tree. In 2(b) the next regression tree is grown on the residuals of the previous tree. In regression, the residuals are also considered as the predicted values for each step whereas in classification, we need to calculate probabilities for prediction. Steps 2(c) and 2(d) involve minimizing the loss function and updating the predictions for each grown tree. For classification with k classes, the final output in step 3 is k different tree expansions [12].

Algorithm 3:Gradient Boosting 1. Initialize f0(x) =argminγ řN i=1L(yi, γ) 2. For m=1, ..., M: a) For i=1, 2, ..., N compute: rim =´ BL(yi, f(xi)) Bf(xi) f = fm´1

b) Fit a regression tree to the targets rimgiving terminal regions. Rjm, j=1, 2, ..., Jm c) For j=1, 2, ..., Jmcompute γjm =argminγ ÿ xiPRjm L(yi, fm´1(xi) +γ) d) Update fm(x) = fm´1(x) +řJj=1m γjmI(x P Rjm) 3. Output ˆf(x) = fM(x)

Trees grown with Gradient Boosting have more leaf nodes than stumps in AdaBoost that only have two leaf nodes. This algorithm can also be tuned to use a portion of the training observations at each step which results in a Stochastic Gradient Boosting. This approach can control the bias-variance trade-off. A smaller portion of the sub-sample can decrease variance while increasing bias [12].

(31)

2.5. Feature Ranking

2.5 Feature Ranking

The main purpose of using a data-driven model for lead scoring is to let data decide the weight of every user action or characteristic. This section introduces two feature importance calculation techniques and then proceeds to describe two correlation-based ranking methods.

Feature Importance with Decision Trees

The level of contribution of each feature can be estimated when building a decision tree. In short, the importance of features is calculated based on reduction of Gini or entropy criterion at every split point [18] (see section 2.4 for more details on decision trees). However, this method does not work well with high cardinality features (numerical or categorical features with several unique values). This method can give high importance to features that may not be predictive on unseen data when the model is overfitting and thus do not have gener-alization power [5]. Calculating the feature importance values are also possible with other tree-based models like Random Forests. However, this thesis is not interested in discussing the details of the tree-based feature importance methods. Instead, the Permutation Feature Im-portance method is introduced as a more practical solution to calculate the feature imIm-portance scores.

Permutation Feature Importance

The Permutation Importance algorithm, proposed by Breiman et al. [5], solves the general-ization problem of the tree-based models.

Algorithm 4:Permutation Importance 1. Inputs: Dataset D, predictive model m

2. Compute a reference score s of the model m on D (accuracy, ROC-AUC, etc) 3. For each feature j in D:

a) For k in 1, ..., K:

i. Randomly shuffle values in column j to create an altered version of the dataset ˜Dj,k

ii. Compute the score sk,jof model m on ˜Dj,k b) Compute importance ijfor the permuted feature as:

ij =s ´ 1 K K ÿ k=1 sk,j

The importance of the jth feature is determined by how much the model’s base score s is changed due to shuffling values in j. The reshuffling step breaks the feature values’ ties with the target variable and thus importance is measured solely based on how the model depends on that specific feature. Sometimes features may appear to be more important in the training than the test/validation set. Therefore, it is good practice to train the model on the training set and calculate the importance scores on the held-out set to improve generalization power [5].

Pearson Correlation Coefficient

Pearson correlation measures the linear correlation between every feature and the response variable. It can take values from -1 to 1. A value close to zero indicates an insignificant corre-lation. This is while a value of higher magnitude, irrespective of the sign, represents a higher

(32)

2.6. Weighting Schemes

correlation between the feature and the response. Pearson correlation is calculated by divid-ing the covariance of the feature-response pair by the product of their standard deviations.

Jcc(Xj) = [řN i=1(xij´ ¯Xj)(ci´¯c)] (σX_j.σc) (2.25) For every feature, Xj and class label c the covariance is calculated and summed over all N observations. This value is then divided by the product of the feature and class standard deviations (σXj.σc). Note that in equation 2.25 ¯c denotes the probability of observing a sample

from class 1. Since rankings near -1 are as informative as rankings close to 1, we take the absolute values of the rankings [16].

Fisher Coefficient

This ranking method is based on Fisher’s Discriminant Analysis that is the Linear Discrim-inant Analysis in the case of binary problems. Therefore, Fisher’s discrimDiscrim-inant separates classes based on their mean and draws a linear discriminant that maximizes the separability while minimizing within-class variance. The Fisher coefficient is calculated using equation 2.26

JFSC(Xj) =

[X¯j,1´ ¯Xj,2]

[σj,1+σj,2] (2.26)

Xj,1and Xj,2represent mean of values in Xj corresponding to class 0 and class 1 respec-tively. The same holds for standard deviations in the denominator [16]. As opposed to feature importance techniques, both Fisher coefficient and Pearson correlation coefficient find a lin-ear relationship between the features and the target variable.

2.6 Weighting Schemes

Different ranking methods produce different types of values and therefore need to be normal-ized by a certain scheme to be converted into weights [16]. Below are two simple weighting schemes used in this thesis.

Normalized Max Filter (NMF)

Equation 2.27 suggests that for positive ranks (J+), the absolute value of the rank is divided by the maximum rank. In the case of negative ranks (J´), the absolute value of the rank is subtracted from the sum of the maximum and minimum ranks and then divided by the maximum rank. WN MF(J) = # _|J| Jmax, for J+ [Jmax+Jmin´|J|] Jmax for J-(2.27)

Normalized Range Filter (NRF)

This scheme is quite identical to NMF. However, in both cases of positive and negative ranks, the minimum rank is added to the numerator and the denominator. In the case of NRF weights lie in[2Jmin/(Jmax+Jmin), 1][16].

WNRF(J) = # _|J|+J

min

Jmax+Jmin, for J+

[Jmax+2Jmin´|J|]

Jmax+Jmin for

(33)

2.7. Model Selection and Evaluation

True

Predicted

0 1

0

True Negative (TN) False Negative (FN) N* 1 _{False Positive (FP)} _{True Positive (TP)} _P*

N P

Table 2.3: Confusion matrix for binary classification

2.7 Model Selection and Evaluation

The procedure of building a regression or classification model begins with downlisting a set of candidate models that may suit the specifications of the problem. Then some model tuning and evaluation steps have to be taken to find the best setting of the best performing model. Interpretation of "the best" relies solely on the nature of the problem. There is no one-size-fits-all evaluation metric to calculate. Therefore, this section starts with introducing a number of essential metrics and motivates why some are preferred in this thesis. Finally, some popular model tuning and validation methods will be discussed.

Metrics

In classification problems, a common approach to evaluate model performance is through the confusion matrix. The confusion matrix of a binary classifier consists of the components displayed in table 2.3.

Several evaluation metrics can be calculated from the confusion matrix. Here a number of these metrics and their respective applications will be described.

FRP and TPR

FPR= FP

N (2.29)

FNR= FN

N (2.30)

FPR also known as probability of false alarm specifies how often the model incorrectly clas-sifies a sample from class 0 as class 1. In contrast, FNR shows the probability of a model incorrectly classifying a sample from class 1 as class 0. There is usually a trade-off between FPR and FNR therefore the two scores cannot be improved simultaneously.

Precision and Recall

The precision and recall metrics can be computed using the components in table 2.3. Precision is obtained by dividing the number of correctly predicted points by all predicted points. A larger value indicates that the model is capable of correctly classifying the observations.

Precision= TP

TP+FP (2.31)

Recall is calculated by dividing the number of correctly predicted points by the number of all relevant points (regardless of whether they were correctly predicted or not). This ratio indicates what portion of the true labels is identified by the model.

Recall= TP

(34)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Random_Guess ROC Curve 0.0 0.2 0.4 0.6 0.8 1.0

Recall

0.0 0.2 0.4 0.6 0.8 1.0

Precision

Precision-Recall Curve

Figure 2.12: A comparison of ROC Curve (left) and Precision-Recall Curve (right)

Similar to FPR and FNR, the two scores stand at a trade-off. The preference of either score depends on how the results are supposed to be interpreted and how the model is to be used in practice. An alternative solution is to use the F1 score.

F1 score

F1 aims to find the right balance between Precision and Recall by combining the two using equation 2.33.

F1= 2.precision.recall

precision+recall (2.33)

F1 is widely used as an overall as well as per class score to evaluate a classification model.

Metrics for imbalanced data

Models trained on imbalanced data cannot be simply evaluated through their accuracy (ratio of correctly predicted points over all points, regardless of the class label). This score does not take into account the per class performance. If a problem is imbalanced, where 0 is the most common class, the F1 score is a good alternative to evaluate the model. However, F1 does not distinguish between the severity of FNR and FPR so one has to decide which score to trade-off in favor of the other [20]. In other words, the F1 score does not make a preference over any of the precision or recall.

An alternative solution is threshold tuning. In probabilistic approaches to binary classi-fication, the model calculates the posterior probability of an observation belonging to both classes 0 and 1. By default, we set a cut-off threshold on the probabilities to assign the class labels. In threshold tuning, instead of setting a discriminative threshold, an FPR and FNR are calculated for many thresholds in the [0,1] interval. From the obtained values the Receiver Operating Characteristic Curve (ROC Curve) can be plotted. The curve has FPR on the x-axis and TPR on the y-axis (Figure 2.12).

The objective is to maximize the Area Under the (ROC) Curve (ROC-AUC). AUC can take any value between 0 and 1. A value of 0.5 amounts to a random guess and a value of 1 (closer to the top left corner) represents a perfect model. The convenience of ROC-AUC is still argued for imbalanced problems. This is because ROC-AUC measures a model’s overall performance and does not reflect the per class performance. A better alternative is the area under the precision-recall curve [20].

The precision-recall curve (PR curve) has precision scores on the x-axis and recall scores on the y-axis. PR curve of a good model gets closer to the top right corner of the plot. Depending

(35)

on the domain and sensitivity of the problem, a higher recall or precision may be preferable. In any case, the area under the PR curve can better reflect the model performance compared to ROC-AUC when working with imbalanced datasets.

Average Precision

The area under the PR curve is still under question. [9] argues that in ROC space it is possible to linearly connect (interpolate) a value of TPR to a value of FPR. However, the same does not hold for the Precision-Recall space. Precision is not necessarily linear to recall. This is since we have FP rather than FN in the denominator of the precision score. In this case, linear interpolation yields an overly optimistic estimate of performance. The incorrect interpolation is most problematic when recall and precision scores are far apart and the local skew is high. The Average Precision (AP) score overcomes this problem. The AP score for n thresholds is calculated as AP= n ÿ n=1 (Rn´Rn´1)Pn (2.34)

According to equation 2.34, the precision (denoted by P) at the nth threshold is weighted by the amount of change in recall (denoted by R) at the n-1 threshold. The final score is the sum of the weighted precision scores.

Model Tuning

Identifying the proper metric to optimize is key to finding the right hyperparameter values. Search for the best hyperparameter setting is a sensitive task. However, it is not realistic to find the perfect model since that involves evaluating several values for every hyperparameter. For large datasets, this would become impractical in terms of time and computational costs. Here, two model tuning approaches will be discussed briefly.

Grid Search

Grid search is a popular approach for hyperparameter tuning. Given a set of values for a set of hyperparameters, grid search performs an exhaustive search over all possible combinations of values and hyperparameters to find the best setting. For instance a grid search over 4 values of parameter A (A = (a, b, c, d)) and 3 values of parameter B (B = (e, f , g)) creates 12 unique combinations. Therefore, the objective model is fit to the training data 12 times with all the unique possibilities. In every step, a score (average precision, ROC-AUC etc) is calculated on the held-out data. The combination that yields the highest score is selected and its corresponding hyperparameter values are returned.

To further minimize the generalization error, the grid search can be used with K-fold Cross Validation. K-fold cross validation method can be summarized as

• Shuffle the training set and divide it into K different folds • For k=1, ..., K repeat:

– Select one fold as the held out or test set – Train the model on the remaining K ´ 1 folds – Evaluate the model on the held out fold – Calculate the evaluation score

Prediction of Lead Conversion With Imbalanced Data : A method based on Predictive Lead Scoring

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

| /LIU-IDA/STAT-A–21/031–SE/

Prediction of Lead Conversion

With Imbalanced Data

A method based on Predictive Lead Scoring

Ali Etminan

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Background

1.2

Objectives

2

Theory

2.1

Lead Scoring and Predictive Lead Scoring

Lead Scoring

Predictive Lead Scoring

2.2

Data Preprocessing

Handling Outliers

One-hot Encoding

2.3

Resampling

Random Oversampling

Random Oversampling

SMOTE

Random Oversampling with dispersion

K Means SMOTE

SMOTE

SVM SMOTE

Borderline SMOTE1

K-Means SMOTE

SVM SMOTE

Borderline SMOTE1

Borderline SMOTE2

Borderline SMOTE2

ADASYN

Combination of Oversampling and Undersampling

Weight Correction

2.4

Classifiers

Decision Trees

Random Forest Classifier

Support Vector Classifier

Logistic Regression

K Nearest Neighbors (KNN)

Gradient Boosting

2.5

Feature Ranking

Feature Importance with Decision Trees

Permutation Feature Importance

Pearson Correlation Coefficient

Fisher Coefficient

2.6

Weighting Schemes

Normalized Max Filter (NMF)

Normalized Range Filter (NRF)

2.7

Model Selection and Evaluation

Metrics

False Positive Rate

True Positive Rate

Recall

Precision

Metrics for imbalanced data

Model Tuning