STOCKHOLM SWEDEN 2020 ,
Predicting Purchase of Airline Seating Using Machine Learning
SEBASTIAN EL-HAGE
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Predicting Purchase of Airline Seating Using Machine Learning
SEBASTIAN EL-HAGE
Master in Computer Science Date: June 23, 2020
Supervisor: Erik Fransén, Christer Ogenstad Examiner: Olov Engwall
Host Company: Novatrox PolarMind AB
Swedish title: Förutsägelse på köp av sätesreservation med maskininlärning.
School of Electrical Engineering and Computer Science
Abstract
With the continuing surge in digitalization within the travel industry and the increased demand of personalized services, understanding customer behaviour is becoming a requirement to survive for travel agencies. The number of cases that addresses this problem are increasing and machine learning is expected to be the enabling technique. This thesis will attempt to train two different models, a multi-layer perceptron and a support vector machine, to reliably predict whether a customer will add a seat reservation with their flight booking. The models are trained on a large dataset consisting of 69 variables and over 1.1 million historical recordings of bookings dating back to 2017. The results from the trained models are satisfactory and the models are able to classify the data with an accuracy of around 70%. This shows that this type of problem is solvable with the techniques used. The results moreover suggest that further exploration of models and additional data could be of interest since this could help increase the level of performance.
Keywords
Deep learning, Machine learning, Binary classification, Neural
network, Support vector machine
Sammanfattning
Med den fortsatta ökningen av digitalisering inom reseindustrin och det faktum att kunder idag visar ett stort behov av skräddarsydda tjänster så stiger även kraven på företag att förstå sina kunders beteende för att överleva. En uppsjö av studier har gjorts där man försökt tackla problemet med att kunna förutse kundbeteende och maskininlärning har pekats ut som en möjliggörande teknik. Inom maskininlärning har det skett en stor utveckling och specifikt inom området djupinlärning. Detta har gjort att användningen av dessa teknologier för att lösa komplexa problem spritt sig till allt fler branscher. Den här studien implementerar en Multi-Layer Perceptron och en Support Vector Machine och tränar dessa på befintliga data för att tillförlitligt kunna avgöra om en kund kommer att köpa en sätesreservation eller inte till sin bokning. Datat som användes bestod av 69 variabler och över 1.1 miljoner historiska bokningar inom tidsspannet 2017 till 2020. Resultaten från studien är tillfredställande då modellerna i snitt lyckas klassificera med en noggranhet på 70%, men inte optimala. Multi-Layer Perceptronen presterar bäst på båda mätvärdena som användes för att estimera prestandan på modellerna, accuracy och F1 score. Resultaten pekar även på att en påbyggnad av denna studie med mer data och fler klassificeringsmodeller är av intresse då detta skulle kunna leda till en högre nivå av prestanda.
Nyckelord
Djupinlärning, Maskininlärning, Binär klassificering, Neurala
nätverk, Stödvektormaskin
Contents
1 Introduction ... 1
1.1 Research question... 2
1.2 Purpose ... 2
1.3 Scope and limitations ... 2
1.4 Thesis outline ... 3
2 Background ... 4
2.1 Machine learning ... 4
2.2 Artificial neural networks ... 8
2.2.1 The One-Neuron Perceptron ... 8
2.2.2 Multi-layer perceptron ... 9
2.3 Support vector machine ... 11
2.4 Pre-processing ... 12
2.4.1 Missing values ... 12
2.4.2 Outlier treatment ... 13
2.4.3 Encoding and normalization ... 13
2.4.4 High cardinality categorical variables ... 14
2.4.5 Feature selection ... 15
2.5 Verifying significance of results ... 16
2.5.1 Two-sample t-test ... 16
2.5.2 Kolmogorov-Smirnov two-sample test ... 17
2.6 Related work ... 17
3 Method ... 19
3.1 Models ... 19
3.2 Data handling ... 19
3.2.1 Pre-processing ... 20
3.2.2 Data partitioning... 21
3.3 Experiments ... 22
3.3.1 Pilot experiments ... 22
3.3.2 Hyperparameter tuning ... 23
3.4 Model training and evaluation ... 24
4 Results ... 25
4.1 Pre-processing and pilot experiments ... 25
4.2 Results from hyperparameter tuning ... 27
4.3 Results from model training ... 28
5 Discussion ... 31
5.1 Pre-processing and pilot experiments ... 31
5.2 Model comparisons ... 31
5.3 Comments on the results ... 32
5.4 Limitations ... 34
5.5 Further research ... 35
5.6 Sustainability and ethics ... 35
6 Conclusions ... 36
Bibliography ... 37
1
Chapter 1 Introduction
With the increase in digitalization globally and across industries, the travel industry has and continues to be transformed. In 2016, the World Economic Forum’s Digital Transformation initiative projected that a continually growing digitalization in aviation, travel and tourism would create around US$305Billion of value for the industry until 2025 [1]. Further, the competition within these industries is expected to grow with new and more technically capable, actors entering the market. The travel industry has shifted from physical travel agencies to online marketplaces where orders can be processed in no time. With this transformation, an increase in demand for personalized services and offers have developed amongst the customers. This introduces new and complex challenges for the existing actors to be able to provide it to the customers [1]. Following the trends of digitalization, the importance of, and increase in data gathering have been unveiled.
Companies are now faced with the possibilities of using these large amounts of data to analyse and understand their customers’ needs, and many attempts at doing so have already been made. One technology that has especially been coupled with data mining tasks and is believed to be an enabler for utilizing large datasets is machine learning [2]–[5]. With the powerful tools of machine learning, models can be trained on large datasets to learn to identify customers who are likely to e.g. purchase a product.
In the travel industry, customers are presented with a possibility
to add optional products such as reserving a seat for some additional
cost during their booking. These additional products are often very
profitable for the companies. If the companies are able to identify
customers who are unlikely to buy, they can utilize this information to
incentivize a purchase by e.g. offering a discount [6]. This thesis will
investigate the applicability of machine learning to predict the
customers intent of purchasing a seating during the booking process
in the travel industry. The problem will be approached by
implementing two machine learning models that are trained to predict
whether a customer will add a seat reservation to their booking based
on several input parameters. The data that is used to train the models
is historical records of bookings, including whether the customer
added seating. The data is provided by a travel agency from Stockholm,
and dates back to 2017. The two models will be evaluated and
2
compared to each other to be able to answer if one of the models performs significantly better at this task over the other.
1.1 Research question
The aim of this thesis is partly to evaluate to what extent MLPs or SVMs reliably can predict whether a customer will add a seating reservation to their flight booking utilizing historical bookings data.
This will be investigated by implementing two models, one linear SVM and one MLP, that are both evaluated on two different metrics, accuracy and F1 score. Further, the thesis will investigate whether there is some significant difference in performance between the two models. This can be summarized in two research questions RQ1 and RQ2 below.
RQ1: To what degree can either SVM or MLP models learn to solve whether a customer will purchase a seat reservation with their flight booking?
RQ2: Is there a significant difference in performance between the two models?
1.2 Purpose
The purpose of this thesis is to fully investigate whether it is possible to, with the usage of records from a bookings dataset, train a classifier that can reliably predict customer behaviour in the travel industry. A classifier as such can be used by the company to provide customers with personalized offers that could potentially lead to increased sales.
Further, the researched area aligns with the trend of increased demand in personalized offers, as mentioned in [1]. Also, since the study uses a transformation technique of high cardinality (HC) categorical variables, the result is interesting since it also hints at whether these are feasible to use over traditional encoding schemes such as dummy encoding. This is valuable as dummy encoding of HC variables quickly becomes impractical to use due to it leading to very sparse datasets and increased dimensionality that raises the constraints of hardware used [5].
1.3 Scope and limitations
The scope of this project is to implement and evaluate the models in
the specific setting described under section 1.1, i.e. it will only predict
the addition of a seating reservation and no other product that the
3
customer is presented with. Further, only one SVM and one MLP will be implemented and their specific configurations will be decided during the early part of the implementation. The data used to train the models is tabular data and the study will not investigate the usage of sequential data that some of the related work uses. Hence, the data will be general data gathered from a flight booking, e.g. the destination, booking time etc. and some customer-specific information. Further, the data used is only be sourced from one actor in the travel industry and this must be considered when analysing the results.
1.4 Thesis outline
The structure of this thesis will consist of five sections. First, the theoretical concepts that are relevant to the thesis and similar work that has been performed will be described in section two. Secondly, a description of the method used to carry out the implementation, experiments, and the evaluation of the results from the experiments are described in section three. The results from the experiments are presented in section four and thoroughly discussed and compared to the results found during the literature study in section five.
Additionally, in section five the limitations of this study are discussed and propositions for future research are presented along with an analysis of the results through an ethical and sustainability perspective. Lastly, the research questions are answered, and the thesis is concluded in section six.
4
Chapter 2 Background
This chapter will present the background and theory that is important for the reader to understand to be able to comprehend the study performed. The concepts described in this chapter have been selected with respect to their relevance for the project or to present alternative methods of a certain procedure. The used concepts are then described in detail in the method chapter 3.
2.1 Machine learning
Machine learning (ML) is a term used to describe the subfield of artificial intelligence that focuses mainly on two interrelated questions: 1) “How one can construct computer systems that automatically improve through experience?” and 2) “What are the fundamental statistical-computational-information-theoretic laws that govern all learning systems, including computers, humans, and organizations?” [7]. The problems that are solved by ML algorithms and models are categorized as either supervised or unsupervised. In supervised problems, models are trained to detect patterns in the data that can then be used to predict other data. Unsupervised learning algorithms seek to find structural properties and cluster data based on these [7]. Further, supervised problems are split up into two subproblems, classification, and regression tasks. Both problems utilize labelled data, however, in classification problems the model learns to categorize data correctly and its output is discrete. In regression problems, the model instead calculates a continuous output based on the input data [8].
In supervised learning, models are trained using a labelled set of data, often referred to as training data [9]. The model is trained iteratively by predicting the output and correcting its parameters to better match the output with the ground truth. The model’s prediction performance is often evaluated on a separate dataset, also called a test- set that consists of data that the model has not seen during the training. The reason that the prediction performance is not evaluated on the training set is to simulate the model's performance on unseen data in a real-world use case [8]. This is referred to in the literature as the problem of generalization [8], [9].
To estimate a model’s generalization, one can use a test set that
is classified after the training has finished. However, a lot of
5
uncertainty is involved when estimating the true performance by just testing one model. Therefore, k-fold cross-validation (CV) is often used to train k classifiers. In some cases, a stratified CV can be used to ensure that each fold has a similar class distribution. The k-fold CV splits up the data into k partitions and each classifier is trained on k-1 of these partitions and then evaluated on the kth partition [10]. All these models are hence trained and evaluated on different parts of the data and their performances are used to, more accurately than using a single model, estimate the true performance of the classifier (see equation 2.1), i.e. ability to generalize.
True performance = 1
𝑘 ∑ 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑖
𝑘
𝑖=1
(2.1)
Figure 2.1: Data partitioning in a 10-fold CV and the estimated error of the model (E) which is the mean of the error estimation for each fold (Ei).
Image from [11] under creative commons licence.
The intended goal of training is to produce a model which generalizes
well on new data. Therefore, having a model which performs well on
training data but poorly on unseen data is undesirable and the model
is said to have overfitted on the training data [9]. Overfitting can
depend on multiple factors, one being that the model used is too
complex for the data or problem at hand. What overfitting means is
that the model will have a large variance in performance and will
6
depend heavily on the data it is fed during training [12]. This means that if the training data for some reason does not represent the real distribution of data then the model will perform poorly. To counter or mitigate a model with large variance, regularization techniques are used. These depend on what model is used. One regularization technique is to implement an L2/ridge regression during training which enforces all parameters of the model to be on the same scale which prevents overfitting [13]. On the other side of the spectra, a model with too much regularization might lead to a model which will not learn sufficiently from the training data. This is called underfitting and can be solved by increasing the model complexity or using less regularization. The problem of choosing a model which does not have too large or small variance and that generalizes well is referred to as the “bias-variance trade-off”, see figure 2.2 [9], [14].
Figure 2.2: A graphic visualization of variance and bias affects the training and validation error of a model. Illustration inspired by [15].
When evaluating models, it is important to choose a suitable performance metric for the problem at hand [16]. Most metrics are based on the confusion matrix which is illustrated in figure 2.3 below.
To visualize the effects of using the wrong metric, accuracy is often
used as an example. The accuracy of a classifier is calculated as
7
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 (2.2)
In a situation where the positive class makes out 90% of all samples, a naïve classifier that just predicts the positive class would have an accuracy of 90% even though it never correctly classifies a negative sample. In a case where identifying the negative class is important (e.g.
in anomaly detection), this classifier is worthless and should not be used even though it has a high-performance accuracy. One metric that is often used as a complement to the accuracy or individually is 𝐹 𝛽 score that is calculated as shown in equation 2.3 below. With 𝛽 = 1, the score is called the F1 score and this score is the harmonic mean between the precision and recall [16].
𝐹 𝛽 = (1 + 𝛽 2 ) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑥𝑅𝑒𝑐𝑎𝑙𝑙
𝛽 2 𝑥𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (2.3)
Figure 2.3: The confusion matrix for a binary classification problem.
Illustration based on [17].
8 2.2 Artificial neural networks
Artificial neural networks (ANN) are biologically inspired networks that can be used to perform complex computations. The idea of constructing neural computation systems was first proposed by McCulloch and Pitts in 1943 and then Metropolis et al. in 1953 [18].
2.2.1 The One-Neuron Perceptron
The simplest ANN is the One-Neuron perceptron which, as is implied in the name consists of one single neuron that outputs a thresholded response based on its input signals (see figure 2.4). The output of this network is given by
𝑂𝑢𝑡𝑝𝑢𝑡 = φ (∑(𝑤 𝑖 ∗ 𝑥 𝑖
m
i=0
) + 𝑏) (2.4)
where φ (x) is some activation function. There are many activation functions but some of the most common are ReLU (ReLU was used in this study), the sigmoid function, tanh or SoftMax (see equations 2.5- 2.8) [19]. The network can be used to find solutions for any linearly separable problem [20]. However, in 1959, Minsky and Papert proved that the single perceptron could not find a correct when the problem was non-linear such as the XOR problem [21]. This finding led to a sharp decrease in the interest of perceptrons until 1980 when significant developments had been made in the field of computational techniques and parallel information systems [18].
ReLU: σ(x) = 𝑚𝑎𝑥(x, 0) (2.5)
𝑆𝑖𝑔𝑚𝑜𝑖𝑑: 𝜎(x) = 1
1 + 𝑒 −𝑥 (2.6)
𝑡𝑎𝑛ℎ: 𝜎(𝑥) = 𝑒 𝑥 − 𝑒 −𝑥
𝑒 𝑥 + 𝑒 −𝑥 (2.7)
𝑆𝑜𝑓𝑡𝑀𝑎𝑥: σ(𝑥 𝑖 ) = 𝑒 𝑥
𝑖∑ 𝑒 𝑖 𝑥
𝑖where xi is the signal from output neuron i (2.8)
9
Figure 2.4: The one-neuron perceptron showing the weights and biases of the network. Illustration based on [22].
2.2.2 Multi-layer perceptron
A Multi-layer perceptron (MLP) is an ANN that consists of one input and one output layer, both of arbitrary size, and between these layers there is some number of hidden layers (see figure 2.5) [18]. At the time of writing, this type of network is used in a wide range of domains such as computer vision and natural language processing [18]. The MLP can be used to solve non-linear problems and it is because of the hidden nodes and the nonlinear activation functions that this is possible.
According to [23], an appropriately set up MLP can to some degree approximate any function.
The MLP is trained using backpropagation which propagates the output error back through the networks layers and calculates the gradient for each of these. The gradient is then used to update all the trainable parameters (weights and biases) in each layer of the network.
This is a supervised learning process, meaning that the network is
trying to adjust its weights and biases so that the output better mimics
the correct output. The training algorithm can on a high abstraction
level be described in six steps.
10
1) The networks weights and biases are initialized using some appropriate initialization technique, e.g. Xavier or He initialization that are two of the most widely used initialization techniques [24].
2) The network performs a forward propagation on all training data (often in batches). This propagation results in an output for each sample in the batch.
3) The output is compared to the ground truth and an error is calculated using a function such as the Mean Square Error (MSE), Mean Average Error (MAE) or Cross-entropy loss.
4) The error is propagated backwards to calculate the gradients for each layer using backpropagation.
5) The parameters in the network are updated using the calculated gradients from step four.
6) All steps from step two and forward are repeated until some stopping criteria is met. E.g. the loss does not change significantly between epochs (one epoch is one iteration over all the data), or the max number of epochs is reached.
The choice of the function used to calculate the error (often called loss function) is important since this value is what the network tries to minimize [18]. Further, according to [25], the loss function could affect the final performance of the network and how long it takes for the network to converge. The Cross-entropy function is often used for classification tasks. However, when the problem is binary a modified version called Binary Cross-entropy is instead used.
Figure 2.5: An MLP with one hidden layer. Illustration based on [26].
11 2.3 Support vector machine
The concept of support vector machines (SVM) was introduced by Vapnik in 1995 and has since then gained a high status within the machine learning community [27]. An SVM can be used for tasks such as classification, regression, anomaly detection and more [27]. In the case of binary classification, the objective of the SVM is to find a decision boundary/hyperplane that separates samples from the two classes with the largest distance between them. This can be done by training an SVM in one of two ways. If we assume that the data is completely linearly separable and without any noise, a Hard-Margin SVM can be used. This SVM finds an optimal hyperplane where no points lie within the maximum margin (see figure 2.6). However, some points lie on the maximum margin and these points are what is called the support vectors [27]. If we instead assume that there is some noise in the data, which is mostly the case, then a Soft margin SVM can be used. The difference between the two types of SVM is that during training, the soft margin SVM allows for some misclassifications by introducing a slack variable into the optimization problem [27],[28].
ξ = (ξ 1 , ξ 2 , … , ξ M ) is the nonnegative slack variable and it describes how far away from the maximum margin that a point 𝑥 𝑖 lies, see figure 2.5.
If 0 < ξ i < 1the point was correctly classified but lies within the maximum margin. However, if 1 ≤ ξ i then the point has been misclassified by the optimal hyperplane [28]. The optimization problem for the hard margin SVM (equation 2.9) and the soft margin SVM (equation 2.10) can be described respectively as
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 Q(w, b) = 1
2 ‖𝑤‖ 2 (𝐻𝑎𝑟𝑑 − 𝑚𝑎𝑟𝑔𝑖𝑛) (2.9) 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑦 𝑖 (𝑤 ̅ 𝑇 𝑥 ̅ + 𝑏) ≥ 1 𝑓𝑜𝑟 𝑖 = 1, … , 𝑀 i
minimize Q(w, b, ξ) = 1
2 ‖𝑤‖ 2 + 𝐶 𝑝 ∑ ξ i p
𝑀
𝑖=1
(𝑆𝑜𝑓𝑡 − 𝑚𝑎𝑟𝑔𝑖𝑛) (2.10)
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑦 𝑖 (𝑤 ̅ 𝑇 𝑥 ̅ + 𝑏) ≥ 1 − 𝜉 i i 𝑓𝑜𝑟 = 1, … , 𝑀
12
Figure 2.6: The two types of SVM classifying 2-dimensional data with variables x1 and x2. Hard-margin SVM (left) and Soft-margin SVM (right)
with samples inside of the maximum margins. Illustration based on [28].
2.4 Pre-processing
In machine learning, the quality of the data is one of the deciding factors on how well a problem can be solved. Datasets are often incomplete, meaning that the data quality is poor. E.g. the data can consist of text attributes or have missing entries which most machine learning models cannot handle. Therefore, it is usually manipulated in a pre-processing phase to handle these. Issues that are often handled during the pre-processing are removal of redundant variables, handling of missing values, outlier treatment and encoding of the data [29].
2.4.1 Missing values
Missing values can be handled in various ways. One approach is to
remove all records that include missing data. However, this could lead
to a large part of the data being removed. In [29] it is estimated that in
a dataset of 30 variables where only 5% of the data values are missing
and these are spread evenly throughout the data, almost 80% of all
records would include a missing value. Therefore, it is sometimes not
feasible to omit records with missing values. Instead, they should be
handled. These could be treated by filling them with an arbitrary
constant, with the mean or the mode (if the variable is categorical) of
the variable or with randomly drawn samples from the distribution of
the values that exist [29]. Depending on the problem, these different
techniques impact the performance of the model differently.
13 2.4.2 Outlier treatment
The definition for an outlier is somewhat ambiguous in the literature.
One of the definitions is “… an object in a dataset that is abnormal in comparison to the assumptions of the data” is presented in the study [30]. One assumption that is frequently used for data is that it is normally distributed. The process of detecting outliers is not always easy. In anomaly detection, a specific type of classification problem, the whole learning process is focused on identifying outliers. However, when it comes to removing samples from a dataset to exclude abnormal data it is not as complex. One factor that affects how outliers can be detected is what type of data the set consists of. For example, if some of the attributes in the dataset are categorical then normal detection techniques such as measuring Euclidean distances between samples might not be applicable. Instead, the outliers can be detected on single or a subset of attributes [30]. When performing an outlier detection on a single attribute, a common method is to first standardize the variable according to equation 2.11. All records with a variable that has an absolute value of the Z-score larger than 3.5 are then considered to be outliers and removed from the dataset [31].
However, if the outlier detection is performed on a multi attribute level, then some algorithm like k-modes algorithm could instead be used. The k-modes algorithm is an extension to the k-means clustering algorithm and can be used to measure distance when some attributes are categorical [30]. The data is then clustered, and outliers are defined as samples that lie outside some predefined distance.
𝑍 − 𝑠𝑐𝑜𝑟𝑒 = 𝑋 − 𝜇
𝜎 (2.11)
2.4.3 Encoding and normalization
In most real-world applications, the data in its raw form has attributes on different scales. For example, in a dataset that includes the buying price of a house which is measured on a scale of millions, and how many rooms there are in the house, which ranges from 1-10 are very different. Because these features are separate in scale, the predictor could be much more affected by the buying price than the number of rooms, this is called feature bias [32]. In addition to this, using equally scaled features can speed up the training and hence improve efficiency.
To overcome the feature bias, features are usually normalized so that
all variables are on the same scale. This can be done in multiple ways
14
by using a Z-score normalization, a min-max normalization, or some other method. Z-score normalization utilizes the variable mean and standard deviance to produce the normalized data as can be seen in equation 2.11. The min-max normalization instead utilizes the min and max value for each variable. All values of the variable are then normalized according to equation 2.12 which results in a value between zero and one. When using a training and test set, the normalization is often performed on the values of the training set.
Therefore, it is important that the, min and max values are stored and used to transform the test set to get meaningful results since these are the representations that the models learn [3].
Min-Max normalization = 𝑋 − 𝑋 𝑚𝑖𝑛
𝑋 𝑚𝑎𝑥 − 𝑋 𝑚𝑖𝑛 (2.12) 2.4.4 High cardinality categorical variables
Categorical variables are variables that are often of string type or discrete values indicating a specific category. Normally, these are encoded using a One-hot or a dummy encoding. These produce n and n-1 new variables respectively, where n is the number of distinct categories of the variable [5]. However, in the case of HC, it is unfeasible to use a dummy encoding since the dimensionality of the data would become very large quickly. A high cardinality categorical variable is a categorical variable that includes a large amount of distinct categories. Previous studies have tried different methods of handling the HC categorical variables by applying some transformation technique like dummy encoding, semantic grouping, or a transformation that converts the data from categorical to continuous [5]. However, to avoid an explosion in dimensionality, the conversions to continuous values are most interesting since this preserves the dimensionality of the data. Two of the methods that are used in [5] are “Weight of Evidence” (WOE) and “Supervised Ratio”
(SR).
1. Weight of Evidence (WOE): This transforms categorical variables
into continuous by calculating the metric according to equation
2.13 on the dataset that has been reserved for this task. TC and TN
describe the number of samples belonging to the positive and
negative class, respectively. 𝐶 𝑖 𝑥 𝑎𝑛𝑑 𝑁 𝑖 𝑥 denotes the amount of the
ith value of variable X that belong to the positive and negative class,
respectively. Since the WOE is calculated on a hold-out set, there is
a possibility that some categories of the variable are unseen before
15
the training starts. These values are encoded as zero, which is the same as the category being equally likely to belong to each label.
2. Supervised Ratio (SR): The supervised ratio is much more straight forward than the WOE and is the probability that the ith value of variable X belongs to the positive class (see equation 2.14) based on the training data. This value is always a value between zero and one where a larger value means the variable is more likely to belong to the positive class. As with WOE, unseen values can occur when encoding the unseen data and these are encoded as 𝑇𝐶
𝑇𝐶 + 𝑇𝑁 , representing the probability that any sample belongs to the positive class. In the case of a perfect class balance, this equates to 0.5.
𝑊𝑂𝐸 𝑖 𝑥 = 𝑙𝑛 ( 𝐶 𝑖 𝑥 ⁄ 𝑇𝐶
𝑁 𝑖 𝑥 ⁄ 𝑇𝑁 ) (2.13)
𝑆𝑅 = 𝐶 𝑖 𝑥
𝐶 𝑖 𝑥 + 𝑁 𝑖 𝑥 (2.14)
2.4.5 Feature selection
In some cases, certain attributes of the data fed to the model during training are redundant and do not add any information. These attributes just increase the model complexity and the computations needed. It is even possible that some variables can affect the model negatively and hence reduce performance [33]. As a countermeasure, feature selection is often applied to the data before the training. The intended goal of feature selection is mainly 1) choosing the features that maximize the performance of the learning algorithm, 2) reduce the computational and/or storage requirements for the learning algorithm or 3) detecting important features that are related to the natural problem being studied.
One algorithm that can be used for feature selection is Random
forests (RF) [34], [35]. RF is an ensemble method that consists of a set
of decision trees. The model utilizes bagging and feature randomness
to create independent training data for each decision tree during the
training phase [36]. This is beneficial since each of the trees will learn
to classify differently but hopefully, on a population scale, the majority
will predict correctly for each sample that is fed into it. When a RF has
been fit to the training set, the features are ranked based on their
contribution to solving the problem, i.e. how much increase does the
16
inclusion of a feature give. In the study from 2018 by Sylvester et al.
[35] a standard implementation of RF was tested against a fixation index (Fst) which is a feature selection method used in biology. The outcome of the study was that the predictions from the variables selected by the RF performed significantly better than its opponent algorithm. Further in a study by Shamsoddini et al. from 2017 [34], it was shown that the combination of either an MLP or MLR with an RF- based feature selection led to significantly lower prediction error in the context of air pollution prediction compared to the models without feature selection.
2.5 Verifying significance of results
When performing studies that evaluates two models using CV, each model produces a result for each fold. The aim of the study is usually to decide whether the models differ in performance which is done through statistical hypothesis testing. To do this, a null hypothesis is normally formulated as there being no difference between the performance measurements produced by the models, i.e. the mean values of the result populations are not different [37]. An alternative hypothesis is also formulated that describes an alternative scenario that is accepted if the null hypothesis is rejected. The null hypothesis is tested using some statistical test and some predefined threshold value, often referred to as significance level and denoted as α.
Commonly used significance levels are 0.05, 0.01 and 0.001 [37], [38].
The tests used to find statistical significance are many and are each test is applied under some assumptions. The two methods that are used in this study are two-sample t-test and Kolmogorov-Smirnov (KS) two-sample test.
2.5.1 Two-sample t-test
The most commonly used test is a two-sample t-test. The test is
parametric and can be used under some assumptions [37], [38]. If a
significance in difference is found from the test, it can be determined
that this is because a difference in location (mean) [38]. This makes
the test very powerful since a conclusion of how the distributions differ
can be drawn directly. However, the test assumes that the distributions
tested are normally distributed which is often not the case [38].
17
2.5.2 Kolmogorov-Smirnov two-sample test
The KS two-sample test is a non-parametric method that tests for differences in two distributions. However, unlike the two-sample t- test, in a case of significant difference it does not tell whether it is because of a difference in location (mean), variation (standard deviation), presence of outliers or some other measurement [38].
Therefore, the outcome of the KS two-sample test is much weaker than for the t-test since it does not tell why the distributions differ. Further, the KS two-sample test can be applied in more cases since it makes less assumptions of the two distributions.
2.6 Related work
The problem of predicting whether a customer will e.g. purchase a specific product has been researched plenty. There is a large spread of models that have been tested for this problem but few that compare the performance between SVM’s and MLP’s. In a study by Zhang et. al [39] the prediction of ad click-through rate, a regression problem, by using two deep neural networks (DNN). The problem is similar to the problem researched in this thesis in that it tries to evaluate, in a static environment, how well a neural network can detect if a customer will perform an action. However, it does this by implementing different DNN architectures whereas this thesis compares one SVM and one MLP. In a study by Moeyersoms et al. [5], similar to this thesis, a linear SVM was trained to predict customer churn 1 in the energy sector. In both [39] and [5], as in this paper, many features were of categorical/discrete type and in the latter also of high cardinality. To tackle the problem of exploding dimensionality, [39] suggested two techniques. Field-wise feature embeddings are learnt through a supervised learning process prior to the model training. This allows for the sparse input space to be represented in a much lower dimension.
The results from this study were that they performed better than a logistic regressor using the sparse input. Moeyersoms et al., [5]
comparatively to [39] suggested the usage of mainly three transformations that maps each categorical value to a continuous value and compared the performance of these to either dropping or dummy encoding all HC categorical variables. The study concluded
1