Predicting Purchase of Airline Seating Using Machine Learning

(1)

STOCKHOLM SWEDEN 2020 ,

Predicting Purchase of Airline Seating Using Machine Learning

SEBASTIAN EL-HAGE

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Predicting Purchase of Airline Seating Using Machine Learning

SEBASTIAN EL-HAGE

Master in Computer Science Date: June 23, 2020

Supervisor: Erik Fransén, Christer Ogenstad Examiner: Olov Engwall

Host Company: Novatrox PolarMind AB

Swedish title: Förutsägelse på köp av sätesreservation med maskininlärning.

School of Electrical Engineering and Computer Science

(3)

(4)

Abstract

With the continuing surge in digitalization within the travel industry and the increased demand of personalized services, understanding customer behaviour is becoming a requirement to survive for travel agencies. The number of cases that addresses this problem are increasing and machine learning is expected to be the enabling technique. This thesis will attempt to train two different models, a multi-layer perceptron and a support vector machine, to reliably predict whether a customer will add a seat reservation with their flight booking. The models are trained on a large dataset consisting of 69 variables and over 1.1 million historical recordings of bookings dating back to 2017. The results from the trained models are satisfactory and the models are able to classify the data with an accuracy of around 70%. This shows that this type of problem is solvable with the techniques used. The results moreover suggest that further exploration of models and additional data could be of interest since this could help increase the level of performance.

Keywords

Deep learning, Machine learning, Binary classification, Neural

network, Support vector machine

(5)

Sammanfattning

Med den fortsatta ökningen av digitalisering inom reseindustrin och det faktum att kunder idag visar ett stort behov av skräddarsydda tjänster så stiger även kraven på företag att förstå sina kunders beteende för att överleva. En uppsjö av studier har gjorts där man försökt tackla problemet med att kunna förutse kundbeteende och maskininlärning har pekats ut som en möjliggörande teknik. Inom maskininlärning har det skett en stor utveckling och specifikt inom området djupinlärning. Detta har gjort att användningen av dessa teknologier för att lösa komplexa problem spritt sig till allt fler branscher. Den här studien implementerar en Multi-Layer Perceptron och en Support Vector Machine och tränar dessa på befintliga data för att tillförlitligt kunna avgöra om en kund kommer att köpa en sätesreservation eller inte till sin bokning. Datat som användes bestod av 69 variabler och över 1.1 miljoner historiska bokningar inom tidsspannet 2017 till 2020. Resultaten från studien är tillfredställande då modellerna i snitt lyckas klassificera med en noggranhet på 70%, men inte optimala. Multi-Layer Perceptronen presterar bäst på båda mätvärdena som användes för att estimera prestandan på modellerna, accuracy och F1 score. Resultaten pekar även på att en påbyggnad av denna studie med mer data och fler klassificeringsmodeller är av intresse då detta skulle kunna leda till en högre nivå av prestanda.

Nyckelord

Djupinlärning, Maskininlärning, Binär klassificering, Neurala

nätverk, Stödvektormaskin

(6)

1 Introduction ... 1

1.1 Research question... 2

1.2 Purpose ... 2

1.3 Scope and limitations ... 2

1.4 Thesis outline ... 3

2 Background ... 4

2.1 Machine learning ... 4

2.2 Artificial neural networks ... 8

2.2.1 The One-Neuron Perceptron ... 8

2.2.2 Multi-layer perceptron ... 9

2.3 Support vector machine ... 11

2.4 Pre-processing ... 12

2.4.1 Missing values ... 12

2.4.2 Outlier treatment ... 13

2.4.3 Encoding and normalization ... 13

2.4.4 High cardinality categorical variables ... 14

2.4.5 Feature selection ... 15

2.5 Verifying significance of results ... 16

2.5.1 Two-sample t-test ... 16

2.5.2 Kolmogorov-Smirnov two-sample test ... 17

2.6 Related work ... 17

3 Method ... 19

3.1 Models ... 19

3.2 Data handling ... 19

3.2.1 Pre-processing ... 20

3.2.2 Data partitioning... 21

3.3 Experiments ... 22

3.3.1 Pilot experiments ... 22

3.3.2 Hyperparameter tuning ... 23

3.4 Model training and evaluation ... 24

(7)

4 Results ... 25

4.1 Pre-processing and pilot experiments ... 25

4.2 Results from hyperparameter tuning ... 27

4.3 Results from model training ... 28

5 Discussion ... 31

5.1 Pre-processing and pilot experiments ... 31

5.2 Model comparisons ... 31

5.3 Comments on the results ... 32

5.4 Limitations ... 34

5.5 Further research ... 35

5.6 Sustainability and ethics ... 35

6 Conclusions ... 36

Bibliography ... 37

(8)

1 Chapter 1 Introduction

With the increase in digitalization globally and across industries, the travel industry has and continues to be transformed. In 2016, the World Economic Forum’s Digital Transformation initiative projected that a continually growing digitalization in aviation, travel and tourism would create around US$305Billion of value for the industry until 2025 [1]. Further, the competition within these industries is expected to grow with new and more technically capable, actors entering the market. The travel industry has shifted from physical travel agencies to online marketplaces where orders can be processed in no time. With this transformation, an increase in demand for personalized services and offers have developed amongst the customers. This introduces new and complex challenges for the existing actors to be able to provide it to the customers [1]. Following the trends of digitalization, the importance of, and increase in data gathering have been unveiled.

Companies are now faced with the possibilities of using these large amounts of data to analyse and understand their customers’ needs, and many attempts at doing so have already been made. One technology that has especially been coupled with data mining tasks and is believed to be an enabler for utilizing large datasets is machine learning [2]–[5]. With the powerful tools of machine learning, models can be trained on large datasets to learn to identify customers who are likely to e.g. purchase a product.

In the travel industry, customers are presented with a possibility

to add optional products such as reserving a seat for some additional

cost during their booking. These additional products are often very

profitable for the companies. If the companies are able to identify

customers who are unlikely to buy, they can utilize this information to

incentivize a purchase by e.g. offering a discount [6]. This thesis will

investigate the applicability of machine learning to predict the

customers intent of purchasing a seating during the booking process

in the travel industry. The problem will be approached by

implementing two machine learning models that are trained to predict

whether a customer will add a seat reservation to their booking based

on several input parameters. The data that is used to train the models

is historical records of bookings, including whether the customer

added seating. The data is provided by a travel agency from Stockholm,

and dates back to 2017. The two models will be evaluated and

(9)

2 compared to each other to be able to answer if one of the models performs significantly better at this task over the other.

1.1 Research question

The aim of this thesis is partly to evaluate to what extent MLPs or SVMs reliably can predict whether a customer will add a seating reservation to their flight booking utilizing historical bookings data.

This will be investigated by implementing two models, one linear SVM and one MLP, that are both evaluated on two different metrics, accuracy and F1 score. Further, the thesis will investigate whether there is some significant difference in performance between the two models. This can be summarized in two research questions RQ1 and RQ2 below.

RQ1: To what degree can either SVM or MLP models learn to solve whether a customer will purchase a seat reservation with their flight booking?

RQ2: Is there a significant difference in performance between the two models?

1.2 Purpose

The purpose of this thesis is to fully investigate whether it is possible to, with the usage of records from a bookings dataset, train a classifier that can reliably predict customer behaviour in the travel industry. A classifier as such can be used by the company to provide customers with personalized offers that could potentially lead to increased sales.

Further, the researched area aligns with the trend of increased demand in personalized offers, as mentioned in [1]. Also, since the study uses a transformation technique of high cardinality (HC) categorical variables, the result is interesting since it also hints at whether these are feasible to use over traditional encoding schemes such as dummy encoding. This is valuable as dummy encoding of HC variables quickly becomes impractical to use due to it leading to very sparse datasets and increased dimensionality that raises the constraints of hardware used [5].

1.3 Scope and limitations

The scope of this project is to implement and evaluate the models in

the specific setting described under section 1.1, i.e. it will only predict

the addition of a seating reservation and no other product that the

(10)

3 customer is presented with. Further, only one SVM and one MLP will be implemented and their specific configurations will be decided during the early part of the implementation. The data used to train the models is tabular data and the study will not investigate the usage of sequential data that some of the related work uses. Hence, the data will be general data gathered from a flight booking, e.g. the destination, booking time etc. and some customer-specific information. Further, the data used is only be sourced from one actor in the travel industry and this must be considered when analysing the results.

1.4 Thesis outline

The structure of this thesis will consist of five sections. First, the theoretical concepts that are relevant to the thesis and similar work that has been performed will be described in section two. Secondly, a description of the method used to carry out the implementation, experiments, and the evaluation of the results from the experiments are described in section three. The results from the experiments are presented in section four and thoroughly discussed and compared to the results found during the literature study in section five.

Additionally, in section five the limitations of this study are discussed and propositions for future research are presented along with an analysis of the results through an ethical and sustainability perspective. Lastly, the research questions are answered, and the thesis is concluded in section six.

(11)

4 Chapter 2 Background

This chapter will present the background and theory that is important for the reader to understand to be able to comprehend the study performed. The concepts described in this chapter have been selected with respect to their relevance for the project or to present alternative methods of a certain procedure. The used concepts are then described in detail in the method chapter 3.

2.1 Machine learning

Machine learning (ML) is a term used to describe the subfield of artificial intelligence that focuses mainly on two interrelated questions: 1) “How one can construct computer systems that automatically improve through experience?” and 2) “What are the fundamental statistical-computational-information-theoretic laws that govern all learning systems, including computers, humans, and organizations?” [7]. The problems that are solved by ML algorithms and models are categorized as either supervised or unsupervised. In supervised problems, models are trained to detect patterns in the data that can then be used to predict other data. Unsupervised learning algorithms seek to find structural properties and cluster data based on these [7]. Further, supervised problems are split up into two subproblems, classification, and regression tasks. Both problems utilize labelled data, however, in classification problems the model learns to categorize data correctly and its output is discrete. In regression problems, the model instead calculates a continuous output based on the input data [8].

In supervised learning, models are trained using a labelled set of data, often referred to as training data [9]. The model is trained iteratively by predicting the output and correcting its parameters to better match the output with the ground truth. The model’s prediction performance is often evaluated on a separate dataset, also called a test- set that consists of data that the model has not seen during the training. The reason that the prediction performance is not evaluated on the training set is to simulate the model's performance on unseen data in a real-world use case [8]. This is referred to in the literature as the problem of generalization [8], [9].

To estimate a model’s generalization, one can use a test set that

is classified after the training has finished. However, a lot of

(12)

5 uncertainty is involved when estimating the true performance by just testing one model. Therefore, k-fold cross-validation (CV) is often used to train k classifiers. In some cases, a stratified CV can be used to ensure that each fold has a similar class distribution. The k-fold CV splits up the data into k partitions and each classifier is trained on k-1 of these partitions and then evaluated on the kth partition [10]. All these models are hence trained and evaluated on different parts of the data and their performances are used to, more accurately than using a single model, estimate the true performance of the classifier (see equation 2.1), i.e. ability to generalize.

True performance = 1

𝑘 ∑ 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 _𝑖

𝑘

𝑖=1

(2.1)

Figure 2.1: Data partitioning in a 10-fold CV and the estimated error of the model (E) which is the mean of the error estimation for each fold (Ei).

Image from [11] under creative commons licence.

The intended goal of training is to produce a model which generalizes

well on new data. Therefore, having a model which performs well on

training data but poorly on unseen data is undesirable and the model

is said to have overfitted on the training data [9]. Overfitting can

depend on multiple factors, one being that the model used is too

complex for the data or problem at hand. What overfitting means is

that the model will have a large variance in performance and will

(13)

6 depend heavily on the data it is fed during training [12]. This means that if the training data for some reason does not represent the real distribution of data then the model will perform poorly. To counter or mitigate a model with large variance, regularization techniques are used. These depend on what model is used. One regularization technique is to implement an L2/ridge regression during training which enforces all parameters of the model to be on the same scale which prevents overfitting [13]. On the other side of the spectra, a model with too much regularization might lead to a model which will not learn sufficiently from the training data. This is called underfitting and can be solved by increasing the model complexity or using less regularization. The problem of choosing a model which does not have too large or small variance and that generalizes well is referred to as the “bias-variance trade-off”, see figure 2.2 [9], [14].

Figure 2.2: A graphic visualization of variance and bias affects the training and validation error of a model. Illustration inspired by [15].

When evaluating models, it is important to choose a suitable performance metric for the problem at hand [16]. Most metrics are based on the confusion matrix which is illustrated in figure 2.3 below.

To visualize the effects of using the wrong metric, accuracy is often

used as an example. The accuracy of a classifier is calculated as

(14)

7 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁

𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 (2.2)

In a situation where the positive class makes out 90% of all samples, a naïve classifier that just predicts the positive class would have an accuracy of 90% even though it never correctly classifies a negative sample. In a case where identifying the negative class is important (e.g.

in anomaly detection), this classifier is worthless and should not be used even though it has a high-performance accuracy. One metric that is often used as a complement to the accuracy or individually is 𝐹 _𝛽 score that is calculated as shown in equation 2.3 below. With 𝛽 = 1, the score is called the F1 score and this score is the harmonic mean between the precision and recall [16].

𝐹 _𝛽 = (1 + 𝛽 ² ) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑥𝑅𝑒𝑐𝑎𝑙𝑙

𝛽 ² 𝑥𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (2.3)

Figure 2.3: The confusion matrix for a binary classification problem.

Illustration based on [17].

(15)

8 2.2 Artificial neural networks

Artificial neural networks (ANN) are biologically inspired networks that can be used to perform complex computations. The idea of constructing neural computation systems was first proposed by McCulloch and Pitts in 1943 and then Metropolis et al. in 1953 [18].

2.2.1 The One-Neuron Perceptron

The simplest ANN is the One-Neuron perceptron which, as is implied in the name consists of one single neuron that outputs a thresholded response based on its input signals (see figure 2.4). The output of this network is given by

𝑂𝑢𝑡𝑝𝑢𝑡 = φ (∑(𝑤 _𝑖 ∗ 𝑥 _𝑖

m

i=0

) + 𝑏) (2.4)

where φ (x) is some activation function. There are many activation functions but some of the most common are ReLU (ReLU was used in this study), the sigmoid function, tanh or SoftMax (see equations 2.5- 2.8) [19]. The network can be used to find solutions for any linearly separable problem [20]. However, in 1959, Minsky and Papert proved that the single perceptron could not find a correct when the problem was non-linear such as the XOR problem [21]. This finding led to a sharp decrease in the interest of perceptrons until 1980 when significant developments had been made in the field of computational techniques and parallel information systems [18].

ReLU: σ(x) = 𝑚𝑎𝑥(x, 0) (2.5)

𝑆𝑖𝑔𝑚𝑜𝑖𝑑: 𝜎(x) = 1

1 + 𝑒 ^−𝑥 (2.6)

𝑡𝑎𝑛ℎ: 𝜎(𝑥) = 𝑒 ^𝑥 − 𝑒 ^−𝑥

𝑒 ^𝑥 + 𝑒 ^−𝑥 (2.7)

𝑆𝑜𝑓𝑡𝑀𝑎𝑥: σ(𝑥 _𝑖 ) = 𝑒 ^𝑥

^𝑖

∑ 𝑒 _𝑖 ^𝑥

^𝑖

where xi is the signal from output neuron i (2.8)

(16)

9 Figure 2.4: The one-neuron perceptron showing the weights and biases of the network. Illustration based on [22].

2.2.2 Multi-layer perceptron

A Multi-layer perceptron (MLP) is an ANN that consists of one input and one output layer, both of arbitrary size, and between these layers there is some number of hidden layers (see figure 2.5) [18]. At the time of writing, this type of network is used in a wide range of domains such as computer vision and natural language processing [18]. The MLP can be used to solve non-linear problems and it is because of the hidden nodes and the nonlinear activation functions that this is possible.

According to [23], an appropriately set up MLP can to some degree approximate any function.

The MLP is trained using backpropagation which propagates the output error back through the networks layers and calculates the gradient for each of these. The gradient is then used to update all the trainable parameters (weights and biases) in each layer of the network.

This is a supervised learning process, meaning that the network is

trying to adjust its weights and biases so that the output better mimics

the correct output. The training algorithm can on a high abstraction

level be described in six steps.

(17)

10 1) The networks weights and biases are initialized using some appropriate initialization technique, e.g. Xavier or He initialization that are two of the most widely used initialization techniques [24].

2) The network performs a forward propagation on all training data (often in batches). This propagation results in an output for each sample in the batch.

3) The output is compared to the ground truth and an error is calculated using a function such as the Mean Square Error (MSE), Mean Average Error (MAE) or Cross-entropy loss.

4) The error is propagated backwards to calculate the gradients for each layer using backpropagation.

5) The parameters in the network are updated using the calculated gradients from step four.

6) All steps from step two and forward are repeated until some stopping criteria is met. E.g. the loss does not change significantly between epochs (one epoch is one iteration over all the data), or the max number of epochs is reached.

The choice of the function used to calculate the error (often called loss function) is important since this value is what the network tries to minimize [18]. Further, according to [25], the loss function could affect the final performance of the network and how long it takes for the network to converge. The Cross-entropy function is often used for classification tasks. However, when the problem is binary a modified version called Binary Cross-entropy is instead used.

Figure 2.5: An MLP with one hidden layer. Illustration based on [26].

(18)

11 2.3 Support vector machine

The concept of support vector machines (SVM) was introduced by Vapnik in 1995 and has since then gained a high status within the machine learning community [27]. An SVM can be used for tasks such as classification, regression, anomaly detection and more [27]. In the case of binary classification, the objective of the SVM is to find a decision boundary/hyperplane that separates samples from the two classes with the largest distance between them. This can be done by training an SVM in one of two ways. If we assume that the data is completely linearly separable and without any noise, a Hard-Margin SVM can be used. This SVM finds an optimal hyperplane where no points lie within the maximum margin (see figure 2.6). However, some points lie on the maximum margin and these points are what is called the support vectors [27]. If we instead assume that there is some noise in the data, which is mostly the case, then a Soft margin SVM can be used. The difference between the two types of SVM is that during training, the soft margin SVM allows for some misclassifications by introducing a slack variable into the optimization problem [27],[28].

ξ = (ξ ₁ , ξ ₂ , … , ξ _M ) is the nonnegative slack variable and it describes how far away from the maximum margin that a point 𝑥 _𝑖 lies, see figure 2.5.

If 0 < ξ _i < 1the point was correctly classified but lies within the maximum margin. However, if 1 ≤ ξ _i then the point has been misclassified by the optimal hyperplane [28]. The optimization problem for the hard margin SVM (equation 2.9) and the soft margin SVM (equation 2.10) can be described respectively as

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 Q(w, b) = 1

2 ‖𝑤‖ ² (𝐻𝑎𝑟𝑑 − 𝑚𝑎𝑟𝑔𝑖𝑛) (2.9) 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑦 _𝑖 (𝑤 ̅ ^𝑇 𝑥 ̅ + 𝑏) ≥ 1 𝑓𝑜𝑟 𝑖 = 1, … , 𝑀 _i

minimize Q(w, b, ξ) = 1

2 ‖𝑤‖ ² + 𝐶 𝑝 ∑ ξ _i ^p

𝑀

𝑖=1

(𝑆𝑜𝑓𝑡 − 𝑚𝑎𝑟𝑔𝑖𝑛) (2.10)

𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑦 _𝑖 (𝑤 ̅ ^𝑇 𝑥 ̅ + 𝑏) ≥ 1 − 𝜉 _i _i 𝑓𝑜𝑟 = 1, … , 𝑀

(19)

12 Figure 2.6: The two types of SVM classifying 2-dimensional data with variables x1 and x2. Hard-margin SVM (left) and Soft-margin SVM (right)

with samples inside of the maximum margins. Illustration based on [28].

2.4 Pre-processing

In machine learning, the quality of the data is one of the deciding factors on how well a problem can be solved. Datasets are often incomplete, meaning that the data quality is poor. E.g. the data can consist of text attributes or have missing entries which most machine learning models cannot handle. Therefore, it is usually manipulated in a pre-processing phase to handle these. Issues that are often handled during the pre-processing are removal of redundant variables, handling of missing values, outlier treatment and encoding of the data [29].

2.4.1 Missing values

Missing values can be handled in various ways. One approach is to

remove all records that include missing data. However, this could lead

to a large part of the data being removed. In [29] it is estimated that in

a dataset of 30 variables where only 5% of the data values are missing

and these are spread evenly throughout the data, almost 80% of all

records would include a missing value. Therefore, it is sometimes not

feasible to omit records with missing values. Instead, they should be

handled. These could be treated by filling them with an arbitrary

constant, with the mean or the mode (if the variable is categorical) of

the variable or with randomly drawn samples from the distribution of

the values that exist [29]. Depending on the problem, these different

techniques impact the performance of the model differently.

(20)

13 2.4.2 Outlier treatment

The definition for an outlier is somewhat ambiguous in the literature.

One of the definitions is “… an object in a dataset that is abnormal in comparison to the assumptions of the data” is presented in the study [30]. One assumption that is frequently used for data is that it is normally distributed. The process of detecting outliers is not always easy. In anomaly detection, a specific type of classification problem, the whole learning process is focused on identifying outliers. However, when it comes to removing samples from a dataset to exclude abnormal data it is not as complex. One factor that affects how outliers can be detected is what type of data the set consists of. For example, if some of the attributes in the dataset are categorical then normal detection techniques such as measuring Euclidean distances between samples might not be applicable. Instead, the outliers can be detected on single or a subset of attributes [30]. When performing an outlier detection on a single attribute, a common method is to first standardize the variable according to equation 2.11. All records with a variable that has an absolute value of the Z-score larger than 3.5 are then considered to be outliers and removed from the dataset [31].

However, if the outlier detection is performed on a multi attribute level, then some algorithm like k-modes algorithm could instead be used. The k-modes algorithm is an extension to the k-means clustering algorithm and can be used to measure distance when some attributes are categorical [30]. The data is then clustered, and outliers are defined as samples that lie outside some predefined distance.

𝑍 − 𝑠𝑐𝑜𝑟𝑒 = 𝑋 − 𝜇

𝜎 (2.11)

2.4.3 Encoding and normalization

In most real-world applications, the data in its raw form has attributes on different scales. For example, in a dataset that includes the buying price of a house which is measured on a scale of millions, and how many rooms there are in the house, which ranges from 1-10 are very different. Because these features are separate in scale, the predictor could be much more affected by the buying price than the number of rooms, this is called feature bias [32]. In addition to this, using equally scaled features can speed up the training and hence improve efficiency.

To overcome the feature bias, features are usually normalized so that

all variables are on the same scale. This can be done in multiple ways

(21)

14 by using a Z-score normalization, a min-max normalization, or some other method. Z-score normalization utilizes the variable mean and standard deviance to produce the normalized data as can be seen in equation 2.11. The min-max normalization instead utilizes the min and max value for each variable. All values of the variable are then normalized according to equation 2.12 which results in a value between zero and one. When using a training and test set, the normalization is often performed on the values of the training set.

Therefore, it is important that the, min and max values are stored and used to transform the test set to get meaningful results since these are the representations that the models learn [3].

Min-Max normalization = 𝑋 − 𝑋 _𝑚𝑖𝑛

𝑋 _𝑚𝑎𝑥 − 𝑋 _𝑚𝑖𝑛 (2.12) 2.4.4 High cardinality categorical variables

Categorical variables are variables that are often of string type or discrete values indicating a specific category. Normally, these are encoded using a One-hot or a dummy encoding. These produce n and n-1 new variables respectively, where n is the number of distinct categories of the variable [5]. However, in the case of HC, it is unfeasible to use a dummy encoding since the dimensionality of the data would become very large quickly. A high cardinality categorical variable is a categorical variable that includes a large amount of distinct categories. Previous studies have tried different methods of handling the HC categorical variables by applying some transformation technique like dummy encoding, semantic grouping, or a transformation that converts the data from categorical to continuous [5]. However, to avoid an explosion in dimensionality, the conversions to continuous values are most interesting since this preserves the dimensionality of the data. Two of the methods that are used in [5] are “Weight of Evidence” (WOE) and “Supervised Ratio”

(SR).

1. Weight of Evidence (WOE): This transforms categorical variables

into continuous by calculating the metric according to equation

2.13 on the dataset that has been reserved for this task. TC and TN

describe the number of samples belonging to the positive and

negative class, respectively. 𝐶 _𝑖 ^𝑥 𝑎𝑛𝑑 𝑁 _𝑖 ^𝑥 denotes the amount of the

ith value of variable X that belong to the positive and negative class,

respectively. Since the WOE is calculated on a hold-out set, there is

a possibility that some categories of the variable are unseen before

(22)

15 the training starts. These values are encoded as zero, which is the same as the category being equally likely to belong to each label.

2. Supervised Ratio (SR): The supervised ratio is much more straight forward than the WOE and is the probability that the ith value of variable X belongs to the positive class (see equation 2.14) based on the training data. This value is always a value between zero and one where a larger value means the variable is more likely to belong to the positive class. As with WOE, unseen values can occur when encoding the unseen data and these are encoded as ^𝑇𝐶

𝑇𝐶 + 𝑇𝑁 , representing the probability that any sample belongs to the positive class. In the case of a perfect class balance, this equates to 0.5.

𝑊𝑂𝐸 _𝑖 ^𝑥 = 𝑙𝑛 ( 𝐶 _𝑖 ^𝑥 ⁄ 𝑇𝐶

𝑁 _𝑖 ^𝑥 ⁄ 𝑇𝑁 ) (2.13)

𝑆𝑅 = 𝐶 _𝑖 ^𝑥

𝐶 _𝑖 ^𝑥 + 𝑁 _𝑖 ^𝑥 (2.14)

2.4.5 Feature selection

In some cases, certain attributes of the data fed to the model during training are redundant and do not add any information. These attributes just increase the model complexity and the computations needed. It is even possible that some variables can affect the model negatively and hence reduce performance [33]. As a countermeasure, feature selection is often applied to the data before the training. The intended goal of feature selection is mainly 1) choosing the features that maximize the performance of the learning algorithm, 2) reduce the computational and/or storage requirements for the learning algorithm or 3) detecting important features that are related to the natural problem being studied.

One algorithm that can be used for feature selection is Random

forests (RF) [34], [35]. RF is an ensemble method that consists of a set

of decision trees. The model utilizes bagging and feature randomness

to create independent training data for each decision tree during the

training phase [36]. This is beneficial since each of the trees will learn

to classify differently but hopefully, on a population scale, the majority

will predict correctly for each sample that is fed into it. When a RF has

been fit to the training set, the features are ranked based on their

contribution to solving the problem, i.e. how much increase does the

(23)

16 inclusion of a feature give. In the study from 2018 by Sylvester et al.

[35] a standard implementation of RF was tested against a fixation index (Fst) which is a feature selection method used in biology. The outcome of the study was that the predictions from the variables selected by the RF performed significantly better than its opponent algorithm. Further in a study by Shamsoddini et al. from 2017 [34], it was shown that the combination of either an MLP or MLR with an RF- based feature selection led to significantly lower prediction error in the context of air pollution prediction compared to the models without feature selection.

2.5 Verifying significance of results

When performing studies that evaluates two models using CV, each model produces a result for each fold. The aim of the study is usually to decide whether the models differ in performance which is done through statistical hypothesis testing. To do this, a null hypothesis is normally formulated as there being no difference between the performance measurements produced by the models, i.e. the mean values of the result populations are not different [37]. An alternative hypothesis is also formulated that describes an alternative scenario that is accepted if the null hypothesis is rejected. The null hypothesis is tested using some statistical test and some predefined threshold value, often referred to as significance level and denoted as α.

Commonly used significance levels are 0.05, 0.01 and 0.001 [37], [38].

The tests used to find statistical significance are many and are each test is applied under some assumptions. The two methods that are used in this study are two-sample t-test and Kolmogorov-Smirnov (KS) two-sample test.

2.5.1 Two-sample t-test

The most commonly used test is a two-sample t-test. The test is

parametric and can be used under some assumptions [37], [38]. If a

significance in difference is found from the test, it can be determined

that this is because a difference in location (mean) [38]. This makes

the test very powerful since a conclusion of how the distributions differ

can be drawn directly. However, the test assumes that the distributions

tested are normally distributed which is often not the case [38].

(24)

17 2.5.2 Kolmogorov-Smirnov two-sample test

The KS two-sample test is a non-parametric method that tests for differences in two distributions. However, unlike the two-sample t- test, in a case of significant difference it does not tell whether it is because of a difference in location (mean), variation (standard deviation), presence of outliers or some other measurement [38].

Therefore, the outcome of the KS two-sample test is much weaker than for the t-test since it does not tell why the distributions differ. Further, the KS two-sample test can be applied in more cases since it makes less assumptions of the two distributions.

2.6 Related work

The problem of predicting whether a customer will e.g. purchase a specific product has been researched plenty. There is a large spread of models that have been tested for this problem but few that compare the performance between SVM’s and MLP’s. In a study by Zhang et. al [39] the prediction of ad click-through rate, a regression problem, by using two deep neural networks (DNN). The problem is similar to the problem researched in this thesis in that it tries to evaluate, in a static environment, how well a neural network can detect if a customer will perform an action. However, it does this by implementing different DNN architectures whereas this thesis compares one SVM and one MLP. In a study by Moeyersoms et al. [5], similar to this thesis, a linear SVM was trained to predict customer churn ¹ in the energy sector. In both [39] and [5], as in this paper, many features were of categorical/discrete type and in the latter also of high cardinality. To tackle the problem of exploding dimensionality, [39] suggested two techniques. Field-wise feature embeddings are learnt through a supervised learning process prior to the model training. This allows for the sparse input space to be represented in a much lower dimension.

The results from this study were that they performed better than a logistic regressor using the sparse input. Moeyersoms et al., [5]

comparatively to [39] suggested the usage of mainly three transformations that maps each categorical value to a continuous value and compared the performance of these to either dropping or dummy encoding all HC categorical variables. The study concluded

1

Customer churn are customer who stop using a certain service for some arbitrary

reason

(25)

18 that it was beneficial to include the variables and that the WOE transformation gave the best increase in AUC of ROC out of the three transformation techniques. SR performed best in three of the seven metrics measured and dropping the HC categorical variables was significantly worse than all other methods. Moeyersoms et al. further stated that the transformation techniques led to equal or increased performance compared to one-hot encoding in terms of AUC of ROC.

However, they indicated that the transformations were a much more efficient representation of the variables.

In two studies, one from 2019 by Sakar et al. [3] and one study from Guo et al. [2] the data used was sequential data instead of, as in this thesis, static. The study made by Sakar et al. approached the problem of identifying whether the customer had an intention to buy a product in two modules. The first module compared the performance of an MLP, SVM and RF and in the second module, the study implemented an LSTM-RNN and compared this to an MLP, extreme learning machine and radial basis function. In the first module, the results showed that the MLP outperformed both other models significantly on both a balanced and imbalanced dataset [3]. The study by Guo et al. instead implemented a novel architecture that combined recurrent neural network layers, multi-task layers and embeddings layers and some extension on these predict the customer intention to make a purchase. The model used was rather complex and reached an AUC of ROC of 0.84 on the very large dataset. The study from Sakar et al. utilized an MLP and SVM and compared these two as was the case with this thesis. The study utilized different kernels for the SVM compared to this thesis which only implemented a linear kernel SVM.

The study by Guo et al. implemented a more complex architecture and

hence the results from both studies are rather implicating the value in

sequential data and not directly comparable to the results from this

thesis.

(26)

19 Chapter 3 Method

3.1 Models

The models that were used in this project were one linear soft-margin SVM and one MLP. The motivation for using the linear soft-margin SVM was to use a classifier in line with the one used in [5]. Further, the model is highly regarded within the machine learning community and recurrently used for classification tasks [27]. The MLP was used since it is considered a high performing model and is used in various problem domains [18]. The model was also used in a similar study by Sakar et al. [3]. Thus, methods used were selected based on their wide usage, making them well understood, and based on usage in previous studies of similar nature, enabling a comparison of results obtained in this thesis to these studies.

The SVM was trained using a stochastic gradient descent optimization. The model was an out of the box model from the python package scikit-learn. The training was regularized using an L2 regularization and with early stopping. The lambda variable in this penalty term was decided upon during the hyperparameter tuning described under section 3.3.2.

The MLP was implemented with Pytorch and trained using an Adam optimizer. The regularization techniques used for the MLP were early stopping, L2 regularization and dropout. The dropout was used in all layers except for the output layer. Further, the activation functions used were ReLU for all layers except the output layer where the sigmoid function was applied. The metric that was optimized was a binary cross-entropy loss. For the early stopping, a patience of 15 epochs was used.

3.2 Data handling

This section describes mainly two aspects of the data handling that are

crucial for all machine learning project. Firstly, a description of how

the data was pre-processed and transformed into a complete dataset

that could be utilized during the training is presented and secondly,

how the data was partitioned.

(27)

20 3.2.1 Pre-processing

The initial dataset consisted of 2.3 million bookings throughout a three-year period, 2017-2020. The dataset consisted of bookings where 23% of the records belonged to those who purchased seating and the rest (77%) had not. The class imbalance was solved by using a random under-sampling until the number of samples in each class were equal. This led to a dataset with both classes having the same number of samples. The size of the dataset after this process was decreased to 1.1 million records. The dataset consisted of 100 variables that could be categorized into three categories. These categories were

“Booking information”, “Customer information” and “Destination information”. All variables were either numerical continuous, numerical discrete or categorical. The initial data was incomplete and not stored in one single source and therefore it was aggregated and went through a pre-processing phase. The pre-processing phase transformed the data into a comprehensible and numerical dataset that could be utilized during the training of the models. The tasks that were performed were:

1) Outlier detection and handling 2) Imputation of missing values

a) Approach “Mean”

b) Approach “Distribution”

3) Encoding and normalization a) Traditional variables b) HC categorical variables

i) Weight of Evidence ii) Supervised Ratio iii) Training without c) Min-Max normalization

4) Removal of highly correlated features

In the first step, the variables were analysed independently to detect outliers using a Z-score test. The variable was detected as an outlier if the Z-score were larger than 3.5 outliers in alignment with previous studies [31]. Since many of the variables in the dataset were categorical, an outlier was defined as a sample that had one or more numerical variables with a Z-score larger than 3.5. The samples that were detected as outliers were removed from the dataset.

The second task was to impute missing values. This problem was

approached in two ways. In the first approach, the missing continuous

numerical variables were filled with the mean value of that variable

(28)

21 and both the numerical discrete and categorical variables were filled with the mode for their respective variable. In the second approach, the continuous numerical variables were treated the same as in approach one. However, the categorical and discrete numerical variables were sampled based on a distribution created from the values that did exist. Then, a smaller test where the MLP and the SVM was trained and evaluated was performed to decide which method to use.

The method that was chosen was the distribution approach.

When all data had been imputed the third task was to make sure that all features were similarly scaled. To achieve this, continuous and discrete nominal data were normalized using a Min-Max normalization. Cyclical variables, e.g. weekday of a booking, was encoded using a sin/cos scheme. This meant that the weekday variable was converted into two variables, one describing the sine value and the other describing the cosine value of the variable with a periodicity fitted to the number of values that the variable has. In this study, a cardinality larger than 10 was considered high, and an experiment was performed to decide what transformation technique to apply on these.

The experiment is described in detail in section 3.3.1.

When the variables were encoded and scaled reasonably, a correlation analysis was performed between the variables to remove those that were linearly dependent. This was performed by calculating Pearson’s correlation [40] between variables and removing one of the variables if the absolute value of the Pearson correlation was above 0.7.

All these steps led to a dataset consisting of 69 final features and a balanced dataset with 1.1 million samples.

3.2.2 Data partitioning

The data available was partitioned for different tasks according to the

partitioning visualized in figure 3.1. The subset of data that was set

aside for training was split into seven folds. The data was split using a

stratified k-fold CV where the class balance for the full dataset was

preserved in each of the folds.

(29)

22 Figure 3.1: A visualization of the data partitioning.

3.3 Experiments

This section describes all experiments that were performed in the study. The pilots were experiments that was performed during the implementation to decide what strategy and number of features to use for the final experiments.

3.3.1 Pilot experiments

Other than the parameters that were directly linked to the models, a few experiments were performed to decide what strategy to move forward with. Both experiments utilized the dataset called hyperparameter data.

Two different approaches for the imputation of missing values and three strategies of handling HC categorical variables were tested.

These were performed prior to the decision of hyperparameters, hence

the models used did not have the optimal configuration. For the MLP,

the network used two hidden layers, each with 100 nodes. Each model

was trained five times and then evaluated on the test set using accuracy

as the performance metric.

(30)

23 The number of features that should be included for the training was also tested. For this pilot, only the MLP architecture that was found optimal during the hyperparameter decision phase was used and trained on a varying number of features. The features that were used to train the MLP were the top N features based on the feature importance calculated by from the random forest. The values that were tested for N ranged from 10 to 69, i.e. all remaining features after the pre-processing, and are presented in table 3.1 below.

Parameter Tested values

Number of features to use (N) 10, 25, 35, 45, 55, 69

Table 3.1: The number of features tested to use as input for the model training.

3.3.2 Hyperparameter tuning

The models' hyperparameters were optimized using a two-way split of the hyperparameter data and then performing a grid search. The grid search was performed by testing a predefined set of combinations of hyperparameters and then evaluating each model on the holdout test set. The hyperparameters that were tested for the MLP were number of layers, number of nodes in each layer, learning rate, l2 regularization strength and the dropout rate for each layer. For the linear SVM, the parameters that were tuned was the strength of regularization and learning rate. From this, the configuration of the hyperparameters for each model was decided upon and used for the final experiment. The value tested for the hyperparameter of the MLP and the SVM are presented in table 3.2 and 3.3, respectively.

Hyperparameter Tested values

Number of hidden layers 3,4 Number of nodes in each hidden

layer 50,100,128,200,256

Learning rate 1e-1, 1e-2, 1e-3, 1e-4, 1e-5 L2 regularization 1e-1,1e-3,1e-5

Dropout rate 0.1,0.2,0.4

Table 3.2: Hyperparameters with corresponding values tested for the

MLP.

(31)

24 Hyperparameter Tested values

Learning rate 1e-1,1e-2,1e-3,1e-4,1e-5 L2 Regularization 1e-1,1e-3,1e-5

Table 3.3: Hyperparameters with corresponding values tested for the SVM.

3.4 Model training and evaluation

When the model parameters were tuned and set, the models were trained using a stratified seven-fold CV scheme. The steps that took place during the training were somewhat different since the two models have different training algorithms. Initially, the categorical variables were encoded using the selected transformation scheme.

Following this step, a feature importance analysis was used to reduce the data dimensionality. The feature importance was estimated using a random forest model that consisted of 30 decision trees trained with the criterion of optimizing the Gini impurity for the splits. A feature selection was performed by calculating the feature importance using the RandomForest and for each fold, the top N most important features were picked out. The models were then trained, using early stopping to prevent long and unnecessary training times, and evaluated on a hold-out test set for each fold.

When both models had been trained, they were evaluated on a

holdout set. The models were measured on two different metrics, F1

score and accuracy. To compare the models, the F1 score and accuracy

on all runs were tested using a two-sample t-test and a Kolmogorov-

Smirnov two sample test. The results from each test was used to

examine whether the metrics produced by each of the models were

significantly different. The significance level that was used was 0.05.

(32)

25 Chapter 4 Results

In this section the main results from the implementation will be presented. Initially the outcome of the pre-processing phase is described and the results from the pilots presented. Lastly the results from the CV run is revealed. The results for some of the experiments are presented as boxplots. The bottom and top of the box represent the first and third quartiles. Further, the median of the values is presented as a yellow line through the box. The whiskers (lines from the boxes) extends up to one and a half interquartile range.

4.1 Pre-processing and pilot experiments

The pre-processing steps resulted in a dataset with an equal amount of class samples. The resulting dataset consisted of 1.1 M records (555 K samples from each class). After removing the linearly dependent variables, 69 variables were left in the dataset. The strategy that was chosen for imputing missing values was the “distribution” approach.

The implementation also tested three different methods of handling

HC categorical variables. The decision was based on the results seen

below in figure 4.1 and 4.2. The “DROP” approach led to a relatively

large decrease in accuracy for the SVM and varying results for the

MLP. The SR approach was decided to be used since the median

accuracy (yellow marking in each box) was higher than the other

approaches for both models.

(33)

26 Figure 4.1: Boxplot of accuracy after training five MLP’s for each encoding strategy. The median of SR was relatively higher than for both

other approaches. The abbreviations SR, WOE and DROP stand for supervised ratio (SR), weight of evidence (WOE) and training without HC

categorical variables (DROP), respectively.

Figure 4.2: Boxplot of accuracy after training five SVM’s for each encoding strategy. The median of SR was slightly higher than for the WOE

and much higher than for the dropping approach. The abbreviations SR, WOE and DROP stand for supervised ratio (SR), weight of evidence (WOE)

and training without HC categorical variables (DROP), respectively.

(34)

27 Other than the parameters directly connected to the models, the optimal number of features in the dataset was also investigated. The results showed that the accuracy increased with the addition of new features up until 45 features. The 24 remaining features gave no increase in performance. In fact, they led to a slight decrease in accuracy, as can be seen in figure 4.3.

Figure 4.3: The graph shows how the number of features used during the training of an MLP affected the accuracy on the test set.

4.2 Results from hyperparameter tuning

In section 3.3.2 the parameters tested for each of the models were stated. The best performing architecture for the MLP had three hidden layers with 256 nodes in each of the layers. Each layer had a dropout rate of 0.1 and the lambda parameter for the L2 regularization that was found the most favourable was 1e-5. Further, the learning rate used for training was 0.001. For the SVM, the learning rate used during the training was 0.001 and the lambda used was 1e-5.

Model Parameters

MLP Learning rate: 0.001, Lambda: 1e-5, No. of hidden layers: 3, Nodes in each layer: 256, Dropout rate: 0.1 SVM Learning rate: 0.001, Lambda 1e-5

Table 4.1: Hyperparameters found during the phase hyperparameter

tuning.

(35)

28 4.3 Results from model training

The models were evaluated on the two metrics F1 score and accuracy.

The results from the CV are seen in table 4.2 below and the two boxplots in figure 4.4 and 4.5. The mean score for the MLP on both metrics were higher than for the SVM. The largest discrepancy was in mean F1 score where the MLP performed 4.4 percentage points better than the SVM. Both models had a low variability of values as is shown on row two and four. The MLP however, had a smaller spread of around two times smaller for both metrics.

MLP SVM

Metric Value Metric Value

Mean F1 score 70.9% Mean F1 score 66.5%

Std F1 score 0. 4% Std F1 score 0.89%

Mean accuracy 71.1% Mean accuracy 69.1%

Std accuracy 0.25% Std accuracy 0.46%

Table 4.2: The estimated mean and standard deviation (Std) for the two

models.

(36)

29 The boxplot for the F1 score in figure 4.4, showed that the MLP had a much higher median than the SVM. Further, the lowest recorded value of the MLP was still higher than the SVM.

Figure 4.4: Boxplot of the normalized F1 score of both models. The yellow lines are the median of all F1 scores from the CV. The MLP had a much

higher median score and smaller spread than the SVM.

(37)

30 The boxplot for the accuracy followed the same pattern as with the F1 score. However, the difference in accuracy was much smaller between the medians than the F1 score.

Figure 4.5: Boxplot of the accuracy of both models. The yellow lines are

the median of all accuracies from the CV. As with the F1 score, the MLP

had a higher median accuracy and smaller spread of values than the SVM.

(38)

31 Chapter 5 Discussion

In this section, the research questions of this thesis will be discussed with the results from chapter 4 as a basis. Initially, the results from the pre-processing phase will be briefly discussed in section 5.1. Following this, the models’ performances will be compared and tested for statistical significance in section 5.2. Then, in section 5.3, the results will be more generally discussed and commented on to provide a broader analysis and interpretation of these. These results will also be discussed and compared with the results from similar studies.

Subsequently, the limitations of this study are presented and their influence on the results considered. Thereafter, building on the limitations, interesting future research areas are presented before the discussion is concluded by presenting analysing the outcome of this thesis from an ethical and sustainability perspective.

5.1 Pre-processing and pilot experiments

From the experiments regarding the pre-processing pilots, it was decided that only the top 45 features provided value, and that the SR encoding strategy was beneficial to use. The decrease in accuracy from incorporating more than the top 45 features proved to negatively influence the performance of the MLP. What was interesting was that a large ratio of the features in the top 45 were HC categorical variables that had been encoded using the SR transformation. This hints that the inclusion of the HC categorical variables was important, and that the SR encoding was a viable method of representing the categories.

5.2 Model comparisons

The two models were both tested and evaluated based on two metrics.

For both metrics, the MLP performed better than the SVM. The mean

accuracy for the MLP was 71.1%, which was above two percentage

points higher than the SVM. Further, the spread of the values for the

MLP was around two times smaller than the SVM. The F1 score

followed the same pattern as the accuracy. The MLP had a higher mean

F1 score at 70.9%, 4.4 percentage points higher than the SVM. Further,

the spread of the score was almost two times larger for the SVM than

the MLP. The results produced from the MLP and SVM were tested for

significance using the Kolmogorov-Smirnov two sample test using a

(39)

32 significance value of 0.05. The results were shown to be significantly different, i.e. 𝑝 _{(𝑚𝑒𝑡𝑟𝑖𝑐)} < 0.05, 𝑚𝑒𝑡𝑟𝑖𝑐 ∈ {𝐹1, 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦}. The results were also tested using a two-sample t-test. The outcome of these tests was also that the differences were statistically significant and where 𝑝 _𝐹1 = 2.875e − 05 < 0.05 and 𝑝 _{𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦} = 5.281e − 07 < 0.05. Hence the null hypothesis of the models performing equally good could be rejected for the alternative hypothesis that the MLP performed better than the SVM.

Looking at the variability of the produced measurements, both the SVM and MLP produce very similar performance throughout all folds. This is a desirable characteristic since it demonstrates that both models produce a performance which is consistent over partitioning of the data, and thus the partitioning of the data itself is not a major concern and that the performance of the models are possibly reliable.

The difference in mean value of the F1 score was almost twice as large as the difference of the accuracy. This indicates that the MLP was better at not misclassifying negative samples as positive (precision) and/or missing fewer positive samples (recall).

5.3 Comments on the results

Looking at the level of performance that was achieved on the test set, the results are satisfactory but not ideal. The MLP, that performed best out of the two models, reaches an accuracy of 71 % and an F1 score of 71%, which can be seen in table 4.2, which hints at one of two things.

Either the problem is very complex and the data available is not sufficient to fully solve the problem, or the models used are not complex enough to utilize and learn the variance in the data.

Indications of the first explanation being the case are that only around 65% of the features were deemed to have a sizeable impact on the performance. Additionally, when the number of optimal features to include in the training was decided, the increase in accuracy seen when using the top 10 vs top 45 features was only 3.3 percentage points, which is shown in figure 4.3. This is a very minor increase and somewhat portrays the slight information that each feature provides to the models. Arguing for the second case, the exploration of models was relatively limited. Due to the large dataset used, the configurations of SVM that could be explored was very few since the training had to be performed using batch training. The number of available implementations for this type of training are few and the tuneable parameters also.

Looking at how the encoding strategies affected the two models,

the drop strategy had the worst median performance of the two

(40)

33 models. While the approach led to a large decrease in accuracy for the SVM, the highest accuracy measured for all models came from one trained on the dataset where the HC categorical variables were dropped. The interpretation of the results from the SVM is that the HC categorical variables provided important information to the SVM to find a good decision boundary. However, with the MLP, the model was able to learn the problem equally good with and without the HC categorical variables. This raises some questions regarding why the drop off in performance is much larger for the SVM. One reason might be that the HC categorical variables increased the dimensionality of the data which can lead to the model finding a solution which it was not able to in a lower feature space. Parallels could be drawn to the utilization of kernels when training SVM’s where the data is transformed into a higher feature space where a solution is then found.

The result of the SVM can also be compared to the results in [5] where the dropping technique had a significantly worse performance than any of the other techniques.

The encoding strategy that was deemed to be ideal in this thesis was the SR, due to it leading to the best median performance for the two models. In the study made by Moeyersoms et al. [5], the comparisons between the encoding strategies were made using different metrics. However, it was argued that no strategy outperformed the others on all metrics. Nonetheless, the SR strategy was proven superior when it came to three of the seven metrics used in the study which somewhat aligns with the results in this study.

From figures 4.4 and 4.5, it is easy to see that the MLP outperformed the SVM on both metrics. The lowest score of the MLP for each of the metrics during the CV was higher than the best measurement for any of the SVMs. This result was in line with the initial hypothesis and partly with the results in the study from Sakar et al. [3]. In the study, when a class balanced dataset was used, the MLP significantly outperformed the SVM in identifying whether a customer was likely to abandon the site or not. The MLP that was used in that study had only one hidden layer compared to the three in the network used for this paper. This strengthens the belief that the more complex network should, as it did in this study, perform better than the SVM.

Predicting Purchase of Airline Seating Using Machine Learning

STOCKHOLM SWEDEN 2020 ,

Predicting Purchase of Airline Seating Using Machine Learning

SEBASTIAN EL-HAGE

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Predicting Purchase of Airline Seating Using Machine Learning

SEBASTIAN EL-HAGE

Master in Computer Science Date: June 23, 2020

Supervisor: Erik Fransén, Christer Ogenstad Examiner: Olov Engwall

Host Company: Novatrox PolarMind AB

Swedish title: Förutsägelse på köp av sätesreservation med maskininlärning.

School of Electrical Engineering and Computer Science

Abstract

Keywords

Deep learning, Machine learning, Binary classification, Neural

network, Support vector machine

Sammanfattning

Nyckelord

Djupinlärning, Maskininlärning, Binär klassificering, Neurala

nätverk, Stödvektormaskin

Contents

1 Introduction ... 1

1.1 Research question... 2

1.2 Purpose ... 2

1.3 Scope and limitations ... 2

1.4 Thesis outline ... 3

2 Background ... 4

2.1 Machine learning ... 4

2.2 Artificial neural networks ... 8

2.2.1 The One-Neuron Perceptron ... 8

2.2.2 Multi-layer perceptron ... 9

2.3 Support vector machine ... 11

2.4 Pre-processing ... 12

2.4.1 Missing values ... 12

2.4.2 Outlier treatment ... 13

2.4.3 Encoding and normalization ... 13

2.4.4 High cardinality categorical variables ... 14

2.4.5 Feature selection ... 15

2.5 Verifying significance of results ... 16

2.5.1 Two-sample t-test ... 16

2.5.2 Kolmogorov-Smirnov two-sample test ... 17

2.6 Related work ... 17

3 Method ... 19

3.1 Models ... 19

3.2 Data handling ... 19

3.2.1 Pre-processing ... 20

3.2.2 Data partitioning... 21

3.3 Experiments ... 22

3.3.1 Pilot experiments ... 22

3.3.2 Hyperparameter tuning ... 23

3.4 Model training and evaluation ... 24

4 Results ... 25

4.1 Pre-processing and pilot experiments ... 25

4.2 Results from hyperparameter tuning ... 27

4.3 Results from model training ... 28

5 Discussion ... 31

5.1 Pre-processing and pilot experiments ... 31

5.2 Model comparisons ... 31

5.3 Comments on the results ... 32

5.4 Limitations ... 34

5.5 Further research ... 35

5.6 Sustainability and ethics ... 35

6 Conclusions ... 36

Bibliography ... 37

1

Chapter 1 Introduction

In the travel industry, customers are presented with a possibility

to add optional products such as reserving a seat for some additional

cost during their booking. These additional products are often very

profitable for the companies. If the companies are able to identify

customers who are unlikely to buy, they can utilize this information to

incentivize a purchase by e.g. offering a discount [6]. This thesis will

investigate the applicability of machine learning to predict the

customers intent of purchasing a seating during the booking process

in the travel industry. The problem will be approached by

implementing two machine learning models that are trained to predict

whether a customer will add a seat reservation to their booking based

on several input parameters. The data that is used to train the models

is historical records of bookings, including whether the customer

𝑘 ∑ 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 _𝑖

𝐹 _𝛽 = (1 + 𝛽 ² ) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑥𝑅𝑒𝑐𝑎𝑙𝑙

𝛽 ² 𝑥𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (2.3)