Prediction of Optimal Packaging Solution using Supervised Learning Methods

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Prediction of Optimal Packaging Solution using Supervised

Learning Methods

ANIRUDH VENKAT CHARI

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Prediction of Optimal Packaging Solution using Supervised

Learning Methods

ANIRUDH VENKAT CHARI

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020

Supervisor at Hennes & Mauritz AB: Anton Alund Supervisor at KTH: Anastasiia Varava

Examiner at KTH: Anja Janssen

(4)

TRITA-SCI-GRU 2020:059 MAT-E 2020:022

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

This thesis investigates the feasibility of supervised learning models in the decision-making problem to package products and predict an optimal packaging solution. The decision-making problem was broken down into a multi-class classification and a regression problem using relevant literature. Supervised learning models from the field of logistics were shortlisted namely; Generalized Linear Models, Support Vector Machines, Random Forest and Gradient Boosted Trees using CatBoost. The performance of the models were evaluated based on relevant metrics, interpretability and ease of implementation. The results from this thesis show that the Random Forest model had the best performance on all the aforementioned criteria in both the classification and regression problems.

(6)

(7)

F¨ oruts¨ agelse av Optimal F¨ orpackningsl¨ osning med

Overvakade ¨

Inl¨ arningsmodeller

Sammanfattning

Denna avhandling undersöker möjligheten att genomföra övervakade inlärningsmodeller i syfte att förbättra beslutsprocessen kring produktpaketering samt att förutsäga en optimal förpackningslösning. Beslutsfattandeprocessen bröts ner i klassifi- ceringsdelar samt ett regressionsproblem med hjälp av relevant litteratur. De

¨

overvakade inlärningsmodeller fr˚an logistikomr˚adet som har använts är ”Gener- alized Linear Models”, ”Support Vector Machines”, ”Random Forest” och ”Gra- dient Boosted Trees using CatBoost”. Modellerna har utvärderades utifr˚an rele- vanta mätvärden, tolkbarhet och enkelhet avseende implementering. Resultaten i denna avhandling visar att ”Random Forest”-modellen har bäst prestanda p˚a alla ovannämnda kriterier, b˚ade vad gäller klassificerings- och regressionsprob- lemen.

(8)

(9)

Acknowledgements

I would like to express my utmost gratitude to my supervisors Anastasiia Varava and Anton Alund for their constant guidance and support with all aspects of this thesis. I would also like to extend my gratitude to Anja Janssen for addressing all queries with the thesis and for examining it. I would also like to thank my peers, Törbjörn Sjöberg and Anton Karlsson for their constructive criticism and feedback from the peer-review seminars. Lastly, I would like to thank my family and my partner Lovisa Ohlson for being a constant pillar of support and encouragement.

(10)

(11)

List of Tables

1 Summary of the chosen classification and regression models . . . 13

2 Kernel functions and their mathematical expressions. γ is a parameter of the radial basis kernel. c, d are parameters of the polynomial kernel . . . . 21

3 Confusion matrix with K classes . . . . 28

4 Description of a Product 1 being packed in different packages having different fill-rates . . . . 34

5 Description of a Product 1 being packed in different packages having different fill-rates . . . . 34

6 The number of levels of the categorical predictors . . . . 36

7 Summary of the modelling task, predictors and response labels . 37 8 Hyper-parameter set for Generalized Linear Models . . . . 40

9 Hyper-parameter set for SVM . . . . 40

10 Hyper-parameter set for Random Forest . . . . 41

11 Hyper-parameter set for CatBoost . . . . 41

12 Libraries used in model implementation and their purpose . . . . 42

13 Macro-averaged metrics for different classifiers when using response label P broad . . . . 43

14 Macro-averaged metrics for different classifiers when using response label P deep . . . . 48

15 Regression metrics for different regressors when using response label P deep . . . . 55

16 Average F1macro and Exc Time of the different classifiers when predicting P broad . . . . 60

17 Average F1macro and Exc Time of the different classifiers when predicting P deep . . . . 61

18 Average RMSE, MAE and Exc Time of the different regressors when predicting Qua. . . . . 62

(12)

List of Abbreviations

MLR . . . Multinomial Logistic Regression MAE . . . Mean Absolute Error

OTS . . . Ordered Target Statistics RMSE . . . Root Mean Squared Error

SMOTE . . . Synthetic Minority Over-sampling Technique SVC . . . Support Vector Classifier

SVM . . . Support Vector Machine

(13)

Contents

1 Introduction 8

1.1 Background . . . . 8

1.2 Case Study with Company A . . . . 9

1.3 Purpose . . . . 9

1.4 Delimitations . . . . 10

1.5 Disposition . . . . 10

2 Literature Review 11 2.1 Review of Supervised Learning Models in Decision-Making Prob- lems in Manufacturing and Logistic Operations . . . . 11

2.1.1 Multi-class Classification Models . . . . 12

2.1.2 Regression Models . . . . 13

2.1.3 Model Selection . . . . 13

2.2 Metrics for Model Evaluation . . . . 14

3 Theory 15 3.1 Supervised Learning . . . . 15

3.2 Generalized Linear Models . . . . 16

3.2.1 Multinomial Logistic Regression . . . . 16

3.2.2 Linear Regression . . . . 17

3.2.3 Regularization . . . . 18

3.3 Support Vector Machines . . . . 18

3.4 Decision Tree-Based Methods . . . . 22

3.4.1 Decision Trees . . . . 22

3.4.2 Random Forest . . . . 24

3.4.3 Gradient Boosting using CatBoost . . . . 25

3.5 Metrics . . . . 27

3.5.1 Regression . . . . 27

3.5.2 Multi-class Classification . . . . 28

3.6 Handling Response Label Imbalance . . . . 30

3.7 Cross-validation . . . . 31

3.8 Categorical Predictor Encoding . . . . 31

4 Data-set Analysis 34 4.1 Preparation of Optimal Packaging Solutions . . . . 34

4.2 Response labels . . . . 35

4.3 Predictors . . . . 36

5 Methodology 37 5.1 Model Implementation . . . . 37

5.2 Hyper-parameter Sets . . . . 40

5.3 Software and Libraries . . . . 41

6 Results 43

(14)

6.1 Multi-class classification . . . . 43

6.1.1 Using Response Label P broad . . . . 43

6.1.2 Using Response Label P deep . . . . 48

6.2 Regression . . . . 55

7 Discussion 59 7.1 Recap . . . . 59

7.2 Model Comparison . . . . 59

7.2.1 Multi-class Classification . . . . 59

7.2.2 Regression . . . . 61

7.3 Choice of Metrics . . . . 62

7.4 Additional Observations . . . . 62

7.4.1 Methods of Handling Imbalance among Response Labels and Encoding Categorical Predictors . . . . 62

7.4.2 Predictor Importance Plots . . . . 63

7.4.3 Impact of Results from a Business Perspective . . . . 64

7.5 Future Work . . . . 64

8 Conclusion 65

A Appendix 70

A.1 Box plots for the predictors in the classification of type of package 70

(15)

1 Introduction

1.1 Background

The efficiency of logistics plays a crucial role in the operation and profitability of retail companies as mentioned by Hellstr¨om and Saghir [1]. The packaging of products to be shipped is an important part of their logistic process. Effi- ciency in this step could lower shipping costs, make warehouse handling easier and reduce wastage of packing materials resulting in economic and environmen- tal benefits as mentioned by Hellstr¨om and Saghir [1], Dubey et al. [2]. For a large product and package assortment, the decision-making process to package products is complex. Various constraints such as the features of the products and packages, shipping regulations, weather conditions, etc. must be accounted.

Decision-making problems to package products can be modelled as explicit optimization problems in order to obtain an optimal strategy as mentioned by Bortfeldt and W¨ascher [3]. They report that the class of problems commonly encountered and widely studied in logistic operations is the Container Load- ing Problem (CLP) Viz. to pack items into containers in order to reduce a specific objective function (for e.g.: number of containers, cost of containers, etc.). Several variations of this problem, as reported by Bortfeldt and W¨ascher [3], account for items and containers of different sizes and orientations. The locally optimal solutions to such optimization (combinatorial-integer) problems are arrived at using tailor-made heuristic or approximation algorithms such as those mentioned by Correia et al. [4] and Maiza et al. [5]. However, the underlying theme in such problems is primarily focused on geometric constraints.

Additional non-geometric constraints can be added to the problem formulation as binary coded variables. This would in-turn require careful adapting and tuning of the algorithms; especially when a having large assortment of products and packages. Moreover, the task of solving such problems for individual instances is often repetitive and incorporating prior knowledge from solved instances would potentially be beneficial. An alternate approach to this problem would be through a data driven manner by employing packaging experts. A packaging expert would use his/her domain knowledge, experience and intu- ition to assess the features of the products and the packages. He/She would then determine a packaging solution which satisfies all the constraints of the decision-making process. In other words, this expert is acting as a heuristic, approximating optimal (with respect to performance indicators/constraints) solutions to the decision-making problem. Thus, it would be of interest to develop and evaluate the feasibility of statistical models to mimic the behavior of this expert.

Whilst statistical models have been used widely in logistics and manufacturing operations, their application in the decision-making process of packaging products is under-studied as reported by Knoll et al. [6, 7]. Knoll et al. [7]

successfully developed and implemented an automated process to package car

(16)

parts using a Random Forest, which is a type of a statistical model. The process implemented by them consisted of a two-step approach: (1) to predict a relevant package using product features (2) incorporating the predicted package to predict the quantity of product packed. They concluded their work by suggesting further research to evaluate the feasibility of other statistical models. This thesis is inspired by their work to use the proposed two-step approach and evaluate the performance of different statistical models in both the steps by performing a case study with Company A.

1.2 Case Study with Company A

Company A is a leading retailer who would like to improve the efficiency of their packaging process. Currently, packaging decisions are made through man- ual inspection on a case by case basis by different product suppliers. This often results in sub-optimal and varied packaging for the same set of products.

The performance indicator (PI) which Company A uses to assess the final quality of a packing solution is the Fill rate. Fill rate is defined as the ratio between volume of the total number of products filled to the volume of the packaging box as shown in equation (1.1). The maximization of fill rate implicitly involves selection of an appropriate packaging box for a product. Thus, the selection of an appropriate packaging box with the maximum fill rate would define an optimal packaging solution in this thesis. Company A has set up experts with certain suppliers to improve the packaging process. The data collected from those suppliers will be used in this thesis.

FR = Total volume of products packed

Volume of packaging box (1.1)

1.3 Purpose

Using the data collected by Company A and the two stage approach elucidated by Knoll et al. [7], the goal of the thesis is to evaluate different statistical models and help automate the package decision process to produce an optimal packaging solution for new-unseen products. Using labelled data to predict future labels falls under the purview of Supervised Learning. The main research question (RQ) and the corresponding sub-question (SQ) that will be addressed by this thesis are given below:

• RQ: Which supervised learning models are best suited to capture the relevant features and decisions involved in product packaging to predict an optimal packaging solution?

• SQ: What are the relevant metrics to compare the performance of such models?

Thus, different supervised learning models in the field of logistics will be reviewed. Chosen models will be shortlisted and appropriately adjusted for the

(17)

problem. The performance of the chosen models will be documented with respect to suitable performance metrics for comparison. The successful analysis and implementation of such models could be used as a guiding tool to improve the packaging efficiency across all suppliers. The chosen model(s) could be implemented and updated (using new data) continuously to streamline and automate the process.

1.4 Delimitations

The thesis focuses only on clothing items. Items with different geometric orientations (such as household articles) were not considered for this project. While a wide variety of models exist in the field of supervised learning, only a few of them were studied in this thesis.

1.5 Disposition

Chapter 2 reviews the literature relating to supervised learning models in the field of logistics. Chapter 3 discusses the relevant theory behind the chosen models, metrics and other evaluation procedures. Chapter 4 describes the data- set used in the thesis and the corresponding predictors and responses used in the models. Chapter 5 describes the methodology for the implementation of the chosen models. The results from the model implementation are shown in Chapter 6. Chapter 7 discusses the results and potential future work. Chapter 8 reports the conclusion of the thesis.

(18)

2 Literature Review

This chapter provides a review on the application of supervised learning models in the field of logistics.

2.1 Review of Supervised Learning Models in Decision- Making Problems in Manufacturing and Logistic Op- erations

A detailed survey on the applicability of machine learning in the various stages of a decision-making process in operations research is presented by Bengio et al.

[8]. They discuss the applicability of supervised learning in predicting optimal solutions from solved instances of combinatorial optimization problems by uti- lizing the input structure of the problem. One example, that they refer to is the Larsen et al. [9]. Larsen et al. successfully designed a classification and regression model to predict tactical solutions (brief/non-detailed version of the optimal solution) using solved instances of a Container Loading Problem. They used a subset of features from the decision-making problem as inputs to their model.

The model performed better in predicting tactical solutions in comparison to heuristic algorithms. Li and Olafsson [10] reported benefits of incorporating supervised learning to predict dispatch rules in scheduling problems. They used a decision tree, on synthesized data, to successfully recreate the set of decision rules used by an expert (which is this case was a set of predefined heuristic rules).

Another aspect of this thesis is the applicability of machine learning in the manufacturing and logistics processes. K¨oksal et al. [11] provides a comprehen- sive review of supervised learning methods in the manufacturing and logistic operations. They report that the most frequently used methods for supervised learning (prediction and classification) within logistics are Neural Networks, Support Vector Machines, Generalized Linear Models and Decision Tree-based methods. This review is complemented by the work of Knoll et al. [6] who discuss the applicability and benefits of supervised learning methods in logistic operations where they quote an example of a package planning task. Such a model, as described by them, would utilize the existing data available about the material features and packaging options to predict an appropriate package type. As mentioned in section 1.1, Knoll et al. [7] successfully developed and implemented an automated process to package car parts. Their model consisted of a two-step approach: (1) to predict a relevant package using product features (2) incorporating the predicted package to predict the quantity of product packed as shown in figure 1. They used product specific features to predict a relevant package and the corresponding quantity of product packed using a Random Forest.

(19)

Figure 1: An overview of the package planning process as described by Knoll et al. [7]

2.1.1 Multi-class Classification Models

The first step of choosing a relevant package based on product features falls under the category of Multi-class classification in supervised learning. Multi- nomial Logistic Regression (MLR) has shown promising results in logistics and manufacturing such as in the identification of faults by Wang [12], prediction of customer satisfaction by Larasati et al. [13], assessment of sustainability in food consumption by Abdella et al. [14] to name a few. The model is simple to train and the model parameters can be easily interpreted as the odds over the different package choices for a given product. Support Vector Machines (SVM) use kernel functions as described in section 3.3. As a result, they help in creating additional features and could aid in modelling the complexity of the package selection. Support Vector Machines have been used to successfully predict the logistics distribution mode of transport in e-commerce industries by Zheng et al.

[15] and quality control by Tseng et al. [16]. Random Forest is a ensemble technique that aggregates the prediction of several decision trees. This was used by Knoll et al. [7] in his modelling procedure. They describe the method as being robust and easy to implement. Thus, it would be of interest to include it to evaluate if it could be used to predict the package type for Company A. Ran- dom Forest also provides a importance measure which aggregates the impact of the individual features on the final prediction. CatBoost employs gradient boosted decision trees to make predictions as described by Prokhorenkova et al.

[17]. The model they describe incorporates categorical features in an innova-

(20)

tive manner using ordered target statistics (see section 3.4.3). Their method also addresses the issue of gradient bias compared to popular gradient boosting methods like XGBoost or LightGBM. CatBoost has been used in predicting the credit rating by Al Daoud [18] and glass transition temperatures by Alcoba¸ca et al. [19]. CatBoost also provides an importance measure which aggregates the impact of the individual features on the final prediction.

2.1.2 Regression Models

The second step of predicting the fill rate using the product and package features falls under the category of Regression. The simplest and the most commonly used model for regression in logistics and manufacturing is Linear Regression as mentioned by K¨oksal et al. [11]. Similar to MLR, Linear Regression is simple to train and the model coefficients would describe the relationship of the features with the fill rate. The Random Forest and CatBoost algorithms described in the multi-class classification setting can also be adopted for regression and thus are also included for predicting the fill rate. Table 1 summarizes all the models that are used for both classification of a package and regression of the fill rate.

Table 1: Summary of the chosen classification and regression models

Task Models

Classification

Multinomial Logistic Regression Support Vector Machines Random Forest

CatBoost Regression

Linear Regression Random Forest CatBoost

2.1.3 Model Selection

An ideal supervised learning model would be one which captures the complexity of the decision-making process and is interpretable (to understand the decisions involved in packing the products). The Generalized Linear Models Viz. Multi- nomial Logistic Regression and Linear Regression represent the simplest and most interpretable models in this scenario. The model coefficients would describe the relationship between the various features in the packing process. The Decision Tree-based ensemble models Viz. Random Forest and CatBoost models offer more complexity than the generalized linear models. While they are less interpretable in comparison to the generalized linear models, they provide feature importance measures which would help understand the packing process.

(21)

The Support Vector Machine offers greater complexity (by creating additional features using kernels) but does not offer interpretability (unless using a linear or low-degree polynomial kernel) compared to the aforementioned models. An- other class of models that offer complexity but lack interpretability are Neural Networks (and their variants). Neural Networks show promising results in prediction and classification tasks within logistics and manufacturing as mentioned by K¨oksal et al. [11]. However, they require considerable effort in setting up a problem specific architecture and parameter tuning as mentioned by Ahmad et al. [20]. Fern´andez-Delgado et al. [21] conducted an extensive survey and recorded the performance of 17 families of classifiers on 121 different data-sets from the UCI data-repository and did not observe a significant difference in the performance between SVM and Neural Networks in classification tasks. Thus, the idea behind shortlisting the various models, as summarized in table 7, was to start with simple and interpretable models and gradually increase the complexity at the cost of interpretability. Neural networks would be the next choice of models to investigate if none of chosen models are able to capture the complexity of the decision-making process sufficiently.

2.2 Metrics for Model Evaluation

According to Sokolova and Lapalme [22], the most frequently used method for analyzing the performance of multi-class classification models is through a confusion matrix. This matrix can further be used to evaluate model specific and class specific metrics which can be used to compare the different classification models. For regression, the most commonly reported metrics for model evaluation are Root Mean Squared Error and Mean Absolute Error as reported by Cruz [23]. Intuitively, they measure the standard deviation and the average magnitude of the error in prediction. These metrics are chosen to evaluate and compare the different models and are further elaborated in section 3.5.

(22)

3 Theory

This chapter discusses the theoretical aspects behind the models and their evaluation. Random variables are denoted by upper-cased letters. Vectors are noted by bold letters.

3.1 Supervised Learning

Supervised learning consists of learning the relationship between a response variable(s) and a set of predictor variable(s). Both are considered to be random variables with their joint probability distribution unknown. Instead, a sample from their joint distribution called as training data is available and is used in learning the relationship.

In a multi-class classification scenario, the response variable Y takes only one distinct value from a complete set of {1, 2 . . . K} classes. In a regression scenario, the response variable Y takes values in the space of real numbers R.

The predictor variables can be numeric or categorical. They are denoted com- pactly by a d dimensional vector X = [X1, X2. . . Xd]^T where X1, X2, ...Xd are the individual predictors.

The relationship between the predictors and response is expressed through a deterministic function f as expressed in equation (3.1) where is the random error term which accounts for the inadequacy of the function f (X) in capturing the response Y [24, p. 16]. A training data set consisting of n independent and identical samples of X and Y denoted by T = {(xi, yi)}ⁿ_i=1 is used for the estimation of f (X). We denote the corresponding estimator of f(X) as ˆf (X).

Y = f (X) + (3.1)

Depending on the task, assumptions are made on the functional form of f (X). If the function f (X) is characterized by a fixed number of parameters Θ, then it is referred to as a parametric function [24, p. 21]. The number of fixed parameters is independent of the size of the training data set. Thus, the estimation of ˆf (X) can be reduced to estimating the parameters ˆΘ of the parametric function. A non-parametric function would not be characterized by a fixed set of parameters. The number of parameters of such a function would be dependent on the size of the training data [24, p. 23]. In both cases, the space of all functions that have the same functional form as f (X) is denoted by F .

Once the functional form f (X) is assumed and the corresponding function space F is defined, the next task would be in estimating it using the training data. A loss function L(Y, f (X)) is defined which quantifies the performance of the function f (X) in predicting Y by mapping it to a number in the real spaceR. Ideally, one would like to find a function f(X) ∈ F having the lowest loss over all the possible values that the random variables X and Y can take.

(23)

However, this task is intractable as the underlying distribution of X and Y are unknown. Instead, the empirical risk Remp is defined as the average point-wise loss over the training data-set as shown in equation (3.2)

Remp= 1 n

n

X

i=1

L(yi, f (xi)) (3.2)

The estimator ˆf (X) is found out by minimizing the empirical risk as shown in equation (3.3). The solution to this optimization problem would give the best estimator ˆf (X) ∈ F .

f (X) =ˆ min

f (X)∈FRemp (3.3)

3.2 Generalized Linear Models

This section discusses the Multinomial Logistic Regression and Linear Regres- sion and is inspired by the theory presented by Hastie et al. [25, Ch. 5,12] and Jurafsky and H. Martin [26, Ch. 5].

3.2.1 Multinomial Logistic Regression

MLR is an extension of the two-class binary logistic regression model for multi- class classification. The response variable Y takes one distinct value from a complete set of {1, 2 . . . K} classes. Without loss of generality, the response variable Y can be transformed to a K dimensional vector Y over the K classes by one hot coding; with the value 1 in the k^th index if it belongs to class k and 0 otherwise. Intuitively, each element Yk in Y would represent the probability of Y belonging to class k. The responses yi in the training data are also transformed into K dimensional vectors yias described above. The conditional probability of Y belonging to a certain class k ∈ {1, 2 . . . K} given the predictor X = x is modelled using the sof tmax function as shown in equation (3.4) where β₁, . . . β_Kare the class specific regression coefficients collectively denoted by β = [β₁, ...β_K] [26, p. 89].

The function f (X) is a K dimensional vector [f₁(X) . . . f_K(X)], where f_k(X) = P (Y = k|X = x) ∀k ∈ {1, 2 . . . K}. The softmax function is the analogue of the sigmoid function in binary logistic regression and ensures that the elements of f (X) conform to a probability distribution over the K classes [26, p. 89]. Thus, the MLR is a parametric model which has a total of (d+1)×K model parameters that need to be estimated.

P(Y = k|X = x) = sof tmax(x, k, β1, . . . β_K) = e^β^k^T^x PK

i=1e^βⁱ^T^x (3.4)

(24)

The loss function L(Y , f (X)) that is used in estimating the model parameters is the cross-entropy function as shown in (3.5). It is equivalent to the negative log-likelihood for multi-class classification problems [25, p. 349].

L(Y , f (X)) =

K

X

k=1

−Yk× log(fk(X)) (3.5)

Using cross-entropy as the loss function, the empirical risk over the training data set can be expressed as in equations (3.6)-(3.8) where yi,k represents the k^thelement of yi

.

Remp= 1 n

n

X

i=1

L(yi, f (xi)) (3.6)

= 1 n

n

X

i=1 K

X

k=1

−y_i,klog f_k(x_i) (3.7)

= 1 n

n

X

i=1 K

X

k=1

−yi,klog e^β^k^T^xⁱ PK

j=1e^β^j^T^xⁱ (3.8) The function f (X) and R_emp are completely defined by the parameters β. Thus, the estimation of ˆf (X) reduces to the problem of estimating the estimation of the is done by minimizing the empirical risk with respect to β as shown in equation (3.9).

β = minˆ

β R_emp (3.9)

The cross-entropy loss is a convex function and the optimization problem (3.9) is solved using iterative algorithms to estimate the optimal parameters [26, p. 91]. The prediction ˆy₀ of a new, unseen observation x₀ is done to the class having the maximum conditional probability as shown in equation (3.10).

ˆ

y0= max

k P(Y = k|X = x0) = max

k

fˆk(x0) (3.10)

3.2.2 Linear Regression

The linear regression assumes a linear relation between the response variable Y and the predictors X as shown in equation (3.11) where β = [β₀, β₁. . . β_d]^T are the model parameters or regression coefficients.

Y = β^TX (3.11)

The loss function that is used to estimate these parameters is the squared loss as shown in equation (3.12). The corresponding empirical risk is shown is equation (3.13).

(25)

L(Y, f (X)) = (Y − β^TX)² (3.12) Remp= 1

n

X

i=1

(yi− β^Txi)² (3.13)

The risk minimization to estimate the optimal parameters is a convex problem and a closed form solution is computed as shown in equation (3.14) where β represents the optimal regression coefficients,ˆ X = [x1, x2, . . . xn]^T is matrix containing all xi stacked row-wise, Y = [y1, y2, . . . yn]^T is a column vector containing all yi [25, pp. 44-46].

β = (ˆ X^TX)⁻¹X^TY (3.14)

The prediction ˆy0of a new, unseen observation x0is done to the class having the maximum conditional probability as shown in equation (3.15).

ˆ

y0= ˆβx0 (3.15)

3.2.3 Regularization

Regularization of the model parameters can be achieved by penalizing the model coefficients. Regularization helps in reducing the variance of the estimated model parameters by shrinking them towards 0 [25, pp. 61-69]. This is essential in the presence of co-related or collinear predictors. The regularization function is denoted by Ω(β). The estimation of the model parameters β is now done by adjusting the optimization problem as shown in equation to (3.9) to that as shown in equation (3.16) where C is the magnitude of the penalty applied to the model parameters β.

β = minˆ

β R_emp+ (C × Ω(β)) (3.16)

When C = 0, the model coefficients estimated will be the same as that of a non-regularized model. The regularization functions which will be considered are the Ridge penalty Ω_R(β) and Lasso penalty Ω_L(β) as shown in equations (3.17) and (3.18) where ||.||₂and ||.||₁are the L2 and L1 norms respectively [25, pp. 61-69].

ΩR(β) = ||β||²₂ (3.17)

Ω_L(β) = ||β||₁ (3.18)

3.3 Support Vector Machines

Support Vector Machine SVM are an extension of the Support Vector Classifier (SVC) for binary classification of the response variable Y using the predictor

(26)

variable X. SVC is used when the two response classes cannot be completely separated by a linear decision boundary. This section is inspired by the theory presented in James et al. [24, Ch. 9] and Hastie et al. [25, Ch. 5,12].

Let the response classes be represented by the set {−1, 1}. The function f (x) = β^Tx + β₀ = 0 defines a linear hyper-plane in the predictor space [25, pp. 417-419]. It is also assumed that ||β|| = 1. The hyper-plane divides the predictor space into two half planes. The classifier tries to predict a response ˆy₀ by identifying the region in which the predictor x₀ falls as shown in equation (3.19) [25, pp. 417-419].

ˆ y0=

®1, if f (x) ≥ 0

−1, if f (x) < 0 (3.19)

The goal would be to identify an optimal hyper-planef (x) that maximizesˆ the separability between the two response classes using the training data set T = {yi, xi}ⁿ_i=1. The distance from the optimal hyper-plane to the nearest predictor xi is denoted by M and is called as the margin [25, pp. 417-419].

Thus, the optimization problem can be denoted as shown in equation (3.20).

The first constraint ensures that all the training predictors lie on the correct side of the margin and are at least M units from the hyper-plane [25, pp. 417-419].

max β, β0

M

s.t. yi(β^Tx + β0) ≥ M, i = 0, . . . , n,

||β||²= 1

(3.20)

This problem can be reformulated to a minimization problem as shown in equation (3.21) by dropping the constraint ||β|| = 1 and setting M = _||β||¹ [25, pp. 417-419].

min β, β0

||β||

s.t. yi(β^Tx + β0) ≥ 1, i = 0, . . . , n

(3.21)

As the predictors in the training set are not linearly separable, finding an optimal hyper-plane as shown in equation is trivial. Miss-classification of some training data is allowed in order to solve this problem. These miss-classified data points lie in on the wrong side of the margins. Thus, a slack variable ξi is introduced for each training predictor xi which is proportional to the amount by which the prediction yi is on the wrong side of the margin [25, pp. 417-419].

If a predictor xiis on the correct side of the margin, then its corresponding slack variable ξi = 0. Thus, the optimization problem shown in equation (3.21) can be reformulated to that in equation (3.22), where C is cost parameter which can be interpreted as the penalty incurred for miss-classification [25, pp. 417-419].

(27)

min β, β0

1

2||β||²+ C

n

X

i=1

ξi

s.t. yi(β^Tx + β0) ≥ (1 − ξi), i = 0, . . . , n,

n

X

i=1

ξi≥ 0, i = 0, . . . , n

(3.22)

The optimization problem described in equation (3.22) is a convex optimization problem and a global minimum is found by formulating the Lagrangian Dual and using the Kahn Kush Tucker optimality conditions [25, pp. 420-421]. The corresponding optimal hyper-plane is shown in equation (3.23) where S is the set of training examples that lie exactly on the optimal margin and are termed as support vectors and ˆα_i are the Lagrange multipliers [25, pp. 417-419].

f (x) = ˆˆ β₀+X

i∈S

ˆ

α_iy_ihx_i, xi (3.23) SVC constructs linear decision boundaries in the predictor space. One way to construct non-linear boundaries is to enlarge the original predictor space by constructing additional features using a feature map function [25, pp. 423- 425]. For example, let x = [x1x2] be a predictor that belongs to the p dimensional predictor spaceR^pand h(x)= be a feature map that transforms x to x⁰= [x1x2. . . xpx²₁x²₂. . . x²_p]. Thus, the original predictor spaceR^phas been enlarged toR^2pwhich is 2p dimensional. Linear decision boundaries constructed on this enlarged spaceR^2p would translate to a quadratic decision boundary on the original spaceR^p.

A linear hyper-plane in an arbitrary enlarged predictor space χ⁰ obtained by a feature map function h(x) on the original space χ is shown in equation (3.24) where ˜β ∈ χ⁰ are the hyper-plane coefficients in the enlarged space. The optimal hyper-plane in the space χ⁰ computed by SVC is shown in equation (3.25) [25, pp. 423-425].

f (x) = ˜β^Th(x) + β0 (3.24)

f (x) = ˆˆ β0+X

i∈S

ˆ

αiyihh(xi), h(x)i (3.25)

The optimal hyper-plane is related to h(x) only through an inner product.

This inner product can be replaced using a kernel function. A kernel function K : χ × χ →R is a positive definite and symmetric function which computes the inner-product of the feature map in the enlarged predictor space χ0 for all x, ˜x ∈ χ as shown in equation (3.26) [25, pp. 167-170]. The kernel functions that will be investigated for this project are the polynomial kernel and the radial basis kernel are shown in table 2.

(28)

Table 2: Kernel functions and their mathematical expressions. γ is a parameter of the radial basis kernel. c, d are parameters of the polynomial kernel

Kernel Expression

Radial Basis exp(−γ||x − x⁰||²) Polynomial (x^Tx⁰+ c)^d

K(x, ˜x) = hh(x), h(˜x)iχ⁰ (3.26) Kernel functions provide a computational advantage by working in the original predictor space than the enlarged predictor space. For the radial basis kernel, the enlarged predictor space is infinite dimensional and the explicit computation of the inner product would be infeasible [24, pp. 352-355].

Thus, by replacing the inner product in equation (3.25) with a suitable kernel K, the optimal hyper-plane in the enlarged predictor space can be expressed as shown in equation (3.27). Thus, a SVC which incorporates a kernel function to produce a non-linear decision boundary in the original predictor space is termed as a SVM.

f (x) = ˆˆ β0+X

i∈S

ˆ

αiyiK(xi, x) (3.27) The extension of SVM to solve multi-class classification problem is done by transforming it into to multiple binary classification problems. The One vs All (OVA) strategy converts the multi-class problem with K classes into K binary classification problems Pk ∀ k ∈ {1, 2 . . . K} [24, pp. 355]. For each problem Pk, a SVM Gk is trained to predict if the response Y lies in class k or not in k by suitably encoding the multi-class response to a binary response. The training of Gk results in estimating the corresponding optimal hyper-plane ˆfk(x). The prediction ˆy0 of an unseen test observation x0 is done as shown in equation (3.28) as this amounts to having the highest confidence in x0belonging to class k [24, pp. 355].

ˆ

y₀= arg max

i∈{1,2...K}

fˆ_i(x₀) (3.28)

The One vs One (OVO) strategy converts the multi-class problem with K classes into ^K(K−1)₂ binary classification problems Pk,j [24, pp. 355]. For each problem Pk,j, a SVM Gk,j is trained to predict if the response Y lies in class k or class in j using only the training samples from those two classes. The prediction ˆy0 of an unseen test observation x0is done by passing it through all the classifiers G_k,j and assigning it to the most frequently predicted class [24, pp. 355]. The OVO strategy is computationally expensive compared to OVA as it trains more models [24, pp. 355].

(29)

3.4 Decision Tree-Based Methods

This section is devoted to the theory behind using decision trees and is inspired by the theory presented in [24, Ch. 8], [25, Ch. 9,10,16] and Prokhorenkova et al.

[17].

3.4.1 Decision Trees

The classification and regression tree (CART) is built by recursively partitioning the predictor space χ into m hyper-cuboidal regions R1...Rm[25, pp. 305-307].

A decision tree can be represented by a hierarchical graph structure consisting of nodes and edges in a top-down manner. The topmost node is called the root of the tree. The partitioning of each node is done in a binary fashion such the regions are non-overlapping and span the entire predictor space until a stopping criterion is met [25, pp. 305-307]. The terminal nodes are called the leaves and they represent the different partitions R₁...R_mof the input predictor space.

Each region Rj is then assigned a local constant cj depending if it’s a classification or regression task [25, pp. 305-307]. The functional form of a decision tree f (x) is represented in equation (3.29) where the model parameters Θ are the set of all partitions and estimators {Rj, cj}^m_j=1 and 1{.} is the indicator function which takes the value 1 if the condition is true and 0 otherwise. The number of partitions of a decision tree are not fixed and depend on the size of the data-set. Thus, decision trees are non-parametric models. The estimation of the parameters {Rj, cj}^m_j=1depends on the task at hand (regression or classification)

f (x) =

m

X

j=1

cj1{x ∈ Rj} (3.29)

The parameters {Rj, cj}^m_j=1 of a regression tree are estimated using the squared loss and the corresponding empirical risk R_emp is shown in equations (3.30) and (3.31) [25, pp. 305-307].

R_emp =

n

X

i=1

(y_i− f (xi))² (3.30)

=

n

X

i=1

(y_i−

m

X

j=1

c_j1{xi∈ R_j})² (3.31)

It can be seen that the estimator ˆc_j of c_j that minimizes R_empis the average of the responses yiwhich fall in region Rjas given by equation (3.32) where Nj

is the number of samples that fall in region Rj [25, pp. 305-307].

ˆ cj = 1

N_j

n

X

i=1

yi1{xi ∈ Rj} (3.32)

(30)

Unfortunately, the identification of the regions R1, . . . Rm which globally minimize Rempis computationally infeasible [25, pp. 305-307]. Thus, a decision tree is constructed with a greedy approach called recursive binary splitting in a top down manner [24, pp. 304-307]. Starting at the root node with all the data, the predictor space χ is partitioned by identifying the splitting predictor X_j and splitting threshold s in a greedy manner. The splitting variable and threshold define two half planes (two new nodes) R₁(j, s) and R₂(j, s) in the predictor space χ as shown in equations (3.33) and (3.34) [25, pp. 305-307].

R₁(j, s) = {X|X_j≤ s} (3.33)

R2(j, s) = {X|Xj> s} (3.34) The best splitting predictor and threshold is then identified by solving the optimization problem shown in equation (3.35).

min

(j,s)



minc1

X

x_i∈R1(j,s)

(yi− c1)²+ min

c2

X

x_i∈R2(j,s)

(yi− c2)²



 (3.35)

The inner optimization problem in estimating ˆc1and ˆc2is similar to equation (3.32). The optimal estimates are shown in equations (3.36) and (3.37) where N1and N2 are the number of training data predictors falling in regions R1(j, s) and R2(j, s) respectively [25, pp. 305-307]. After identifying a splitting variable and threshold, the predictor space χ is appropriately partitioned resulting in two 2 new nodes and the procedure is repeated in each of the new nodes until a stopping criterion is met.

ˆ c₁= 1

N1 n

X

i=1

y_i1{xi∈ R1(j, s)} (3.36)

ˆ c₂= 1

N₂

n

X

i=1

y_i1{xi∈ R₂(j, s)} (3.37) For regression trees under squared loss, the node impurity Q_mat a node m is defined as shown in equation (3.38) [25, pp. 305-307]. Thus, the optimization problem described in equation (3.35) can be redefined in terms of node impurity as shown in equation (3.39) where N_R₁_(j,s) and N_R₂_(j,s) are the number of samples falling in regions R1(j, s) and R2(j, s) respectively. Thus, the identification of the splitting variable and splitting threshold is done in a way such the weighted node impurity of the 2 resulting split nodes is minimized [25, pp. 305-307].

Qm= 1 N_m

X

xi∈Rm

(yi− ˆcm)² (3.38)

min

(j,s)N_R₁_(j,s)Q_R₁_(j,s)+ N_R₂_(j,s)Q_R₂_(j,s)

(3.39)

(31)

The prediction ˆy0for a new observation x0is done by identifying the region m in the predictor space x0falls into and choosing the appropriate estimator ˆcm. For a classification task with K classes, the node impurity used in regression trees is not suitable and needs to be modified appropriately. Letck,mˆ represent the proportion of samples of class k in node m as shown in equation (3.40) where N_mrepresents the number of samples in node m [25, pp. 307-310].

ˆ ck,m= 1

Nm

X

xi∈Rm

1{yi= k} (3.40)

The node impurity measure Q_mthat is used to identify the splitting variables and threshold in CART classification trees is called gini impurity as shown in equation (3.41). Intuitively, it measures the total variance across all the K classes at node m [25, pp. 307-310].

Qm=

K

X

k=1

ˆ

ck,m(1 −ck,mˆ ) (3.41) Thus, the process of growing trees is similar to that of regression trees; where the trees are grown by recursive binary splitting. The identification of the splitting variable and splitting threshold is done in a way such the gini impurity of the 2 resulting split nodes is minimized.

The prediction ˆy0for a new observation x0is done by identifying the region m in the predictor space x0 falls into and choosing the majority class in that region as shown in equation (3.42).

ˆ

y₀= arg max

k

ˆ

c_k,m (3.42)

3.4.2 Random Forest

Decision trees are generally low-biased but suffer from high variance [25, pp. 587- 590] . This would imply that the splits and thresholds identified would vary con- siderably across different training data-sets. As a result, their performance on an unseen test would be poor. One way of tackling this problem is through an ensemble based approach called bootstrap aggregation or bagging [25, pp. 587-590].

Bagging involves creating B samples from the training data-set. Sampling is done with replacement and a decision tree is trained on each of the bootstrapped samples [25, pp. 587-590]. This results in a sequence of {Tb}^B_b=1 decision trees which are identically distributed [25, pp. 587-590]. Predictions for an unseen observation are done by aggregating the predictions of each tree in the sequence depending on the nature of task i.e. regression (averaging) or classification (majority voting) [25, pp. 587-590].