• No results found

Prediction of Optimal Packaging Solution using Supervised Learning Methods

N/A
N/A
Protected

Academic year: 2022

Share "Prediction of Optimal Packaging Solution using Supervised Learning Methods"

Copied!
80
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Prediction of Optimal Packaging Solution using Supervised

Learning Methods

ANIRUDH VENKAT CHARI

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)
(3)

Prediction of Optimal Packaging Solution using Supervised

Learning Methods

ANIRUDH VENKAT CHARI

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020

Supervisor at Hennes & Mauritz AB: Anton Alund Supervisor at KTH: Anastasiia Varava

Examiner at KTH: Anja Janssen

(4)

TRITA-SCI-GRU 2020:059 MAT-E 2020:022

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

This thesis investigates the feasibility of supervised learning models in the decision-making problem to package products and predict an optimal packaging solution. The decision-making problem was broken down into a multi-class clas- sification and a regression problem using relevant literature. Supervised learning models from the field of logistics were shortlisted namely; Generalized Linear Models, Support Vector Machines, Random Forest and Gradient Boosted Trees using CatBoost. The performance of the models were evaluated based on rele- vant metrics, interpretability and ease of implementation. The results from this thesis show that the Random Forest model had the best performance on all the aforementioned criteria in both the classification and regression problems.

(6)
(7)

F¨ oruts¨ agelse av Optimal F¨ orpackningsl¨ osning med

Overvakade ¨

Inl¨ arningsmodeller

Sammanfattning

Denna avhandling unders¨oker m¨ojligheten att genomf¨ora ¨overvakade inl¨arningsmodeller i syfte att f¨orb¨attra beslutsprocessen kring produktpaketering samt att f¨oruts¨aga en optimal f¨orpackningsl¨osning. Beslutsfattandeprocessen br¨ots ner i klassifi- ceringsdelar samt ett regressionsproblem med hj¨alp av relevant litteratur. De

¨

overvakade inl¨arningsmodeller fr˚an logistikomr˚adet som har anv¨ants ¨ar ”Gener- alized Linear Models”, ”Support Vector Machines”, ”Random Forest” och ”Gra- dient Boosted Trees using CatBoost”. Modellerna har utv¨arderades utifr˚an rele- vanta m¨atv¨arden, tolkbarhet och enkelhet avseende implementering. Resultaten i denna avhandling visar att ”Random Forest”-modellen har b¨ast prestanda p˚a alla ovann¨amnda kriterier, b˚ade vad g¨aller klassificerings- och regressionsprob- lemen.

(8)
(9)

Acknowledgements

I would like to express my utmost gratitude to my supervisors Anastasiia Varava and Anton Alund for their constant guidance and support with all aspects of this thesis. I would also like to extend my gratitude to Anja Janssen for addressing all queries with the thesis and for examining it. I would also like to thank my peers, T¨orbj¨orn Sj¨oberg and Anton Karlsson for their constructive criticism and feedback from the peer-review seminars. Lastly, I would like to thank my family and my partner Lovisa Ohlson for being a constant pillar of support and encouragement.

(10)
(11)

List of Tables

1 Summary of the chosen classification and regression models . . . 13

2 Kernel functions and their mathematical expressions. γ is a pa- rameter of the radial basis kernel. c, d are parameters of the polynomial kernel . . . . 21

3 Confusion matrix with K classes . . . . 28

4 Description of a Product 1 being packed in different packages having different fill-rates . . . . 34

5 Description of a Product 1 being packed in different packages having different fill-rates . . . . 34

6 The number of levels of the categorical predictors . . . . 36

7 Summary of the modelling task, predictors and response labels . 37 8 Hyper-parameter set for Generalized Linear Models . . . . 40

9 Hyper-parameter set for SVM . . . . 40

10 Hyper-parameter set for Random Forest . . . . 41

11 Hyper-parameter set for CatBoost . . . . 41

12 Libraries used in model implementation and their purpose . . . . 42

13 Macro-averaged metrics for different classifiers when using re- sponse label P broad . . . . 43

14 Macro-averaged metrics for different classifiers when using re- sponse label P deep . . . . 48

15 Regression metrics for different regressors when using response label P deep . . . . 55

16 Average F1macro and Exc Time of the different classifiers when predicting P broad . . . . 60

17 Average F1macro and Exc Time of the different classifiers when predicting P deep . . . . 61

18 Average RMSE, MAE and Exc Time of the different regressors when predicting Qua. . . . . 62

(12)

List of Abbreviations

MLR . . . Multinomial Logistic Regression MAE . . . Mean Absolute Error

OTS . . . Ordered Target Statistics RMSE . . . Root Mean Squared Error

SMOTE . . . Synthetic Minority Over-sampling Technique SVC . . . Support Vector Classifier

SVM . . . Support Vector Machine

(13)

Contents

1 Introduction 8

1.1 Background . . . . 8

1.2 Case Study with Company A . . . . 9

1.3 Purpose . . . . 9

1.4 Delimitations . . . . 10

1.5 Disposition . . . . 10

2 Literature Review 11 2.1 Review of Supervised Learning Models in Decision-Making Prob- lems in Manufacturing and Logistic Operations . . . . 11

2.1.1 Multi-class Classification Models . . . . 12

2.1.2 Regression Models . . . . 13

2.1.3 Model Selection . . . . 13

2.2 Metrics for Model Evaluation . . . . 14

3 Theory 15 3.1 Supervised Learning . . . . 15

3.2 Generalized Linear Models . . . . 16

3.2.1 Multinomial Logistic Regression . . . . 16

3.2.2 Linear Regression . . . . 17

3.2.3 Regularization . . . . 18

3.3 Support Vector Machines . . . . 18

3.4 Decision Tree-Based Methods . . . . 22

3.4.1 Decision Trees . . . . 22

3.4.2 Random Forest . . . . 24

3.4.3 Gradient Boosting using CatBoost . . . . 25

3.5 Metrics . . . . 27

3.5.1 Regression . . . . 27

3.5.2 Multi-class Classification . . . . 28

3.6 Handling Response Label Imbalance . . . . 30

3.7 Cross-validation . . . . 31

3.8 Categorical Predictor Encoding . . . . 31

4 Data-set Analysis 34 4.1 Preparation of Optimal Packaging Solutions . . . . 34

4.2 Response labels . . . . 35

4.3 Predictors . . . . 36

5 Methodology 37 5.1 Model Implementation . . . . 37

5.2 Hyper-parameter Sets . . . . 40

5.3 Software and Libraries . . . . 41

6 Results 43

(14)

6.1 Multi-class classification . . . . 43

6.1.1 Using Response Label P broad . . . . 43

6.1.2 Using Response Label P deep . . . . 48

6.2 Regression . . . . 55

7 Discussion 59 7.1 Recap . . . . 59

7.2 Model Comparison . . . . 59

7.2.1 Multi-class Classification . . . . 59

7.2.2 Regression . . . . 61

7.3 Choice of Metrics . . . . 62

7.4 Additional Observations . . . . 62

7.4.1 Methods of Handling Imbalance among Response Labels and Encoding Categorical Predictors . . . . 62

7.4.2 Predictor Importance Plots . . . . 63

7.4.3 Impact of Results from a Business Perspective . . . . 64

7.5 Future Work . . . . 64

8 Conclusion 65

A Appendix 70

A.1 Box plots for the predictors in the classification of type of package 70

(15)

1 Introduction

1.1 Background

The efficiency of logistics plays a crucial role in the operation and profitability of retail companies as mentioned by Hellstr¨om and Saghir [1]. The packaging of products to be shipped is an important part of their logistic process. Effi- ciency in this step could lower shipping costs, make warehouse handling easier and reduce wastage of packing materials resulting in economic and environmen- tal benefits as mentioned by Hellstr¨om and Saghir [1], Dubey et al. [2]. For a large product and package assortment, the decision-making process to package products is complex. Various constraints such as the features of the products and packages, shipping regulations, weather conditions, etc. must be accounted.

Decision-making problems to package products can be modelled as explicit optimization problems in order to obtain an optimal strategy as mentioned by Bortfeldt and W¨ascher [3]. They report that the class of problems commonly encountered and widely studied in logistic operations is the Container Load- ing Problem (CLP) Viz. to pack items into containers in order to reduce a specific objective function (for e.g.: number of containers, cost of containers, etc.). Several variations of this problem, as reported by Bortfeldt and W¨ascher [3], account for items and containers of different sizes and orientations. The locally optimal solutions to such optimization (combinatorial-integer) problems are arrived at using tailor-made heuristic or approximation algorithms such as those mentioned by Correia et al. [4] and Maiza et al. [5]. However, the un- derlying theme in such problems is primarily focused on geometric constraints.

Additional non-geometric constraints can be added to the problem formulation as binary coded variables. This would in-turn require careful adapting and tuning of the algorithms; especially when a having large assortment of prod- ucts and packages. Moreover, the task of solving such problems for individual instances is often repetitive and incorporating prior knowledge from solved in- stances would potentially be beneficial. An alternate approach to this problem would be through a data driven manner by employing packaging experts. A packaging expert would use his/her domain knowledge, experience and intu- ition to assess the features of the products and the packages. He/She would then determine a packaging solution which satisfies all the constraints of the decision-making process. In other words, this expert is acting as a heuristic, approximating optimal (with respect to performance indicators/constraints) so- lutions to the decision-making problem. Thus, it would be of interest to develop and evaluate the feasibility of statistical models to mimic the behavior of this expert.

Whilst statistical models have been used widely in logistics and manufac- turing operations, their application in the decision-making process of packaging products is under-studied as reported by Knoll et al. [6, 7]. Knoll et al. [7]

successfully developed and implemented an automated process to package car

(16)

parts using a Random Forest, which is a type of a statistical model. The process implemented by them consisted of a two-step approach: (1) to predict a relevant package using product features (2) incorporating the predicted package to pre- dict the quantity of product packed. They concluded their work by suggesting further research to evaluate the feasibility of other statistical models. This the- sis is inspired by their work to use the proposed two-step approach and evaluate the performance of different statistical models in both the steps by performing a case study with Company A.

1.2 Case Study with Company A

Company A is a leading retailer who would like to improve the efficiency of their packaging process. Currently, packaging decisions are made through man- ual inspection on a case by case basis by different product suppliers. This often results in sub-optimal and varied packaging for the same set of products.

The performance indicator (PI) which Company A uses to assess the final quality of a packing solution is the Fill rate. Fill rate is defined as the ratio between volume of the total number of products filled to the volume of the pack- aging box as shown in equation (1.1). The maximization of fill rate implicitly involves selection of an appropriate packaging box for a product. Thus, the se- lection of an appropriate packaging box with the maximum fill rate would define an optimal packaging solution in this thesis. Company A has set up experts with certain suppliers to improve the packaging process. The data collected from those suppliers will be used in this thesis.

FR = Total volume of products packed

Volume of packaging box (1.1)

1.3 Purpose

Using the data collected by Company A and the two stage approach elucidated by Knoll et al. [7], the goal of the thesis is to evaluate different statistical models and help automate the package decision process to produce an optimal packaging solution for new-unseen products. Using labelled data to predict future labels falls under the purview of Supervised Learning. The main research question (RQ) and the corresponding sub-question (SQ) that will be addressed by this thesis are given below:

• RQ: Which supervised learning models are best suited to capture the relevant features and decisions involved in product packaging to predict an optimal packaging solution?

• SQ: What are the relevant metrics to compare the performance of such models?

Thus, different supervised learning models in the field of logistics will be reviewed. Chosen models will be shortlisted and appropriately adjusted for the

(17)

problem. The performance of the chosen models will be documented with re- spect to suitable performance metrics for comparison. The successful analysis and implementation of such models could be used as a guiding tool to im- prove the packaging efficiency across all suppliers. The chosen model(s) could be implemented and updated (using new data) continuously to streamline and automate the process.

1.4 Delimitations

The thesis focuses only on clothing items. Items with different geometric orien- tations (such as household articles) were not considered for this project. While a wide variety of models exist in the field of supervised learning, only a few of them were studied in this thesis.

1.5 Disposition

Chapter 2 reviews the literature relating to supervised learning models in the field of logistics. Chapter 3 discusses the relevant theory behind the chosen models, metrics and other evaluation procedures. Chapter 4 describes the data- set used in the thesis and the corresponding predictors and responses used in the models. Chapter 5 describes the methodology for the implementation of the chosen models. The results from the model implementation are shown in Chapter 6. Chapter 7 discusses the results and potential future work. Chapter 8 reports the conclusion of the thesis.

(18)

2 Literature Review

This chapter provides a review on the application of supervised learning models in the field of logistics.

2.1 Review of Supervised Learning Models in Decision- Making Problems in Manufacturing and Logistic Op- erations

A detailed survey on the applicability of machine learning in the various stages of a decision-making process in operations research is presented by Bengio et al.

[8]. They discuss the applicability of supervised learning in predicting optimal solutions from solved instances of combinatorial optimization problems by uti- lizing the input structure of the problem. One example, that they refer to is the Larsen et al. [9]. Larsen et al. successfully designed a classification and regres- sion model to predict tactical solutions (brief/non-detailed version of the optimal solution) using solved instances of a Container Loading Problem. They used a subset of features from the decision-making problem as inputs to their model.

The model performed better in predicting tactical solutions in comparison to heuristic algorithms. Li and Olafsson [10] reported benefits of incorporating supervised learning to predict dispatch rules in scheduling problems. They used a decision tree, on synthesized data, to successfully recreate the set of decision rules used by an expert (which is this case was a set of predefined heuristic rules).

Another aspect of this thesis is the applicability of machine learning in the manufacturing and logistics processes. K¨oksal et al. [11] provides a comprehen- sive review of supervised learning methods in the manufacturing and logistic operations. They report that the most frequently used methods for supervised learning (prediction and classification) within logistics are Neural Networks, Support Vector Machines, Generalized Linear Models and Decision Tree-based methods. This review is complemented by the work of Knoll et al. [6] who dis- cuss the applicability and benefits of supervised learning methods in logistic operations where they quote an example of a package planning task. Such a model, as described by them, would utilize the existing data available about the material features and packaging options to predict an appropriate package type. As mentioned in section 1.1, Knoll et al. [7] successfully developed and implemented an automated process to package car parts. Their model consisted of a two-step approach: (1) to predict a relevant package using product fea- tures (2) incorporating the predicted package to predict the quantity of product packed as shown in figure 1. They used product specific features to predict a relevant package and the corresponding quantity of product packed using a Random Forest.

(19)

Figure 1: An overview of the package planning process as described by Knoll et al. [7]

2.1.1 Multi-class Classification Models

The first step of choosing a relevant package based on product features falls under the category of Multi-class classification in supervised learning. Multi- nomial Logistic Regression (MLR) has shown promising results in logistics and manufacturing such as in the identification of faults by Wang [12], prediction of customer satisfaction by Larasati et al. [13], assessment of sustainability in food consumption by Abdella et al. [14] to name a few. The model is simple to train and the model parameters can be easily interpreted as the odds over the differ- ent package choices for a given product. Support Vector Machines (SVM) use kernel functions as described in section 3.3. As a result, they help in creating additional features and could aid in modelling the complexity of the package selection. Support Vector Machines have been used to successfully predict the logistics distribution mode of transport in e-commerce industries by Zheng et al.

[15] and quality control by Tseng et al. [16]. Random Forest is a ensemble tech- nique that aggregates the prediction of several decision trees. This was used by Knoll et al. [7] in his modelling procedure. They describe the method as being robust and easy to implement. Thus, it would be of interest to include it to evaluate if it could be used to predict the package type for Company A. Ran- dom Forest also provides a importance measure which aggregates the impact of the individual features on the final prediction. CatBoost employs gradient boosted decision trees to make predictions as described by Prokhorenkova et al.

[17]. The model they describe incorporates categorical features in an innova-

(20)

tive manner using ordered target statistics (see section 3.4.3). Their method also addresses the issue of gradient bias compared to popular gradient boosting methods like XGBoost or LightGBM. CatBoost has been used in predicting the credit rating by Al Daoud [18] and glass transition temperatures by Alcoba¸ca et al. [19]. CatBoost also provides an importance measure which aggregates the impact of the individual features on the final prediction.

2.1.2 Regression Models

The second step of predicting the fill rate using the product and package features falls under the category of Regression. The simplest and the most commonly used model for regression in logistics and manufacturing is Linear Regression as mentioned by K¨oksal et al. [11]. Similar to MLR, Linear Regression is simple to train and the model coefficients would describe the relationship of the features with the fill rate. The Random Forest and CatBoost algorithms described in the multi-class classification setting can also be adopted for regression and thus are also included for predicting the fill rate. Table 1 summarizes all the mod- els that are used for both classification of a package and regression of the fill rate.

Table 1: Summary of the chosen classification and regression models

Task Models

Classification

Multinomial Logistic Regression Support Vector Machines Random Forest

CatBoost Regression

Linear Regression Random Forest CatBoost

2.1.3 Model Selection

An ideal supervised learning model would be one which captures the complexity of the decision-making process and is interpretable (to understand the decisions involved in packing the products). The Generalized Linear Models Viz. Multi- nomial Logistic Regression and Linear Regression represent the simplest and most interpretable models in this scenario. The model coefficients would de- scribe the relationship between the various features in the packing process. The Decision Tree-based ensemble models Viz. Random Forest and CatBoost mod- els offer more complexity than the generalized linear models. While they are less interpretable in comparison to the generalized linear models, they provide feature importance measures which would help understand the packing process.

(21)

The Support Vector Machine offers greater complexity (by creating additional features using kernels) but does not offer interpretability (unless using a linear or low-degree polynomial kernel) compared to the aforementioned models. An- other class of models that offer complexity but lack interpretability are Neural Networks (and their variants). Neural Networks show promising results in pre- diction and classification tasks within logistics and manufacturing as mentioned by K¨oksal et al. [11]. However, they require considerable effort in setting up a problem specific architecture and parameter tuning as mentioned by Ahmad et al. [20]. Fern´andez-Delgado et al. [21] conducted an extensive survey and recorded the performance of 17 families of classifiers on 121 different data-sets from the UCI data-repository and did not observe a significant difference in the performance between SVM and Neural Networks in classification tasks. Thus, the idea behind shortlisting the various models, as summarized in table 7, was to start with simple and interpretable models and gradually increase the com- plexity at the cost of interpretability. Neural networks would be the next choice of models to investigate if none of chosen models are able to capture the com- plexity of the decision-making process sufficiently.

2.2 Metrics for Model Evaluation

According to Sokolova and Lapalme [22], the most frequently used method for analyzing the performance of multi-class classification models is through a con- fusion matrix. This matrix can further be used to evaluate model specific and class specific metrics which can be used to compare the different classification models. For regression, the most commonly reported metrics for model evalu- ation are Root Mean Squared Error and Mean Absolute Error as reported by Cruz [23]. Intuitively, they measure the standard deviation and the average magnitude of the error in prediction. These metrics are chosen to evaluate and compare the different models and are further elaborated in section 3.5.

(22)

3 Theory

This chapter discusses the theoretical aspects behind the models and their eval- uation. Random variables are denoted by upper-cased letters. Vectors are noted by bold letters.

3.1 Supervised Learning

Supervised learning consists of learning the relationship between a response variable(s) and a set of predictor variable(s). Both are considered to be random variables with their joint probability distribution unknown. Instead, a sample from their joint distribution called as training data is available and is used in learning the relationship.

In a multi-class classification scenario, the response variable Y takes only one distinct value from a complete set of {1, 2 . . . K} classes. In a regression scenario, the response variable Y takes values in the space of real numbers R.

The predictor variables can be numeric or categorical. They are denoted com- pactly by a d dimensional vector X = [X1, X2. . . Xd]T where X1, X2, ...Xd are the individual predictors.

The relationship between the predictors and response is expressed through a deterministic function f as expressed in equation (3.1) where  is the random error term which accounts for the inadequacy of the function f (X) in capturing the response Y [24, p. 16]. A training data set consisting of n independent and identical samples of X and Y denoted by T = {(xi, yi)}ni=1 is used for the estimation of f (X). We denote the corresponding estimator of f(X) as ˆf (X).

Y = f (X) +  (3.1)

Depending on the task, assumptions are made on the functional form of f (X). If the function f (X) is characterized by a fixed number of parameters Θ, then it is referred to as a parametric function [24, p. 21]. The number of fixed parameters is independent of the size of the training data set. Thus, the estimation of ˆf (X) can be reduced to estimating the parameters ˆΘ of the para- metric function. A non-parametric function would not be characterized by a fixed set of parameters. The number of parameters of such a function would be dependent on the size of the training data [24, p. 23]. In both cases, the space of all functions that have the same functional form as f (X) is denoted by F .

Once the functional form f (X) is assumed and the corresponding function space F is defined, the next task would be in estimating it using the training data. A loss function L(Y, f (X)) is defined which quantifies the performance of the function f (X) in predicting Y by mapping it to a number in the real spaceR. Ideally, one would like to find a function f(X) ∈ F having the lowest loss over all the possible values that the random variables X and Y can take.

(23)

However, this task is intractable as the underlying distribution of X and Y are unknown. Instead, the empirical risk Remp is defined as the average point-wise loss over the training data-set as shown in equation (3.2)

Remp= 1 n

n

X

i=1

L(yi, f (xi)) (3.2)

The estimator ˆf (X) is found out by minimizing the empirical risk as shown in equation (3.3). The solution to this optimization problem would give the best estimator ˆf (X) ∈ F .

f (X) =ˆ min

f (X)∈FRemp (3.3)

3.2 Generalized Linear Models

This section discusses the Multinomial Logistic Regression and Linear Regres- sion and is inspired by the theory presented by Hastie et al. [25, Ch. 5,12] and Jurafsky and H. Martin [26, Ch. 5].

3.2.1 Multinomial Logistic Regression

MLR is an extension of the two-class binary logistic regression model for multi- class classification. The response variable Y takes one distinct value from a complete set of {1, 2 . . . K} classes. Without loss of generality, the response variable Y can be transformed to a K dimensional vector Y over the K classes by one hot coding; with the value 1 in the kth index if it belongs to class k and 0 otherwise. Intuitively, each element Yk in Y would represent the proba- bility of Y belonging to class k. The responses yi in the training data are also transformed into K dimensional vectors yias described above. The conditional probability of Y belonging to a certain class k ∈ {1, 2 . . . K} given the predic- tor X = x is modelled using the sof tmax function as shown in equation (3.4) where β1, . . . βKare the class specific regression coefficients collectively denoted by β = [β1, ...βK] [26, p. 89].

The function f (X) is a K dimensional vector [f1(X) . . . fK(X)], where fk(X) = P (Y = k|X = x) ∀k ∈ {1, 2 . . . K}. The softmax function is the analogue of the sigmoid function in binary logistic regression and ensures that the elements of f (X) conform to a probability distribution over the K classes [26, p. 89]. Thus, the MLR is a parametric model which has a total of (d+1)×K model parameters that need to be estimated.

P(Y = k|X = x) = sof tmax(x, k, β1, . . . βK) = eβkTx PK

i=1eβiTx (3.4)

(24)

The loss function L(Y , f (X)) that is used in estimating the model param- eters is the cross-entropy function as shown in (3.5). It is equivalent to the negative log-likelihood for multi-class classification problems [25, p. 349].

L(Y , f (X)) =

K

X

k=1

−Yk× log(fk(X)) (3.5)

Using cross-entropy as the loss function, the empirical risk over the training data set can be expressed as in equations (3.6)-(3.8) where yi,k represents the kthelement of yi

.

Remp= 1 n

n

X

i=1

L(yi, f (xi)) (3.6)

= 1 n

n

X

i=1 K

X

k=1

−yi,klog fk(xi) (3.7)

= 1 n

n

X

i=1 K

X

k=1

−yi,klog eβkTxi PK

j=1eβjTxi (3.8) The function f (X) and Remp are completely defined by the parameters β. Thus, the estimation of ˆf (X) reduces to the problem of estimating the estimation of the is done by minimizing the empirical risk with respect to β as shown in equation (3.9).

β = minˆ

β Remp (3.9)

The cross-entropy loss is a convex function and the optimization problem (3.9) is solved using iterative algorithms to estimate the optimal parameters [26, p. 91]. The prediction ˆy0 of a new, unseen observation x0 is done to the class having the maximum conditional probability as shown in equation (3.10).

ˆ

y0= max

k P(Y = k|X = x0) = max

k

fˆk(x0) (3.10)

3.2.2 Linear Regression

The linear regression assumes a linear relation between the response variable Y and the predictors X as shown in equation (3.11) where β = [β0, β1. . . βd]T are the model parameters or regression coefficients.

Y = βTX (3.11)

The loss function that is used to estimate these parameters is the squared loss as shown in equation (3.12). The corresponding empirical risk is shown is equation (3.13).

(25)

L(Y, f (X)) = (Y − βTX)2 (3.12) Remp= 1

n

n

X

i=1

(yi− βTxi)2 (3.13)

The risk minimization to estimate the optimal parameters is a convex prob- lem and a closed form solution is computed as shown in equation (3.14) where β represents the optimal regression coefficients,ˆ X = [x1, x2, . . . xn]T is ma- trix containing all xi stacked row-wise, Y = [y1, y2, . . . yn]T is a column vector containing all yi [25, pp. 44-46].

β = (ˆ XTX)−1XTY (3.14)

The prediction ˆy0of a new, unseen observation x0is done to the class having the maximum conditional probability as shown in equation (3.15).

ˆ

y0= ˆβx0 (3.15)

3.2.3 Regularization

Regularization of the model parameters can be achieved by penalizing the model coefficients. Regularization helps in reducing the variance of the estimated model parameters by shrinking them towards 0 [25, pp. 61-69]. This is essential in the presence of co-related or collinear predictors. The regularization function is denoted by Ω(β). The estimation of the model parameters β is now done by adjusting the optimization problem as shown in equation to (3.9) to that as shown in equation (3.16) where C is the magnitude of the penalty applied to the model parameters β.

β = minˆ

β Remp+ (C × Ω(β)) (3.16)

When C = 0, the model coefficients estimated will be the same as that of a non-regularized model. The regularization functions which will be considered are the Ridge penalty ΩR(β) and Lasso penalty ΩL(β) as shown in equations (3.17) and (3.18) where ||.||2and ||.||1are the L2 and L1 norms respectively [25, pp. 61-69].

R(β) = ||β||22 (3.17)

L(β) = ||β||1 (3.18)

3.3 Support Vector Machines

Support Vector Machine SVM are an extension of the Support Vector Classifier (SVC) for binary classification of the response variable Y using the predictor

(26)

variable X. SVC is used when the two response classes cannot be completely separated by a linear decision boundary. This section is inspired by the theory presented in James et al. [24, Ch. 9] and Hastie et al. [25, Ch. 5,12].

Let the response classes be represented by the set {−1, 1}. The function f (x) = βTx + β0 = 0 defines a linear hyper-plane in the predictor space [25, pp. 417-419]. It is also assumed that ||β|| = 1. The hyper-plane divides the predictor space into two half planes. The classifier tries to predict a response ˆy0 by identifying the region in which the predictor x0 falls as shown in equation (3.19) [25, pp. 417-419].

ˆ y0=

®1, if f (x) ≥ 0

−1, if f (x) < 0 (3.19)

The goal would be to identify an optimal hyper-planef (x) that maximizesˆ the separability between the two response classes using the training data set T = {yi, xi}ni=1. The distance from the optimal hyper-plane to the nearest predictor xi is denoted by M and is called as the margin [25, pp. 417-419].

Thus, the optimization problem can be denoted as shown in equation (3.20).

The first constraint ensures that all the training predictors lie on the correct side of the margin and are at least M units from the hyper-plane [25, pp. 417-419].

max β, β0

M

s.t. yiTx + β0) ≥ M, i = 0, . . . , n,

||β||2= 1

(3.20)

This problem can be reformulated to a minimization problem as shown in equation (3.21) by dropping the constraint ||β|| = 1 and setting M = ||β||1 [25, pp. 417-419].

min β, β0

||β||

s.t. yiTx + β0) ≥ 1, i = 0, . . . , n

(3.21)

As the predictors in the training set are not linearly separable, finding an optimal hyper-plane as shown in equation is trivial. Miss-classification of some training data is allowed in order to solve this problem. These miss-classified data points lie in on the wrong side of the margins. Thus, a slack variable ξi is introduced for each training predictor xi which is proportional to the amount by which the prediction yi is on the wrong side of the margin [25, pp. 417-419].

If a predictor xiis on the correct side of the margin, then its corresponding slack variable ξi = 0. Thus, the optimization problem shown in equation (3.21) can be reformulated to that in equation (3.22), where C is cost parameter which can be interpreted as the penalty incurred for miss-classification [25, pp. 417-419].

(27)

min β, β0

1

2||β||2+ C

n

X

i=1

ξi

s.t. yiTx + β0) ≥ (1 − ξi), i = 0, . . . , n,

n

X

i=1

ξi≥ 0, i = 0, . . . , n

(3.22)

The optimization problem described in equation (3.22) is a convex optimiza- tion problem and a global minimum is found by formulating the Lagrangian Dual and using the Kahn Kush Tucker optimality conditions [25, pp. 420-421]. The corresponding optimal hyper-plane is shown in equation (3.23) where S is the set of training examples that lie exactly on the optimal margin and are termed as support vectors and ˆαi are the Lagrange multipliers [25, pp. 417-419].

f (x) = ˆˆ β0+X

i∈S

ˆ

αiyihxi, xi (3.23) SVC constructs linear decision boundaries in the predictor space. One way to construct non-linear boundaries is to enlarge the original predictor space by constructing additional features using a feature map function [25, pp. 423- 425]. For example, let x = [x1x2] be a predictor that belongs to the p di- mensional predictor spaceRpand h(x)= be a feature map that transforms x to x0= [x1x2. . . xpx21x22. . . x2p]. Thus, the original predictor spaceRphas been en- larged toR2pwhich is 2p dimensional. Linear decision boundaries constructed on this enlarged spaceR2p would translate to a quadratic decision boundary on the original spaceRp.

A linear hyper-plane in an arbitrary enlarged predictor space χ0 obtained by a feature map function h(x) on the original space χ is shown in equation (3.24) where ˜β ∈ χ0 are the hyper-plane coefficients in the enlarged space. The optimal hyper-plane in the space χ0 computed by SVC is shown in equation (3.25) [25, pp. 423-425].

f (x) = ˜βTh(x) + β0 (3.24)

f (x) = ˆˆ β0+X

i∈S

ˆ

αiyihh(xi), h(x)i (3.25)

The optimal hyper-plane is related to h(x) only through an inner product.

This inner product can be replaced using a kernel function. A kernel function K : χ × χ →R is a positive definite and symmetric function which computes the inner-product of the feature map in the enlarged predictor space χ0 for all x, ˜x ∈ χ as shown in equation (3.26) [25, pp. 167-170]. The kernel functions that will be investigated for this project are the polynomial kernel and the radial basis kernel are shown in table 2.

(28)

Table 2: Kernel functions and their mathematical expressions. γ is a parameter of the radial basis kernel. c, d are parameters of the polynomial kernel

Kernel Expression

Radial Basis exp(−γ||x − x0||2) Polynomial (xTx0+ c)d

K(x, ˜x) = hh(x), h(˜x)iχ0 (3.26) Kernel functions provide a computational advantage by working in the origi- nal predictor space than the enlarged predictor space. For the radial basis kernel, the enlarged predictor space is infinite dimensional and the explicit computation of the inner product would be infeasible [24, pp. 352-355].

Thus, by replacing the inner product in equation (3.25) with a suitable kernel K, the optimal hyper-plane in the enlarged predictor space can be expressed as shown in equation (3.27). Thus, a SVC which incorporates a kernel function to produce a non-linear decision boundary in the original predictor space is termed as a SVM.

f (x) = ˆˆ β0+X

i∈S

ˆ

αiyiK(xi, x) (3.27) The extension of SVM to solve multi-class classification problem is done by transforming it into to multiple binary classification problems. The One vs All (OVA) strategy converts the multi-class problem with K classes into K binary classification problems Pk ∀ k ∈ {1, 2 . . . K} [24, pp. 355]. For each problem Pk, a SVM Gk is trained to predict if the response Y lies in class k or not in k by suitably encoding the multi-class response to a binary response. The training of Gk results in estimating the corresponding optimal hyper-plane ˆfk(x). The prediction ˆy0 of an unseen test observation x0 is done as shown in equation (3.28) as this amounts to having the highest confidence in x0belonging to class k [24, pp. 355].

ˆ

y0= arg max

i∈{1,2...K}

fˆi(x0) (3.28)

The One vs One (OVO) strategy converts the multi-class problem with K classes into K(K−1)2 binary classification problems Pk,j [24, pp. 355]. For each problem Pk,j, a SVM Gk,j is trained to predict if the response Y lies in class k or class in j using only the training samples from those two classes. The prediction ˆy0 of an unseen test observation x0is done by passing it through all the classifiers Gk,j and assigning it to the most frequently predicted class [24, pp. 355]. The OVO strategy is computationally expensive compared to OVA as it trains more models [24, pp. 355].

(29)

3.4 Decision Tree-Based Methods

This section is devoted to the theory behind using decision trees and is inspired by the theory presented in [24, Ch. 8], [25, Ch. 9,10,16] and Prokhorenkova et al.

[17].

3.4.1 Decision Trees

The classification and regression tree (CART) is built by recursively partitioning the predictor space χ into m hyper-cuboidal regions R1...Rm[25, pp. 305-307].

A decision tree can be represented by a hierarchical graph structure consisting of nodes and edges in a top-down manner. The topmost node is called the root of the tree. The partitioning of each node is done in a binary fashion such the regions are non-overlapping and span the entire predictor space until a stopping criterion is met [25, pp. 305-307]. The terminal nodes are called the leaves and they represent the different partitions R1...Rmof the input predictor space.

Each region Rj is then assigned a local constant cj depending if it’s a classifi- cation or regression task [25, pp. 305-307]. The functional form of a decision tree f (x) is represented in equation (3.29) where the model parameters Θ are the set of all partitions and estimators {Rj, cj}mj=1 and 1{.} is the indicator function which takes the value 1 if the condition is true and 0 otherwise. The number of partitions of a decision tree are not fixed and depend on the size of the data-set. Thus, decision trees are non-parametric models. The estimation of the parameters {Rj, cj}mj=1depends on the task at hand (regression or classification)

f (x) =

m

X

j=1

cj1{x ∈ Rj} (3.29)

The parameters {Rj, cj}mj=1 of a regression tree are estimated using the squared loss and the corresponding empirical risk Remp is shown in equations (3.30) and (3.31) [25, pp. 305-307].

Remp =

n

X

i=1

(yi− f (xi))2 (3.30)

=

n

X

i=1

(yi

m

X

j=1

cj1{xi∈ Rj})2 (3.31)

It can be seen that the estimator ˆcj of cj that minimizes Rempis the average of the responses yiwhich fall in region Rjas given by equation (3.32) where Nj

is the number of samples that fall in region Rj [25, pp. 305-307].

ˆ cj = 1

Nj

n

X

i=1

yi1{xi ∈ Rj} (3.32)

(30)

Unfortunately, the identification of the regions R1, . . . Rm which globally minimize Rempis computationally infeasible [25, pp. 305-307]. Thus, a decision tree is constructed with a greedy approach called recursive binary splitting in a top down manner [24, pp. 304-307]. Starting at the root node with all the data, the predictor space χ is partitioned by identifying the splitting predictor Xj and splitting threshold s in a greedy manner. The splitting variable and threshold define two half planes (two new nodes) R1(j, s) and R2(j, s) in the predictor space χ as shown in equations (3.33) and (3.34) [25, pp. 305-307].

R1(j, s) = {X|Xj≤ s} (3.33)

R2(j, s) = {X|Xj> s} (3.34) The best splitting predictor and threshold is then identified by solving the optimization problem shown in equation (3.35).

min

(j,s)

minc1

X

xi∈R1(j,s)

(yi− c1)2+ min

c2

X

xi∈R2(j,s)

(yi− c2)2

(3.35)

The inner optimization problem in estimating ˆc1and ˆc2is similar to equation (3.32). The optimal estimates are shown in equations (3.36) and (3.37) where N1and N2 are the number of training data predictors falling in regions R1(j, s) and R2(j, s) respectively [25, pp. 305-307]. After identifying a splitting variable and threshold, the predictor space χ is appropriately partitioned resulting in two 2 new nodes and the procedure is repeated in each of the new nodes until a stopping criterion is met.

ˆ c1= 1

N1 n

X

i=1

yi1{xi∈ R1(j, s)} (3.36)

ˆ c2= 1

N2

n

X

i=1

yi1{xi∈ R2(j, s)} (3.37) For regression trees under squared loss, the node impurity Qmat a node m is defined as shown in equation (3.38) [25, pp. 305-307]. Thus, the optimiza- tion problem described in equation (3.35) can be redefined in terms of node impurity as shown in equation (3.39) where NR1(j,s) and NR2(j,s) are the num- ber of samples falling in regions R1(j, s) and R2(j, s) respectively. Thus, the identification of the splitting variable and splitting threshold is done in a way such the weighted node impurity of the 2 resulting split nodes is minimized [25, pp. 305-307].

Qm= 1 Nm

X

xi∈Rm

(yi− ˆcm)2 (3.38)

min

(j,s)NR1(j,s)QR1(j,s)+ NR2(j,s)QR2(j,s)

(3.39)

(31)

The prediction ˆy0for a new observation x0is done by identifying the region m in the predictor space x0falls into and choosing the appropriate estimator ˆcm. For a classification task with K classes, the node impurity used in regression trees is not suitable and needs to be modified appropriately. Letck,mˆ represent the proportion of samples of class k in node m as shown in equation (3.40) where Nmrepresents the number of samples in node m [25, pp. 307-310].

ˆ ck,m= 1

Nm

X

xi∈Rm

1{yi= k} (3.40)

The node impurity measure Qmthat is used to identify the splitting variables and threshold in CART classification trees is called gini impurity as shown in equation (3.41). Intuitively, it measures the total variance across all the K classes at node m [25, pp. 307-310].

Qm=

K

X

k=1

ˆ

ck,m(1 −ck,mˆ ) (3.41) Thus, the process of growing trees is similar to that of regression trees; where the trees are grown by recursive binary splitting. The identification of the split- ting variable and splitting threshold is done in a way such the gini impurity of the 2 resulting split nodes is minimized.

The prediction ˆy0for a new observation x0is done by identifying the region m in the predictor space x0 falls into and choosing the majority class in that region as shown in equation (3.42).

ˆ

y0= arg max

k

ˆ

ck,m (3.42)

3.4.2 Random Forest

Decision trees are generally low-biased but suffer from high variance [25, pp. 587- 590] . This would imply that the splits and thresholds identified would vary con- siderably across different training data-sets. As a result, their performance on an unseen test would be poor. One way of tackling this problem is through an en- semble based approach called bootstrap aggregation or bagging [25, pp. 587-590].

Bagging involves creating B samples from the training data-set. Sampling is done with replacement and a decision tree is trained on each of the bootstrapped samples [25, pp. 587-590]. This results in a sequence of {Tb}Bb=1 decision trees which are identically distributed [25, pp. 587-590]. Predictions for an unseen observation are done by aggregating the predictions of each tree in the sequence depending on the nature of task i.e. regression (averaging) or classification (ma- jority voting) [25, pp. 587-590].

References

Related documents

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

Det finns många initiativ och aktiviteter för att främja och stärka internationellt samarbete bland forskare och studenter, de flesta på initiativ av och med budget från departementet

Av 2012 års danska handlingsplan för Indien framgår att det finns en ambition att även ingå ett samförståndsavtal avseende högre utbildning vilket skulle främja utbildnings-,

boundary layer measurements, random vibration, spectral super element method, wave expansion difference scheme, fluid-structure interaction... Application of the spectral finite

&#34;How well the LDA fingerprinting representation performs with Random Forest classifier on different real-life user email sets?&#34;, &#34;How does the number of topics used

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating

When sampling data for training, two different schemes are used. To test the capabilities of the different methods when having a low amount of labeled data, one scheme samples a