Evolutionary Algorithms in Statistical Learning

(1)

Statistical Learning

Automating the Optimization Procedure

Niklas Sjöblom

Master thesis, 30 credits

M.Sc. in Industrial Engineering and Management, 300 credits Industrial statistics

Spring term 2019

(2)

EVOLUTIONARYALGORITHMS IN STATISTICAL LEARNING: AUTOMATING THEOPTIMIZATIONPROCEDURE

Department of Mathematics and Mathematical Statistics Umeå University

901 87 Umeå, Sweden Supervisors:

Håkan Lindkvist, Umeå University Isolde Snellman, Scania CV AB Examiner:

Jun Yu, Umeå University

(3)

Scania has been working with statistics for a long time but has invested in becoming a data driven company more recently and uses data science in almost all business functions. The algorithms developed by the data scientists need to be optimized to be fully utilized and traditionally this is a manual and time consuming process.

What this thesis investigates is if and how well evolutionary algorithms can be used to automate the optimization process.

The evaluation was done by implementing and analyzing four variations of genetic algorithms with different levels of complexity and tuning parameters. The algorithm subject to optimization was XGBoost, a gradient boosted tree model, applied to data that had previously been modelled in a competition.

The results show that evolutionary algorithms are applicable in finding good models but also emphasizes the importance of proper data preparation.

Keywords: evolutionary algorithms, statistical learning, gradient boosting, auto- mation, artificial intelligence

Sammanfattning

Evolutionära algoritmer i statistisk inlärning:

Automatisering av optimeringsprocessen

Scania har länge jobbat med statistik men har på senare år investerat i att bli ett mer datadrivet företag och använder nu data science i nästan alla avdelningar på företaget. De algoritmer som utvecklas av data scientists måste optimeras för att kunna utnyttjas till fullo och detta är traditionellt sett en manuell och tidskrävade process. Detta examensarbete utreder om och hur väl evolutionära algoritmer kan användas för att automatisera optimeringsprocessen.

Utvärderingen gjordes genom att implementera och analysera fyra varianter av genetiska algoritmer med olika grader av komplexitet och trimningsparameterar.

Algoritmen som var målet för optimering var XGBoost, som är en gradient boosted trädbaserad modell. Denna applicerades på data som tidigare hade modellerats i en tävling.

Resultatet visar att evolutionära algoritmer är applicerbara i att hitta bra modeller men påvisar även hur fundamentalt det är att arbeta med databearbetning innan modellering.

Nyckelord: evolutionära algoritmer, statistisk inlärning, gradient boosting, auto- mation, artificiell intelligens

(4)

I would like to thank my supervisor Håkan Lindkvist for guiding me through my thesis project. Håkan has provided me with a lot of important feedback on all levels. From guiding me through the research procedures, clarifying mathematical concepts and helping me analyze the result to correcting minute misprints and making the thesis easier to consume.

My thanks also go to Isolde Snellman, my supervisor at Scania, who has given me an insight in the use of data science in the industry and who has guided me through a large and complex organisation for when I’ve needed help and who also helped me keep on track.

My gratitude also extends to the open source communities that developed R and L^ATEX for providing amazing tools for statistical analysis and document preparation, which without this work would have been a lot more difficult and frustrating.

I am also immensely grateful to my wife who brings me happiness and whom without I would not be the person I am today. All my love and gratitude goes to you for always being there for me, through good times and hardships. Also thanks to my family who has supported me along my many different roads in life.

(5)

1 Introduction

In this section the problem of the thesis is presented together with its background.

A literature study is also performed investigating in related works. It is followed by the aim of the thesis and with the result of a pilot study of what optimization means at Scania. The section with the scope and limitations as well as providing an outline of the thesis.

1.1 Background

Big data is reigning the world right now and the amount of data is growing at an exponential rate where it is doubling almost every two years [17]. But data by itself is not very useful, tools are needed to extract information from it. This is where data science comes into the picture. A common tool used today to extract information from data is statistical learning techniques because we have reached the point where we have enough computational power to process the large amount of data that exists. Statistical learning algorithms however need to be tuned to fit the specific data set to which they are applied, and one aspect of doing this is by adjusting the so-called hyperparameters, which are parameters that are not adjusted by the model itself during the training process [19].

Model training and hyperparameter tuning are therefore two separate steps in building a statistical learning model and there are several methods for selecting appropriate values of the hyperparameters such as grid search and random search.

Since the underlying objective function is unknown, the best set of hyperparameters in the search space are also unknown [24]. A common procedure to tune the hyperparameters is by a manual process where there are sets of standard initial settings of hyperparameters which are used and then grid search is applied to sets of hyperparameters, or points, in proximity to that point in the search space. The selection of the best point is usually done with cross validation where an evaluation metric has been calculated and the point with best value is chosen to be the set of hyperparameters for the final model.

There are however methods for looking for optima in an automated and guided way in an unknown search space and one that has successfully been applied to both test functions and statistical learning models are evolutionary algorithms [26]. Evolutionary algorithms are search methods in artificial intelligence which replicates natural evolution by assigning a higher probability of survival to good points in the search space as well as recombine with others and/or mutate themselves over several generations to find optimal parameters [4].

With the change of the chief executive officer of Scania in 2016 to Henrik

(9)

Henriksson, Scania made the decision to become a data driven company. Being data driven means both collecting large amounts of data but more importantly to analyse that data to extract information. A common way of analysing data is to apply a statistical learning model as they usually perform well with large amounts of data.

The statistical learning models however need to be tuned to fit the specific data set. Automating and making the optimization process easier is important because models are not just trained once during the initial build but deployed models can be retrained later when even more data has been collected. Having an automated optimization process will help improve the utilization of the data as well as save time for the data scientists to work with new data sets instead of tuning existing models.

1.2 Literature study

Several studies regarding the use of evolutionary algorithms have been undertaken and there is a wide array of applications. It includes optimization of hyperparameters in statistical learning, optimizing test functions, geometric matching and even design of physical objects such as antennas and various other applications. This section presents several studies that has been performed regarding usage in statistical learning and optimization. Theory for statistics, evolutionary algorithms and statistical learning can be found in Section 2 meanwhile other theory is covered in Appendix A.

Evolutionary algorithms have long been used for optimization and several articles cover this topic regarding test functions, which are functions often used in optimization problems because of different complex patterns such as many local optima [28]. In [27] several common test functions were used and three different genetic algorithms (ESGA, Genitor and CHC) were applied and since genetic algorithms are based on bit representation of parameters the result was compared with the Random Bit Climber search method. The authors also applied different bit representation strategies such as Gray coding. A reason for this is because of the Hamming cliff problem. The result from the study was that the genetic algorithms can be efficient in finding optima as the Random Bit Climber did not find the optimum in many cases meanwhile CHC, which was the best performing genetic algorithm, found them in almost every instance. The study also showed that the result of the algorithms depends on how the bit representation is implemented, which can have a large impact both on the result as well as the execution time.

In [28] three different genetic algorithms (ESGAT, Genitor and CHC) were compared with the local search methods simple hill climbing for optimization on test functions as well as geometric matching. Other than a common suite of test functions they also created new, more difficult, test functions. These better fit the

(10)

purpose of evaluating how well these genetic algorithms work for optimization, because the local search methods perform very well on the standard test functions and give an unfair comparison. The result was that CHC performed best with the test functions in the respect that it found the optima most often and also required less trials to do so. As for the geometric matching CHC has difficulties and local search performed better than the genetic algorithms. An important takeaway from this study is also that the number of trials required to find the global optimum and most problems required several tens of thousands or hundreds of thousands of trials. Test functions are quick, they require a few milliseconds to be calculated, meanwhile one statistical learning algorithm can require several minutes to train just one model with one set of hyperparameters.

Evolutionary algorithms have also been applied directly on different statistical learnings algorithms. In [13] an elitist genetic algorithm was successfully implemented in R to train a random forest model. Other than using genetic algorithm the authors also used class decomposition, which is clustering of the response labels, but kept the result separate for comparability. These were compared with the standard settings for random forest as well as Adaboost. In 18 out of 22 data sets the combination of class decomposition and genetic algorithm outperformed that standard settings of random forest meanwhile in 17 cases it outperformed Adaboost. In 10 out of the 22 cases the method using genetic algorithm without class decomposition performed better than with class decomposition. The data sets contained between 3 − 33 features and 150 − 7 200 observations and are thus a relatively small data sets.

Meanwhile, in [16] evolutionary algorithms were unsuccessfully implemented on a machine learning algorithm. Evolution is usually connected with Darwinism but as the authors point out there are also contributions by Lamarck and Baldwin which have different paradigms. Darwin’s theory is based on survival of the fittest meanwhile Lamarck and Baldwin believed in inheritance of characteristics that has been acquired over time, learning and concious adaptation. The different evolutionary theories were applied to a Naive Bayes algorithm in Java to set the weights of attributes. This method was compared to two other ways of optimizing a Naive Bayes algorithm using the Java machine learning software Weka. The result was that there was no improvement and was actually the worst of three methods in terms of accuracy on the classification problem and there was no difference between the paradigms of Darwin or Lamarck and Baldwin.

Besides hyperparameter optimization, there are other useful areas of evolutionary algorithms in statistical learning. One of them is feature selection, which is explored in [3]. A genetic algorithm was used to include or exclude features from a data set with 18 features. The study was successful and the performance of genetic

(11)

algorithms was superior in comparison with other feature selection or extractions method. Six different performance indicators were used and the model built using a genetic algorithm had the best value for all performance indicators.

In [5] the authors came up with a novel combination of genetic algorithms and statistical learning. They designed an algorithm named as genetic algorithm based random forest (GARF). Their implementation is based on first training a large random forest and then using genetic algorithm to search for a forest using different combinations of trees from the larger forest. In eight out of the 15 data sets they used, GARF performed the best of several types of random forest models and in some cases with quite a large margin in terms of accuracy.

1.3 Scania

Scania develops complete solutions of trucks, buses and engines of high quality using a modular system. They develop solutions for all industries within transport- ation and strive to take the lead in sustainable transport. As a business partner the aim is not only to sell the truck but being a strategic partner for their customers and providing solutions in the aftermarket as well [2].

1.3.1 History

Scania originates from two companies that started independently. The first company was Vagnfabriksaktiebolaget, abbreviated as VABIS, was founded in 1891 in Södertälje and started of by making railroad cars. Meanwhile in 1900 in Malmö Maskinfabriksaktiebolaget Scania was founded and started building bicycles. The first truck was manufactured in 1902 and in 1911 the companies merged to become Scania-Vabis. Scania’s modular system was born in 1939 when the company revealed their new diesel engine and standardized components. After years of global expansions with factories in Brazil, Netherlands, Argentina and France, Scania made their one-millionth truck in 2000 and is now a subsidiary in the Volkswagen Group [1].

1.3.2 Data science

Scania has for a long time worked with statistics in more traditional ways. For over 15 years Scania has been working with statistics in analytics and data mining across different function. It was however about ten years ago that the work with statistics entered the realm of data science with the introduction of statistical learning. This was initially done in the research and development department to develop better, more robust components for their trucks. In 2015 the initiative for a holistic approach was taken with the introduction of a data lake and distributed and shared computing. Today Scania has data science competence in most business

(12)

functions as well as groups with highly specialized skills to enable enhanced digitalization across the company. The challenges that Scania faces are some of the common challenges in the field of data science but Scania is aiming for company- wide digitalization and investing in becoming a supplier of complete transport solutions [23].

1.4 Aim

This work is a study of the application of evolutionary algorithms for automating the optimization of statistical learning algorithms at Scania. The goal is to find out whether or not evolutionary algorithms are appropriate at the present time for Scania.

1.5 Optimization

Before evaluating evolutionary algorithms optimization had to be defined. For this three questions were formulated. These questions were answered by discussing the questions with data scientists and other relevant people at Scania. The questions were:

1. What does it mean that an algorithm is optimized?

2. When is the algorithm optimized?

3. How are algorithms optimized today?

As for the first question there wasn’t any clear answer, but the answer was still unanimous. There isn’t a single measurement for what it means that an algorithm is optimized, as it depends on the project. However, in the end a performance indicator will be used but the exact one always depends on the business case. Sometimes it might be a standard one such as accuracy but other times it can be a custom cost function.

With the second question there was more clarity in the answers. There are two different criteria and usually just one is used per case. The first criterion is that there is usually a level that the performance indicator need to reach for the model to be viable, for example that the accuracy needs to be at least 80 %. The second criterion is that the model performs better than what is currently used, whether it is a manual method or another algorithm.

To be able to fairly evaluate the performance of evolutionary algorithms for training a statistical learning algorithm it should be compared to the current procedure of optimization at Scania. The presently most common method is manual tuning of

(13)

hyperparameters using standard values as baseline and then applying grid-search of subjectively reasonable values with cross validation.

For this study the optimization is defined for the chosen data set and problem. Since this study is not performed with a business case in mind the stopping criterion will be based on a computational power. The result is then discussed for how an appropriate implementation can be done in future projects. The evolutionary algorithms are compared to previous work done on the data set with conventional manual optimization and data cleaning.

1.6 Scope and limitation

The scope of this thesis is to evaluate if evolutionary algorithms are appropriate for Scania to use in optimizing the training of statistical learning models and see how they can be used to automate the training process on deployed models. What is outside the scope of this thesis is the implementation of evolutionary algorithms in a business case. The evolutionary algorithms are evaluated by using a relatively clean data set on which prior model training has been performed. This data set also provides a reference and benchmark for the performance of the evolutionary algorithms. The scope is also limited to using a single statistical learning algorithm and not working on the optimization of the data pipeline.

1.7 Outline

This section has given an introduction to the thesis and the outline of the remaining report is described here. The theory is presented in Section 2 meanwhile the data and methods are presented in Section 3. The result is presented in Section 4 meanwhile the discussions can be found in Section 5 and conclusions in Section 6.

The references and appendices can be found at the end of the thesis. Throughout the thesis there are a lot of notation and some abbreviations used, all of the commonly used ones are summarized in Appendix B.

(14)

2 Theory

In this section the theory of statistical learning, imputation, the used statistical learning model as well as evolutionary algorithms are presented.

2.1 Statistical learning

Statistical learning is an approach for finding patterns in data. The goal is to estimate the underlying function to find relations between the input, or independent variables which are often called predictors or features, and output, or dependent variable which is often called the response, for use in inference and prediction.

Statistical learning methods are often categorized into two categories: supervised learning where the response is known and unsupervised learning where the response is unknown and the goal is often to explore the data and find clusters.

2.1.1 Supervised learning

Since the existing data, if of high quality, closely represents the truth, the goal with supervised learning is to find an approximation of the true relation between predictors and response. The method has different names depending on the data’s response variable. If the response is a quantitative variable the problem is called a regression problem and if the response is qualitative it is called a classification problem. The goal in both is to find an approximation ˆf(x) ≈ f (x) where f (x) represents the true relation between the predictors and the response. Supervised learning can be applied both with and without domain knowledge, but excluding prior knowledge increases the search space and will increase both the bias and variance of the prediction [15, p. 28].

The learning procedure is inspired by human behaviour. A mathematical representation is that an observation i, {x_i, y_i}, with one response, y_i, and p predictors, x_i are represented as a (p + 1)-dimensional vector in the Euclidean space. The model cannot be purely deterministic either because reality is itself not deterministic or there are measurement errors, so an error term ε_i is also needed. The real model that is attempted to approximate is thus y_i= f (x_i) + εi [15, pp. 29–30]. The true value is thus y_iand the predicted value is ˆf(x_i).

A problem that occurs with large data sets is the curse of dimensionality, which means that as the dimension of the problem increases, i.e. the number of features, the number of observations needed to cover most events become impossibly large to collect. An example in ten dimensions says that we need to examine 80 % of the range of all features to observe just 10 % of all possible data [15, pp. 22–23].

(15)

2.1.2 Model selection

An important step in statistical learning is the selection of the final model where the goal is to get as good predictions as possible. How good a model is depends mainly on three general things, excluding potential hyperparamters of the model.

These are the bias, the variance and the complexity of a model. In general a more complex model may end up overfitting the data, leading to low bias and high variance meanwhile a simple model may underfit the data and lead to a high bias and low variance. A low bias and high variance means that the model is very flexible and will adjust to small changes in the data. Meanwhile a high bias and low variance means that the model is rigid and will not change much from small changes in the data. Which combination will result in the best model depends on the data [15, pp. 219–220].

To find a good model there are a couple of steps that needs to be followed. The first is that the model needs to be trained and evaluated using independent data sets and evaluated in multiple steps. For this the data set should be partitioned into three separate data sets, the training set, the validation set and the testing set. The training set is the data set on which the model will learn the patterns. The validation set is used for model selection. Model selection is the procedure in which a trained model will predict the outcome, which is then compared to the true values using a loss function. Several different models are trained using the training set and all of them perform prediction on the validation set. The testing set is kept separately until the last step, model assessment, and is only used once. In this step the model is applied to the testing set and the predictions are compared to the true values. The result of the model assessment is an independent evaluation of the model as the test data was not seen until this stage. The reason why the model is assessed only on the unseen testing set is the bias towards both the training and validation set that were used to optimize the model [15, p. 222].

Let Y = f (X ) + ε where ε ∼N (0,σ_ε²) be the true relation between the response Y and predictors X . In a regression setting, the expected prediction error of a model fˆ(X ) for observation x₀ can then be calculated using the squared-error loss seen in (1). The first term cannot be reduced as it’s a natural phenomena. However the two latter ones can be optimized for the applied model on the used data. As mentioned earlier bias usually decreases with the model complexity meanwhile variance increases. This displays the concept of bias-variance tradeoff and needs to be considered for specific implementations [15, p. 223].

(16)

Err(x₀) = E[(Y − ˆf(x₀))²|X = x₀]

= σ_ε²+ [E ˆf(x₀) − f (x₀)]²+ E[ ˆf(x₀) − E ˆf(x₀)]²

= σ_ε²+ Bias²( ˆf(x₀)) +Var( ˆf(x₀))

= Irreducible Error + Bias²+Variance (1)

2.1.3 Ensemble learning

In statistical learning there are many different kinds of models, where some are simple, both in terms of computations as well as assumptions about relationships and interpretability. Meanwhile other models can be very complex, difficult to compute and hard to interpret. A concept in statistical learning that has shown to be quite powerful is the idea of ensemble learning, where the prediction model is built by combining several models. Implementations of ensembles vary slightly with some being parallel and other sequential but the concept is that instead of training one model, several are trained and used together for the prediction. Different models will sometimes also be assigned different weights depending on how well they classify certain observations. The two main parts in training an ensemble model are to build a population of the base learners and then to combine them for the final model [15, p. 605].

2.1.4 Performance indicators

In statistical learning it is important to measure how good a model is and this can be done in several ways. A custom cost function for the specific problem can be used so that the problem represents a relevant value in reality but statistical measures can also be used, here called performance indicators. Several performance indicators for classification are presented below.

Before presenting the performance indicator a tool often helpful for evaluating the performance indicators is explained; the confusion matrix. The confusion matrix is a two-by-two matrix which presents four different types of classifications, for a binary case. The confusion matrix can easily be expanded to a general case with more classes even though the binary confusion matrix is still valid for multi-class classification. Confusion matrices match the correct and incorrect classifications of each type and present them in different cells, see Figure 2.1. The four cells then represent the number of true positive, false positives, false negatives and true negatives. There are thus two different types of correct classifications and two different types of incorrect classifications. These are used to calculate several different types of measurements discussed below [14].

(17)

Reference

Positive Negative

Predicted Positive True positive (TP) False positive (FP) Negative False negative (FN) True negative (TN)

Figure 2.1: A 2x2 confusion matrix is a matrix where the number of correctly and incorrectly classified predictions are put into relation to the true value. Each cell contains the number of predictions that were made of that type.

First there is accuracy, which is the ratio of how many observations are correctly classified. This is calculated using (2).

accuracy= T P+ T N

T P+ T N + FP + FN (2)

Next there is the false alarm rate, or false positive rate seen in (3).

f alse alarm rate= FP

FP+ T N (3)

Next there is sensitivity, or true positive rate (also sometimes known as recall or hit rate), which is the ratio of how many positives that were correctly classified. This is calculated using (4).

sensitivity= T P

T P+ FN (4)

Then there is precision, which measures the ratio of true positives among all the observations that were classified as positives, including the false positives. Equation (5) calculates precision.

precision= T P

T P+ FP (5)

Then there is specificity, which measures the ratio of observations correctly classified as negatives among all negatives values. This equation is shown in (6).

speci f icity= T N

T N+ FP (6)

A more aggregate performance indicator is Cohen’s Kappa, κ, which is calculated using equation (7), where p₀ = accuracy and p_c is an aggregate statistic seen in equation (8). Cohen’s Kappa is one of the most used statistic in evaluating agreement between different raters, which a classification model is. The range is κ ∈ [−1, 1] where a value close to 1 represents that the the classifiers are in agreement meanwhile a negative value means that there are discrepancies in how

(18)

they classify. Being close to 0 means that the agreement might as well be random. A weakness with this metric is that it is very sensitive to the skewness of distributions and work is being done to find a more robust metric, but until then Cohen’s Kappa is still a good metric for evaluating a model [29].

κ = p₀− p_c

1 − p_c (7)

p_c= (T P + FN)(T P + FP) + (FP + T N)(FN + T N)

(T P + FN + FP + T N)² (8)

2.2 Imputation

It is not uncommon for data sets to be incomplete, that is there are missing values.

Reasons can vary and be all from a sensor failing in a machinery, a patient being too sick to take a blood sample in a medical trial or human error and forgetfulness.

Oftentimes it is desirable to fill in these values, and the process of doing so is called imputation. Before imputing it is important to know the reason why the data is missing so that the value can be replaced with one that will most likely represent the truth. There are generally three types of missing values, which are missing not at random(MNAR), missing at random (MAR) and missing completely at random (MCAR). When the values are randomly missing MCAR requires a stronger assumption about the reason than MAR does. Even though MCAR requires a stronger assumption it is the one often used in implementations [15, p. 332]. Meanwhile the details of imputation are out of the scope of this work a short introduction to two different methods is given below.

2.2.1 Multivariate imputation by chained equations

A complex and competent imputation algorithm for multivariate data sets is Multivariate Imputation by Chained Equations (MICE). In multivariate data the variables that do not have missing values can be used to impute the missing value by using joined distributions. MICE is also based on the idea of multiple imputation, which is that instead of imputing once, imputation is performed several times. This allows for different variations of the imputation to be analyzed before selecting the final values to impute with. There are several ways of both sampling for imputation as well as analyzing the result but some examples follow. A simple method is to randomly sample existing observations. In binary cases logistic regression can be used where a classification model is built to see which value is the most likely.

For continuous values a linear regression model can be built for imputation, which allows the generating of new values [7].

(19)

2.2.2 Mode

A simple imputation method is using the mode of the distribution, or the data.

Imputing with the mode means replacing the missing values with the most common value. If the data is continuous it may be better to use the mean but for categorical data there may not be a mean or the mean is nonsensical and mode should be used as it represents a value known to exist. This method of imputation should only be used though if the number of missing values is relatively small, so as not to skew the distribution.

2.3 XGBoost

In this section the theory of the statistical learning model XGBoost, which is a tree based model, is presented. It starts from the theory of decision trees and then continues with boosting and gradient descent to finish with a presentation of the theory specific to XGBoost.

2.3.1 Decision trees

Decision tree learning is a tree based model which is based around partitioning the feature space in rectangular shapes and assigning different values to the rectangles. The idea is very simple and powerful and at the same time is also highly interpretable as visualisation of decision trees often resembles natural thinking.

Tree based models can be used for both regression and classification but the focus here is on classification. A tree consists of several nodes and edges where the edges correspond to different paths and have logical conditions meanwhile nodes are gateways (split points) or results. The node at the top is called the root. Each node, except the end nodes, is split in a binary way which means that it has two children, or nodes, connected by one edge each. A node that does not have any children, an end node, is called a leaf node and is the result, for example the classification in a classifying decision tree. The depth of a tree is an important tuning parameter where shallow trees will become simple classifiers and deep trees can find complex structured in the data but run the risk of overfitting the data [15, pp. 305–307]. A simple example of a decision tree can be seen in Figure 2.2. The way to use the tree is to check the logic statements until a leaf has been reached.

Say that the conditions x₁< a and x₂≥ b are fulfilled, then the result would be the leaf node m₂and the observations would be classified as class k₂.

An important problem in growing decision trees is to consider the partitioning, that is how to split the data set for each part of the tree and which the logical conditions should be. The splits are usually binary [15, p. 305] but that does not impose a restriction to the model as multiple splits can be simulated by making the tree deeper [15, p. 311]. An important consideration when making the splits is to try

(20)

m₁= k₁ m₂= k₂ m₃= k₁ x₁< a

x₂< b x₂≥ b x₁≥ a

Figure 2.2: Decision tree example where there are two variables, two classes, three leaf nodes and the depth of the tree is two. For example, if the two conditions necessary to reach the leaf node m₁ is fulfilled the observations is classified as class k₁.

and keep nodes pure, in the sense that a node is pure if there are no or only a few misclassifications. A common way to do this is by using the Gini index, G, which is defined in equation (9) where ˆp_mkis the probability that the observation belongs to class k ∈ 1, 2, ..., K in node m [15, pp. 308–310].

G=

K

∑

k=1

ˆ

p_mk(1 − ˆp_mk) (9)

An advantage of tree based models over some other models is that they can natively handle missing values. One way of doing this is by simply having a category, or logical statement, for whether the value is missing or not. This way the model can even find possible patterns for when values are missing. Another is to have different splits for the same node which are ranked and depending on whether the observation is missing either the primary or secondary split will be used [15, p. 311].

If different misclassifications have different weights these can also be incorporated into the training algorithm in the evaluation of purity of a node. By creating a loss matrix where different misclassification carry different weights this can guide the model to prioritize lowering misclassifications of certain types. Correct classifications usually carry no cost, or loss [15, pp. 310–311].

2.3.2 Boosting

Boosting is an ensemble concept in learning algorithms where, instead of having a single strong model, several weak learners are combined in sequence. A weak learner is a model that produces a result that is almost as bad as a random guess.

What boosting does is thus letting the weak learners learn sequentially where the subsequent learners correct the previous learners’ mistake. This is a concept

(21)

commonly used on tree based model and is a fundamental part of gradient boosted trees. Let T_m(x), m = 1, 2, ..., M each be a weak learner and a₁, a₂, ..., a_M be weights for each model calculated during the learning procedure using (10), where w₁, w₂, ..., w_N are weights that are applied to all observations (x_i, y_i), i = 1, 2, ..., N and calculated with (11) where w⁰_iare the updated weights.

a_m= log 1 − errm

err_m

, err_m= ∑^N_i=1w_iI(y_i6= T_m(x_i))

∑^M_i=1w_i (10)

w⁰_i= w_ie^a^m^I(yⁱ^6=T^m^(xⁱ⁾⁾, i = 1, 2, ..., N (11)

Then the final model T (x), using weights a₁, a₂, ..., a_M calculated during the learning procedure, can be written as in equation (12) if the case is binary classification with labels y ∈ {−1, 1} [15, pp. 337–339]. An example can be seen in Figure 2.3.

T(x) = sign

M

∑

m=1

a_mT_m(x)

!

(12) The weights a₁, a₂, ..., a_Mfor the models govern another set of weights, w₁, w₂, ..., w_N, the weights applied to all observations (x_i, y_i), i = 1, 2, ..., N. The purpose of the weights on the observations is to adjust them accordingly to how difficult they are for the model to classify. Observations that get misclassified more often get higher weights and will be more focused on during training and consequently correctly classified observations get lower weights. Initially w_i= _N¹, ∀i ∈ [1, N] but are updated during the boosting procedure dependent on the weights a_m [15, pp. 338–

339].

Basis functions are often used for transforming the input variables to modify the function space so that the models used can be linear, for simplicity. Let h_k(x) : ℜ^p → ℜ be the kth transformation of the input variables X with k = 1, 2, ..., K.

Then the linear basis function is the one seen in equation (13) where β_k’s are expansion coefficients. This changes the dimensions of the feature space from p to Kand allows a linear model to be trained with the regular modelling procedure [15, pp. 139–141].

f(X ) =

K

∑

k=1

β_kh_k(X ) (13)

The reason why boosting works well is because it lets the learning model become additive by basis functions where each weak model is a basis function [15, pp. 341–

342]. Additivity means that a function f (x) can be assumed to be a linear

(22)

Example of classification using boosted trees

Observation for classification is x = (x₁< a, x₂> b, c < x₃< d) with response y.

Tree 1, a₁= α1

x₁< a ∧ x₂> b ⇒ T₁(x) = k₂

m₁= k₁ m₂= k₂ m₃= k₁ x₁< a

x₂< b x₂≥ b x₁≥ a

Tree 2, a₁= α2

x₃> c ⇒ T₂(x) = k₂

m₁= k₁ m₂= k₂ x₃< c x₃≥ c

Tree 3, a₃= α3

x₃> c ∧ x₃< d ⇒ T₃(x) = k₁

m₁= k₁

m₂= k₁ m₃= k₂ x₃< c x₃≥ c

x₃< d x₃≥ d

Figure 2.3: Example of binary classification using boosted trees. Tree 1 with weight α₁ gives classification k₂, tree 2 with weight α₂ gives classification k₂ meanwhile tree 3 with weight α₃ gives classification k₁. If the classes are represented as k₁=

−1 and k₂= 1, then model can be defined as T (x) = ∑³_i=1αiT_i(x) < 0 ⇒ T (x) = k₁, otherwise T(x) = k₂.

combination of the space bases and thus be written in the form in equation (14)

(23)

where f_j(X_j) is a basis function [15, p. 140].

f(X ) =

p j=1

∑

f_j(X_j) =

p

∑

j=1 K k=1

∑

β_jkh_jk(X_j) (14) To achieve the best model, T (x), it needs to be fit to the data, which is done by minimizing a loss function. This loss function is based on the basis functions, which are the individual classifiers in this case. The model as seen in equation (12) depends on multiple parameters, which is clarified as having the models denoted as T_m(x; Θm), where Θm is the set of parameters for Tm(x). To fit the model the loss function L in equation (15) needs to be minimized.

min

{Θ_m}^M₁ N i=1

∑

L y_i,

M m=1

∑

a_mT_m(x, Θm)

!

(15)

This is usually too computationally intensive to be feasible and can often be simplified by solving the subproblem of fitting a single basis function. The problem then becomes the problem seen in (16) [15, pp. 341–342].

min

Θ N

∑

i=1

L(y_i, T (x, Θ)) (16)

Let ϒ_j, j = 1, 2, ..., J be the J regions that the tree model partitions the space into and ζ_jas the value that is assigned to that region, i.e. the classification. This gives that the prediction will be as seen in (17).

x∈ ϒ_j⇒ f (x) = ζ_j (17)

The parameters Θ_m = {ϒj, ζj} for a tree model can be found by minimizing the empirical risk seen in (18) which is an optimization problem which can be divided into two parts. One problem is finding the value ζ_j to assign to given ϒ_j, which is usually a simple problem and for classification ˆζ_jis often the mode of ϒ_j. The more difficult problem is finding the regions ϒ_j and is often solved by greedy top-down methods of recursive partitioning.

Θ = arg minˆ

Θ J j=1

∑ ∑

xi∈ϒj

L y_i, ζ_j

(18)

The final model, the boosted tree model, is a sum of all trees and can be seen in equation (19).

f_M(x) =

M m=1

∑

T(x; Θm) (19)

(24)

The model can be generated in a forward stagewise procedure where (20) must be solved in each step with the region set constraint Θ_m = {ϒ_jm, ζ_jm}^J₁^m of the following model f_m(x) after the current f_m−1(x) [15, pp. 353–357].

Θˆm= arg min

ΘM

N

∑

i=1

L(y_i, f_m−1(x_i) + T (x_i; Θ_m)) (20)

2.3.3 Gradient boosting

There is no fast way for solving (20) and an approximation is necessary. The solution is derived from numerical optimization with steepest descent. Again, the goal is to minimize the loss (21) of using f (x) to predict y.

L( f ) =

N i=1

∑

L(y_i, f (x_i)) (21)

The true problem requires the constraint that f (x) is a sum of trees but without it the problem is similar to a numerical optimization problem as in equation (22) where f ∈ ℜ^N are the parameters of the approximate function f (x_i) for each data point x_i.

ˆf = argmin

f L(f) (22)

Solving (22) numerically means solving it as a sum of components vectors, as in equation (23) where f₀= h₀is an initial guess and the following f_m, m = 1, 2, ..., M are based on the previous f_m−1.

f_l=

l

∑

m=0

h_m, l = 0, 1, 2, ..., M (23)

Choosing the increment vector h_m, h_m∈ ℜ^N is done with steepest descent with h_m= −ρmg_m where ρ_mis the step length calculated with (25) and g_m∈ ℜ^N is the gradient vector of the loss function L(f) according to (24).

g_im=

δ L(y_i, f (x_i)) δ f (x_i)

f(xi)= f_m−1(xi)

(24)

ρ_m= arg min

ρ L(f_m−1− ρg_m) (25)

The solution is then iteratively updated according to (26).

f_m= f_m−1− ρmg_m (26)

(25)

This however optimizes the model for the training data and not for future predictions on new data points. In an attempt to generalize the model to work well on new data one solution is to generate a tree at iteration m which predictions are in close proximity to the negative gradient. For multiclass classification with K classes the loss function is the multinomial deviance and K trees are grown for each iteration where each tree T_km is fit to its corresponding negative gradient vector g_km, with gradients for each model based on all observations, as seen in (27) [15, pp. 358–

360].

−g_ikm=

δ L(y_i, f₁(x_i), ..., f_K(x_i)) δ f_k(x_i)

f(xi)=f_m−1(xi)

= I(y_i− Z) − p_k(x_i) (27)

The multinomial deviance can be seen in equation (28) where Z = {Z₁, Z₂, ..., Z_K} are the K unordered classes and p_k(x) is a logistic model as seen in equation (29) [15, pp. 348–349].

L(y, p(x)) = −

K

∑

k=1

I(y = Z)log(p_k(x))

= −

K

∑

k=1

I(y = Z) f_k(x) + log

K

∑

k=1

e^f^k^(x)

!

(28)

p_k(x) = e^f^k^(x)

∑^K_l=1e^f^l^(x) (29)

2.3.4 XGBoost

XGBoost is a specific open-source implementation of gradient boosted trees with focus to be an efficient implementation for scalability. The name is an abbreviation of Extreme Gradient Boosting. All information about XGBoost is taken from the article about the algorithm that the creator wrote [8].

Multiple details of the XGBoost implementation are covered above as XGBoost is a gradient boosted tree model. There are however aspects specific to XGBoost and some will be presented here: the regularized learning objective, shrinkage, feature subsampling, approximation of split point, sparsity awareness and how it deals with missing data as well as parallelization, cache awareness and out-of-core computation.

First, the learning objective of XGBoost uses not only a loss function to minimize but also a regularisation term which penalizes the complexity of the model. The

(26)

learning objective is seen in equation (30) where l is a differentiable and convex loss function and Ω( f ) is a complexity penalty.

L( f ) =

∑

i

l( ˆy_i, y_i) +

∑

k

Ω( f_k) (30)

In the complexity penalty Ω( f ) seen in (31) f_k is a single tree, ι the number of leaf nodes, w is leaf weights meanwhile γ and λ are penalty weights.

Ω( f ) = γ ι +1

2λ ||w||² (31)

XGBoost also uses shrinkage, as in regular gradient boosting, which is a technique to scale weights by factor η after each boosting step. The purpose of the shrinkage is to leave room for improvement for future trees. Another technique is one that is used in random forest, which is subsampling of the features considered for each tree. This technique helps against overfitting and speeds up computation.

For finding the split points XGBoost uses an approximate algorithm because exact greedy, i.e. considering all splits for all features, does not scale well when the data grows and cannot fit in the memory anymore. What XGBoost does is to propose split points according to aggregate statistics derived from percentiles in the feature distribution which it has discretized. Two options are available, either a global method which does this for each tree or a local method which proposes new ones after each split.

Next is that XGBoost is sparsity aware and works by default with missing data, which is not common in many implementations. A sparse data set can mean that there are a lot of missing data or that there are a lot of zero entries. Both of these can be result of feature engineering where some combinations may not produce a new value. The way XGBoost handles missing data is that each node is assigned a default direction which will be taken if the data required for that node is missing.

The direction is learned from the data where there are no missing values.

As for computational performance improvements XGBoost has a column block design which allows for computational parallelization of sorting and split finding.

Another technique is to make the algorithm cache aware and store gradient statistics in the CPU cache instead of memory to reduce read and write time. The last is to use out-of-core computation, which is a technique to use disk space to manage data that does not fit in the memory. One method for this is by compressing the data that does not fit in memory so that the read and write times are shortened. Another is sharding the data, that is splitting the data so that it can be stored in separate drives, which allow for parallel read and write operations. The specific implementations also allow for use of larger data sets.

(27)

2.4 Principles of evolutionary algorithms

Evolutionary algorithms are weak metaheuristic search methods in artificial intelligence that are based on natural evolution. Being metaheuristic means that it is a generalized method that is not adapted for a single problem and that it does not guarantee the global optimum to be found. An approximate solution that is usually good enough by subsampling the search space is found instead. Being weak is a term in artificial intelligence which means that it does not use domain knowledge during the search by default [26].

That evolutionary algorithms are based on natural evolution manifests in the algorithm by several principles and mechanisms. They are that it utilizes a measure of how strong individuals are and prioritizes their survival by selection of the fittest, it allows for reproduction with other individuals to mix their genes as well as undergo mutation to incur random changes in the solution [4].

Evolutionary algorithms are often used in search applications, optimization problems, tuning statistical learning algorithms as well as solving design problems. The concepts of evolutionary algorithms were developed separately in Germany and USA from the 1960s where two different paradigms were used for how evolution should be represented digitally [26].

As genetic algorithms are used in the implementation of this work it will be the evolutionary algorithm of focus in the succeeding sections, though most concepts apply generally.

2.4.1 Evolution

Before the formal explanation of how evolutionary algorithms work a short description proceeds followed by a setup of the notation to be used.

Evolutionary algorithms follow the natural stages in evolution where there is an initial population, which is commonly randomly generated. All individuals in this population then have their fitness evaluated on how well they solve the problem.

Next the best individuals are selected and undergo a mating procedure, called recombination or crossover, which is followed by a random mutation. These steps are then repeated: all individuals in the population are evaluated, the best individuals are selected, they mate and then the offspring may end up also being mutated. The evolution stops when a termination criterion is fulfilled, which is usually either when the desired fitness has been reached or when a maximum number of iterations has been performed [4].

Let Π be the individual space and π_t the size of the population for generation

(28)

t, where π^µ is the size of the parent population and π^λ is the size of the child population. These can be the same or vary over generations. The population for generation t is denoted as P(t) = {p₁(t), p₂(t), ..., p_π_t(t)}, where p_i(t) ∈ Π is a single individual, i = 1, 2, ..., π_t. Several operators are also required where Φ : p_i(t) → ℜ is the fitness function which evaluates individual i in generation t to a metric of a real number. The selection operator is Ψ_Θ

Ψ : P(t) → P^µ(t) with parameters Θ_Ψ and P^µ(t) are the selected parents for generation t. The recombination operator is X_Θ_X : P^µ(t) → P_x^λ(t) with parameter ΘX and where P_x^λ(t) is the child population after the recombination procedure. Then M_Θ_M : P_x^λ(t) → P_m^λ(t) is the mutation operator with parameters ΘM and P_m^λ is the population after the mutation procedure, which may include mutated children. Lastly, let ξ : P(t) → {true, f alse} be the termination criteria. This can be either that a maximum number of generations has been reached or that the global optimum has been reached. The formal algorithm for a general evolutionary algorithm can be seen in Algorithm 1 [4].

Algorithm 1: General evolutionary algorithm t← 0 ;

P(t) ← {p₁(t), p₂(t), ..., p_π_t(t)} ∈ Π^π^µ ;

Evaluate P(t) : {Φ (p₁(t)) , Φ (p2(t)) , ..., Φ (pπt(t))} ; while ξ (P(t)) 6= true do

t← t + 1 ;

Select: P^µ(t) ← ΨΘ_Ψ(P(t − 1)) ; Recombine: P_x^λ(t) ← X_Θ_X(P^µ(t)) ; Mutate: P_m^λ(t) ← M_Θ_M

P_x^λ(t)

; Evalute P_m^λ(t) : Φ

P_m^λ(t)

; end

return {best fitness, best solution}

2.4.2 Types of evolutionary algorithms

As mentioned earlier there are multiple different types of evolutionary algorithms.

The differences are mainly conceptual and actual implementations vary with how mechanics are implemented and prioritized. There is also an overlap as concepts borrow ideas from one another. Here follows a short presentation of the variety that exists within evolutionary algorithms.

Two major kinds, the most popular and well known ones, are genetic algorithms and evolutionary strategies. Genetic algorithms use binary encoding to represent the solution, see Section 2.4.3 and emphasize recombination over mutation. Due

(29)

to the binary representation it also focuses on changes on genotypes rather than phenotypes. Evolutionary strategies instead represent the solution as real values and therefore do not require any encoding/decoding of the solution. Meanwhile the recombination mechanisms work similarly, but not necessarily identically, evolutionary strategies have different opportunities for mutation. As the solutions are represented with real values mutation is often performed as sampling from a distribution which has more flexibility than binary encoding does. Evolutionary strategies therefore also emphasize mutation more than recombination in contrast to genetic algorithm.

Furthermore there are evolutionary algorithms that are derived by combinations of both. Genitor is an example which uses a population management feature of evolutionary strategies but retains the principles of genetic algorithms. CHC is another example which uses binary representation and is one of the most aggresive EAs. It uses recombination as the search tool meanwhile something called cataclysmic mutation is used as a restart mechanism for the population. There is thus a plethora of different evolutionary algorithms, which overlap in many ways but might be appropriate for different problems due to different mechanisms or function spaces [26].

2.4.3 Problem representation

In using evolutionary algorithms there are two aspects which are completely problem dependent. These are the encoding of the solution as well as the construction of the fitness function. Both of these can vary greatly from task to task. The encoding is discussed below meanwhile the fitness function is discussed in 2.5.1.

As for the encoding of the problem, for genetic algorithms this is usually done in the form of a bit string. This bit string contains an appropriate number of bits to represent the solution with an adequate precision [4]. If the problem has 16 different levels in a discrete parameter, then four bits will be perfect. As for continuous numeric parameters the number of bits can be increased until required precision has been reached. There are however problems with binary representation and one of these are the flexibility, or lack thereof, of bit strings. If a problem has for example 327 distinct levels then nine bits would be required to find a way to represent all of these. But nine bits gives a total number of 512 possible values. The remaining values need to be dealt with by mapping them to either a default value or allowing different probabilities for different values of the parameters [25]. A key concept in the encoding part of problem representation it that of letting the solution be seen as a chromosome, for analogy to nature [20], which is discussed below.

(30)

Chromosomes

As with molecular genetics an individual’s properties are represented by its DNA and meanwhile humans have 23 chromosome pairs, in genetic algorithm an individual usually only has a single chromosome. This chromosome consists of several genes, where a subsequence makes up a genotype [20]. This genotype is still only a bit string and needs to be decoded to make semantic sense, i.e. represent a value. In nature they are decoded by genetic mechanisms into amino acids and then proteins before their final form as phenotypes. In genetic algorithms a decoding function decodes the genotype into the phenotype directly, which is the value of interest of that sequence of genes [4].

An example of chromosomes can be seen in Figure 2.4, where two individuals, A and B, chromosomes can be seen. Individual A has the bitstring 10001110 as chromosome and individual B has 01110101. Parameter boundaries are not visualized but can be anywhere. The chromosomes can represent from one parameter, with the whole chromosome representing just one phenotype, up to eight parameters if every bit represents it’s own phenotype with the simple encoding of

‘true’ or ‘false’.

Individual A 1 0 0 0 1 1 1 0 Individual B 0 1 1 1 0 1 0 1

Figure 2.4: Two individuals whose chromosomes are represented by bit strings.

These chromosomes can represent from one up to eight different value where the substrings are split at any point.

2.4.4 Other aspects

Other than the critical parts of evolutionary algorithms there are more advanced aspects to take into consideration both for computational performance as well as improved search performance.

One of these aspects is the possibility for parallel computing. As the entire population of each generation is independent, that is no individual depends on another individual, they can all be evaluated in parallel. With minimal overhead this quickly speeds up computation. If a 64 core machine is used and the population is set to 64, each generation can be evaluated in almost the same time it takes for one individual to be evaluated if executed in sequence [26].

A second aspect which allows even more parallelization from both a computational

(31)

perspective, but more importantly with respect to the search performance is the use of parallel islands with the parallel island model. The parallel island model is a coarse grain parallel model where entire populations are split and separated from each other, metaphorically on islands. These subpopulations are kept separate from each other for a certain amount of generations. After that number of generations has passed a number of individuals can migrate between the islands to mix their gene data [26], which introduces a third concept, diversity.

Diversity is important in evolutionary algorithms because meanwhile convergence means that an optima has been found there is no guarantee that it is global. It is possible that as the fitness in the population improves the variation between individuals, the diversity, decreases. The similarity between the individuals makes the search stagnate. Keeping the population diverse helps in avoiding early convergence in local optima and aids in continued search [26].

2.5 Mechanisms of evolutionary algorithms

As mentioned in Section 2.4 there are four main mechanics in evolutionary algorithms. They are the fitness function and the three operators for selection, recombination and mutation. They are discussed below.

2.5.1 Fitness function

What the fitness function actually calculates will, as previously said, depend on the problem. In a simple optimization of a mathematical function it will just be to evaluate the test function and these problems are usually quite fast to execute.

Meanwhile in a statistical learning application, the fitness function will not only evaluate the fitness of the solution by for example calculating a cost function. It will also have to perform the training of the model as well as perform predictions to feed the cost function or performance measure [25].

2.5.2 Selection

The selection operator has the function of selecting which individuals will be chosen as parents for the next generation. There are several different kinds and two of them are rank-based and tournament selection. In rank based selection the individuals are sorted according to the fitness value and a certain number of the top performers is selected. This carries a bias towards the individuals with higher performance. Tournament selection is where individuals are selected at random and the better one survives. The selection is done with resampling and also carries a bias towards the individuals with higher fitness [26]. There are however several more methods which introduces different probabilities depending on the mechanics used and the individuals fitness [21].