A comparison between Neural networks, Lasso regularized Logistic regression, and Gradient boosted trees in modeling binary sales

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

A comparison between Neural

networks, Lasso regularized

Logistic regression, and Gradient

boosted trees in modeling binary

sales

RICKARD STRANDBERG

JOHAN LÅÅS

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

A comparison between Neural

networks, Lasso regularized

Logistic regression, and Gradient

boosted trees in modeling binary

sales

RICKARD STRANDBERG

JOHAN LÅÅS

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2019

Supervisor at Dustin: Omid Sedighi Supervisor at KTH: Tatjana Pavlenko Examiner at KTH: Tatjana Pavlenko

(4)

TRITA-SCI-GRU 2019:084 MAT-E 2019:40

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

The primary purpose of this thesis is to predict whether or not a customer will make a purchase from a specific item category. The historical data is provided by the Nordic online-based IT-retailer Dustin.

The secondary purpose is to evaluate how well a fully connected feed forward neural network performs as compared to Lasso regularized logistic regression and gradient boosted trees (XGBoost) on this task.

This thesis finds XGBoost to be superior to the two other methods in terms of prediction accuracy, as well as speed.

(6)

(7)

En jämförelse mellan Neurala nätverk, Lasso regulariserad

Logistisk regression, och Gradient boostade träd för modellering av

binära försäljningar

Sammanfattning

Det primära syftet med denna uppsats är att förutsäga huruvida en kund kommer köpa en specifik produkt eller ej. Den historiska datan tillhandahålls av den Nordiska internet-baserade IT-försäljaren Dustin. Det sekundära syftet med uppsatsen är att evaluera hur väl ett djupt neuralt nätverk presterar jämfört med Lasso regulariserad logistisk regression och gradient boostade träd (GXBoost). Denna uppsats fann att XGBoost presterade bättre än de två andra metoderna i såväl träffsäkerhet, som i hastighet.

(8)

(9)

1 Introduction

Recent years have shown impressive results from various neural network designs in solving tasks ranging from object recognition,^{1, 2} to speech recognition^{3, 4} and language processing.⁵ These results have been made possible by an increase in computational resources, data availability, and new methods developed to solve issues that arose when training early neural networks. The computational resources have not only increased by way of more sophisticated CPUs but also through the realization that numerical computations can be performed on a Graphics processing unit (GPU) in a parallel fashion. These advances combined with the possibility for parallel training in which a cluster of computational units can work on training the same model has resulted in vastly reduced training time for neural networks. As the computational power increases, the cost of data storage decreases and the ability to gather data improves, this combination of computational power and large amounts of data available to train models on have been crucial to the success of neural networks.

In light of the progress made in these fields it is natural to ask how well neural networks perform when applied to the task of capturing and predicting consumer behavior. The companies concerned with the sale of wares and services often have access to large numbers of observations of historical transactions as well as general information such as the number of employees, location, company growth, etc. This data may also be augmented with data describing online behavior from customers navigating the company website.

The question of customer purchasing behavior is by no means a new one. There are several models that have a long history of service in such fields, two of these models, logistic regression, and gradient boosted classification trees will be examined and used in this thesis.

When deciding on which model to use there is not only the accuracy to consider. Neural networks suffers from a lack of interpretability, the model does not provide details as to which variables make large con- tributions to the final result nor in which way they do so. Logistic regression on the other hand produce coefficients that can be interpreted as weights explaining if a feature makes the outcome more or less likely.

This model may in addition also be modified to induce more sparse models through the use of Lasso regularization, further contributing to an interpretable model. Classification and regression trees (CART) similarly have accessible interpretations, though the interpretability is lost when the method is used as a predictor in a sequence of additive weak learners such as in the boosting case. Further aspects to consider is the effort required to deploy a certain model. In addition to obtaining a low classification error and weighing that against how interpretable the model is there is also the time and resources required to fit the model to consider, both in terms of computational resources as well as time spent deciding on pre-processing and deciding which hyperparameters to include.

The aim of this thesis is to construct a satisfactory classifier that is able to correctly classify which customers are likely to purchase a specific product given some specific customer information. The problem is thus one of supervised learning and binary classification, more details on this are given in Section 2 which also contains a brief discussion on issues such as missing data (Section 2.2.1), and unbalanced training sets (Section 2.2.2). The effect of the latter will be explored in some detail as the models are fitted to two versions of the original training set in order to examine the effect of an uneven ratio between the binary responses as opposed to the result of perfectly balanced labels, achieved by augmenting the training data with synthetic data.

The subsequent sections reviews the aspects of theory pertinent to our stated objective of the three methods, fully connected feed forward neural networks (Section 3), logistic regression (Section 4), and gradient boosted classification trees (Section 5).

Section 6.1 is dedicated to data exploration and pre-processing such as encoding of categorical features, scaling, and examination and elimination of outliers. Sections 6.2, 6.3, and 6.4 are dedicated to fitting and tuning the neural network, logistic regression, and gradient boosting models respectively.

Section 7 presents the combined results of the three methods, and the final section, Section 8 delivers our conclusions.

(12)

Learning from data 2.1 Machine learning | 2

2 Learning from data

2.1 Machine learning

Machine learning is a field concerned with learning to perform tasks without a prior set of specific instruc- tions. These tasks can range from translating a sentence from one language to another, detecting if an image is showing a raccoon or a car, or forecasting the number of phones sold one month from now. A formal definition is given by Tom M. Mitchell⁶

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E.”

A concrete example to illustrate the definition is the task T of deciding what integer is represented in an image showing a handwritten integer. The experience E is a set of labeled pictures of the different hand- written integers, the label being what integer the image is showing, and the performance measure P the percentage of correctly classified images.

The example above is a case of supervised learning. The experience E comes in the form of examples (images) as well as the correct answer corresponding to each image. The labels will be referred to as labels, response, or target interchangeably throughout the paper. The experience E comes in several observations/instances.

The case where the experience lacks a label is commonly referred to as unsupervised learning. In that case the algorithm/method tries to discern patterns in the data without the help of a target. An example of this is clustering methods that groups data by nearness, as measured by some metric, for the different data points.

An instance of experience will have some characteristics, which will be referred to as features. A feature may be of different types, numerical such as a number representing income, temperature, height or similar, or categorical, for properties such as country or gender. The experience E as a set of instances can be represented by a matrix of data where each row represents an observation. This matrix will be called the design matrix, or simply the data, symbolized by XXX. If XXX contains n observations, each with p features, XXX will be of size n × p. The i’th observation’s j’th feature is denoted by x_ij. Any feature of observation xxx_ican be used as the target depending on what task T is set before the algorithm. To make this clear, the target will instead be referred to as y_i. Summarizing the notation, a training example composed of p features, with an accompanying target will form the experience (xxx_i, y_i) , xxx_i= (x_i1, ..., x_ip).

The algorithm (referred to as method or model interchangeably) used to learn the task will have some parameters θθθ that needs to be learned by training on the experience E with the performance measure P . There will also be hyperparameters, which are parameters that are not learned directly on the experience but rather set before hand.

Machine learning adapts a model that fit patterns presented in the experience, henceforth referred to simply as the data. If the structure is known one can infer previously unseen data by extrapolation/inference. If the model is to simplistic it may not capture the structures in the data. On the other hand, if the model is too complex it will be able to incorporate structures that are there by coincidence. If the model is to simplistic the result is underfitting the data, and if the model is too complex and incorporates structures found in random noise, the model is overfitting the data.

This is summed up in the concept of generalization, the model should be able to work well on hitherto unseen data. In terms of the previously declared terms, the experience, referred to as the data, is split into a training set and a test set. The training set is used to find suitable parameters θθθ, and the model is then evaluated on the test set to provide an approximation of the models ability to generalize to previously unseen data. The hyperparameters is to be set before training the model, this can be accomplished by reserving a part of the training set as a validation set. The model is assigned some set of hyperparameters, and allowed to learn θθθ on the training set. The result is evaluated on the validation set in order to determine which hyperparameters works well. Once a satisfactory model is produced, the true generalization error is approximated by the error obtained from evaluating the model on the test set.

An alternative to reserving a portion of the training set for validation is to use cross-validation. The full data is divided into a training set and a test set. The training set is further divided into K equally large sets {Si}^K_i=1. For a given choice of hyperparameters, the model is trained on the set {S₁, ..., S_i−1, S_i+1, ..., S_K},

(13)

Learning from data 2.2 Data | 3

and the validation error is evaluated on the subset Si. This is done for each i = 1, ..., K resulting in K values of the validation error for the given hyperparameters. This method allows the model to be trained on all available data (barring the test data) at the cost of computational expenses.

The choice of hyperparameters are the values that provides the lowest validation error. In practice, these values are found by testing different combinations until one is satisfied with the result. One way of doing this is grid search, this approach makes use of a coarse grid of values that is believed to cover the optimal (or near optimal) hyperparameter combination and iteratively refine the grid around the values that results in the smallest validation error. This method becomes unfeasible if the number of hyperparameters is large due to the curse of dimensionality. One solution is to instead use randomly drawn hyperparameters. This can in some cases⁷ be faster than the more classic grid approach.

2.2 Data

The basis for any machine learning method is the data used to train it on. Good results, in terms of finding patterns that generalizes well to previously unseen instances hinges on the quality of the data. What constitutes good data will differ depending on the task to be solved and on what method is used. Obviously, a first criteria is that there is indeed a pattern present in the data that is relevant to the task at hand. Not as obvious issues are for example categorical features; Tree methods can handle these without the need for any encoding while neural networks and logistic regression can not. The data might also contain missing values or outliers.

2.2.1 Missing data and outliers

Before dealing with missing values, one should consider the reason for their absence. If there is some systematic reason, this may distort the data.⁸ Consider the case of a specific feature Z1 that has a large number of missing values, this feature might in turn depend on yet another feature Z₂ in such a way that a certain outcome of Z₂prevents the gathering of feature Z₁. Simply deleting these observations would in this case introduce bias. If the values are instead missing at random, there are two ways of dealing with them.

The first and easiest way is to omit the instance that has a missing value of some feature. This might be a sensible thing to do if there are only a few observations that has missing values and the observation with missing values do not differ too much from the other observations. If there is a particular feature where many values are missing and it can be said with some certainty that the cause is not systematic, then one approach is to delete the feature or transform it into a categorical.

The second approach is to fill in the missing data through imputation. This may be done in several ways.

One way is to use the mean or the median of the specific feature, another method is to fit a model to the instances without missing values and sample from the resulting model.

Outliers are observations that differ more from the remaining observations than do the others. What counts as an outlier depends on the problem and on assumptions. An example of an outlier in the assumption setting would be a measurement of a persons height in the tens of meters. A model specific outlier might be an observation that is correct but not representative of the phenomena being modeled, such an observation might be a one time sale resulting in a very large profit when the goal is to model the sales of everyday items.

Outliers may be the result of a measurement error, a human error, or a very unlikely event, and some of these may be discarded out of hand, based on reasoning.

One possible method to determine which observations might be outliers is Mahalanobis Distance⁹ defined as

D_M(xxx_i) = q

(xxx_i− ˆµµµ)^TΣΣΣˆˆˆ⁻¹(xxx_i− ˆµµµ),

where xxxi = (xi1, ...., xip)^T is an observation containing p features, ˆµµµ = (ˆµ1, ...., ˆµp)^T are the sample means of the feature columns, and ˆΣΣΣ is the sample covariance matrix.

Observations for which the distance is greater than some cutoff limit are investigated and possibly dis- regarded. One drawback is that outliers themselves influence the sample mean ˆµµµ and therefor also the covariance matrix. An extreme outlier may thus distort the result to an extent where smaller outliers are no longer considered outliers, this phenomenon is known as the masking effect.

(14)

2.2.2 Skewed data

When performing classification it is preferable if the target classes are fairly balanced.¹⁰ A commonly cited example is that of fraud detection, most transactions are not fraud, so a data set might have 99% legit transactions and 1% fraudulent cases. A model that outputs the prediction "legit" for all inputs would achieve an accuracy of 99%, while simultaneously misclassifying all the fraudulent cases.

Imbalanced training data can be handled by either discarding observations from the majority class, du- plicating samples from the minority class, creating synthetic data from the minority class, or punishing misclassification of the minority class harder than misclassification of the majority class. The first approach, undersampling, randomly draws without replacement as many instances from the majority class as there are samples in the minority class. Oversampling draws with replacement instances from the minority class until the classes are equal in terms of observations.

Creating synthetic data can be achieved by modeling the minority class and sampling from the resulting distribution, this will however result in a poor sample if the class contains too few samples originally. One approach is the SMOTE^{11, 12} algorithm, which works by calculating the K nearest neighbors (KNN) and randomly selecting one. The new synthetic data point is created by moving in the direction of the selected point by a random distance in the range (0, d) where d is the distance to the selected point. Since this method makes use of the KNN algorithm it suffers as the dimension of the feature space increases.

If any of the features are categorical, SMOTE handles these by penalizing the continuous features when there are mismatches in the categories between the sample in question and its K neighbors.

This is done by first calculating the median of the standard deviation of each continuous feature, and then for each category that is a mismatch between the sample and neighbor in question adding it to the euclidean distance between them as a penalty term.

The algorithm then behaves as in the case of continuous features, with the difference that this line segment might now be extended due to mismatches in categories. As to what categories will be chosen in the various features of this synthetic data point the algorithm assigns whichever category is the majority of the K nearest neighbors. Features with high cardinality might therefore need a high value of K in order to determine which class is the majority category.

2.2.3 Scaling and encoding

The data available for machine learning algorithms to train on is often a mix of categorical and numerical data. It is also not uncommon for the features to be on widely ranging scales. An example could be one feature for number of children, while another feature has yearly income. Another example could be data that uses different units of measurements.

The two most common methods of scaling data are standardization and min-max scaling. In standardization, the values of a feature column have the column mean subtracted, and the value is divided by the variance.

Let x_ij denote the i’th observation of the j’th feature in a column containing n observations. The scaled values of the feature column is then

˜

x_ij =x_ij− ˆµ ˆ

σ , µ =ˆ 1 n

n

X

i=1

x_ij, ˆσ = v u u t 1 n

n

X

i=1

(x_ij− ˆµ)²,

where ˆµ, and ˆσ denotes sample-based approximations of the mean value and the standard deviation.

The second scaling method, the min-max scaling is performed by, for the given feature column xxxj:

˜

x_i= x_i− x_min xmax− xmin

,

which results in all the values being in the range [0, 1].

This type of transformation has the advantage of preserving the relation between the observations of a feature column.¹³ Since the minimum and maximum values must be calculated on the training set and carried over to scale the test set data, there is a chance that observations from the test set will fall outside of the

(15)

[0, 1] range, making it vulnerable to outliers.

For most methods, features containing categorical entries have to be assigned a numerical representation.

Categorical features can be of different types. Some categorical features may have an inherent ordering, such as "high income" versus "low income", while others have no hierarchy what so ever. Even if there is an inherent ordering, that does not mean that the difference between levels in a categorical feature translates to exact numerical values. The naive approach of assigning dummy variables to these features is therefore not appropriate, since it enforces a hierarchy that may not be present or is of the wrong magnitude.

One approach is to encode categorical features using one-hot encoding. Given a feature containing k unique categorical values, create k − 1 new features that takes the values 0 if the feature is not of the categorical value in question, or 1 if it is. The reason for the missing k’th feature column is that this value may be inferred from the remaining feature columns. This method is easy to implement but also has drawbacks. If the categorical feature contained many distinct classes this would lead to a very large design matrix. For example if the original design matrix contained a total of 5 features with a single categorical feature having 10 distinct values, the one-hot encoded design matrix would now have 14 features.

A more elaborate method is the method of Target statistics.¹⁴ The idea behind this encoding scheme is to map categories of a categorical feature to numerical values on the basis of the ratio between target and categories. For a binary target y ∈ {0, 1}, and a feature column xxx_j = [x1j, ..., x_nj] of length n, containing m ≤ n distinct categories. Each component belonging to a certain category k ∈ {1, ..., m} is mapped to a scalar in the range [0, 1] by

s_k= P (Y = 1|xxx_j = k) ,

for a well balanced training set one would be able to use the approximation sk = nkY

n_k ,

where n_kY is the number of observations of category k where the target is Y = 1.

In the case of high cardinality this estimate will not be sufficiently good. Instead, the estimate is formed by mixing the posterior probability and the prior probability

sk= λ(nk)nkY

n_k + (1 − λ(nk)) nY

n_tot,

where n_Y is the total number of instances where Y = 1, n_totis the total number of instances in the training set, n_k is the number of instances of class k, and λ(n_k) is a monotonically increasing function of n_k in the range (0, 1).

This method is based on empirical Bayes methods where the prior probability is computed from the available data, in this case the training set used. This encoding scheme can also be extended to the case with N + 1 different targets, Y = {0, 1, ...., N }.

2.2.4 Dimensionality

Optimal data would contain only features that are causally related to the response. This is rarely if ever the case. The data will most likely contain superfluous features, or contain features that might be causally related to the response if combined in certain ways. This leads to feature space of higher dimension and thereby increases the computational cost. It also adds noise to the model, obfuscating the true underlying pattern in the data.

The problem of high dimensionality may also be aggravated by the need to encode categorical features. As mentioned in Section 2.2.3, one-hot encoding will transform a single feature containing m distinct levels into m − 1 separate features, increasing the feature space dimension by m − 2. One solution is to use target statistics if the method calls for numeric inputs. One may also use a model that does not require encoding categorical features, such as a tree method. However this method tends to favor splits on high cardinality categorical features.

The issue of high dimensionality also appears in model tuning. The hyperparameter space grows exponen- tially with each new hyperparameter. The naive approach of grid search to find the optimal combination

(16)

Learning from data 2.3 Performance metrics and the ROC curve | 6

becomes prohibitively expensive as the number of hyperparameters increases. One way to handle this is by randomly sampling the feature space instead of doing a grid search.

2.3 Performance metrics and the ROC curve

How well a model performs is measured using a certain metric. The metric is in turn chosen depending on the purpose of the model. If the purpose of the model is simply to assign the correct class to observations one might use accuracy, the proportion of correctly classified instances divided by the total number of instances.

However, if the model is created to diagnose cancer it would be advisable to use a metric that emphasizes the detection of cancers. It is far more serious to miss a cancer than to accidentally diagnose a healthy individual.

This can be illustrated by a confusion matrix (Figure 1). For two real outcomes Ptrue and Ntrue, a true positive is if the model classifies Ptrue as true and a true negative if the model classifies Ntrue as negative.

The diagonal elements shows when the prediction of the model is wrong. If the model predicts positive for N_true, it is called a false positive (FP) (also known as a type I error), and if the model predicts negative for P_true it is called a false negative (FN) (type II error).

True positive Positive

Positive

False negative Negative

False positive

Negative True

negative True

Value

Model prediction

Figure 1: An example of a confusion matrix.

There are some useful expressions for classification that may be derived from this. The first is precision, defined as

Precision = TP TP + FP.

Which is the ratio of true positives out of all instances classified as positive, meaning the percentage of predicted positives that were actually correct.

The second useful expression is Recall, defined as

Recall = TP TP + FN.

The Recall is the percentage of all the positive instances detected by the model.

Another name for Recall is True Positive Rate (TPR). Conversely, one may define a False Positive Rate (FPR), defined as

FPR = FP

TN + FP.

The Receiver operating characteristic curve (ROC curve) is a plot with True positive rate (TPR) on the y-axis, and False positive rate (FPR) on the x-axis. An example of a ROC curve is shown in Figure 2.

(17)

Learning from data 2.3 Performance metrics and the ROC curve | 7

Figure 2: The Receiver operating characteristic (ROC) curve.

The plot is a function of the outcome for different probability thresholds. By varying the threshold, precision can be traded for recall. One way to summarize the result in a number is to calculate the area under the curve (AUC). For a perfect classifier the value of AUC would be 1, represented by the plot y(x) = 1, x ∈ [0, 1].

This would mean that all true positives were classified as positives and all true negatives were classified as negatives. On the other hand, a curve corresponding to y(x) = x, x ∈ [0, 1] would be equivalent to letting a coin toss decide the classification of an observation. Furthermore the interpretation of the curve is symmetric around the diagonal, since a classifier that consistently assigns the wrong class to each observation can be made to give the correct predictions by simply switching the classes.

Unfortunately the ROC curve is of less use in cases where there are more than two outcomes. One method is then to construct a one-versus-all approach and create a ROC curve plot for each case.

(18)

Artificial Neural Networks (ANNs) 3.1 The Multilayer Perceptron | 8

3 Artificial Neural Networks (ANNs)

3.1 The Multilayer Perceptron

Artificial Neural Networks (ANNs) were originally inspired by the human brain and its structure of intercon- nected neurons. Many variations of neural networks have since been proposed and adapted to solve different tasks. The convolusional neural network (CNN) was inspired by how human vision works and attempts to detect rough features in early layers and from these building stones construct more and more elaborate patterns. The recurrent neural network (RNN) allows for outputs to be fed back into the model, and can be viewed as an instance of a deep neural network if rolled out in time. These types of networks and others are explained in great detail in the book Deep learning (Adaptive computation and machine learning)¹⁵ by Goodfellow et al.

This paper will primarily focus on the fully connected feed forward network, also known as the multilayer perceptron (MLP). Despite this delimitation, many of the aspects discussed in this thesis are either applica- ble on other types of ANNs as well as MLPs, or has in fact originally been developed for these other types of networks. A schematic figure of a MLP network is shown in Figure 3.

x₁

x2

x₃

x4

Hidden layer 1

Hidden layer 2

Hidden layer 3

Hidden layer 4

Hidden layer 5

Hidden layer 6

Output Input

layer

Output layer

Figure 3: Schematic view of a fully connected feed forward neural network with input layer size 4, width 5, 6 hidden layers, and output size 1. The connecting arrows between the input layer, all the hidden layers, and the output layer, represents the weights www. To each hidden layer is also added a bias vector bbb.

The fully connected feed forward neural network constructs a mapping from the input x to an output y, these may be vector valued or scalars. The network is composed of an input layer that takes a vector x

x

x = (x₁, ..., xp) of fixed length p, where p is the number of features. Next is a hidden layer, consisting of m nodes (the width of the network). Each node takes in a linear combination of the input xxx and ap- plies some function g(z), referred to as an activation function, the resulting output of the hidden layer is g(zzz) = g(www^Txxx + bbb). The parameters to be learned, denoted by θθθ, are in this case the weights www, and the biases bbb. To be more precise, the weight w^(l)_ij connects the output from neuron i in layer l − 1 to neuron j in layer l, the neuron also receives a bias term b^l_j.

In the case where the network has a single hidden layer with activation function g(z) and an output layer with activation function f (z), the output (response) would be yyy = f (g(www^Txxx +bbb)). The network can be made both deeper and wider, for a network with two hidden layers where the first layer nodes have the activation functions g⁽¹⁾(z), and the second hidden layer nodes g⁽²⁾(z), the output would be yyy = f (g⁽²⁾(g⁽¹⁾(www^Txxx +bbb))).

If the activation functions g⁽ⁱ⁾(z) are chosen to be linear transformations yyy would be a linear transformation of xxx. To be able to approximate a larger family of functions the activation function is often chosen to be a non-linear transformation. Which activation function is used will vary depending on the problem at hand, the type of network used, and whether or not the nodes are in a hidden layer or in the output layer.

Activation functions are examined in more detail in Section 3.2.

The universal approximation theorem states that a neural network with a single hidden layer and finite amount of neurons can approximate any continuous function on compact subsets of R^p as long as the acti-

(19)

Artificial Neural Networks (ANNs) 3.2 Activation Functions | 9

vation functions are sufficiently smooth.¹⁶ If yet another layer is added the network is able to approximate any function,¹⁷ although not necessarily with a finite number of neurons.

While a neural network is able to approximate any function, the question still remains which size is optimal.

The problem of deciding on the width and the depth for a neural network is still largely a matter of exper- iment. There is some evidence that a deep network may have higher parameter efficiency than a shallow one, atleast in the case of multivariate polynomials.¹⁸ In the polynomial case the deep networks required number of neurons grow linearly with p features while the growth in the number of neurons are exponential in the case of a single hidden layer.

3.2 Activation Functions

The activation function g(z) is chosen based on the problem at hand as well as training considerations.

Early neural networks used a step function in order to mimic the behavior of brain neurons. Since the training of a neural network is performed by moving along the gradient (examined in more detail in Section 3.5) this proves problematic since the step function has derivative zero everywhere except at the origin where it is undefined. Several other activation functions has been proposed and many of them are in use today. The choices for the hidden layer activations as well as the cost function are mostly guided by training considerations, such as preventing vanishing or exploding gradients (more on this in Section 3.5).

3.2.1 Saturating activation functions

The family of saturating activation functions are functions that squeeze the input, in more rigorous terms, an activation function g(z) is non-saturating iff

z→−∞lim g(z)

= ∞, or

z→+∞lim g(z)

= ∞, (1)

and saturating if not non-saturating.

A natural replacement for the previously mentioned step function is the sigmoid function, also called the logistic function, defined as

σ(z) = 1 1 + e^−z. This function maps an input z to (0, 1) and has the derivative

σ⁰(z) = σ(z) (1 − σ(z)) , or the Hyperbolic tangent function defined as

tanh(z) = 2σ(2z) − 1.

The graphs of the sigmoid function and the tanh function are shown in Figure 4.

The graphs of the sigmoid function and tanh function

Figure 4: Left: The sigmoid function σ(z) = _1+e¹−z. Right: The tanh function.

(20)

Artificial Neural Networks (ANNs) 3.2 Activation Functions | 10

The softmax function maps the input vector of length K into a vector that may be interpreted as a probability distribution, that is, each component is positive and the components sum to 1.

σ(zzz) = e^−z^j PK

k=1e^−z^k.

The sigmoid function can be used either as an activation function in a hidden layer or in the output layer.

Due to issues of training explored further in Section 3.5 it is now more common to use it in the output layer where it can map the output to the range (0, 1) which is more suited to probabilities. In binary classification where the goal is to model the probability p that the response belongs to class number one out of the two possible, the probability for the observation belonging to class two is then 1−p. For a classifier with multiple classes, the softmax function outputs a probability distribution over the different responses. In the special case of two classes, the softmax simplifies to the sigmoid function.

3.2.2 Non-saturating activation functions

In the previous section the conditions for a non-saturating function were defined in Equation (1).

An example of this type of activation function is the rectified linear unit (ReLU) and its modified versions.

The ReLU^19–21 is defined as

ReLU (z) = max(0, z).

This activation function is fast to compute and although only the one sided derivative exists at zero, it turns out in practice that it does not pose a problem. Implementations can simply be defined to return the one sided derivative. Two justifications can be presented for this; Firstly, in a computation, it is unlikely that the result is exactly zero due to rounding errors, the second reason is that the training is unlikely to arrive at a minima but will instead arrive at a sufficiently low cost.

The ReLU does however introduce the problem of dying ReLUs. If the input to a neuron is less than or equal to zero the output will be zero. This means that there is no path for a signal to propagate forward through the neuron and later, no path for the gradient to flow through. A solution to this is the modified ReLU, the Leaky ReLU (LReLU)²²defined as

LeakyReLU_α(z) = max(αz, z), where α is the slope of the function for z < 0.

The parameter α can be picked either through trial and error as a hyperparameter or it can be picked at ran- dom, in which case it is referred to as Randomized Leaky ReLU (RReLU). During testing and inference α is held fixed. α can also be learned as a parameter. This method is named Parametric Leaky ReLU (PReLU).²³ A further modification to the ReLU is the Exponential Linear Unit (ELU),²⁴ defined as

ELU_α(z) =

(α (e^z− 1) if z < 0

z if z ≥ 0.

In general, ELUs work best but are slightly more costly to calculate and therefore affects computation speed.

Second best is leaky ReLU and its variations (PReLU), followed by ReLU.²⁵The sigmoid and tanh functions discussed previously can be used as activation functions in the hidden layers but are outperformed by the different ReLU variations and are therefore more suited to output layers. The plots of the different ReLU variations are shown in Figure 5.

(21)

Artificial Neural Networks (ANNs) 3.3 Cost Functions | 11

The graphs of the ReLUs and the ELU function

Figure 5: Left: The variations of ReLU, if the slope for x < 0 is set to a constant α > 0 the graph depicts a LReLU. If the slope is chosen at random the function is referred to as RReLU, while if the slope is treated as a trainable parameter it is the PReLU. For the case where α = 0 the regular ReLU is recovered. To the right: The ELU activation function.

3.3 Cost Functions

The purpose of the model is to approximate the true distribution responsible for generating the observed data. Let the model distribution of the response yyy given observations xxx, where the model is parameterized by θθθ (which as an example would be the mean and the variance in the case of the normal distribution), be denoted by

p(yyy|xxx).

One way to approximate this distribution is through maximum likelihood where the cross-entropy between the training data and the model predictions is used as the cost function with the aim of estimating the param- eters θθθ for the chosen distribution. To arrive at the cross-entropy, begin with the self-information of an event.

The self-information of an event can be defined as I(x) = − log p(x) where p(x) is some probability. If base 2 is used for the logarithm, the units are called Shannons or bits, and if the base e is used the units are called nats. This definition of self-information is intuitive in the way that an event that is certain to happen carries little information, whereas an unlikely event is more informative. If the base e is used, one nat corresponds to the information gained from observing an event of probability ¹_e.

For a probability distribution the idea is extended to what is called Shannon entropy, defined as H(x) = E_p[I(x)] = −Ep[log p(x)] ,

where the subscript p specifies that the expected value is to be taken over the distribution p.

The Shannon entropy of a distribution p(x) (H(p) for brevity), is the expected amount of information gained by observing an event from the distribution p(x). The more uniformly distributed, the higher H(p) will be.

In the case of two probability distributions p(x) and q(x), a measure of how much they differ is the Kullback- Leibler divergence, defined as

D_KL(p||q) = Ep

log p(x) q(x)

= Ep[log p(x) − log q(x)] . From this, the cross-entropy can be defined as

H(p, q) = H(p) + DKL(p||q), which reduces to

H(p, q) = −Ep[log q(x)] .

Minimizing cross-entropy with respect to q is thus equivalent to minimizing the Kullback-Leibler divergence.

(22)

Artificial Neural Networks (ANNs) 3.4 Regularization | 12

To avoid the difficulties of the expression 0 · log 0, it is defined as 0 since lim

x→0x · log x = 0.

The cost function that should be minimized is then simply the KL-divergence between the empirical dis- tribution as approximated from the data sample ˆp, and the model p believed to be generating the training sample:

L (θθθ) = −E_p_ˆ[log p (yyy|xxx)] .

This function will take different forms depending on our beliefs about the true distribution p(x). A common special case is the Gaussian model for continuous observations where the underlying true distribution is assumed to be the normal distribution.

This model assumes that the true distribution is p (yyy|xxx) ∼ N µµµ, σ²III, while ˆp (yyy|xxx) is the empirical distri- bution obtained from the sample data.

If one assumes that µµµ = ˆyyy is the expected value of the distribution then

H (ˆp, p) = − E_p_ˆ[log (p(yyy|xxx))] = −1 n

n

X

i=1

log (p(yi|xi))

= −1 n

n

X

i=1

log

1

√2πσe⁻

(_yi−ˆ_yi)²

2σ2

= − log

1

√2πσ

+ 1

n

X

i=1

(yi− ˆyi)² 2σ² . The first term is simply some constant. Minimizing the expression is equivalent to minimizing

1 n

n

X

i=1

(y_i− ˆy_i)².

Which is the familiar mean squared error.

In the case of a binary response y ∈ {0, 1} it is more reasonable to assume that the true distribution is a Bernoulli distribution. Denote the model distribution p(y = 1|xxx) = ˆy (which implies p(y = 0|xxx) = 1 − ˆy), and the empirical distribution ˆp(y = 1|xxx) = y, the Kullback-Leibler divergence is then

H (ˆp, p) = −Epˆ[log p(y|xxx)] = − [y log(ˆy) + (1 − y) log(1 − ˆy)] ,

which is recognizable as the logistic cost function for a single instance. The set of n observations would have the cost function

L (θθθ) = −1 n

n

X

i=1

y_ilog (ˆy_i) + (1 − y_i) log (1 − ˆy_i) .

There are more cost functions than the two mentioned in this section. The final shape of the cost function depends on which assumptions are made about the data-generating distribution, how the response is encoded (binary responses can be coded y ∈ {−1, 1} yielding a slightly different expression), and design choices such as weights penalizing certain misclassifications more than others.

3.4 Regularization

Whereas minimizing the cost function will reduce the training error, regularization is focused on reducing the generalization error of the model. Some of the methods discussed here are also found under other methods.

The techniques of `₁ and `₂ norm restriction on weights (known as Lasso and Ridge respectively), and the combination of the two (Elastic net) are explored in greater detail in Section 4.

3.4.1 Parameter norm constraints

Restricting the size of the weights connecting the neurons has the effect of restricting the model, rendering it less complex. The constraint can be on the `1norm of the weights www or the `₂norm. One may also combine the two into the elastic net approach which takes a combination of the two norm restrictions. Expressed more formally, the `₁ regularization, often referred to as the Lasso,²⁶ is the addition of a constraint to the cost function such that

(23)

Artificial Neural Networks (ANNs) 3.4 Regularization | 13

L (θ˜ θθ; XXX, yyy) = L (θθθ; XXX, yyy) + λΩ (θθθ) ,

`1: Ω (θθθ) = ||www||₁=X

∀i

|wi|,

while the `2 regularization, known as either weight decay^∗ or ridge regularization²⁷ is the addition of the term

`₂: Ω (θθθ) = 1

2||www||²₂= 1 2www^Twww.

The Elastic net²⁸ is a combination of the `₁ and `₂ regularization Ω (θθθ) = α ||www||₁+ (1 − α)1

2||www||²₂, α ∈ [0, 1].

The norm constraint tends to be applied to the weights only, and not the biases. The reasoning behind this decision is elaborated on in Deep learning (Adaptive computation and machine learning); Since a weight will affect the relationship between two neurons, while a bias-term will affect a single neuron, leaving the bias unregularized does not introduce much variance. Furthermore, regularizing the bias term may result in underfitting. A further reason is that the gradient does not flow through the biases, and therefore does not participate in issues such as vanishing or exploding gradient (more on this in Section 3.5).

The effect of the `1 and `2 regularization differ in some significant ways. Whereas the `2 regularization has the effect of shrinking all the weights by a certain multiplicative factor, the `1 regularization reduces the weights by a constant factor. This difference has far reaching consequences for how the regularization affects the model in terms of sparsity. These regularizations have been studied to a far greater extent in the context of linear regression and the extension to generalized linear models and will therefore be examined further in Section 4 which deals with logistic regression.

3.4.2 Dropout

One way of circumventing overfitting is to train multiple models and average the results as is done in ensemble methods. For deep neural networks this approach is prohibitively expensive, partly due to the number of model parameters, but also the need to tune hyperparameters for each model. Dropout³⁰ is a method to achieve an ensemble effect in a slightly different way. The dropout method can be viewed as sampling from an exponential number of sub-networks where each network contain only a subset of the full number of neurons and their corresponding connections. This is done by assigning each neuron a probability to be included during a training iteration. The probability can either be tuned as a hyper parameter or set to some number (0.5 for hidden neurons and close to 1 for input neurons are common and also recommended by the author of the original paper). The average can then be approximated by performing the test evaluation on the full network with all neurons present but their weights scaled by the probability that they were kept.

The neural network with m neurons comprises a possible set of 2^m possible thinned subnetworks, with a max number of parameters of the order m².

The dropout network is trained like a regular network, except that the neurons now dropped will no longer propagate a signal forward or backwards through the network. However, the authors of the original paper on dropout found that constricting the max-norms of the incoming weights of a hidden (non-dropped) neuron resulted in better results than training without max-norm regularization.

One possible reason as to the dropout methods success is that by making the presence of the other neurons unreliable, specific neurons cannot rely on the other neurons to correct for their mistakes, this in turn forces a neuron to learn to perform well on tasks by itself, thus breaking co-adaptations, which in turn may prevent overfitting and thus improve the networks ability to generalize.

∗The term weight decay is often used interchangeably with `2 regularization, however in the context of training neural networks they may not be the same. In regular SGD it can be made to be equivalent through a reparametrization that takes learning rate into consideration, this is not the case for the Adam algorithm.²⁹

(24)

Artificial Neural Networks (ANNs) 3.5 Training | 14

3.4.3 Early stopping

The purpose of a model is in most cases to perform well on hitherto unseen data. Reducing the error on a training set is done with the belief that the training error and the generalization error are in some way connected and that by reducing the error on the training set, the generalization error will also decrease.

However, given a set of training data observations, it is possible to find a perfect fit if the model is sufficiently complex, that is, the training error may be reduced to zero. At some point the model is likely to overfit the data. This happens when the model is fitting noise as well as the data. This noise can be caused by small errors in the measurements of the observations, a biased selection of training examples, etc. One way to examine whether or not the model is overfitting or not is to track the validation error as well as the training error. The validation error is used as an approximation of the test error.

Early stopping³¹ has a hyperparameter referred to as patience. Let t be the number of iterations of the algorithm, in the case of neural networks the number of epochs, If the patience is set to m, and the validation error at epoch t + m is greater than or equal to the validation error at step t, the training will cease and the method will use the results obtained at iteration t.

If the validation error is a good approximation of the generalization error then this should result in a model that generalizes well to unseen data.

3.5 Training

Broadly speaking, neural networks are trained by gradient descent or a variation thereof. The gradient is in turn computed by means of backpropagation. For a given training tuple (xxxi, yi), where xxxi = (x⁽¹⁾_i , ..., x^(p)_i ) is the i’th observation consisting of p features, and yi the i’th target, the network computes ˆyi and inserts it into the cost function in order to calculate the error. This is the forward pass. The gradient of the cost function with respect to the different parameters can then be calculated through backpropagation. The weights and biases are then updated simultaneously using gradient descent.

The number of observations used to calculate the gradient before the parameters are updated is called the batch size. A complete run through all the available training observations is referred to as an epoch.

3.5.1 Backpropagation and Gradient descent

In order to train the model, the weights www are updated in the direction that decreases the value of the cost function. This requires the gradient of the cost function with respect to the specific weights to be known.

Backpropagation³²exploits the property that the derivatives can be expanded using the chain rule, and that many of the terms found in the expressions for the gradient with respect to a certain weight are also present in gradients previously calculated for other weights. This dramatically reduces the number of computations required.

In its simplest form, the cost function denoted L(www) is minimized with respect to the weights using the gradient calculated by backpropagation, and evaluated using the result ˆyi obtained from the forward pass.

The weights are then updated using gradient descent with a step size, or learning rate, η w^next= w − η∂L(www)

∂w .

Gradient descent calculates the gradient of the loss function with respect to the parameters of the model.

The gradient can be calculated using the entire training sample or a subset of the training set. The choice of sample size used in the calculation of the gradient can be placed into three groups:³³ Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

The first, batch gradient descent, uses the whole training set of size n to compute the gradient. The weights are then updated using the calculated gradient. In stochastic gradient descent, a single instance is chosen at random and used to compute the gradient. The last version, mini-batch gradient descent randomly chooses a subset of the full size n of the training data to compute the gradient.

(25)

The larger the training set n, the more expensive the computation of the gradient will be if one uses the batch gradient descent, on the other hand, if one uses stochastic gradient descent, the solution will con- stantly shift around, never settling.

The training can be further modified by manipulating the learning rate/step size, η, or even use a separate learning rate for each parameter. Today the method most commonly used is named Adam³⁴(Adaptive mo- ment estimation), although there have been attempts to improve this optimizer as well through the Nadam algorithm which incorporates Nesterov momentum. Adam builds on the previous modifications of gradient descent: Momentum, AdaGrad, and RMSProp.

The idea of momentum optimization is to take into consideration previous gradients. The momentum vector which takes into account past gradients is defined as

m_i= βm_i+ η∂L(www)

∂wi

,

where β is called the momentum and is a number between 0 and 1. The updated weights are given by w_i^(next)= wi− mi.

Nesterov Momentum³⁵ improves on the idea of momentum by measuring the gradient of the cost function slightly ahead in the direction of the momentum rather than at the current position.

m_i= βm_i+ η∂L(www + βmmm)

∂wi

,

and as above updates the weights according to

w_i^(next)= wi− mi.

AdaGrad³⁶adapts the learning rate over time and keeps a learning rate for each parameter. For each weight wi:

s_i= si+ ∂L(www)

∂wi

2

,

w^(next)_i = wi− η

√si+

∂L(www)

∂wi

,

where is a small number that improves numerical stability. The result of this approach is that parameters with large partial derivatives of the cost function have a rapid decrease in their learning rate while parameters with small partial derivatives have a small decrease. This optimization method suffers from issues of stopping too early when training neural networks since the learning rate becomes so scaled down that it stops before reaching an optimum.

RMSProp³⁷ remedies AdaGrads issues by only accumulating the gradients from the most recent iterations as opposed to accumulating all the previous gradients

si= βsi+ (1 − β) ∂L(www)

∂wi

2

,

w^(next)_i = wi− η

√s_i+

∂L(www)

∂w_i .

As mentioned previously, the currently most widely used optimizer is the Adam optimizer. Adam combines the ideas of momentum optimization and RMSProp, it keeps a decaying memory of past gradients and at the same time adapts separate learning rates for each parameter. The steps for the Adam optimizer are

1 : mmm^(next)= β1mmm + (1 − β1)∇wwwL(w), 2 : sss^(next)= β2sss + (1 − β2)∇wwwL(w) ⊗ ∇wwwL(w),

3 : mmm^(next)= mmm 1 − β₁^T, 4 : sss^(next)= sss

1 − β₂^T,

(26)

5 : www^(next)= www − ηmmm √ sss + ,

where ⊗ denotes elementwise multiplication and denotes elementwise division. The parameters mmm and sss are initialized to zero meaning that they will be biased towards zero at the beginning of training which will boost m and s in the early stages of training.

Step number 1,2, and 5 are similar to Momentum optimization, while steps 3 and 4 are inspired by RMSProp.

Nadam³⁸ modifies Adam by instead of using the momentum, incorporating the Nesterov momentum.

3.5.2 Convergence

A convex optimization problem converges to a solution regardless of initial parameters. For example, a linear model with a convex cost function is guaranteed to find a global minimum. Stochastic gradient descent applied to non-convex cost functions has no such convergence guaranteed and is therefore sensitive to the values of the initial parameters. Furthermore, there is no guarantee that the solution arrived at is a global minima, more likely the solution has arrived at a local minima. Despite this, the local minima is often sufficient. The previous section developed modifications to the original batch gradient descent approach.

These modifications are in part intended to aim straighter towards a minima, but also to be able to escape bad local minimas in favor of better ones.

A prominent problem facing neural networks, especially deep ones, is that of vanishing or exploding gradients. These are cases where the gradient becomes very large or very small. The introduction of the ReLU activation function discussed in Section 3.2 was implemented partly for this very reason, replacing the sigmoid function as the preferred activation function in the hidden layers. Since the sigmoid function is saturating it would cause the gradient to become very small, eventually shrinking it to zero. ReLU has the potential of instead allowing the gradient to either become very large (exploding gradient), or prohibiting the gradient from flowing backwards through the network by the phenomena of dying ReLUs. This happens when the input for the ReLU is 0 or less and the activation function therefore outputs zero. As was shown in the section about non-saturating activation functions, the solution to dying ReLUs was the Leaky ReLU and its variants.

The exploding and vanishing gradient problem³⁹ can be solved by re-scaling the norm of the gradient if it exceeds a certain threshold.⁴⁰

In addition to new activation functions such as the Leaky ReLU, and norm constraints on the gradient, an important addition to training deep neural networks has been how to initialize the parameters. The main parameters in a neural network are the weights www and the biases bbb.

If the weights are initialized to zero, the result will be that all the neurons of a layer computes the same expression. This symmetry will remain unbroken and the network will amount to nothing more than a network of width one. Instead, several approaches have been proposed with variations on random initialization of the weights.

Xavier initialization⁴¹ explores the idea that weights should be initialized in a way as to keep the variance constant throughout the network going forward as well as when backpropagating the error. Under the assumption of an activation function f (x) that is symmetric and has f⁰(0) = 1 there are two simultaneous conditions that should be met, namely

n_iV ar[Wⁱ] = 1, and n_i+1V ar[Wⁱ] = 1,

where n_i denotes the width of layer i, n_i+1 the width of layer i + 1, and Wⁱ the weights connecting layer i − 1 and layer i. If the two layers are of equal width this can be satisfied, if they are not, the harmonic mean is used as an approximation

V ar[Wⁱ] = 2 n_i+ n_i+1.

This leads to what the authors themselves refer to as normalized initialization but what has later been referred to as Xavier initialization

Wⁱ∼ U

−

√6

√ni+ ni+1

,

√6

√ni+ ni+1

,

A comparison between Neural networks, Lasso regularized Logistic regression, and Gradient boosted trees in modeling binary sales

A comparison between Neural

networks, Lasso regularized

Logistic regression, and Gradient

boosted trees in modeling binary

sales

RICKARD STRANDBERG

JOHAN LÅÅS

A comparison between Neural

networks, Lasso regularized

Logistic regression, and Gradient

boosted trees in modeling binary

sales

RICKARD STRANDBERG

JOHAN LÅÅS

En jämförelse mellan Neurala nätverk, Lasso regulariserad

Logistisk regression, och Gradient boostade träd för modellering av

binära försäljningar

Contents

1 Introduction

2 Learning from data

2.1 Machine learning

2.2 Data

2.3 Performance metrics and the ROC curve

3 Artificial Neural Networks (ANNs)

3.1 The Multilayer Perceptron

3.2 Activation Functions

3.3 Cost Functions

3.4 Regularization

3.5 Training