Improved personalized suggestions on websites using machine learning

(1)

Improved personalized

suggestions on websites using machine learning

FILIZ BOYRAZ

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

using machine learning

FILIZ BOYRAZ

Master in Computer Science Date: January 26, 2020

CSC Supervisor: Lars Arvestad

Principal supervisor: Gustav Rengby, Random Forest Examiner: Johan Håstad

Swedish title: Förbättrade personaliserade förslag på webbsidor genom maskininlärning

School of Computer Science and Communication

(4)

(5)

Abstract

Automated web personalization is a desired feature both for a website visitor and owner. The visitor is released from the burden of selecting settings or use search tools to adapt the website for their needs and find the right content. The owner benefits from having the visitors find content on the website they otherwise would have overlooked.

In this project an existing model for the program Red Pine was improved on. The program Red Pine is a solution which can provide web personalization to website visitors by suggesting content based on the available information on the visitor. To predict which offer a visitor should be suggested the program uses a model with a classi- ﬁcation algorithm. The model is trained with previous instances of made suggestions and the recorded responses from the visitors. The model was improved through selecting features, evaluating algorithm performance, and tuning algorithm parameters. Eight different algorithms from Azure Machine Learning Studio were used for the experiments. This resulted in ﬁnding an improved combination of features, algorithm parameters and algorithm for the model.

New features describing the current session were added to the model and a correlation based filtering method for feature selection was used to discover relevant features. The identified features from the filtering was used in a greedy search to find an improved feature set. The default parameter values of the algorithm were tuned and the final combination of features, algorithm parameters and algorithm was cross evaluated.

It is concluded that session data can be used as a replacement or com- plement for visitor data, when visitor data is unavailable, to get personalization with equal performance. Two of the algorithms that ob- tained highest performance scores had a linear decision boundary.

(6)

Sammanfattning

Automatiserad webbpersonalisering är till fördel både för en webb- sidas besökare och ägare. Besökare kan undvika att behöva välja in- ställningar eller använda sökverktyg för att anpassa sidans innehåll och hitta rätt information. Ägaren gynnas av att besökare hittar till det innehåll som de söker och som de annars hade riskerat att missa.

I projektet optimerades en existerande modell för programmet Red Pi- ne. Programmet Red Pine är en lösning som kan erbjuda webbpersonalisering för besökare av en webbsida genom att föreslå innehåll på webbsidan baserat på den information om besökaren som ﬁnns till- gängligt. För att kunna avgöra vilket erbjudande en besökare borde visas använder sig programmet av en modell med en klassiﬁceringsal- goritm. Modellen tränas med tidigare instanser av gjorda förslag och den noterade responsen som förslaget genererat från besökaren. Tre opimeringssteg för åtta olika algoritmer från Azure Machine Learning Studio användes för att upptäcka en förbättrad modell kombination av inputvärden, algoritmparametrar och algoritm.

Nya sessionsvärden inkluderades i modellen och en korrelationsbase- rad filtreringsmetod användes för att välja ut relevanta indataparamet- rar. De värden som identifierats i filtreringen användes sedan i en snål sökning för att hitta en bättre uppsättning parametrar. Standard para- metrarna i algoritmerna justerades och den slutgiltiga kombinationen av inputvärden, algoritmparametrar och algoritm korsvaliderades.

Slutsatserna från projektet är att sessionsdata kan användas som er- sättning eller komplement för besöksdata, när besöksdata inte är till- gängligt, för att uppnå personalisering med samma prestation. Två av de algoritmer som nådde bäst resultat hade linjära beslutsgränser.

(7)

Glossary 1

1 Introduction 3

1.1 Background . . . 3

1.2 Problem description . . . 4

1.3 Objective . . . 5

1.4 Research questions . . . 5

1.5 Limitations . . . 6

1.6 Novelty . . . 7

2 Background 8 2.1 Algorithm selection . . . 8

2.1.1 Evaluation methods . . . 9

2.1.2 Input . . . 11

2.1.3 Parameter tuning . . . 13

2.2 Microsoft Azure Machine Learning Studio . . . 14

2.2.1 Binary classiﬁcation algorithms in Azure . . . 15

2.2.2 Parameters . . . 19

2.3 Previous research . . . 20

2.3.1 The StatLog legacy . . . 21

2.3.2 Research with other performance measures . . . . 22

2.3.3 Effects of feature selection and parameter tuning . 22 2.3.4 Examples of website visitors data mining . . . 23

2.4 Summary . . . 25

3 Method 26 3.1 Material . . . 26

3.1.1 System . . . 26

3.1.2 Dataset content . . . 29

v

(8)

3.2 Execution . . . 31

3.2.1 Dataset . . . 32

3.2.3 Evaluation and algorithm selection . . . 39

3.3 Method motivation . . . 40

3.3.1 Correlation and greedy search . . . 40

3.3.2 Evaluation method . . . 41

3.3.3 Evaluation purpose . . . 41

4 Results 43 4.1 Correlation . . . 43

4.2 Filtered dataset . . . 46

4.3 Algorithm parameters . . . 47

4.4 Results with default parameters . . . 48

4.5 Results with tuned parameters . . . 49

4.6 Cross Validation . . . 51

4.7 Runtime . . . 52

5 Discussion 53 5.1 Experimentation results . . . 53

5.1.1 Feature selection . . . 54

5.1.3 Performance consistency . . . 55

5.2 Features . . . 55

5.2.1 Visiting pattern feature . . . 55

5.2.2 Offer speciﬁc features . . . 56

5.3 Feature selection . . . 57

5.4 Linearity in higher dimensions . . . 57

5.5 Runtime . . . 58

5.6 Training intervals . . . 58

5.7 Ethical aspects of the project . . . 59

5.7.1 Handling of personal data in the system . . . 59

5.7.2 Privacy concerns in web personalization . . . 61

6 Conclusions 62

Bibliography 64

(9)

AUC Area Under the Curve, measure for accuracy of classiﬁcation model. The value indicates the True Positive Rate to the False Positive Rate at different threshold values in a classiﬁcation.

Azure Microsoft Azure Machine Learning Studio, the tool for devel- oping and testing the model.

NFL No Free Lunch theorem, there is no shortcut to discovering the optimal dataset, algorithm, and algorithm parameters combination.

Offer Content on the client website which Red Pine can chose to sug- gest to a visitor.

Placement location The area on the client’s website which Red Pine has control of.

Red Pine The product program developed by Random Forest.

Request A new request is sent every time a visitor loads a page which has at least one placement location. Red Pine answers the request by suggesting one offer for each placement location.

Response The visitors response to the suggested content. The response is positive, 1, if the visitor has clicked on the suggested offer and negative, 0, otherwise. Anything from none to all of the sugges- tions in the request can get a positive response.

1

(10)

Session One session for a visitor of the website. The dynamic data is connected to speciﬁc sessions but one visitor can be connected to several sessions.

Session data Also referred as Dynamic data, the information known about the current session, such as which pages the visitor has viewed in chronological order. The dynamic data is available for all visitors.

Suggestion When the model selects one, or several ordered, of the offers and displays it to the visitor on a placement location.

Visitor A private person who is viewing the client’s website. If the visitor is not logged in to the website they are an Anonymous visitor. If the visitor has been identiﬁed, by for example logging in to the website, they are Registered visitors.

Visitor data Also referred as Static data, known information on a vis- itor connected to the current session. Only available for registered visitors. The static data can change over time but not as frequently as the dynamic data. One example of this would be a visitor data attribute which describes the membership connected to a customer, if the membership is changed the data will natu- rally also change.

(11)

Introduction

1.1 Background

Website providers strive to improve user experience in different ways, for example by making the most visited pages accessible through fewer steps or by putting direct links on the main page. The website can also choose to produce more of the content that the visitors show most interest in by looking at the visitor statistics of different pages, for example on a news website. These methods can improve the website in general for the targeted user group, but it is also possible to use the data for web personalization where parts of the content is adapted to each visitor who is viewing it, for example by giving recommendations in a webshop.

The company Random Forest has developed a solution, Red Pine, which makes suggestions, among a set of available offers, suitable to a website visitor based on the visitor’s stored data. This would personalize what content on the website the visitor has easiest access to from the current webpage. The goal is to guide the visitor to content that the visitor is likely to be interested in and which the website owner also is interested in making available to the visitor. The visitor data is available to the system only after the visitor has logged in to the website.

For anonymous visitors the previous version of the program was us-

3

(12)

ing default suggestions. Which suggestion the program should make to known visitors is predicted by a classiﬁcation model using Boosted Decision trees. A previous version of the program have already been used on the websites of clients to Random Forest.

In this project the dynamic session data of the classiﬁcation model was extended and evaluated separately from the full dataset. This determined to what extent personalized suggestions can be made to anonymous visitors using only session data, without compromising on the performance. The model with either the full, or only the dynamic data, dataset was evaluated in the project by comparing the performance using new algorithms and tuned algorithm parameters.

Web personalization and algorithm optimization are common research problems. These research areas can be further divided into the fields for algorithm selection, parameter tuning, and user profiling which are interesting fields to study both separately and in combination with each other as this project intends to cover.

1.2 Problem description

The problem examined in this project concerns methods and strate- gies for how an optimal model can be created for a speciﬁc dataset.

The developed model is used in a program for predicting the offer with highest acceptance potential. There was an existing model which used boosted decision trees and known information about the website visitors.

The ﬁnding of an improved algorithm and parameter combination is assumed to improve user experience for visitors of the client webpage.

This can be measured through the same evaluation method as the one used for the algorithms, since a higher degree of correct predictions indicates that a visitor would have accepted the suggestion.

The performance was optimized through three steps. In the ﬁrst step features from the session data were selected through a correlation based ﬁltering method. In the second step all algorithms were tuned to have

(13)

suitable algorithm parameters for improving the algorithms performance on the dataset. In the third step the performance of each algorithm was evaluated on the new dataset. The challenge is that ﬁnding the right algorithm and parameters is a search problem which could have several optima and likely also local optima. The project was executed mainly with the tools available in Azure.

The project also covers to what extent the session data, which are the only features the model can use for anonymous visitors, is sufﬁcient to make accurate predictions. Session data makes the model applicable to both websites with a registration process as well as systems without registered visitors. With session data it is possible to make suggestions to anonymous visitors where the previous version of the program instead would be making default suggestions. All visitors will not be logged in to the system on every visit or the visitor might not yet be registered. The possibility to adapt the model for anonymous visitors as well would therefore broaden the use of the program were it has not been usable until now. The model also beneﬁts from additional information to the existing visitor data when making predictions.

1.3 Objective

The objective of the project is to improve on the performance of the classiﬁcation predictions personalized for both registered and anonymous website visitors. Achieving the objective beneﬁts website owner and visitors by ensuring that content is made easily accessible.

1.4 Research questions

To what extent can different factors influence the performance of the specific classification model? The results in this study are used to evaluate how much different aspects of the model construction affect performance. In order to answer that question the study will examine the following sub-questions:

(14)

• The use of correlation measures in feature selection for estimat- ing the relevance of the new session features to the label.

• The different algorithms characteristics and performance on the classiﬁcation problem.

• The performance of algorithms using tuned parameters in comparison to their default values.

Can equal personalization be achieved for anonymous visitors as for registered visitors? What performance can be achieved from only us- ing session data in comparison to the full dataset with both visitor and session data is examined in the project. Visitors data will not always be available and a solution which still is able to offer personalization for anonymous visitors is advantageous.

1.5 Limitations

The project and evaluation of the algorithms used data from only one client. The program Red Pine is intended to be a general solution for use on different client websites where the content of the data could have different characteristics, but the primary goal of the project was to only examine the model using the current client data. The chosen algorithm is therefore not necessarily the optimal choice for any given dataset.

The project also focused on using the algorithms available in Azure, if strong reasons to implement new algorithms do not appear. The main interest lies with algorithms used for classiﬁcation with supervised learning.

The evaluation is considering the accuracy of the classiﬁcation, where AUC is assumed to give the best indication on accuracy, as well as the runtime for a single prediction. Other measures were left unmen- tioned and were not a part of the algorithm selection.

(15)

1.6 Novelty

The optimization techniques used in this project have been studied separately before, but previous results were combined to document the total impact on the performance as well as each intermediate step for a specific classification problem. It is expected that selecting the optimal algorithm with a pre-processed dataset and optimal parameter settings will yield better results than only selecting the optimal algorithm by itself. This project uses a single dataset to instead focus more on results achieved on a specific problem.

Correlation based feature selection is a filtering method for feature selection and is less computationally expensive when compared to wrapper methods for feature selection. If a filtering method can recog- nize the weaker features for this problem adequately it could in many cases, where an optimal subset is not necessary, be a preferred method over wrapper methods. The results from this project will show to what extent a correlation based filtering method is applicable to this problem.

Using session data for web personalization to anonymous visitors is not novel, but the result of the comparison to performances achieved with the full dataset should be of great interest. Furthermore the methods used in the project are general and can be applied in new context or inspire similar projects. The results should therefore be of interest as reference.

(16)

Background

This chapter introduces the background knowledge required to ex- ecute the project. Section 2.1 introduces machine learning methods used in this project. The tool Azure, which was in this project, is presented in section 2.2. The resources of Azure are investigated and the components which are likely to be used are mentioned. In section 2.3 there is an overview of referenced literature and conclusions from previous studies in the ﬁeld. A summary of the pilot study can be found in section 2.4.

2.1 Algorithm selection

When selecting an algorithm for a machine learning model it is necessary to ﬁrst consider what aspects of the performance that affects the evaluation. Each algorithm offers a tradeoff between the different metrics of the performance, for example accuracy, runtime or required memory. The algorithm selection therefore depends on what performance metric has highest priority. The problem the algorithm should solve is often unique and the performance of each algorithm will be different, with some performing better than others. The algorithms are however also adaptable to different problems and to get a fair evaluation the algorithms should be compared under the best pos-

8

(17)

sible conditions. The algorithm selection problem was first formulated by Rice in 1976 [28]. The paper also discussed the idea of problem classification and the fact that algorithms are developed for a specific class of problems which might not be explicitly defined.

The No Free Lunch (NFL) theorem [30] states that it is not possible to know which algorithm under what conditions will have the best performance on a problem. Experimentation is needed to discover the optimal combination of all conditions since the unique problem cannot easily be compared to other problems.

2.1.1 Evaluation methods

There are several ways to evaluate the performance of a model. The evaluation methods can put special weight on different factors such as computation time, either when training or making a prediction, memory needed, or how correct the classiﬁcation is.

In supervised learning, the model will be evaluated in the process of the development. The model will go through a training iteration fol- lowed by a testing iteration with parts of the full dataset being used for testing and training respectively. this can be achieved by either splitting the data in two parts, one for training and one for testing, or the model can be cross-validated. In cross-validation the model is trained with a partition of the labeled data and then validated with another, but also changing the partition of the dataset multiple times in folds of the complete dataset to further confirms that the model has not been overfitted to a specific subset of the data.

The classification outcome for each sample in the training data will be compared to the actual label field and grouped as: True Positive, TP, True Negative, TN, False Positive, FP, or False Negative, FN. These 4 groups of classification outcomes are displayed in a confusion matrix (Table 2.1) and are used in accuracy measures for classification models.

(18)

Actual Positive Negative Classiﬁcation Positive TP FP

Negative FN TN

Table 2.1: A confusion matrix

The elements of the confusion matrix can be combined into performance metrics of a model as precision, _{T P +F P}^{T P} , and recall, _{T P +F N}^{T P} . The F1score, 2∗precision∗recall

precision+recall, combines these two metrics for one data class into one value and could be used for easier comparison between models. Similarly there is Cohen’s Kappa metric, Accuracy−randomAccuracy

1−randomAccuracy , which combines the values in the confusion matrix into one value to measures the agreement between the predicted values and their true labels. The metrics for each separate class could be combined as the mean value of all, or if the occurrence of one of FPs or FNs is considered more damaging to the performance, the adequately weighted mean value [7].

Area under the curve

The area under the Receiver Operating Characteristics (ROC) curve is used as a metric for evaluating the quality of binary classiﬁers. This metric is usually called Area Under the Curve (AUC) [23]. The ROC curve shows tradeoff between the true positive (TP) rate and false positive (FP) rate depending on the threshold function. The graph has the FP value on the x-axis and TP value on the y-axis. The optimal outcome is if the model is able to achieve a minimal FP and maximal TP, meaning that the curve would reach the upper left corner in the graph.

The AUC would then be 1 for a perfect model and 0.5 for a completely random classiﬁer.

General accuracy value can be misleading when there is a majority and minority class in the group to be classiﬁed [2]. When the size of the minority class is much smaller than the majority class, even a classiﬁer which predicts all data samples to be from the majority class can get a high evaluation score. The AUC score is therefore a more comprehen-

(19)

sive measurement which could potentially discover a better tradeoff between TPR and FPR at a different threshold. The threshold will not require tuning to evaluate the model and the model will not get away with predicting only the majority class since it will result in either low TPR or low FPR.

2.1.2 Input

Dataset characteristics

Dataset characteristics determine what type of problem it is that needs to be solved. What the content of the dataset is describing, for example either medical records of patients or pictures of different objects, is usually less important when compared to what the content of the dataset looks like. It is how the content can be be described to the model which deﬁnes the type of classiﬁcation problem and also determines what algorithm is the optimal choice [14].

The number of samples in the dataset could limit the use of some algorithms which requires more examples to be able to converge. If there are few samples and many iterations the algorithm are at higher risk of overﬁtting to the samples since there is less new information. This in turn affects how the learning rate parameter, the correction amount for each iteration in the model, should be conﬁgured. With a high learning rate the change in between iterations is larger which could cause the algorithm to miss optimal values. With a low learning rate and few iterations, however, it is possible that the optimal value is not reached before the execution terminates. Other examples of dataset characteristics are statistical descriptions of the data or the number of binary, continuous, discrete and empty values in the dataset. What type of features a dataset contains could be unsuitable for some algorithms and affect the performance. One example is categorical values in decision trees [10] where categorical values often are transformed to numeric features.

There exist methods for handling differences in the data values. For example unbalanced datasets, where there is a clear minority and ma-

(20)

jority class, are often sampled according to the imbalance such that either the minority class is increased or the majority decreased. The column type could also be changed or missing values replaced to make the dataset content more suitable to the model.

Feature selection

It is possible that not all of the data is relevant for prediction and that additional content then creates noise as well as unnecessary computations [15]. Weighting features differently or removing redundant features could improve the performance of a classiﬁer. Reducing the dimensionality of the data also makes it easier to interpret the trained model. The process of discovering relevant content from the dataset is called Feature Subset Selection. To avoid overﬁtting the feature selection should be done on a separate dataset from the testing data [8].

Adding features to the model could increase performance as more information is available. But since the number of dataset samples is limited the balance between samples and features becomes dispropor- tionated, which is known as the curse of dimensionality, and the performance decreases. What is meant by high dimensionality is relative but is mostly referring to when problems include over 100 features.

Even for smaller dimensional problems feature selection could still be relevant both for reducing the computations as well as improving on the performance.

One way to do feature selection is through ﬁltering methods [8]. A ﬁltering method ranks all features based on statistical measures. Each feature is evaluated separately or in comparison with the label feature.

Since the features are ranked it enables the user to select the desired amount from the top. This method can not discover multicollinearity between features and should therefore be analyzed in another step of the subset selection of the data. This method is motivated by the fact that each feature in the data generally should be correlated to the class but uncorrelated to other features.

Wrapper methods [8] are more complex than ﬁltering methods and selects the feature subset by actually searching for the subset which gives

(21)

the best performance. The feature selection is then reduced to a search problem which can be solved with a greedy, heuristic or stochastic algorithm. This guarantees an improved performance since the feature selection is based on the actual performance of the model with the selected features, but an optimal feature selection can only be achieved with a complete search of all feature combinations. An example method is to train a decision tree and then select the features which appears in the created tree [15]. The filtering method does not always yield an improved feature set since the feature selection is not based on the performance of the model but instead other metrics of the feature set, one example being the correlation between the features. The wrapper method does however depend on the used model to a greater extent and also increases the risk of overfitting while the filtering method has the same selection method indifferently from the algorithm selection.

2.1.3 Parameter tuning

Parameter tuning, the setting of algorithm parameters, affects the balance of exploration versus exploitation. Since the balance between exploration and exploitation greatly affects the performance, by overﬁt- ting or not ﬁtting closely enough to the data, the chosen parameter values are of importance.

Each algorithm can have a few parameters that can take on a broad range of values. Parameter tuning can therefore become a difﬁcult and time consuming task. But it has proven to be rewarding as different parameter settings can improve or worsen the performance vastly [4]. However, ﬁnding the optimal settings manually by trial and er- ror is not always possible if the number of parameters to tune is large.

Following the No Free Lunch (NFL) theorem it is known that the optimal parameter values are different depending on the problem it is used for. The task also requires sufﬁcient domain knowledge. Ideally an algorithm would not need the user to manually select any parameters, and thereby avoiding the need of domain knowledge. Parameter tuning requires an initial input value or range for the parameter and the used initial values have an impact on the outcome, but the initial parameter values are, compared to only setting parameters manually,

(22)

of less signiﬁcance for the performance.

Automated parameter tuning has been experimented with in the field of Search Based Software Engineering (SBSE) and can be applied to any algorithm that has different optimal parameters depending on the application [4]. Some example studies within this field have used meta-learning or advanced search algorithms to tune the parameters of an algorithm [20]. Research in the field has shown some disappoint- ing results, especially in comparison with the suggested default values of an algorithm and there is not always a great increase in the performance [5]. However, results do show the vastly varying performances between optimal and worst-case parameter settings.

Additionally to parameter tuning there is the more unexplored ﬁeld parameter control which not only sets parameters at the start but also changes the parameters during the runtime to adapt to the current state. This idea is motivated by the fact that, just as the performance of an algorithm is affected by the initial parameter values, the optimal tuned parameter value could be suboptimal later in the execution.

There could also be good reasons, inspired from the Simulated Anneal- ing algorithm, to have for example a variable learning rate during the execution. Parameter control appears to be a promising ﬁeld but has not yet yielded breakthrough results. In a survey from 2015 covering the current trends and challenges concerning parameter control [19] it was noted that there is an increase in publications on the topic but that no parameter control method yet has been adopted by the community in general.

2.2 Microsoft Azure Machine Learning Stu- dio

Azure [25] is a tool that simpliﬁes development of predictive models.

The model is built by connecting components of algorithms, datasets and training stages which are all already implemented in Azure. The tool is interactive and uses a visual workspace that separates the devel- oper from the actual code. The model can be adapted to some extent

(23)

through altering parameter values, choosing training methods and input values. When the model is completed it can be published as a web service and contacted for use by other applications. After the model has been published it does not change, unless trained and republished again in Azure. Reinforcement learning can therefore not be used in Azure.

Datasets for training or testing a model can be imported into Azure from different sources, such as from a SQL database. Once the model has been published it can be contacted through HTTP connections.

Azure [25] offers implementations of some well known algorithms suitable for different problem types, for example regression, anomaly detection and classiﬁcation.

2.2.1 Binary classiﬁcation algorithms in Azure

The classiﬁcation algorithms in Azure [25] all use supervised learning.

There are 9 binary classiﬁers listed, with their properties summarized, in ﬁgure 2.1.

Figure 2.1: The ﬁgure shows the model recommendations in the Azure documentation for the Two-class Classiﬁcation problem given a general descriptions of the problem context: linear/non-linear, number of features, restrictions on training time, memory footprint. [25]

(24)

Azure [25] includes a module, called Tune Model Hyperparameters, for determining the optimal parameters of an algorithm. Given a parameter range the model is trained multiple times with varying parameters to ﬁnd the best parameter values. The tuned parameters can be discovered using two different methods: a parameter sweep, where the model tries different values for unﬁxed parameters, or with cross validation, which can give more accurate results.

Decision trees

Decision trees [27] use the training set to build nodes of conditions.

The trees are built from the root. In each node the training set is split into two groups based on the condition with greatest information gain, or entropy. The construction and method used are simple, but the size of the tree grows with the complexity of the data and makes the model have a variable memory usage with larger trees requiring a lot of memory. The size of the trees can be limited by for example setting the maximum depth parameter.

Variations of the Decision tree algorithm are known to generally perform well on a wide variety of machine learning problems even where the complexity is not linear. Decision trees are most suited to problems where the data consists of discrete values. A possible issue with decision trees is that they are prone to overﬁtting. Overﬁtting can happen for example when the tree is large enough to fully explain the training data so that there is one, not necessarily unique, path for each sample leading to a node with the correct target value. The trees can be controlled by limiting the depth and requiring a minimum amount of sample to form a new node which would prevent the trees from learning from possibly noisy data samples. The trees are also easily affected by changes in the data and could generate vastly different trees if there is noise in the data. Having a forest of trees reduces this effect.

The algorithms implemented in Azure [25] all use more than one tree to predict the label of the test data. The Boosted Decision Tree (BDT) algorithm predicts a label for the input data with the ﬁrst tree. The label can then be corrected by subsequent trees. Decision Forests (also known as Forest for the purpose of this thesis) instead make separate

(25)

predictions with all used trees and determines the label of the data by combining the results from each tree, for example a mean value of all predictions.

Decision Jungles [25] are similar to Decision Forests. What distinguishes them are the implemented data structures, which is a directed graph instead of tree. This difference has a negative effect on the training time, but a positive effect on the memory usage. Since none of the affected factors are considered in this thesis only Decision Forests were included in the project.

Support Vector Machines

Support Vector Machines (SVMs) [6] are suitable for binary classiﬁca- tion and input data with continuous or categorical data. Simple SVMs are generally preferred when speed is more important than accuracy, but is a possible option if the dataset is less complex. Each sample in the training data is represented as points in space, the SVM maxi- mizes the separation between the points of different classes in a space with additional dimensions created using a kernel function. The kernel function is useful when the decision boundary in the original space is complex, see ﬁgure 2.2. The accuracy of a SVM is greatly affected by the kernel function but choosing the type of kernel function requires domain knowledge and can be challenging.

(26)

Figure 2.2: The ﬁgure show how the data points in the XOR-problem, which have a complex decision boundary in 2-dimensions can be separable in 3-dimensions. [26]

There are two SVM implementations in Azure [25]. The Support Vec- tor Machine (SVM) works as described using a linear kernel function.

Locally Deep Support Vector Machine (LD-SVM) [17] is a SVM which can handle more complex datasets and uses a kernel for nonlinear predictions.

Artiﬁcial Neural Networks

Artiﬁcial Neural Networks (ANNs) have contributed to the more recent development in deep learning. Simpler versions of ANNs are suitable for linearly separable data and are preferable due to their speed. However, more networks with many layers can show good accuracy on fairly complex data, but then require more computation time. ANNs are often applied to image recognition problems but might not be preferred for problems when simpler solutions are available, they are also less intuitive compared to other algorithms and the solution model does not give much insight to the problem.

The Average Perceptron (AvgPerc) algorithm in Azure [25] is an example of a simple ANN implementation. It uses linear functions to sep-

(27)

arate the training data into two classes, the labels are combined using weights corresponding to each function. Azure also has a Neural Network (ANN) implementation which can be adapted by setting the number of hidden layers. The default implementation is a fully connected network.

Statistical methods

There are several ways to use statistical calculations to make a prediction based on the knowledge gained from the training data. There are two different algorithms which use this approach in Azure [25].

Logistic Regression (LogReg) assumes that the data is logistically dis- tributed. The distribution function takes the input data as a vector and an amount of parameters, as many as the dimension of the vector. The parameter values are optimized by using the Limited Memory BFGS method. New data is labeled with the class which has the highest probability. This algorithm only handles numeric values, other at- tributes in the input data would be converted by the algorithm.

Another algorithm in Azure [25] which uses statistical methods is Bayes Point Machine (BPM) [16]. The implemented version takes relatively few parameters. Bayesian classifiers are generally common in text classification and simpler Bayesian implementations, such as Naive Bayes, can perform well on a small training data set and overfitting is rare.

2.2.2 Parameters

A selection of the used parameters are explicitly mentioned below. The Azure documentation [25] can be referenced to get a full list of all parameter descriptions.

There are several parameters which directly affects the precision of the algorithms. For tree algorithms there are for example parameters deciding the depth or number of trees, where a larger value allows for more precision. The Minimum number of samples per leaf node parameter where a higher value result in more general rules and could eliminate

(28)

the effect of deviant samples, while smaller values makes the rules more speciﬁc.

Similarly in Neural Networks there are parameters for the number of layers or iterations in the learning which could affect how closely the model learns the training data.

Some parameters offering a direct tradeoff on optimization and over- ﬁtting is learning rate and number of iterations which is applicable to most algorithms. For decision trees the corresponding parameters would be the depth of the tree and number of trees constructed. The depth of the tree is limited in several ways for the trees. Besides the parameter of the tree depth there is also parameters limiting the number of leaves and the number of samples per leaf which indirectly limits the depth as well.

2.3 Previous research

A literature study [29] covering several studies on the topic of algorithm selection and meta-learning combines the progress made in the field. The study noted that the algorithm selection problem has taken several different directions within separate fields, mainly naming the fields machine learning, artificial intelligence and meta-heuristics. Each field uses separate vocabulary to describe the same concepts and have little knowledge of progress made in the research of other fields, since the referenced literature also stays within the field. However the content in all fields is similar and the research could benefit greatly from findings in related research.

The NFL theorem is generally accepted and all mentioned studies have conﬁrmed that a general optimal algorithm selection cannot be found.

What has been shown however is that algorithms can be grouped to match certain conditions in the dataset. The research has therefore been focused on evaluating the performance of several different algorithms applied to different datasets.

(29)

2.3.1 The StatLog legacy

The StatLog project [24] compared learning approaches to classiﬁca- tion to examine their strengths and weaknesses. 23 different algorithms were used, categorized into three groups: Machine learning, Neural Networks, and Statistical learners. The project also used results from experiments to relate performance to characteristics of the dataset. The top 5 algorithms, and the group they belong to, for each dataset was listed. The datasets were then plotted onto a 2D surface to visualize similarities. A decision tree algorithm, C4.5, were trained on the data and the model was used to present a binary outcome for each algorithm, either recommended or not, based on dataset characteristics. From the study general rules of connecting dataset characteristics to types of learning algorithms were presented. The idea to use machine learning algorithms to classify optimal algorithms for a machine learning problem was not novel but the coverage and large scale experiments was extensive and the research is a frequently referenced source in the ﬁeld.

StatLog also introduced the concept landmarking which means that a simple algorithm, landmarker, can be referenced to predict the performance of an algorithm. This does however require that the landmark- ers have vastly distinctive learning biases. It is also not expected to be an accurate algorithm selection tool, but can be advantageous when other methods are computationally expensive. [29].

There are additional research projects where the content is directly building on the observations in StatLog. The scope of the algorithms chosen to be examined or the number of different datasets the algorithms are evaluated with varies. The used evaluation measures and the measured factors also differ. One project [2] used the results from the algorithm evaluation to train a rule based learning algorithm C5.9 which resulted in rules of high conﬁdence that based on dataset properties predicted the optimal algorithm selection for the speciﬁc dataset.

The experiments have been performed on algorithms with default parameter settings.

Recent studies have experimented with variations in the method of using results from algorithm evaluations to ﬁnd a classiﬁer of which

(30)

algorithm suits the data set best. In a project using the NOEMON approach [18] the statistical signiﬁcance between the performance of algorithms to comparing their performance is mentioned as a new direc- tion to StatLog. NOEMON is a system which compares a new dataset to previously known datasets based on the morphological similarities, it is then possible to select an algorithm for previously unseen datasets.

The NOEMON would be able to chose one algorithm over another, whereas the previous ruleset presented binary outcomes for each algorithm.

2.3.2 Research with other performance measures

In another project [22], algorithms were compared with respect to accuracy, complexity and training time. The conclusion stated that the difference in accuracy is statistically insigniﬁcant and that there is more to gain from selecting algorithms based on other metrics. The difference in training time was more noticeable. The study covered 22 tree, 9 statistical and 2 neural algorithms were covered, most of which also appeared in the StatLog project.

2.3.3 Effects of feature selection and parameter tun- ing

Several previous studies have successfully applied feature selection to classification problems. One example [21], using ANNs as the classification algorithm, proposed and examined two different feature selection methods. The first method iteratively added features to a subset of the initial feature set if it contributes with new information. This, being the simple feature selection method, is easy to calculate and do not require substantial computation power but has the weakness that the final subset is greatly influenced by the order the features are selected in. The other method found important control variables, that influence the performance of the model, through a iterative process using an or- thogonal array which assigns values to each feature. The conclusions from the project stated that each method had their benefits and that

(31)

a combination of both could be considered as well. The results were likely to be applicable in contexts with another classiﬁcation algorithm than ANNs.

The importance of parameter tuning can be noted in a study [5] which demonstrated the difference in performance between default, optimal and worst performing parameter combinations. From the results of the study, which evaluated the performance of algorithms on several different search problems, the reached conclusions stated that there was a great variation in the performance depending on the used parameters but that the default parameters could have an adequate performance.

In a study on parameter tuning for a classification system with SVMs [20] the experiments achieved successful results. The parameter tuning aimed to minimize the upper bound of errors made by the classifier. The optimal kernel parameters were discovered through a hybrid genetic method which optimizes the parameters for the performance measure for several subsets of the training data. The use of a hybrid method was motivated by the findings that starting from random initial values often converged to local optima. The average errors from the results were close to 0 for two different classification problems.

Since both feature selection and parameter tuning have showed increased improved performance of algorithms in classiﬁcation problems it continues to be a common research problem.

2.3.4 Examples of website visitors data mining

As the Internet use has increased rapidly, and the amount of information available has grown with it, the interest has increased in using techniques for improving Internet services. Some example of this can be seen in Facebook [13] and Amazon [3]. One common approach to this task is to personalize the content presented to each individual.

To achieve this the individuals must ﬁrst be identiﬁed, while still pre- serving their integrity, either by user registration or by recognition of session ID or IP address. The system needs available visitor data, either information the user has contributed by selecting preferences or

(32)

collected by logging the activity of the user on the website, to analyze and predict behavior connected to the purpose of the website. The collected data could be used to know what content the visitor is searching for and present that content on an easily accessible location on the page.

A study from 2003 [12] covered this subject and although Internet use has changed over time the fundamentals remain the same. The concepts mentioned above are referred to as user profiling, who is the visitor, and web usage mining, what is the activity on the web page like. The user profile fields can be either static or dynamic, the profile can cover individuals or user groups and the data collected either explicitly or implicitly. User profiling can be challenging specially when the visitor is not a registered user of the page and even the IP address is not sufficient to determine if several visitor logs data belong to the same data. In web usage mining it is generally assumed that the logs of each pageview and visiting pattern of a visitor contains strong indica- tions to the user preferences. In a later stage the data must be prepared and preprocessed. Activities which do not contribute to the learning process can be filtered and extraction of relevant features from the raw data is necessary before using any machine learning model.

It was noted in a research paper [11] that there could be an alternative option to only consider the pageviews of a visitor and comparing it to the visiting pattern of others to make predictions. The alternative approach would be to also include the structure of the website as a factor. The experiments in the paper therefore combined both visiting patterns and analysis of the links between pages. The link analysis used the PageRank algorithm which is well known from search en- gine contexts. The results showed that this combined approach was superior to only using the visitor data.

Another relevant approach to this topic is covered in a study [1] which combines the existing techniques to use for promotional purposes. Through user proﬁling the content to promote for speciﬁc users was discovered and analysing the web links aided in deciding the location on the website to promote the content on. The same study could also be used to restructure the website topology for better user experiences. There could therefore be several use cases to analysing visitor data.

(33)

2.4 Summary

To summarize the insights gained from the surveyed literature in this chapter, the algorithm selection problem is not a black box, no single algorithm can suit all problems optimally. The selection is therefore a learning problem. The dataset characteristics are crucial for determining what algorithm should be selected. Furthermore the optimal performance could be improved through feature selection. The optimal algorithm selection for a problem is also loosely connected to the type of classifier it belongs to. The internal differences between the algorithms within a classification type also makes them specialized to certain tasks. The metric to evaluate the performance of the classifier should reflect the desired results from the model and emphasise the values of importance.

Specifically for website visitors, data on the visitors pageviews could be logged to use for user profiling and finding visiting patters. By training a model on this processed data the model would be able to predict the users preferences and determine what content the visitor should be suggested.

(34)

Method

This chapter is divided into three sections. Section 3.1 gives the premise to the project execution and describes the existing content. Section 3.2 includes the details of the project execution. The used tools are described and the stages of the project are split into parts for processing the dataset, tuning the parameters and evaluating the results. Section 3.3 discusses the used methods and motivated the made choices.

3.1 Material

The existing material for this project are both the existing version of the program and the collected data from the visitors to the client website.

3.1.1 System

The system the model is used in, the program Red Pine, is used as a component on a client website. The program is in charge of some sections on the client’s webpage, for example on the homepage. In these sections personalized suggestions are displayed depending on the visitor data. The goal is to optimize the click through rate of the

26

(35)

content on a client website the program has predicted the current visitor should be displayed.

Figure 3.1: Example of Red Pine in use on a website. On loading the website Red Pine is sent an request with all available data. The response with as many offers as there are placement locations are returned from Red Pine and displayed on the client website.

To achieve this Red Pine has access to logged information from the session, the session can also be connected to a registered visitor in which case the knowledge about the visitor is used for the prediction model.

In the system it is also possible to limit the scope of an offer to target it to speciﬁc user groups. These offers are only be used in the prediction model when the request is sent by a visitor matching the selection. The model ranks all possible offers and chooses to display the best possible offer in the ﬁrst placement location on the webpage. In case there is more than one placement location on the page the other offers are chosen following the ranking.

(36)

Figure 3.2: The components of Red Pine scoring. Each offer is scored separately with the input. The offers are ranked on their score and the selected offer is returned to the client website. The list of offers is determined by the client. Red Pine can be optimized by improving on the used input parameters and the algorithm used for the AI model.

The model uses a two class classiﬁer which predicts the response given an offer combined with the input data. But the predicted label is only used comparatively to the label of other offers with the same input data. What the model in reality should be aiming to predict is which offers that have the highest potential of getting a positive response when suggested to a visitor. The input data to the model would be the knowledge available about the visitor and the session. Additionally, all offers which match the visitor are used in separate inputs to get one output prediction for each.

The model is trained regularly with collected visitor data from the last two months. The time period of two months is assumed to be appro- priate considering that the number of samples with positive responses would be too small with a shorter time period and the dataset would contain too much outdated information with a longer time period. The collected data, for training and testing the model, contains information from previous logged instances of what responses visitor had to the suggestion they were shown. After the page has been loaded the visitors response for a suggestion is recorded as positive if that offer is clicked on. It is possible for more than one suggestion to get a positive response, for example if the visitor opens a new tab for each offer.

(37)

3.1.2 Dataset content

The dataset used for this project was collected in a time period of 36 days. During this time a total of 92074 data samples were collected from the client website, out of which 541 had a positive response label.

The dataset contains ﬁelds describing the request, session, and visitor.

The request data describes the request being made, for example the time and origin of the request. The session data is dynamic and describes what is known about the current session the visitor is using.

The raw session data contain the useragent and all pageviews of the session. If the visitor is registered in the system the data also contains ﬁelds of what is known about the visitor in general, the static data, for example the age or home address.

The data does not contain any information on the content of the offers.

The offers are only identiﬁed by their ID and category. When a completely new offer is created it can make predictions based on the category of the offer. But the predictions get more accurate predictions, on what visitor groups would give positive responses to the offer, when more data samples is collected for the new offer. The training data for older offers is not always comparable to the new offers.

The dataset could contain exact duplicates, if the time attribute is omit- ted. The request and session data is likely to change between requests while it is possible for the visitor data to remain the same and be identical for different visitors.

The dataset consisted of categorical and numerical fields. Bit values, strings and all numeric values are considered to be categorical when there is no numeric distance relation between the values. An offer with ID 5 is for example not more similar to an offer with ID 6 than it is to an offer with ID 10, even though the integer 5 is closer to the value 6 than 10. For a field describing for example the age of a visitor the field becomes a numeric value since the age 40 is in fact closer to the age 41 than it is to the age 50.

(38)

Noise

Dealing with data inﬂuenced by human behavior makes noise inevitable.

It can be assumed that the training data does contain a lot of noise which makes the training process more difficult. The ideal dataset for this problem would contain data recordings strictly from visitors who actually pursue finding something specific on the website. The reality however is that many visitors do not have a clear goal and might even visit the client website by chance, immediately leaving after loading the homepage.

Another option causing noise to the data is that it is possible to get a negative response even when the prediction was correct. Imagine the scenario where a regular visitor who is used to the website makes a visit with the purpose of ﬁnding content A. In the case where the model correctly predicts content A and displays it on a visible placement location the visitor could still choose to navigate to the content by clicking on a tab instead which would be recorded as a negative response to the suggestion. The negative responses are therefore un- certain to some degree while the positive responses are less likely to be caused by noise.

These mentioned scenarios, where the visitor would not have clicked on a suggested offer even if it was the content they were pursuing or where a visitor would have clicked on any suggested offer, create noise which makes it difﬁcult to teach the model that the suggested offer also should have an impact on the outcome. The responses might seem sporadic and unrelated to the actual offer suggested. It could also be behaviour which is not in the best interest for the performance.

Lastly the dataset contains features which could add noise. Since the features have not been selected through a careful analysis there is a possibility of features which only adds on to the complexity. These features would confuse the learning process. Removing redundant features is one of the goals with this thesis and is examined later in the report.

(39)

Other difﬁculties

The main difﬁculty with the used dataset is that the class labels are highly unbalanced, the ratio of positive responses are only a fraction of the total amount of samples. For each value of a feature there are therefore few positive samples to indicate the effect of the value on the label.

Another difficulty is that two identical data samples might belong to different label classes. Getting identical data samples is possible when the date and time of the request is ignored and can happen due to the unpredictability of human behavior. Identical visitor and session data can belong to two different visitors, a online visitor could also be operated by two different real people with different behavior, and the same visitor could give either a positive or negative response for other reasons even if all prerequisites are the same. Overfitting the data would therefore damage the performance, as in all classification problems. If the model gets a previously seen data sample it can not be assumed to know the right label. This is however not a major issue since all classification algorithms are able to handle this complexity.

3.2 Execution

The experiment was executed using tools from Azure. Azure is a development tool used to create the model in Red Pine. An algorithm can be selected from a library, trained and then published. The training process can use a component for tuning the parameters of the algorithm through a random parameter sweep over a parameter range or use algorithms with selected parameters.

The experiments were performed by creating the necessary features and importing the full dataset to Azure ML where missing values were replaced, the categorical and numerical features were separated.

All features were ranked by correlation scores and ﬁltered from the dataset with a greedy search.

(40)

The model was trained and tested with labeled data. The data had been collected by logging visitor responses on suggestions on the client website. The data consists of columns for each static visitor attribute as well as the dynamic data of the session. The last part contained request ﬁelds such as the offer id and the response label. The offer id shows which suggestion the user has been given and the data label is a bit value indicating if the visitor, with the speciﬁc data values, gave a positive response to the suggestion or not. The response is positive if the visitor clicks on the content link suggested by the model. The request data is used to describe the request itself. The data creating personalization for each visitor is the session data for anonymous visitors and the session data together with the visitor data for registered visitors.

3.2.1 Dataset

Feature creation

The previous implementation of the model has only been using the visitor data for training. The collected session information includes what pages and at what time a page has been loaded as well as the useragent. From this information new features can be created with multiple combination possibilities. The types of features included in this project were visits to a speciﬁc page within a time limit, tracking the latest page visiting patterns, the browser and device of the session.

All created session features were categorical.

Type of feature Features Value

Browser 1 Browser name

Device 1 Device type

Page visit within 1h 4 0/1

Page visit within 1 day 1 0/1

Last visited page 4 3, 2 or 1 full or

shortened URLs Table 3.1: The session data features used in the experiments.

(41)

The visiting pattern was stored as a sequence of URL locations. With the client website having a large quantity of different pages the combinations of visiting patterns contained a majority of unique values. The feature for the last three pages was noted to be of high cardinality, the other 3 features using either the last two pages, one page or one page with only a part of the URL stem, not the full location path, were created to combine several categories and reduce the number of unique categories to get more samples for each category.

A feature in the request data which could benefit from simplification, was the field describing what date and time the request was sent to the model. A complete date field on its own would be unique for every millisecond and when used to train the model it would be used as a string value given a category. Simply inputting the date would therefore not contribute greatly to the performance of the model and valuable information would be overlooked. The date feature was divided into 6 new features which were: days since the offer start date, days to the offer end date, if it is weekend, the weekday, the time period in the day and the absolute time as an integer.

In total 11 features from the session data and 10 features from the request data, including date and time, where included in the training set.

Feature Type Unique

Values

Description r1 Numerical 46374 Request ID

r2 Categorical 3 Webpage placement location

r3 Categorical 17 Offer ID

r4 Categorical 12 Offer category

r5 Numerical 73 Days since offer start r6 Numerical 73 Days to offer end

r7 Numerical 7 Weekday

r8 Categorical 2 Weekend

r9 Categorical 4 Time period in day (morning, day, evening, night)

r10 Numerical 37 Time (absolute time) Table 3.2: All request features in the project

(42)

Feature Type Unique Values

Description s1 Categorical 935 Last visited page

s2 Categorical 444 Last visited page URL stem s3 Categorical 3685 Visiting pattern, last 2 pages s4 Categorical 7753 Visiting pattern, last 3 pages

s5 Categorical 8 Browser

s6 Categorical 5 Device

s7 Categorical 2 Visit of page A last 24h s8 Categorical 2 Visit of page B last 24h s9 Categorical 2 Visit of page C last 24h s10 Categorical 2 Visit of page D last 24h s11 Categorical 2 Visit of page E last 24h

Table 3.3: All session features in the project

(43)

Feature Type Unique Values

Description

v1 Numerical 2 Subscription settings v2 Numerical 3 Subscription settings v3 Numerical 3 Subscription settings v4 Numerical 6 Subscription settings v5 Numerical 2 Subscription settings v6 Numerical 4 Subscription settings v7 Numerical 2 Subscription settings v8 Numerical 3 Subscription settings v9 Numerical 7 Subscription settings v10 Categorical 2 Subscription settings v11 Categorical 2 Subscription settings v12 Categorical 2 Subscription settings v13 Numerical 100 Churn rate

v14 Numerical 347 Monthly revenue v15 Numerical 9 Subscription settings v16 Numerical 2 Subscription settings v17 Numerical 41 Bound months left

v18 Numerical 81 Customer age

v19 Numerical 214 Months as customer v20 Numerical 67 Months since last sale v21 Categorical 7 Subscription type v22 Categorical 9 Subscription channel

v23 Numerical 13 Number of historic purchases v24 Numerical 79 Time since latest customer ser-

vice

v25 Categorical 35 Latest customer service type v26 Categorical 161 Latest customer service sub

type

v27 Numerical 46 Total customer service count v28 Numerical 13 Subscription settings

Table 3.4: All visitor features in the project

(44)

Dataset cleaning

Among the visitor data which should be registered in the website system there are values which could be unknown. The default value for these values would in most cases be to assume that the feature does not exist at the visitor, for numerical or bit values the feature would be given the value 0 and for categorical values Unknown. For the numerical features which could not easily be replaced by 0, for example age, the unknown values were given either the mean or mode value of the whole feature column to replace the value with the best effort guess.

The choice depending on the general distribution of that feature. 4 features (v13, v17, v 18, v19) had the unknown values replaced with the mean value. 3 features (v14, v20, v24) had the unknown values replaced with the mode value.

Figure 3.3: The diagrams show examples of value distribution in features where missing values are replaced by the mean value, in v19 to the left, or the mode value, in v20 to the right.

The training set did not use all logged information, the placement of the suggestion is for example recorded but not relevant in training since the model places offers based on ranking rather than the placement where they were predicted to have highest potential. Features with only one unique value were not included in the dataset, since the ﬁeld would not contribute to the prediction. However if there are rea-

(45)

sons to believe that different values could appear, although infrequent, there is a point in keeping the ﬁeld in the training of the model.

Descriptive statistics

With the additional features created from the session data and removed or altered features the ﬁnal dataset used during the execution of the project is described in Table 3.5

Dataset properties

Response (label) 0/1

Rows 92074

...with positive responses 541

Columns 49

...Categorical 23

...Numerical 26

...Unbalanced 7

Unique offers 17

Unique requests 46374

...with positive responses 531 Unique visitor types 23532 ...with positive responses 463

Request Data

Columns 10

...Categorical 5

...Numerical 5

...Unbalanced 1

Session Data

Columns 11

...Categorical 11

...Unbalanced 1

Visitor Data

Columns 29

...Categorical 8

...Numerical 21

...Unbalanced 5

Table 3.5: Descriptive statistics of the dataset content

(46)

With Unique visitor types the number of unique combinations of the 29 visitor feature is intended. Out of all visitor types, 11 occurs more than 100 times in the dataset.

Unbalanced features intends the features where one value of the feature occurs in more than 99% of the dataset samples. Features r2, s11, v1, v5, v7, v11, v16 were unbalanced in the dataset.

Feature selection

To attempt an automatic ﬁltering method the features were ranked by the correlation to the label with the Filter Based Feature Selection component in Azure, Fisher Score was used for numerical features and Chi Squared for categorical. Features with almost perfect correlation to each other were also recognized as clusters. The ﬁnal feature subset selection was determined through a greedy search which iteratively excludes features and runs the full algorithm with each dataset version to evaluate the performance, similarly as to wrapper methods for feature selection.

All features were initially included in the dataset. Then was the features with low correlation excluded from the dataset one at a time.

One high correlating feature cluster was identiﬁed where 4 of the numeric time features had a linear correlation above 0.98, at most one of these features were used in the greedy search.

3.2.2 Parameter tuning

The algorithm parameters were tuned with the Tune Model Hyper Pa- rameters component in Azure. All algorithm parameters were not used in tuning, some were kept constant. Each algorithm was initially tuned with a broad value range, with a few attempts of narrowing the range in regard to the initial results. Each run with the component used 5 random sweeps over the parameter range and cross validated with 5 folds. The parameter values achieving the best performance were recorded.

(47)

Figure 3.4: The tables show example results from using Tune Model Hyper Parameters on BDT, to the left, and Forest, to the right.

3.2.3 Evaluation and algorithm selection

Each of the optimization stages were observed in isolation as well as in combination with each other. Three versions of the dataset were used in testing: the original dataset with only visitor data, the full dataset after feature selection and the dataset with only session data after feature selection.

The evaluation was done by splitting the dataset into training set, the first 70% of the data samples, and testing set, the last 30% of the data samples, when selecting a data subset. This method was more effi- cient for testing many combinations. For tuning the algorithms and evaluating the performance of the final combinations cross validation was used to get a more general performance score. Cross validation is done by partitioning the data into several sets where one set will be used for testing in each fold of the evaluation. This results in that each fold has a different split of training and testing data, the mean and standard deviation of the results from all folds can be used to evaluate the performance of the algorithm.

In tuning, cross validation with 5 folds was used and the ﬁnal evaluation used 10 folds. The partitioning of the dataset were, in both of these evaluation methods, not random and the experiments therefore only needed one execution. Algorithm performance was evaluated