Methods to combine predictions from ensemble learning in multivariate forecasting

(1)

Author: Agustí Conesa Gago Supervisor: Diego Pérez Palacín Semester: HT 2020

Subject: Computer Science

Bachelor Degree Project

Methods to combine predictions from ensemble learning in

multivariate forecasting

(2)

Abstract

Making predictions nowadays is of high importance for any company, whether small or large, as thanks to the possibility to analyze the data available, new market opportu- nities can be found, risks and costs can be reduced, among others. Machine learning algorithms for time series can be used for predicting future values of interest. How- ever, choosing the appropriate algorithm and tuning its metaparameters require a great level of expertise. This creates an adoption barrier for small and medium enterprises which could not afford hiring a machine learning expert to their IT team. For these reasons, this project studies different possibilities to make good predictions based on machine learning algorithms, but without requiring great theoretical knowledge from the users. Moreover, a software package that implements the prediction process has been developed. The software is an ensemble method that first predicts a value taking into account different algorithms at the same time, and then it combines their results considering also the previous performance of each algorithm to obtain a final prediction of the value. Moreover, the solution proposed and implemented in this project can also predict according to a concrete objective (e.g., optimize the prediction, or do not exceed the real value) because not every prediction problem is subject to the same constraints. We have experimented and validated the implementation with three different cases. In all of them, a better performance has been obtained in comparison with each of the algorithms involved, reaching improvements of 45 to 95%.

Keywords: Machine learning, Online supervised learning, Ensemble method,

Regression.

(3)

Preface

Before diving into the report, I would like to deeply express my gratitude to people who have helped me in some way throughout my studies, without them I could not have gotten to where I am today. First of all, I want to thank my supervisor Diego Pérez who has always been providing valuable feedback and guiding me during this whole thesis project.

I would also like to highlight the help of the different professors that I have had during my

Bachelor studies at the Polytechnic University of Catalonia. Last but not least, I want to

give a special mention to all my family and friends who have always been supporting me

in good times and bad. Thank you all for your never-failing support.

(4)

1 Introduction 1

1.1 Background . . . . 1

1.2 Related work . . . . 2

1.3 Problem formulation . . . . 2

1.4 Motivation . . . . 4

1.5 Results . . . . 4

1.6 Scope/Limitation . . . . 5

1.7 Target group . . . . 5

1.8 Outline . . . . 5

2 Method 6 2.1 Research Project . . . . 6

2.2 Research methods . . . . 7

2.3 Performance metrics . . . . 7

2.3.1 Mean absolute error (MAE) . . . . 8

2.3.2 Counter of equal values (EC) . . . . 8

2.3.3 Counter of approximated values (OC) . . . . 8

2.3.4 Absolute error . . . . 8

2.4 Reliability and Validity . . . . 8

2.5 Ethical Considerations . . . . 9

3 Theoretical Background 10 3.1 Incremental learning . . . . 10

3.2 Algorithms . . . . 10

3.2.1 Decisional Tree . . . . 10

3.2.2 Adaptive Decisional Tree . . . . 10

3.2.3 KNN algorithm . . . . 11

3.2.4 Persistence model . . . . 11

3.2.5 ARIMA . . . . 11

3.3 Feature selection techniques . . . . 11

3.4 Ensemble method . . . . 12

3.5 Uncertainty in models . . . . 12

4 Research project – Implementation 13 4.1 Data preprocessing . . . . 13

4.1.1 Dataset description . . . . 14

4.1.2 Data preparation . . . . 14

4.1.3 Correlation of variables . . . . 14

4.2 Implementation of the algorithms to train different predictors . . . . 15

4.2.1 Ensemble method . . . . 15

4.3 Combinations of predictors output for improving the final result . . . . . 16

4.3.1 Method 1: Base ensemble case . . . . 16

4.3.2 Method 2: Prediction using the best algorithm of the previous step 17 4.3.3 Method 3: Ensemble learning increasing the weight to the best and decreasing to the rest by constant . . . . 18

4.3.4 Method 4: Ensemble learning increasing the weight to the best

and decreasing to the worst by constant . . . . 18

(5)

4.3.5 Method 5: Ensemble learning increasing the weight to the best by

constant and decreasing to the worst and perhaps to others too . . 19

4.3.6 Method 6: Ensemble learning with feature selection techniques . 19 4.3.7 Method 7: Ensemble learning modifying weights of more than one algorithm . . . . 19

4.3.8 Method 8: Ensemble learning modifying weights of more than one algorithm equally . . . . 20

4.3.9 Method 9: Ensemble learning with multiple KNN . . . . 21

4.3.10 Method 10: Ensemble learning modifying weights by proportion . 22 4.3.11 Method 11: Ensemble learning modifying weights with proportion and constant . . . . 23

4.4 Combination of predictors for improving the result depending on the prediction constraints . . . . 23

4.4.1 Method 12: Nail the maximum number of possible cases . . . . . 23

4.4.2 Method 13: Optimize . . . . 24

4.4.3 Method 14: Seek overprediction . . . . 24

4.4.4 Method 15: Seek underprediction . . . . 25

5 Results 26 5.1 First dataset . . . . 26

5.1.1 Data description . . . . 26

5.1.2 Predicting the future percentage of CPU . . . . 28

5.1.3 Predicting the future proportion of Memory used . . . . 32

5.2 Second dataset . . . . 36

5.2.1 Data description . . . . 36

5.2.2 Predicting the number of people making BBQ . . . . 38

6 Analysis 42 6.1 First dataset . . . . 42

6.1.1 Analysis of the dataset . . . . 42

6.1.2 Predicting the percentage of CPU used . . . . 42

6.1.3 Predicting the proportion of Memory used . . . . 47

6.2 Second dataset . . . . 51

6.2.1 Analysis of the dataset . . . . 51

6.2.2 Predicting the number of people making BBQ . . . . 51

7 Discussion 55 7.1 Reliability and validity . . . . 56

8 Conclusions and Future Work 57 8.1 Future work . . . . 57

References 59

(6)

1 Introduction

Predicting the future using historical data is increasingly important in organizations when making decisions. To achieve good predictions, the organizations already have the historical data. The Machine Learning area [1] has provided methods for time series forecasting, and the computational processing and the storage are cheaper than years ago.

However, to date, significant expertise is required to appropriately choose, configure, and execute these methods based on machine learning. For example, selecting the method that is the best suited for the current forecasting problem and available data, or determining the best values for its meta parameters is not straightforward for a non-expert. These difficulties create an adoption barrier of forecasting methods in many organizations that cannot afford to hire an expert for this task (e.g., small and medium enterprises).

This project aims to reduce the adoption barrier by making the appropriate execution of forecasting methods based on machine learning more accessible for the final users.

After the study and experimentation activities of this project, solutions have been proposed, and a software that implements such solutions has been developed. The solutions use machine learning algorithms and allow to forecast of the desired values simply, without requiring excellent prior knowledge on machine learning. The implemented software includes an ensemble method [2] which predicts future values taking into account different algorithms for time series forecasting. It also implements novel methods for combining the results from the ensemble in an appropriate manner to produce the final forecasting result.

Moreover, the proposed solutions allow requesting a specific type of objective when predicting. The reason is that there are different contexts in which forecasting is used. The most usual and simple objective is to predict a value that is as close as possible to the real one, but that objective is not the only possibility. In some cases, the objective is to make the predicted value as close as possible to the actual value, but without going under it.

For example, when a power plant predicts the power demand for the next 30 minutes to produce such power, overprediction is less dangerous than underprediction: overpredicting will cause a waste of energy, while underpredicting will cause blackouts. There are also other types of objectives, and this project explores four of these types.

To evaluate the performance, accuracy, and error of the proposed solutions, we have worked on a case study related to the prediction of CPU and memory utilization of a web application.

1.1 Background

The background of this project comprises machine learning and incremental learning.

Machine learning is a branch of artificial intelligence whose objective is to learn according to what happens, taking data samples, trying to know the general behaviors, either analyzing current situations or making possible decisions.

On the other side, this research also involves incremental learning, and it is a method of machine learning where the model does not stop training with new data. In this way, it improves its knowledge all the time, following possible recent trends.

The companies use these two concepts to predict the future. For example, in a web-

services company, given a specific time, they may want to predict how many processors,

memory capacity, and network bandwidth the website will need to work correctly. Al-

ternatively, a retail company may want to forecast how many products of the same type

will sell shortly. To carry out this process, a model-based prediction is created, whichis a

file that has an algorithm, and is trained with data to recognize different patterns. After

(7)

required resources in the future. At the present, it exists several approaches to carry out these predictions, such as Linear Regression, Random Forest, neural networks, ARIMA, among others [3]˙.

As we can see, there is a particular uncertainty about which methods to use and their quality for prediction depending on the context in which we are, primarily when these models are used in isolation. An expert on machine learning is most often required to make the prediction system work acceptably.

1.2 Related work

The use of different predictors in the ensemble method is not a novel idea. For example, this happens in Random Forests [4]˙, when each tree is grown depending on the best split of each node, taking into account the features. Another example is the Random Subspaces [5] that uses various sets of features among the models. These mentioned models have the data available from the beginning to train and then predict. However, our approach is different from these ones, since our ensemble method works in incremental learning with time-series data. This means that we obtain new information over time, we do not have it at first. Therefore, our model will predict and train each time we obtain more data.

Furthermore, we will work with ensemble methods since they are commonly known to be one of the best ways to solve predictive tasks in a wide variety of domains. There are distinct studies [6] [7] that explain the main reasons.

Moreover, our ensemble learning will include distinct predictors, not repeating only one type of predictor, since we want to weigh up the advantages and disadvantages of different methods. This is because diversity is an important element to generate a successful model.

Apart from that, we will give more weight to one predictor or another one depending on the results. This focus of our work is different from previous ones such as [8] since in this research, they only combine ARIMA models with different meta parameters, giving the same weight among the predictors.

We know that there are other approaches to make time-series predictions, such as Coun- terpropagation NN and Neuro-Fuzzy systems. Nevertheless, we have not implemented it in this way since we have preferred to make a model that involves different predictors so that it can work in more domains. At the same time, the ensemble method is more under- standable and accessible for any company that wants to implement it. In addition to that, ensemble learning requires less prediction time and resources than Counterpropagation NN and Neuro-Fuzzy systems.

1.3 Problem formulation

The starting point is the different investigations carried out at Linnéuniversitetet about predicting the used resources that a web application needs at a given moment in time.

However, the research in this project can be generalized to a general multivariate time series prediction problem using machine learning.

Prediction based on time series is not easy. It requires significant expertise. Never- theless, it is helpful in a multitude of organizations. Small and medium enterprises see a potential benefit if they can use it. However, often they cannot take advantage of the theories because they cannot afford experts in the area that can manipulate the data and the algorithms to make them work suitable.

Considering the above, this problem needs to be investigated because it is not an

isolated case. It occurs in the day-to-day of many companies, either in web pages such as

(8)

in any planning. Moreover, it is a problem that can save a considerable amount of money [9].

Therefore, there is a gap between the theories existent for machine learning-based prediction and their “easy” utilization by users. This project aims to fill, at least partially, that gap. This is not the only area where this problem exists. For example, the Big Data area also suffers from a similar problem. Many organizations are interested in analyzing their massive amount of data, but this requires very specialized knowledge, and many of them cannot find enough expert analysts to hire.

On this basis, the goal of this degree project is to make an ensemble learning using different algorithms concurrently to make good predictions in the vast majority of cases, whether peak demand or the rest of cases. Therefore, different proposals will be explored, implementing different prediction algorithms, and developing different methods to combine their results to find a better way to predict the future values of the variables of interest. For example, the first attempt to combine the results of prediction methods may pass through computing their average value, using basic model averaging uncertainty management technique [10]. However, this thesis also studies and proposes more elaborated manners that show appropriate behavior if the patterns of the inputs change during run time.

In addition to that, this project would also like to have in mind four different types of goals when making predictions: those that only want to hit sometimes precisely the actual value (called nail in this document) although they can make big prediction mistakes in the other cases; those that want to optimize trying to err as little as possible in average in the long run (called optimize), those that prefer to predict upward (explained as overpredicting), and those that prefer to predict lower values (explained as underpredicting).

Furthermore, it will try to have fast prediction times. This approach would allow being able to observe more nearby data and to foresee new trends, such as peaks of demand.

Following the guidelines to formulate research questions in [11], the research questions we want to answer in the scope of our project are the following:

• RQ1: Is it possible to create a prediction model based on machine learning that reduces the necessity of an expert in the subject and still makes accurate predictions?

• RQ2: What are the implementations for predicting having different goals in mind?

• RQ3:What is a good model for predicting in several domains at the same time?

To carry out this project will be proposed methods that use some meta-parameters and will be developed the software that automatically adjusts and finds out suitable values for each meta-parameter. The purpose is to reduce the error of the overall combined predictor and for different metrics to quantify the error, for example, the number of past observations to take into account and the optimal prediction horizon.

However, the gap will not be solved entirely, since it is a tough task to forecast demand,

especially when a peak begins. In addition to that, it can not be know how much can be

improved the prediction results.

(9)

1.4 Motivation

As introduced in the previous problem formulation, there is an industrial need to make acceptably good predictions without requiring too much expertise in machine learning algorithms.

Using the results of this project, a user would not need to have so much expertise in the appropriateness of each method for his/her problem because all of them will execute, and the system will automatically give a final prediction result considering which methods are executing better at that moment for that type of problem.

The rest of this section motivates the problem in the area of the prediction of computing resources for web applications and web services.

The resource predictions are relevant for web application owners who make these calculations in their cloud to figure out if they need to acquire or remove resources.

Making higher predictions can lead to paying for resources that are not used. However, if making lower predictions, the website will work slowly, and users can become unhappy.

For this reason, it is essential to make accurate predictions.

In addition, these calculations are also made by Cloud Providers [12](Amazon AWS, Microsoft Azure, Google App Engine, etc.). These enterprises are dedicated to providing computing services through the network, such as "Function as a service" and "Serverless computing" [13]. These services are becoming increasingly popular over time. In these computing models, the web owner does not have to worry about freeing or acquiring resources since Cloud Providers take care of these predictions by monitoring the use of CPU, RAM, Disk I/O, Network I/O, among others. The Cloud Providers have to give the necessary resources, since giving an excess of it would lead to an expenditure of data center [14] resources. In case of giving more minor, the client would complain about the slowness of the website.

In support of the above, important companies such as Amazon are trying to lead this sector due to the relevance that it currently has and will have in the near future. This is why Amazon has created Amazon Forecast [15] in order that any enterprise can easily make forecasts.

Therefore, this report will focus on solving this problem as much as possible, trying to make an ensemble learning that predicts correctly the vast majority of cases, peak demands, and the rest of cases.

1.5 Results

Different methods are proposed based on a good combination of the results from the ensemble learning. Software is produced for predicting values within an organization, or also if an individual wants to use it on their own. The software is an ensemble learning involving different predictors that are combined among them to predict as well as possible in any field.

The software is valued by testing possible real cases and comparing it with slight

modifications that the ensemble learning can have. To demonstrate the proper operation of

the software, we have analyzed the performance of it with two distinct datasets. At the

same time, we have also tested the ensemble learning keeping in mind not so much its

performance, but the four different goals that may exist: overprediction, underprediction,

nailing more cases, or optimizing.

(10)

1.6 Scope/Limitation

Due to the limited time of the project, we have focused on regression ensemble learning.

In addition, we have had some difficulties when it comes to predicting time series data since there are not many libraries about it. Therefore, the variety of possible predictors to use has been small, but we created a proper ensemble learning. However, one of the predictors involved in the ensemble method is created by hand, the VAR algorithm, which makes it go slower than it should be. This is because we have not found the VAR algorithm in the incremental learning library. Consequently, the VAR algorithm needs to be created and trained every time new data comes.

1.7 Target group

The final target of this research is the small and medium enterprises that want to make predictions based on machine learning methods easily. However, that is a too ambitious goal for a Bachelor Thesis project. We consider that this project gives a step further to develop software that any company can easily use to forecast any value. In this way, the machine learning world will move closer to companies that do not have the potential to implement it. Nevertheless, it does not provide the final solution to this research problem.

Therefore, our realistic target group is the researchers who are currently working to make easier predictions with time-series data, more precisely, regression forecasting.

1.8 Outline

The structure of the report is the following. Section 2 discusses the methodological framework and the research methods used to answer the formulated problem. In addition to that, this section also contains a brief description of the reliability and validity of the project, describing in which way the work can be replicated, and also ethical considerations.

Section 3 explains the theoretical background and discusses the knowledge gap in this

project. Section 4 is focused on describing in detail all the activities carried out on this

project, such as the collection of data, the possible solutions, and the implementations. This

is followed by explaining the results obtained after executing all the software in Section

5. Section 6 analyzes the results obtained in the previous section. Section 7 discusses the

findings and compares the results of this project with the related work in Section 1. The

last section, Section 8, presents the conclusion of the project, in which we can see if the

initial objectives are accomplished and the future work is discussed.

(11)

2 Method

This section defines the research project, the research methods, the different performance metrics, the reliability, and the validity of the project together with its corresponding ethical considerations.

2.1 Research Project

This project follows a design science methodology. It is a research methodology that allows us to design a solution to an existing problem and iterate over it until we obtain a definitive solution. For this reason, we have opted for this procedure, as it allows us to understand the problem better and reevaluate the solution as we get results.

This methodology in our project has facilitated us to know the different steps we need to follow to achieve an accurate result. The different stages of the methodology can be seen clearly in Figure 2.1 below. First of all, we have identified the problem, which is trying to create an excellent predictor to be able to forecast in any field. To make and assess our predictor, we have focused on solving a real problem such as predicting the CPU and memory used by a website at any given time since this task is of high importance nowadays in any organization and every day even more. These different tasks have been performed in Stage 1.

Figure 2.1: Stages of the design science methodology.

After setting the requirements, we have chosen, developed, and optimized each of the algorithms individually and applied the dataset to them to evaluate their performance. We have carried out these tasks in Stages 2 and 3. Subsequently, we have proposed to do the base case of the ensemble method giving equal weight to all the algorithms, Stage 4, and we have implemented it in Stages 5 and 6. In this way, we have been able to assess a possible predictor. Nevertheless, after measuring the performance of this base case, we thought that the forecasting could be improved by making some slight modifications. For this reason, we have done another iteration in which we have designed another solution and have evaluated how well it works, Stages 4 to 6.

We have made these iterations as many times as we needed to see that our ensemble

learning was the best possible. After that, starting from the latest model, we have developed

possible solutions for each type of objective that any person or company can have when

predicting. Therefore, we have iterated as often as necessary to design all the options we

want to offer in our model. Finally, we have wanted to demonstrate that this ensemble

(12)

reason, we have applied a synthetic dataset [16] to the ensemble method to validate the project carried out, performing Stages 4 to 6 again, thus obtaining a final design that has gone through various iterations until reaching its maximum performance. Thanks to this methodology, we have been able to create a possible solution to a real problem that is of great importance, according to the results that we were obtaining.

2.2 Research methods

We have planed to follow a design science methodology since our objective in this project is to create a new algorithm, and we do not know what needs to be done from the beginning.

Therefore, we will move according to the results that we obtain, as mentioned in the previous section. For this reason, we have not chosen to work with controlled experiment methodology since although we have an apparent problem to solve, we do not know initially which is the procedure we have to carry out to obtain optimal results. Considering all these aspects, we saw it was not appropriate to follow the controlled experiment method in this project.

On the other hand, we have also thought about using the case study methodology since focusing on creating an algorithm for a specific problem could make it easier for us to create a possible general solution. Although it seems that we have carried out case study methodology, since we have focused on a possible case, we did not follow this methodology. This is because some algorithms that do not work very well in the case of resource predicting we have left them equally in the ensemble learning, since if we eliminate the worst ones, we could obtain better performance for this case. However, if we think of expanding this model to other areas, focusing on obtaining the best possible results for a single case study is not an adequate procedure. Therefore, we can affirm that we have created the model from a possible problem. However, we have not followed a case study since we have not focused on obtaining the best performance only in this case, by removing the worst predictors from the ensemble learning. To minimize the problem of having some algorithms not very useful, we have increased and decreased the weights of the different predictors involved. In this way, we have focused on creating a general solution.

2.3 Performance metrics

This section describes each of the metrics that will be taken into account when evaluating the performance of the algorithms within ensemble learning.

There are different metrics, but not all are considered simultaneously because it depends on the objective we have when predicting. As I mentioned before, there are four types of goals when requesting resources: when we want to underpredict, to optimize, to predict upward, and to nail more cases.

These different metrics not only determine the model’s performance after training but

also how the ensemble learning will take into consideration one algorithm more than the

rest when forecasting resources.

(13)

2.3.1 Mean absolute error (MAE)

It is a measure to obtain the error between the values expected and the values predicted.

This metric is helpful to know which algorithm is generally predicting closer to the actual value.

M AE = P

n

i=1

| y

_true_i

− y

_pred_i

|

n (1)

Where y

truei

is the real value of y, while y

predi

is the predicted value of y.

2.3.2 Counter of equal values (EC)

Knowing the number of times that an algorithm has predicted the exact same number as the real one is a reference parameter for problems in which the objective is to nail more cases.

EC =

n

X

i=1

δ

_y_truei_,y_predi

(2)

Where y

truei

is again the real value of y, y

predi

the predicted value of y, and δ

x,y

is the Kronecker delta function.

2.3.3 Counter of approximated values (OC)

This metric relaxes the counter of equal values in previous Equation 2 by allowing, in the prediction, a tolerance of X% of the real value. Therefore, this counter increases for each prediction that is within the X% of the real value.

OC =

n

X

i=1

| y

_true_i

− y

_pred_i

| y

_true_i

· 100 ≤ X

(3)

2.3.4 Absolute error

This metric contributes to knowing in each iteration of the ensemble learning prediction which algorithm is the closest to the real value. The error, in general, can be defined as the difference between the obtained value and the true one.

E

_a

(y) = |y

_true_i

− y

_pred_i

| (4) 2.4 Reliability and Validity

The reliability in our project is to be able to replicate the same work again without any problem, obtaining the same results. The only difference that may exist when replicating is that if we do not use the same data that we have applied, then the model will give us different results. This divergence can also occur if we change some meta parameters of the predictors involved in the ensemble learning. Therefore, we can affirm that this project is reliable if the explanation of the implementation is followed step by step and if it is used the same data, thus obtaining very similar results.

Furthermore, validity in our case refers to the fact that the conclusions must be sup-

ported by the results obtained. The only validity problem that we can have in this project is

external validity. We assume that our model can work for any case, only having observed

that it works in two different datasets.

(14)

2.5 Ethical Considerations

The ethical considerations in this project are more related to the collection and processing

of information. Anonymity must always be kept in the data that we enter into the model. It

is very relevant to have as much information as possible but always protecting the privacy

of persons or companies. On the other side, the world of machine learning is a bit perverse

at the same time that it is valuable since predictions can always be appropriately made or

negatively by particular interests. However, if we use the model correctly, we can improve

lives, being able to help to reduce unnecessary expenses, such as electricity in the case of

predicting the resources used for a website or other cases with a positive impact on the

environment.

(15)

3 Theoretical Background

This section presents the theoretical part necessary to carry out the project. It provides brief explanations for incremental learning, which is how the algorithms are trained, the different machine learning algorithms involved in the ensemble method, and the ensemble method itself.

3.1 Incremental learning

In this project, we use incremental learning, also called online learning; this is a machine learning method where data is not available at first due to data is available in sequential order. Furthermore, every time there is new data, the predictor is updated. This method is the opposite of batch learning, which generates the predictor taking into account all data.

In this project, we have decided to use incremental learning for two reasons: the need to adapt to new patterns in data and also because the data is generated based on time.

3.2 Algorithms 3.2.1 Decisional Tree

A decision tree [17] is a model that has a tree-shaped structure: it has branches, nodes, and leaves. Branches are the subsections of the entire tree, while the nodes can store a value or a condition. In the case of storing a value, it can be named terminal node or leaf, otherwise, they are called decision nodes.

This model begins with a root node. Branches develop from creating a set of questions to the data (represented in decision nodes) that unfolds even more branches. This process is carried out repeatedly until the model becomes secure enough to make a single prediction.

When the model stops branching, the last unbranched nodes are named leaves. These leaves provide the values that represent the continuum of the questions of the decisional nodes.

3.2.2 Adaptive Decisional Tree

Adaptive Decisional Trees [18] are based on the previous decisional trees, although they also use ADWIN to detect drift and Perceptron to make predictions. The next paragraph explains these terms.

ADWIN, which stands for Adaptive Windowing [19], is an algorithm that aims to detect changes in the data and facilitate predictors to make accurate forecasting. Before describing how the algorithm works, we have to explain what a windows is. A window is a fragment of the available data which is used to train predictors. Algorithms usually utilize a fixed size observation window, where all the data available is taken into account in the training process, even if most of these observations are irrelevant because they do not capture the latest trends in the most recent data. This causes a decrement in the prediction accuracy. To solve it, ADWIN proposes an adaptive window size selection method, where the window size is automatically modified according to new trends, always trying to keep as much recent data as possible to provide at the same time both robustness and adaptability to changes.

The window adjustment process is carried out by comparing the average of different

windows as we obtain information, eliminating the oldest information that makes the

windows not adapt to the new tendencies.

(16)

On the other hand, the adaptive decision tree is also involved the perceptron [20]. The only difference in comparison with a common decision tree is that, instead of having leaves, there are aLinear Threshold Unit (LTU). On the one hand, common decisional trees are based on boolean combinations of the features of the instances, which forces that everything must be a boolean combination of the characteristics. On the other hand, the modification proposed in adaptive decisional trees allows one to learnmore than boolean combinations. This is a useful characteristic for some problems in which the common tree would be a poor choice.

3.2.3 KNN algorithm

K nearest neighbors (KNN) [21] is an algorithm that stores all available cases and predicts new cases based on the most similar old cases. The size of the similar cases depends on the variable k. This represents the number of the closest observations taken into account.

The value of the variable k can be chosen by the analyst according to his experience, or it can also be given after analyzing which number of k gives the best performance within a specific range.

3.2.4 Persistence model

It is one of the simplest methods to predict the future in time series. This model is based on predicting the future values assuming that no condition of the present changes. Therefore, the current conditions will be taken into account in order to calculate the future values.

The following formula is the easiest to represent the persistence model, where t is the current time and (t + T

_H

) the future time. As we can see, the future value will be the current value.

y(t + T

_H

) = y(t) (5)

This is the baseline algorithm for predictions. In order to assess how well other time series prediction algorithms perform, their errors are compared with the errors of the persistence model.

3.2.5 ARIMA

The autoregressive integrated moving average (ARIMA) [22] is a model that is created based on the old data and not with external variables, as it only relies on finding some pattern from the previous instances to make future predictions. The following formula illustrates how the predicted value of a specific time y

_t

only depends on the previous data, combining old instances. This model is applied when there is evidence that the data are stationary.

y

_t

= µ + φ

₁

y

_t−1

+ ... + φ

_p

y

_t−p

− θ

₁

e

_t−1

− ... − θ

_q

e

_t−q

(6) 3.3 Feature selection techniques

Feature selection methods is a process that consists of reducing the number of input

variables to have fewer redundant variables, leaving only the most relevant to predict. This

will make us have less noise and more speed when forecasting.

(17)

3.4 Ensemble method

The ensemble method is a technique that uses different base models and combines their results, aiming to produce a better resulting prediction.

There are different ways to combine the algorithms, but the ones we use in this project are averaging and dynamic weighted voting. The former is that the final result is simply the average of the predictor’s results. In the latter, weighted voting, each predictor has an

“importance” value (or trustability value) used as a weight, at the beginning the same for all.

This project combines the results of five distinct types of algorithms, namely, the decisional tree, the adaptive decisional tree, the persistence model, the KNN, and the ARIMA model.

The sum of weights of all the algorithms must always be equal to 1. As we already mentioned, all predictors have the same weight at the beginning. Therefore, in the first iteration, the final prediction of the model is the average of the results of each of the algorithms involved. However, this is is quickly improved by the dynamic update of weights. The weights are modified throughout the run time depending on whether the predictions are close to the actual value (and the weight of the algorithm increases) or not (the weight of the algorithm decreases).

3.5 Uncertainty in models

Literature and advances in the area of models uncertainty have also been helpful in this project [23]. Each of the methods in the ensemble previously explained can be seen as a different prediction model. Without looking at the internal details of each method, which is among the fundamental restrictions of this project (i.e., make accurate prediction easier for non-experts), the activity that combines the results sees its inputs as a set of different plausible results.

Therefore, in this activity, there is uncertainty about what of results is the correct one,

if any. This project leverages “model averaging” uncertainty management technique [10],

to begin with a simple averaging of results, and then improving the combination method

through weighted voting until the final dynamic weighted voting.

(18)

4 Research project – Implementation

The different software artifacts that have been implemented during the project development can be classified into three different groups:

• Methods to preprocess the input data.

• Algorithms to train and execute different predictors .

• Algorithms to combine the outputs of the predictors to improve the final produced prediction.

Figure 4.2 illustrates the it. All the software artifacts have been developed using Python programming language (version 3.7) [24]. This programming language has been chosen because it is among the most popular in machine learning, for its simplicity, for the support it has in the community,and more importantly, because it supports recent incremental learning frameworks. All the code made in this project can be seen on my GitHub account [25].

Figure 4.2: Structure of the ensemble method.

Following subsections explain the characteristics and problems faced during the imple- mentation of each artifact.

4.1 Data preprocessing

This subsection describes the steps carried out during the data preprocessing activity. These steps comprise a description of the dataset, activities that modify the entries in the dataset to a more usable format, and activities that filter the information in the dataset.

These steps are the most dataset-dependent in the project. Other datasets will need

their data preprocessing activities.

(19)

4.1.1 Dataset description

The first dataset used in this project contains the resources used for a web application every 5 minutes from 12 August 2013 to 11 September of the same year. This is the dataset with which this project carried out the majority of its explorative and experimental work. More specifically, the resources registered in the dataset: the number of CPU available, their actual percentage of use, the memory capacity provisioned, the memory actually used, the rate of reading and writing from/to the disk, and the bits transmission rate over the network both incoming and outgoing.

4.1.2 Data preparation

Preprocessing the original data before starting to apply algorithms to the dataset is a mandatory task. It is necessary to generate the entries in a meaningful way for concrete prediction purposes and in the format accepted by the algorithms. This requires adding columns or removing unnecessary ones. This task will not only allow the prediction algorithms to work, but will also increase their accuracy. At the same time, the training time will be reduced, since we will not have as much information to process.

To begin with this phase, in our dataset we have created a new column named Date, this has been created by transforming the timestamp values of the original dataset to date and time, to understand better how the data is distributed. After this, we have created another column to get the Memory usage [%], dividing the memory usage column by the provisioned one. The reason is that each entry in the originally used dataset contained a number with the KB of provisioned memory and a number with the actual memory used in KB. However, the value with the % of memory used is more meaningful and summarizes in one column the information of two original columns. In this way, we already have the two columns that we need to predict: the CPU usage [%] which already existed, and the Memory usage [%] ,which is the column we have just created.

After having created these two columns, we have removed the columns that gave redundant information. These redundant columns were:

• CPU cores, which was a constant value in all entries in the dataset.

• CPU provisioned capacity [MHZ]and CPU usage [MHZ], because the new column of % of CPU usage summarizes them.

• Memory capacity provisioned [KB] and Memory usage [KB], because the new column of % of memory usage summarizes them.

After this preprocessing, our data is already able to train the different algorithms.

4.1.3 Correlation of variables

To further understand the dataset, a heat map has been implemented, in which the cor- relations between each of the variables can be more easily observed. Heat maps show correlations between every two variables in the dataset in the range [-1,1]. A value close to 1, means that there is a positive correlation between these two variables; a value close to 0 means that there is not correlation between these two variables; and a value close to -1 means that there is a negative correlation. Section 5 illustrates the result of a heat map

study in Figure 5.4.

(20)

4.2 Implementation of the algorithms to train different predictors

Once the dataset is pre-processed, we have implemented the different algorithms indi- vidually ,and we have explored their behavior and performance before involving them in the ensemble method. The implemented algorithms are introduced in Section 3.2: the decisional tree, the adaptive decisional tree, the KNN regression, the persistence model, and the ARIMA model.

We have created the first three algorithms using specific Python libraries for incremental learning, such as scikit-multiflow [26]. The persistence model, due to its simplicity, has been implemented entirely by us. Finally, we have also used libraries to implement ARIMA, specifically statsmodels [27]. However, this is not dedicated to incremental learning. Therefore, we have started from it, but we have modified it to use ARIMA within the project purposes. The main problem that has arisen when not finding any ARIMA library explicitly designed for incremental learning is that the solution we have designed is relatively slow, even if it makes accurate predictions. This is because an ARIMA is created each time new data comes. Consequently, the model needs to train all the old data each time and after make the prediction. This is good enough for our exploratory work of an ensemble of methods making predictions, but it is not in full agreement with the incremental learning philosophy.

For the KNN regression method, it is necessary to decide the value of k. Using the first dataset, we have performed several executions of the KNN method varying the k value to understand its behavior and the region of values for k in which the method provided better results.

After analyzing the KNN performances with distinct values of k, we have only imple- mented a KNN algorithm with the value of k that gives us the slightest error. We know that we cannot study in advance the values of k in order to give the best k value when creating the predictor, since we may not have data ,and also the value of k can vary due to being in an incremental learning context.

The proper implementation would be to have multiple KNN running simultaneously with different values combining them with the other predictors and modifying the im- portance of each one of them according to the performance given at that moment, as we explain in the following sections. However, we have not carried out it in this way since having so many predictors would not facilitate an analysis of their combinations, which was our principal goal. In addition, if you had several KNN, the error would not have been reduced so much compared to having only one. Therefore, we decided to implement only one KNN in the different models.

4.2.1 Ensemble method

When we have the five algorithms ready to run, we create the structure of the ensemble.

Figure 4.2 showed the different stages that exist for the proper functioning of the model.

The word data preprocessing is in italics because it is the only part that has not been

automated in this project. When we have the data available to work with, the ensemble

allows running all the prediction methods concurrently and provides an entry point for

gathering each of their results. This set of results is then passed to the combination module,

in which the results are combined according to the characteristics of the model to give a

final prediction. We will see the different ways to combine the results in the following

sections.

(21)

4.3 Combinations of predictors output for improving the final result

This is the most original contribution of the project. It needed exploration, theoretical development, implementation, experimentation, development ,and validation activities.

We have researched and proposed several methods to combine the results from the different predictors. The next paragraphs explain them.

4.3.1 Method 1: Base ensemble case

The base case for the combination of results gives the same weight to each prediction algorithm without considering the performance that these algorithms have over time; i.e., simply computing the average of the result from each predictor. This base case of ensemble learning will help us assess whether our further proposals on the combination of results lead to better predictions. The following proposed methods will dynamically adjust the weight of each predictor by taking into account their previous performance and the different prediction objectives.

Starting from the base case, this project explores and experiments with different manners to combine the results produced by the ensemble of predictors.

The main objective of these proposals is to find out a combination of algorithms that produce a prediction that beats the predictions of the base ensemble case explained above ,which computed the final prediction as to the average of results.

A second, even more ambitious, objective for the proposed methods is to produce a result that also executes better than any of the predictors working in isolation. We remind the reader that the general motivation of this project is to reduce the level of expertise required from users of machine learning methods, and the ability to choose the correct method to use requires expertise from the user. Therefore, a solution that considers a set of predictors and can automatically decide which one would produce the best results already reduces the expertise required from the users. However, developing a solution that considers a set of possible methods and can automatically produce results that are better than all of them is a much more ambitious objective which certainly deserves research and exploration.

The next subsections explain the different methods explored in this project for the combination of results from predictors. The notation used in the description of the next methods is the following:

• N is the number of predictors used. Each predictor is called p

ⁿ

with n ∈ [1..N ].

• t is the current time.

• p

ⁿ_t

is the prediction result of predictor n for time t. Obviously, this prediction is made before time t (by default, at time t − 1) because at time t the real value can be observed.

• e

ⁿ_t

is the error of predictor n for time t. This value be known only at time t, when the real value is observed.

• w

ⁿ_t

is the weight given to predictor n for its prediction for time t.

• p

_t

is the prediction coming from the combination of results p

ⁿ_t

for time t

• o

t

is the real observed value at time t

(22)

And let nBest ∈ [1..N ] be a variable that stores the index of the predictor that showed the best performance for prediction for time t; i.e., nBest = arg min

n

(e

ⁿ_t

).

The only changes that there are regarding Method 4 are the lines

• N is the number of predictors used. Each predictor is called p

ⁿ

with n ∈ [1..N ].

• t is the current time.

• p

ⁿ_t

is the prediction result of predictor n for time t. Obviously, this prediction is made before time t (by default, at time t − 1) because at time t the real value can be observed.

• e

ⁿ_t

is the error of predictor n for time t. This value be known only at time t, when the real value is observed.

• w

ⁿ_t

is the weight given to predictor n for its prediction for time t.

• p

_t

is the prediction coming from the combination of results p

ⁿ_t

for time t

• o

_t

is the real observed value at time t

4.3.2 Method 2: Prediction using the best algorithm of the previous step

In this method, the final prediction is taken from the predictor that showed the best performance in the previous iteration. An iteration in this context means the process that is carried out each time new data comes. This refers both to the moment in which each algorithm predicts by itself and when each of the models is combined to make the final prediction of the ensemble learning. In addition, an iteration also includes the training with the new data after making the predictions.

This modification gives a weight of one to the algorithm that showed the most minor error, while in the other algorithms, it is given a weight of zero. Therefore, this method is understood as a dynamic predictor selection instead of a combination of their results.

Although the predicted value only depends on one of the algorithms, all the algorithms in each iteration are trained at the end of the iteration with the new instance.

More formally,

• At time t − 1, all the n predictors create their p

ⁿ_t

. The final prediction for time t is p

_t

= P

n∈[1..N ]

p

ⁿ_t

w

_tⁿ

• at time t, when the real value o

t

is observed, all predictors compute their prediction errors e

ⁿ_t

using one of the methods presented in Section 2.3. The weights w

_t+1ⁿ

are updated following:

– w

^nBest_t+1

= 1; i.e., the predictor with minimum error gets weight 1, and – ∀n 6= nBest, w

ⁿ_t+1

= 0; i.e., the rest of predictors get weight 0.

And each predictor executes its incremental learning task using the new observation

and compute their new prediction p

ⁿ_t+1

for moment t + 1.

(23)

4.3.3 Method 3: Ensemble learning increasing the weight to the best and decreas- ing to the rest by constant

In this method, the weights are adjusted in each iteration in the following way. The predictor that showed the best performance in the previous iteration increases its weight by a constant factor α. The rest of the predictors decrease their weights by a total constant factor α. Obviously, the update of weights must consider that none of the weights can be larger than 1 or lower than 0. In some cases, this method cannot increase or decrease the total amount to the weight: for example, if the weight of the predictor to increase was already close to 1 or if the weight of some of the predictors to decrease was already close to 0. In that case, the total sum of weights would be different from 1. Therefore, after each update, the weights are normalized to make their sum 1.

In this method, the best predictor in the previous step is rewarded ,and the rest are penalized, but not as much as in the previous Method 2, which gave total weight to the previous best predictor and obviated all the rest.

More formally, the behavior is the following:

• At time t − 1, all the n predictors create their p

ⁿ_t

. The final prediction for time t is p

t

= P

n∈[1..N ]

p

ⁿ_t

w

_tⁿ

• at time t, when the real value o

t

is observed, all predictors compute their prediction errors e

ⁿ_t

using one of the methods presented in Section 2.3. The weights w

_t+1ⁿ

are updated following:

– w

^nBest_t+1

= min(1, w

_t^nBest

+ α); i.e., the predictor with minimum error increases its weight by α, at most.

– ∀n 6= nBest, w

_t+1ⁿ

= max(0, w

ⁿ_t

−

_{N −1}^α

); i.e., the rest of predictors decrease their weights by α, at most.

– normalization step, make the sum of weights equal to 1 by updating each of them: w

_t+1ⁿ

=

^w

n t+1

P

∀n

wⁿ_t+1

.

And each predictor executes its incremental learning task using the new observation and compute their new prediction p

ⁿ_t+1

for moment t + 1.

4.3.4 Method 4: Ensemble learning increasing the weight to the best and decreas- ing to the worst by constant

This ensemble method is similar to the previous one. The difference is that this method penalizes only one of the predictors, the worst one. Therefore, the weight of N-2 predictors remains unchanged. With this method, we aimed to avoid penalizing methods that produced relatively good results in previous steps but not the best ones.

More formally, let nW orst ∈ [1..N ] be a variable that stores the index of the predictor that showed the worst performance for prediction for time t; i.e., nW orst = arg max

n

(e

ⁿ_t

).

Then, the only part that changes with respect to previous Method 3 is the line

• ∀n 6= nBest, w

ⁿ_t+1

= max(0, w

_tⁿ

− α); i.e., the worst predictor decrease their weight by α, at most,

which becomes

• w

^{nW orst}_t+1

= max(0, w

_t^{nW orst}

− α); i.e., the predictor with maximum error decreases

(24)

4.3.5 Method 5: Ensemble learning increasing the weight to the best by constant and decreasing to the worst and perhaps to others too

The previous ensemble learning did not almost affect the weight of bad predictors when the weight of the worst predictor was close to 0. This method avoids it. In case the weight of the worst predictor is less than α, we will decrease the missing difference equally among the algorithms that are not the best and have positive weight.

Comparing with the two previous methods:

• Method 3 decreased the weight of every predictor that was not the best one, which may be too pessimistic for predictors that worked relatively well, but that was not the best one.

• Method 4 decreased at most quantity α to only one predictor, the worst one, which may become a minor decrement if the worst predictor had already 0 weight or a value close to 0.

This method proposes something in between. Then, the only part that changes con- cerning previous Method 4 is the lines that reduce the weight of the worst predictor, becoming:

• if [w

^{nW orst}_t

> α] then w

_t+1^{nW orst}

= w

^{nW orst}_t

− α; like previous method.

• otherwise

– w

^{nW orst}_t+1

= 0, the worst predictor decreases its weight to 0

– ∀n 6= {nBest, nW orst} w

ⁿ_t+1

= max(0, w

_tⁿ

−

^α−w_{N −2}^{nW orst}^t

); i.e., the other predictors decrease the missing difference to make the overall set of predictors decrease a quantity α

4.3.6 Method 6: Ensemble learning with feature selection techniques

Sometimes having used more features in the dataset does not produce more accurate results.

Some of the features may only introduce “noise” because they are not correlated to the variable to predict, and the predictors may behave worse than if these features were not considered.

Thus, this method applies a feature filtering task to try to eliminate noise in the predictions. After having carried out the feature selection process in our dataset and observing some variables provide the most information, we have executed the previous ensemble method with the input of the most essential features.

Section 5 explains the functions used for the filtering, provides the feature relevance values, and compares the differences of the two method executions, Method 5 and this Method 6, which first selects a subset of the available features and then applies Method 5.

4.3.7 Method 7: Ensemble learning modifying weights of more than one algorithm

We currently modify only the weight of only one predictor, the best one, after each

observation. However, we have observed that there may be two or more predictors that

predict exactly equally well (i.e., at some point in time, they have the same error value

e

ⁿ_t

). Up to now, one of them was arbitrarily considered the nBest and its weight updated,

(25)

Therefore, this method observes all the algorithms that have the best performances ,and for each of them, we will increase its weight decrease the least performance algorithm.

This approach will lead to an ensemble learning more equitable with all the algorithms involved.

More formally, the only aspect that changes concerning the previous method is the notation of nBest. Now, we consider that {nBest} is the set with all the indices of the predictors that showed the best performance for prediction for time t; i.e., {nBest} = arg min

n

(e

ⁿ_t

). The only part that changes with previous methods is the calculation of the increment in the weights in line

• w

^nBest_t+1

= min(1, w

^nBest_t

+ α); i.e., the predictor with minimum error increases its weight by α, at most.

which becomes

• ∀nBest

⁰

∈ {nBest}, w

^nBest_t+1 ⁰

= min(1, w

_t^nBest⁰

+ α); i.e., all the predictors with minimum error increase their weight by α, at most.

4.3.8 Method 8: Ensemble learning modifying weights of more than one algorithm equally

This method proposes two updates. First, the weights will not be updated after the first iteration; i.e., ∀n ∈ [1..N ] w

ⁿ₁

= w

ⁿ₀

. The reason is that, while experimenting with the previous ensemble method, we created a table with the prediction value of each algorithm, the prediction of the ensemble method ,and also the different weights of each of the algorithms during all the process. This table showed some undesired behavior in the first iteration: the method modified the weights in the first iteration when there were not enough values to forecast yet. Therefore, we observed that we can still improve the way of modifying the weights at the beginning of the prediction process.

Secondly, when reducing the weight to the worst algorithm, we have thought that when there are two equally bad algorithms, it could be better to reduce the weight to the one that has more weight (for example, in Method 4, one of the algorithms that were equally bad was arbitrarily chosen to decrease its weight). In this way, in the subsequent iterations both will have less weight.

More formally,

• let {nW orst} be the set with all the indices of the predictors that showed the worst performance; i.e., {nW orst} = arg max

n

(e

ⁿ_t

),

• let nHeaviest ∈ {nW orst} be the index of the predictor that had the highest weight value among the predictors in {nW orst}; i.e., nHeaviest = arg max

n∈{nW orst}

(w

_tⁿ

),

• then, the difference with the previous method is that the first predictor that decreases

its weight is w

_t+1^nHeaviest

= max(0, w

_t^nHeaviest

− α).

(26)

4.3.8.1 Results in real time of the algorithms involved in the ensemble learning The predictions of each of the algorithms can be viewed in real-time, where the pre- dictions can be compared with the actual value. Moreover, the respective errors of the algorithms can be seen both when increasing and decreasing during the learning process.

This Figure 4.3 which comes next, is an example of a possible visualization that we could see.

Figure 4.3: Results in real-time of the ensemble learning.

4.3.9 Method 9: Ensemble learning with multiple KNN

In the previous methods, the KNN predictor executed only one instance, that is, for a single value of k. The user had to initially choose a value of k for the KNN algorithm.

A natural extension for reducing the knowledge required by the user on machine learning algorithms is to relax that precondition. This Method 9 relaxes such precondition by executing in parallel different KNN predictors, each of them with a different value of k. Therefore, the overall prediction does not execute anymore only five predictors and combines their results. Now, it executes four plus the number of KNN predictors and combines the results of all of them.

Regarding the number of KNN predictors executing in parallel, we have decided to instantiate a KNN every five values; that is k = 1 + 5k

⁰

. This instantiates a KNN with k = 1, another with k = 6, another with k = 11, etc. until a predefined maximum value. For example, in our experiments, we executed five different KNN predictors, with k ∈ {1, 6, 11, 16, 21, 26}. Therefore, the number of predictors executing in the ensemble will be ten: four different algorithms (the decisional tree, the adaptive decisional tree, the persistence model ,and the ARIMA) and the five instances of KNN.

We have only implemented one KNN algorithm in the previous ensemble methods

because we wanted to focus on the way the algorithms were combined and to be able

to analyze correctly the behavior of the modification of weights. In the case of having

multiple KNN instances simultaneously, these multiple predictors would unnecessarily

(27)

results having five algorithms than having ten. We had also previously seen that if we executed multiple KNN, the error would not have decreased too much. Therefore, we decided not to instantiate more than one KNN in the ensemble until this Method 9.

We have used only one KNN algorithm in the previous cases, with the value of k that worked best for us. This was obtained thanks to previously analyzing the KNN algorithm individually to see which value gave us the best performance, just as we have explained in Section 4.2.

4.3.10 Method 10: Ensemble learning modifying weights by proportion

This ensemble method is almost the same as Method 8. The only aspect that differs is that instead of increasing or decreasing the weights with a constant value α, they are modified by a percentage β. This means that the weight of the best predictor is increased β proportion of its weight. The exact amount is subtracted from the weight of the worst predictor.

If the amount to be decreased is greater than the weight of the worst algorithm, the remaining weight will be reduced among the other algorithms that are not the best and have positive weight. This characteristic was already proposed in Method 5.

More formally, the behavior is the following:

• At time t − 1, all the n predictors create their p

ⁿ_t

. The final prediction for time t is p

t

= P

n∈[1..N ]

p

ⁿ_t

w

_tⁿ

• at time t, when the real value o

t

is observed, all predictors compute their prediction errors e

ⁿ_t

using one of the methods presented in Section 2.3. The weights w

_t+1ⁿ

are updated following:

– w

^nBest_t+1

= min(1, w

^nBest_t

(1 + β)); i.e., the predictor with minimum error increases its weight by proportion β, at most.

– if [w

_t^{nW orst}

> w

_t^nBest

β] then w

^{nW orst}_t+1

= w

^{nW orst}_t

− (w

_t^nBest

β); i.e., the pre- dictor with maximum error decreases its weight by the same amount as the best predictor was incremented.

– otherwise

* w

t+1^{nW orst}

= 0, the worst predictor decreases its weight to 0, and

* ∀n 6= {nBest, nW orst} w

ⁿt+1

= max(0, w

_tⁿ

−

^w^nBest^t ^β−w_{N −2}^{nW orst}^t

; i.e., the other predictors decrease the missing difference.

– normalization step, make the sum of weights equal to 1 by updating each of them: w

_t+1ⁿ

=

^w

n t+1

P

∀n

wⁿ_t+1

.

And each predictor executes its incremental learning task using the new observation and compute their new prediction p

ⁿ_t+1

for moment t + 1.

We thought that this model would be a more valid option than the previous ones, because we would increase or decrease the weight quicker and more according to the weight they already had. We are aware that the value of the proportion β could become a parameter that needs to be set by the user, requiring some expertise from him/her.

Methods to combine predictions from ensemble learning in multivariate forecasting

Author: Agustí Conesa Gago Supervisor: Diego Pérez Palacín Semester: HT 2020

Subject: Computer Science

Bachelor Degree Project

Methods to combine predictions from ensemble learning in

multivariate forecasting

Abstract

Keywords: Machine learning, Online supervised learning, Ensemble method,

Regression.

Preface

I would also like to highlight the help of the different professors that I have had during my

Bachelor studies at the Polytechnic University of Catalonia. Last but not least, I want to

give a special mention to all my family and friends who have always been supporting me

in good times and bad. Thank you all for your never-failing support.

Contents

1 Introduction 1

1.1 Background . . . . 1

1.2 Related work . . . . 2

1.3 Problem formulation . . . . 2

1.4 Motivation . . . . 4

1.5 Results . . . . 4

1.6 Scope/Limitation . . . . 5

1.7 Target group . . . . 5

1.8 Outline . . . . 5

2 Method 6 2.1 Research Project . . . . 6

2.2 Research methods . . . . 7

2.3 Performance metrics . . . . 7

2.3.1 Mean absolute error (MAE) . . . . 8

2.3.2 Counter of equal values (EC) . . . . 8

2.3.3 Counter of approximated values (OC) . . . . 8

2.3.4 Absolute error . . . . 8

2.4 Reliability and Validity . . . . 8

2.5 Ethical Considerations . . . . 9

3 Theoretical Background 10 3.1 Incremental learning . . . . 10

3.2 Algorithms . . . . 10

3.2.1 Decisional Tree . . . . 10

3.2.2 Adaptive Decisional Tree . . . . 10

3.2.3 KNN algorithm . . . . 11

3.2.4 Persistence model . . . . 11

3.2.5 ARIMA . . . . 11

3.3 Feature selection techniques . . . . 11

3.4 Ensemble method . . . . 12

3.5 Uncertainty in models . . . . 12

4 Research project – Implementation 13 4.1 Data preprocessing . . . . 13

4.1.1 Dataset description . . . . 14

4.1.2 Data preparation . . . . 14

4.1.3 Correlation of variables . . . . 14

4.2 Implementation of the algorithms to train different predictors . . . . 15

4.2.1 Ensemble method . . . . 15

4.3 Combinations of predictors output for improving the final result . . . . . 16

4.3.1 Method 1: Base ensemble case . . . . 16

4.3.2 Method 2: Prediction using the best algorithm of the previous step 17 4.3.3 Method 3: Ensemble learning increasing the weight to the best and decreasing to the rest by constant . . . . 18

4.3.4 Method 4: Ensemble learning increasing the weight to the best

and decreasing to the worst by constant . . . . 18

4.3.5 Method 5: Ensemble learning increasing the weight to the best by

constant and decreasing to the worst and perhaps to others too . . 19

4.3.6 Method 6: Ensemble learning with feature selection techniques . 19 4.3.7 Method 7: Ensemble learning modifying weights of more than one algorithm . . . . 19

4.3.8 Method 8: Ensemble learning modifying weights of more than one algorithm equally . . . . 20

4.3.9 Method 9: Ensemble learning with multiple KNN . . . . 21

4.3.10 Method 10: Ensemble learning modifying weights by proportion . 22 4.3.11 Method 11: Ensemble learning modifying weights with proportion and constant . . . . 23

4.4 Combination of predictors for improving the result depending on the prediction constraints . . . . 23

4.4.1 Method 12: Nail the maximum number of possible cases . . . . . 23

4.4.2 Method 13: Optimize . . . . 24

4.4.3 Method 14: Seek overprediction . . . . 24

4.4.4 Method 15: Seek underprediction . . . . 25

5 Results 26 5.1 First dataset . . . . 26

5.1.1 Data description . . . . 26

5.1.2 Predicting the future percentage of CPU . . . . 28

5.1.3 Predicting the future proportion of Memory used . . . . 32

5.2 Second dataset . . . . 36

5.2.1 Data description . . . . 36

5.2.2 Predicting the number of people making BBQ . . . . 38

6 Analysis 42 6.1 First dataset . . . . 42

6.1.1 Analysis of the dataset . . . . 42

6.1.2 Predicting the percentage of CPU used . . . . 42

6.1.3 Predicting the proportion of Memory used . . . . 47

6.2 Second dataset . . . . 51

6.2.1 Analysis of the dataset . . . . 51

6.2.2 Predicting the number of people making BBQ . . . . 51

7 Discussion 55 7.1 Reliability and validity . . . . 56