Evaluation of Adaptive random forest algorithm for classification of evolving data stream

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2020

Evaluation of Adaptive random

forest algorithm for classification

of evolving data stream

AYHAM ALKAZAZ

(2)

Evaluation of Adaptive

random forest algorithm for

classification of evolving data

stream

AYHAM ALKAZAZ & MARWA SAADO KHAROUKI

Degree Project in Computer Science

Date: August 16, 2020

Supervisor: Erik Fransén

Examiner: Pawel Herman

School of Electrical Engineering and Computer Science

(3)

(4)

Abstract

In the era of big data, online machine learning algorithms have gained more and more traction from both academia and industry. In multiple scenarios de-cisions and predictions has to be made in near real-time as data is observed from continuously evolving data streams. Offline learning algorithms fall short in different ways when it comes to handling such problems. Apart from the costs and difficulties of storing these data streams in storage clusters and the computational difficulties associated with retraining the models each time new data is observed in order to keep the model up to date, these methods also don’t have built-in mechanisms to handle seasonality and non-stationary data streams. In such streams, the data distribution might change over time in what is called concept drift. Adaptive random forests are well studied and effective for online learning and non-stationary data streams. By using bagging and drift detection mechanisms adaptive random forests aim to improve the accuracy and performance of traditional random forests for online learning. In this study, we analyze the predictive classification accuracy of adaptive random forests when used in conjunction with different data streams and concept drifts. The data streams used to evaluate the accuracy are SEA and Agrawal. Each data stream is tested in 3 different concept drift configurations; gradual, sudden, and recur-ring. The results obtained from the performed benchmarks shows that adaptive random forests have better accuracy handling SEA than Agrawal, which could be interpreted by the dimensions and structure of the input attributes. Adap-tive random forests showed no clear difference in accuracy between gradual and sudden concept drifts. However, recurring concept drifts had lower accuracy in the benchmarks than both the sudden and the gradual counterparts. This could be a result of the higher frequency of concept drifts within the same time period (number of observed samples).

(5)

Sammanfattning

I big data tiden har online-maskininlärningsalgoritmer f˚att mer och mer dragkraft fr˚an b˚ade akademin och industrin. I flera scenarier m˚aste beslut och predek-tioner göras i nära realtid när data observeras fr˚an dataströmmar som kontin-uerligt utvecklas. Offline-inlärningsalgoritmer brister p˚a olika sätt när det gäller att hantera s˚adana problem. Bortsett fr˚an kostnaderna och sv˚arigheterna med att lagra dessa dataströmmar i en lagringskluster och den beräkningsmässiga sv˚arigheterna förknippade med att träna modellen p˚a nytt varje g˚ang ny data observeras för att h˚alla modellen uppdaterad. Dessa metoder har inte heller inbyggda mekanismer för att hantera säsongsbetonade och icke-stationära data-strömmar. I s˚adana strömmar kan datadistributionen förändras över tid i det som kallas konceptdrift. Anpassningsbara slumpmässiga skogar (Adaptive ran-dom forests) är väl studerade och effektiva modeller för online-inlärning och hantering av icke-stationära dataströmmar. Genom att använda mekanismer för att upptäcka konceptdrift och bagging syftar adaptiva slumpmässiga skogar att förbättra noggrannheten och prestandan hos traditionella slumpmässiga skogar för onlineinlärning. I denna studie analyserar vi den prediktiva klassificer-ingsnoggrannheten för adaptiva slumpmässiga skogar när de används i sam-band med olika dataströmmar och konceptdrift. Dataströmmarna som används för att utvärdera prestandan är SEA och Agrawal. Varje dataström testas i 3 olika konceptdriftkonfigurationer; gradvis, plötslig och ˚aterkommande. Re-sultaten som erh˚allits fr˚an de utförda experiment visar att anpassningsbara slumpmässiga skogar har bättre noggrannhet än Agrawal, vilket kan tolkas av antal dimensioner och strukturen av inmatningsattributen. Adaptiva slumpmässiga skogar visade dock ingen tydlig skillnad i noggrannhet mellan gradvisa och plötsliga konceptdrift. Emellertid hade ˚aterkommande konceptdrift lägre nog-grannhet i riktmärken än b˚ade de plötsliga och gradvisa motstycken. Detta kan vara ett resultat av den högre frekvensen av konceptdrift inom samma tidsperiod (antal observerade prover).

(6)

Acronyms

ADWIN ADaptive WINdowing. 5 ARF Adaptive Random Forest. 10

DSRF Dynamic Streaming Random Forest. 14 DT Decision Tree. 8

DWM Dynamic Weighted Majority. 14 HT Hoeffding Tree. 9

(9)

Chapter 1

Introduction

In some real-world applications of machine learning, data arrives in continuous streams. In such situations, the entire training data-set is not available at the the time of designing the model. The underlying data distribution for these streams might also change over time in response to various events. There is, therefore, a need for machine models that can learn on-the-fly from continuously evolving data streams and adapt to changes in the underlying probability distribution of the observed data. This means that the model needs to learn and adapt as new observations become available. In these situations, online machine learning approaches are required.

Many learning algorithms use one model to form a final prediction. However, it is also possible to combine several models in some manner to make predictions for new examples. The method that combines multiple base model is known as the ensemble method. There are many different well-known ensemble methods nowadays, such as bagging [1] and random forests [2], that aim to improve gen-eralization performance and model accuracy. Adaptive random forests (ARF) algorithm is an adaptation of batch-based random forests that aims to handle evolving data streams [3].

In this work we aim to study the behaviour of ARF in the challenging context of evolving data streams. More specifically, we aim to study ARF’s accuracy when used to classify observations from an evolving data streams with different kinds of concept drifts, where the data distribution changes over time in different patterns; gradual, sudden or recurring.

1.1 Problem statement

Different online machine learning algorithms have different performance and ac-curacy characteristics when it comes to classifying non-stationary data streams. This study aims to study the behaviour of adaptive random forests when used to classify data streams with various forms of concept drifts by comparing the changes in predictive accuracy between different data-streams and concept

(10)

drifts. The study aims to answer the following question: how do adaptive ran-dom forests perform when it is used to classify non-stationary data streams with sudden, gradual, and recurring concept drifts? Two synthetic data stream generators are used in order to evaluate the performance of the studied models; SEA and Agrawal.

1.2 Scope

In this study, we investigate the adaptive random forests (ARF) algorithm [3], an online ensemble method which deals with concept drifts using dynamic update methods and batch learning. There are an infinite number of data streams, concept drift setups and model parameter permutation. The study, therefore, focuses on three types of concept drifts : gradual, sudden and recurring.

All data streams and algorithms used in this paper are available in scikit-multiflow framework [4]. The drift adaptation strategy that we choose is the Adaptive WINdowing (ADWIN) [5] with threshold (0.01) for warning detection and threshold (0.001) for drift detection. We also use 10 Hoeffding trees classi-fiers for the ensemble model. Other model parameters are to the predetermined defaults chosen by the framework designers.

The performance evaluation method is Prequential evaluation [6] (inter-leaved test-then-train method) and the evaluation measure used to assess the performance is the classification accuracy. The algorithm is evaluated in the immediate setting where labels are presented to the learner before the next in-stance arrives. Situations where labels arrive with delay are outside the scope of this work. The scope of this thesis is also limited by two data streams: SEA and Agrawal with one test configuration for each type of concept drift.

1.3 Thesis outline

This report is divided into six chapters. The first chapter involves an introduc-tion to the subject and the problem statement of the study. The second chapter presents background information related to online machine learning and ARF algorithm. In addition to this, previous works related to this thesis are pre-sented. The subsequent chapter is about the procedure of the thesis, where the experimental setting and the data streams are described. The fourth chapter depicts the results generated by our experiment accompanied by figures required to interpret the findings. The results are divided into three parts, one for each type of concept drifts. Following that, the results of the experiment are thor-oughly discussed and critically analyzed in the fifth chapter. The limitations we have, the ethics of the study and the future research we suggest are also presented in the discussion chapter. Lastly, in the sixth chapter, the discussion is summarized and the research question is concluded.

(11)

Chapter 2

Background

2.1 Offline and Online learning

Traditional machine learning systems that work in offline mode learn from fixed training sets where the identity of the data elements is known to the learner in advance. These systems are also memory-based and produce static models from finite data items produced from a stationary distribution.

However, in many real-world applications, such as web user tracking, fraud analysis, trading systems, radio frequency identification, and social networks analysis, the data is continuously generated at high-speeds in the form of data streams. By definition, the data stream is a real-time and ordered sequence of infinite data records that have a timestamp. Moreover, the underlying data distribution for the streams may change on the fly in response to various events: This, in turn, leads to continuous updates of the predictive models which learn from the streams.

Due to these characteristics of the data stream, typical offline learning meth-ods, such as classification algorithms, are not capable of successfully processing data streams. Alternatively, online learning is employed to solve problems that deal with data streams.

2.2 Data stream classification

In machine learning, classification algorithms are a type of supervised learning in which predictive models are built from labeled training data. The aim of the classification algorithm is to approximate the classification function which represents the underlying relationship between input instances and output in-stances(class labels). Thus, the predictive model of a classification problem, known as classifier, can assign a class label to any new input example from the problem domain. A common example of classifiers is systems for classifying emails as “spam” or “not spam.

(12)

Stream classification algorithms can handle situations where data instances arrive at any time for prediction because labeled and unlabeled data examples occur together in the stream [7]. Additionally, these algorithms use a limited amount of memory and do not store data prior to learning. These algorithms instead process the data instance once it arrives and use it for testing the model.

2.3 Ensemble methods

Ensemble method is a machine learning technique which creates multiple base learners and then combines them into one predictive model in order to produce more accurate results than could be produced from any of the single base learners [8].

The base models of an ensemble can be of different types or of the same type. For instance, the base model for classification problems can be a Naive Bayes classifier, a Bayesian network, a decision tree or a neural network [9]. During learning, each base model makes a prediction, called vote, every time a new data instance arrives to the ensemble for classification. Thereafter one technique, such as majority voting and weighted averaging, is employed to combine the votes of the base classifiers in order to yield the ensemble’s prediction.

Ensemble methods are able to selectively add, remove, reset or update their base models and to achieve high learning performance, which in turn makes them appropriate to handle evolving data streams problems [10].

2.4 Concept drifts

Learning from data streams that flow constantly and rapidly poses new learning challenges. A phenomenon known as concept drifts is one major challenge that arises when the process generating the data evolves over the time. In many real-world applications, the problem of concept drifts is common to occur. For instance, if the customer’s preferences evolve over time due to seasonal incli-nations or fashion trends, the prediction system which is previously trained on previous data becomes invalid for predictions.

In machine learning, however, concept drifts refer to the changes that occur over time in the underlying relationship between the data instances the model tries to classify and their corresponding class targets [11]. Apparently, the prob-lem of concept drift is of special interest when dealing with non-stationary data because of its direct impact on learning, which makes the predictive models ob-solete. This problem, in turn, implies the need to update the model to represent the actual concept that current data reflects.

A very common categorization of concept drift depends on the pattern of the change from one concept to another [11]. Concept drift can be categorized into three patterns as shown in Figure 2.1, where the blue circles represent the old concept while the green ones are for the new concept.

(13)

Figure 2.1: Types of concept drift

• Gradual: These drifts result from a slow and gradual transition from the past concept to the new one. The both concepts may concur within the transition period and the percentage of the data that poses the new con-cept gradually increases. However, this gradual transition process makes it hard for the predictive system to observe a noticeable change in dif-ferences between the incoming data and the underlying class distribution [12].

• Sudden: in this pattern the transition between old concept and new concept occurs immediately within a very short time [11]. Namely, the streamed data suddenly reflects the new concept and the old concept thus becomes invalid.

• Recurring: this type of drift represents a scenario where a previously active concept represented by the data reappears in the future as the stream progresses [13]. For example, in intrusion detection systems, the same incidents may appear again after some time.

2.5 Algorithms

2.5.1 Decision Tree

Decision Tree (DT) algorithm is a technique for creating a predictive model that uses the tree representation to solve classification problems. The data structure of the tree consists of internal nodes (decision nodes) for testing data attributes, edges for branching the outcomes of the test and leaves representing the class labels of the problem.

The algorithm for constructing a decision tree identifies ways to divide the training data by determining the most informative data attribute to split the data at each node. The construction procedure is applied during the training phase and starts execution from the root of the tree which contains the complete dataset. The procedure first finds the best data attribute that splits the training data at the root into subsets. Accordingly, all possible outcomes of the test assigned to the root, are represented as the outgoing edges from the root.

Thereafter, the construction method recursively generates new nodes through splitting each sub dataset that flows through each branch. The method does not

(14)

extend a branch further, if it reaches a maximum depth or when the remaining subset at this branch is almost entirely from the same class. The algorithm instead generates a leaf labeled with this majority class label and representing the decision taken by the classifier.

A number of criteria and quality measurement can be used to determine the best attribute on which to divide the dataset available at each node, as well as the test based on that attribute. For instance, Gini impurity [14] and information gain [15] can be used for classification tasks.

To classify the new arriving example, the decision tree first tests it against the test at the root using the value of its attribute. Based on the outcome of the test, the example then follows one of the branches and moves to the next node. This process is recursive and is repeated for every node until the data example reaches a leaf that provides the classification to the example.

2.5.2 Hoeffding Tree

Hoeffding Tree (HT) [16] is a state-of-the-art decision tree method designed for learning from massive data streams. It is first proposed by Domingos et.al in [16] and its name is derived from the Hoeffding bound that is used for split decision in the tree. The main idea behind the algorithm is that a small subset of the training instances may be sufficient to choose the best splitting attribute at a given node. Hoeffding bound supports this idea mathematically and is explained as the following.

Assuming that the random variable r, which has range R, is the attribute selection measure that is used to choose the best splitting attribute at each node in the Hoeffding tree. Supposing also that N independent observations of r have been performed and the estimated value of their mean is ¯r. Accordingly, Hoeffding bound ensures that with probability 1–δ , the true mean of r is at least ¯r– , where δ is user specified and is :

=pR2_ln(1/δ)/2n _(2.1)

Concretely, Hoeffding bound is applied when each newly arrived example reaches a frontier node, which isn’t a leaf and hasn’t yet been split while train-ing the tree. At the frontier node, the Hoeffdtrain-ing test is satisfied when there is a confidence of 1–δ that the optimal attribute is chosen by the quality measure-ment method. At this point, the split is done and the frontier node becomes an internal node.

2.5.3 Random forest

Random forest (RF)[2] is an ensemble algorithm developed by Breiman and is widely used in batch or non-stream learning for regression, classification, and other tasks. The algorithm grows a number of decision trees and combines them into a single model. For the classification problems, the final prediction made by the algorithm is based on the majority of votes from all of the trees where

(15)

each individual tree makes a class prediction. The fundamental idea behind RF is to build a forest of trees which have low correlations with each other or are uncorrelated in order to avoid overfitting and to make a robust prediction. While some trees in the forest are sensitive to noise and predict wrong, many other different trees can be right. As long as the trees are not correlated, the plurality of their votes is less prone to overfitting and the expected error is lower. To decrease the correlations between the trees and increase the diversity in the forest, the algorithm uses two methods when building each individual tree: random selection of features and Bagging (bootstrap aggregating) [1].

Bagging [1] is an ensemble algorithm designed to generate a certain number of different datasets each of N records from a given training dataset of N records. The algorithm generates the datasets by randomly sampling the records with replacement. Hence, each record in the original dataset can be repeated in each bootstrapped sample K times, where P(K = k) represents the binomial distribution[17]. This binomial distribution tends to a Poisson(1) distribution, if the size N of the bootstrapped sample is large. The classifiers (decision trees) in the Bagging algorithm, also in the RF, are then trained each on one of the generated datasets instead of on the original dataset. This results in different trees because the decision trees are very sensitive to the training data.

Random selection of features is used while growing the trees and splitting each node. In the standard decision tree algorithm, the entire set of features is inspected at each node to find the most important feature to split. By contrast, to split each node in the random forest, a random subset of m<M features is considered, where M is the total number of features.

The use of bagging and random selection of features make the RF algorithm more powerful than simple ensembles of trees and less prone to overfitting if the forest has enough trees. It was also shown that Random Forest algorithm is fast to train and effective when tested on different data sets [2]. Additionally, the algorithm is able to handle all forms of data, including data with missing values, and produces highly accurate predictions.

2.5.4 Adaptive random forest

Adaptive random forest (ARF) [3] is an extension of the original Random Forest algorithm designed to deal with evolving data streams. The adaptive algorithm combines techniques from Random Forest with methods used to dynamically cope with different kinds of concept drifts. The standard random forest is a batch ensemble trained using all the available static data set. RF is , thus, not appropriate for learning from sequential streams of data that can arrive continuously. Thus, to make the RF algorithm operate in online-mode some adaptations are required. One of the adaptations is that the base learners in ARF are Hoeffding trees, which are capable of learning from massive data streams, instead of the standard decision trees used in RF.

In the original RF, two methods, Bagging [1] and random selection of fea-tures, are applied to decrease the correlations between the base models and increase the diversity of the ensemble. In the same way, ARF adds diversity

(16)

through selecting random subset of features for node splits while inducing each HoeffdingTree in the forest. This is performed by modifying the HoeffdingTree algorithm which considers all attributes to find the best one for splitting.

In online learning, it is infeasible to use the non-stream Bagging method to draw random samples with replacement from the original training data. That is because the method needs multiple passes over the data, which has an unknown size in the online setting and is constantly coming. However, ARF includes an effective resampling algorithm which is based on Online Bagging algorithm , a bootstrap aggregating process for data stream proposed in [18].

In the online bagging algorithm [18], as each new training instance arrives and for each base model in the ensemble, the current instance is used to train the base model W times in a row. This means weighting the instance with a value W where w is a random number generated by Poisson( λ = 1) distribution. Hence, online bagging simulates sampling with replacement, original Bagging, by using Poisson(1) for weighting instances.This is based on the fact that the binomial distribution used in non-stream bagging tends to a Poisson(1) when the size of the data stream is infinite, as the case in online learning.

However, the change in the ARF strategy for resampling is about increasing the value of diversity in the Poisson distribution to λ = 6 . Consequently, this attributes a different range of weights to the samples and thus increases the input space diversity inside the ensemble.

A common challenge in online learning where data is collected over time, is that the concept of data may unpredictably drift which negatively impacts the performance of the predictive model and makes it obsolete over time. However, the objective of ARF is to include mechanisms to cope with different kinds of concept drifts. Concretely, ARF uses a drift detector for each tree in the ensemble to monitor warnings and drifts. As soon as a warning is detected in a tree, the algorithm creates a background tree and starts training it along with the ensemble. The background tree can be used later for predictions instead of the active tree if the warning escalates to a drift. This strategy differs from the default approach that resets base trees immediately after detecting a drift and uses them without being pre-trained on any instance, which can negatively impact model’s prediction.

To summarize, the pseudo-code of ARF algorithm is shown in figure 2.2 and is explained in the following.

1. The algorithm begins with initializing a defined number of trees. It then starts receiving streams of data and sending each new arriving instance to each tree in the ensemble.

2. ARF works in the test-then-train setting, where the new instance is first used to test the model, make a prediction, and estimates its performance. Then the instance is used to train the model. This means that the learner is always tested on data that it hasn’t seen yet.

3. Tree training function basically based on online bagging with λ = 6 and selecting a split attribute from random subset of features of size M, a

(17)

user-specified value.

4. The algorithm then uses a permissive threshold to detect warnings in the tree and creates a background tree if the warning is detected.

5. Later, if a drift is detected using the drift threshold, the original tree is then replaced by its background tree.

6. Finally, ARF trains all the background trees on the current instance.

Figure 2.2: Adaptive random forest algorithm

2.6 Related Work

2.6.1 Ensemble methods for data streams

The two main strategies to ensemble online models are Online Bagging (OzaBag) and Online Boosting (OzaBoost) [18]. They both aim to improve the accuracy of base learners by combining them in an ensemble but using different techniques. However, Online bagging often outperforms online boosting according to early studies comparing them [19, 20]. This can be attributed to the higher sensitivity of Online Boosting to noise in comparison with Online Bagging.

Online bagging simulates the original Bagging algorithm [1] which trains each base learner on a bootstrap sample, generated by resampling from the

(18)

original training data with repetitions. This resampling strategy is achieved online by training each classifier on the arriving data example k times in a row, where k is a random number k generated by Poisson( λ = 1) and known as instance’s weight.

Similarly, Online Boosting [18] gives each new arriving instance weight k by Poisson( λ = 1). However, online boosting increases λ in Poisson distribution for the arriving instance if it is misclassified by the current learner. This leads to giving the misclassified instance more attention by the next learner. In this way, the succeeding model is given a varied amount of diversity depending on the performance of the previous model.

2.6.2 Ensemble methods with drift detectors

Many state-of-the-art ensemble algorithms for data streams are coupled with drift detectors in order to adapt to concept drifts. This approach is specifi-cally useful for rapidly recovering from sudden drifts. ADWIN Online Bagging [19], Leveraging Bagging[17], FASE [21] and DDD [22] are well-known ensemble algorithms which use this approach.

ADWIN Online Bagging [19] is basically Online bagging algorithm [18] equipped with Adaptive WIndowing (ADWIN) algorithm [5] as a drift detector. Each base classifier of the ensemble has an instance of ADWIN to monitor the error rates. When an ADWIN instance detects a change in the classifier, the algorithm replaces this classifier with a new one.

Leveraging bagging [17] is a modified version of Online Bagging with two main changes to add more randomization to the input and output of the clas-sifiers. Leveraging bagging [17] increases the input space diversity inside the ensemble by sampling the data stream with Poisson( λ = 6) instead of Poisson( λ = 1) used in Online bagging. Increasing the value of diversity( λ) means using more training data instances with respect to original Online bagging.

In addition, Leveraging bagging adds randomization to the output of the classifiers in order to decrease the correlations between them. This is achieved by changing the standard way the ensemble model makes predictions. In partic-ular, Leveraging bagging makes each classifier predict a different classification function, instead of predicting the same function by all classifiers. Consequently, this strategy helps to increase the diversity of the ensemble.

Leveraging bagging [17] also includes the same adaptation technique used in ADWIN Bagging to handle concept drift problems. However, the results of the experiments conducted in [17] show that Leveraging bagging achieves better accuracy than ADWIN Online Bagging.

2.6.3 Dynamic weighted majority

There are online ensemble algorithms that handle concept drifts by constantly resetting low performance learners instead of using drift detection methods. This reactive approach is particularly appropriate for recovering from gradual drifts.

(19)

Dynamic weighted majority (DWM) [10] is an online classification weighted ensemble algorithm designed to deal with the problem of concept drift without explicitly detecting the drifts. The main idea of DWM is to reduce the ensemble error by dynamically changing the weights of the classifiers according to their performance on the training data. DWM utilizes four techniques in order to cope with concept drift. First, the algorithm trains each weighted online classifier of the ensemble on the arriving data instance. It reduces the weight of the learner if it predicts incorrectly. However, if the global prediction of the ensemble is incorrect, it adds a new learner to the ensemble and initializes its weight with one. Finally, DWM removes the classifier with a very low weight due to its low accuracy on many data examples.

2.6.4 Dynamic streaming random forest

The Dynamic Streaming Random Forest (DSRF) [23] is an adapted version of Random Forest [2] designed for stream classification problems and to reflect concept changes in the underlying data. DSRF algorithm sequentially trains a defined number of Hoeffding trees [16], each on a different number of data instances (tree window). The value of the tree window is estimated using specific parameters, such as tree threshold, which are dynamically updated. Indeed, the dynamic update of parameters’ values depends on the performance of the tree on the recent batch of training data. During each training phase, the current classification error of the tree is compared with the tree threshold parameter to decide whether to stop building this tree or not. This consequently ensures that all trees don’t perform worse than a random tree.

Similarly to ARF algorithm [3], DSRF algorithm adds diversity to the en-semble through evaluating a random subset of attributes for split at every node in the Hoeffding tree. However, the DSRF algorithm resets 25% of its trees that have high classification error, whenever a new batch of labeled instances arrives. At the same time, the algorithm estimates the percentage of the concept drift in the data using an entropy-based technique [24]. Based on the significance of the concept drift, more trees are reset to reflect the new distribution as well as the values of the hyperparameters.

(20)

Chapter 3

Method

In the initial stage, the research was based on a deductive approach. We started by gathering information about the current state of the art in ensemble online machine learning algorithms for classification problems. A quantitative ap-proach was used then for bench-marking the mean accuracy of ARF when tested to handle different types of concept drifts using two different data streams.

This chapter describes the method of the report and the technical imple-mentation of six experiments using ARF. Initially, this chapter begins with a section describing data streams and concept drifts used to perform the exper-iments. Following is a section describing how the ARF was configured and how drifts were detected and handled. The last section describes the learning process.

3.1 Software frameworks

In this study, all experiments have been configured and executed using the Scikit-multiflow framework [25]. Scikit-multiflow is an open-source framework implemented in Python and intended for learning from data streams and adap-tation to drift. It includes several state-of-the-art learning algorithms , genera-tors for data and concept drifts, concept drift detecgenera-tors and metrics to evaluate stream learning.

3.2 Data streams

A data stream is defined as a real-time and ordered sequence of data items that have a timestamp and possibly stream continuously from their sources. By definition, the data item represents a vector of attributes which is available at a specific time, when an event occurs, to be utilized for building and maintaining models.

This work is based on using two stream generator classes, SEA and Agrawal, which are implemented in scikit-multiflow framework. Stream generators

(21)

pro-duce new samples of synthetic data on-demand, using next sample method, in order not to store data physically. This implies that synthetic data generators are a cheap source of data.

3.2.1 SEA

SEA generator is the class responsible to produce artificial data streams and was first described in [26]. The generated sample of the stream contains one label and 3 numerical attributes which have values in the range between 0 and 10. The first two attributes are related to the classification task while the third one is a noise attribute and is not relevant to the target class.

Any generated instance by SEA will be classified as one of two target classes, 0 or 1. The classification decision is made by the classification function which can be chosen among four possible ones. For each classification function, the class label is chosen depending on the comparison between the sum of the first two attributes and a threshold value. In other words, if att1 + att2 the instance belongs to class 1, otherwise it belongs to class 0 . The threshold is unique for each of the classification functions and has value 8 for function 1 , value 9 for function2, value 7 for function3 and value 9.5 for function4. In fact, classification function is used to describe the target concept of the data, thus changing the classification function results in presenting the concept drift phenomenon.

Briefly, two main parameters play a role in generating data streams are: clas-sification function, to decide the label of data, and random state which is the seed used by the random number generator to produce values for the three at-tributes of the sample. In this study, we always initialize classification function with 0 for the original SEA stream and with 3 for the SEA drift stream. For random state we tested 10 different values to reduce the effects of choosing a specific random seed on the results.

3.2.2 Agrawal

Agrawal generator is the class responsible to produce artificial data streams and was first introduced by Agrawal et al. in [27]. The data instances generated by Agrawal represent hypothetical loan applications where attributes values are information from people applying for a loan and the classification function decides whether the loan can be approved. The generated instance by Agrawal contains one label and 9 input attributes described in figure 3.1 [27]. The values of the attributes are randomly selected by the random generator where 3 of the attributes are categorical (elevel, car, and zipcode) and the others are numerical.

(22)

l Figure 3.1: Description of instance attributes in Agrawal data stream

The target class for an instance is determined according to one of 10 classi-fication functions which map instances into two different classes. In Table 3.1, there is a description of the ten classification functions used in Agrawal data stream.

Classification function Description

Function 1 involves a predicate with ranges on Age value Functions 2 involves predicates with ranges on Age and Salary values.

Function 3 involves predicates with ranges on Age and Elevel values. Function 4 involves predicates with ranges on Age, Elevel and Salary values. Function 5 involves predicates with ranges on Age, Salary and Loan values. Function 6 involves predicates with ranges on Age, Salary and Comission values. Functions 7 linear function of Salary, Comission and Loan values.

Function 8 linear function of Salary, Comission and Elevel values. Function 9 linear function of Salary, Comission, Elevel and Loan values. Function 10 nonlinear function of Salary, Comission, Elevel, Hvalue and Hyears values. Table 3.1: Description of classification functions in Agrawal data stream

Similarly to the SEA generator, two main parameters play a role in generat-ing Agrawal data streams which are: classification function and random state. In this study, we always initialize classification function with 0 for the original Agrawal stream and with 2 for the Agrawal drift stream. For random state we tested 10 different values to reduce the effects of choosing a specific random seed on the results.

(23)

3.2.3 Concept drift stream

To generate a stream with concept drift, a stream generator ConceptDriftStream is used in scikit-multiflow. Given that the target concept before the drift is A and is B after the drift, the concept drift event is modeled as a weighted combination of two pure distributions that characterizes A and B [17]. In other words, the transition between one concept to the other, window of change, is modeled in scikit-multiflow by the Sigmoid function. Thus, the probability of a new instance t of the stream belonging to B, the new concept after the drift, is defined by the Sigmoid function [11]:

f (t) = 1/(1 + e−4(t−p)/w) (3.1) Where P is the central point of concept drift change and W is the width of the change. Based on that, the probability that t belongs to A is higher at the start of the window, while the probability that t belongs to B increases at the end of the window. All instances belong to B after the window is over.

To generate data streams with concept drifts we first define the original stream concept and the drift stream concept. Additionally, for the gradul drift, we configure W=1000 instances where the drift will take place and the position of change,P, is at sample number 5000. For the sudden drift we set W = 1 and P = 5000.

However, scikit-multiflow does not add Recurring drift to the data, therefore we implement a new class, RecurringDriftStream, to generate Recurring drifts. The Sigmoid function we use here for period width = 1000 and w = 100 is:

f (t) = 1/(1 + e−4(period pos−period drif t pos)/w) (3.2) Where period pos = t % period width and period drift pos = period width // 2 .

3.3 Experimental settings

The six experiments conducted in this study using ARF are:

SEA data stream with gradual drift, SEA data stream with sudden drift SEA data stream with recurring drift, Agrawal data stream with gradual drift Agrawal data stream with sudden drift, Agrawal data stream with recurring drift

The objective of these six experiments is to see how ARF performs on each of these scenarios. The evaluation measure used to assess the predictive per-formance of ARF is the classification mean accuracy. To reduce the effects of choosing a specific random seed on the results, each benchmark was repeated with 10 different random seed having the following values: 110 111 112 113 114 115 116 117 118 119.

For consistency, the same configuration of ARF was used for all experi-ments. The configuration of ARF is as follows: ARF grows 10 trees in the ensemble where the final prediction of the ensemble is based on the weighted

(24)

vote method.This method actually gives a reasonable weight to each classifier and counts the vote of the better classifiers multiple times.

The number of attributes considered at each node for splitting equals√M , where M is the total number of attributes. In addition, the warning and drift detection method we used in all experiments is ADaptive WINdowing (ADWIN) [5], where the threshold for warnings detection equals to 0.01, while the threshold for drift detection equals to 0.001.

3.4 Training and Benchmarking

In this section we present how we chose to train the ARF classifier on SEA and Agrawal data streams. We also show how accuracy was measured using prequential evaluation, which tests the model on the arriving data before using it for training [6]. The training and benchmarking process can be summarized by the following steps, which were repeated for each permutation of data streams and concept drifts we study.

1. Create an object of ConceptDriftStream class in order to generate a stream with concept drift, as described in section 3.2.3.

2. Instantiate the ARF classifier, as shown in section 3.3.

3. Train the classifier initially on 100 samples to enforce a ‘warm’ start before using the evaluation method.

4. Instantiate the EvaluatePrequential evaluator, and specify 10000 samples to evaluate during the evaluation process and 100 samples to process be-tween each test.

5. Use evaluate() function to start the evaluation process which works as follows. For each new sample X generated by the stream:

(a) Make a prediction on the sample X using predict(), in order to test the classifier and update statistics about the performance.

(25)

Chapter 4

Results

In this section, we present the results of executing the various experiments that have been performed to test and benchmark the predictive accuracy of ARF. The experiments have been configured and executed using the scikit-multilearn framework. To reduce the effects of choosing a specific random seed on the results, the same benchmarks have been repeated 10 times for each permutation of data streams and concept drifts.

4.1 Gradual concept drift

Figure 4.1: ARF accuracy over time for gradual data streams. The concept drifts start gradually after 4500 samples and ends after 5500 samples.

(26)

Comparing the results from SEA and Agrawal datastreams we see a number of differences. First, we notice that the drop in accuracy at the period cor-responding to the concept drift is larger in Agrawal. Second, we notice that the recovery from the accuracy drop is somewhat faster for SEA datastream compared to Agrawal. The mean accuracy over the course of the experiment is 0.9692 for SEA and 0.9104 for Agrawal. ARF seams, therefore, to show better accuracy with SEA datastream in this experiment.

4.2 Sudden concept drift

Figure 4.2: ARF accuracy over time for sudden data streams. The concept drifts happens abruptly after 5000 samples

For the sudden concept drift experiment, we notice that the drop in accuracy is sharp for both SEA and Agrawal. Although the drop is deeper for Agrawal compared to SEA. The recovery from the accuracy drop seems to be steeper in both cases compared to the gradual concept drift experiment. The accuracy for both SEA and Agrawal seems to stabilize again after 1000 - 1500 samples. The final mean accuracy for the experiment is 0.9696 for SEA and 0.8964 for Agrawal. Similar to the gradual case, we observe that the ARF performs better with SEA than Agrawal.

(27)

4.3 Recurring concept drift

Figure 4.3: ARF accuracy over time for recurring data streams. The concept drifts start after every 1000 samples and last for 100 samples.

In the recurring concept drift case, we observe that the accuracy oscillating in for both SEA and Agrawal around the points of concept drifts. The oscillations are wider for Agrawal, where the drop in accuracy is lower. In both cases, the oscillations seem to happen within a bounded range, without trending upwards or downwards. The mean accuracy for this experiment is 0.9195 for SEA and 0.9295 for Agrawal.

4.4 Statistical summary

stream concept drift accuracy mean accuracy std SEA gradual 0.9692 0.0095 SEA sudden 0.9696 0.0101 SEA recurring 0.9295 0.0068 Agrawal gradual 0.9104 0.0498 Agrawal sudden 0.8964 0.0472 Agrawal recurring 0.7476 0.0411

Table 4.1: Statistical summary of running the benchmarks with 10 different random seeds.

(28)

In 4.1, we present a statistical summary of the accuracy after running the benchmarks with 10 different random seeds. For a more detailed view, check appendix 2 which has all the data for each benchmark execution.

(29)

Chapter 5

Discussion

In light of the obtained empirical results, we notice slight differences in the behaviour of ARF between the different test cases. The differences seem to be depending on both the actual data stream (SEA or Agrawal) and the type of the concept drift (Gradual, sudden or recurring). In this section, we will provide our interpretations of the results and differences in performance between the various test cases.

The accuracy of ARF seems to be dependent on the type and sturcture of the data stream. The accuracy for SEA has been consistently better than Agrawal over all types of concept drifts. This observation is expected and not surpris-ing. While both SEA and Agrawal have the same number of target classes (two classes), the input space is different both in terms of dimensionality and type of attributes (numerical or categorical). SEA has 3 numerical attributes and one of them is a noise attribute [26]. Agrawal, on the other hand, has 9 input features; 6 numerical attributes and 3 categorical. In general the higher the input dimensions the harder the classification task. This phenomenon is common in machine learning and known as the curse of dimensionality [28]. It’s unclear whether differences in accuracy between SEA and Agrawal can be solely attributed to the differences in input space dimensions. The type of attributes and complexity of the relationships between these attributes and the output target could also affect the model accuracy. Taking that into consideration, it’s likely that the better accuracy results for SEA can be attributed to a combina-tion of the smaller input space and less complex relacombina-tionships between the input space and output targets.

The type and form of the concept drift are other factors that seem to be affecting the accuracy of ARF. Comparing sudden and gradual concept drifts we observe relatively small differences in the mean accuracy. For SEA, ARF accuracy was almost the same between the sudden case and the gradual case; the mean accuracy for the two cases are 0.9696 and 0.9692 respectively. ARF behaviour seems to be almost identical for sudden and gradual concept drifts with SEA. On the other hand, Agrawal has performed better with the gradual drift. ARF’s test accuracy was 0.9104 for the gradual case and 0.8964 for the

(30)

sudden case. The difference in for Agrawal are still small but more noticeable than for SEA. Comparing this to the results obtained by a similar study, which also studied the behaviour of ARF with different data streams and concept drifts, we notice a number of differences. ARF’s accuracy seems to be higher for the sudden case than the gradual case for both SEA and Agrawal [3, p. 1490]. SEA reached an accuracy of 0.7811 for the sudden case and 0.7729 for the gradual case. While Agrawal’s accuracy of 0.7721 for the sudden concept drift and 66.49 for the gradual case. Nevertheless, it’s not straight forward to compare the results obtained by our study and the results obtained by [3]. Firstly, the implementations of ARF is different between the one used by our study and the the implementation used by [3]. Secondly, the experimental settings for the two studies differ in terms of number of base learners, model hyperparameters, configuration of concept drifts, and number of samples used to perform each benchmark. In light of this, the sudden concept drift could be indeed detected faster by the drift detector and therefore force the model to react faster to update the model parameters accordingly. This quick drop and recovery of accuracy in the sudden case could result in better accuracy than the gradual case where the drop in accuracy at the start of the drift is less noticeable, delaying the model adaption. However, the transition period between concept drifts is longer and could result in better trained background trees when the gradual concept drift is finally detected. Validating these hypothesises would require collecting additional data about how and when the model reacts to the concept drift. Another possible interpretation is that updating the parameters is more costly for Agrawal than SEA. This could lead to a steeper drop in accuracy with the sudden approach and longer time to recover. This also needs to be studied more thoroughly to reach more conclusive results. Lastly, the role of randomness in such small sample sizes must be taken into consideration.

Recurring concept drifts had the least accuracy of all other concept drifts for both SEA and Agrawal; 0.9295 and 0.7476 respectively. This is likely a result of the higher frequency of concept drifts in this test configuration. Under the same period of time, more samples are arriving during the transition period of concept drifts in the recurring case than in the gradual and sudden cases where there is only one transition period. The model could be therefore struggling to keep a high accuracy while it’s adapting to more frequent concept drifts. In a similar study [29, p. 11], ARF had an accuracy of 0.699 when tested with a recurring concept drift Agrawal stream. Similar to the gradual and sudden case, it’s hard to due one-to-one comparasion due to varying implementations and test settings.

5.0.1 Limitations

There are some limitations that have to be taken into consideration when inter-preting the results of this study. Firstly, the test cases are far from being exhaus-tive or adequate to draw generalized conclusions. This study compares only 2 data streams and 3 types of concept drifts. Therefore, the results obtained from SEA and Agrawal data streams with these fixed concept drift configurations,

(31)

might not extrapolate to other data streams and concept drift configurations. Secondly, there are an infinite amount of hyperparameters to tune ARF and the drift detector. The default settings used in our study, are not be taken as the the optimal ones. Performing more systematic parameter search approaches could yield better accuracy. Thirdly, there is also an infinite number of ways to configure concept drifts. For example, changing the position, width, classifica-tion funcclassifica-tion, and frequency of the concept drift might yield different results. Finally, both of these streams are synthetic and ARF might have different be-haviour characteristics when dealing with data stream in real-world settings. In such settings, ARF would have to handle unclean data, noise, missing features, and other problems that emerge in practical streaming applications.

5.0.2 Ethics and sustainability

There is a number of general ethical considerations that have to be thought about carefully whenever machine learning and big data technologies are to be used for practical applications. These technologies have the potential to cause social disruption in the form of lost jobs due to automation. Privacy and integrity have to also be taken into consideration when using these algorithms to process personal information.

Online machine learning methods are in general more effective than offline batch methods. Offline machine learning models usually have to redo the entire training process to keep the model up to date when new data is available. This process is usually computationally intensive and resource-heavy leading to excess energy consumption. Online machine learning models, on the other hand, can update the model on the fly without needing to train again from the start. This should theoretically reduce energy consumption and contribute to more sustainable energy usage.

5.0.3 Future research

There are many possible extensions for future research. Future research could address the limitations highlighted in this study by trying different data streams, different concept drift configurations, and different model tuning approaches. It would also be interesting to contrast the results of ARF with other state of the art models. ARF models can also be improved to achieve better model observability to understand when and how adaptations happen in the model. Given that recurring data streams were the hardest test, future research can study different ways to improve performance on recurring data streams. For example, caching model parameters from previous concept drifts and reusing them when the concept drift is detected. This should theoretically make it easier for the model accuracy to recover faster after a concept drift. This has been already studied by [29] using a solution that utilizes prior learning to improve classification performance in the current stream. The study uses Enhanced Concept Profiling Framework (ECPF), which recognizes recurring concepts and reuses a classifier trained previously. Trying to create a custom implementation

(32)

of ARF that is inspired by ECPF, might enable more accurate classification following a drift, and thus improving the accuracy for recurring concept drifts.

(33)

Chapter 6

Conclusion

This study demonstrates the ability of adaptive random forests to handle dif-ferent types of data streams and concept drifts. The experiments performed in this study show that ARF accuracy was lower for Agrawal data stream than SEA which could be due to higher input space dimensionality, more complex relationship between input attributes or a combination of both. ARF has man-aged to detect concept drifts and recover its accuracy after all 3 types of concept drifts.

The differences between gradual and sudden concept drifts were subtle. Gradual concept drifts resulted in slightly better accuracy metrics than sudden concept drifts for Agrawal but the differences are small. A possible interpreta-tion of these small differences could be due to ARF having more time to build the background trees and adapting the model for the gradual concept drift. On the other hand, the sudden concept drifts performed slightly better than the gradual concept drift for SEA. The sudden concept drifts happen faster than their gradual counterpart and therefore forcing the model to adapt faster and updating the model parameters accordingly. However, it’s hard to verify these hypothesises without conducting further research. Recurring concept drifts had the least accuracy compared to the sudden and gradual cases. This could be a consequence of more frequent concept drifts. In theory, the more frequent the concept drifts the more mistakes the model would do during the translations, and hence the low accuracy for recurring concept drifts.

It’s unclear whether it would be possible to extrapolate these observations to other data streams and concept drift configurations. Further research has to be conducted to analyze how the frequency and shape of the concept drift affect the performance characteristics of ARF.

(34)

Bibliography

[1] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. [2] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [3] Heitor M Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabr´ıcio

Enembreck, Bernhard Pfharinger, Geoff Holmes, and Talel Abdessalem. Adaptive random forests for evolving data stream classification. Machine Learning, 106(9-10):1469–1495, 2017.

[4] Jacob Montiel, Jesse Read, Albert Bifet, and Talel Abdessalem. Scikit-multiflow: A multi-output streaming framework. The Journal of Machine Learning Research, 19(1):2915–2914, 2018.

[5] Albert Bifet and Ricard Gavalda. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international con-ference on data mining, pages 443–448. SIAM, 2007.

[6] A Philip Dawid. Present position and potential developments: Some per-sonal views statistical theory the prequential approach. Journal of the Royal Statistical Society: Series A (General), 147(2):278–290, 1984. [7] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Philipp Kranen, Hardy

Kremer, Timm Jansen, and Thomas Seidl. Moa: Massive online analysis, a framework for stream classification and clustering. In Proceedings of the First Workshop on Applications of Pattern Analysis, pages 44–50, 2010. [8] Lior Rokach. Ensemble-based classifiers. Artificial intelligence review,

33(1-2):1–39, 2010.

[9] Mehmed Kantardzic. Data mining: concepts, models, methods, and algo-rithms. John Wiley & Sons, 2011.

[10] J Zico Kolter and Marcus A Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Re-search, 8(Dec):2755–2790, 2007.

[11] Jo˜ao Gama, Indr˙e ˇZliobait˙e, Albert Bifet, Mykola Pechenizkiy, and Abdel-hamid Bouchachia. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.

(35)

[12] Yange Sun, Zhihai Wang, Jidong Yuan, and Wei Zhang. Tracking recurring concepts from evolving data streams using ensemble method. Int. Arab J. Inf. Technol., 16(6):1044–1052, 2019.

[13] Gerhard Widmer. Tracking context changes through meta-learning. Ma-chine learning, 27(3):259–286, 1997.

[14] Stephen Marsland. Machine learning: an algorithmic perspective. CRC press, 2015.

[15] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning, volume 112. Springer, 2013.

[16] P Domingos and G Hulten. Mining high-speed data streams.[in:] pro-ceedings of the sixth acm sigkdd international conference on knowledge discovery and data mining. Boston, 71:80, 2000.

[17] Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Leveraging bagging for evolving data streams. In Joint European conference on machine learn-ing and knowledge discovery in databases, pages 135–150. Sprlearn-inger, 2010. [18] Nikunj C Oza. Online bagging and boosting. In 2005 IEEE international

conference on systems, man and cybernetics, volume 3, pages 2340–2345. Ieee, 2005.

[19] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavald`a. New ensemble methods for evolving data streams. In Pro-ceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 139–148, 2009.

[20] Nikunj C Oza and Stuart Russell. Experimental comparisons of online and batch versions of bagging and boosting. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 359–364, 2001.

[21] Isvani Fr´ıas-Blanco, Alberto Verdecia-Cabrera, Agust´ın Ortiz-D´ıaz, and Andre Carvalho. Fast adaptive stacking of ensembles. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, pages 929–934, 2016.

[22] Leandro L Minku and Xin Yao. Ddd: A new ensemble approach for dealing with concept drift. IEEE transactions on knowledge and data engineering, 24(4):619–633, 2011.

[23] Hanady Abdulsalam, David B Skillicorn, and Patrick Martin. Classifying evolving data streams using dynamic streaming random forests. In Inter-national Conference on Database and Expert Systems Applications, pages 643–651. Springer, 2008.

(36)

[24] Peter Vorburger and Abraham Bernstein. Entropy-based concept shift de-tection. In Sixth International Conference on Data Mining (ICDM’06), pages 1113–1118. IEEE, 2006.

[25] Jacob Montiel, Jesse Read, Albert Bifet, and Talel Abdessalem. Scikit-multiflow: A multi-output streaming framework. The Journal of Machine Learning Research, 19(1):2915–2914, 2018.

[26] W Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 377–382, 2001.

[27] Charu C Aggarwal, S Yu Philip, Jiawei Han, and Jianyong Wang. A frame-work for clustering evolving data streams. In Proceedings 2003 VLDB con-ference, pages 81–92. Elsevier, 2003.

[28] Jerome H Friedman. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data mining and knowledge discovery, 1(1):55–77, 1997. [29] Robert Anderson, Yun Sing Koh, Gillian Dobbie, and Albert Bifet.

Recur-ring concept meta-learning for evolving data streams. Expert Systems with Applications, 138:112832, 2019.

(37)

Chapter 7

Appendix 1 - Plots

7.1 SEA data stream

(38)

Figure 7.2: ARF accuracy over time (SEA-Sudden)

(39)

7.2 AGRAWAL data stream

Figure 7.4: ARF accuracy over time (AGRAWAL-Gradual)

(40)

Figure 7.6: ARF accuracy over time (AGRAWAL-Recurring)

7.3 Summary

(41)

Chapter 8

Appendix 2 - Benchmark

data

8.1 All

Stream concept drift random state accuracy SEA gradual 110 0.9727 SEA gradual 111 0.9732 SEA gradual 112 0.9746 SEA gradual 113 0.9678 SEA gradual 114 0.9692 SEA gradual 115 0.9692 SEA gradual 116 0.9724 SEA gradual 117 0.9736 SEA gradual 118 0.9714 SEA gradual 119 0.9685 SEA recurring 110 0.9207 SEA recurring 111 0.9228 SEA recurring 112 0.9208 SEA recurring 113 0.9217 SEA recurring 114 0.925 SEA recurring 115 0.925 SEA recurring 116 0.9246 SEA recurring 117 0.9191 SEA recurring 118 0.9253 SEA recurring 119 0.9231 SEA sudden 110 0.9734 SEA sudden 111 0.9722 SEA sudden 112 0.973 SEA sudden 113 0.9699

(42)

SEA sudden 114 0.9763 SEA sudden 115 0.9708 SEA sudden 116 0.9749 SEA sudden 117 0.9739 SEA sudden 118 0.9734 SEA sudden 119 0.9708 AGRAWAL gradual 110 0.8983 AGRAWAL gradual 111 0.8931 AGRAWAL gradual 112 0.8806 AGRAWAL gradual 113 0.8698 AGRAWAL gradual 114 0.8796 AGRAWAL gradual 115 0.8912 AGRAWAL gradual 116 0.8908 AGRAWAL gradual 117 0.8945 AGRAWAL gradual 118 0.9316 AGRAWAL gradual 119 0.9044 AGRAWAL recurring 110 0.7241 AGRAWAL recurring 111 0.7139 AGRAWAL recurring 112 0.7086 AGRAWAL recurring 113 0.7172 AGRAWAL recurring 114 0.7244 AGRAWAL recurring 115 0.7239 AGRAWAL recurring 116 0.7333 AGRAWAL recurring 117 0.7221 AGRAWAL recurring 118 0.7409 AGRAWAL recurring 119 0.7392 AGRAWAL sudden 110 0.9023 AGRAWAL sudden 111 0.8884 AGRAWAL sudden 112 0.8676 AGRAWAL sudden 113 0.9102 AGRAWAL sudden 114 0.875 AGRAWAL sudden 115 0.9163 AGRAWAL sudden 116 0.8975 AGRAWAL sudden 117 0.9372 AGRAWAL sudden 118 0.9152 AGRAWAL sudden 119 0.8942

(43)

8.2 Summary

stream concept drift accuracy mean accuracy std SEA gradual 0.9692 0.0095 SEA sudden 0.9696 0.0101 SEA recurring 0.9295 0.0068 AGRAWAL gradual 0.9104 0.0498 AGRAWAL sudden 0.8964 0.0472 AGRAWAL recurring 0.7476 0.0411

(44)

Evaluation of Adaptive random forest algorithm for classification of evolving data stream

Evaluation of Adaptive random

forest algorithm for classification

of evolving data stream

AYHAM ALKAZAZ

Evaluation of Adaptive

random forest algorithm for

classification of evolving data

stream

AYHAM ALKAZAZ & MARWA SAADO KHAROUKI

Degree Project in Computer Science

Date: August 16, 2020

Supervisor: Erik Fransén

Examiner: Pawel Herman

School of Electrical Engineering and Computer Science

Contents

Acronyms

Chapter 1

Introduction

1.1

Problem statement

1.2

Scope

1.3

Thesis outline

Chapter 2

Background

2.1

Offline and Online learning

2.2

Data stream classification

2.3

Ensemble methods

2.4

Concept drifts

2.5

Algorithms

2.5.1

Decision Tree

2.5.2

Hoeffding Tree

2.5.3

Random forest

2.5.4

Adaptive random forest

2.6

Related Work

2.6.1

Ensemble methods for data streams

2.6.2

Ensemble methods with drift detectors

2.6.3

Dynamic weighted majority

2.6.4

Dynamic streaming random forest

Chapter 3

Method

3.1

Software frameworks

3.2

Data streams

3.2.1

SEA

3.2.2

Agrawal

3.2.3

Concept drift stream

3.3

Experimental settings

3.4

Training and Benchmarking

Chapter 4

Results

4.1

Gradual concept drift

4.2

Sudden concept drift

4.3

Recurring concept drift

4.4