Machine learning algorithms in a distributed context

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor thesis, 16 ECTS | Datateknik

2018 | LIU-IDA/LITH-EX-G--18/060--SE

Machine learning algorithms

in a distributed context

Maskininlärningalgoritmer i en distribuerad kontext

Samuel Johansson and Karol Wojtulewicz

Supervisor : Niklas Carlsson Examiner : Marcus Bendtsen

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Students in the 5 year Information Technology program complete a semester-long software development project during their sixth semester (third year). The project is completed in mid-sized groups, and the students implement a mobile application intended to be used in a multi-actor setting, currently a search and rescue scenario. In parallel they study several topics relevant to the technical and ethical considerations in the project. The project culminates by demonstrating a working product and a written report documenting the results of the practical development process including requirements elicitation. During the final stage of the semester, students create small groups and specialise in one topic, resulting in a bachelor thesis. The current report represents the results obtained during this specialisation work. Hence, the thesis should be viewed as part of a larger body of work required to pass the semester, including the conditions and requirements for a bachelor thesis.

(4)

Abstract

Interest in distributed approaches to machine learning has increased significantly in recent years due to continuously increasing data sizes for training machine learning models. In this thesis we describe three popular machine learning algorithms: decision trees, Naive Bayes and support vector machines (SVM) and present existing ways of distributing them. We also perform experiments with decision trees distributed with bagging, boosting and hard data partitioning and evaluate them in terms of performance measures such as accu-racy, F1 score and execution time.

Our experiments show that the execution time of bagging and boosting increase lin-early with the number of workers, and that boosting performs significantly better than bagging and hard data partitioning in terms of F1 score. The hard data partitioning algo-rithm works well for large datasets where the execution time decrease as the number of workers increase without any significant loss in accuracy or F1 score, while the algorithm performs poorly on small data with an increase in execution time and loss in accuracy and F1 score when the number of workers increase.

(5)

Acknowledgments

We would like to thank our supervisor Niklas Carlsson for his guidance during the project. We would also like to thank Jesper Wrang, Eric Lindskog, Gustav Aaro and Daniel Roos for giving us useful feedback.

(6)

List of Figures

2.1 Conceptual view of serialism. . . 5

2.2 Conceptual view of parallelism. . . 5

2.3 Conceptual view of combined serialism and parallelism. . . 5

2.4 Array partitioned among three workers. n = 8, k = 1. . . 8

2.5 Confusion matrix for a binary classifier. . . 8

3.1 Decision tree for discriminating between apples and lemons. Leaf nodes outlined with green. . . 11

3.2 Naive Bayes structure. . . 12

3.3 Visualization of SVM concepts. . . 14

3.4 Linearly and nonlinearly separable classes . . . 14

5.1 Bagging with adult dataset. . . 18

5.2 Boosting with adult dataset. . . 19

5.3 Hard data partitioning with adult dataset. . . 20

5.4 Bagging with car evaluation dataset. . . 21

5.5 Boosting with car evaluation dataset. . . 22

5.6 Hard data partitioning with car evaluation dataset. . . 22

5.7 F1 score set against time using Adult 32k and Car Evaluation datasets. . . 23

5.8 F1 score variations for different datasets and number of threads. . . 24

(9)

List of Tables

3.1 Training data example. . . 13

5.1 Base algorithm results with adult dataset. . . 17

5.2 Bagging results with adult dataset. . . 18

5.3 Boosting results with adult dataset. . . 19

5.4 Hard data partitioning results with adult data set. . . 20

5.5 Base algorithm results with car evaluation dataset. . . 20

5.6 Bagging results with car evaluation data set. . . 21

5.7 Boosting results with car evaluation data set. . . 21

5.8 Hard data partitioning results with car evaluation data set. . . 22

(10)

1 Introduction

1.1 Motivation

The need for data analysis on larger and larger data sets is ever increasing. Machine learn-ing, algorithms that let computers make generalizations based upon some training data, can be an important tool for these analyses. The datasets that actors want to analyze is becom-ing so large, sometimes exceedbecom-ing 1 PB [1], that runnbecom-ing machine learnbecom-ing algorithms on a single computer is not always feasible. With tools like SDN and TensorFlow, GraphLab, DryadLINQ, MapReduce and Apache Hadoop it is possible to distribute the processing of machine learning algorithms between multiple computers, allowing more resources to be used on the training and testing of machine learning models and possibly making huge gains in terms of processing times. Other reasons for scaling up machine learning processing with distribution include high input dimensionality, high model and algorithm complexity and strict/tight time constraints [2].

Distribution of a machine learning algorithm does not always give beneficial results though, since distribution of the learning adds complexity such as the need to split data and/or tasks between different machines or threads and combine their results. In cases where one is processing smaller amounts of data, the added complexity of a distributed implementa-tion might not be worth it, especially if the accuracy of the model suffers from the distributed approach in comparison to running the training on a single computer. If the accuracy suffers badly from a distributed approach it might not be feasible to implement at all even if there is a need to process large sets of data.

1.2 Aim and research questions

The broader question we wanted to investigate in this thesis was under which conditions (such as algorithm used, the size of the dataset processed and the number of features in the dataset) it is appropriate and beneficial to distribute the processing of machine learning. That question is large enough to require a meta-study or even a series of meta-studies in order to be answered properly. Therefore, we narrow our thesis down to a set of research questions that, at least in part, help approaching the answer to the broader question instead of answering it fully. We focus on binary classification, which is a subset of the supervised learning style of machine learning and try to answer the following research questions:

(11)

1.3. Contribution and novelty

• How can different classification algorithms be distributed?

• Are there any other potential benefits to distributing machine learning than time per-formance?

• When distributing a classification algorithm on a given dataset, what is the best number of machines to split the processing over?

1.3 Contribution and novelty

This report touches on topics that are relatively new to the computer science community and that is in need of continuous research. Several research papers have been released that introduce new versions of distributing machine learning algorithms, including papers that focus on time and performance gains [3, 4]. In this thesis, we research how using different amounts of machines impact various performance measures (introduced in Chapter 2), and if there is a sweet spot in the amount of machines to use for a certain task, which provide the best performance. Because of limited resources, we focus on a (small) targeted subset of all machine learning algorithms and make some simplifying delimitations.

1.4 Delimitations

The field of machine learning, especially in conjunction with distributed processing and com-puter networks, is vast. Thus, the scope of this thesis must be limited. For example, types of machine learning algorithms and areas relevant to distribution that were not considered in this thesis include:

• Regression analysis • Dimensionality reduction • Reinforcement learning • Neural networks • Deep learning • Network protocols

• Network conditions (signal strength and other factors)

• Cloud computing

• Distribution frameworks for machine learning

Furthermore, the distribution will be simulated on a single machine, and only a limited num-ber of methods for distribution will be tested.

1.5 Outline

In Chapter 2, we describe crucial aspects of machine learning and ensemble algorithms used for distribution. In Chapter 3 we describe three algorithms: decision trees, Naive Bayes and support vector machines and possible ways of distributing them. In Chapter 4 we describe the experiment we performed, setup and datasets used. Chapter 5 goes through the results of the experiment and in Chapter 6 we evaluate the results as well as we compare them to other related works. Finally, in Chapter 7 we summarize discuss some of the things that we have learned during this project and what we have not had time to explore.

(12)

2 Theory

This chapter covers a short overview of what machine learning is about, a couple of different categories of machine learning and machine learning in a distributed context.

2.1 What is machine learning?

If learning is the ability to change according to external stimuli and remembering most of all previ-ous experiences, then machine learning is concerned with creating and implementing math-ematical models in software that are able to adapt and remember sets of context-related inputs (training data) [5]. These models can be used to predict future events and make de-cisions based upon predictions without having knowledge of every influencing factor. In other words, the machine makes generalizations, based upon one set of data, that can then be applied to other sets of data.

2.2 Learning styles

There are three main styles of learning for machine learning algorithms: supervised learning, unsupervised learning and reinforcement learning.

Supervised learning

In the real world supervised learning would mean learning that occurs under the influence or leadership of a supervisor such as a teacher or instructor. In the context of machine learning the supervisor is not an active entity, instead the training data comes with labels containing the correct results [5]. The labels provide the model with the possibility to make absolute error measurements, which can be as simple as the number of incorrectly classified data points, or the sum of the difference between the predicted values and the correct values or more complex such as mean square error1among others. Supervised machine learning algorithms try to minimize some relevant error measurement. The goal is to be able to train the model for use on other sets of data with unknown (at least to the model) correct answers. Thus

1_{The sum of the square of the difference between predicted and correct values of each prediction and divided it}

(13)

2.3. Machine learning in a distributed context

a significant problem/risk of supervised learning algorithms is overfitting, where the model is "overtrained" on the training data such that it is very good at making predictions on the training data at the cost of generalization onto other sets of data. Another drawback is the requirement of labeled data and the dependence on the correctness of the labels. The two main categories of supervised learning are regression, where one tries to find the best fitting curve between a set of points and thus has a continuous output and classification, which divide data points into different classes and thus has a discrete output.

A couple of examples of supervised learning algorithms are: • Naive Bayes

• Support vector machines (SVM) • Decision tree

• Linear regression

• Logistic regression

Overfitting and underfitting

Overfitting, as mentioned previously, occurs when a model is so tightly trained on the train-ing data that it is unable to generalize onto new sets of data. An overfitted model is great at capturing the characteristics of the training data, but poor at capturing the characteristics of the real world. The opposite of overfitting is underfitting where a model is unable to capture the characteristics of the training data. This often means that the model is poor at predicting new sets of data since it hasn’t learned all that it can learn from the training data.

Unsupervised learning

Unsupervised machine learning works in the absence of a supervisor or labeled training data. Instead unsupervised machine learning algorithms are used to group objects together in clus-ters based on similarity of objects that is often measured as Euclidean distance in a vector space. Applications for unsupervised machine learning include: object segmentation, simi-larity detection and automatic labeling [5]. Not needing labeled data is a big advantage for unsupervised learning, but it has the disadvantage of making evaluations of the model more difficult than supervised learning, since there is no absolute error measurement and there might not exist a good approximate error measurement for the given application.

Reinforcement learning

Reinforcement learning is based on feedback from the environment which separates it from unsupervised learning, but the feedback does not come from labeled data and precise er-ror measurements which separates it from supervised learning [5]. Reinforcement learning learns by adapting to the environment rather than learning generalizations from an already known environment. Reinforcement learning tries to minimize some approximate error mea-surement. The goal is often to find the most beneficial sequence of actions, which requires the model to look further into the future than just the next step. The next step could have some locally optimal action that also has cascading effects resulting in a lower than optimal result in the long term. That is shortsightedness have the risk of finding sub-optimal solu-tions. However, looking far into the future requires more computational power and adds uncertainty. Reinforcement learning is appropriate in non-deterministic and changing envi-ronments where good error measurements are difficult or impossible to obtain.

2.3 Machine learning in a distributed context

There are two main concepts in distributing processing; serialism and parallelism [2]. In the following paragraphs a node is a logical unit in a network, in simulation a node is an instance

(14)

2.3. Machine learning in a distributed context

of a program, a programming object or a thread and in emulation or implementation a node is a machine.

Figure 2.1: Conceptual view of serialism.

Figure 2.1 shows a conceptual picture of serial distribution2. Each node, except for the first one, takes input from the former node, makes some operations on the data and forwards it to the next node. The nodes are logically different in a serial distribution and make up a chain where the last node has the result of the algorithm [6].

Figure 2.2: Conceptual view of parallelism.

In parallel distribution3data is split among the nodes, which is depicted in Figure 2.2. Each node does the same set of operations, they are logically equivalent, but on their respective data [6].

Figure 2.3: Conceptual view of combined serialism and parallelism.

There are, of course, pros and cons with both serialism and parallelism and they are not necessarily mutually exclusive. Figure 2.3 shows a conceptual picture of a distributed system which uses both serialism and parallelism. One of the pros of serialism is that each node can be highly specialized to only do one or a couple operations, and do them efficiently. Each node can be radically different from each other node. This can result in quicker execution times and might allow for a cheaper set-up since easier and quicker operations can be done by cheaper and weaker hardware, while more complex operations can be done by better hardware without slowing the overall execution time down. A major con with serialism is that the risk for bottlenecks is prevalent. Another con is that all data must be held by the node currently processing it, even if that processing only looks at a subset of the data, since the data

2_{Often referred to as vertical distribution or task distribution}

(15)

2.4. Ensemble algorithms

eventually has to be forwarded to the next node. A simple measure against this might be to allow nodes to send data directly to nodes further down the line so that the data can "skip" nodes where it’s not needed. A major pro of parallelism is that the data can be split among the nodes. This means that an overall larger set of data can be processed. Another pro is that parallelism has potential to be scalable. Since each node is essentially identical and capable of doing the same operations adding or removing nodes should not be problematic as long as the method of splitting the data between the nodes can handle it. A con of parallelism is that each node has to be logically identical. This means that each node has less potential to be specified and optimized compared to serialism.

2.4 Ensemble algorithms

Most algorithms/models are meant to run as a single instance, these are called strong learners. Another category of machine learning algorithms utilize multiple instances of weak learners with different parameters and combine their results in some way to come up with an out-put/solution. This category of machine learning algorithms is called ensemble algorithms, and the group of weak learners are called an ensemble. It is not necessary for each weak learner in an ensemble to be the same kind of learner as long as each learner has the same domain and range as the other learners. For classification, the result of a query to the trained ensemble is typically the majority vote of the weak models and for regression the result is usually the mean of the weak models.

Bootstrap aggregating (bagging)

If there are n points of training data, then each weak learner (bag) randomly gets n1 _points

to use for its training. The n1 _{points are randomly given to the bag one at a time, each time}

choosing one data point from the whole set. This means that the same point can appear multiple times in any given bag. Typically n1 _{is the same size as n but can be chosen to be}

smaller. For predictions and classification the same input is given to each weak learner. For numerical classes and regression the mean of the weak learners’ outputs is considered the model’s output. If the output is non-numerical the majority vote is the model’s output [7].

Boosting

In contrast to bagging, where the whole ensemble is built before any evaluation, boosting makes evaluations on the current ensemble every time a new bag is added. Data points in the training data that give high error measures are weighted higher for the selection of data points for the next bag. By weighting data points in this selection process, boosting lowers the risk of completely missing some points. Some implementations also give the weak learners weights related to their accuracy that are used when aggregating the results of the weak learners [8].

Because boosting gives weights to all training data points based on some loss function boosting is most appropriate for problems where the output of the learner is numerical or binary classification since the classes can be assigned -1 and 1 such as in AdaBoost.

AdaBoost (adaptive boosting) is a popular boosting algorithm for binary classification in-troduced by Freund and Schapire [9]. As mentioned in the previous paragraph the two classes are represented as 1 and -1 respectively. Each tuple in the training data is weighted. We call the weight for tuple i and iteration k in the training data δi,k. The weights are

nor-malized such that they make up a distribution i.e. their sum is 1. For the first iteration each weight δi,1is equal to1/Nwhere N is the number of tuples in the training data. The weights

can be used by the weak learners by making them favor splits on sets with high weights. In a data distributed setting the weights can instead be used to favor tuples with higher weights when partitioning the data for each iteration.

(16)

2.4. Ensemble algorithms

For every iteration a weak learner is trained. Then the weak learner is evaluated on the training data. A weighted error rate Ekis calculated as

Ek= N

ÿ

i=0

δ_i,k,

for all tuples i where the weak learner predicted wrong. Then the weak learner is given a weight

α_k= 1

2log 1 ´ Ek

Ek

which is used later when aggregating the result from each weak learner so that prediction from weak learners with better predictions have a bigger impact on the end result than pre-dictions from weak learners with worse prepre-dictions. A problem with the way αkis calculated

is that if the weak learner is able to correctly predict the class of every tuple in the training data Ekwill be 0, resulting in a division by 0. To get around this αkcan instead be calculated

as αk = 1 2log 1 ´ Ek+e Ek+e ,

for some small number e as suggested by Shapire and Singer [10]. Then new weights are calculated for each tuple in the training data as

δi,k+1= $ ’ ’ ’ & ’ ’ ’ % δi,k¨e´αk Zk

if the prediction of the weak learner was correct for that tuple

δi,k¨eαk

Zk

if the tuple was not correctly predicted, where Zkis a normalization factor that is calculated as

Zk= N

ÿ

i=0

δi,k¨ehk(xi)αkc(xi).

The hk(xi)is the predicted label for the i:th tuple and c(xi)is the correct label. It can be shown

that the expression for Zkis equal to

Zk=2

b

(Ek+e)(1 ´(Ek+e)).

Once all weak learners are trained the model can be asked to make predictions on new tuples. The prediction of a model with m weak learners is

sgn m ÿ k=0 hk(xi)α_k ! .

Hard data partitioning

Another way to create an ensemble is what we call hard data partitioning where the dataset is split evenly among the weak learners without random selection. The simplest way to split the dataset for m weak learners and n data points is to give the first learner the first n/m

data points, the second learner the nextn/m and so on. This can be built upon by adding

some overlap such that each learner is givenn/m+k data points where k is the number of

overlapping data points. Just like bagging the model’s final output is either the mean of the output from the weak learners or the majority vote. Figure 2.4 shows an example of how an array can be split between three learners.

(17)

2.5. Measures of performance

Figure 2.4: Array partitioned among three workers. n = 8, k = 1.

2.5 Measures of performance

There are multiple aspects of the performance of an implementation. One aspect is time, where there are many different places one can look at such as the time it takes to train the model, the time it takes to answer a query and so on. Another aspect of performance is how good the predictions of the model are. Depending on what style of learning the algorithm uses and its particular purpose, different measures are applicable and/or suitable.

Confusion Matrix

The confusion matrix, also refereed to as error matrix [11] is an important concept that many of the measures are based on. This concept applies only to supervised machine learning algorithms since it involves both the predicted answers and the actual answers. Visualized in Figure 2.5 for a binary classifier, on one edge of the matrix are all values predicted by the algorithm, and on the other, the actual answers. In binary classification one class is seen as the positive class and the other as a negative class. The classification is seen in the perspective of one class, asking the question: Is the data point of this class? Where ’yes’ is seen as positive and ’no’ is seen as negative.

Figure 2.5: Confusion matrix for a binary classifier.

To make it easier to follow, we introduce some terminology [12]: True positives (TP), true predictions that were correct, true negatives (TN), negative predictions that were correct, false positives (FP) positive predictions that were false and false negatives (FN), negative predictions that were false. In later sections we describe uses of confusion matrices for estimation of the performance of machine learning implementations.

The confusion matrix can be generalized for any number of classes. For k classes the confusion matrix is a kxk matrix. With k classes true positives are the correctly classified entries, every true positive for one class is a true negative in terms of every other class. In the perspective of one class a false positive is every entry incorrectly classified as that class, note that this means that the entry is a false negative in terms of the class it actually is.

Accuracy

To measure the accuracy of a machine learning algorithm we use data from the confusion matrix. Accuracy is the number of values, that the algorithm got correct, divided by all of the predicted values. The equation looks as follows:

accuracy= TP+TN

(18)

2.5. Measures of performance

The accuracy is calculated in the same way regardless of the number of classes, the sum of each entry where the predicted label is equal to the actual label (the top left to bottom right diagonal) divided by the total number of entries in the confusion matrix.

Precision

Another measure that can be derived from the confusion matrix is precision. Precision is the number of true positives divided by true positives and false negatives:

precision= TP

TP+FP

and gives an estimate to what degree we should trust the positively predicted values. In a multiclass setting a precision measure can be calculated for each class. There are two ways used to add these per class measures together, micro-averaging and macro-averaging. Micro-precision (using micro-averaging) uses the sum of the TP of each class divided by the sum of each TP+FP for each class. Macro-precision is calculated as the sum of each class precision divided by the number of classes.

Recall

Recall is another performance measure derived from the confusion matrix, and is an estimate of how many times the algorithm predicted the true positive value:

recall= TP

TP+FN.

In a multiclass setting a recall measure can be calculated for each class. Micro-recall and macro-recall are calculated analogously to micro- and macro-precision.

F1 score

The F1 score is the harmonic mean of the precision and recall. This is often a better measure than precision or recall alone. When there is a very unequal distribution of labels in the training and testing datasets F1 provides a better idea of the algorithms performance than accuracy does. The F1 score is calculated as:

F1=2 ¨ precision ¨ recall precision+recall.

Other measures of performance

There are several other measures of performance and tests one can use while studying ma-chine learning algorithms. In Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power [13] Garcia et al. review several other performance measures such Multiple Sign-tes, Friedman test or Quade test however they are outside of the scope of this thesis. In this thesis we will use confusion matrix measures due to its popularity and simplicity.

(19)

3 Machine learning algorithms

In this chapter we go through the machine learning algorithms focused on in this paper. The chapter describes the concepts and mathematical expressions used in each algorithm. For each algorithm we first write about the base ideas in the algorithm and then go on and de-scribe possible solutions for distribution of the given algorithm. The algorithms are presented in the following order: first we go through decision trees, then Naive Bayes and lastly support vector machines.

3.1 Decision trees

Decision trees are a group of supervised learning algorithms that can be used on discrete or continuous data for classification or regression. There are multiple decision tree algorithms including: CART (Classification And Regression Trees), ID3, C4.5 and MART (Multiple Ad-ditive Regression Trees).

How it works

Each entry in the input for the training is a tuple with k features and a label for the correct answer. The first node in the tree contains all data. The data is then split according to some rule or question, based on one feature. The questions are for most implementations questions that can be answered either true or false, which means that one true-branch and one false-branch is created from the split. To know which question and feature to split over, for each feature the best question is calculated and the information gain of a split based on that question is calculated, and the question that has the highest information gain is selected for the split. Information gain can either be based on information entropy (the average amount of data from a stochastic source [14]) or on Gini impurity(the frequency a randomly chosen tuple from the set would be misclassified if randomly classified according to the distribution of the classes in the data set [15]). The difference in information entropy or Gini impurity before and after a split is the information gain. The Gini impurity is the sum of the probabilities that each class in a data set would be randomly selected multiplied by the probability of misclassifying the data item. For classes t1, 2...Cu and probability pcbeing the fraction of data items in the set

(20)

3.1. Decision trees

with label c the Gini impurity G can be written as

G= C ÿ c=1 pc ÿ d‰c pd=1 ´ C ÿ c=1 p2c.

The impurity after the split is calculated as the sum of the impurities of each new node weighted by the number of items each new node contains. If f is the fraction of the number of items in the left node then the information gain is calculated by

Gain=Gbefore split´f ¨ Gleft´(1 ´ f)¨Gright.

The splitting over features is recurring until there is no more information to be gained or if there are no more features to split over. If the information gain is zero then a leaf node is created and the class of the remaining data is set as the output. Depending on the way information gain is calculated it might be possible for the information gain to be non-zero even if the remaining data is of only one class. In that case the node is considered pure and splitting is redundant so a leaf should be created. If there are no more features, a leaf is created with the percentages of any remaining classes.

A small example of a decision tree can be a model that predicts whether a fruit is an apple or a lemon based on its color, diameter in centimeter and texture. The tuples may look like this; {"Green", 6, "Smooth", "Apple"}, {"Yellow", 3, "Coarse", "Lemon"}, {"Red", 4, "Coarse", "Apple"} et cetera. Fig 3.1 is an example of a decision tree with depth 2 made from this kind of tuples. Texture = "Coarse" Diameter <= 3 Color = "Yellow" Apple: 70 %, Lemon: 30 %

Lemon: 100 % Lemon: 90 %, Apple: 100 %

Apple: 10 %

True

True True

False False

False

Figure 3.1: Decision tree for discriminating between apples and lemons. Leaf nodes outlined with green.

How decision tree algorithms could be used with distributed processing

Decision tree algorithms might not suitable for serial distribution since recursion is used to build the tree. There are two main ways of parallelly distributing decision trees; data parallel distribution and feature parallel distribution. Common distribution of decision trees involves using ensemble algorithms such as bagging or boosting. One controller can run the ensemble algorithm and only send the designated training data to each worker which runs the chosen decision tree algorithm. One way of feature distribution is to distribute the computation of best feature to split on, which results in a logically equivalent model to a centralized algorithm [2]. Another way of feature distribution might be to give each node a different set of features to train on. This would not create a logically and equivalent model, instead each node would have a subtree knowing different features.

AdaBoost can use decision trees as weak learners for binary classification problems, which is what we are doing in our experiment. When used for a AdaBoost one might opt to use

(21)

3.2. Naive Bayes classifiers

decision stumps as weak learners instead of full trees. A decision stump is a decision tree with a depth of 1. Another boosting implementation for decision trees is XGBoost (Extreme Gradient Boosting).

It might be possible to distribute the training of a single decision tree. If each node con-tains an array or an index for its position in the tree, whenever there is a split the new right (or left) node can be sent to another machine for execution while the other is kept on the first machine. Each machine must then only be given the data points associated with its "starting" node. The problem that has to be solved for this approach is that the system must be able to dynamically assign new machines to the tree building process. For data sets with k features there are up tořk

i=02i = 2k+1´1 possible nodes and

řk´1

i=02i = 2k´1 splits. Given that a

machine continues down one of the new nodes while giving the next node to a new machine (and no machine reused (elsewhere) when it has reached a its leaf) one new machine is added for each split. Thus, 2k´1 machines has to be available or some smarter protocol/algorithm for the assignment of tasks implemented.

XGBoost

XGBoost [4] is an algorithm, that is popular among competitors in challenges and competi-tions such as the Netflix prize or Kaggle. It has been created for purposes of distributing the computing power as well as speeding up the machine learning process on a single machine. Its main focus is gradient boosting however, the algorithm also finds solutions for problems like out of core computing and cache- and sparsity-aware learning. In the article Chen and Guestrin found that the algorithm performed more than ten times faster than then existing single computer machine learning implementations, and that XGBoost was very scalable. The algorithm uses multiple algorithms to avoid overfitting like regularized objective, shrinkage and column subsampling. It optimizes finding the best split with exact greedy algorithm, that sorts the data according to feature values. XGBoost handles distribution with column block, where each block might contain a subset of rows in a dataset and be distributed across mul-tiple machines. This approach works well with sorting of data and finding the best split. The performance of this algorithm will be evaluated in Chapter 6.

3.2 Naive Bayes classifiers

Naive Bayes is a supervised machine learning algorithm, that is a popular tool for text catego-rization. The algorithm assumes independence of each feature in a given set of data. In other words it assumes that there is no correlation between each feature, which might be a rather bold assumption. As an example, using Naive Bayes to predict whether a person will go to the beach or not. One feature might be outlook, how the weather looks, with parameters such as sunny, rainy or snowy. Another feature might be the temperature. In reality the outlook and the temperature are related. Sunny days are more often hot while snowy days are more often cold. The Naive Bayes ignores this and assumes that the outlook and the temperature vary independently of one another.

(22)

3.3. Support vector machines

As mentioned previously Naive Bayes is a supervised algorithm and therefore to train it, labeled data is needed. One application of a Naive Bayes classifier could be to classify what sentences are about. The training data would be labeled sentences. The label or a tag for each sentence would be the class that the sentence belongs to. As an example consider the sentence The tallest building in the world is Burj-Khalifa with the associated label building. Clearly this sentence is related to building, however consider another sentence: Construction workers were on top of the building. This sentence is more related to construction than to building, therefore it is tagged as construction. Naive Bayes uses Bayes theorem to calculate the probability that a set of features like words in a sentence is of a given class. Figure 3.2 visualizes the class-attribute relationship [16, 17]. In Optimality of Naive Bayes [16] Harry gives an example; given a feature or in this case word W the probability of it being in class c is

p(c|W) = p(W|c)p(c)

p(W) . (3.1)

For multiple features this expands as follows. Given the sentence S containing the words W1, W2...Withe probability of it being in class c is

p(c|S) =p(c) i ź k=1 p(Wk|c) p(Wk) . (3.2)

Table 3.1: Training data example.

Sentence Label

The tallest buildings in the world is Burj-Khalifa Building Construction workers were on top of the building Construction Buildings are places for people to live Building Building high buildings is a risky business Construction

In the Table 3.1 we show an example of the training data. In a real life scenario there would be thousands or more labeled sentences, to make the model better at predicting the meaning of a sentence. After the model is trained it can estimate the probability of a sentence being in a certain class with function 3.2.

How Naive Bayes could be done with distributed processing

At the time of writing this paper we were unable to find any work or articles that focused on a distributed version of the Naive Bayes algorithm. Kohavi [18] writes about scaling up Naive Bayes accuracy for big data sets with a help of Decision Trees. In his paper, he created a hybrid algorithm called NBTree that showed impressive results on up to 45k data instances. Con-sidering Kohavi’s paper, and previously mentioned distribution methods for Decision Trees, one could distribute Naive Bayes by distributing the hybrid algorithm. This would however be just another distribution of decision trees and not Naive Bayes and is just a hypothetical method that needs further investigations.

3.3 Support vector machines

Support vector machines is a family of supervised machine learning algorithms used for classi-fication and regression. In SVM the input data points are represented in a vector space where each feature corresponds to one dimension. The algorithms work by separating the classes into their own clusters with the best fitting hyperplane, a subspace with one less dimension

(23)

3.3. Support vector machines

than its corresponding vector space. Each hyperplane has a margin between different clus-ters of points. The margin boundaries are determined by support vectors which are the data points in each cluster that are closest to the hyperplane [5, 19]. The hyperplane can be con-structed by creating a position vector ¯w to which the hyperplane is always perpendicular, and where the hyperplane intersects with the point of ¯w as visualized in Figure 3.3a. Find-ing the best fittFind-ing hyperplane is an optimization problem for the ¯w. For linearly separable classes such as in Figure 3.4a the problem is quite simple to solve. The classes are however often not linearly separable as illustrated in Figure 3.4b. To solve this problem is called ker-nel trick or kerker-nel method in which an extra dimension is added over which the classes can be separated. Then the hyperplane in the new vector space is projected onto the original vector space. Further details about SVM can be found in Appendix A.

(a) Hyperplane and the vector ¯w (b) Width estimation in SVM.

Figure 3.3: Visualization of SVM concepts.

(a) Linearly separable classes in SVM.

(b) Nonlinearly separable classes in SVM.

Figure 3.4: Linearly and nonlinearly separable classes

How SVM can be done with distributed processing

SVM is a widely used algorithm in machine learning, however its design causes scala-bility problems, this is why several parallelization methods have been developed such as PSO–SVM [20] and PSVM [2].

The kernel trick in SVM involves a matrix called the kernel matrix. This matrix is at the center of PSVM (P for parallell). PSVM utilizes matrix factorization of the kernel matrix as a means to parallelize the processing and have shown great improvements in memory complexity, processing time and scalability compared to regular SVM. More details can be found in Appendix A.

(24)

4 Experimental methodology

In this chapter we present the method used in the experiment that was performed. In our experiment the goal was to find out what kind of advantages we are able to obtain by per-forming a distributed decision tree algorithm. In the experiments we test and evaluate four different algorithms on two different datasets. Firstly we test the base decision tree algorithm; its accuracy, precision, recall and F1 score. Next we use bagging and test its performance. The third algorithm is boosting and lastly we distribute the entire dataset over the workers with hard data partitioning and look at the potential time performance increase as well as the other performance measures.

4.1 Datasets and setup

The datasets used in the experiments come from the UCI Machine Learning Repository [21]. The repository contains different types of data, that is sorted and specifically prepared for training and evaluating machine learning models. The datasets used are the adult dataset1, donated by Ronny Kohavi and Barry Becker, and the car evaluation dataset2, donated by Marko Bohanec and Blaz Zupan.

The adult dataset is about whether people earn more than $50,000 per year (labels ’>50k’ and ’ď50k’ where ’>50k’ is seen as the positive class). The dataset contains approximately 32,000 tuples with 14 features. The labels are distributed such that approximately 24% has the label ’>50k’ and the remaining 76% has the other. The adult dataset was provided as two parts; a training dataset and a testing dataset. Testing on this dataset was run only one time due to time constraints.

The car evaluation dataset contains approximately 2,000 tuples with 6 features and 4 la-bels, ’unacc’ (unacceptable), ’acc’ (acceptable), ’good’ and ’vgood’ (very good). The majority of the tuples are classed as ’unacc’. In order to be able to do binary classification on the car dataset the classes ’good’ and ’vgood’ were changed to ’acc’ (which is seen as the positive class), which is reasonable since good and very good are essentially better versions of accept-able. The label distribution of the car dataset is slightly more equal than the adult dataset with approximately 30% having the label ’acc’ (after changing ’good’ and ’vgood’ to ’acc’)

1_{http://archive.ics.uci.edu/ml/datasets/Adult}

(25)

4.2. Experiment

and 70% the other. We split the car dataset into a training dataset and a testing dataset by randomly designating about 20% of the tuples as the testing set and letting the remaining 80% be our training dataset. Training and testing with the car evaluation dataset was run 5 times due to the short time it took to perform each test.

For the experiments a desktop computer with an Intel Core i7 2600K processor at 3.5 GHz and 16 GB RAM was used.

4.2 Experiment

In the experiment we tested four different algorithms; Base Algorithm, Bagging, Boosting and Hard data partitioning. Due to a lack of resources we were unable to perform experiments in a physically distributed environment. Instead, we simulated the environment with user-level threads representing each worker (machine). The threaded simulations are logically equivalent to physically distributed implementations. In the base algorithm part we trained a single decision tree on the adult and car datasets and evaluated the decision tree on our testing dataset in terms of accuracy, precision, recall and F1 score as well as recording the time it took to train the models. Each of the remaining algorithms was run on the following number of workers: 2, 5, 10, 20, 50, 100 and 200 (and 400 on hard data partitioning and bagging). Each model was tested on our testing datasets and evaluated on the same measures as the base algorithm. We trained the adult once for every algorithm and the car evaluation dataset five times for each algorithm. For Hard data partitioning the overlap constant k was set to 10 in each test. The reason why we chose to use three distribution techniques was:

• With bagging we wanted to see if training each worker node on randomized 60% subset of the training dataset could give benefits in F1 score and accuracy.

• In boosting we wanted to improve upon the results of bagging by using AdaBoost de-scribed in Chapter 2.

• In hard data partitioning we wanted to see the time performance change with parti-tioning the entire training dataset over the number of workers.

Testing of the trained model was done by voting. Each worker thread votes on each test data and then the label with most votes from the workers is chosen. In case there is the same amount of workers voting for two different labels the first one is picked, i.e. the one with lowest index.

The Python3 code for the decision tree, for the bagging and the hard data partitioning implementation and the boosting implementation (AdaBoost) can be found on GitHub3. The implementations used for bagging, hard data partitioning and boosting uses instances of the decision tree.

4.3 Downsampling

To further investigate the bagging algorithm we ran tests where we downsampled the adult data set. We downsampled the adult dataset to contain 16,000 and 2,000 datapoints and ran tests on the bagging model. We ran the tests on both downsampled datasets 10 times. The tests were run on 2, 5, 10, 20, 50 and 100 threads.

(26)

5 Results

In this chapter we present the results of our experiments. For each algorithm we present a table with the results and graphs (except for the base algorithm due to just one result). There are two graphs for each algorithm; one showing time performance, where we use logarithmic scale on both axes and another one showing accuracy and F1 score.

5.1 Adult dataset

Base algorithm

Table 5.1 shows the results from running our base algorithm, a single decision tree trained on the complete training data set. The metrics from this table provide a baseline for the up-coming tests. It is noteworthy that there is a large disparity between the accuracy and the F1 score. This indicates that the model classifies data points as the negative class (’ď50k’) too fre-quently (always classifying data points as ’ď50k’ would yield an accuracy of approximately 76%).

Table 5.1: Base algorithm results with adult dataset.

Time (s) Accuracy(%) Precision Recall F1 score

2640 80.5 0.601 0.592 0.596

Bagging

The bagging algorithm shows a linear rise in execution time as more threads are used. The entire test run took around two weeks to perform making it the most time demanding exper-iment in this thesis. The shortest run was with one worker taking 1626 seconds to finish and the longest with 400 workers taking 507248 seconds or about a week to finish. In table 5.2 below we can notice that precision is on average higher than recall, so the proportion of the people predicted correctly to earn more than 50k (precision) was higher, than people earning more than 50k that were classified correctly (recall). Which means that if a person is guessed to earn more than 50k a year is most likely a correct guess.

(27)

5.1. Adult dataset

Table 5.2: Bagging results with adult dataset.

Threads Time(s) Accuracy(%) Precision Recall F1 score

2 2570 83.6 0.767 0.457 0.573 5 6110 83.4 0.671 0.605 0.636 10 11800 84.4 0.727 0.559 0.632 20 24000 85.1 0.741 0.584 0.653 50 60100 85.0 0.730 0.595 0.655 100 118000 84.7 0.720 0.595 0.652 200 243000 84.8 0.723 0.597 0.654 400 507000 85.2 0.738 0.600 0.662 2 5 10 20 50 100 200 400 2000 4000 8000 16000 32000 64000 128000 256000 512000 1024000 Number of threads T ime (s)

(a) Time performance; total execution time in re-lation to the number of worker threads.

2 5 10 20 50 100 200 400 50 55 60 65 70 75 80 85 90 95 100 Number of threads Accuracy (%) Accuracy F1 score 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 F1 scor e

(b) Accuracy and F1 score.

Figure 5.1: Bagging with adult dataset.

The performance scores accuracy and F1 score vary from 83.2% to 85.2% for accuracy and 0.573 to 0.662 for F1. There seems to be an upwards trend in both accuracy and F1 scores, although it’s not large. The accuracy is significantly larger than the F1 score and we can see that the recall is lower than the precision. This indicates that the models has a bias towards predicting the negative class (’ď 50K’), which might be expected considering the dataset contains a lot more datapoints with that class. The accuracy is better than the base algorithm from 2 workers and onwards, and the F1 score is better from 5 workers.

Boosting

The boosting algorithm does not show any improvements as the number of work-ers/iterations increase beyond 5 iterations. This algorithm significantly outperforms the other algorithms on this datasets in terms of F1 score, which is due to higher values in both re-call and precision for all numbers of workers. The accuracy is higher than the base algorithm for all but 2 workers.

In Figure 5.2a we observe an almost linear relationship between the execution time and the number of threads/iterations, which is similar to bagging although the curve is steeper for boosting. The accuracy and F1 score as shown in Figure 5.2b decrease for 2 threads but are otherwise constant. The constant accuracy and F1 score suggests that each new weak learner vote more or less the same way that the ensemble from the previous iteration does. In other words, between iterations the ensemble makes the same prediction for the majority of data points in the testing dataset. It’s expected that few predictions change between iterations, but not that no predictions change. One would hope that some predictions changed for the better.

(28)

5.1. Adult dataset

Table 5.3: Boosting results with adult dataset.

Threads Time (s) Accuracy (%) Precision Recall F1 score

2 2030 80.1 0.876 0.869 0.873 5 4630 82.9 0.879 0.898 0.889 10 9840 82.9 0.879 0.898 0.889 20 29400 82.9 0.879 0.898 0.889 50 52600 82.9 0.879 0.898 0.889 100 98900 82.9 0.879 0.898 0.889 200 180000 82.9 0.879 0.898 0.889 2 5 10 20 50 100 200 1000 2000 4000 8000 16000 32000 64000 128000 256000 Number of threads T ime (s)

(a) Time performance; total execution time in relation to the number of worker threads.

2 5 10 20 50 100 200 50 55 60 65 70 75 80 85 90 95 100 Number of threads Accuracy (%) Accuracy F1 score 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 F1 scor e

Figure 5.2: Boosting with adult dataset.

Hard data partitioning

In the hard data partitioning test on the adult dataset an almost linear decrease of execution time is observed, see Figure 5.3a. This trend is different from bagging and boosting, and quite interesting considering that we only simulate distribution. In this part of the experiment we train the model on much smaller subsets of data, that in total are slightly larger than the original adult data set with size n+k ¨ m, where m is number of worker threads and k is the overlap (k = 10 in our tests), whereas in bagging and boosting each worker is trained on a larger portion (60%) of the training data set such that the total number of instances processed is 0.6 ¨ n ¨ m. Thus, it is expected that the algorithm ran quicker than the other two but not necessarily that it shows the observed trend. The trend might be explained by that each decision tree becomes less complex when their training data is smaller. In other words, they reach pure subsets by asking fewer questions.

We can see in Figure 5.3b, that the accuracy varies from 81% to 85% and has noticeable improvements following increased amount of worker nodes, however the accuracy does not vary more than 0.4 percentage points between 5 to 200 workers. One more interesting result is the decrease in F1 score and accuracy at 400 worker nodes. Like bagging the model is biased in favor of the class ’ď50K’, hence the high accuracy and low recall and F1 score.

(29)

5.2. Car evaluation dataset

Table 5.4: Hard data partitioning results with adult data set.

2 2310 83.3 0.747 0.459 0.569 5 1050 84.6 0.706 0.613 0.656 10 562 84.8 0.742 0.561 0.639 20 313 84.6 0.742 0.549 0.631 50 158 84.8 0.732 0.580 0.647 100 100 84.8 0.739 0.565 0.641 200 67 84.7 0.731 0.572 0.642 400 53 84.4 0.747 0.503 0.601 2 5 10 20 50 100 200 400 40 80 160 320 640 1280 2560 5120 Number of threads T ime (s)

Figure 5.3: Hard data partitioning with adult dataset.

5.2 Car evaluation dataset

Experiments run on the car evaluation dataset were run five times (due to the short time it took to perform each run) for each algorithm and thus in tables we provide the average results for each run and the standard deviation in parenthesis.

Base algorithm

The base algorithm with the car evaluation dataset showed good time performance as well as high accuracy, precision and F1 score due to the small size of the dataset. The standard deviation with respect to time was marginal at 0.01 and for accuracy and F1 score nonexis-tent. This is because the algorithm runs on the same settings each time unlike bagging and boosting, where the data can be split differently among the worker threads each time.

Table 5.5: Base algorithm results with car evaluation dataset.

Time (s) Accuracy (%) Precision Recall F1 score

0.106 (0.01) 99.4 (0) 1.0 0.980 0.990 (0)

Bagging

Bagging with a smaller dataset gave a similar time curve to the adult dataset. The accuracy and F1 score is significantly better than with adult dataset, coming close to 100% for accuracy and 1 for F1 score. In Table 5.6 we notice precision at 1 for 20 or more worker threads. The best time to accuracy and F1 score performance is around 5 workers, where accuracy is 99.1% with standard deviation of 0.4% and F1 score is 0.984 with standard deviation of 0.006%. Using

(30)

5.2. Car evaluation dataset

more than 5 workers does not give any significant benefits in terms of accuracy, F1 score or time although the deviation becomes smaller and eventually 0 at 50 workers and beyond.

Table 5.6: Bagging results with car evaluation data set.

2 0.297 (0.023) 97.3 (0.9) 0.942 0.962 0.951 (0.017) 5 0.639 (0.04) 99.1 (0.4) 0.988 0.98 0.984 (0.006) 10 1.292 (0.053) 99.1 (0.4) 0.996 0.973 0.985 (0.006) 20 2.527 (0.084) 99.4 (0.2) 1 0.979 0.99 (0.003) 50 5.501 (0.167) 99.5 (0) 1 0.981 0.99 (0) 100 11.215 (0.27) 99.5 (0) 1 0.981 0.99 (0) 200 22.491 (0.255) 99.5 (0) 1 0.981 0.99 (0) 400 44.711 (0.365) 99.5 (0) 1 0.981 0.99 (0) 2 5 10 20 50 100 200 400 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 Number of threads T ime (s)

Figure 5.4: Bagging with car evaluation dataset.

Figure 5.4a show the execution time in relation to the number of threads for the bagging algorithm run on the car evaluation dataset. Similarly to the test on the adult data set the relationship between the execution time and number of threads is linear.

Boosting

All models for a given number of workers made the exact same predictions on the boosting implementation on the car evaluation dataset. Furthermore every model with 10 or more workers made the exact same predictions.

Figure 5.5a shows the relationship between the total execution time and the number of threads. Just like on the adult dataset the execution time increases linearly with the execution time. The size of the dataset might be the reason for the high accuracy, precision, recall and F1 score.

Table 5.7: Boosting results with car evaluation data set.

2 0.188 (0.004) 98.0 (0) 1.0 0.972 0.986 (0) 5 0.459 (0.005) 99.1 (0) 0.996 0.992 0.994 (0) 10 0.954 (0.026) 99.4 (0) 1.0 0.992 0.996 (0) 20 1.917 (0.132) 99.4 (0) 1 0.992 0.996 (0) 50 4.897 (0.25) 99.4 (0) 1.0 0.992 0.996 (0) 100 10.057 (0.463) 99.4 (0) 1.0 0.992 0.996 (0)) 200 19.816 (0.958) 99.4 (0) 1.0 0.992 0.996 (0) 400 37.617 (0.703) 99.4 (0) 1.0 0.992 0.996 (0)

(31)

5.2. Car evaluation dataset 2 5 10 20 50 100 200 400 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 Number of threads T ime (s)

Figure 5.5: Boosting with car evaluation dataset.

Hard data partitioning

The hard data partitioning on a small dataset performed poorly in terms of time, accuracy and F1 score. The biggest difference from the bigger adult dataset, is the time performance, that matches more performance of bagging or boosting. One reason that the trend is the opposite of the test on the adult dataset might be that the subsets given to each weak learner is quite small from the start and thus the complexity of the decision trees is not decreased as much by an increase in the number of threads. Another aspect might be that the overlap, which is constant, takes up a larger and larger portion of the subsets as the number of threads increase.

Table 5.8: Hard data partitioning results with car evaluation data set.

2 0.296 (0.02) 91.5 (2.8) 0.885 0.847 0.847 (0.082) 5 0.453 (0.024) 92 (0.3) 0.839 0.879 0.856 (0.004) 10 0.609 (0.033) 89.7 (0.4) 0.895 0.776 0.832 (0.005) 20 0.791 (0.054) 89.9 (0.2) 0.897 0.781 0.835 (0.003) 50 1.138 (0.125) 87.9 (0) 0.920 0.728 0.813 (0) 100 1.667 (0.155) 87.9 (0) 0.920 0.728 0.813 (0) 200 2.457 (0.157) 87.1 (0.3) 0.877 0.726 0.795 (0.004) 400 3.745 (0.14) 87.0 (0.2) 0.869 0.727 0.792 (0.005) 2 5 10 20 50 100 200 400 0.2 0.4 0.8 1.6 3.2 Number of threads T ime (s)

(32)

5.3. Comparison

The accuracy and F1 score is the worst of any tests on the car evaluation dataset. The reason could be the smaller data subsets that the algorithm sends to workers for training. As we can see in Figure 5.6b the accuracy and F1 score have a downwards trend and the performance become gradually worse as more worker threads are used.

5.3 Comparison

In this section we look at the relationship between F1 score and execution time of the algo-rithms based on the results from our experiments. The experiments on more than 100 workers did not give any significant results, which is why we exclude 200 and 400 workers from the graphs in this section. As we can see in 5.7a hard data partitioning (HDP) gives better results in regards to F1 score versus time than bagging. Although bagging has better F1 score, the difference is not significant and the time difference between the best F1 score point of HDP and bagging makes bagging a less suitable algorithm. Boosting shows a lot higher F1 score but takes much more time to run than hard data partitioning. Figure 5.7b shows the results from tests on the smaller car evaluation dataset. Here bagging and boosting show the best performance. Even though HDP is still faster than the other two algorithms, the F1 score is diminishing with time.

100 250 500 1k 2k 4k 8k 16k 32k 64k 128k 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Time (s) F1 scor e Bagging Boosting HDP

(a) Time performance set against F1 score for comparison of the different algorithms with adult 32k dataset. 0.2 0.4 0.8 1.6 3.2 6.4 12.8 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Time (s) F1 scor e Bagging Boosting HDP

(b) Time performance set against F1 score for comparison of the different algorithms with car evaluation dataset.

Figure 5.7: F1 score set against time using Adult 32k and Car Evaluation datasets.

5.4 Downsampling

Bagging showed poorer results than boosting on the adult dataset, but comparable results on the car evaluation dataset. Since the relative performance was so different and that bagging has a better potential for time benefits of distribution we chose to test it further by downsam-pling the adult dataset. We downsampled the adult dataset (which we will call adult 32k here since it has approximately 32,000 data points) to 16k and 2k. The results are shown in Figure 5.8. As it can be noticed, the dataset size has a significant impact on F1 score performance. However with higher numbers of threads all dataset sizes tend to converge towards an F1 score of 0.65 and the deviance for 16k and 2k diminishes. The average training time per weak learner was approximately 5s for adult 2k, 280s for 16k and 1200s for 32k.

(33)

5.4. Downsampling 2 5 10 20 50 100 0.5 0.55 0.6 0.65 0.7 0.75 Number of threads F1 scor e adult 32k adult 16k adult 2k

(34)

6 Discussion

In this chapter we discuss our findings from the experiments performed and compare some of the results with other related articles.

6.1 Results

Bagging

In contrast to the base algorithm the bagging ensemble obtained a higher precision than re-call on the adult dataset. The accuracy saw a slight increase with a higher amount of weak learners with the high points at 20 and 400 workers for the adult dataset and 50 workers on the car dataset. The F1 score also improved as the number of weak learners grew, mainly due to the recall increasing while the precision remained all but the same.

The time increased linearly with the number of workers in the experiment. In a real dis-tributed implementation we still expect the time to increase linearly since the data partition-ing still has to run once per worker at the master node, although the trainpartition-ing can be done in parallel. Thus, we expect a real distributed implementation to be a lot quicker than our implementation, but still becoming slightly slower for every worker thread added.

Downsampling decreased the F1 score significantly for low numbers of threads while it had little to no effect for higher numbers of threads. It is not clear whether this effect is exclusive to the dataset used or if it is a general behaviour. Since downsampling also decrease the training time for each individual weak learner there might be an optimal trade-off between the degree of downsampling and the number of weak learners trained (more downsampling requires more weak learners for the same F1 score).

Hard data partitioning

Just like the bagging ensemble the hard partitioned ensemble obtained a significantly higher precision than recall, which might be expected since the algorithm is similar to the bagging algorithm. The hard partitioning algorithm is standing out from the rest of the algorithms with the property that the training/execution time decreases as the number of nodes increase. However this property shows only in the bigger, adult dataset. This is most likely due to a decrease of the complexity of the weak learners decision trees when fewer data points are

(35)

6.2. Method

given. That effect becomes negligible when the dataset is small enough from the beginning. There does however seem to be significant diminishing returns for the execution time with more weak learners, this is reasonable since the difference between 1

n and 1

n+1 decreases as n increases. There also seem to be little correlation between the strength of the ensemble and the number of weak learners for 5 and more weak learners.

In the car evaluation dataset we notice an increase in execution time with an increasing amount of workers. This might indicate that the algorithm doesn’t manage smaller datasets or datasets with a smaller number of features. The decreasing accuracy and F1 score with number of workers is most likely also a result of the size of the dataset and/or the number of features. To get a better idea what these results depend on, more tests are needed.

Boosting

The adaptive boosting algorithm showed poor time performance both in adult and car eval-uation datasets. In the adult dataset the accuracy was worse than both bagging and the hard data partitioning, but the more important measure F1 was a lot higher on all tests. The per-formance did no increase beyond 5 workers. In car evaluation dataset we see a similar trend when it comes to accuracy and F1 score. We struggled to see any performance benefits from using more than 5 workers. This might be due to adaptive boosting increasing the prevalence of overfitting and an increased impact from outliers in the training dataset, especially when each weak learner is relatively strong [22].

The training of the boosting algorithm is inherently sequential which means that there are almost no time benefits to parallel distribution during the training. Thus, the only time benefits to be had is when querying the model. Whether boosting is suitable is a compromise between a long training time and a high F1 score. To decrease the training time and possibly make some gains in accuracy it might be better to use weaker decision trees such as decision stumps.

6.2 Method

The algorithms used in the experiment originated from a single, basic decision tree algo-rithm that is described in Chapter 3, which means that the results are mostly dependent on its implementation. To be able to analyze the potential improvements in the distributed ver-sions we needed to perform tests to get the accuracy and F1 scores. Due to time constraints we only did a single test for each number of workers for the adult dataset and five tests on the car evaluation dataset. Preferably we would have done at least ten tests on each num-ber of workers on both datasets. In our method we did not use any widely used methods for preventing overfitting, which could have had a significant impact on the final results. Previously described XGBoost uses three methods for preventing overfitting: regularized ob-jective, shrinkage and column subsampling, that could have been implemented and possibly giving better accuracy and a more unbiased model.

In our method only two datasets were tested, which is not enough to give a proper big picture view of the algorithms’ performances. Bagging and boosting both used 60% of the to-tal training dataset for each weak learner which had a significant impact on time performance per worker node. Using 10% or 20% of the training dataset could give a time performance increase at the possible cost of accuracy and F1 score. Using a wider variety of datasets with different amounts of features would have given a better idea of the overall performance of the algorithms used.

A factor that had a significant impact on the results was the fact that the experiments were performed in a simulated distribution environment on a single physical machine. We did not have control of the delegated memory and CPU resources for each thread. A Python 3 instance runs on one physical core independently of the number of virtual threads. With that

Machine learning algorithms in a distributed context

Linköping University | Department of Computer and Information Science

Bachelor thesis, 16 ECTS | Datateknik

2018 | LIU-IDA/LITH-EX-G--18/060--SE

Machine learning algorithms

in a distributed context

Maskininlärningalgoritmer i en distribuerad kontext

Samuel Johansson and Karol Wojtulewicz

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim and research questions

1.3

Contribution and novelty

1.4

Delimitations

1.5

Outline

2

Theory

2.1

What is machine learning?

2.2

Learning styles

Supervised learning

Unsupervised learning

Reinforcement learning

2.3

Machine learning in a distributed context

2.4

Ensemble algorithms

Bootstrap aggregating (bagging)

Boosting

Hard data partitioning

2.5

Measures of performance

Confusion Matrix

Other measures of performance

3

Machine learning algorithms

3.1

Decision trees

How it works

How decision tree algorithms could be used with distributed processing

XGBoost

3.2

Naive Bayes classifiers

How Naive Bayes could be done with distributed processing

3.3

Support vector machines

How SVM can be done with distributed processing

4

Experimental methodology

4.1

Datasets and setup

4.2

Experiment

4.3

Downsampling

5

Results

5.1

Adult dataset

Base algorithm

Bagging

Boosting

Hard data partitioning

5.2

Car evaluation dataset

Base algorithm

Bagging

Boosting