Topological recursive fitting trees

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Topological recursive fitting

trees

A framework for interpretable regression

extending decision trees

ALEXANDRE TADROS

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Topological recursive fitting

trees : a framework for

interpretable regression

extending decision trees

ALEXANDRE TADROS

Master in Computer Science Date: Thursday 12th_{March, 2020}

Supervisor: Pawel Herman Examiner: Örjan Ekeberg

Swedish title: Topologiskt rekursiva anpassade träd : ett ramverk för tolkbar regression genom utvidgade beslutsträd

(4)

(5)

iii

Abstract

(6)

iv

Sammanfattning

(7)

1 Introduction 1 1.1 Problem statement . . . 2 1.2 Outline . . . 3 2 Background 5 2.1 Decision trees . . . 5 2.2 Dimension reduction . . . 6 2.3 Linear regression . . . 7 2.4 Related Work . . . 9 3 Methods 11 3.1 Model representation . . . 11 3.2 Learning procedure . . . 12 3.2.1 Hyper-parameters . . . 13 3.2.2 Algorithm . . . 15 3.3 Evaluation . . . 16 4 Results 17 4.1 Toy problem . . . 17 4.1.1 1-dimensional regression . . . 17

4.1.2 2-dimensional binary classification . . . 21

4.1.3 Influence of hyper-parameters . . . 21 4.2 Complexity . . . 22 4.3 Experimental setup . . . 23 4.4 Results in numbers . . . 25 5 Discussion 29 5.1 Results summary . . . 29

5.2 Connection to the state of the art . . . 30

5.3 Sustainability . . . 30

(8)

vi CONTENTS

5.4 Weaknesses and limitations . . . 31 5.5 Future work . . . 32

6 Conclusion 33

(9)

Chapter 1 Introduction

Applications of machine learning in real-world problems are vari-ous and numervari-ous [13]. Embedded image recognition for defense purposes, text classification for human resources decision support, time series forecasting providing business intelligence are some of the many fields where machine learning is being introduced [22, 1, 10].

Studies support that machine learning algorithms are likely to be-come inherent parts of processes in companies, governments and insti-tutions [29, 1, 35, 24]. For many of these application, an interpretation of the algorithm output is required. Indeed, a human understandable interpretation of how an output is produced by the model, including the learning process, is equally important as the actual performance [34, 30]. Due to the critical aspect of machine learning empowerment in process, a renewed interest in interpretable and explainable ma-chine learning is growing [16]. Interpretability is hard to define due to the variety of models uses and purposes. Motives and technical de-scriptions of interpretability are diverse and occasionally discordant, suggesting that interpretability refers to more than one concept. Hence academic research is starting to try and give a common framework to evaluate and analyze interpretability of machine learning models [11, 3, 26]. Indeed there is a gap in literature between theoretical pa-pers aiming at ever greater accuracy on narrow tasks and interpretabil-ity enabled actionabilinterpretabil-ity of models in real-world problems. The deep learning field is probably the machine learning field that suffers the most of this gap in the past years. Nonetheless, recent work has been focused on interpretation of deep networks, particularly in computer vision [2, 15, 23].

(10)

2 CHAPTER 1. INTRODUCTION

We can see interpretability as a result of one of the two following strategies: (1) use inherently interpretable models or (2) postprocess models in a way that yields insights. The choice was made to adopt the first strategy and build an algorithm made out of inherently "inter-pretable" bricks.

In the context of regression problems, linear regression and deci-sion trees are widely used methods featuring easy interpretation. For both algorithms, their internal simplicity renders the learned models expressive and easily interpretable. For linear regression, the learned weights of the linear relation between the output and the input can be directly interpreted as the importance of the corresponding feature in the model. For decision trees, learned nodes segmentations indicate both the most determining input feature and the most discriminant value of that feature.

1.1 Problem statement

(11)

CHAPTER 1. INTRODUCTION 3

structure prevent any solid interpretation (corresponding to the afore-mentioned definition of interpretation).

We aim at extending the decision tree framework with PCA to achieve higher performance without overly deteriorating interpretability. In this thesis we propose a procedure to learn a decision tree based on topological appreciation of the data, called Topological Recursive Fit-ting (TRF). Instead of dividing the data along a single feature like in a basic decision tree learning, TRF cuts along the linear regression direc-tion, learned on principal components direcdirec-tion, at each node. In the tree leaves, linear regressions are then performed on subsets of data. We make the assumption that the underlying model to be fit can be considered locally linear and that linear models are interpretable ac-cording to the aforementioned criteria. By this procedure we try to find the best way to split the data to achieve local linearity. In the absence of direct and easy interpretation, we will explain how our approach is interpretation-friendly, contrasting with previously mentioned black-box model. In summary, we aim at resolving the question whether the decision tree framework, extended through dimensionality reduc-tion and linear regression (namely the TRF procedure), produces re-gression performance on par with low interpretability state-of-the-art methods. The hereby work does not aim at giving a thorough defi-nition of interpretability nor a way of evaluating it. We take here for granted that linear models are somewhat interpretable and that a pro-cedure featuring linear models is more likely to be interpreted, but the current work does not extensively demonstrate the latter.

Our approach, which takes inspiration from the previous work [4, 32, 20, 18, 28, 36], is analyzed on several benchmark datasets and compared with Linear Regression, Random Forests, Gradient boost-ing trees, Support Vector Machines and Neural Networks. A theoreti-cal interpretability comparison is made as well as an evaluation of the performance.

1.2 Outline

(12)

4 CHAPTER 1. INTRODUCTION

(13)

Chapter 2 Background

The algorithms under investigation in the thesis combine concepts from well-established approaches in the domains of regression, di-mension reduction, decision tree learning and boosting. Basic linear regression and Principal Component Analysis algorithms are detailed, and decision tree learning algorithm with its most popular extensions is described here.

2.1 Decision trees

There are several ways to describe a decision tree. It can be viewed as a factored boolean expression [36] as well as a partitioning of the input space into cuboid regions. For the latter, simple models are individu-ally assigned to each region [5]. Note that regions edges are aligned with the axes in the basic version. The tree denomination comes from the representation in decision nodes that consist in a threshold value comparison. The process of selecting a specific leaf model, given a new input, can be described by a sequential decision making process corre-sponding to the traversal of a tree [5]. We focus here on the classification and regression trees, or CART [6].

(14)

6 CHAPTER 2. BACKGROUND

(a) _(b)

Figure 2.1: Simple binary decision tree illustration : (a) Binary tree corre-sponding to the partitioning of input space in figure b, (b) Illustration of a two-dimensional input space that has been partitioned into five regions us-ing axis-align*ed boundaries.

Such model is learned from a training set by a greedy optimization that starts with a single root node (corresponding to the whole input space) and then grows the tree by adding nodes one at the time. At each step, the combination of an input variable and a corresponding threshold value are jointly determined by exhaustive search. Growing the tree until single data point leaves would cause the global model to be very large and to considerably overfit the training set. A common practice is to grow a large tree using a stopping criterion based on the number of data points associated with the leaf nodes, and then prune back the resulting tree [5]. The pruning is based on a criterion that balances residual error against a measure of model complexity [5].

The human interpretability of a tree model such as CART is often seen as its major strength. However, in practice it is found that the particular tree structure that is learned is very sensitive to the details of the data set, so that a small change to the training data can result in a very different set of splits [5]. Another problem is that the splits are align*ed with the axes of the feature space, which may be suboptimal [19]. In the context of high dimensional data, this limit is intensified and prevents a decision tree model from achieving high accuracy.

2.2 Dimension reduction

(15)

CHAPTER 2. BACKGROUND 7

have been developed to mitigate the curse of dimensionality when us-ing decision trees.

Limited methods consist in dropping features having low variance or keeping only one of a group of highly correlated features. Forward Feature Construction and Backward Feature Elimination [25], respec-tively, iteratively keep and remove features with respectively most and least impact on the error rate. This approach is computationally expen-sive, which limits the scalability and is restricted to already existing features. A more sophisticated approach that implements feature se-lection is the bagged extension decision trees, namely Random Forests [18]. The latter grows multiple decision trees on randomly selected samples and randomly selected features. The outputs of each individ-ual decision tree are then aggregated to provide the final output.

Instead of picking out of initially existing feature, PCA builds fewer new features called components or factors. It can be seen as a tech-nique that iteratively finds a set of orthonormal basis vectors along which the variance is maximized [5]. Equivalently, it can be defined as the linear projection that minimizes the average projection cost, de-fined as the mean squared distance between the data points and their projections [32]. Both formulations show that components encode the way data points are spread in the space. The dimension reduction is performed by selecting the k principal components. Usually k is cho-sen large enough to encode most explainable variance, meaning that if a gain in explained variance for a component to the next is not high enough, all next components are discarded. An advantage of PCA is the statistical significance encapsulated in each components. Besides, components are built as linear combinations of original features, which offers a good interpretation of them.

2.3 Linear regression

Along with decision trees, linear regression [5] is considered as an in-terpretable model. Indeed, in this framework the objective variable is expressed as a linear combination of the input features variables.

(16)

variable from all of the others in the model.

Interpretation can then be done by interpreting the value of the fitted weight of each feature as the ’role’ of this feature on the objective variable variations.

Let us consider the framework in which regression aims at defining a mapping f from an input x 2 Rd _{to an output space of dimension 1}

y_{2 R based on sample data points (x}i, yi) _i21...N.

Linear regression tries to fit a model of the form y =wx + ✏

where ✏ is a zero-mean Gaussian random variable with variance 2_.

We can derive the Maximum Likelihood Estimates of the parameters from the likelihood function.

wM L = (XTX) 1XTY 2 M L = 1 N N X n=1 yn wTM Lxn 2 where X = 0 B B B @ x1,1 x1,2 · · · x1,d x2,1 x2,2 · · · x2,d ... ... ... ... xN,1 xN,2 · · · xN,d 1 C C C Aand Y = 0 B B B @ y1 y2 ... yN 1 C C C A

and wM L and M L2 are the Maximum Likelihood Estimates of the

model parameters. We notice that variance estimate is the residual variance of the target value around the regression function.

This regression method is also known as the least square method since the maximum likelihood estimation can be seen as the minimiza-tion of the square error between target values and predicminimiza-tion values. Like PCA, the linear regression framework provides a confidence in-terval around the model that helps interpretation.

(17)

CHAPTER 2. BACKGROUND 9

2.4 Related Work

The decision tree framework has been extended in several ways to gain performance. Random Forest is a widely used bagging variant of the decision tree framework. In this method, multiple decision trees are trained on random subsets of training samples with a random selec-tion of features, thus forming a forest. The final predicselec-tion is com-puted by aggregating all individual tree results. An alternative named oblique Random Forest has also been developed, consisting in sep-arating the input space by randomly oriented hyperplanes. Gradient Boosting Trees is a boosting variant of the decision tree framework that also achieves high performance on many regression problems. The principle is to stack simple decision trees that are iteratively trained on residuals. The final prediction is the sum of every stacked tree predic-tion. Random Forests and Gradient Boosting Trees mitigate the poor performance issue that face basic decision tree regression, but at the same time they give up the interpretation power of a single tree struc-ture.

Postprocessing work has been studied to improve interpretability of those ensemble tree methods. For instance, it is possible to evaluate the weight of each input feature in a random forest [31]. An other method suggests to learn a much simpler model while minimizing the Kullback-Leibler divergence between distributions represented by the complex and the simple model [17]. While some analogies can be made in the ways the decision tree framework is extended by en-semble methods to achieve higher performance and the way TRF tree is trained, the difference is that TRF is potentially inherently inter-pretable thanks to its linear internals. Those analogies are presented in the Discussion section

(18)

(19)

Chapter 3 Methods

We propose to add a topological modeling to the aforementioned ’de-cision tree’-frameworked methods (De’de-cision Trees, Radom Forests, Gradient Bossting Trees) thanks to linear dimension reduction trough PCA. PCA offers a statistical framewor and a good interpretation thanks to its linearity. We allow decision tree input space boundaries to lay along a free direction, namely the orthogonal to the linear re-gression direction performed on significant main components.

3.1 Model representation

The learned model can still be seen as a decision tree. Decision nodes evaluate the condition whether a linear combination of the projec-tions of attributes on main components is greater than a learned value. Evaluation still consists in the traversal of the tree until reaching a leaf node, as illustrated in figure 3.1. In a leaf node, final prediction emerges from a linear regression model prediction.

(20)

12 CHAPTER 3. METHODS

Figure 3.1: TRF evaluation of a data point x = (x1, x2, . . . , xd): Starting

from the root node, linear combinations of x features are successively com-pared to the learned threshold to decide whether to continue evaluation to the left child node or the right child node. The parameters {{pi}, {pl,i, pr,i},

..., {plll,i}, {pllr,i}, . . . , {prrl,i}, {prrr,i}} are learned during the training phase

as the main component of the considered data at each node. The notation lstands for left and r for right. A succession of l’s and r’s designates the path in the tree leading to a specific node. Linear regression parameters {{wlll,i}, {wllr,i}, . . . , {wrrl,i}, {wrrr,i}} at each leaf node are also learned

dur-ing the traindur-ing phase and serve for prediction purposes in the evaluation. The main difference is that instead of evaluating whether a single feature value is greater than a threshold in classical decision trees, here a linear combination of the features is compared to a threshold.

3.2 Learning procedure

(21)

CHAPTER 3. METHODS 13

the principal components. The threshold value is then assigned to any statistics or aggregate value of the data points projections. We chose the vanilla procedure to use the median value as threshold in order to split in half the training subset data at each step. Then two child nodes are added to the former node, respectively associated with the sub-threshold and over-threshold training samples. When a stopping condition is reached at a node, the latter becomes a leaf node and a linear regression is performed on the training samples associated with the node.

The stopping condition is typically a minimum number of training samples to have associated with a node. There could be other stop-ping criteria such as goodness of fit depending on the nature of the data. The stopping criterion should be such that the splitting would stop when the local data can be linearly approximated. It is set by the hyper-parameters described hereafter.

3.2.1 Hyper-parameters

The hyper-parameters needed by the procedure are the following : • Minimum population in a leaf : min.population

This parameter defines the stopping criterion. If splitting a node data in half would result in under-min.population populated child nodes, the node becomes a leaf node and the input space will not be split. It is tightly linked to the final number of leaves, thus it can be chosen according to the number of ’states’ or ’modes’ of the model. The number of leaves in the tree is the number of different models that represents the data in the global model. Thus this hyper-parameter can be set to choose the corre-sponding number of leaves in the tree and therefore the number of different models representing the data.

• Number of predictors : qt.predictor

This parameter can be set to choose the number of principal com-ponents to be used for linear regression at each node. Note that the user can rather choose a percentage of explained variance of the data than choose the number of predictors at each step. Only significant factors are kept.

(22)

This parameter specifies the degree of additional polynomial fea-tures that are to be built from initial feafea-tures.

(23)

CHAPTER 3. METHODS 15

3.2.2 Algorithm

In this section, we present the formal definition of TRF, an algorithm aiming at reaping the rewards of combining linear models inside a de-cision tree framework. The learning procedure of TRF is formalized in Algorithm 1. Let us consider a set of N data points xi 2 Rd

repre-sented by the matrix X 2 RN⇥d_{, provided with a set of corresponding}

objective yi 2 R represented by y 2 RN. We define the P CA(X, y, k)

operator as a PCA performed on the data points in X from which the k factors most correlated with y are kept.

Algorithm 1 Topological Recursive Fitting

1: procedure TRF(X, y)

2: T Root node

3: N number of data points in X

4: if N > min.population then

5: _{by LinearRegression(P CA(X, k, y), y)} 6: Me median(by) 7: XL X[byi 6 Me] 8: XR X[byi > M e] 9: Append nodes T RF (XL,y_L), T RF (XR,y_R)to T 10: else 11: L regression(X, y, non.linear.order) 12: Append leaf L to T 13: return T

Some of the steps of the algorithm described in Algorithm 1 are detailed below.

5 : Compute linear regression on the k first principal components of

the data which gives the linear combinations to be tested at this node

6 : Project xi’s on m, results in a vector of size N. Those are the linear

combination evaluation for the node

7 : Compute threshold value for this node

8 : Create left sub-population with data points in the half space before

(24)

notation selecting every data point xifor which byi is greater than

M e.

9 : Create Right sub-population with data points in the half space after

the median along main direction

3.3 Evaluation

Measures used to assess any algorithm performance will be the Root-Mean-Squared-Error (RMSE) and the R-square error evaluated on the test set.

RMSE gives the L2averaged error between predictions and ground

truth. It is defined by RMSE = s_P N i=1(yi yˆi)2 N

where the square root ensures that the error is measured on the same scale and in the same units as the target variable.

R-squared (R2_{) error (also known as coefficient of determination) gives}

the proportion of signal variance explained. The main term is the frac-tion of variance unexplained, comparing the variance of residuals and the variance of the signal itself. An R2 _{value of 1 means a perfect fit,}

while a value of 0 indicates a poor fit.

R2 = 1

PN

i=1(yi yˆi)2

PN

i=1(yi y)2

(25)

Chapter 4 Results

4.1 Toy problem

Let us illustrate TRF’s learning procedure and evaluation on simple problems for didactic purposes. First 1-dimensional regression prob-lems followed by a 2-dimensional classification problem serve to show the result of the fitting with TRF. Then a study on hyper-parameters influence is developed.

4.1.1 1-dimensional regression

Simple signal

We consider a set of 10000 1-dimensional data points following a nor-mal distribution N (0, 1) with a non linear objective y = x3 _{+ e} x2

+ ✏, where ✏ ⇠ N (0, 1). We use a random selection of 8000 points for train-ing and use the rest for testtrain-ing.

(26)

18 CHAPTER 4. RESULTS

Figure 4.1: TRF regression for 1-dimensional data with a non-linear objec-tive. Black dots are training data points. The red line represents the continu-ous path of the model’s test data points predictions.

The model shown in Figure 4.1 follows in a simple way the data points and is therefore visually satisfying. If we pay attention to the red line, we can see junctions of different models. The first one is a linear regression up until -1.5. The second junction is a linear regres-sion after +1.5. The interval between -1.5 and +1.5 is split into some 50 regions within which an individual linear models are fitted. Indeed, with a min.population of 300, with 8000 data points, dense regions are sliced multiple times (since the features follows a normal distribu-tion, there are more points around the mean of the distribution than towards the tails).

Heterogeneous signal

(27)

CHAPTER 4. RESULTS 19

Figure 4.2: Heterogeneous dataset : The first part of the signal is linear y = x + 1 + ✏, and then suddenly changes to a sine signal y = cos(2x)+3

4 + ✏, where

(28)

Figure 4.3: TRF regression on heterogeneous data : Dots are test set samples. Their color represents their leaf node. The red line corresponds to the test data predictions. Each color represents a leaf node of the learned decision tree that has a depth of 3 as we chose to set min.population to 30.

(29)

4.1.2 2-dimensional binary classification

We can apply the same procedure for a binary classification problem. Instead of performing a linear regression in the leaves of the tree a logistic model is fitted.

Let us consider a set of 20000 data points following a 2-dimensional normal distribution. A selection of 15000 data points will be used for training, and the remaining 5000 for testing. The objective is thus bi-nary (0 or 1) classification and the probability for a point to belong to class 1 follows a 2-dimensional normal distribution with an elliptical density centered around (1, 1) (see figure 4.4).

(a) (b)

Figure 4.4: TRF binary classification learning. The 2-dimensional data is re-cursively sliced into sub-datasets until having min.population = 400 data points groups to perform logistic regression on. (a) Training set of 15000 points. Dark points are labeled 0, light points are labeled 1. (b) Test set of 5000 points. The blue shade represents the predicted probability.

4.1.3 Influence of hyper-parameters

(30)

sufficiently linear. The influence of this parameter on the previously described toy problem is illustrated in figure 4.5.

(a) (b)

Figure 4.5: Minimum population hyper-parameter influence : (a) Mini-mum population per leaf : 50. (b) MiniMini-mum population per leaf : 3000. There are 8000 data points spread densely along the one feature. A small min.populationresults in overfitting in low density regions (here both left and right end). The model generalizes badly because of the bias of the training data. A high min.population results in underfitting. The model fails to capture the non-linearity of the data.

It is a common practice in machine learning to keep a part of the training set as a validation set to search for best min.population hyper-parameter by cross-validation [5] and it can be applied here. The entire set of example data points are splitted in 3 parts : 70% is dedicated to training, 10% to validation and 20% to testing. Hyper-parameters are optimized to achieve the highest performance on the validation set, and then used on the test set to compute pure perfor-mance.

4.2 Complexity

The time complexity of TRF can be derived to compare it against state-of-the-art algorithms such as neural networks, gradient boost-ing trees and random forest. At each step, the computation of the first principal component of the considered population has a complexity of O(nd2₎ _{where the population is composed of n data}

(31)

number of data points until the population is composed of at least min.population data points. Furthermore, the linear regression computed on each of the N

min.population leaf population has a

complex-ity of O(min.population ⇥ d2_{). Thus the overall complexity can be}

derived to be O(Nd2_{+ N d}2_log( N

min.population)).

4.3 Experimental setup

We compared our approach to five other algorithms against four datasets. The chosen comparison algorithm are Support Vector Ma-chines, Random Forest, Gradient Boosted Trees, feed-forward Neural Network and basic Linear Regression. It constitutes a representative panel of widely used algorithm for regression, from simple and inter-pretable models to state of the art high performing black-box ones.

We computed prediction on the four following datasets :

• Household power consumption time series [14]: One variable was considered to fit an auto-regressive model, simply using past aggregations to compute 1 day predictions from 2006 to 2010 • Residential building [33]: Predict the sale price of residential

buildings based on 103 features for 372 data points

• Energy consumption [9]: Predict the energy consumption of a residence based on 26 features for 20000 data points

• Blog Feedback [8]: Predict the number of comments of a post in the upcoming 24 hours based on 280 features for some 50000 blog posts

The measures used to assess any algorithm performance are the Root-Mean-Squared-Error (RMSE) and the R-square error evaluated on the test set. Since RMSE range of value depends on the mean and standard deviation of the objective, these statistics are provided with the results.

(32)

(33)

4.4 Results in numbers

Predictions have been computed thanks to R implementations of men-tioned algorithms from dedicated open source packages. R2 _and

RM SE metrics are measured on the test sets and displayed in table 4.1. The mean µ and standard deviation of the test set objective, number of data points in the training set ntrain and the number of

in-put features d are displayed for each dataset.

Measure Algorithm Blog Energy Power Residential

µ 5.4 24.9 505.9 923.1

30.6 14.4 186.4 542.6

ntrain 52896 13814 640 260

d 62 26 7 103

RMSE Linear Regression 25.34 14.43 156.12 354.51 Neural Network - 14.39 186.65 878.56 Random Forest 22.90 14.64 153.45 418.31

SVM - 14.56 158.00 217.24

TRF 25.29 14.83 155.82 383.41

Gradient Boosted Trees 24.52 16.07 168.51 232.38 R2 _{Linear Regression} _0.31 _-0.01 _0.30 _0.57

Neural Network - 0.00 0.00 -1.64 Random Forest 0.44 -0.03 0.32 0.40

SVM - -0.02 0.28 0.84

TRF 0.32 -0.06 0.30 0.50

Gradient Boosted Trees 0.36 -0.25 0.18 0.81 Table 4.1: Prediction performance on 4 datasets : RMSE and R2_error

mea-sure are provided for each of the evaluated algorithm. Computations taking more than 6 hours are not displayed, which is the case for the blog dataset, for Neural Network and SVM algorithms. The mean µ and standard devia-tion computed on the test set, along with the number of training examples ntrainand the dimensionality of input space d are provided for each dataset.

(34)

was not the main concern as all the algorithms under investigation are fed with same input and the same output. Thus for some datasets, all the algorithms achieve poor performance. The diversity of selected datasets echoes with the fact that algorithms can achieve high perfor-mance on one dataset but not on another one. For instance, Neural Networks features the lowest RMSE on the Energy dataset but the highest on the Residential dataset. TRF features performance mea-sures within worst and best performance for each dataset. In Figure 4.6 we plot the error measures with respect to the dimension of the input space. We can see that TRF’s performance falls within other al-gorithms performance on the selected 4 datasets.

(a) (b)

(35)

(a) (b)

Figure 4.7: Error comparison between TRF (in blue) and other algorithms on 4 benchmark datasets: (a) displays RMSE error, normalized by the mean of the objective. Results on the blog dataset have been discarded to zoom on similar results. (b) displays R2_{error. Neural Network have been discarded}

to zoom on the area of interest.

Errors are compared in figure 4.7. The question to be answered is whether TRF provides the same results as the other benchmark algo-rithms, which is the null hypothesis in the following Student’s t-tests that are performed on both normalized RMSE and R2 _{populations of}

size 4 for TRF and size 20 for other algorithms.

Comparing TRF RMSE population to other algorithms errors re-sults in a p-value of 0.71. For R2 _{type error the p-value is 0.55. These}

high values imply that the null hypothesis can not be rejected largely due to the low number of instances in the statistical test.

(36)

(a) (b)

Figure 4.8: Comparison of computation time in seconds between TRF (blue tiles) and other algorithms (red tiles) depending on the number of data points and the number of dimensions: (a) displays the computation time according to the number of features. (b) displays the computation time ac-cording to the number of data points in the train set.

(37)

Chapter 5 Discussion

5.1 Results summary

The state-of-the-art methods such as Random Forests, Gradient Boost-ing Trees or neural networks achieve high accuracy for regression problems but at the cost of interpretation [16]. We proposed a topolog-ical extension of the interpretable decision tree framework to achieve higher performance than basic decision trees, while keeping a higher interpretation power than aforementioned state-of-the-art black-box approaches. We tested its performance on several regression problems and compared it against other regression methods.

For each dataset, TRF achieves comparable performance with the other algorithms, in terms of RMSE or R2 _{error measures. As}

ex-pected from dimension reduction brick and adjustability of the model, other algorithms seem less robust to high-dimensional data and to few data points datasets. For instance, the used implementation of Neu-ral networks does not converge for a 260 data points 103-dimensional dataset.

Concerning the computational time aspect, as shown in figure 4.8, the learning time of TRF increases for a dataset with a high number of samples, but not as high as Neural Networks. For low dimensional as well as high dimensional datasets, TRF still has a reasonable compu-tation time compared to SVM or Neural Networks.

We recall that all those considerations are made under the parametrization rules defined in section 4.3.

(38)

30 CHAPTER 5. DISCUSSION

5.2 Connection to the state of the art

Similarly to Random Forest [18], TRF learns locally on subsets of train-ing samples. In the Random Forest procedure, those subsets are ran-domly selected for each tree of the forest. In the proposed procedure, the data is recursively split in half. In both cases, several models are learned on subsets of training data points. A key difference with Ran-dom Forest is the single tree structure that remains in TRF while mul-tiple trees are grown in Random Forest, making interpretation harder. Gradient Boosting Trees [12] iteratively try to fit residuals of the previously fitted tree in order to capture weak signal, as simple deci-sion trees are first fitted to capture the global shape of the data. This can be compared with TRF’s split of the dataset along main compo-nent of the data. While other trees are then grown and stacked to the previous ones in Gradient Boosting, preventing easy interpretation of the global model, TRF keeps the single tree structure and uses a topo-logical method to learn decision nodes.

Most of linear model tree approaches do not use a median split rule. Impurity check or goodness of fit have been most explored, but they induce computation time issues when dealing with huge data sets [21].

The particularity of TRF is its mixture of mathematical and statisti-cal notions with computer science principles. The recursiveness of the algorithm and its capacity to plug other cut rules, dimension reduction and regression algorithmic bricks makes it flexible.

5.3 Sustainability

(39)

CHAPTER 5. DISCUSSION 31

integration projects.

Introducing AI into real-world processes comes with ethical con-siderations of how to do so. Two of the most important requirements on an algorithm are the mastering of the underlying mathematics and the understanding of the algorithm output. Black-box models fail to be used in real-world systems because the responsibility of their output cannot be assigned. For instance, a recruitment AI system suggesting to hire one candidate over another one, should be explanatory on its suggestion as human recruiters cannot blindly trust an output prob-ability. Being able to track the path of a prediction in the tree, with statistical significance at each node, enables to understand why this region of the space has been isolated in the same region. Then the in-terpretation of the leaf linear regression helps understanding the map-ping between the prediction and the input features. In a real applica-tion, since the internals of the algorithms are simple and mastered, a prediction that does not fit human intuition or expert expectation are easier to investigate, in order to adjust hyper-parameters for instance. An algorithm with this capacity is more likely to be used in real-world applications.

5.4 Weaknesses and limitations

While TRF claims to be topologically aware, some steps can seem counter-intuitive for simple cases. The cut rule at median may be ad-justed to better implement intuition of the cutting spot. Further work is still needed to build a good interpretation of such model. Black-box aspects remain in TRF. Training on complex and voluminous data results in a deep TRF tree, which loses its simple and interpretable as-sets. For fixed hyper-parameters, the model size grows with number of training data points, which means that the model can get quite big for large data sets, in addition to a longer computation time.

(40)

32 CHAPTER 5. DISCUSSION

concrete use case where interpretation is well defined.

5.5 Future work

As several models are disjointedly learned on the different regions, the global model connections from one region to its neighboring re-gions can show discontinuity. To mitigate this problem and make the global model smoother, margins can be introduced when splitting data at each decision node. Region neighboring data points can be included in the region to achieve better coherence between neighbor leaf node models.

(41)

Chapter 6 Conclusion

While a great part of research is dedicated to improving state of the art algorithm such as deep neural networks, which demonstrate im-pressive accuracy, we focus on other aspects of machine learning al-gorithms such as interpretability and how much computation power is needed. The proposed approach combines widely used and inter-pretable methods such as PCA and linear regression inside a decision tree framework. By recursively reducing dimensions and splitting the data, TRF isolates regions of the space where the data is considered lo-cally linear. It achieves comparable performance to other state-of-the-art algorithms on machine learning benchmark regression problems. The proposed algorithm aims at being a first step towards a high per-forming method upon which interpretation can be easily built.

We anticipate that the interpretability of TRF’s internals can be used to reach interpretability of the global algorithm. An initial step of interpretation is tracking down an output through the tree, with statistical significance of aforementioned simple linear methods used at each node. Yet a clear specification of interpretability is still under construction in machine learning research, resulting in ad hoc punc-tual meaning of such.

(42)

Bibliography

[1] J.D. Allen Smith. Audit Annually to Catch Bias in Artifi-cial Intelligence. 2018. URL: https : / / www . shrm .

org / resourcesandtools / legal - and - compliance / employment law / pages / artificial intelligence -diversity.aspx.

[2] David Bau et al. “Network Dissection: Quantifying In-terpretability of Deep Visual Representations”. In: CoRR abs/1704.05796 (2017). arXiv: 1704 . 05796. URL: http : / /

arxiv.org/abs/1704.05796.

[3] Vanya Van Belle and Paulo Lisboa. “Research directions in in-terpretable machine learning models”. In: In European Sympo-sium on Artificial Neuronal Networks, Computational Intelligence and Machiene Learning. 2013.

[4] Richard E. Bellman. Adaptive Control Processes: A Guided Tour. Ed. by Richard E. Bellman. MIT Press, 1961.

[5] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.ISBN: 0387310738.

[6] L. Breiman et al. Classification and Regression Trees. Monterey, CA: Wadsworth and Brooks, 1984.

[7] Carla E. Brodley and Paul E. Utgoff. “Multivariate Decision Trees”. In: Mach. Learn. 19.1 (Apr. 1995), pp. 45–77. ISSN:

0885-6125.

[8] Krisztian Buza. “Feedback Prediction for Blogs”. In: Data Anal-ysis, Machine Learning and Knowledge Discovery (2014), pp. 145– 152.

(43)

BIBLIOGRAPHY 35

[9] Luis Candanedo Ibarra, Veronique Feldheim, and Dominique Deramaix. “Data driven prediction models of energy use of ap-pliances in a low-energy house”. In: Energy and Buildings 140 (Jan. 2017), pp. 81–97.

[10] Taufik Djatna. “Progress in Business Intelligence System re-search: A literature Review”. In: International Journal of Basic & Applied Sciences IJBAS-IJENS Vol: 11 (July 2011), p. 96.

[11] Finale Doshi-Velez and Been Kim. “Towards A Rigorous Sci-ence of Interpretable Machine Learning”. In: arXiv (2017). URL:

https://arxiv.org/abs/1702.08608.

[12] Jerome H. Friedman. “Greedy Function Approximation: A Gra-dient Boosting Machine”. In: Annals of Statistics 29 (2000), pp. 1189–1232.

[13] Zhiqiang Ge et al. “Data Mining and Analytics in the Process In-dustry: The Role of Machine Learning”. In: IEEE Access 5 (2017), pp. 20590–20616.

[14] Alice Berard Georges Hebrail. “Individual household elec-tric power concsumption”. In: UCI Machine Learning Repository (2012).

[15] Amirata Ghorbani, Abubakar Abid, and James Zou. “Interpre-tation of Neural Networks is Fragile”. In: AAAI Technical Track: Machine Learning (2019), pp. 3681–3688.

[16] Leilani Gilpin et al. “Explaining Explanations: An Overview of Interpretability of Machine Learning”. In: IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) (Oct. 2018), pp. 80–89.

[17] Satoshi Hara and Kohei Hayashi. “Making Tree Ensembles In-terpretable: A Bayesian Model Selection Approach”. In: Proceed-ings of the Twenty-First International Conference on Artificial Intel-ligence and Statistics. Vol. 84. Proceedings of Machine Learning Research. Playa Blanca, Lanzarote, Canary Islands: PMLR, 2018, pp. 77–85.

(44)

36 BIBLIOGRAPHY

[19] Gareth James et al. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated, 2014.

[20] I. T. Jolliffe. Principal Component Analysis. Ed. by Springer-Verlag. Springer Verlag, 1986.

[21] Aram Karalic Jozef. “Employing Linear Regression in Regres-sion Tree Leaves”. In: In Proceedings of ECAI-92. John Wiley & Sons, 1992, pp. 440–441.

[22] Shahid Karim, Ye Zhang, and Asif Laghari. “Image Process-ing Based Proposed Drone for DetectProcess-ing and ControllProcess-ing Street Crimes”. In: IEEE, Oct. 2017, pp. 1725–1730.

[23] Been Kim et al. “Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)”. In: ICML. Proceedings of Machine Learning Research 80 (2018), pp. 2673–2682.

[24] David Kirk. “Demystifying intelligent automation: the layman’s guide to the spectrum of robotics and automation in govern-ment”. In: KPMG (2017), p. 1.

[25] J. KITTLER. Feature selection and extraction. ELSEVIER, 1986. [26] Zachary Chase Lipton. “The Mythos of Model

Interpretabil-ity”. In: CoRR abs/1606.03490 (2016). arXiv: 1606.03490.URL:

http://arxiv.org/abs/1606.03490.

[27] Weiwei Liu and Ivor Tsang. “Sparse Perceptron Decision Tree for Millions of Dimensions”. In: AAAI (Feb. 2016), pp. 1881–1887. [28] Weiwei Liu and Ivor W. Tsang. “Making Decision Trees Feasible

in Ultrahigh Feature and Label Dimensions”. In: Journal of Ma-chine Learning Research 18.81 (2017), pp. 1–36.

[29] D. Douglas Miller and Eric W. Brown. “Artificial Intelligence in Medical Practice: The Question to the Answer?” In: The American Journal of Medicine 131.2 (2018), pp. 129–133.ISSN: 0002-9343.

(45)

BIBLIOGRAPHY 37

[31] Anna Palczewska et al. “Interpreting Random Forest Classi-fication Models Using a Feature Contribution Method”. In: vol. abs/1312.1121. Proceedings of Machine Learning Research. 2013.

[32] K. Pearson. “On Lines and Planes of Closest Fit to Systems of Points in Space”. In: Philosophical Magazine 2 (6 1901), pp. 559– 572.

[33] Mohammad H. Rafiei and Hojjat Adeli. A Novel Machine Learning Model for Estimation of Sale Prices of Real Estate Units. Aug. 2015. [34] Wayne Reeves. Learner-Centered Design: A Cognitive View of

Man-aging Complexity in Product, Information, and Envirommental De-sign. Sage Publications, Inc., 1999.

[35] Martin Sokalski and Kelly Combs. “Intelligent automation takes flight: risk and governance will help you safely land your au-tomation goal”. In: KPMG (2017), p. 1.

(46)

TRITA EECS-EX-2020:58

Topological recursive fitting trees

Topological recursive fitting

trees

A framework for interpretable regression

extending decision trees

ALEXANDRE TADROS

Topological recursive fitting

trees : a framework for

interpretable regression

extending decision trees

ALEXANDRE TADROS

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1 Problem statement

1.2 Outline

Chapter 2

Background

2.1 Decision trees

2.2 Dimension reduction

2.3 Linear regression

2.4 Related Work

Chapter 3

Methods

3.1 Model representation

3.2 Learning procedure

3.2.1 Hyper-parameters

3.2.2 Algorithm

3.3 Evaluation

Chapter 4

Results

4.1 Toy problem

4.1.1 1-dimensional regression

4.1.2 2-dimensional binary classification

4.1.3 Influence of hyper-parameters

4.2 Complexity

4.3 Experimental setup

4.4 Results in numbers

Chapter 5

Discussion

5.1 Results summary

5.2 Connection to the state of the art

5.3 Sustainability

5.4 Weaknesses and limitations

5.5 Future work

Chapter 6

Conclusion

Bibliography