Cardinality estimation with a machine learning approach

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Cardinality estimation with a machine learning approach

OLLE FALGÉN ENQVIST

(2)

(3)

Cardinality estimation with a machine learning approach

OLLE FALGÉN ENQVIST

Degree Projects in Systems Engineering (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020

Supervisor at Polystar: Björn Julander Supervisor at KTH: Per Enqvist Examiner at KTH: Per Enqvist

(4)

TRITA-SCI-GRU 2020:394 MAT-E 2020:099

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Cardinality estimation with a machine learning approach

Olle Falg´en Enqvist oenqvist@kth.se

(6)

Abstract

This thesis investigates how three different machine learning models perform on cardinalty estimation for sql queries. All three models were evaluated on three different data sets. The models were tested on both estimating cardinalities when the query just takes information from one table and also a two way join case. Postgresql’s own cardinality estimator was used as a baseline. The evaluated models were: Artificial neural networks, random forests and extreme gradient boosted trees. What was found is that the model that performs best is the extreme gradient boosted tree with a tweedie regression loss function. To the authors knowledge, this is the first time an extreme gradient boosted tree has been used in this context.

(7)

Kardinalitetsuppskattning med maskininl¨ arning

Olle Falg´en Enqvist oenqvist@kth.se

(8)

Sammanfattning

Denna uppsats undersöker hur tre olika maskininlärningsmodeller presterar p˚a kardinalitetsuppskattning för sql förfr˚agningar till en databas. Alla tre modeller utvärderades p˚a tre olika datauppsättningar. Modellerna fick b˚ade behandla förfr˚agningar fr˚an en tabell, samt en sam- manslagning mellan tv˚a tabeller. Postgresql’s egna kardinalitetsestimerare användes som referen- spunkt. De utvärderade modellerna var följande: artificiella neurala nätverk, random forests och extreme gradient boosted trees. En slutsats var att den modellen som utförde uppgiften bäst var extreme gradient boosted trees med en tweedie-regression förlustfunktion. S˚avitt författaren vet

¨

ar det här första g˚angen den här typen av extreme gradient boosted tree används p˚a detta typ av problem.

4

(9)

Acknowledgements

I would like to thank my supervisor Bj¨orn Julander at Polystar for letting me write my thesis at the company. I would also like to thank Bj¨orn once again for giving me the reins to turn this project into what it is today.

I would also like to thank my supervisor Per Enqvist at KTH for giving me feedback on my thesis.

Lastly I would like to thank my family, friends and girlfriend for all the support.

(10)

(11)

1 Introduction

In this part there will be an introduction to relational databases and query optimization. There will also be information about the scope and aim of the study.

1.1 Relational databases

It is hard to come up with a business that do not use some sort of database. The purpose of a database is to be able to store data about things of interest and not only that, but the relation between different data.

Let’s say that you want to relate one of your customers to the cellphone they use, then you need a relation between customer and cellphone [16]. You build a database with tables containing data about what you want to store, it can be: actors, customers, running trails. The list goes on and on. The name of a table in a database is unique and is made up of rows and columns [11].

In the 1970’s a paper called ”A relational model of data for large shared data banks” was released. In this paper the author stresses the benefits of storing the data in a relational type of way, the author called the language SEQUEL. Based on this IBM started to develop a language that almost all databases today shares:

Structured Query Language (SQL)[9]. In the beginning SQL was just used for IBM to do database operations in their own databases. During the 1979’s SQL was made public when the software company Relational Software (today Oracle) released their own version of SQL. Over the years more and more Relational Database Management Systems (RDBMS) have been released. A few examples of RDBMS used today are:

• MySQL

• PostgreSQL

• Microsoft SQLServer

• Oracle

• ClickHouse

SQL offers many operations that can be used in a database. One can create databases or objects inside a database, delete objects / databases, insert values, retrieve information and much much more. A further de- scription of what one can do in SQL can be found in the book ”Learn SQL Database Programming”[9].

1.2 Query optimization

Query optimization for relational databases is something that has been studied to a great extent for a long time [33]. First of all it is important to declare what a query is. A query is a request of data from a database [31]. The request can be whatever the database consists of, it can for example be, all passengers that are above the age 65 years (if it was a database for an airline company). A good query optimizer should return the data of interest as fast as possible. Query optimization is traditionally divided into three parts where each part is important in order to get a good query plan [40, 39]:

1. Cardinality estimation 2. Cost model

3. Plan space enumeration

As mentioned above a query is a request to get data from a database. A good way to illustrate this is through an example. Let’s say that we have three different tables: Q, W and E. The goal of this query is to select all values from Q, W and E where Q and W share the same values in column a1 and Q share the same values in column a2 with E. The following SQL-line will produce this query:

1 S E L E C T * F R O M Q , W , E W H E R E

2 Q . a1 = W . a1 AND Q . a2 = E . a2 ;

(14)

This simple example query would produce a three way join between tables Q, W, E. Since Q shares values with both W and E, this would result in two possible join-orders, see figure 1.

1 .

Figure 1: Resulting potential joins for Q, W and E

Because of the two possible join orders it is crucial to find which combination is the most efficient one. To find the optimal query plan standard query optimization does so in three parts[40], see Figure 2.

Figure 2: Classical query optimization consisting of cardinality estimation, cost model and plan space enumeration

1.2.1 Cardinality Estimation

Cardinality estimation is a critical part of query optimization. This estimation is so important since it approximates how many rows that needs to be processed at each step of the query execution [22]. Differ- ent databases uses different cardinality estimates and a good cardinality estimate will be close to the true cardinality.[40]. If we, once again, take a look at the example query, you want to choose the query plan with the lowest cardinality. The example query gives two different cardinalities, see Table 1.

Join True cardinality Estimate

Q ./ W 6 5

Q ./ E 8 7

Table 1: Example of true cardinalities and estimates for the example query

In this case the optimizer should execute the query where Q ./ W is joined first, since the estimated cardinality for this join is the lowest one. Because of this it is important that the estimated cardinality is as close to the true value as possible, otherwise a sub-optimal query plan might me chosen.

9

(15)

Every RDBMS has a cardinality estimator to be able to plan the queries as good as possible. These estimators produce good results overall but they do have some problems, especially when it comes to queries with complex predicates and correlation between columns. Traditional cardinality estimators tends to use histograms and other statistical approaches to find the cardinality estimates[34].

1.2.2 Cost Model

The cost model guides the plan towards an optimal solution through the search space. These algorithms are usually sophisticated and a result of many years of fine tuning by many engineers. The goal of the cost model is to find which query plan has the lowest cost, this results in a quicker query execution [39].

1.2.3 Plan Space Enumeration

The last part of query optimization is the plan space enumeration. What plan space enumeration does is that it evaluates all possible join orders that can be used to obtain the data the user wants to fetch. In table 1 it is the plan space enumerations job to choose the right query plan of the two listed[39].

1.2.4 Different kinds of joins

There are different kind of joins in the SQL language. Suppose you have two tables Q and W and you want to extract data from these two tables, then one can use different join operators. Commonly used joins are[8]:

1. An inner join returns all rows that are shared between table Q and W.

2. A right join will return all rows in W and all common rows in Q

3. A left join is the same as a right join but in the other way, thus, a left join will return all rows in Q and all common rows in W.

4. A full join returns all columns from both Q and W

Figure 3: A visual representation of different joins.

1.3 Goal and scope

The goal of this thesis is to construct a cardinality estimator that can produce good cardinality estimates.

Why cardinality estimation and not cost model or plan space enumeration one might ask? Because cardinality estimation has been described as ”The root of all evil, the Achilles Heel of query optimization, is the estimation of the size of intermediate results, known as cardinalities” by Guy Lohman [8]. The developed cardinality estimators will be compared to PostgreSQL’s own cardinality estimator for different queries and datasets.

This thesis has a machine learning approach to the problem of cardinality estimation since many RDMS-users has quite a constant type of data in their databases and because of this a ML approach might be beneficial since it can learn itself the distribution of the data in the DB. The following machine learning methods were evaluated:

(16)

• Neural Networks

• XGBoost

• Random Forest

This thesis is delimitated to the accuracy of the cardinality estimates because query optimization is a big and complex problem and to construct a whole query optimizer is out of scope for this thesis.

2 Theory

As mentioned in the introduction this project has a machine learning approach. Three types of ML-approaches were considered: XGboost, artificial neural networks and random forests. Neural networks is often viewed as the best machine learning approach due to it’s ability to deal with complexities. However tree-based models, such as XGBoost, performs well on many different types of problems and wins many Kaggle competitions [4]

(Kaggle is a website where there exists competitions in machine learning solutions to problems).

This section will cover XGBoost, ANNs, random forests and also the concepts of validation, hyperparameter tuning and how to interpret if an estimation is good or bad.

2.1 Machine Learning

Machine learning is an area of computer science where one tries to construct algorithms that improves with experience. The foundation of machine learning can be dated back to the 18th and 19th century and in the 1950’s it started to be more considered with Alan Turing’s paper ’Computing Machinery and Intelligence[12].

The basics of machine learning are the following:

1. Data input: past or new data that is used for future classification / prediction.

2. Abstraction: if the original input data can not be used in order to improve the results, then one needs to convert it to be able to use it in a model.

3. Generalization: based on the abstraction your model needs to be able to generalize, which means be able to use the knowledge gained in future predictions.

In machine learning there are no ”one size fits all”, just because a method works well on one data set, it does not mean that it will work well on another one. To be able to choose the best method for a problem might be one of the hardest tasks when it comes to machine learning[20].

The errors of a machine learning model comes from different sources. There are errors that comes from that the model is biased, variance and irreducible errors. The irreducible errors comes from variability in the system that the model can not get rid of, so these errors are usually not taken into consideration when evaluating a model[13]. Bias comes from simplifying something complex with a model that is too simple to describe all the variability in the variable of interest. A high bias indicates underfitting. High variance is when the target function changes a lot with different training sets and this is an example of overfitting[20].

2.2 Supervised Learning

Supervised learning is used when one wants to get insight from labeled data with target / labeled cases. The input data, X is usually referred to as features and the output variables, Y are usually refer to as the target variables. In supervised learning each example is a pair, the input data X and the desired output Y . The goal of supervised learning is to learn the connection between the features and the target variables s.t. a subset of x_j ∈ X and yj∈ Y can create a function f(x) that can predict unseen target variables y^∗∈ Y with unseen features x^∗∈ X, hence it is important that the function f(x) generalizes well[29].

11

(17)

2.3 Artificial Neural Networks

Artificial neural networks (ANN) has been successful in various fields both in science and in industry. They are used both for classification and regression problems. What builds a ANN are neurons, which are supposed to loosely mimic the way the human brain gathers knowledge. ANN’s are suitable for data of high dimensions with large sample sizes [36]. To understand ANN’s one has to grasp a few concepts such as:

• Activation functions

• Loss functions

• Weights

• Connections

• Hidden layers

• Training ANN’s

2.3.1 Activation Functions

Neural networks uses activation functions in it’s neurons and determines if the neuron should be firing or be passive, the activation function also performs a non-linear transformation on the input data. This allows the network to learn complex non-linear functions[10].

There many different kinds of activation functions, a few examples are listed in table 2 Evaluation metric Formula

Softplus ln(1 + e^x)

ReLU max(0, x)

Sinusoid f (x) = sin(x)

Table 2: Example of activation functions where x represents the input value to the neuron.

In this project ReLu was used for all neurons. The ReLu function can also be written as:

f (x) =

(0, if x < 0

x, otherwise (1)

As one can see, the ReLu function is zero for x < 0 and linear otherwise. ReLu is widely used in many applications since it’s introduction in 2010. The activation function is popular for multiple reasons [10], for example:

• Fast computations - no divisions / exponential calculations

• Eliminating the vanishing gradient problem

• Easy to understand

• Generalizes well

In regression problems - which this thesis is based on - the output neuron usually has a linear activation function since the output will be an integer. However, as mentioned above, a cardinality can not be negative.

Thus, the last neuron has a ReLu activation function also.

2.3.2 Loss Functions

A loss function, also called objective function, is used to indicate how good a model performs on the training data where the labels are known. The loss function is optimized during the training of the model and

(18)

quantifies the difference between the predicted value and the ground truth[21]. Thus, for each iteration the neural network tries to perform better on the loss function. Just as with activation functions there are many different variants of loss functions. A few examples can be seen in Table 3

Evaluation metric Formula

MSE _N¹ PN

i=1(Yi− ˆYi)

MAE _N¹ PN

i=1|ei| RMSE

q1 N

PN

i=1(Yi− ˆYi) Table 3: Example of loss functions

Based on the goal of the network one has to carefully choose wich type of loss function that suits the problem best.

2.3.3 Dense artificial neural networks

Dense AANs is a type of neural network where every neuron in the previous layer is connected to every neuron in the next layer. The amount of layers can differ from model to model. In ANNs a regular concept is hidden layers, a hidden layer is a layer that lays between the input and output layer of a model[18]. In figure 4 an example of an ANN can be seen.

Figure 4: Neural network with one hidden layer

What an ANN wants to achieve is to transform the input data to a desired output value, the output value can both be a vector / one distinct number. In this thesis that is dealing with cardinalities, the desired output is a cardinality which is a number, hence the output is just one node. The output from one neuron acts as the input for the next[21]. To clarify this, lets use the ANN from figure 4 as an example.

The ANN from figure 4 is an ANN with one hidden layer. With r input nodes X = [x1, x2, ..., xr]^Tf or m = 1, ..., r, one hidden layer of t hidden nodes Z = [z1, z2, ..., zt]^T f or j = 1, ..., t and one output node Y . Now, the output values of xj acts as the input values of the hidden layer, X → Z. Where βmj represents the weights between X and Z. What the weight does is that it multiplies the input value with the weight, thus the weight transforms the input value. β0t represents the bias, each node in an ANN has a bias allowing it to learn affine transformations of the output value from a neuron. So, the activation functions input is the bias plus the weight times the features and this output will be the input in the next layer. Thus, Z will act as the input between the models hidden layer and the output node. Let λ_jk be the weight between Z → Y and λ_0k the bias [18]. Now, let V_j represent the transformations made from the input layer to the hidden layer:

V_j= β_0j+ X^Tβ_j (2)

13

(19)

Then, the output from Z will be given by:

Zj= ωj(Vj) (3)

Where ωj represents the activation function in the hidden layers nodes. Also, let Wj represent the transformation from the hidden layer to the output node:

Wk= λ0k+ Z^Tλk (4)

Then, the output from the last node is given by:

µk(X) = Ωk(Wk) (5)

Now with β_j = [β_1j, β_2j, ..., β_rj]^T and λ_k = [λ_1k, λ_2k, ..., λ_tk]^T if one puts equation (3) and (5) together this yields the following expression for µ_k(X):

µk(X) = Ωk



λ0k+

t

X

j=1

λjkωj β0j+

r

X

m=1

βmjXm

!

 (6)

Equation (6) is the expression for the output from the last node in the ANN from figure X. As mentioned above, the network tries to approximate the target variable Y. The expression for Y_k is given by:

Yk= µk(X) + (7)

Where is the error term which can be viewed as a normally distributed value ∈ N(0, σ²).

For more than one hidden layer equation (6) can be rewritten as:

µ(X) = Ω (λ₀+ Aω(β₀+ βX)) (8)

Where β = [βij] is a [s × t] matrix that represents the weights between the input nodes and the hidden layer, A is a [s × r] matrix consisting of the weights between the hidden layer and the output neurons, Ω ∈ R^sand ω ∈ R^rare vectors holding the activation functions. λ0∈ R^sand β0∈ R^rare vectors of biases[18].

2.3.4 Training of a neural network

To increase performance the ANN has to train on the data to learn the underlying patterns of the data.

When an ANN ”learns” it finds the best set of biases and weights for a function to approximate the target variable. To learn this pattern the ANN needs to use some sort of strategy to learn the patterns. To initialize the training process the ANN needs to make an initial guess on the weights and biases, usually this results in a bad prediction and this is where the loss function comes into play. The loss function calculates the error of the first prediction and can tell, based on the error, if the prediction is good or bad.

So, how does the ANN improve from the indication if a prediction is good or bad? The answer to that is gradient descent, which tries to find a solution to the unconstrained optimization problem:

min f (Θ)

Θ

= f (Θ^∗) (9)

Where Θ represents the vector of all weights and biases that determines the loss function f (Θ), Θ^∗represents the optimal set of weights and biases. Since this is an nonlinear optimization problem there exists global and local minimums. What gradient descent then does is to help the loss function to explore the parameter space to minimize equation (9), this puts an important constraint on the loss function: it has to be differentiable.

2.3.5 Regularization

A common problem that occurs with ANNs are overfitting. A way to detect this is if the model performs well on the training data but poorly on the test set. The reason why the model performs so well on the training data, but not the test set is because the network has learned all the noise and distributions of the training

(20)

data to such an extent that it can not handle unseen data[17]. There exists different methods to address this problem, for example dropout.

Dropout can be viewed as a method were the prediction of an ANN is a product of many different models.

This is achieved by adding dropout probabilities between layers in the network. What the dropout does is that it randomly turns of the neuron in the network, thus the next node can not rely on the input from one single previous node[14]. When talking about ”dropout” one means that all the connections to the neuron as well as the connections from the neuron to the next ones are cut off. This procedure is so effective since in the training phase the ANN can not depend on one single connection, but when the network is tested it can utilize all the neurons in the network[30]. In Figure 5 an example of dropout can be seen.

Figure 5: Simple example of dropout where 1 represents the network before dropout and 2 after dropout.

The grey neurons indicates that they are cut off.

2.4 XGBoost and Random forests

To be able to understand XGBoost and random forests one has to grasp the concept of regression trees and gradient boosting first.

2.4.1 Regression trees

Regression and classification trees were introduced in 1984 by Breiman. These methods are based on a

”divide and conquer” method. The regression and classification tree (CART) approach is based on binary trees, through divide and conquer the training set is partitioned, divided, solved and compiled [23]. In figure 6 a visual representation of a simple regression tree can be seen.

15

(21)

Figure 6: A simple visual representation of a regression tree.

In regression trees the target variable takes ordered values in the end nodes and a regression model is fitted to these nodes [25]. What happens in figure 6 is that the algorithm first checks the value of x1 and based on that evaluation, either returns a result or investigates the value of x2.

2.4.2 Random forest

Regression trees provide good results but has one major draw back, to quote ”Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy”. They work well with the data that was used to create the trees but fails to predict new, unseen data.[38]. There are methods to work around this, one is called bagging and the other is called boosting. Boosting will be covered further down in the thesis.

Random forests (RF) uses something called bagging to improve on the predictive power of regression trees.

What it does is that it combines a number of independent trees that will predict the output value. Thus, the output will be determined by a majority vote from each tree in classification and the average value of the output of all the trees for regression[24]. In figure 7 a simple example of a RF can be seen.

Figure 7: A RF consisting of two trees.

The algorithm to create a RF goes as the following. Assume one has a set of labeled variables D = (X1, Y1), ..., (Xn, Yn) f or i = 1, ..., Z where X are the features, Y the target variable, D the space that contains these pairs and Z the amount of trees one wants to create. The next step is to take a bootstrap sample from D. What the bootstrap does is that it chooses a random number of features from X and only makes splits on these features to predict the output value Y .

So, based on how many trees one wants to create for the RF, the algorithm works through it as mentioned above and splits based on some criterion[24].

An important parameter in RFs are the number of trees. One way to measure how the generalization of the model increases is the out-of-bag error (OOBE). When a RF creates it’s trees it usually uses around

(22)

2

3 of the data, the remaining data that is not used to fit the tree is called the out-of-bag data. The OOB can afterwards be run through the trees in the forest and then see if this set of features generates the correct prediction [20].

2.4.3 Gradient Boosting

Gradient boosting (GB) is a technique used in machine learning both for regression and classification. The purpose of gradient boosting is to take weak learners (such as regression- and classification trees) and from these create strong learners in an iterative procedure. Gradient boosting was introduced in 1999 by Jerome H. Friedman in the paper ”Stochastic Gradient Boosting” [3]. The procedure in GB goes as the following:

For any input data D with N samples and m features[37]:

D = {(xi, yi)}^N_i=1(||D|| = N, xi∈ R^m, yi∈ R) (10) Where xirefers to the variables that predict the variable of interest yi. What also is needed is a differentiable loss function:

L(yi, F (x)) (11)

Where y_i is the observed value at i and F (x_i) is the estimated value. In order to start the iterative process an initial guess has to be made, this is given by:

F0(x) = argmin

γ N

X

i=1

L(y, γ) (12)

Where γ is an initial guess for the first prediction. This is an easy equation to solve since we know the loss function is differentiable. Next, one wants to compute the derivative of the loss function w.r.t. the predicted value:

rim= − ∂L(yi, F (xi)

∂F (xi)

F (x)=F_m−1(x)

f or i = 1, ...., N (13)

The r_imis called the pseudo residuals where i is the sample number and m is the tree that is being built. The next step is to fit the residuals in a regression tree with the features and create terminal regions R_j,mf or j = 1, ...J_m. With terminal regions one means the leaves of the tree where j is the number of the leaf and m is the number of the tree.

In the next step the algorithm is supposed to determine the output values from each leaf:

γj,m= argmin

γ

X

x_i∈R_ij

L(yi, Fm−1(xi) + γ) f or j = 1, ..., Jm (14)

Where xi ∈ Rij means which x’s that are used to calculate the output of leaf Rij. After this the model is supposed to be updated according to:

Fm(x) = Fm−1(x) + p

J_m

X

j=1

γjmI(x ∈ Rjm) (15)

Where p is the learning rate. The learning rate is a value between 0 and 1. A small learning rate will reduce the effect each tree has on the final prediction. The summation in the equation above says that all γjmshould be added for all Rjmwhere the feature x can be found.

This is how gradient boosting works and can be performed for m = 1, ..., M depending on how many trees one wants to create.

17

(23)

2.4.4 XGBoost

Everything stated in this section is taken from [37] and [5] if nothing else is mentioned.

eXtreme Gradient Boosting (XGBoost) was released in march 2014 by Tianqi Chen and Carlos Guestrin.

Since it release XGBoost (XGB) has become a popular method for both regression and classification. The technique is built on the same foundation as GB but differs on a few important points. For one, when one compare how XGB builds it’s trees compared to GB, XGB builds the trees in parallel while GB builds them sequential. This results in a faster training time for XGB.

Just like in GB, XGB has to make an initial prediction. This prediction is usually 0.5 for classification and the average of the target variables in regression.

XGB utilizes a loss function L(yi, F (xi)). This loss function is later used in order to minimize the following equation:

"_N X

i=1

L(y_i, F (x_i))

#

+ γT + 1

2λΩ²_val (16)

Where T is the number of leaves in a tree, γ is a user defined penalty that is there to encourage pruning.

λΩ²_val is a regularization term where Ω is an output value for the leaf and λ is a scaling parameter. The parameter γ will be covered further down in the thesis since it is used when the whole tree is built. Thus, for now we consider:

"_N X

i=1

L(y_i, F (x_i))

# +1

2λΩ²_val (17)

Now one wants to find an output value, Ω that minimizes the loss function plus the regularization:

argmin

Ω

" _N X

i=1

L(y_i, F (x_i) + Ω_val)

# +1

2λΩ²_val (18)

To solve this equation XGB uses a second order Taylor Polynomial:

L(yi, F (xi) + Ωval) ≈ L(yi, F (xi)) +

d

dF (x_i)L(yi, F (xi))

Ωval+1 2

d²

dF (x_i)²L(yi, F (xi))

Ω²_val (19)

Where the first derivative of a function is called the Gradient (g) and the second derivative of a function is called a Hessian (h), thus, equation 19 can be rewritten as:

L(yi, F (xi) + Ωval) ≈ L(yi, F (xi)) + gΩval+1

2hΩval² (20)

Since L(yi, F (xi)) has no affect on the optimal output value it can be removed from equation 20:

L(yi, F (xi) + Ωval) = giΩval+1

2hiΩval² f or i = 1, ..., N (21) And now combine all squared Ω_val and all non squared terms and take the derivative w.r.t. to the output value and set the derivative to zero and solve for Ωval:

d dΩval

(g1+ ... + gN)Ωval+1

2(h1+ ... + hN)Ω²_val

= 0 (22)

The result of equation 22 gives the optimal output value for the leaf. Now, the Hessians and the Gradients needs to be plugged in into the loss function. By now we know that gi = _{dF (x}^d

i)L(yi, F (xi)) and that hi= _{dF (x}^d²

i)²L(yi, F (xi)), hence the output value from a leaf is given by:

Ωval= g1L(y1, F (x1)) + ... + gNL(yN, F (xN))

h1L(y1, F (x1)) + ... + hNL(yN, F (xN)) + λ (23)

(24)

Then, the similarity score is given by:

S = (g₁+ ... + g_N)

(h1+ ... + hN + λ) (24)

XGB utilizes pruning and this is were the similarity score comes into play. What XGB does is that it calculates a gain as:

G = S_{lef t}+ S_right− S_root (25)

Where Slef t is the similarity score of the left branches, Sright is for the right branches and Sroot is the similarity score of the root. In Figure 8 the similarity scores can be seen in a tree structure.

Figure 8: Visual representation of Sroot, Slef tand Sright. Where root holds the similarity score of the root, left the similarity score of the left leaves and right the similarity score of the right leaves.

XGB prunes the tree if the gain minus the user defined parameter γ is negative and it does not prune if it is positive.

If one takes a look at equation (24) it becomes clear that the parameter λ is a parameter that encourage pruning since the gain will become smaller with higher values of lambda.

As mentioned above XGB utilizes a loss function. There exists many different loss functions in regression, for example:

MSE _n¹Pn

i=1(yi− ˆyi)²

MAE ¹_nPn

i=1|yi− ˆyi|

Tweedie Pn

i=1yi

_y_ˆ

i1−p

1−p

+^y^ˆ_2−pⁱ^2−p

Table 4: Example of loss functions for XGBoost where ˆyi represents the estimated value, yi the true value and p is a hyperparameter that controls the variance of the tweedie distribution[27]

19

(25)

2.5 K-Fold Cross-validation

To be able to say if a model performs well it is important to validate this in some way, one way to do this is to split the data into training / test sets. Let’s say that a dataset which a ML model should perform training / testing on looks like:

Figure 9: Example of a dataset divided into a training part and test part where the red represents testing and green training.

This training / test split do not address the potential randomness in the quality of the model because of the specific split. A good approach to neutralize this randomness is to repeatedly draw samples from a set and perform training / testing on different parts of the dataset to see if the model is rigorous.

One approach to deal with the randomness problem is k-fold cross-validation (CV). In CV the data is divided into k folds of equal size. What happens is that the first fold is used as test set and the remaining k-1 folds are used for training, this procedure is then iterated k times where the training / test sets are changing for each run [20]. In figure 10 one can see a visual representation of a CV on a dataset.

Figure 10: Graphic visualisation of k-fold cross-validation with k = 5. Green represents training and red testing.

As mentioned above the validation is performed k times, each run will generate a test error based on some defined error Err, so the example above would generate five errors: Err1, Err2, ..., Err5 and the average is given by:

CV(k)=1 k

k

X

j=1

Errj (26)

2.6 Hyperparameter Optimization

Hyperparameter optimization is a crucial part of machine learning. The whole goal of a machine learning model is for the software to learn the structure of the data in order to be able to make classifications / predictions based on the data. The difference between the first-level parameters and the tuning parameters are that the first-level ones are supposed to be learned in the training phase, while the tuning parameters needs to be carefully evaluated and tuned by the one who runs the ML software. It is important to get the tuning right, since this is not something that the model can handle on it’s own[35].

The ultimate objective of a learning algorithm Ω is to find a function ω that minimizes some expected loss F (x; ω) over samples x drawn from a distribution Θx. This is done by letting the learning algorithm Ω that maps a subset of Θ_x, X_train to a function ω. Now, the learning algorithm Ω produces through a training objective some parameters ξ (which are learned by the model) in order to best perform the task at hand.

However each model still has hyperparameters ζ which are a subset of all possible hyperparameters Z [19].

An example of hyperparameters are the depth and learning rate of a neural network. What one wants to do in practice is to find a hyperparameter ζ s.t. the generalization error is minimized:

E_X∈Θ_x[F (x; Ω_ξ(X_train))] (27)

(26)

And in order to solve this problem one has to find a solution to the following equation:

ζ^∗= argmin

ζ∈Z

EX∈Θ_x[F (x; Ωξ(Xtrain))] (28)

Where ζ^∗is the optimal choice of hyperparameters. Equation 28 can’t be solved because of different reasons, a few of them are:

• There are no efficient algorithms to solve the optimization in the equation.

• The expected value of the unknown distribution Θx

Because of this it is important to be able to evaluate different sets of hyperparameters through other tech- niques [19]. In the following two subsections two different methods will be covered.

2.6.1 Grid Search

Grid search is the ”traditional” hyperparameter tuning approach in machine learning. When using grid search, you specify the different values of the hyperparameters you want to test for the model and then all the possible combinations of these parameters will be used to find the combination that produces the best results. A problem that comes with grid search is the fact that it quickly produces a high dimension search space and thus, it will lead to high computation times [32].

2.6.2 Random Search

In random search all possible combinations will not be tried. The difference between random and grid search is that in random search the user sets a bound on the hyperparameters of interest and a maximum amount of iterations. When this is done the search for hyperparameters is searched randomly until the iteration is done [32].

2.7 Defining the estimation error

To determine if an estimated value is good or bad one has to define some sort of evaluation metric. There are many ways in doing so, such as the mean squared error (MSE), the mean absolute error (MAE). However, these methods might not be well suited for cardinality estimations. A good way to estimate cardinality performance was investigated in [15] where they constructed the q-error. The above mentioned evaluation metrics can be seen in Table 5

MSE _n¹Pn

i=1(yi− ˆyi)²

MAE ¹_nPn

i=1|yi− ˆyi|

Q-error Pn

i=1

max(y_i, ˆy_i) min(y_i, ˆy_i)

Table 5: Example of true cardinalities and estimates for the example query

Q-error is superior to other other ways in estimating cardinality performance[15]. The q-error is given by the quota of the estimated cardinality and the true cardinality:

q(cardtrue, cardest) = max(cardtrue, cardest)

min(card_true, card_est) (29)

What one can see from the equation above is that if the estimation is perfect the resulting q-error will be equal to one and it can never go below this value. However, if the estimation is bad then q → ∞. Why the q-error performs well can be well described through two examples:

Example 1: Let’s say that a model estimates a cardinality cardest= 2 and the true cardinality is cardtrue= 1

21

(27)

then this would result in a q-error of 2.

Example 2: In this example the estimation is cardest= 100 and the true cardinality is cardtrue= 101 then this would generate a q-error of 1.01.

The q-error is a good choice when it comes to cardinality estimates since one obivously wants to get as close as possible to the true cardinality, but the estimation needs to be ”punished” when it’s off by a large percent from the true cardinality.

3 Method

As previously mentioned, the goal of this paper is to create alternative cardinality estimators that can produce good estimates. To be able to use the methods mentioned in Section 3 one must first transform the input data in such a way that it can be evaluated numerically. Basically the method used to solve this problem is the following:

1. Transform the sql-strings to vectors. This defines the labeled input to the ML algorithms.

2. Compare algorithms to each other and state of the art DBMS. The evaluation metric used is the q-error which was defined in Section 2.

In figure 11 one can see how the problem was broken down into a scheme.

Figure 11: Flow chart over how the problem was solved. Arbitrary sql-queries are first transformed to vectors and the true cardinality is measured as well as Postgresql’s cardinality estimation is measured. Then the vectorized query is used as input in the ML algorithms. Lastly the q-error is calculated for both the ML algorithms and Postgresql’s optimizer based on the true cardinality.

3.1 Experimental Setup

All the experiments in this thesis were made on the following setup:

Computer Acer Aspire

Memory 8 GB

Processor Intel Core^R ^TM i7-6700HQ CPU @ 2.60GHz x 8

Graphics NVIDIA GEFORCE GTX 950M

Harddrive 128 GB LITEON CV1-8B128 (G871201) Operating System Ubuntu 20.04.1 LTS 64-bit

Table 6: Computer setup for the experiments

Python3 (with different packages) was used to write the scripts for this project. Postgresql version 12.5 was used as database.

(28)

3.2 Data used

The data used for this project is the same data that was used in [28]. Three different data sets were used and they will be covered in this section.

3.2.1 Shifted normal

The shifted normal data set consists of three tables with two columns each. The values in the columns are numbers drawn from normal distributions that was rounded to the nearest integer, where corr1.ci ∼ N (µ1, Σ²), corr2.ci ∼ N (µ2, Σ²) and corr3.ci ∼ N (µ3, Σ²) for i = 1, 2. The covariance matrix was set to:

Σ =1200 1100 1100 1200

And the means were choosen as: µ1 = [500, 700], µ2 = [600, 700] and µ3 = [700, 700]. This data set is highly correlated.

3.2.2 Department of Motor Vehicles

A dataset consisting of real world data from the Department of Motor Vehicles. The dataset consists of information about cars, owners, accidents and so on. In this thesis two tables were used, of cars and demographics.

Car is a table built up of 9 columns ´a 143269 rows that describes information about the car. Five of these columns were used in this thesis and they are:

• id - a unique id in the database, 143269 unique values

• year - production year, 74 unique values

• model de - an integer indicating what type of car model it is, 36 unique values

• color de - an integer indicating what color the car has, 504 unique values

• ownerid - id of the car owner, 85437 unique values.

The demographics table is a table consisting of data about the owner of the car. This table consists of five tables ´a 100000 rows and five of these were used:

• id - a unique id in the database, 100k unique values

• age - the age of the owner, 81 unique values

• salary - the salary of the car owner, 10699 unique values

• assets - the assets of the car owner, 73690 unique values

• ownerid - id of the car owner, 100k unique values.

3.2.3 IMDB

The IMDB dataset consists of multiple tables storing real world data about movies, series, documentaries etc. The tables used were: title and movie companies. Title is a table containing approximately 3.6 million rows and 12 columns, three of these columns were used when evaluating query performance:

• kind id - an id, 6 unique values

• production year - the production year of the movie, 143 unique values

• id - a unique movie id, approximately 3.6 million unique values

The movie companies table consists of data about the company behind the movie. The table stores 5 columns

´

a approximately 3.9 million rows and three of these rows were used in this thesis:

• company id - a company id, approximately 300k unique values

23

(29)

• movie id - the id of the movie that was produced, close to 1.7 million unique values

• id - a unique id, approximately 3.9 million rows

3.3 Generating queries

To train the machine learning algorithms and to be able to evaluate them, a large set of queries needs to be used. It is a grueling task to write these queries for hand, so an algorithm was constructed to generate unique queries with a cardinality greater than zero. Queries were generated for two different cases: just looking at one table and a two way join between two tables.

3.3.1 Scanning one table

In the one table case, just one table is investigated. Let’s use the shifted normal dataset as an example. In this case one can look at, for example, table corr1. When generating queries for this case, the maximum amount of predicates was set to two, hence a string can contain a WHERE and one AND statement. To make the algorithm fast and efficient the maximum and minimum values connected to the column were stored in a dictionary. The following pseudocode describes how the query generating algorithm works:

Listing 1: Pseudocode for generating queries

s q l s t r i n g = SELECT ∗ FROM c o r r 1 WHERE

n r p r e d i c a t e s = a random number between 1 and a s many p r e d i c a t e s a s one l i k e s

s i g n = random . c h o i c e ( > , <, =, >=, <=, ! = ) #t h i s i s a v e c t o r w i t h l e n g t h = n r p r e d i c a t e s column = random . c h o i c e ( c1 , c 2 ) #v e c t o r w i t h l e n g t h = n r p r e d i c a t e s

f o r p r e d i c a t e in n r p r e d i c a t e s :

v a l u e = random . r a n d i n t (max( column ) , min( column ) ) i f i == 0 :

s q l s t r i n g = s q l s t r i n g + column [ i ] + s i g n [ i ] + v a l u e e l s e :

s q l s t r i n g = s q l s t r i n g + ”AND” + column [ i ] + s i g n [ i ] + v a l u e d a t a b a s e . e x e c u t e ( s q l s t r i n g )

If the execute command returns a cardinality that is equal to zero the algorithm will run again. This code can be set to generate the amount of queries the user wants to get.

3.3.2 Two way joins

The procedure to generate queries in this case is basically just an extension of scanning one table. In two way joins the joins will be between the tables, and to do this a join id was used. The join id represents values that are shared between two tables. When two tables share columns one can set some constraint on the values in these columns. If one take the DPMV as example the join id can be:

Listing 2: Example string from two way join for the DPMV dataset

1 S E L E C T * F R O M car , d e m o g r a p h i c s W H E R E car . o w n e r i d = d e m o g r a p h i c s . o w n e r i d ; Multiple queries of this type were generated.

3.4 Encoding sql-strings

In order to be able to evaluate the queries numerically they were converted into vectors by the predicates. A predicate is a condition expression that is either true or false. How the queries were converted to vectors was inspired by the work of [7]. If we take a look at an example query (from the shifted normal dataset):

Listing 3: Example string from two way join for the shifted normal dataset

1 S E L E C T C O U N T (*) F R O M corr1 , c o r r 1 _ 2 W H E R E c o r r 1 . c1 = c o r r 1 _ 2 . c2 AND c o r r 1 . c1 > 3 0 0 ; In this example query there are two predicates, the WHERE and AND statements. To transform the query

the following was done: the mathematical operators were vector encoded in a dictionary according to:

Cardinality estimation with a machine learning approach

Cardinality estimation with a machine learning approach

OLLE FALGÉN ENQVIST

Cardinality estimation with a machine learning approach

OLLE FALGÉN ENQVIST

Cardinality estimation with a machine learning approach

Kardinalitetsuppskattning med maskininl¨ arning

Contents

1 Introduction

2 Theory

3 Method