How to select the right machine learning approach?

(1)

Degree project

How to select the right

machine learning approach?

Author: Yoel Sánchez Bermúdez Supervisor: Welf Löwe

(2)

Abstract

In the last years, the use of machine learning methods has increased remarkably and therefore the research in this field is becoming more and more important. Despite this fact, a high uncertainity when using machine learning models is still present. We have a wide variety of machine learning approaches such as decision trees or support vector machines and many applications where machine learning has been proved useful like medical diagnosis or computer vision, but all this possibilities make finding the best machine learning approach for a given application a time consuming and not well-defined process since there is not a rule that tells us what method to use for a given type of data.

We attempt to build a system that, using machine learning, is capable to learn the best machine learning approach for a given application. For that, we are working on the hypothesis that similar types of data will have also the same machine learning approach as best learner. Classification algorithms will be the main focus of this research and different statistical measures will be used in order to find these similarities among the data.

Keywords: machine learning, application, algorithm, machine learning approach,

(3)

1 Introduction

In this introductory chapter we will give a short background about machine learning and what problems have led to our research. We will discuss what our research topic is, how we will evaluate the success or fail of this project and what methodology we will follow in order to do so. A short description about the structure of this report will be given as well.

1.1 Background

Machine learning is a field of study that focuses in the development of systems with the ability to learn. Constructing algorithms that can perform tasks accurately by generalizing from examples, that is, training the systems to perform well on new data. In the last years, the use of machine learning has gotten more important in computer science fields and beyond. In a few words, machine learning is a series of algorithms that improve their performance with experience. Machine learning is used in applications where a manual solution is not an option. A good example of this are spam filters. See 2.1 for a detailed definition of machine learning.

A wide range of applications can be found. Machine learning has presence in fields such as computer vision, speech recognition, medical diagnosis or search engines. As we can see, the variety of applications and therefore of the data to be analyzed, causes a wide variety of machine learning approaches. Support vector machines, Bayesian theory, neural networks or decision trees are a few examples of the many existing machine learning approaches.

The first problem in machine learning is in fact choosing which machine learning approach to use in our application because depending on the properties of the problem (data distribution, representation, etc.), one machine learning approach will perform better than the others. In this thesis we want to focus in this issue. Choosing the best machine learning approach is a key step that performed manually takes a lot of work.

1.2 Project aim

A machine learning system can be seen as a two steps process, training and deciding. In the training step, the training data (datasets) and the correct decision for each instance in the dataset is analyzed (an algorithm is used) and then a decider is produced. This decider will take new input and make a decision based on the results in the previous step. We want to use this approach to build a system that, given an application can learn the best machine learning approach for it.

We are making the assumption that the best machine learning approach for a certain application will be also the best for application with similar properties. We will build a system that will take different datasets as input, try all machine learning approaches and find the best. Then it will build a decider based on the properties of the data used in this process. The decider, given data to process, based in the properties of the data will choose a machine learning approach which we assume that based on the training will be the best.

1.3 Goal criteria

There are two obvious goals in this project that are related.

(6)

The other goal is accuracy. We want that the machine learning approach chosen for an application is the same that we would have chosen if done manually. In machine learning 100% accuracy is something that usually never happens, so we will have to decide which accuracy is considered a good enough success rate and, if the machine learning approach chosen is not the best one for the given input, in which level the decision is good enough. For some applications may happen that the machine learning approach chosen has a really good performance even though is not the best one for this context.

1.4 Methods and procedures

The data repository we will use for the training is the UC Irvine Machine Learning Repository (Bache & Lichman, 2013) which contains 239 datasets for machine learning. We will work with RapidMiner (Mierswa, et al., 2006), an open-source system for data mining used in research, education, application development and many other fields. This environment has many machine learning approaches implemented and we will use those for our thesis.

To analyze the properties of the dataset we will use Octave (Eaton, 2002), and open-source mathematical tool for mathematical computations that has all the necessary statistical methods that we need to apply on the data that we will work with.

In section 3 we will describe with more detail these tools and we will reason why we have chosen them for our thesis.

1.5 Report structure

(7)

2 Theory

In this chapter we will give some background about the topics we will be dealing with in this thesis. We will introduce the concept of machine learning, how the learning process work and we will summarize the most popular machine learning approaches. We will also give some insight about a few statistical measures that can be helpful to describe the datasets that we will be using to evaluate our research hypothesis.

2.1 Machine learning

Machine learning is a science that aims to build systems that can automatically learn from data. By learning in this concrete field we are talking about machines that have the ability to modify their behavior (as a response to external inputs) in a way that they can improve their performance. More concretely, machine learning is a collection of algorithms that are capable to generalize behaviors from examples.

The reasons why is useful to give to machines the ability to learn and not design machines that behave as desired from the beginning is that some problems cannot be solved by humans in a way that a well-defined relationship between input and output does not exist, other problems may be too complex to program by hand or it also can happen that we want to find discover knew knowledge about large amounts of data.

Machine learning approaches have been extendedly used in many applications such as web search, spam detection, recommender systems, medical diagnosis, speech recognition and many other applications.

In this section we will talk about how data is usually represented in these systems, we will briefly mention the different types of learning and explain the learning process with detail. Later on we will focus on supervised learning, especially on classification algorithms, the most mature and used type of machine learning and the focus of our thesis. In the end of this section we will shortly introduce regression analysis, other supervised learning method; and we finally will talk about some techniques to evaluate and improve machine learning approaches.

2.1.1 Data representation

In the following sections we describe how usually the data is represented and its basic properties for machine learning purposes. This is the most common practice, but many other representation or structures can be found in some machine learning applications like relational or recommender systems, but this falls beyond of the scope of our thesis.

2.1.1.1 Dataset and features

Machine learning data is usually described in a matrix called dataset. This matrix is structured in a way that each row corresponds to an observation (example) of the data and each column represents a feature (also variable or attribute) that describes the data. In this form of representation we may also found (in classification problems) a column indicating to which class each observation belongs.

(8)

2.1.1.2 Data properties

Data values can take many representations. Data can be numerical (integer or real numbers) or nominal data, where values are differentiated by name. Categorical data is a type of nominal data that, as its name indicates, the data only can have a fixed set of nominal values (or categories).

Missing values are usual in machine learning datasets, maybe because some information about the data is missing or because for this particular attribute not all observations have a measurable value, either way machine learning approaches have to deal with this issue. For nominal data, the usual approach is to treat the missing values as one more category or possible value that the data can have. For numerical data, a usual practice if to replenish the missing values with the average value of the attribute (column) we are dealing with.

It is also important (we will see in the following sections) to talk about the number of classes (or categories) that a dataset has. In supervised learning problems (see 2.1.2.1), each observation in the dataset is labeled, i.e. it belongs to one of a fixed number of classes. We distinguish into binary and multiclass (or polynomial) problems. It is easy to understand the difference between these two groups, the data can have two labels (binary) or more (multiclass). The reason for this distinction is that often machine learning approaches are designed to solve only binary problems (for example Support vector machines) because usually the algorithm definition is not trivial to implement for multiclass problems. A usual approach for multiclass problems (if we are not using already existing multilabel algorithms) is through the combination of binary machine learning approaches (see 2.1.6).

2.1.2 Supervised learning

Machine learning approaches can be classified in function of their desired output and/or input values. The two main subfields of machine learning are supervised learning and unsupervised learning.

Supervised learning aims to find methods that can perform well on unseen data by previously learning on labeled data. Two popular supervised learning problems are classification (2.1.3) and regression (2.1.4).

Unsupervised learning deals with data that we do not have any prior information. The goal of unsupervised learning algorithms is to extract accurate and descriptive information from this type of data. An example are clustering algorithms like k-means (Weisstein, 2013). 3.5 1.4 0.2 Iris-setosa 3.0 1.4 0.2 Iris-setosa 3.2 1.3 0.2 Iris-setosa 3.1 1.5 0.2 Iris-setosa 2.5 4.9 1.5 Iris-versicolor 2.8 4.7 1.2 Iris-versicolor 2.9 4.3 1.3 Iris-versicolor 3.0 4.4 1.4 Iris-versicolor 3.3 6.0 2.5 Iris-virginica 2.7 5.1 1.9 Iris-virginica 3.0 5.9 2.1 Iris-virginica

(9)

Other less popular learning styles exist such as semi-supervised learning, where both labeled and unlabeled data is used; or reinforcement learning, where the algorithms learn through interactions with their environment, i.e. the learner must discover which actions have as a consequence a better reward (trial and error search).

In this thesis we will focus on supervised learning, concretely classification algorithms. Other learning methods are harder to mathematically evaluate and therefore classification algorithms are more suited for testing the hypothesis of our thesis.

2.1.2.1 Definition

Supervised learning algorithms aim to build models that are capable to generate predictions (or outputs) as a response of previously unseen data, i.e. find a function that establishes a correspondence between input and output data. This is possible by training the model with already classified data an later on testing the model on unseen (or unlabeled) data. This process is called generalization. Take for example that you have the data of a number of patients with a known type of arrhythmia, including the symptoms they present. Supervised learning methods allow us to create models that will be able to predict (with certain accuracy) the type of arrhythmia a patient has based on its symptoms.

2.1.2.2 Training and testing

Supervised learning methods are structured in two steps called training and testing. During the training phase, the algorithm takes the set of labeled observations and tries to find a function that correctly separates the observations in their correct categories. Once we have the predictor model, we want to test its efficiency on unseen data.

The problem that presents this process is the choice of the data that is used for training and testing. Usually we only have a set labeled data and we have to split it in some way to have training and testing data but how we split the data can have effects in the performance of the model. Techniques like cross-validation (see 2.1.5) aim to solve this issue.

2.1.3 Classification

We have already mentioned this type of supervised learning. In short terms, classification problems aim to build a model that is able to predict (classify) the categories of new unseen data observations. This generalization process is possible thanks to a prior step call training where the model learns to generalize the model using a set of already labeled (classified) data.

Classification problems work with data where the output is a discrete (and fixed) number of classes and therefore are easy to evaluate the performance of the algorithms, we simply have to compare the predicted with the correct label for each tested data observation. We will mention regression methods but their characteristics make the evaluation of regression models something not so trivial to measure. This factor and the popularity of classification algorithms (although a lot of them can be extended for regression purposes) has led us to focus our thesis in classification approaches. In the following sections we describe a few of these algorithms, including k-NN, decision trees, rule induction models, naive Bayes, neural networks and support vector machines.

2.1.3.1 k-nearest neighbors (k-NN)

(10)

sample. In its most basic form, when k=1, the algorithm will classify the new data point with the class of its nearest neighbor. With greater values of k, the most predominant class found in the neighbors will be assigned. Choosing the value of k is a key step when using this algorithm and it will be discussed in the following sections as well as how the voting approach may cause problems depending on the characteristics of the data and the value of k chosen.

2.1.3.1.1 Formal definition

The k-NN algorithm falls into the supervised learning category of machine learning approaches and is arguably the most simple among them. k-NN was first introduced by Fix and Hodges (1951). This method works with the assumption that observations close to each other (with similar feature values) will belong to the same class, but the world is not perfect and even if we are lucky and the data points belonging to the same class are close to each other and far from other points of different classes, usually the separation between classes (boundaries) is not great and some observations may be misclassified choosing only the nearest neighbor, that is why k-NN performs a generalization of this idea, it finds the predominant class in the k neighbors of the new observation. Figure 2.2

shows a simple example of the functionality of this algorithm finding the closest 5 neighbors.

The algorithm works as it follows: In the training step, the labeled observations are stored in an n-dimensional space and upon classification, each new observation will be classified as the majority class in its k nearest neighbors. The simplicity of the algorithm is obvious, but there are two key decisions that are worth mentioning: neighbor selection and the choice of k.

 Neighbor selection. We have talked about finding the closest neighbors, but we have not defined yet how to determine which ones are the closest. This notion of closeness is based on a distance metric between observations. The Euclidean distance is commonly used due to computational considerations, but other metrics can be used depending on the characteristics of the data for example the Mahalanobis or the Hamming distance.

 Choice of k. Choosing the value of k when using the k-NN algorithm is a key decision that the user has to make. With k=1, noisy data or even a bad labeled observation can limit the classification performance of the algorithm and a large value of k can make the limits between classes less clear and lead to misclassifications. The performance of k-NN based on the choice of k varies depending on how the data is distributed, therefore finding an optimal value of k for different applications is not an option. Finding the optimal value for a given

(11)

application can be achieved with techniques such as cross validation, comparing the performance of the classifier with different values of k.

2.1.3.1.2 Advantages and disadvantages

k-NN is very simple and intuitive, which makes its implementation very easy, it only has two parameters to tune (distance metric and k) and updating the model with new training samples has a very low cost. This simplistic approach allows the machine learning approach to overcome certain restrictions that other techniques have, such as data type (numerical, text, etc.) or if the data is linearly separable and it does not need to make assumptions about the data in order to learn.

k-NN can be really time consuming with large datasets (number of instances and attributes), since it has to measure the distance between the new instance and all the labeled observations. The algorithm also has a decrease in performance in presence of noisy and unbalanced data. Another drawback is that it has zero interpretability, no conceptual information about the data can be extracted through the learning process.

2.1.3.1.3 Extensions

The limitations of the basic algorithm have motivated the development of numerous extensions of the algorithm that aim to improve its performance. The lack of probabilistic semantics when making predictions that do not allow the posterior employment of predictive probabilities, motivated a probabilistic nearest neighbor method (Manocha & Girolame, 2007) in order to overcome this issue. To improve the efficiency of k-NN on high-dimensionality problems, Hastie and Tibshirani (1996) proposed a discriminant adaptive nearest neighbor method (DANN). A weight based k-NN algorithm was introduced (Han, et al., 2001) to overcome the drawbacks of using all attributes of the data to measure the distance between neighbors, associating different weights to the features.

2.1.3.1.4 Applications

k-NN approaches have been successful in some areas like content based data retrieval or protein structure prediction. k-NN is sometimes combined with other machine learning approaches in order to improve their efficiency, for example a SVM-KNN approach for visual category recognition (Zhang, et al., 2006) or a KNN/LSVM approach for gene expression analysis (Pan, et al., 2004).

2.1.3.2 Decision tree learning

Decision tree learning in data mining is a type of machine learning approach that uses a decision tree as a predictive model (for classification and regression purposes). First discussed by Belson (1959), a decision tree is a tree-like structure that represents a sequential decision process in a graphical way (a tree) which compared to other machine learning approaches makes the model easy to interpret and understand. The goal of a decision tree is to predict a target attribute (classification class or label attribute) given a set of input values.

(12)

In a decision tree, each interior node represents an input attribute of the training set and the edges correspond to the possible values of each attribute, therefore the number of outgoing edges of a node depends on all possible values for that attribute. Leaf nodes correspond to the values of the label attribute. A new example, based on its attribute values will follow a unique path in the tree and it will be classified according to the leaf node that is reached.

A decision tree is built by recursively partitioning the n-dimensional instance space, which means repeatedly splitting the tree depending on the values of the attributes (interior nodes). While building the tree, in each recursion the best attribute to split will be selected depending upon a selection criterion (crucial step that will be discussed later on). Then, the example set is divided into subsets depending on the value of the selected attribute and a tree is returned with one edge for each subset. Figure 2.3 shows an example of a decision tree in which a decision has to be made about playing golf or not based on weather conditions.

We will discuss in the following sections two key features of this algorithm: the splitting criterion and pruning, a tecnique to reduce overfitting. We will also mention some specific decision tree algorithms.

2.1.3.2.1.1 Splitting criterion

In each recursion while building a decision tree, the attribute that best splits the training set into subsets. The quality of the split depends on the criterion chosen for splitting. There are many techniques that face this concern but we will only describe the most common ones.

Maybe the most used, information gain consists in selecting the attribute with minimum entropy. In information theory, entropy is the amount of information (or uncertainty) of a random variable (Shannon, 1948). Sometimes the measure information gain ratio is used to overcome overfitting problems produced while using information gain to split attributes with a large number of distinct values. Information gain ratio is a variant of information gain in which the ratio between the information gain of a label and the attribute considered for splitting is calculated.

Other frequently used technique for splitting is the Gini index, the measure of impurity (inequality) of a data distribution, the probability of misclassification according to the data distribution.

(13)

2.1.3.2.1.2 Pruning

As mentioned before, making the tree too complex may produce overfitting and cause a poor performance of the predictive model. To overcome this, a technique named pruning is used. Pruning consist on reducing the size of a decision tree by removing sub-trees that do not contribute to the classification task, producing a simpler and more general tree and usually improving the performance of the model. Pruning can be performed in a top-down or bottom-up mode once the decision tree is built, but there is also a technique called pre-pruning where pruning is performed parallel to the tree creation. Pruning techniques differ in the measurement they use to optimize the model performance, for example reduced error or pessimistic pruning.

2.1.3.2.1.3 Decision tree algorithms

There are many well-known decision tree algorithms that have been proven useful in machine learning. ID3 (Quinlan, 1985) and its successor C4.5 (Quinlan, 1993) build decision trees using information gain as a splitting criterion and unlike ID3, C4.5 performs pruning on the resulting tree. CART or Classification and Regression Trees (Breiman, et al., 1984) follows a similar approach to C4.5 and CHAID (Kass, 1980), that uses a chi-squared based criterion are more examples of decision tree algorithms.

Decision trees provide a graphical and easy to understand representation of a model from which knowledge about the data can be easily extracted. This type of machine learning approach can handle both numerical and categorical data and it can deal with errors and missing values. Other advantage is that decision trees are considered a non-parametric type of algorithm, it makes no assumptions about the data and it requires very little parameter tuning. There are of course some drawbacks that we have to face when we use decision trees. We already mentioned overfitting and another major disadvantage is the instability of these algorithms, changing the input data can result in different splitting of the tree and therefore a rather different tree structure. When facing large amounts of data it can result in very complex trees and when many relevant attributes exist the complexity of the resulting tree increases.

A few techniques that uses more than one decision tree have been proved to improve the performance of the different basic decision tree algorithms. The Random Forest approach (Breiman, 2001) creates a fixed number of random trees (splitting on a random subset of attributes) and the result is a voting model with all the created trees. To improve the stability of the algorithm, Bagging (Breiman, 1996) can be used. The tree-construction algorithm chosen is applied on different subsets of the example set and the resulting model, like in random forest is a voting model. Other popular extension are Decision Graphs (Oliver, 1993) were the Join operation is included to solve fragmentation (attributes fragmented in many partitions) and replication problems (duplicated sub-trees).

(14)

2.1.3.3 Rule induction

The Rule learning task consist in, given a set of labeled data, find a set of rules that can be used to classify new unseen data. Rule induction are somehow similar to decision trees since the multiples set of rules produced can be expressed in the form of a decision tree. They differ in the way the sets of rules are usually built and how the results are interpreted, we will give a more detailed explanation in the following section.

Rule induction algorithms for binary classification try to classify new instances into positive and negative classes, in a way that the learning task for a positive class tries to find a set of rules that covers all positive examples (is complete) in the training set and does not include any negative one (is consistent). To deal with multi-class problems, a simple extension is applied building set of rules for each class where all non-members of the class are marked as negative examples.

A rule usually takes the form:

if(attribute-1,value-1)and(attribute-2,value-2)and… and(attribute-n,value-n) then(class,value)

How the different sets of rules are built depends on which Rule induction algorithm we are using. Rule induction algorithms can be categorized as global (they consider all possible attribute values for each attribute) or local (attribute pairs are considered). For each attribute or set of attributes, all possible values are considered and a condition (rule) will be selected based on some criteria, for example LEM1 (Michalski, 2000) uses the rough sets and RIPPER, Repeated Incremental Pruning to Produce Error Reduction (Cohen, 1995), uses information gain to determine the rules.

The similarities between decision trees and rule induction algorithms cause them to share some of the positive and negative characteristics of these algorithms.

Rule sets are easy to understand and provide a better understanding of the data. In some cases can perform better than simple decision trees and they are easy to implement because they can be expressed en first order logic. Other important advantage is that prior knowledge about the data can be added, not only the training data can provide information.

Like in decision trees, these kind of algorithms are sensitive to noisy and irrelevant data, so usually pruning techniques are used to overcome this issue.

Overfitting is a major drawback (pruning techniques are needed), noisy and irrelevant data can reduce the performance and these kind of algorithms scale poorly with the training data size.

There are a lot of algorithms that uses different approaches to rule induction, but they often follow a similar structure so we want to focus on approaches that use other fields in machine learning to improve these techniques.

A genetic algorithm for generalizing rules (Freitas, 1999) tries to simplify rule induction in knowledge discovery in order to return a smaller but more relevant set of rules.

(15)

Rule induction has been proved very useful in data mining, in the discovery of knowledge of large databases providing accurate rules describing the data. These techniques have been proven more useful for describing data rather than for classification purposes. For example analyzing medical data (Słowiński, et al., 2002) or text information extraction (Ciravegna, 2001).

2.1.3.4 Naive Bayes

Naive Bayes (Minsky, 1961) is a probabilistic classifier based on applying the Bayes’ theorem (which we will explain in the following section) with a naive assumption of independence within all features in the data. In simple terms, this model assumes that the presence or absence of a feature (i.e. an attribute of a class) is not related to the presence or absence of any other feature. This approach, even though its usually inaccurate assumptions and simplicity has been proven to perform really well in practice, outperforming in several occasions other more sophisticated machine learning approaches.

Like we mentioned before, Naive Bayes is a model based on the Bayes theorem, so to really understand this model first we have to introduce this theorem.

2.1.3.4.1.1 Bayes’ theorem

In probability theory, the Bayes theorem (Thomas Bayes, 1973), sometimes called Bayes’ rule or Bayes’ law is a simple mathematical formula to calculate the conditional probability of an event A, given B, based on the conditional probability of B, given A and the probability of A. The formula is in its simplest form is written as it follows:

𝑃(𝐴|𝐵) =

𝑃(𝐵|𝐴)𝑃(𝐴)

𝑃(𝐵)

P(A|B) is called the posterior probability, what we want to know at the end; P(B|A) is the likelihood, the probability of B given A; P(A) is called the prior, the probability a priori of A before B is observed (this is called uncertainty); and finally P(B) is the evidence.

This formula can be expressed more accordingly to our domain. Bayesian inference is a method in which observations are used to infer (update) the probability that a hypothesis is true. For machine learning, if we see the hypothesis as the classification label, and the evidence as the attributes, the formula can be expressed as follows:

𝑃(𝐶|𝐹

₁

, … , 𝐹

_𝑛

) =

𝑃(𝐹

1

, … , 𝐹

𝑛

|𝐶)𝑃(𝐶)

𝑃(𝐹

₁

, … , 𝐹

_𝑛

)

Being C the classification class and F the set of features or attributes that describes each instance in a dataset. This formula is calculated, given a new example, for each existing class and the one with the highest posterior probability is where the example is classified.

)

, … , 𝐹

_𝑛

) =

1

𝑍

𝑃(𝐶) ∏

𝑃(𝐹

𝑖

|𝐶)

𝑛

𝑖=1

The simplicity of this model while providing a more than acceptable performance in several cases is maybe its most important advantage. Is fast to train, does not need large amounts of data in the learning step and due to this probabilistic approach, Naive Bayes is not sensitive to irrelevant features.

While these characteristics are obviously very interesting in machine learning, assuming always feature independence does not usually hold and the incapacity to deal with independencies may arise inaccuracy results.

Different extensions have been presented to improve the Naive Bayes model.

(17)

A Hierarchical Naive Bayes model was introduced (Helge & Nielsen, 2006) to relax some of the independency statements of the original model.

Random Naive Bayes adopts some of the Random Forest basics such as, bagging and random feature selection.

Naive Bayes has been proved successful in many areas such as text or document classification (Rennie, 2001), specially spam filters; image classification (Lowe, 2012) or some medicine disciplines like treatment optimization.

2.1.3.5 Neural networks

Artificial neural networks (ANN), usually called neural networks (NN) are a class of mathematical models that attempts to replicate the structure and processing capabilities of biological neural networks. First discussed in 1943 (McCulloch & Pitts), neural networks are a group of interconnected nodes or artificial neurons, often organized into layers in which the information is processed using a connectionist approach, i.e. that mental and behavioral perceptions can be described by interconnected networks of simple units. The global behavior of the processing units is determined by the strength (or weights) of these connections. The structure of the network is changed during the learning step modifying the strength of the connections in order to find the desired conditions or data flow.

Neural networks implementations approach the biological model in a more simple and practical way. Neural networks are defined as a group of nodes (neurons) that have some connections between them (a synapsis if we look at the biological model). To trigger a neuron to do something we provide some input to the neuron, which will trigger the nodes that it is connected to, but this attempt to emulate the biological model is difficult to implement and to get results in the way that computers works, so neural networks are then organized in a way that computers can work with, that is in input, processing and output units.

(18)

What neural networks aim to do is to infer (approximate) a function from the data observed and use this function. A simple example would be to learn a separating hyperplane (a linear function in a 2-dimensional space for example) that classifies correctly the observations in a training set.

Each of the neurons in this topology computes a function (activation function) of the inputs (incoming edges) and sends the output by its outgoing edges. The calculations are weighted by the weights of the edges and shifted by some bias factor attached to each node.

We can see now how the weights of the connections determine how good is the approximations of the desired function, so how do we find the weights that more accurately approximate this function? This is where a learning process needs to be used to iteratively update the weights of the connections and find the best ones. There are different technique to achieve this depending in the structure of the network. We will focus in the most common topology that is feed-forward neural networks. To be able to understand better this networks we will introduce first the concept of perceptron and finally learn about feed-forward neural networks and the back-propagation algorithm that uses to learn the weights.

2.1.3.5.1.1 Perceptron

Invented by Frank Rosenblatt (1957), the perceptron is a class of neural network which can be considered the simplest kind of feed-forward neural network. It can be seen as the basic unit (or neuron) in larger neural networks. The key contribution by Rosenblatt was the introduction of weights and the learning rule for training neural networks. The perceptron is a binary linear classifier that computes its predictions combining a vector of inputs with a set of weights (and usually shifted by some bias), then an activation is applied to produce the perceptron output. We can see an example diagram in Figure 2.5.

Figure 2.4 Neural network example.

Input Nodes Output Nodes

(19)

In this example, a simple threshold functions is used as an activation function, therefore the function could be defined for example as:

𝑓(𝑥) = { 1 𝑖𝑓 𝑤 · 𝑥 + 𝑏 > 0 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

where for each input 𝑥₁, 𝑥_2,… the corresponding weights 𝑤₁, 𝑤_2,… are applied and shifted by the bias b. This value is passed to the activation function (threshold function in Figure 2.5) where the output determines the classification result.

The learning algorithm in a perceptron is not as complex as for example back-propagation for multilayer perceptron. First, for each input 𝑥1, 𝑥2,… (corresponding to

each feature of an observation in the data), all weights 𝑤₁, 𝑤_2,… and bias b are initialized, then for each example in the training set the output is calculated by the activation function (threshold function in Figure 2.5). If the result is not the desired output the weights are updated according to a predefined learning rule. This process is repeated until the error of the model is less than some predefined error or until a predetermined number of iterations have been reached.

An example learning rule could be:

𝑤_𝑖 = 𝑤_𝑖 + 𝑙𝑟(𝑑_𝑗+ 𝑦_𝑗) 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑖𝑛𝑝𝑢𝑡

where 𝑙𝑟 is a predetermined learning rate (between 0 and 1), 𝑑_𝑗 is the desired output and 𝑦_𝑗 the actual output for the example 𝑗 in the training set.

2.1.3.5.1.2 Feed-forward neural networks and back-propagation

A feed-forward neural network (also called multi-layer perceptron) is a type of neural networks where the connections do not form any cycles, thus there information flows in one direction from the input to the output nodes. See Figure 2.3.

In contrast with a single-layer perceptron, the complexity of the network causes a need for more sophisticated learning algorithms such as back-propagation.

Back-propagation algorithm is divided into two phases: propagation and weight update. During propagation, all output values of the network are calculated and compared with the desired output values. The difference (predefined error function)

𝑥₂ 𝑥1 𝑥₃ 𝑤1 𝑤2 𝑤3

∑

b out

(20)

update the weight of each node based on the value of the old weight, the input value, the error and the learning rate. Once the weight is updated, the node calculates its error and push it back through the network. This process is repeated until the network converge to a user-defined error or until certain number of iterations (or training cycles) is reached.

A neural network does not need the data to be linearly separable therefore it can approximate non-linear functions, this fact make neural networks more suited than other machine learning approaches when the relationships in the data are non-linear or difficult to describe with conventional methods. Neural networks are flexible and robust to changes in the data (noise for example) and they work well with both continuous and discrete data.

The major drawbacks that neural networks present is that for some problems they are really slowly to train and usually require a large amount of data to train.

There are a lot of types of neural networks: feed-forward neural networks, recurrent neural networks, modular neural networks… but these approaches usually differ in the topology of the network and their learning algorithm. They can be found combined by other machine learning methods in order to improve their performance in certain fields but maybe one of the most interesting approaches is when neural networks are combined with genetic algorithms. For example, AutoMLP (Breuel & Shafait, 2010) is an algorithm for learning rate and network size adjustment during training. The algorithm maintains a small set of networks that are trained in parallel with different learning rates and number of hidden nodes. After a fixed number of iterations, the error rate of all networks is calculated and the networks with the worst results are replaced with the best networks slightly modified with different rates and number of hidden nodes.

Neural networks are used in a wide range of areas. A classic application is in image recognition problems such as optical character recognition. Other popular application is for data mining. Neural networks are best at identifying complex patterns in the data, so they can be often seen in fields like risk management, customer research or sales forecasting. In general, neural networks are best suited for applications where an identifiable model of the data is not available for example pattern recognition problems, medicine-related applications and robotics.

2.1.3.6 Support vector machines

Support vector machines are a set of supervised learning algorithms developed by Vladimir Vapnik (1995). SVM (support vector machines) are originally binary linear classifiers but they have been extended successfully for non-linear data. The basic idea is that a SVM builds a set of maximal separating hyperplanes in a high (or infinite) dimensional space that separates the data. SVM generally can deliver a higher performance than other machine learning approaches, therefore SVM is one of the most interesting machine learning approaches in machine learning.

(21)

capable to predict the correct class of a new data point. But which hyperplane best separates the feature space? This is the basic idea behind SVM. SVM are called a maximum-margin classifiers, i.e. SVM try to find a hyperplane in a way that it maximize the distance between the hyperplane and the closest members of both classes. In the next sections we will go into detail about this idea of maximum-margin hyperplane, we will learn how SVM finds this hyperplane where all data points cannot be separated entirely into two classes and how SVM is extended to deal with non-linearly separable data. The mathematics behind SVM are not simple and require knowledge about optimization problems and Lagrange multipliers and this falls beyond the scope of this thesis so it will not be treated here.

2.1.3.6.1.1 Linear SVM and soft margin

Assuming the data is linearly separable, the separating hyperplane for a set of points can be described as 𝑤 · 𝑥 + 𝑏 = 0, where 𝑤 is the normal to the hyperplane, x is a set of points and 𝑏

||𝑤|| is the distance from the hyperplane to the origin. The examples closest to

this hyperplane are called support vectors and the goal of SVM is to find the hyperplane that is as far as possible from these support vectors. We can see an example in Figure 2.6.

The two separating hyperplanes selected scan be described as 𝑤 · 𝑥 + 𝑏 = 1 and 𝑤 · 𝑥 + 𝑏 = −1, which are rewritten as 𝑤 · 𝑥 + 𝑏 ≥ 1 and 𝑤 · 𝑥 + 𝑏 ≤ −1 to avoid the data points to fall into the margin. The distance between these two hyperplanes is 2

||𝑤||,

so the optimization problem is to minimize ||𝑤||. This is considered a quadratic programming optimization problem and its constraints are defined introducing Lagrange multipliers. The solution of this problems in its final form is expressed only in function of the dot product of the support vectors. For detailed information about how SVM algorithms solves the optimization problem check for example Tristan Fletcher’s paper (2009) “Support Vector Machines Explained”.

𝐻1

𝐻2

(22)

Obviously, not always is possible to find a hyperplane that correctly separates all data points of both classes. SVM uses the concept of soft margin to address this issue. If a hyperplane that splits correctly all the data, the soft margin technique will find the hyperplane that best splits the data but still maximizing the margin between the correctly classified data.

2.1.3.6.1.2 Kernel trick

The original SVM was designed to be a linear classifier, but years later a modified algorithm was presented to deal with non-linearly separable data. In essence, the algorithm is the same meaning that the methodology has not changed, but in the modified algorithm every dot product has been replaced with a kernel function. This idea is based on the assumption of that non-linearly separable data can be linearly separable in high dimensional spaces. So what this kernel trick consist is into using these kernel functions to calculate the inner products of the mapped vectors without actually applying a mapping function on the data. Figure 2.7 shows an example of this method.

A few common kernel functions are:

 Linear kernel: 𝐾(𝑥_𝑖, 𝑥_𝑗) = 𝑥_𝑖𝑇𝑥_𝑗

 Polynomial kernel: 𝐾(𝑥_𝑖, 𝑥_𝑗) = (𝛾𝑥_𝑖𝑇𝑥_𝑗+ 𝑟)𝑑, 𝛾 > 0

 RBF kernel (radial basis function): 𝐾(𝑥𝑖, 𝑥𝑗) = exp (−𝛾||𝑥𝑖 − 𝑥𝑗||2), 𝛾 > 0

 Sigmoid kernel: 𝐾(𝑥_𝑖, 𝑥_𝑗) = tanh(𝛾𝑥_𝑖𝑇𝑥_𝑗+ 𝑟)

SVM algorithms produce very efficient classifiers and are very robust, overfitting due to noisy data is not common so SVM do not need a higher number of training samples for the generalization of the problem. SVM do not suffer the problem of local minima due the quadratic programming optimization characteristics and the only parameters to design are the kernel function and the error cost (C).

SVM are designed as binary classifiers, in order to design multiclass SVM the usual approach is to transform the multiclass problem into a series of binary classification problems (see section 2.1.6). SVM are slow learners, it is not easy to introduce prior knowledge of the data into the algorithm and the parameters of the resulting model are difficult to interpret.

(23)

In an attempt to incorporate prior knowledge to the problem (Schölkopf, et al., 1996) presented the virtual support vector method. This technique incorporates known invariances of the problem into the training process by first training a model, creating new data by transforming the support vectors and finally training the system with this data.

In order to speed up the test phase of SVM algorithms, a reduced set method was proposed (Burges, 1996), which consists in the approximation of the decision function based on a reduced set of computed vectors (that are not support vectors and therefore not in the training set).

Support vector machines have been used successfully in many fields. Thorsten Joachims (2002) gives a detailed use of SVM for text categorization where SVM out-perform other popular classification techniques. SVM are widely use in applications in bioinformatics such as tissue classification (Furey, et al., 2000), gene function prediction (Pavlidis, et al., 2001) and protein fold recognition (Ding & Dubchak, 2001) for example. SVM have an important presence in image processing related problems like handwritten digit identification (Bahlmann, et al., 2002) or image based gender identification (Moghaddam & Yang, 2002).

2.1.4 Regression

Regression analysis is a statistical method whose goal is to determine relationships between variables, more concretely between a dependent variable (label attribute) and a set of independent variables (regular attributes). Regression attempts to describe how the dependent variable is affected when the dependent variables change their value. This process is done by trying to fit a function on the observed data. While classification methods aim to predict a categorical values (labels), regression is used to predict a continous value. For example, we may want to predict the future sales of a product, regression will determine how factors such as the price of the influence the sales of the product.

Many techniques for regression analysis exist. Usually the regression function is defined in form of a set of unknown parameters that are determined by the data, but there are other approaches like non-parametric regression in which the regression function is estimated directly from the data, rather than estimating parameters. As we can see, this models cannot be evaluated in the same way that classification models are, being the output a continuous value, methods to determine the error between the desired and the actual output are necessary.

Regression analysis is used in a wide variety of applications. This method has been successful in areas such as economics, psychology, sociology and medicine.

In the following sections we will give a short description about linear and nonlinear regression models.

2.1.4.1 Linear regression

(24)

Standard linear regression models make a few assumptions about the data. The relationship between the variables is lineal, the errors of the different response variable are independent among each other, have the same variance and are normally distributed.

The most common linear regression method to find the regression line is the least-squares method. Least-least-squares calculates the regression line by minimizing the sum of the squared vertical deviation between the observed response and the predicted response, i.e. the distance of the observed data points to the predicted regression line.

2.1.4.2 Nonlinear regression

This form of regression analysis aims to fit the observed data with a nonlinear function. The process to fit the function to the data is similar to linear regression. The parameters of the fitting function are iteratively adjusted so that the function comes closer to the observed points (sum-squares for example) until the error is minimized. Many nonlinear functions exist such as exponential functions, logaritmic functions, trigonometric functions and more.

2.1.5 Cross-validation

Learning and testing on the same data the performance of machine learning models is an obvious mistake that can give a false sense of achievement about the efficiency of a model. The simplest approach to solve this if we do not already have training and testing data is to split the data into two sets, train the model with one and test it with the other, but which observations end up in the training set and which ones in the test set may alter the evaluation of the model if the data splits are not representative enough or if too much data is used to train (overfitting). Model evaluation methods like cross-validation are necessary to evaluate how statistical models like machine learning methods will generalize to independent datasets.

The main idea of cross-validation iteratively split the data into different subsets and calculate the mean of each evaluation. Different approaches for cross-validation exist and we will shortly summarize the most commonly used.

The holdout method is the simplest approach of cross-validation and it consists into splitting the data into two non-overlapping sets, training and testing. As mentioned

(25)

before, this approach can present a high variance on its results depending on how the data is split.

In the k-fold cross-validation method, the data is partitioned first into k parts (or folds) with the same size. Then k iterations of training and testing and performed in a way that in each iteration one of the k subsets is used for validation and the remaining ones for training. Once all validations are completed, the average results of the k iterations is returned. A common value for k is 10. The subsets are selected in a way that all data observations are used for both learning and testing but each instance is used for validation only one time. The advantage of this method is that as k is increased, the variance of the evaluations is reduced. The major disadvantage is that the algorithm has to perform the training process k times, which for slow machine learning approaches like neural networks can result in very high computation times. Figure 2.9 shows one iteration of a 5-fold cross-validation method.

The leave-one-out method is an especial case of k-fold cross-validation where k is equal to the number of observations in the dataset. This way, in each iteration all the data but one instance is used for training.

2.1.6 Binary to multiclass extensions

Many classification algorithms are originally designed for binary problems (SVM for example), however a few strategies exist to extend binary to multiclass classification problems. The general approach is to build the multiclass model as a combination of binary models, we will shortly describe the two most used methods for multiclass extension.

The one-against-all method builds n binary models (for each class) in a way that for each label i, all observations belonging to i are considered positive examples and the instances of the other classes are considered negative examples. This method build one classifier for each class and given and unseen example, it will be classified in the class of the model with the highest output value. The output value is algorithm dependent and must be calibrated into comparable scores. Figure 2.10 shows a diagram of this

Test data Training data

5-fold cross-validation

1st fold 2nd fold 3rd fold 4th fold 5th fold

(26)

The all-against-all (or one-versus-one) is based as well in a simple idea. One binary classifier is built for each pair of classes, therefore having n(n-1)/2 classifiers. A new

observation will be classified by a voting mechanism, i.e. with the class that most classifiers have selected.

2.2 Statistical analysis

In order to evaluate our hypothesis, we need to extract some properties of all the datasets we have used to build our predictive model. We have to consider first some obvious characteristics that define the structure of the data. The number of samples and attributes in a dataset, the data type or if we are dealing with a binary or a multiclass problem are the most obvious properties we can extract from the data. Although they are very important, it does not seem nearly enough information to build our model therefore other methods have to be taken into consideration.

We will later on describe the properties selected for our model (see 4.7), but we will first summarize some statistical measures that can be applied to extract different information from the data.

2.2.1 Correlation and dependence analysis

The attributes (columns) in a dataset can be described as a set of random variables or features describing the data where each row corresponds to a different observation of the data. Measure the statistical relationship (dependence) between these random variables may give us a better understanding on how the features of the data are related and what impact they have on the class feature. We will shortly introduce a few methods that are related to this concept and can give us important information about the data.

2.2.1.2 Correlation

Correlation is a measure that indicates the degree of dependence between two random variables. The most common measure of dependence is the Pearson correlation cofficient, which measures the strength of the linear correlation (linear dependence) between to variables. The correlation coefficient is a number between -1 and +1. A

𝑐𝑙𝑎𝑠𝑠𝑖𝑓₁ 𝑐𝑙𝑎𝑠𝑠𝑖𝑓₂ 𝑐𝑙𝑎𝑠𝑠𝑖𝑓₃ 𝑐𝑙𝑎𝑠𝑠𝑖𝑓_𝑛

…

n classes

new data _{takes all}winner label

(27)

positive value of correlation between two random variables X and Y, implies that the values of the two variables increase together (linear positive relationship) while a negative value implies that when one value increases, the other decreases (linear negative relationship). A value closer or equal to zero implies no linear relationship.

Pearson coefficient is calculated by dividing the covariance (Weisstein, 2013) of X and Y by the product of their standard deviations (Weisstein, 2013), it is therefore a normalized version of the covariance.

𝜌_𝑋,𝑌 = 𝑐𝑜𝑣(𝑋, 𝑌) 𝜎_𝑋𝜎_𝑌

When we have more than two random variables (attributes), the correlation of n variables 𝑋1… 𝑋𝑛 is a matrix of n x n dimensions where the position i,j is the

correlation coefficient between 𝑋𝑖 and 𝑋𝑗.

2.2.1.3 Principal component analysis

Principal component analysis (or PCA) is a dimensionality reduction mathematical procedure that tries to find the causes of the variability in the data and sort them by their relevance.

When working with high-dimensionality data we can believe that some of the attributes present some redundancy, meaning that they are correlated in some way. By getting rid of this redundancy it should be possible to reduce the set of attributes into a smaller set of artificial attributes uncorrelated to each other that will account for most of the variance of the data. These artificial attributes or variables are called principal components. Formally, PCA finds an orthogonal transformation (Rowland, 2013) which transforms the original data into a new coordinate space that maximizes the variance. The largest variance in the data will be captured in the first direction of the feature space, the second largest in the second axis and so on. This procedure has the constraint that each component has to be orthogonal (uncorrelated) with the previous components.

PCA is usually done by doing the Eigen decomposition (Weisstein, 2013) of the covariance (or correlation) matrix. The principal components will be therefore the eigenvectors found and the eigenvalues will tell us how much variance is accounted in each component, allowing us to reduce the dimensionality of the data by choosing the most relevant components (i.e. with the most variance).

2.2.2 Probability distribution

A probability distribution of a random variable is a function that describes the probability that the variable takes on a given value. Probability distributions are very used for statistical studies and to make general conclusions about the data. Different types of probability distribution function exist depending on their mathematical application. Maybe the most common type of probability distribution function is the probability density function (pdf) , which is used for continous random variables (for discrete random variables, the probability mass function is used). We can see in Figure 2.11 an example pdf of the normal (Gaussian) distribution. Probability distribution functions have associated a series of terms that help to describe how the data is distributed. The most common ones are the arithmetic mean, variance and standard deviation (Weisstein, 2013).

(28)

2.2.2.1 Kurtosis and skewness

Kurtosis and skewness are measures used to describe the shape of a probability distribution.

Kurtosis (Weisstein, 2013) measures the peakedness of a probability distribution. Kurtosis is commonly defined to make equal to zero the kurtosis of a normal distribution. With this definition, kurtosis quantifies whether the distribution matches the normal distribution in a way that a normal distribution will have a kurtosis of zero, a flatter distribution will have a negative kurtosis and a sharper distribution will have a positive kurtosis.

Skewness (Weisstein, 2013) measures the lack of symmetry of a distribution. Skewness describes how a probability distribution leans to one side or the other of its mean. A symmetric distribution around the mean will have a skewness or zero while distributions where its values are concentrated on the left or the right of the mean and therefore the tail of the probability distribution is longer in on side than the other will have negative or positive skewness respectively.

2.2.2.2 Statistical dispersion

Statistical dispersion is a measure of the degree of variation in a probability distribution, i.e. how spread is the data. Dispersion measures are used for example to determine how reliable the average value of the data is, to facilitate comparison and as parameters to other statistical measures. A dispersion measure is a value that is zero if all data items have the same value and increases as the data is more spread. Common dispersion measures are standard deviation (Weisstein, 2013), median absolute deviation (Weisstein, 2013), interquartile range (Weisstein, 2013) and variance.

2.2.2.3 Kernel density estimation

Kernel density estimation is a non-parametric approach to estimate probability density functions directly from the data, therefore avoiding the restrictions about the form of the function. Kernel density estimators place the center of a kernel function in each point of the data and using a smooth kernel function we will have the smooth kernel estimate. The density estimate is calculated as a combination (sum) of the kernel in each data point. Kernel density estimators are defined by the type of kernel we choose and the bandwidth (or smoothing parameter) of the kernel. The bandwidth determines the width of the kernel, how smooth the density estimate will be. A large bandwidth will cause over-smoothed density estimate and therefore hide the shape of the data and a small

0 0,02 0,04 0,06 0,08 0,1 0,12 0,14 -7 -4,6 -2,2 0,2 2,6 5 7,4 9,8 12,2 14,6 17 Pr o b ab ili ty D en sit y

(29)

bandwidth will result in an under-smoothed and hard to interpret estimate. Some types of kernels are the triangle, the cosinus, the Gaussian or the Epanechnikov.

2.2.3 Analysis of variance

Analysis of variance (ANOVA) is a series of mathematical models used to test the hypothesis that the means of two or more groups or data are equal (analyzes the variation between the means of these groups), ANOVA is therefore a generalization of the t-test (Weisstein, 2013) for two or more groups. This statistical model makes the assumptions that the data is normally distributed, each group has equal variance and independent errors.

The strategy to test this hypothesis is the comparison of the mean squares values and the F-test (Weisstein, 2013), which reflects the degree of resemblance among the means being compared. We distinguish between one-way and two-way ANOVA when two characteristics or factors instead of one define the groups in the test.

How to select the right machine learning approach?

Degree project