Polypharmacy Side Effect Prediction with Graph Convolutional Neural Network based on Heterogeneous Structural and Biological Data

(1)

Polypharmacy Side Effect Prediction with Graph

Convolutional Neural Network based on Heterogeneous

Structural and Biological Data

JUAN SEBASTIAN DIAZ BOADA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

Polypharmacy Side Effect Prediction with Graph

Convolutional Neural Network based on Heterogeneous

Structural and Biological Data

JUAN SEBASTIAN DIAZ BOADA

Degree Projects in Scientific Computing (30 ECTS credits)

Master’s Programme in Computer Simulations for Science and Engineering KTH Royal Institute of Technology year 2020

Supervisor at KI Algorithmic Dynamics Lab, Center for Molecular Medicine:

Narsis A. Kiani

Supervisor at KTH: Michael Hanke

Examiner at KTH: Michael Hanke

(4)

TRITA-SCI-GRU 2020:390 MAT-E 2020:097

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden

URL: www.kth.se/sci

(5)

Acknowledgements

This thesis and its experiments were performed in the Algorithmic Dynamics

Lab of the Center for Molecular Medicine. Special thanks to Amir Amanzadi

for creating the affinity score dataset, Jesper Tegnér for his comments analyz-

ing results and Linus Johnson for his help with the Swedish translation.

(6)

(7)

Abstract

The prediction of polypharmacy side effects is crucial to reduce the mortal- ity and morbidity of patients suffering from complex diseases. However, its experimental prediction is unfeasible due to the many possible drug combi- nations, leaving in silico tools as the most promising way of addressing this problem. This thesis improves the performance and robustness of a state-of- the-art graph convolutional network designed to predict polypharmacy side effects, by feeding it with complexity properties of the drug-protein network.

The modifications also involve the creation of a direct pipeline to reproduce

the results and test it with different datasets.

(8)

(9)

Sammanfattning

Förutsägning av biverkningar från polyfarmaci med grafiska faltnings- neuronnät baserat på heterogen strukturell och biologisk data

För att minska dödligheten och sjukligheten hos patienter som lider av kom-

plexa sjukdomar är det avgörande att kunna förutsäga biverkningar från poly-

farmaci. Att experimentellt förutsäga biverkningarna är dock ogenomförbart

på grund av det stora antalet möjliga läkemedelskombinationer, vilket läm-

nar in silico-verktyg som det mest lovande sättet att lösa detta problem. Detta

arbete förbättrar prestandan och robustheten av ett av det senaste grafiska falt-

ningsnätverken som är utformat för att förutsäga biverkningar från polyfarma-

ci, genom att mata det med läkemedel-protein-nätverkets komplexitetsegen-

skaper. Ändringarna involverar också skapandet av en direkt pipeline för att

återge resultaten och testa den med olika dataset.

(10)

(11)

Acknowledgements iii

1 Introduction 1

1.1 Statement of the problem . . . . 1

1.2 Thesis Objective . . . . 2

1.3 Outline of Thesis . . . . 3

2 Theoretical Framework 4 2.1 Supervised Learning . . . . 4

2.1.1 Linear Models . . . . 6

2.1.2 Tree-based Methods . . . . 7

2.1.3 Support Vector Machines . . . . 8

2.1.4 Bayesian Methods . . . . 8

2.2 Deep Learning . . . . 9

2.2.1 Feedforward Neural Networks . . . . 10

2.2.2 Training Feed-forward Neural Networks . . . . 14

2.2.3 Convolutional Neural Networks . . . . 24

2.2.4 Graph Convolutional Networks . . . . 27

2.3 Algorithmic Complexity . . . . 32

3 Related Work and State of the Art 36 3.1 Traditional synergy calculations . . . . 37

3.2 General Methods . . . . 39

3.3 Trainable Methods . . . . 40

3.3.1 Linear Regression Methods . . . . 40

3.3.2 Tree-based methods . . . . 41

3.3.3 Other Machine Learning Approaches . . . . 41

3.4 Deep Learning Methods . . . . 42

3.4.1 Standard Deep Learning Methods . . . . 42

vii

(12)

3.4.2 Decagon . . . . 43

3.4.3 Decagon-based methods . . . . 48

4 Materials and Methods 49 4.1 Datasets . . . . 49

4.2 Original Implementation of Decagon . . . . 52

4.2.1 Data structure organisation . . . . 53

4.3 Contributions and improvements to Decagon . . . . 59

4.3.1 Data Treatment and Preparation . . . . 60

4.3.2 Implementation of Algorithmic Complexity Features . 64 4.3.3 Containers and GPU Configuration . . . . 70

4.3.4 Minibatch sampling and the data leakage problem . . . 71

4.3.5 Incorporation of edge features . . . . 73

4.3.6 Other improvements . . . . 73

4.3.7 Overall Pipeline . . . . 74

5 Results and Discussion 76 5.1 First experiments: Feature selection . . . . 76

5.2 Node features as possible method stabilisers . . . . 79

5.3 Experiments with side effects with the lowest performance . . 83

5.4 Extension to experiments with a full graph . . . . 87

6 Conclusions 97

Bibliography 102

A Additional figures 111

(13)

ADR adverse drug reaction. 1, 2, 4, 36, 42 AI Artificial intelligence. 4

ANN artificial neural network. 10, 12–14, 19, 21, 22, 25, 26 AP algorithmic probability. 33

AP@K average precision at k. 24

API application programming interface. 97

ATC Anatomical Therapeutic Chemical. 36, 40–42

AUPRC area under the precision-recall curve. 23, 24, 59, 76, 78, 79, 81, 86, 88–94, 112, 113

AUROC area under the receiving operating characteristics curve. 24, 59, 76, 79, 89

BDM Block decomposition method or KC features calculated with the block decomposition method. 34, 35, 64, 67–69, 71, 74–77, 79–81, 83, 85, 87–90, 98, 100

CNN convolutional neural network. 25–29 CPU central processing unit. 20, 71, 76 CSR compressed sparse row. 56, 63

CTM coding theorem method. 33, 34, 65, 66

DDI drug-drug interaction. 3, 41–44, 47, 48, 53, 54, 60, 62–64, 69, 70, 99 DL Deep learning. 9, 10, 24, 25, 27–29, 41–43, 45

ix

(14)

DSE Single drug side effects. 50, 60, 63, 74, 76, 77, 79–81, 84, 85, 87, 89, 90, 93–95, 99

DTI drug-target interaction. 3, 41, 43, 44, 47, 50, 52–54, 56, 60–64, 69, 73, 78, 99

EMI Edge Minibatch Iterator. 55, 56, 59, 72–74, 80, 81, 83, 88, 89, 100 FN false negatives. 22

FP false positives. 22

GCN graph convolutional network. 2, 28, 29, 31, 52, 57, 58, 60, 61, 64, 73, 81, 97–99

GPU graphic processing unit. 20, 69–71, 74–76, 87, 88, 95, 97, 100 KC Kolmogorov complexity. 32–34

MedDRA Medical Dictionary for Regulatory Activities. 50, 51 ML Machine learning. 4–6, 9, 10, 15, 21, 34, 36, 37, 40

MSE mean squared error. 15

PPI protein-protein interaction. 3, 40, 43, 44, 47, 50, 53, 54, 56, 61–64, 69, 78

ReLU rectified linear unit. 19, 20, 25, 31, 57, 73, 87 RF Random forests. 7, 41

SGD stochastic gradient descent. 17, 18, 21, 35 SVMs Support vector machines. 8, 41, 42 TN true negatives. 24

TP true positives. 22

UTM universal Turing machine. 32–34

w2 Simulations including DSE and BDM features. 81, 87, 89, 90, 93–95

(15)

Introduction

1.1 Statement of the problem

In many complex diseases, single-drug therapies fall short in helping recover- ing patients. This lower performance occurs because complex diseases such as cancer or AIDS, involve processes controlled by multiple biochemical mecha- nisms, which give redundancy to their functioning [1–3]. Usually, from all the targets that a drug may have, only a few of them are known, which give insight to which diseases they can treat. Single drug therapies target only a limited number of pathways in the pathogenesis of a disease, which sometimes leads to an incomplete treatment and, therefore, perpetuates the disease.

As a result, new procedures have shifted towards multi-drug therapies, which have proven to boost the efficacy of cancer, AIDS and fungal infection treat- ments over single drug therapies [4–7]. The potentiated polypharmacy effect comes from drug synergy, occurring when multiple drugs undertake the same disease by simultaneously targeting different pathophysiological pathways [1, 5]. Due to this effect, the single-drug doses can be reduced [8, 9], which con- tributes to reducing the individual toxicity of the drugs [2, 4–6, 10], and even reduce the drug resistance of the disease [6, 8, 11, 12].

Nevertheless, polypharmacy is associated with a much higher risk of adverse drug reactions (ADRs) due to drug-drug interactions [1, 4, 13]. Single drugs may modulate the activity of various untracked proteins in what is known as off-target interactions, which are challenging to trace [1, 14]. Multiple inter- actions of this kind could give rise to unexpected polypharmacy ADRs. These interactions usually go unnoticed in clinical trials, due to the limited time spent

1

(16)

in testing drug combinations [15] and being, most of the times, discovered once the drug is already in the market [14]. As the mechanistic understand- ing of drug-drug interactions is low, it is difficult to predict these side effects [2]. Furthermore, polypharmacy therapies are getting more common [16], be- coming a growing problem and being the cause of a significant fraction of the hospitalisation of patients due to unexpected ADRs [17, 18].

Until recently, the prognostication of polypharmacy ADRs was mainly based on clinical experience [2, 4] and medical expertise [8]. Some classical quan- titative methods to predict the effect of drug combinations, such as the Loewe model (see section 3.1), were also used but failed to fully explain non-linear interactions such as synergy [4, 5, 19]. Clinical experiments can give a solu- tion for a few combinations, but they are time-consuming and expensive [2, 5].

In vitro approaches like high-throughput screening can lead to cheaper proce- dures, but the vast combinatorial space of drugs makes it unfeasible to test all drug combinations [2, 4].

Therefore, it is necessary to have some development in the preclinical trials to make the procedure more sustainable and efficient. In silico approaches come handy to solve these problems. These are computational methods to simu- late the effect of drug combinations rapidly and with low resource investment.

While many studies address the problem of synergy or toxicity prediction, a minimal number of studies try to predict specific side effects coming from drug combinations. A method of this kind could reduce mortality and mor- bidity among polypharmacy treatment users and save considerable expenses in healthcare. As the nature of the problem involves interactions of agents in densely connected networks, a solution using a network approach could ex- ploit this kind of structures and extract information that could be overlooked by other methods.

1.2 Thesis Objective

Decagon [1] is the name of an algorithm that has been proposed in the liter- ature to tackle the multiple-drug side effect prediction problem. This method formulates polypharmacy side effects as a graph learning problem and solves it using a graph convolutional network (GCN). However, its current implemen- tation misses some key components in its pipeline, has unnecessary calcula- tions, and too many parameters that affect its efficiency and lead to overfitting.

Moreover, its form of data manipulation is not robust and may lead to overesti-

mated results. This work aims to improve its performance and generalizability.

(17)

More precisely, a current state-of-the-art deep learning method aimed to pre- dict the side effects of a drug combination is modified and tuned to improve its performance. As this project bases firmly on the premise that the underlying mechanisms of undesired side effects come from the structure of the human interactome, an existing graph convolutional architecture (see section 2.2.4) is chosen as a suitable solution for the problem, among many other machine learning and deep learning techniques.

Additionally, the input data used to train the neural network is enriched with a wide variety of features extracted from the graph properties of the involved net- works, namely the protein-protein interaction (PPI) network, the drug-target interaction (DTI) network and the drug-drug interaction (DDI) network or known polypharmacy side effects. These features include algorithmic com- plexity features of the involved networks, obtained by applying algorithmic numerical methods over the graphs before training the neural model. In this fashion, there is a redundancy in the learning data, as the GCN learns the main components of the network from the data while being fed with meta-features obtained independently from the network. Furthermore, tests with secondary structure protein features and drug-target affinity scores are performed to ex- plore possible feature extraction domains.

1.3 Outline of Thesis

Chapter 2 includes a contextualization of machine learning and an explanation

of the mathematical theory behind it. The chapter covers a brief explanation

of the most common machine learning methods used, deepening more in arti-

ficial neural networks and the emergence of graph convolutional networks. It

also explains the theory and motivation behind algorithmic complexity and its

numerical implementation. Chapter 3 tracks some of the computational meth-

ods that have been used to address the problem of predicting the effects of

drug combinations, covering trainable and non-trainable methods, but focus-

ing on the deep learning approaches. The chapter closes with the introduction

of the used method, Decagon. Chapter 4 gives details about the datasets and

the method used. It includes an explanation of the original implementation

and the modifications done to it. Chapter 5 exposes the results of the exper-

iments through all the development stages of the projects. Finally, Chapter 6

presents the thesis’s conclusions, including the contributions, the findings, the

limitations and the future work.

(18)

Theoretical Framework

This chapter gives a broad summary of the theory behind some of the cutting edge models for solving the polypharmacy ADR prediction problem. It will start with a background of supervised learning and some of its fundamental methods. Afterwards comes an explanation of deep learning and its relevant variations for this study. Finally, the chapter closes with a brief description of algorithmic complexity.

2.1 Supervised Learning

Machine learning (ML) is transforming the world almost in all segments from medical diagnosis, stock market prediction, virtual personal assistants to social networks [20]. This branch of Artificial intelligence (AI) has been increasing its popularity due to an increment in data availability, and the rapid growth of computing power. The reason is that ML techniques analyze vast amounts of data and interpret it to discover patterns and make decisions that humans can not, mainly due to the large size or complexity of the datasets. Any process that requires an unbiased analysis of numerous quantified factors to generate an outcome is suitable to be solved by ML.

Machine Learning categories

Machine learning is the name given to the branch of computational meth- ods that learn from data without being explicitly programmed ¹ [21]. More

1 This definition was given in 1959 by Arthur Samuel, one of the pioneers of Artificial Intelligence.

4

(19)

specifically, it is the family of computational methods with a data-driven ap- proach, that extracts patterns from data without knowing a precise mathemat- ical model in advance. ML methods can be classified into two big groups, depending on the learning goal and the type of data: supervised and unsu- pervised learning. More recent developments have extended these categories, adding semi-supervised and reinforcement learning. Supervised learning uses labelled data to perform tasks like regression or classification, while unsuper- vised learning uses unlabelled data to perform clustering or dimensionality reduction. Semi-supervised learning deals with scarce labelled data problems and can be considered a middle ground between supervised and unsupervised learning. Conversely, reinforcement learning methods teach themselves based on their actions under specific circumstances, considering the penalties and rewards they may give. As the present study will focus exclusively on super- vised learning methods, this section overview the current approaches within this field.

Regression vs Classification

In practice, supervised learning methods are characterized by exploiting a pre- dictive ability. This means that generally ² , its primary function is to predict the label of an unseen instance. The labels of the data can be of different types, such as numerical or categorical. Generally speaking, the type of label will define the nature of the supervised problem: if the output is numerical, the problem will be defined as a regression problem, otherwise it will be a classification problem.

The challenge of overfitting

ML methods use statistical tools to find patterns or infer properties from the data. One of the greatest challenges of ML is to infer inherent properties from the limited samples available that represent the population correctly. When a model learns particular properties from the training samples that do not repre- sent the general behaviour of the population, the model is said to be overfitted.

In these cases, performance measured from the training data is high, but it per- forms poorly on new instances.

2 Although prediction is the most practical application of supervised learning, it is some-

times necessary to know the mechanisms of learning, or what is known as interpretability or

inference. Inference gives some insight into the nature of the problem and the path to the

solution.

(20)

There are many approaches to solve the overfitting problem. The most gen- eral one involves increasing the number of data points for the learning, in an attempt to make the training dataset as diverse as possible. This enlarging can be done either by mining more data points or by augmenting the current dataset by artificial means. Other methods to reduce overfitting are more specific to the type of model or the optimisation method. The most suitable strategy to reduce overfitting will then depend on the ML algorithm used.

2.1.1 Linear Models

Linear regression is a simple yet very applicable method. The method finds an optimal coefficient β i for each feature or component x i of the data, plus an independent term or bias β ⁰ . With these coefficients, it generates a linear combination that predicts the labels of the data points as follows:

y = β ₀ + β ₁ x ₁ + · · · + β _p x _p =

p

X

i=1

β _i x _i + β ₀ = β β β · x + β ₀ , (2.1) where p is the number of input dimensions.

Equally important, logistic regression is a linear method of binary classifica- tion based on linear regression. This method uses the vector product calculated by linear regression as the argument of a sigmoid function (equation (2.2), which gives an output between 0 and 1. This output is then approximated to the closest integer to make the classification:

σ(β β β · x) = 1

1 − e ^β ^β ^β·x . (2.2)

To deal with overfitting, linear models often use regularisation techniques.

These techniques impose limits on different norms of the β β β vector. As ex- amples, the Lasso and the Ridge regularisations impose constraints in the L 1

and L 2 norms, respectively, while elastic net regularisation uses a linear com- bination of both L ¹ and L ² norm restrictions.

The relevance of linear models lies in their simplicity and interpretability.

When analyzing the result of a linear model, the vector β β β can directly quantify

the relevance of each feature of the data. Nevertheless, most practical prob-

lems have non-linear behaviour, so the application of linear models can be

limited.

(21)

2.1.2 Tree-based Methods

State of the art tree-based methods are ensemble learning methods ³ consisting of multiple decision trees, mainly random forests and boosted classifiers. A decision tree is a structure that subsequently splits the input data into groups based on some splitting criteria. In each level, a node evaluates a feature of the input group and classifies the instances accordingly. The goal is to divide the data repeatedly into groups until there are only pure nodes (have instances be- longing to only one category), called leaves. This data division is interpreted as the partition of the feature space into a set of rectangle-like areas with a single category [22]. A trained decision tree consists of the optimal splitting criteria on the optimal features. The standard way to find the optimal split- ting conditions is by using Gini impurity or entropy to measure each subset’s

"purity". Although the leaves are said to hold a category, decision trees can be used for regression and classification problems. In the case of regression, the value on the leaves is a real number. It is important to note that the func- tion modelled by the regression decision tree is stepwise continuous, and the resolution depends on the number of leaves.

Decision trees implement pruning to overcome overfitting, consisting of trim- ming its branches before the leaves hold a pure category. In this way, pruning may decrease the training accuracy but increase the generalisation. However, the best way to overcome overfitting is to use ensemble methods of several trees.

Random forests (RF) is a commonly used ensemble method. In short, RF grow several uncorrelated trees, each trained with a different subset of data points or data features [22]. The final output of the forest is computed by majority vote or average of the outputs of individual trees for classification and regression tasks, respectively.

While RF train each tree independently, boosted classifiers grow trees sequen- tially, each one correcting its predecessor. Gradient boosting methods build a tree adjusting its parameters based on an error function gradient, calculated over the previous tree’s performance [22].

Tree-based methods have much more applicability than linear models without sacrificing its interpretability. Moreover, they can achieve state of the art per- formances in many tasks. However, ensemble tree-based models are still very

3 Ensemble learning methods consist of multiple individual models. The final output of

these methods takes into account the outputs given by the different individual models.

(22)

suitable for overfitting and often lack robustness.

2.1.3 Support Vector Machines

Support vector machines (SVMs) are binary maximal margin classifiers. In other words, they find an optimal decision boundary between two classes max- imising the distance of the data points of the different classes to the boundary that separates them. The decision boundary takes a hyperplane form, divid- ing the feature space linearly in two, corresponding to the desired categories.

Formally, for N data points, the task is to find the hyperplane described by the vector [β 0 , β β β] such that

max

β 0 ,β β β,||β β β||=1 M

subject to y _i (x ^T _i β β β + β ₀ ) ≥M, i = 1, ..., N, (2.3) where y ⁱ ∈ {1, −1} is the label of the i ^th category, x ⁱ is the i ^th datapoint and M is the margin [22].

The real power of SVMs comes with the utilisation of kernels, which extend their scope beyond linearity. With kernels, data is transformed into a high- dimensional space where the linear classification takes place. This property makes SVMs capable of separating classes that are not linearly separable in the original feature space.

SVMs are based on the minimal distance between the data and the boundary, so only the data points closer to the boundary are used for training. This data selection makes the method computationally cheap and functional even with small datasets. Moreover, the use of the so-called kernel trick makes it possible to increase the dimensionality of the problem without running into complex calculations, as the only operations needed are inner products. The downside of this method is that the kernel has to be carefully chosen for the task, which sometimes reduces to a trial-and-error picking. It may also perform poorly when the number of dimensions of the original data is larger than the number of data points.

2.1.4 Bayesian Methods

In the previous methods (and in neural networks), the main idea was to up-

date the parameters of the model iteratively, maximizing the probability that

the parameters will explain the current data. As a new data point came, the

(23)

parameters tried to adjust themselves closer to their "real" value. Bayesian methods change this paradigm. They use Bayes’ Theorem shown in equation (2.4), where the left-hand side of the equation is the posterior distribution of the parameters given the data, and the right-hand side relates the likelihood of the data p(x|θ) and the prior of the parameters p(θ):

p(θ|x) ∝ p(x|θ)p(θ). (2.4)

The basic idea of Bayesian ML is that the parameters are treated as another random variable taken from the prior distribution. This distribution holds pre- vious knowledge about the parameters given by the problem. The goal then is to maximize the posterior distribution with a fixed dataset. This basic idea can lead to converting previous methods into Bayesian methods. As examples, there is Bayesian linear and logistic regressions or Bayesian networks.

Bayesian ML methods are more oriented towards inference. Although a model’s prediction ability is the priority in practice, it is sometimes essential to under- stand the mechanisms rather than obtain a correct prediction with a black-box method. For these problems, Bayesian methods are very suitable. Their lim- itation is that most complex Bayesian ML methods require heavy numerical computations, and therefore vast computational resources.

2.2 Deep Learning

Deep learning (DL) is the name given to the models of multi-layered artificial neural networks. They have become one of the most used ML methods due to their high performance on diverse tasks, but more for being a self-learning model. In other words, the model finds the function that relates the input data and the output by itself, with no other input by the user than stating the com- plexity of the model.

Deep Learning: The Evolution of Machine Learning

In conventional ML models, the input gives direct insights to the model on how

to perform on the assigned task. Thus, the choice and manipulation of the data

are crucial in the performance of the method. As a consequence, a pre-training

step known as feature engineering is, most of the time, a decisive phase of the

algorithms. Feature engineering is the process of transforming or selecting

relevant components of the data to feed an ML algorithm with the optimal

(24)

inputs in order to achieve peak performance. This process requires enormous effort and sometimes expert knowledge on the application field to manipulate the data correctly so that the algorithm can extract the correct information from it [20].

DL models stand out from conventional ML techniques as they can receive raw data as input (without the need for feature engineering) and achieve great results. This considerable advantage, which is the core of DL, is achieved by sequentially transforming the inputs through layers, creating an internal repre- sentation of the data in each layer in what is known as representation learning [20]. By creating abstractions of higher or lower dimension, DL algorithms can decide autonomously which aspects of the data are representative to the problem and amplify them, while suppressing the least relevant components.

Consequently, one can say that each layer performs an automatic feature engi- neering procedure. With the stacking of more layers, more definite or complex traits can be extracted from the data, facilitating the final task. Hence, the num- ber of layers and learning units per layer can be chosen at discretion to adjust the complexity of the model. Furthermore, non-linearities added at the end of each layer give DL models the ability to model virtually any desired function with an arbitrary accuracy [23].

2.2.1 Feedforward Neural Networks

The perceptron: An artificial neural unit

Not surprisingly, artificial neural networks (ANNs) were inspired by biolog- ical neural networks. In an ANN, learning units are connected to mimic the behaviour of a biological neural network. Biological networks are composed of fundamental learning units called neurons. A neuron receives stimuli by other neurons through connections called synapses, which can be excitatory or inhibitory. Multiple synapses can stimulate a neuron simultaneously, having an overall effect equal to the sum of excitatory and inhibitory stimuli (Figure 2.1a). If a neuron receives enough inputs to reach a certain threshold, it will trigger a strong response in the neuron called an action potential, which is the primary way of transmitting information between neurons. The action poten- tial is a binary response, for which it is often described as an "all or none"

response [24].

A perceptron is the artificial analogue of a biological neuron. It receives a

set of n inputs x i modulated by a corresponding set of weights w i that mimic

the excitatory and inhibitory stimuli of a biological neuron (see Figure 2.1b).

(25)

Input Input

Input

Output

(a) Stimulation of a biological neuron.

. . .

Σ 𝞼

(b) Perceptron.

Figure 2.1: The resemblance between biological neurons and perceptrons. a) Red neurons give activation/inhibition stimuli to the blue neuron. When the sum of the stimuli reaches the threshold, the blue neuron generates an out- put. b) The perceptron’s inputs are combined linearly with some weights and introduced in a non-linear activation function.

Each input represents a feature of the dataset, meaning that a set of simultane- ous inputs would represent a single data point being evaluated in the neuron.

The inputs then are transformed by an inner vector product, often called pre- activation or logit z (equation 2.5). An additional term b called bias is usually added to the vector product. Then, the "all or none" response h is modelled evaluating the logit by a step function σ, which outputs 1 if the logit reaches the desired threshold t or 0 otherwise:

z = w ₁ x ₁ + w ₂ x ₂ + · · · + w _p x _p + b =

p

X

i

w _i x _i + b = w · x + b (2.5)

h = σ(z) =

( 0, for z ≤ t

1, for z > t. (2.6)

The perceptron is a simple supervised learning method that can be trained to

solve binary classification problems, as it divides the feature subspace in two

by a hyperplane. The algorithm by itself can have a decent performance in

simple linearly separable tasks. However, it fails when the data is not linearly

separable. This substantial limitation motivates to extend further the power of

perceptrons.

(26)

Input 1

Input 2

Input 3

Input 4

Output

Input Layer Hidden Layer

Output Layer

Figure 2.2: A single-layer neural network with four input neurons, five hidden neurons and one output neuron: Each hidden neuron receives a linear combi- nation of the inputs and transforms it non-linearly. The output neuron does the same procedure with the hidden neurons.

Artificial neural network: A perceptron network

In artificial and biological neural networks, a single neuron has no significant influence if it is not connected to others. Therefore, multiple perceptrons can be connected to tackle more complicated tasks. A set of connected artificial neurons working together to perform a specific task is known as an artificial neural network (ANN). The different ways in which the neurons are connected define different architectures of ANNs suitable for different tasks.

One way of integrating multiple neurons is by connecting them in parallel or layerwise. With this configuration, the different neurons receive a unique lin- ear combination of the inputs since all the weight combinations are different.

After the respective activation functions evaluate their logits, each neuron re- turns its binary response. A subsequent neuron, called output neuron, can then linearly combine these results to generate the output of the network (Figure 2.2).

The network as mentioned above would be composed of three parts: the input

neurons, each of them representing a feature of the dataset; the set of neurons

evaluating the inputs known as the hidden layer; and the output neuron, which

consolidates the responses of the neurons in the hidden layer. Such a network

(27)

z ^y ₁

. . .

h _j

h _L h ₁ z ^h ₁ Σ

Σ

𝞼

. . .

x ₁

x _i

x _p

𝞼

𝞼 w ^h ₁₁

w ^h _ij

w ^h _Lp

z ^h _j

z ^h _L

Σ

. . .

𝞼

𝞼 w ^y ₁₁

w ^y _kj

w ^y _mL

z ^y _k

z ^y _m

y ₁

y _k

y _m x y

Figure 2.3: Multi-layered network for multiple-label classification problem:

Inputs are linearly combined by the P

neuron and non-linearly transformed by the σ neuron in each layer. The outputs of a layer constitute the input of the next. Each output neuron gives the probability that the input belongs to a particular class.

is called a single-layer network.

Multi-layered networks: The heart of deep learning

A more complex model can be created by stacking multiple hidden layers of neurons between the input and the output, in a way that each layer transforms the output of the previous one. That is, the outputs of one layer are linearly combined with a different set of weights in the following layer, which in turn can be connected to a subsequent layer (Figure 2.3). The number of layers of the model can be chosen at the user’s discretion. The more layers one network has, the more complex the model and the higher its capability to solve intricate problems. In this way, each layer acts as a feature extractor, where initial layers capture fine details, and the last layers extract complex traits. This property of transforming non-linearly the inputs several times gives ANNs the ability to extract meaningful information from untreated features. This characteristic of networks having multiple layers is what gave its name to deep learning.

In the network shown in 2.3, every neuron in one layer transmits its output to every neuron in the next layer. These layers are called fully connected layers.

It is also important to note that there are no direct connections between neurons

of the same layer, or loop connection of a neuron with itself. These properties

(28)

are representative of a feedforward neural network.

Mathematically, the output of this network can be described with matrix prod- ucts of the inputs with matrices of weights for each layer. In the perceptron, the weights are grouped in a vector w. Instead, in a layer of a feedforward network with m neurons and a p-dimensional input vector x, the weights are expressed by a matrix W of dimensions p × m. The logits z of the l ^th layer can be expressed as the vector

z ^(l) = W ^(l) · x + bbb ^(l) , (2.7) while the outputs h of the l ^th layer are just the non-linear activations of the logits

h ^(l) = σ(z ^(l) ). (2.8)

Then, plugging the right hand side of equation (2.7) into equation (2.8) gives the expression for the next layer

h ^(l+1) = σ

W ^(l+1) · h ^(l) + bbb ^(l+1)

, (2.9)

which is the general propagation rule for feed-forward ANNs.

Feedforward neural networks are suitable for both regression and classifica- tion tasks. For regression tasks, the output node of the network must return a numeric value. In binary classification, the output node returns either 0 or 1, depending on the predicted category. For multi-class classification, several output neurons are needed, as shown in Figure 2.3. In this case, each output neuron would represent a category, and it will output 1 when the input belongs to the corresponding category, and 0 otherwise. In practice, the values of the output neurons are real values between 0 and 1, which represent the probability of belonging to the given class.

2.2.2 Training Feed-forward Neural Networks

Previously it was assumed that the networks had optimal weights for perform-

ing the desired task. However, these parameters have to be trained to achieve

their optimal values. These training parameters in ANNs are the weight ma-

trices W and biases bbb of each neuron. For simplicity, the biases are usually

included in the W matrix in the mathematical expressions.

(29)

Training dataset: Learning from examples

It is the essence of supervised learning methods that they require a training phase in which they learn from labelled examples. Although each ML method may have its peculiarities, all of them follow the same general procedure.

The process starts by selecting a representative subset of the available data known as the training dataset while leaving the remaining examples as the test dataset. The training dataset is used to tune the parameters, while the test dataset is left to evaluate the method’s generalisation ability after training. It is recommended to choose as the training set a significant proportion of the data, usually around 70%, to maximize the probability that the model captures the general traits of the population and not individual characteristics of instances.

Sometimes, an additional subset of the dataset known as the validation dataset is built to monitor the performance during training and change the value of non-training parameters, known as hyperparameters. The basic hyperparam- eters of a neural network are the number of layers and the number of neurons per layer.

Cost function: Quantification of the error

The next step is to measure the error between the predictions of the model and the ground truth. The ground truth of an instance x ⁱ is its label y ⁱ . On the other hand, the prediction ˆ y _i of instance x ⁱ is the output given by the network upon evaluating x i . This evaluation, which implies performing the corresponding linear combinations and activations sequentially through each layer, is known as forward propagation. Therefore, the required error function should measure the discrepancy between y ⁱ and ˆ y _i for all training examples. This function has to take high values when y i and ˆ y _i are very different, and values close to zero when they are alike. Such a function J is known as the cost function.

The specific cost function has to be chosen depending on the problem. Some cost functions are best suited for classifications problems, and others for re- gression. One common choice of a cost function for regression is the mean squared error (MSE)

J = 1 N

N

X

i=1

(ˆ y _i − y _i ) ² , (2.10)

where N is the number of instances in the training dataset. For binary classi-

(30)

fication problems, cross-entropy is often used:

J = −

N

X

i=1

y i log ₂ (ˆ y i ) + (1 − y i ) log ₂ (1 − ˆ y i ). (2.11)

For multi-class classification, the cross-entropy takes the form J = −

N

X

i=1 K

X

k=1

y _i,k log ₂ (ˆ y _i,k ), (2.12)

where K is the number of different classes.

This cost function depends on the values of ˆ y i , which at the same time depends on every trainable parameter w. As the goal is to find the optimal values for each weight, the cost function has to be minimized, and then use the values w corresponding to such minimum. However, due to the high dimension of the parameter space and the fact that the neural network implements non-linear ac- tivation functions in the hidden layers, the cost function becomes non-convex [25], leading to the minimum estimation to be made by iterable algorithms.

Gradient descent: Moving in the steepest slope direction

The preferable choice for optimising neural networks is using gradient descent.

This iterable algorithm finds a local minimum of a cost function, updating its value based on the gradient’s direction. The algorithm displaces a chosen start- ing point w ⁰ in the negative direction of the gradient by some step η called the learning rate, as shown in equation (2.13). This direction is where the func- tion decreases at the highest rate (the steepest slope). As the cost function is a hypersurface in the parameter space, the starting point w ⁰ and the gradient

∇ w J (w) are d-dimensional vectors, where d is the number of trainable pa- rameters of the model. Each entry in these two vectors represents the value of one of the weights and its derivative, respectively, as shown in equations (2.14) and (2.15):

w ^j+1 = w ^j − η∇ _w ^j J (w ^j ) (2.13)

∇ _w ^j J (w ^j ) =







∂J

∂w ^j ₁

∂J

∂w ^j ₂

.. .

∂J

∂w p ^j







(2.14)

(31)

w ^j =





 w ₁ ^j w ₂ ^j .. . w ^j _p







, (2.15)

where the superindex j is the number of the iteration.

The steepest gradient method is known to have a low speed of convergence, and the value of the learning rate is one of the factors that determines the convergence speed. A learning rate chosen too small would make tiny steps towards the solution, making the algorithm take too many iterations to reach the minimum. On the other hand, choosing a too big learning rate may miss the desired minimum, causing divergence of the algorithm. The value of the learning rate is a hyperparameter that has to be tuned using the validation error.

Sometimes, an adaptable learning rate is used, to make large steps in the first iterations and reduce the steps close to the solution. This adaptation could increase the convergence rate of the algorithm.

Due to the updating of parameters in each iteration, the current solution moves to a new point in the cost function, so a new gradient must be calculated. The number of iterations depends on the number of batches, which are subsets of training data points used to estimate the cost function and update the network’s weights. Using the whole training dataset as the batch may result in a faster convergence when the dataset is big, but it may be slower when the number of features is large, and it may have a risk of getting stuck in a saddle point with a low performance-set of parameters. An alternative is stochastic gra- dient descent (SGD), where only one or a few instances are used to calculate the gradient. In this method, the cost function is estimated with fewer data- points but over more iterations. This means that the gradient calculation may not be exact but it has several iterations to correct itself, reducing the risk of getting stuck in a flat region [26]. However, this method may result in a slow convergence and even a risk of divergence close to the minimum. Usually, a middle ground method is used, which divides the dataset into batches of a few hundred examples.

Most iterative algorithms stop when they find a value good enough given a tolerance. However, in neural network training, one usually sets the number of times the whole dataset will be used in gradient descent as a stopping criterion.

Each of these complete passes is called an epoch. The number of iterations

then can be calculated multiplying the number of epochs with the number of

(32)

batches in which the training dataset is divided.

Backpropagation: Applied chain rule of differentiation

Historically, calculating a numerical expression for the gradient has been the most challenging aspect of SGD, as it may be a very computationally expensive procedure [23]. As a result, the development of an algorithm called backprop- agation to calculate the gradient numerically was a significant achievement in the field. The algorithm of backpropagation takes advantage of the recur- sive variable dependence in the network. On the whole, forward propagation makes possible to express each layer as a function of the previous ones.

Specifically, in a simple single-layer network (recall Figure 2.2), the output ˆ y can be defined as the activation of a logit z ⁽²⁾

ˆ

y = σ(z ⁽²⁾ ). (2.16)

The logit is a linear combination of the outputs of the neurons in the hidden layer h ⁽¹⁾ i

z ⁽²⁾ =

m 2

X

i=1

w ⁽²⁾ _i h ⁽¹⁾ _i + b ₂ . (2.17) In turn, these neuron responses h ⁱ are activations of linear combinations of the inputs:

h ⁽¹⁾ _i = σ(z _i ⁽¹⁾ ) ; z _i ⁽¹⁾ =

m 1

X

j=1

w ⁽¹⁾ _ij x _j + b ₁ , (2.18) where in equations (2.17) and (2.18), the i index goes over neurons in the hidden layer (m 2 ) and index j goes over the inputs (m 1 ).

Following this idea, to calculate an entry of the gradient vector of equation (2.14), i.e. a partial derivative of the cost function with respect to a single weight, an expression using the chain rule of differentiation is derived:

∂J

∂w ₁₁ = ∂J

∂ ˆ y

∂z ⁽²⁾

∂h ⁽¹⁾ ₁

∂z ⁽¹⁾

w ₁₁ + . . . + ∂z ⁽²⁾

∂h ⁽¹⁾ m 2

∂z ⁽¹⁾

∂z ⁽¹⁾ w ₁₁

!

. (2.19)

Nevertheless, the previous explanation exposes an obvious problem using back-

propagation—the non-linear function σ(·) (equation (2.6) and Figure 2.4a),

defined previously as the step function is not differentiable over all its do-

main. Even more, its derivative is zero when it is defined, which will reduce

all the chained derivatives to zero and hence, the weights not to be updated

(33)

in the gradient descent algorithm. For this reason, other non-linear functions have to be used. Differentiable alternatives of the step function such as the sigmoid function (equation (2.2) and Figure 2.4b) or inverse tangent (Figure 2.4c), can be used as substitutes. However, it has been shown [23] that us- ing a rectified linear unit (ReLU) function (equation (2.20) and Figure 2.4d) as activation functions in hidden layers can achieve better performances than saturating functions ⁴ .

σ _ReLU (x) = max{0, x} (2.20)

10 5 0 5 10

0 1

(a) Step function

10 5 0.0 0 5 10

0.5 1.0

(b) Sigmoid

10 5 0 5 10

/2 0 /2

(c) Inverse tangent

10 5 0 0 5 10

5 10

(d) Rectified linear unit (ReLU) Figure 2.4: Some of the possible non-linear functions that can be used as ac- tivations for the neurons.

The previously explained dependence of outer layer variables from inner layer variables makes ANNs suitable for being represented as acyclic computational

4 Saturating functions have horizontal asymptotes in values approaching infinity and neg-

ative infinity. In other words, a given increase or decrease in its inputs will not cause a signif-

icant increase/decrease of its value over great part of their domains. As a consequence, their

derivatives are close to zero approaching these values.

(34)

graphs. In such graphs, each variable is a node, linked together to other vari- ables through edges representing operations. This representation is precisely what the implementation packages such as TensorFlow use to make the algo- rithm’s computations.

In those computational graphs, each edge represents an operation between two tensors. The training of neural networks can involve a considerable number of these operations. This heavy load of operations is why a graphic process- ing unit (GPU), is commonly used to train neural networks, instead of central processing units, or CPUs. In brief, GPUs have a higher number of cores, and higher bandwidth than CPUs, making them more suitable to perform opera- tions in the computational graph, as many of them are calculated in parallel.

Nevertheless, GPUs usually have more limited memory than CPUs, which is the bottleneck of their usage.

Training challenges

Gradient descent is used as a resource in the absence of an easy exact solution to the non-convex minimisation problem. This alternative gives successful re- sults in a vast number of situations, but several theoretical and practical issues must be taken into consideration.

Among the many practical considerations, one important highlight is dealing with vanishing gradients in backpropagation. A gradient calculation with re- spect to a parameter very deep in the network is usually a product of many partial derivative terms. Only one or some of them being close to zero may vanish and leave that parameter untrained. The deeper the layer is located, the more likely for its derivative to contain a value close to zero. Additionally, vanishing gradients are persistent using saturating activation functions such as sigmoid and inverse tangent, due to the extensive part of their domain where their derivative is close to zero. Two main approaches can be used to solve this problem. The first one uses non-saturating functions as activations of in- ner layers, such as ReLU or other functions that only have a finite, countable number of points where the derivative is not defined [23]. The second option is to use a particular weight initialisation strategy. One of the currently most used initialisation strategies is the Glorot & Bengio initialisation [27], which samples the initial value of the weights from a random uniform distribution bounded between

±

√ 6

√ n _in + n _out , (2.21)

where n in and n out are the number of inputs and outputs of the layer, respec-

(35)

tively. The reader is referred to [27] for an in-depth explanation of this initial- isation method.

Among the theoretical considerations, it stands out the fact that gradient de- scent does not guarantee convergence to the absolute minimum, since the given solution depends on the starting point. Although this may seem to bring trou- ble, recent studies have shown that converging to a local minimum is not a severe issue, as, for networks with many parameters, the cost function has abundant minima with similar quality than the absolute minimum [20]. How- ever, the optimisation procedure may arrive at a saddle point, where most of the derivatives are close to zero, which could reduce the average speed of conver- gence. For these cases, momentum-based optimisation methods were devel- oped, Adam being the most successful. The Adam optimisation [28] differs from the regular SGD in two things: first, it keeps individual learning rates η _i for every trainable parameter w ⁱ ; second, the learning rates are adaptively changed using exponentially decaying averages of the means and variances of the parameters. Equation (2.22) shows the updating of the mean m and un- centered variance s estimates of the gradients, and how they are involved in the parameter updating:

m ^(l) _i = β ₁ m ^(l−1) _i + (1 − β ₁ )∇ _w _i J (w) s ^(l) _i = β 2 s ^(l−1) _i + (1 − β 2 )(∇ w i J (w)) ² w ^(l+1) _i = w _i ^(l) − η _i m ^(l) _i

(1 − β ₁ ^l+1 ) r

s ^(l) _i 1−β ₂ ^l+1 +

,

(2.22)

where l is the iteration, i is the index of the parameter, β 1 and β 2 are the expo- nential decay rate of the mean and the variance, respectively; and is a small number that prevents division by zero in the implementation. These last three variables are hyperparameters set by the user, but the values of β ¹ and β ² have to be between 0 and 1 to generate the exponential decay in the values of mean and variance, and usually is of the order of 10 ⁻⁷ . Adam optimisation helps gradient descent overcome saddle points efficiently by keeping the "momen- tum" of previous iterations.

Finally, as for any other ML method, there is a constant threat of overfitting.

Deep ANNs can have thousands or even millions of trainable parameters [20],

which gives them a tremendous ability to model complex functions but also

makes them exceptionally prone to overfitting. To overcome this drawback,

some models implement a strategy called dropout. With this approach, neu-

rons in the ANN will be turned off during a training iteration with a probability

(36)

p, called the dropout rate. Consequently, the model will train with fewer pa- rameters, reducing the model’s complexity during the current iteration. The choice of dropout neurons is made in every iteration, giving the possibility to previously dropped neurons to reactivate. The probability p is a hyperparam- eter, but it usually takes values around 0.1.

Performance Metrics

It is crucial to know how well the models are performing. The correct inter- pretation of performance measures can diagnosticate all of the previously de- scribed problems that affect ANN. In classification tasks, two measures are fre- quently used to describe a method’s performance: precision and recall. These metrics are mostly used for binary classification methods, but its use can be extended to multi-class classification by evaluating each class against the rest [29].

Precision measures the accuracy of the positive class, and it is defined as the ratio of the instances correctly classified as positive (true positives, abbreviated TP) and the total number of instances (correctly and incorrectly) classified as positive, as shown in equation (2.23)

P = T P

T P + F P , (2.23)

where FP is the number of false positives or the number of negative instances classified as positive. Precision is useful to identify classifiers that are very good in identifying negative instances, i.e., will never classify a negative in- stance as positive. However, it does not tell anything about the chance of a positive instance being classified as negative.

On the other hand, recall (also called sensitivity or true positive ratio, abbre- viated TPR) measures how many of the total positive instances were correctly classified as positive. It is defined as the ratio of TP and the total positive instances in the training dataset, as shown in equation (2.24)

R = T P

T P + F N , (2.24)

where FN stands for false negatives, i.e., instances that are positive but were

classified as negative. Thus, recall is useful to identify when a classifier can

identify the positive class correctly but gives no information about the per-

formance of the negative class. As a result, a classifier with good recall may

not necessarily be a good classifier overall, just performing optimally in the

(37)

0 1 Recall

0 1

Precision

Precision-recall curve

0 1

False positive rate 0

1 positive True rate

Receiver operating characteristic curve

Figure 2.5: Examples for a precision-recall curve to the left and a receiver operating characteristic curve to the right. The precision-recall curve shows the tradeoff between precision and recall: when one of these quantities is 1, the other is 0. The receiver operating characteristic curve shows the opposite relation between the true positive and false negative rates.

positive class. For instance, a classifier that assigns the positive class to every instance will have a recall of 1 but misclassify the opposite class completely.

The separate roles of precision and recall result in a tradeoff between them:

one cannot achieve perfect recall and perfect precision at the same time in practice. A classifier can then tune its parameters to favour one or the other, depending on the specific task. For example, a neural network with an output neuron using a sigmoid function will distinguish between two classes depend- ing on the value of the logit. Tweaking the weights of a neural network to make logits more negative (or simply displacing the sigmoid to change its threshold) will favour negative classification and, therefore, tuning positive classification.

This shift will reduce the false positive ratio and hence, increase the precision;

but at the same time, it would also increase the false negatives and as a conse- quence, reduce recall.

The precision-recall tradeoff can be visualised in a curve of precision vs recall

as shown in Figure 2.5. One common way of evaluating the performance of

the method is calculating the area under the precision-recall curve (AUPRC),

which will give a value between 0 and 1. Although AUPRC is mostly used for

binary classification, using the one-vs-rest algorithm and the AUPRC metric

can give a more accurate evaluation of a method’s performance using heavily

unbalanced datasets [30].

(38)

Theoretically, the AUPRC is calculated by an integral of precision as a function of recall. The integral is calculated computationally, approximating it using the average precision, which is a finite sum of using precision and recall values over some thresholds of the model [31]:

AP = X

n

(R _n − R _n−1 )P _n , (2.25)

where P n and R n are precision and recall in threshold n.

Another important metric is the area under the receiving operating characteris- tics curve (AUROC) (Figure 2.5 right). The receiving operating characteristic curve is a graph of recall against the false positive rate (FPR), defined as

F P R = F P

F P + T N (2.26)

where TN is the number of negative instances correctly classified. Similarly to AUPRC, the AUROC is a value between 0 and 1 that measures how well a model can distinguish between classes. Models having values close to one perform optimally, while a random classifier will have an AUROC of 0.5. AU- ROC is often used in multi-class classification tasks.

An approach for multi-label prediction is the average precision at k (AP@K), being k the number of labels predicted. First, the precision at each k (P@K) must be calculated, this is, for the first given k labels, how many were they correct. The average precision at k is simply the average of P@K over several k, namely

AP @K = 1 k

k

X

i

P @i. (2.27)

This metric may be useful when a known amount of labels is more important than others. However, it fails to describe the performance over the whole label space.

2.2.3 Convolutional Neural Networks

Despite the drastic improvements in the performance of the conventional DL

models, they still had some significant shortcomings. Networks with fully

connected layers can have a number of parameters that grow intractably with

the number of inputs and hidden neurons. This increase means that the mod-

els could get unpractically large for solving problems with high dimensional

data. Moreover, in the case of images, standard neural networks work fine for

(39)

Input map Feature map

Output map

Figure 2.6: A convolutional filter or kernel combines linearly the entries of a small section of the input map and forms an output map entry. A non-linear activation such as ReLU usually follows the linear combination.

datasets with simple images, i.e. small images with binary values, but as the input grows in dimension and complexity (like in high-resolution RGB im- ages), the models do not manage to learn the proper components of the data.

One of the reasons for this low performance was the lack of consideration of the spatial data in images. For recognition tasks, an object has to be identified in an image no matter their location, orientation, or size. In images, local groups of values are highly correlated, forming distinctive motifs [20]. A standard neural network would flatten the image as a vector and treat it as a standard input vector. In this way, the information of pixel adjacency is lost.

Convolution: An artificial neuronal receptive field

Convolutional neural networks (CNNs) emerge to satisfy the need for a DL method that could capture information coming from 2-dimensional data, such as images. Just as conventional ANNs, CNNs mimic the functioning of bio- logical visual structures. It was found that neurons in the visual cortex have a small receptive field which responded to specific local spatial patterns [32].

A CNN leaves behind the previous architecture of fully connected layers to

take a more local approach to imitate this local receptive field. Therefore, a

layer uses small trainable matrices called filters or kernels that go over the

(40)

tree (0.01) sun (0.99) cloud (0.99) Convolution +

ReLU Pooling Convolution +

ReLU Pooling Fully connected

Output predictions

Figure 2.7: Deep convolutional neural network: An image is put through sev- eral convolutional and pooling layers, then flattened and fed to a standard neu- ral layer. The final layer performs the multi-label classification.

entire input, creating responses of local patches of the same size as the filter.

A response involves an elementwise multiplication of the filter and the patch, chained to a non-linear activation function as in traditional ANNs. The filter moves across the input generating responses of every possible patch as in a 2-dimensional convolution. As a result, a 2-dimensional output is generated, concatenating the responses accordingly in what is known as a feature map (Figure 2.6). Several different filters can be convolved over a data instance in a layer, generating their characteristic feature maps. Each filter is supposed to detect specific patterns over the data. As the weights that compose the fil- ters are trained using backpropagation, the network decides which patters are relevant.

Deep CNNs: Convolution and pooling layers

In image recognition tasks, it is crucial to stack multiple layers to achieve good performance. The subsequent transformations of the feature maps reveal the intricate hierarchy in patterns that are characteristic of objects. Specifically, first convolutional layers identify the most primitive patterns in images, such as borders. The generated feature maps will then contain information on the presence or absence of borders in local patches. The following layers will try to identify patterns of borders that form motifs in the same way. Finally, the last convolutional layers will identify groups of these motifs located in specific ways that form recognisable objects. On the whole, multiple layers guarantee multiple levels of abstraction from fine details to composite objects in a hierarchical way. This multiple level architecture of CNNs is shown in Figure 2.7.

The algorithm’s convolutional nature guarantees that a structure can be iden-

(41)

tified with translational invariance, meaning that the object can be recognised in any location in the image, as many times as necessary. Meanwhile, pool- ing layers also plays an essential role in CNNs. A pooling layer condenses a feature map by obtaining a descriptor of small patches in the map, usually the maximum or the average. In the maximum pooling (max pooling in short), the maximum value of all patches is chosen to form the condensed feature map.

This step helps to reduce the dimensionality of the data and improves the gener- alisation of the method. Max pooling layers can make the model more robust, overcome small positional variations and improve statistical efficiency [23].

After several sets of convolutional and pooling layers, the original high dimen- sional input is reduced substantially in size. In the final phase of the algorithm, the resulting feature map is flattened, where fully connected layers can use the reduced feature vector to perform standard tasks like classification, as seen in Figure 2.7.

Sparse interactions: fewer trainable parameters

The convolutional approach previously described tackles the problem of spa- tial dependency of data successfully. Additionally, it provides a considerable reduction in the number of trainable parameters due to its sparse interactions, which brings several practical benefits to the method. First of all, a method with fewer parameters is less complex, hence less prone to overfitting: re- ducing the parameters, consequence of the convolutional architecture and the max pooling layers, give robustness to the algorithm. Second, having fewer parameters means fewer memory requirements and fewer calculations needed, which makes the method faster and more efficient. Additionally, the charac- teristic weight sharing of the method improves the statistical efficiency: tradi- tional DL methods use each weight only once, while CNNs exploit more their parameters, making each of them more significant.

2.2.4 Graph Convolutional Networks

CNNs have proven to be successful in problems where data has a grid-like

structure. However, there is a vast supply of problems that involve data on ir-

regular or non-Euclidean domains, such as the ones structured by graphs. For

example, data coming from social networks, log data on communication net-

works, bibliographical data [33], text data on word embeddings and biological

interactions, involve pairwise relationships defined better by graphs [34]. Most

of these problems cannot be appropriately addressed without considering their