Hyperparameter Optimization for Convolutional Neural Networks

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Hyperparameter Optimization for Convolutional Neural Networks

CLÉMENT GOUSSEAU

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Hyperparameter Optimization for Convolutional Neural

Networks

CLÉMENT GOUSSEAU

Master in Computer Science Date: February 16, 2020 Supervisor: Erik Fransén Examiner: Pawel Herman

School of Electrical Engineering and Computer Science Host company: Orange Labs Lannion

Swedish title: Hyperparameteroptimering av faltningsnätverk

(4)

(5)

iii

Abstract

Training algorithms for artificial neural networks depend on parameters called the hyperparameters. They can have a strong influence on the trained model but are often chosen manually with trial and error experiments.

This thesis, conducted at Orange Labs Lannion, presents and evaluates three algorithms that aim at solving this task: a naive approach (random search), a Bayesian approach (Tree Parzen Estimator) and an evolutionary approach (Particle Swarm Optimization). A well-known dataset for handwritten digit recognition (MNIST) is used to compare these algorithms.

These algorithms are also evaluated on audio classification, which is one of the main activities in the company team where the thesis was conducted.

The evolutionary algorithm (PSO) showed better results than the two other methods.

Sammanfattning

Hyperparameteroptimering är en viktig men svår uppgift vid träning av ett artificiellt neuralt nätverk.

Detta examensarbete, genomfört vid Orange Labs Lannion, presenterar och utvärderar tre algoritmer som syftar till att lösa denna uppgift: en naiv strategi (slumpmässig sökning), en Bayesiansk metod (TPE) och en evolutionär strategi (PSO). För att jämföra dessa algoritmer har MNIST-datasetet använts.

Algoritmerna utvärderas även med hjälp av ljudklassificering, som är kärnverksamheten på företaget där examensarbetet genomfördes.

Evolutionsalgoritmen (PSO) gav bättre resultat än de två andra metoderna.

(6)

Acknowledgements

I would like to thank:

Lionel-Delphin Poulat, Orange Labs Lannion, for his advice, support, and kindness.

Christian Grégoire, Orange Labs Lannion, for his welcome and his trust.

Pawel Herman, KTH, for his clear and useful guidelines.

Erik Fransén, KTH, for his advice and his support.

iv

(7)

Chapter 1 Introduction

Created in the 1940s and 1950s and inspired by Biological Neural Networks [1],[2],[3], Artificial Neural Networks (ANN) are now widely used in the field of machine learning. The rise of computational resources, the release of public datasets [4],[5],[6] and the emulation among researchers and engineers have enabled a lot of progress in the 2010s. Convolutional Neural Networks (CNN) are a specific type of ANN, which are particularly suitable for tasks such as image classification.

One issue with ANN and CNN is that they are defined by many hyperparameters which define e.g. the topology of the network, its size, its training conditions. In practice, hyperparameter tuning is a tricky task for many reasons such as the long training time, the multi-aspect nature of the search (both quantitative and qualitative hyperparameters can be involved) or the curse of dimensionality [7],[8]. This results in a very time-consuming process which gives rise to the need for automated hyperparameter optimization.

Some solutions have been proposed in order to make the search more efficient. In 2011, Bergstra, Bardenet, Bengio and Kégl introduced a method based on a Bayesian framework [9]. In 2012, Bergstra and Bengio also showed the superiority of random search over grid search [10]. Since then, other methods have been proposed, such as evolutionary algorithms [11], Bayesian optimization [9], or radial basis functions [12].

1

(10)

2 CHAPTER 1. INTRODUCTION

1.1 Research Question

The question is to quantify the ability of hyperparameters optimization algorithms to explore the search space efficiently. This involves choosing hyperparameters optimization algorithms from the literature, defining test cases and metrics to evaluate the algorithms.

The three hyperparameters that will be evaluated are: Random Search [10], Tree Parzen Estimator [9] and Particle Swarm Optimization [13]. These algorithms are defined in Chapter 3.

More specifically, we study the following question:

• Among Random Search, Tree Parzen Estimator and Particle Swarm Optimization, which hyperparameter optimization algorithm does explore the search space most efficiently?

1.2 Scope and Limitations

The team of the company where this thesis was conducted works on audio classification and is confronted with the problem of hyperparameters optimization, which is a difficult and time-consuming task. For audio classification tasks, CNN are particularly suitable. In order to evaluate hyperparameters optimization algorithms on more well-known tasks and datasets, an image classification task will also be tackled. For image classification too, CNN are particularly suitable. Therefore this thesis will focus on this kind of ANN. However, the problem of hyperparameters optimization is not particularly different from ANN to CNN.

Because of time and computational resources constraints, this work will be limited to two test cases (image classification and sound classification), which may pose limitations for the impact of the study. For the same reasons, not all hyperparameter optimization algorithms from the literature will be evaluated. A reasonable number of algorithms that rely on diverse approaches, will be selected.

This project will focus on the hyperparameters optimization of convolutional neural networks and not the networks structure in itself. Therefore rather simple architectures like Le-Net [14], VGG-nets [15] will be used.

(11)

CHAPTER 1. INTRODUCTION 3

1.3 Thesis Outline

This thesis intends to provide people with a scientific background (students, engineers, researchers) solutions to automatically optimize hyperparameters.

Given the complexity of the problem (high dimensionality, continuous variables), global optimization is almost impossible. Therefore local optimization, which uses few observations, will be used.

The goal of this thesis is to present and evaluate solutions to automatically tune the hyper-parameters. These solutions will be evaluated on their ability to explore the most promising areas of the search space, that is to say the areas of the hyperparameter search space that yield to the best performing CNN.

First, Chapter 2 gives a background on ANN and CNN. Then Chapter 3 presents the related work on hyperparameters optimization. Then, Chapter 4 presents the experimental setup (experiments and metrics) to evaluate these methods. Finally, the results are showed and analyzed in Chapter 5, and discussed in Chapter 6.

(12)

Chapter 2 Background

2.1 Artificial Neural Networks

2.1.1 From Biological Neural Networks to Artificial Neural Networks

Artificial neural networks [16] have become very popular in the field of machine learning. Introduced in the late 1940s and early 1950s by McCulloch and Pitts [1], Hebb [2] and Rosenblatt [3], they are inspired by biological neural networks. A biological neuron receives electrical or chemical signals from other neurons through its dendrites and synapses.

Then this input signal is processed and an output signal is transmitted to other neurons through an axon (A on Fig. 2.1) These neurons are interconnected to constitute a neural network (C on Fig. 2.1).

Similarly, an artificial neuron is a mathematical operator which receives inputs from other artificial neurons. Then, the artificial neuron processes these inputs to produce an output: a weighted sum of the inputs is computed, the weights representing the strength of the connections between the artificial neurons. Then a bias or a threshold can be applied to this weighted sum (B on Fig. 2.1). The resulting output can be transmitted to other neurons to create an artificial neural network (D on Fig. 2.1).

4

(13)

CHAPTER 2. BACKGROUND 5

Figure 2.1: A: representation of a biological neuron. B: representation of an artificial neuron. C: representation of two interconnected biological neurons.

D: representation of interconnected artificial neurons [17].

Artificial neural networks are usually composed of several layers which are stacked. The input layer receives the input data. It can be an image, a sequence of characters, a matrix, a tensor. The output layer produces the ultimate result. It can be a vector of probability encoded as a vector for instance. Between the input layer and the output layer, there are hidden layers. Each hidden layer receives as input the output of the previous layer. It applies a transformation to this input to produce an output. The transformation from an input to an output depends on the nature of the layer (fully connected layer, convolutional layer, pooling layer, etc.)

The layers which will be used further in this thesis are defined in 2.1.2.

2.1.2 ANN layers

2.1.2.1 Fully Connected Layer

A fully connected layer composed of m neurons takes as input a vector

(14)

6 CHAPTER 2. BACKGROUND

x =





 x₁ ...

x_n−1 1





∈ Rⁿ It returns an output vector

y =



 y₁ ...

y_m



∈ R^m such that

y = W x with W =





w_1,1 ... w_1,n ... ... ...

wm,1 ... wm,n



∈ R^m×n

Each of the m rows of W corresponds to a neuron and the coefficient wi,j

defines the connection between the input xj and the output yi. The term "1" in the vector x is a way to include a bias.

2.1.2.2 2D Convolution

2D Convolution is the main component of Convolutional Neural Networks [16]. Convolutional Neural Networks (CNN) belong to the family of Artificial Neural Networks. These are composed of convolutional layers which are stacked. These convolutional layers are very good at feature extraction which is useful for many machine learning tasks (e.g., classification, detection). Convolutional Neural Networks (CNN) showed the best results in recent machine learning challenges such as ImageNet Large Scale Visual Recognition Challenge [18].

A convolutional layer composed of n convolution filters takes as input a tensor x ∈ R^w×h×c. w is the width of the tensor, h its height and c its number of channels.

For 1 ≤ i ≤ n, the ith convolutional filter is a tensor fi ∈ R^w⁰^×h⁰^×c (also referred as ’kernel’). The number of channels of the convolutional filters must be the same as the number of channels of the input tensor.

(15)

It returns an output tensor y ∈ R^w×h×n[16, equation 9.7].

Each convolution filter "slides" over the input. For each area of the input, it returns the dot product between this area and the convolution filter. The figure below presents an example with a one-channel input.

Figure 2.2: A convolutional filter slides over an input image (on the left) to output another image (on the right).

The dimensions of the convolutional filters w⁰ and h⁰ are usually much lower than the dimensions of the input w and h. Therefore a convolutional layer contain much fewer parameters than a fully-connected layer. The smallness of the filters also enables to extract local properties of the input (lines, corners, blobs,etc.). This is useful for feature extraction.

2.1.2.3 Pooling Layer

Usually, for the task of ANN (image classification, sound tagging, etc.), the input space is a space of high dimension. For instance, a 32 × 32 pixels image is a vector of R^32×32 which is equivalent to R¹⁰²⁴ (relation of bijection). On the other hand, the output space is usually a space of low dimension (number of possible classes, number of possible tags, etc.).

Convolution layers increase the dimension of the input if the number of

(16)

convolution filters is greater than the number of channels. Therefore it is useful to add layers which reduce the dimension of the data. Pooling layers are meant for that.

Like convolution layers, a pooling layer "slides" over the input. For each area of the input, it returns a single value and therefore reduces the dimension of the input. This value can be the mean of the area (average pooling), the maximum of the area (max pooling), etc.

Figure 2.3: Max Pooling returns the largest value of each area whereas Average Pooling returns the average value of each area.

2.1.2.4 Activation Layer

Fully-connected layers and convolution layers are linear transformations.

Stacking fully-connected layers would result in another fully-connected layer, which would also be a linear transformation. The purpose of activation is to introduce non-linearity in ANN in order to approximate non-linear functions.

Different activation functions exist:

• sigmoid : x ∈ R 7−→ _1+e¹−x ∈ [0, 1]

• tanh : x ∈ R 7−→ ^1−e_1+e^−2x−2x ∈ [−1, 1]

• relu : x ∈ R 7−→ max(0, x) ∈ [0, +∞[

A common problem with sigmoid and tanh is that they saturate for large input values which may cause problems for the gradient backpropagation (see 2.2.3).

relu is meant for avoiding that.

(17)

2.1.2.5 Batch Normalization Layer

During the training of an ANN (see 2.2.3), the parameters of the layers are modified. As a consequence, the distribution of the input of each layer changes during the training, this is the "internal covariance shift". Batch Normalization [19] normalizes the input data of each layer to solve this problem. Batch Normalization also helps the input values of activation functions to avoid saturation areas of these functions.

2.1.2.6 Flatten Layer

This is a simple layer that takes a tensor as input and flattens it to output a vector. This layer is usually placed before a fully-connected layer since fully- connected layers take vectors as input.

2.2 ANN Training

2.2.1 Supervised Learning

In this thesis, all the machine learning tasks are done in the context of supervised learning. This means the used data contain both input data (images, sounds, etc.) and their corresponding labels (class, tags, etc.).

2.2.2 Loss Function

The purpose of the loss function is to quantify the error between the true output ytrueand the prediction of the ANN ypred.

In the case of a multi-label classification task, y^true and y^pred ∈ R^n×d where n is the number of samples and d the number of labels. ytruei,j is 1 if the label j is present in the sample i, 0 if not. y_pred_i,j is the probability of presence of the label i in the sample j.

There are different loss functions. One widely used loss functions is binary crossentropy:

bce(ypred, ytrue) = −_nd¹ Pn i=1

Pd

j=1[ytrue_i,jlog(ypred_i,j) + (1 − ytrue_i,j) log(1 − ypred_i,j)]

This loss function will be optimized during the training (see 2.2.3).

(18)

2.2.3 Optimization of the Loss Function: Gradient Backpropagation

The goal of the training is to minimize the loss function LW on a training set with respect to the parameters of the ANN referred to as W . A widely used learning algorithm is gradient backpropagation [20]. First the parameters W are initialized randomly. Then iteratively, inputs xⁱ and their corresponding labels yi are randomly sampled and ∇WL_W(x_i, y_i) the gradient of the loss in (x_i, y_i) with respect to W is computed. The weights W are updated in the opposite direction of this gradient in order to minimize the loss.

Inputs: X = x1, ..., xn, the input data Y = y₁, ..., y_n, the labels η, the learning rate,

epochs, the number of iterations Result: W , the model parameters

Initialize randomly W the parameters of the ANN for i ← 1 to epochs do

Sample (xi, y_i) from (X, Y )

Update the parameters: W = W − η∇^WLW(xi, yi) end

return W

Figure 2.4: Stochastic Gradient Descent Algorithm

η is the learning rate, which defines "how much" is learned at every update.

A low learning rate may result in a slow learning whereas a high learning rate may result in an unstable learning.

Figure 2.5: Impact of the learning rate on the training [21].

(19)

Stochastic gradient descent can also be improved by introducing a momentum:

at each update, the parameters W are updated by a moving average of the current gradient and the past gradients. This enables to converge quickly and with fewer oscillations towards a local minimum of the loss function. Adam [22] is one algorithm that includes this notion of momentum.

2.2.4 Regularization Techniques

A common problem in training an ANN is when the network performs very well on the training data but poorly on unseen data. This is overfitting. There are several methods to avoid overfitting and build more robust networks.

Batch Learning

Instead of sampling one by one input data during the stochastic gradient descent, data are sampled by batch. Then at each epoch, the gradient of the loss is computed on a whole batch. This enables a more stable learning since it averages the gradient on several inputs and may attenuate the impact of very specific data.

Weight Decay

During the training, some weights of the ANN may become very large and this may be a sign of overfitting. Weight decay consists in adding a penalty term to the loss function so that the weights do not become too large.

In the case of an ANN with one fully-connected layer, with an input x, a predicted output ypred = W x (see 2.1.2.1), and a true output ytrue, and loss as loss function, the loss becomes

loss(y_pred, y_true) + βkW k²₂ with

kW k²₂ =Pm i=1

Pn

j=1w²_i,jgiven W =





w_1,1 ... w_1,n ... ... ...

wm,1 ... wm,n



(l2-norm)

Other norms (e.g., l1-norm can used).

The hyperparameter that quantifies the power of the regularization is β. A large β means a strong regularization.

(20)

Dropout

During the training, some connections between neurons are temporarily set to zero. This enables to build a more robust network since it forces the network to let the information flow through different paths. The dropout rate defines the proportion of connections which will be set to zero at each update [23].

Figure 2.6: an ANN with and without dropout [23].

2.3 Parameters and Hyperparameters

The performance of an ANN depends a lot on another class of parameters, which are called the hyperparameters. These hyperparameters must be initialized before the training and they are not modified during the training.

They can define the topology of the network (depth, number of units by layer, size of the convolutional filters, dropout rate) or the learning configuration (learning rate, number of epochs, size of the mini-batch).

When the complexity of the network increases, the number of hyperparameters also increases. The hyperparameters optimization task becomes more complex since the search space becomes larger [7],[8] . Another issue is that a network may combine both qualitative and quantitative hyperparameters.

Without prior knowledge, it is difficult to know which areas of the search space will be the most promising (i.e. search space areas where the hyperparameters yields good performing CNN). Therefore hand-tuning may lead to hyperparameters evaluations of limited usefulness and as a consequence, a very time-consuming process.

(21)

Some solutions have been proposed in order to make the search more efficient [9],[10],[11],[12] . They are presented in the following chapter:

Related Work.

(22)

Chapter 3 Related Work

Chapter 3 presents the related work on hyperparameters optimization. First, the problem of hyperparameters optimization is defined and three hyperparameters optimization algorithms from the literature are presented.

3.1 The Problem of hyperparameters Optimization

3.1.1 Definition of the problem

In a typical classification problem with ANN, the data is split into a training set X^trainand a validation set X^validation.

The goal is to find the parameters W which minimizes a loss function L(W |Xvalidation) on the validation set X^validation. The parameters W are obtained by a learning algorithm A defined by its hyperparameters λ and trained on the training set Xtrain. Then W = A(Xtrain|λ) [10].

The model parameterized by W is trained on the training set X^train but evaluated on its performance on the validation set X^validation. The motivation is to obtain a model with good generalization capability, that is to say a model that performs well on unseen data. Actually, it is a common issue in machine learning that the model learns too closely from the training data: it is called overfitting. The opposite phenomenon is underfitting, which occurs

14

(23)

CHAPTER 3. RELATED WORK 15

when the model does not capture the underlying structure of the training data. Usually, overfitting occurs when the model contains too many parameters than can be justified by the data whereas underfitting occurs when the model contains too few parameters. A bad hyperparameter may result in either underfitting or overfitting.

The goal of hyperparameter selection is to find λ which minimizes L(W |X_validation) [24]:

λ^∗ = arg min

λ

L(W |X_validation) = arg min

λ

L(A(X_train|λ)|X_validation)[10]

(3.1) It is important to understand the difference between the parameters W and the hyperparameters λ. The parameters W are modified during the training whereas the hyperparameters λ are constant during the training. So ANN training is a problem of learning whereas hyperparameter optimization is a problem of meta-learning: the goal is to learn to learn.

Usually, there are several hyperparameters in an ANN. From now on in this thesis the term λ will refer to a set of hyperparameters, that is to say, a vector of scalar hyperparameters.

3.1.2 Main issues

3.1.2.1 The Curse of Dimensionality

The curse of dimensionality is a phenomenon that appears when solving problems in high dimensional spaces [7],[8]. The problem is that the volume of a space grows exponentially with the dimension of the space. This makes the exploration of high dimension spaces difficult.

For instance, in order to sample evenly [0, 1]² with no more than a 0.1 distance between two points, 10² = 100 points are required. In order to sample evenly [0, 1]¹⁰ with no more than a 0.1 distance between two points, 10¹⁰= 10 billion points are required.

3.1.2.2 The Cost of Observations

In order to evaluate the performance of a set of hyperparameters an artificial neural network parameterized with these hyperparameters must be trained on a training set. This training is very time consuming since usually many layers

(24)

16 CHAPTER 3. RELATED WORK

with many parameters to train are involved. Even with large computational resources, this training may take minutes or hours. Therefore, in order to be done in a reasonable amount of time, the optimization process has to be done using few observations.

3.1.2.3 The Score Function

Usually, the score function which is used to evaluate the performance of a hyperparameter is a metric such as accuracy, F-score, AUC, etc [25]. These metrics are not differentiable with respect to the hyperparameters. Therefore, gradient-based optimization methods like stochastic gradient (see 2.2.3) cannot be used.

3.2 Existing Solutions for Hyperparameter Optimization

When it comes to hyperparameter optimization, all approaches can be split into model-free and model-based. Model-based techniques build a model of the hyperparameter space during its exploration whereas model-free algorithms do not use the knowledge about the solution space during the optimization. The terms ’model-free’ and ’model-based’ are also used in the field of reinforcement learning [26]. In this thesis, ’model-free’ and

’model-based’ have the meaning that is defined above.

The figure below shows different approaches and their corresponding algorithms: grid search [10], random search [10], Tree Parzen Estimator [9]

(referred as TPE), Spearmint [27], Radial Basis Functions (RBF) Surrogate Model [12], Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [28], Particle Swarm Optimization (PSO) [13].

(25)

Figure 3.1: Approaches for the automated hyper-parameter selection [11]

This section presents two model-free solutions (grid search and random search) and two model based solutions (TPE, PSO).

It has been chosen to focus on these solutions in order to study very diverse approaches: naive solutions as a starting point (grid search/random search) and more complex solutions (TPE for a Bayesian approach, PSO for a non-Bayesian approach).

3.2.1 Model-free Algorithms 3.2.1.1 Grid Search

A range of values is defined for each component of the hyperparameter and a grid is defined in the search space. Then the network is trained for every hyperparameter in the grid. A logarithmic scale can be used to cover a larger range of values.

Commonly several grid searches are sequentially performed. The user first performs a grid search with a large range of values. Then the user refines its search in the most promising area of search space.

Grid search is easy to implement however it suffers from the curse of dimensionality. For a hyperparameter space of dimension k with n values tested for each dimension, n^k networks need to be trained. Therefore the complexity grows exponentially with the dimension. Grid search is reliable in low dimensional spaces [10].

(26)

Another practical issue with grid search is the lack of flexibility for the number of hyperparameters to be evaluated (it is n^k with k search space dimension and n an integer). For instance, with k = 5, the number of hyperparameters which are evaluated can be 2⁵ = 32, 3⁵ = 243, 4⁵ = 729.

No intermediate values like 50 or 100 can be used.

3.2.1.2 Random Search

A range of values is defined for each component of the hyperparameter and n hyperparameters are sampled randomly in this range. Then the network is trained for every hyperparameter. The benefit is that n different values are tested for each dimension.

Therefore random search gives a better coverage than grid search. Bergstra and Bengio showed random search will have a greater chance of finding effective values for each hyperparameter component [10](see Fig. 3.2). Like grid search, random search is also easy to implement.

Fig. 3.2 shows the results of Bergstra and Bengio’s experiments [10]. They created 100 search problems. For each problem, a hyper-rectangle of dimension 5 (to simulate a problem of hyperparameter optimization in dimension 5) is designed and a volume representing 1% of the hyper-rectangle is sampled. The goal is to find this volume. Then for each problem up to 512 points are sampled in the hyper-rectangle using a search algorithm (grid search, random search, quasi-random search). If the 1%

volume is reached, it is a success. The success rate is the rate of success over the 100 problems. One can see the success rate is much greater with random search and quasi-random search.

(27)

Figure 3.2: The efficiency in simulation of grid search, random search and quasi-random search [10]

If the number of samples is low, some areas of the search space may not be explored. This problem can be solved using quasi-random sequences which are designed to cover the search space evenly. An example is Latin hypercube: in order to sample n points from a d-dimensional space, each of the d dimensions are divided into n intervals and the sampling is done so that each of the n intervals is occupied by a point.

Figure 3.3: Example of latin hypercube sampling: each of the 4 rows and each of the 4 columns are occupied by a point

(28)

3.2.2 Model-based Algorithms

3.2.2.1 Bayesian Optimization: Tree Parzen Estimator (TPE)

Sequential Model-Based Optimization (SMBO)

Bayesian approaches uses Sequential Model-Based Optimization (SMBO)[9]. SMBO tries to minimize validation loss y by sequentially selecting different hyperparameters using Bayesian reasoning.

The algorithm is composed of 3 main steps:

• step 1: given an approximation of the validation loss ˆy, the most promising hyperparameter λ^∗is computed. It will be the next guess.

The meaning of "the most promising" is defined by an acquisition function S which tries to find a balance between exploring new areas in the search space and exploiting areas that are already known to have favourable values. It will be the next guess.

• step 2: the loss function is evaluated for the hyperparameter λ^∗. This is the most time-consuming step.

• step 3: given all the previous observations, the approximation of the loss function ˆy is updated.

This process is repeated until the maximum number of iterations is reached.

(29)

Algorithm: SMBO

Inputs: ˆy0, the initial surrogate model

max_iter, the maximum number of iterations S, the acquisition function

Result: λ_best, the best hyperparameter

H ←− ∅ # H is the history of observations for t ← 1 to max_iter do

λ^∗ ←− argmax_λS(λ, ˆy_t−1) # step 1

Evaluate y(λ^∗) # step 2

H ←− H ∪ (λ^∗, y(λ^∗))

Fit a new model ˆy_tto H # step 3 end

λ_best= argmax_λS(λ, ˆy_{max_iter}) return λ_best

Figure 3.4: SMBO algorithm Example:

SMBO is illustrated in Fig. 3.5. The goal is to find the minimum of the solid line. After 5 observations, a model ˆy₅ has been fitted to the data, the dashed line is the expectation of ˆy₅. The acquisition function of ˆy₅ is computed, the maximum of this acquisition function is the next point to be observed. After this observation, a new model ˆy6 is computed and the process is repeated.

Figure 3.5: Example of an optimization problem using SMBO.

(30)

TPE

With TPE [9], the acquisition function is the Expected Improvement (EI):

EI_y∗(λ) = Z y∗

−∞

(y^∗− ˆy)p(ˆy|λ)dˆy (3.2) where λ is a set of hyperparameter, ˆy is a value of the validation loss, y* is a value of the validation loss which is higher than the lowest observed.

Intuitively, one can see the EI is high if both (y^∗− ˆy) is high (which means great improvement) and p(ˆy|λ) is high (which means likely improvement given the hyperparameter λ).

TPE does not model p(ˆy|λ) directly. Instead it uses Bayes rule:

p(ˆy|λ) = ^p(λ|ˆ_p(λ)^y)p(ˆ^y) where:

p(λ|ˆy) = l(λ) if ˆy < y∗

g(λ) if ˆy ≥ y∗ (3.3)

and:

p(λ) = Z +∞

−∞

p(λ|ˆy)p(ˆy)dy = γl(λ) + (1 − γ)g(λ) with γ = p(ˆy < y∗) (3.4) Therefore,

EIy∗(λ) = Z y∗

−∞

(y^∗− ˆy)p(ˆy|λ)dˆy

= Z y∗

−∞

(y^∗− ˆy)p(λ|ˆy)p(ˆy) p(λ) dˆy

= 1

p(λ) Z y∗

−∞

(y^∗− ˆy)p(λ|ˆy)p(ˆy)dˆy

= 1

p(λ)l(λ) Z y∗

−∞

(y^∗− ˆy)p(ˆy)dˆy

= 1

p(λ)l(λ)α, α a constant w.r.t. λ, ≥ 0

= l(λ)α

γl(λ) + (1 − γ)g(λ), α a constant w.r.t. λ, ≥ 0

= α

γ + (1 − γ)^g(λ)_l(λ), α a constant w.r.t. λ, ≥ 0

(3.5)

So in order to maximize EI, the hyperparameters to be evaluated should have low probability under g(λ) and high probability under l(λ).

(31)

TPE in practice

First, a search space is defined for each hyperparameter component in order to define the range of the search and sample the first hyperparameters. Then hyperparameters are iteratively sampled using SMBO. These samples are divided into two groups based on the validation loss. By default the best 25%

(which means γ = 0.25) go to the "good group" and the other go the "bad group".

The two densities l(λ) and g(λ), respectively for the "bad group" and the

"good group" are built: each sample defines a Gaussian distribution and all the Gaussian distributions of a group are stacked.

Figure 3.6: Construction of a density based on 3 samples

The hyperparameter that minimizes ^g(λ)_l(λ) is the next guess and the two densities l(λ) and g(λ) are updated until the maximum number of iterations is reached.

3.2.2.2 Evolutionary Algorithm: Particle Swarm Optimization (PSO)

Particle Swarm Optimization (PSO) is a population-based optimization method. It was created by Kennedy, Eberhart and Shi [29][30] and it was inspired by animal social groups like herds, schools and flocks. The swarm is composed of particles which have a position in the search space. Particles move in the search space and cooperate according to simple mathematical formulas in order to find an optimal solution.

In the case of hyperparameter optimization, a particle’s coordinate corresponds to a hyperparameter in the hyperparameter search space.

(32)

In a D-dimensional search space:

• the position of the i-th particle at time t is λi(t) = (λ_i1(t), ..., λ_iD(t))

• the best position so far of the i-th particle at time t is pbest_i(t) = (pbest_i,1(t), ..., pbest_i,D(t))

• the best position so far of the whole swarm at time t is gbest(t) = (gbest₁(t), ..., gbest_D(t))

First, the positions of the particles are randomly initialized in the search space and then the particles move in the search space. At each iteration, the position of the i-th particle is computed as follows:

λ_i(t + 1) = λ_i(t) + v_i(t) (3.6)

v_i(t + 1) = ωv_i(t) + c₁r₁(pbest_i(t) − λ_i(t)) + c₂r₂(gbest_i(t) − λ_i(t)) (3.7) vi(t) is the velocity of the i-th particle at time t, ω is an inertia weight scaling the previous time step velocity, c1 and c2 are two acceleration coefficients that scale the influence of the best personal position of the particle pbesti(t) and the best global position gbest(t) and r1 and r2 are random variables within the range of 0 and 1.

The first term of the velocity ωvi(t) in an inertia term that creates a balance between the previous movement and the changes of direction caused by new information.

The second term c1r₁(pbest_i(t) − λ_i(t)) is the "cognition" part which represents the private thinking of the particle itself.

The third term c²r₂(gbest_i(t) − λ_i(t)) is the "social" part which represents the collaboration among the particles.

(33)

Figure 3.7: Representation of the update of a particle velocity [31]

(34)

Chapter 4 Methods

Chapter 3 presents the methodology that is used to evaluate and compare the hyperparameter optimization algorithms. First, the MNIST dataset, which is used to evaluate the algorithms, is introduced. Then, three experiments of hyperparameter optimization are presented. Finally, the metrics that will be used for the analysis of the results are defined. Another use case, which is audio classification, is also introduced.

4.1 A Conventional Benchmark: MNIST Image Classification Task

4.1.1 Data and Task

In order to obtain comparable results and draw conclusions that may be generalized, it has been chosen to test these algorithms on a widely used dataset: MNIST dataset [4].

The MNIST (Modified National Institute of Standards and Technology) dataset is composed of 70000 images of handwritten digits (0 to 9). Each image is a 28 × 28 pixels gray-scale image. The dataset is divided into a training set (60000 samples) and a validation set (10000 samples).

Given a 28 × 28 input image, the task is to predict the corresponding digit.

26

(35)

CHAPTER 4. METHODS 27

Figure 4.1: Examples of MNIST images [4]

4.1.2 Starting Point: LeNet-1

The major focus of this thesis is to study hyperparameter optimization rather than to study complex convolutional neural networks. Moreover, the computational resources provided for this thesis were limited. Therefore a rather simple network has been chosen as a starting point: LeNet-1 [32]. It is a simple network since it contains a reasonable number of parameters, which enables quick training and no memory issues.

Created in 1995 by LeCun et al., LeNet-1 was one of the first CNN tested on MNIST dataset. It showed good results (1.7% error rate) although it is not state of the art now (0.23 % error rate has been reached [33]).

By default, the network is composed of 7 layers:

Layer type # conv.

filters

Kernel size Padding Activation Number of dense units

1 Input - - - - -

2 Conv2D 4 5 × 5 no padding tanh -

3 Average Pooling2D

- 2 × 2 - - -

4 Conv2D 12 5 × 5 no padding tanh -

5 Average Pooling2D

- 2 × 2 - - -

6 Flatten - - - - -

7 Dense - - - sigmoid 10

Table 4.1: Detailed architecture of LeNet-1

(36)

28 CHAPTER 4. METHODS

Figure 4.2: General Structure of LeNet-1 [32]

4.1.3 Experiments 4.1.3.1 Experiment 1:

Optimizing LeNet-1 with Variable Complexity

Motivation

In this experiment, the overall structure of LeNet-1 (two convolutional layers and one dense layer) is fixed but the number of parameters of each layer can vary. The goal of this task is to find the network complexity that achieves the best results given a fixed architecture.

Network

For this experiment, no additional layer is added to LeNet-1 but the sizes of the two convolutional layers are not fixed. Four hyperparameters can vary:

• nconv1: the number of convolutional filters in the first convolutional layer (integer ranging from 1 to 100).

• size_conv1: the width of the convolutional filters in the first convolutional layer (integer ranging from 2 to 8). The convolutional filters are assumed to be squared, thus the kernel size is size_conv1 × size_conv1

• nconv2: the number of convolutional filters in the second convolutional layer (integer ranging from 1 to 100).

(37)

• size_conv2: the width of the convolutional filters in the second convolutional layer (integer ranging from 2 to 8). The convolutional filters are assumed to be squared, thus the kernel size is size_conv2 × size_conv2.

Layer type # conv. filters Kernel size Padding Activation Number of dense units

1 Input - - - - -

2 Conv2D nconv1 size_conv1 × size_conv1 no padding tanh - 3 Average

Pooling2D

- 2 × 2 - - -

4 Conv2D nconv2 size_conv2 × size_conv2 no padding tanh - 5 Average

Pooling2D

- 2 × 2 - - -

6 Flatten - - - - -

7 Dense - - - sigmoid 10

Table 4.2: Detailed Architecture of LeNet-1_exp1 (variable hyperparameters are in bold)

Training

The network is implemented using Keras library [34], the loss function is the categorical crossentropy, the optimizer is Adam optimizer [22] with a learning rate equal to 0.0001 (and other parameters set to default [22]). The batch size is set to 32. The metrics to evaluate the network is the error rate.

4.1.3.2 Experiment 2:

Optimizing LeNet-1 with Variable Regularization Capacity

Motivation

In experiment 2, the overall structure of LeNet-1 (the convolutional layers and the dense layer) is fixed. However three changes are made to improve the generalization capacity of the network: the use of dropout, the use of weight decay (L2-regularization) and a variable batch size for the training.

(38)

Network

Three hyperparameters can vary:

• l2_reg: the coefficient β that weighs the L2-regularization (see 2.2.4).

It is a float ranging from 1 to 10⁻¹⁰.

• dropout_rate: the dropout rate (see 2.2.4). It is a float ranging from 0 to 1.

• batch_size: the size of the mini-batches (see 2.2.4). It is an integer ranging from 1 to 100.

Layer type # conv.

filters

Kernel size

Padding Activation Number of dense units

dropout rate weight decay

1 Input - - - - - - -

2 Conv2D 4 5 × 5 no padding tanh - - -

3 Average Pooling2D

- 2 × 2 - - - - -

4 Conv2D 12 5 × 5 no padding tanh - - -

5 Average Pooling2D

- 2 × 2 - - - - -

6 Flatten - - - - - - -

7 Dropout - - - - - dropout_rate -

8 Dense - - - sigmoid 10 - l2_reg

Table 4.3: Detailed architectures of LeNet-1_exp2 (variable hyperparameters are in bold)

Training

The network is implemented using Keras library, the loss function is the categorical crossentropy, the optimizer is Adam optimizer with a learning rate equal to 0.0001. The batch size batch_size is a variable. The metrics to evaluate the network is the error rate.

(39)

4.1.3.3 Experiment 3:

Optimizing "augmented LeNet-1" with Variable Regularization Capacity

Motivation

In order to test the scalability of hyperparameter optimization, a network with more hyperparameters has been studied. Another reason is that optimization algorithms also include hyperparameters. For instance, PSO includes 4 hyperparameters (the number of particles, the three coefficients that weigh inertia, individual behaviour, collective behaviour). TPE also includes hyperparameters, such as the quantile γ (see 3.2.2.1) for instance.

As a consequence, in order to actually reduce the complexity of the problem, these algorithms have to be effective to optimize more hyperparameters than they have themselves.

The network is composed of 2 convolutional layers, two dense layers but also dropout and batch normalization and weight decay. Therefore there are more hyperparameters. The goal is to study how the performance scales with the dimensionality.

This network is referred as "augmented LeNet-1".

Network

Eight hyperparameters can vary (see the architecture 4.4):

• dp1, dp2, dp3, dp4: the four dropout rates. It is a float ranging from 0 to 1.

• l2_reg1, l2_reg2: the coefficients β that weighs the two L2-regularization. It is a float ranging from 1 to 10⁻¹⁰.

• lr: the learning rate. It is a float ranging from 10⁻¹to 10⁻⁶. This range has been chosen since learning rates higher than 10⁻¹ lead to unstable learning and learning rates lower than 10⁻⁶ lead to a very slow convergence.

• batch_size: the size of the mini-batches. It is an integer ranging from 16 to 512. This range has been chosen since batch sizes lower than 16 lead to a very long training whereas batch sizes higher than 512 lead to

(40)

memory issues.

Layer type # conv.

filters

Kernel size

Padding Activation Number of dense units

dropout rate

weight decay

1 Input - - - - - - -

2 BatchNorm - - - - - - -

3 Conv2D 32 3 × 3 no relu - - -

4 Dropout - - - - - dp1 -

4 MaxPooling2D - 2 × 2 - - - - -

5 BatchNorm - - - - - - -

6 Conv2D 64 3 × 3 no relu - - -

7 Dropout - - - - - dp2 -

8 MaxPooling2D - 2 × 2 - - - - -

9 Flatten - - - - - - -

10 BatchNorm - - - - - - -

11 Dropout - - - - - dp3 -

12 Dense - - - relu 150 - l2reg-1

13 BatchNorm - - - - - - -

14 Dropout - - - - - dp4 -

15 Dense - - - sigmoid 10 - l2reg-2

Table 4.4: Detailed architectures of "augmented LeNet-1" (variable hyperparameters are in bold)

Training

The network is implemented using Keras library, the loss function is the categorical crossentropy, the optimizer is Adam optimizer with a variable learning rate equal lr. The batch size batch_size is also variable. The metrics to evaluate the network is the error rate.

4.1.4 Hyperparameter Optimization 4.1.4.1 Normalization

The hyperparameters may vary in very different ranges and be of different data types (float, integer). To create a generic method, all the hyperparameters are associated to a float ranging from 0 to 1.

For experiment 1,

[λ₁, λ₂, λ₃, λ₄] −→ [nconv1, sizeconv1, nconv2, sizeconv2]

such that:

• nconv1 = 1 + round(99λ₁)

(41)

• sizeconv1 = 2 + round(6λ₂)

• nconv2 = 1 + round(99λ3)

• sizeconv2 = 2 + round(6λ₄)

So the initial hyperparameter optimization problem becomes a problem of optimization in [0, 1]⁴.

For experiment 2,

[λ₁, λ₂, λ₃] −→ [l2_reg, dropout_rate, batch_size] such that:

• l2_reg = 10^(−10λ¹⁾

• dropout_rate = λ2

• batch_size = 1 + round(99λ3)

So the initial hyperparameter optimization problem becomes a problem of optimization in [0, 1]³.

For experiment 3,

[λ1, ..., λ8] −→ [dp1, dp2, dp3, dp4, l2_reg1, l2_reg2, lr, batch_size] such that:

• dp1 = λ₁

• dp2 = λ₂

• dp3 = λ₃

• dp4 = λ₄

• l2_reg1 = 10^(−10λ⁵⁾

• l2_reg2 = 10^(−10λ⁶⁾

• lr = 10^−(1+5λ⁷⁾

• batch_size = 16 + round(496λ8)

So the initial hyperparameter optimization problem becomes a problem of optimization in [0, 1]⁸.

4.1.4.2 Optimization

For the three experiments, random search, TPE and PSO are used to optimize the hyperparameters.

The algorithms for random search and PSO were implemented using Python whereas the library hyperopt [35] was used for TPE. The convolutional neural networks were implemented and trained using Keras [34] and a NVIDIA GeForce GTX 1080.

(42)

For experiments 1 and 2, which take place in low dimension, 50 hyperparameters are evaluated. Random search consists in randomly sampling 50 sets of hyperparameters. In PSO, 5 particles are randomly initialized in the search space and 10 iterations are done, so 50 sets of hyperparameters are evaluated. With TPE, before the algorithm runs, the first 5 sets of hyperparameters to be evaluated are the same as PSO initialization, so PSO and TPE can be compared with the same initial conditions.

For experiment 3, which takes place in a higher dimension, 100 hyperparameters are evaluated. Random search consists in randomly sampling 100 hyperparameters. In PSO, 10 particles are randomly initialized in the search space and 10 iterations are done, so 100 hyperparameters are evaluated. With TPE, 100 hyperparameters are evaluated as well with the same initialization as PSO.

4.1.5 Metrics

The goal of hyperparameter optimization is to explore the most promising areas of the search space without prior knowledge. The evaluation of the algorithms and the analysis of the results will be based upon two capabilities:

exploration and exploitation. Metrics that quantify these capabilities were designed and are contributions of this thesis.

4.1.5.1 Exploration

Exploration is testing large portions of the search space with the hope of finding promising solutions. It is a strategy of diversification.

Metric 1: Dispersion

The dispersion is defined as the standard deviation of the coordinates of the hyperparameters in the hyperparameter search space:

If during the optimization process in dimension d, the hyperparameters [λ₁, ..., λ_n] with λi ∈ R^d,1 ≤ i ≤ n, the dispersion is defined by:

dispersion = q1

n

Pn

i=1||λi− ¯λ||²

(43)

with¯λ = _n¹ Pn i=1λ_i

and ||.|| the L2-norm such as ||x|| = q

Pd

i=1x²_i for x=[x1, ..., x_d]∈ R^d Metric 2: Number of Intervals Explored

For each dimension of the search space, the search space [0, 1] is divided into two intervals [0, 0.5] and [0.5, 1]. This division is applied for each dimension, so for a d-dimensions space, there are 2^dintervals.

Figure 4.3: Example of the "Number of Intervals Explored" in 2D ([0, 1]²).

The 5 hyperparameters explored 3 intervals out of 4.

4.1.5.2 Exploitation

Exploitation is testing limited portions of the search space with the hope of improving a promising solution that we already have tested. It is a strategy of intensification.

Metric: Mean of the Error Rates

This is the mean of the error rate (in %) for all the hyperparameters settings evaluated during the optimization process. This error rate is computed on a validation set which has not been used for the training.

(44)

4.2 An Application to a Real-World Use Case: Audio Classification

Audio processing is one of the main research topics of the team where this thesis was conducted (team Ambient Intelligence of Orange Labs Lannion).

This section presents a real-world application of hyperparameter optimization. Hyperparameter optimization algorithms are applied to a task of audio tagging. This problem is made similar to image classification by using mel-spectrograms, a 2D representation of sounds.

4.2.1 Data and Task

The dataset is the SONYC dataset [36]. It is composed of 2351 recordings in the training dataset and 443 recordings in the validation dataset. Each recording is a 10-seconds audio segment recorded in the streets of New York City.

8 labels can be present in the audio recordings: ’engine’, ’machinery impact’,

’non-machinery impact’, ’powered saw’, ’alert signal’, ’music’, ’human voice’ and ’dog barking’. Labels are non-exclusive: several classes can be present in each recording. The goal is to output a probability of presence for each class for each recording.

4.2.2 Feature Engineering

4.2.2.1 Definition of Melspectrograms

Since artificial neural networks have shown good performance on image classification tasks, audio inputs are often converted into images through mel-spectrograms. A spectrogram is a visual representation of a sound that contains information both in the time domain and the frequency domain. It represents the evolution of the frequency spectrum over time.

First, the original audio signal is divided into signals of length window_size.

The overlap between two consecutive signals is the window_hop (1 in Fig.

4.4). The usual size of window_size and window_hop is a few milliseconds.

Then for each window, the Fourier Transform magnitude is computed (2 in Fig. 4.4).

(45)

Finally, the Fourier Transform magnitudes are stacked: the horizontal axis corresponds to the time domain and the vertical axis corresponds to the frequency domain (3 in Fig. 4.4).

A mel-spectrogram is a spectrogram where a mel scale is used in the frequency domain. A mel scale is a logarithmic scale based on human sound perception. There is not a unique definition of the mel scale. In this project, the melspectrogram function of the librosa[37] library has been used.

Figure 4.4: Construction of a spectrogram

(46)

4.2.2.2 Feature Engineering for the Audio Classification Task

First, the recordings are re-sampled using a sampling rate of 22050Hz, which enables not to lose information from frequencies lower than 11025Hz and therefore keep most of the information from the signal. Then three features are extracted from these signals:

• The mel-spectrograms using 64 mel-bands and a hop length of 512 thus resulting a 64 rows x 431 columns image.

• The averaged value of the harmonic and percussive components (64 rows x 431 columns image).

• The derivative of the mel spectrograms (64 rows x 431 columns image).

These spectrograms have been extracted using the librosa library [38]. The figure below represents the three spectrograms extracted from an alert-signal recording.

Figure 4.5: The three input images corresponding to a recording.

(47)

4.2.3 Model

4.2.3.1 Model structure

A VGG-style [39] convolutional neural network is used to detect the classes from the input spectrograms:

Input 64 x 431 x 3 64 x conv 3x3 64 x conv 3x3 MaxPooling 2x2 128 x conv 3x3 128 x conv 3x3 MaxPooling 2x2 256 x conv 3x3 256 x conv 3x3 256 x conv 3x3 MaxPooling 2x2 512 x conv 3x3 512 x conv 3x3 512 x conv 3x3 MaxPooling 2x2 512 x conv 3x3 512 x conv 3x3 512 x conv 3x3 MaxPooling 2x2 Flatten

1024-Fully Connected + L2-regularization ReLu Activation

Dropout

8-Fully Connected + L2-regularization Sigmoid Activation

All convolutional layers are initialized with the weights of VGG16 pre-trained on Imagenet dataset [40] but they remain unfrozen during the training. This model has 30,188,360 parameters.

(48)

4.2.3.2 Data Augmentation

The training set is quite small (2351 samples). Data augmentation is a way to artificially increase the size of the training set and avoid overfitting. Mixup is a data augmentation method that has been experimented for this task. A new sample is created by linearly combining two samples. This linear combination is applied to both the melspectrograms and the corresponding labels. From two samples {input : x1, target : y₁} and {input : x2, target : y₂} a ’new’

sample is created: {input : x³ = λx₁+ (1 − λ)x₂, target : y₃ = λy₁+ (1 − λ)y₂} where λ ∼ β(mixup_rate) [41].

4.2.3.3 Model Training

Binary crossentropy (defined in 2.2.2) is the loss function which is optimized using Adam optimizer with a learning rate of 0.00001 during 100 epochs. The model is trained using a GPU (NVIDIA GeForce GTX 1080), the training takes about 1 hour.

4.2.4 Optimization of the Model

Four hyperparameters can vary:

• the dropout_rate of all the layers that use dropout

• l2_reg, the L2-regularization constant of all the layers that use L2-regularization

• batch_size, the size of the mini-batches

• the mixup_rate

For this task, the training is pretty long (1 hour) so only 30 hyperparameters are evaluated. The three algorithms presented earlier are evaluated: random search, PSO (using 5 particles) and TPE.

4.2.4.1 Model Evaluation

The performance of the model is measured using micro-averaged AUPRC (Area Under Precision-Recall Curve [42]). This is a good metric for this multi-label classification task since it depends on both precision and recall, and it does not depend on a predefined threshold. The AUPRC is computed on a validation set which has not been used for the training.

Hyperparameter Optimization for Convolutional Neural Networks

Hyperparameter Optimization for Convolutional Neural Networks

CLÉMENT GOUSSEAU

Hyperparameter Optimization for Convolutional Neural

Networks

CLÉMENT GOUSSEAU

Abstract

Sammanfattning

Acknowledgements

Contents

Chapter 1 Introduction

1.1 Research Question

1.2 Scope and Limitations

1.3 Thesis Outline

Chapter 2 Background

2.1 Artificial Neural Networks

2.1.1 From Biological Neural Networks to Artificial Neural Networks

2.1.2 ANN layers

2.1.2.1 Fully Connected Layer

2.1.2.2 2D Convolution

2.1.2.3 Pooling Layer

2.1.2.4 Activation Layer

2.1.2.5 Batch Normalization Layer

2.1.2.6 Flatten Layer

2.2 ANN Training

2.2.1 Supervised Learning

2.2.2 Loss Function

2.2.3 Optimization of the Loss Function: Gradient Backpropagation

2.2.4 Regularization Techniques

2.3 Parameters and Hyperparameters

Chapter 3

Related Work

3.1 The Problem of hyperparameters Optimization

3.1.1 Definition of the problem

3.1.2 Main issues

3.1.2.1 The Curse of Dimensionality

3.1.2.2 The Cost of Observations

3.1.2.3 The Score Function

3.2 Existing Solutions for Hyperparameter Optimization

3.2.1 Model-free Algorithms 3.2.1.1 Grid Search

3.2.1.2 Random Search

3.2.2 Model-based Algorithms

3.2.2.1 Bayesian Optimization: Tree Parzen Estimator (TPE)

3.2.2.2 Evolutionary Algorithm: Particle Swarm Optimization (PSO)

Chapter 4 Methods

4.1 A Conventional Benchmark: MNIST Image Classification Task

4.1.1 Data and Task

4.1.2 Starting Point: LeNet-1

4.1.3 Experiments 4.1.3.1 Experiment 1:

Optimizing LeNet-1 with Variable Complexity

4.1.3.2 Experiment 2:

Optimizing LeNet-1 with Variable Regularization Capacity

4.1.3.3 Experiment 3:

Optimizing "augmented LeNet-1" with Variable Regularization Capacity

4.1.4 Hyperparameter Optimization 4.1.4.1 Normalization

4.1.4.2 Optimization

4.1.5 Metrics

4.1.5.1 Exploration

4.1.5.2 Exploitation

4.2 An Application to a Real-World Use Case: Audio Classification

4.2.1 Data and Task

4.2.2 Feature Engineering

4.2.2.1 Definition of Melspectrograms

4.2.2.2 Feature Engineering for the Audio Classification Task

4.2.3 Model

4.2.3.1 Model structure

4.2.3.2 Data Augmentation

4.2.3.3 Model Training

4.2.4 Optimization of the Model

4.2.4.1 Model Evaluation