Transfer learning between domains: Evaluating the usefulness of transfer learning between object classification and audio classification

(1)

object classification and audio classification

Bachelor Degree Project in Informatics Level ECTS

Spring term 2020

Tobias Frenger Johan Häggmark

Supervisor: Niclas Ståhl Examinator: Joe Steinhauer

(2)

Convolutional neural networks have been successfully applied to both object classification and audio classification. The aim of this thesis is to evaluate the degree of how well transfer learning of convolutional neural networks, trained in the object classification domain on large datasets (such as CIFAR-10, and ImageNet), can be applied to the audio classification domain when only a small dataset is available.

In this work, four different convolutional neural networks are tested with three configurations of transfer learning against a configuration without transfer learning. This allows for testing how transfer learning and the architectural complexity of the networks affects the performance. Two of the models developed by Google (Inception-V3, Inception-ResNet-V2), are used. These models are implemented using the Keras API where they are pre-trained on the ImageNet dataset. This paper also introduces two new architectures which are developed by the authors of this thesis.

These are Mini-Inception, and Mini-Inception-ResNet, and are inspired by Inception-V3 and Inception-ResNet-V2, but with a significantly lower complexity. The audio classification dataset consists of audio from RC-boats which are transformed into mel-spectrogram images. For transfer learning to be possible, Mini-Inception, and Mini-Inception-ResNet are pre-trained on the dataset CIFAR-10. The results show that transfer learning is not able to increase the performance.

However, transfer learning does in some cases enable models to obtain higher performance in the earlier stages of training.

Keywords - Convolutional Neural Networks, Object Classification, Audio Classification, Transfer learning, Inception-V3, Inception-ResNet-V2, Keras, ImageNet, Mini-Inception, Mini-Inception- ResNet, mel-spectrogram, CIFAR-10

(3)

The opportunity to conduct this study at Combitech in Skövde has been a privilege. We would like to thank Jonas Karlsson from Combitech at Skövde, for always being helpful. His expertise has been a great source. We would also like to thank Eva Nero from Combitech at Skövde for being a valuable provider of feedback. Finally, we would like to thank Niclas Ståhl at the University of Skövde for providing us with advice and feedback to accomplish this study.

(4)

1 Introduction ... 1

2 Background ... 2

2.1 Artificial intelligence ... 2

2.2 Machine learning ... 2

2.2.1 Supervised learning ... 2

2.2.2 Unsupervised learning ... 3

2.2.3 Semi-supervised learning ... 3

2.2.4 Reinforcement learning ... 3

2.3 Neural Networks ... 3

2.3.1 Activation functions ... 4

2.3.2 Hidden Layers ... 5

2.4 Teaching a neural network ... 6

2.4.1 Backpropagation ... 6

2.4.2 Batch Normalization ... 6

2.4.3 Transfer learning... 8

2.5 Convolutional Neural Networks ... 8

2.6 Convolutional neural network features ... 9

2.6.1 Kernel ... 9

2.6.2 Spatial dimension reduction ... 9

2.6.3 Depth dimension reduction ... 10

2.6.4 Data representation in the architecture ... 11

2.7 Inception-V3... 11

2.8 Inception-ResNet-V2 ... 12

2.9 Audio classification with convolutional neural networks ... 13

2.10 Related work ... 15

3 Problem ... 16

3.1 Aim ... 16

3.2 Problem Area ... 16

3.3 Research problem ... 16

3.3.1 Motivation ... 16

3.3.2 Research Questions ... 17

3.3.3 Hypothesis ... 17

3.3.4 Objectives ... 17

4 Method ... 18

(5)

4.1.2 Alternative methodology ... 18

4.2 Implementation ... 18

4.2.1 Dataset ... 19

4.2.2 Data collection ... 21

4.2.3 Experimentation environments ... 22

4.2.4 CNN implementations ... 22

4.2.5 Optimizer ... 25

4.2.6 Configurations ... 25

4.3 Average mean amplitude classification ... 26

5 Evaluation ... 27

5.1 Introduction ... 27

5.2 Results ... 28

5.2.1 Test evaluation results ... 28

5.2.2 Training results ... 33

5.2.3 Average mean amplitude results ... 41

5.2.4 Confusion Matrices ... 41

5.3 Analysis ... 44

5.3.1 Summary ... 44

5.3.2 Average validation loss in test evaluation ... 44

5.3.3 Transfer learning with freezed convolutional layers ... 44

5.3.4 Mean result from training ... 45

5.3.5 Heat map ... 45

5.3.6 T-Test ... 46

5.3.7 Confusion matrix ... 47

5.3.8 Average mean amplitude ... 53

5.4 Threats to validity ... 53

5.4.1 Conclusion validity ... 53

5.4.2 Internal validity ... 53

5.4.3 Construct validity ... 54

5.4.4 External validity ... 54

5.5 Answers to research questions ... 55

5.6 Hypotheses evaluation ... 56

5.7 Conclusions ... 56

6 Discussion ... 57

(6)

6.3 Architectures ... 57

6.3.1 Network depths ... 57

6.3.2 Network widths ... 58

6.3.3 Neural network architecture choices ... 58

6.3.4 Time consumption in training ... 58

6.4 Transfer Learning ... 59

6.5 Statistical Significance ... 59

6.6 Related work ... 60

6.7 Research ethics ... 60

6.8 Research usefulness ... 60

6.9 Future work ... 61

7 References ... 63

Appendix A – Test evaluation ... 66

Appendix B – Training ... 67

Appendix C – Test evaluation result for each model ... 76

Appendix D – Training result for each model ... 82

(7)

1

1 Introduction

Convolutional neural networks (CNN) are proven to be effective at recognizing objects in images (Szegedy et al. 2017). In image classification, a CNN’s objective is to find local patterns in images.

To start with, the network finds simple objects, such as horizontal and vertical lines. The deeper we get into the architecture of the network, the higher the abstraction level becomes for what the network is able to identify. In the last few layers, the network may for instance recognize entire faces, or objects such as eyes and ears.

Hershey et al. (2017) show that CNNs are capable as audio classifiers. Performing audio classification with a CNN may at first seem counterintuitive since a CNN is usually used to recognize patterns in images. But audio can be transferred into spectrogram images and the CNN can identify characteristics in the images which represent audio in three dimensions (frequency, amplitude, and time). Depending on the audio source, different types of spectrograms may be of interest. The type of spectrogram that is used in this thesis is a mel-spectrogram. This type of spectrogram is inspired by the human auditory system which is good at detecting low frequencies.

To evaluate the usefulness of transfer learning, the approach is to evaluate four different CNNs in four different configurations. The chosen architectures are Google’s Inception-V3 and Inception- ResNet-V2. These architectures are complex, which means they have a high capacity for processing information. The other two architectures are low complex architectures, called Mini- Inception, and Mini-Inception-ResNet. These are developed by the authors of this thesis and provides a baseline of comparing the performance for different complexities. To specify the differences, Inception-V3 architecture has 94 convolutional layers, and the Inception-ResNet-V2 architecture has 244 convolutional layers. In comparison, the Mini-Inception architecture only has 12 convolutional layers, and the Mini-Inception-Resnet has 15 convolutional layers.

The conducted experiment divided each architecture into four configurations. Three configurations use transfer learning, and the final configuration consists of an untrained architecture which must adapt to the dataset with randomly initialized weights. The transfer learning configurations use models which have been trained on datasets of the object classification domain. One of the transfer learning configurations consists of using the weights of the pre-trained architecture as initialized weights, still allowing every layer in the architecture to be re-trained. The main reason for using this configuration is to see if the weights by themselves can help reduce the amount of training needed for the networks. The second transfer learning configuration has half of the convolutional layers set to be untrainable. This configuration allows for a middle ground which further clarifies the usefulness of transfer learning. The third configuration concerning transfer learning consists of a pre-trained model where all convolutional layers are set to be non-trainable.

This approach makes it possible to see the initial capabilities of the pre-trained models in the new domain. Every configuration of transfer learning also replaces the last fully connected layer with a new fully connected layer which is mapped to the number of classes used in the targeted dataset.

The audio classification dataset (RC-boats) that is used in this paper consists of mel-spectrogram representations of audio sources from a number of RC-boats which has been recorded in two different environments (One being in a smaller sized lake, and the other being in a pool). There is also one audio class representing noise which simply contains the background noise from the different environments. The dataset consists of 6’674 images. In comparison, CIFAR-10 contains 60’000 images and ImageNet contains roughly 14 million images (Rosebrock 2017). The size of the RC-boats dataset is significantly smaller than both of these datasets and it is therefore justifiable to call RC-boats a small dataset.

(8)

2

2 Background

This chapter gives a brief history of artificial intelligence which merges into a more technical description about neural networks. Later in the chapter, a description is given of how CNNs works.

There are also important concepts for this thesis that are taken into account such as, learning strategies, transfer learning, and audio classification.

2.1 Artificial intelligence

Negnevitsky (2005) describes that intelligence is a philosophical term. The meaning can differ depending on who you ask. One definition that he mentions is that intelligence is a quality only humans can possess. When it comes to artificial intelligence (AI), the goal is to make computers perform tasks that require some form of intelligence which resembles human intelligence.

Rosebrock (2017) states that the first known work of AI was done by Warren McCulloch and Walter Pitts in 1943. The research was based on the central nervous system and resulted in the first artificial neural network. In the 1960’s the view of AI was optimistic, and the research area flourished. The aim was to develop general-purpose solving techniques. The expectations of the 1960’s did, however, not debouch in the success that was expected during the following decade.

The AI researchers of the 1970’s came to the realization that developing general-purpose solving techniques was a too broad problem. Instead of focusing on general-purpose techniques, the shift was to instead focus on domain expert algorithms, thus restricting the problem area to specific tasks. Today’s most popular approach for AI is based on what is called ‘machine learning’.

2.2 Machine learning

Saravanan & Sujatha (2018) explain that machine learning enables a system to learn without having to be explicitly programmed for what it should learn. They further clarify that the main intention of machine learning is to make the degree of how much human assistance is needed for a system to learn as small as possible. The algorithm has to figure out how to find a desired behaviour without human interference (e.g. being programmed for the specific task). There are mainly three different learning styles of machine learning who, according to Brownlee (2019) are amongst the most popular as of today. These learning styles are supervised learning, unsupervised learning, and semi-supervised learning. There are other types aswell who are more suited for other kinds of problems, for example reinforcement learning (Kaelbling et al. 1996).

2.2.1 Supervised learning

Supervised learning makes use of labeled data. This means that the algorithm tries to predict the known output given the input data, during training. Brownlee (2019) groups supervised learning into two subcategories. Classification problems, and regression problems. These subcategories describe situations where supervised learning is a viable method to solving problems. As the algorithm tries to assign a correct label to specific input data, it belongs to the area of classification problems. Regression problems are described as situations where a real value is the output variable. Dollar and weight are examples of variables suitable for regression problems.

(9)

3

2.2.2 Unsupervised learning

For situations where input data exists but there are no output variables which the input data can be mapped to, unsupervised learning may be useful. Brownlee (2019) explains that with unsupervised learning, the goal is to model the distribution of the data, or in other words, the underlying structure in order to obtain more information about the data. Unsupervised learning can be split into two subcategories, Clustering problems, and Association problems.

Clustering problems are described as situations where it is desired to discover the intrinsic groupings in the data. Grouping customers by their purchasing behavior is an example of what clustering problems may include.

Association problems are situations where you want to discover rules that in general can describe large portions of a data. Such a rule may for example be that when people tend to buy X, they also buy Y.

2.2.3 Semi-supervised learning

Semi-supervised learning can be considered as being a mix of both supervised, and unsupervised learning, as Brownlee (2019) explains. The reason is that the problems which semi-supervised learning is trying to solve, is problems where there exist large amounts of input data, but only a smaller portion of the data is labeled. These datasets exist because it can be both expensive, and time-consuming to label data which in some cases may require domain experts. Unlabeled data, is on the other hand both cheaper, and easier to obtain. An example of semi-supervised learning problem is when a photo archive consists of a large number of photos where several different categories may be present (e.g., frog, bird, and alligator). In order to predict what label the photos may belong to, an unsupervised learning technique may be used to discover the structure of the data, then supervised learning can be used to make predictions, and put labels to the unlabeled data.

2.2.4 Reinforcement learning

According to Kaelbling et al. (1996) reinforcement learning is a learning strategy where an agent is trying to find an optimal solution by trial and error. The desired behaviour might not be known, and applying punishment for bad behaviour, the agent will learn how to avoid bad decisions. The thesis is facing a classification problem. The data is labeled, which makes the desired behaviour known. This makes it less suitable to use reinforcement learning and more suitable to use supervised learning.

2.3 Neural Networks

The previously mentioned learning strategies are often used with artificial neural networks.

Negnevitsky (2005) explains that artificial neural networks are models of reasoning which are based on how the human brain operates. Nerve cells (neurons) in the brain can be considered as densely coupled, basic information-processing units. The human brain consists of nearly 10 billion neurons and about 60 trillion connections, where each connection is referred to as being a synapse. The brain can perform its functions very fast due to the brain utilizing multiple neurons simultaneously. Biological neural networks have a fundamental, and essential characteristic which

(10)

4 is the ability to learn. The simplicity of how neurons learn has led to the pursuit of emulating the biological neural networks in computers. However, he points out that artificial neural networks of the present day have as much resemblance to the human brain as a paper airplane has to a real airplane with supersonic capabilities. In other words, there is still much more work needed before artificial neural networks can

be said to emulate the biological neural network of humans to a greater extent.

Neurons in an artificial neural network is connected to other neurons via links. Each link has a numerical weight attached. The value of the weight determines the importance of the originating neuron to the neuron it is linked to. Weights can be seen as a basic type of long-term memory in an artificial neural network. In order for the network to learn, it has to repeatedly adjust the values of these weights. The flow of information through the

neuron and on to the next consists of several steps. At first, inputs are received via links. The value from the input is then multiplied by the value of the weight connected to the link. This is done for all of the links that are attached to the neuron. Finally, a bias value is added to the total sum where the objective is to shift the total sum independent of the inputs. This is often a critical component for training neural networks successfully (Rosebrock 2017). To make a comparison to the brain, the weights can be considered as being the synapses of the neural network (Negnevitsky 2005). Figure 1 shows how the biological neuron is related to the artificial neuron.

The values which the neuron outputs are strongly influenced by the activation function (Nwankpa et al. 2018).

2.3.1 Activation functions

An activation function is a mathematical function that is used to model the activation of a neuron, i.e. if the neuron should fire, or not (Nwankpa et al. 2018). Different types of activation functions are suitable in different situations. The Softmax activation function computes a probability distribution based on real numbers that are located in a vector and is therefore suitable in the output layer of a neural network (Chollet et al. 2015). Softmax is for this reason the activation function that is being used in the implementations of this thesis. The ReLU (rectified linear unit) activation function is resembling to a linear activation function but has a threshold value of zero which forces negative input values to zero, and thereby is able to avoid what is called the vanishing gradient problem. Wang (2019) explains the vanishing gradient problem and that the problem occurs with activation functions that scale down input values, e.g. Sigmoid and the hyperbolic tangent function (Tanh). He further informs that this has the effect that large changes in the input of a neural network has a smaller effect on the output. The reason for this is because the derivatives for the values becomes very small during an optimization procedure called backpropagation (further explained in section 2.4.1). This problem gets further increased for each

Figure 1 - Parts of the biological neuron is mapped to the names used when building neurons for neural networks. Dendrites refer to the inputs the neuron is able to receive. Axon refer to the output the neuron sends to other connected neurons. Soma is the body of the neuron where the activation function, and linear combiner is defined. The linear combiner takes all of the input values from connected nodes in the previous layer of the neural network and outputs some value which is then passed through the activation function

(11)

5 layer of neurons that is added to the neural network. ReLU is

presented by Wang (2019) as a viable option (see Figure 2) for countering the vanishing gradient problem. This activation function has become one of the most popular choices for neural networks (Ramachandran et al. 2017). The ReLU activation function suffers from its own set of issues and is not the final answer to optimizing neural networks.

2.3.1.1 The Dying ReLU Problem

The main problem with ReLU is the fact that once the value of a neuron output gets below zero, it can never be re-activated (Lu et al. 2019). This will leave dead nodes (where nodes are also referred to as neurons) in the network which are unable to contribute to the network. This is not a desired behaviour, and therefore, there exists a few solutions that try to address this problem. One of the common replacements for ReLU which tries to mitigate the dying ReLU problem, is the Leaky ReLU algorithm.

2.3.1.2 Leaky ReLU

Leaky ReLU is introduced by Maas et al. (2013), and uses a small negative slope to avoid the dying ReLU problem, see Figure 3. The negative slope allows for neurons with negative, and close to zero values to stay alive, meaning that neurons with negative values will still be able to output useful values. Because of this reason, the Leaky ReLU activation function is used in the implemented architectures of this thesis.

2.3.2 Hidden Layers

Networks consisting of several hidden layers are often referred to as deep neural networks (Papernot et al. 2016).

They also mention that the term ‘deep’ was determined in the middle of the 2000s. Hinton et al. (2006) developed a deep learning architecture consisting of three hidden layers which may indicate what the lower limit for a deep neural network is (in terms of the number of hidden layers). However, it seems like there is no definite answer to what is considered ‘deep’ but Nwankpa et al. (2018) consider neural networks that have more than one hidden layer as being deep learning architectures. The important aspect of this is that deep neural

networks consists of several hidden layers. Deep neural networks are in some cases also referred to as multilayer perceptrons, see Figure 4. Hidden layers are used in order for the networks to be able to find more complex patterns in the data. Heaton (2017) points out that an important part of deciding the network architecture is to define the number of neurons in the hidden layers. The reason for this is that too few neurons in the hidden layers will cause the network to underfit. This means that the network will not be capable of finding the patterns in the data for a dataset of some complexity. If, on the other hand, there are too many neurons in the hidden layers, the network can overfit, meaning that the complexity of the dataset is not high enough for all of the neurons to

Figure 3 - Leaky ReLU activation function

Figure 4 - Hidden layers

Figure 2 - ReLU activation function

(12)

6 be trained. The network’s capacity of processing information is in such a case too large for what is required by the dataset. The effect is that the network is only able to perform accurate on the training data. This occurs because the network is memorizing the training data instead of learning the underlying structures which is needed for the ability to generalize. Further in this report, a network’s capacity of processing information is considering to be the complexity of a network.

2.4 Teaching a neural network

There are many different settings that can be applied to a neural network and its learning implementation, these are referred to as hyperparameters. One of these hyperparameters is the optimizer which holds the parameters needed for training a neural network. An example of such a parameter is the ‘learning rate’ which affects how much the weights are adjusted during training.

When the configuration of the hyperparameters has been made, training rounds can be performed.

The adjustment of the weights is most commonly done by the gradient backpropagation procedure (LeCun et al. 1998).

2.4.1 Backpropagation

As Werbos (1990) explains, backpropagation (backward propagation of errors) is a method which is commonly used by neural networks. The goal of the backpropagation is to minimize the cost function of the network, where the cost function represents the deviation of the predicted value with the actual value. This is done by applying gradient descent which calculates the derivatives of the weights, and biases in regards to the cost function. The derivatives reveal the direction in which to change these values. The derivatives are then multiplied with the learning rate of the network in order to update the weights and biases. This process is performed from the output layer back to the input. There are techniques which can make this process more effective, one such technique is batch normalization.

2.4.2 Batch Normalization

Ioffe & Szegedy (2015) introduce and describe batch normalization. It has shown to enable a faster learning rate for neural networks and handles the internal covariate shift. The internal covariate shift is when layers in the network gets unoptimized by the backpropagation procedure.

Layers suffer from internal covariate shift when they first get optimized by the backpropagation- procedure and later getting affected by the updated layers in the previous stages. The deeper the network is, the more inconsistent the outputs will be from the layers closer to the output layer in the architecture, see Figure 5.

(13)

7

Figure 5 - Internal covariate shift

The problem that batch normalization is trying to address is that backpropagation makes layers that are closer to the output of the network unable to give reliable values as easily as the layers that are closer to the input of the network. Batch normalization attempts to make the updated values of backpropagation not deviate as much as they would without the normalization process.

This makes the network more stable in its predictions (Ioffe & Szegedy 2015).

(14)

8

2.4.3 Transfer learning

Another theory to make the training of a neural network more effective is transfer learning. A neural network has a better opportunity to learn about a domain if the dataset is big enough (Szegedy et al. 2015). However, big datasets are not always available. According to Pan & Yang (2009) knowledge in one domain can sometimes be useful in another domain of interest. A model trained in one domain could thereby have an advantage when the obtained knowledge is applied to a new task in a new domain. The transfering of knowledge from one area to another is what is being refered to as transfer learning. Shin et al. (2016) show that transfer learning is useful from object classification to computer-aided detection, especially for very complex architectures. Transfer learning can be applied in various ways. One approach is that the entire knowledge from the previous domain is stored and the data from the new domain is simply used for remapping the output of the network, meaning that the stored knowledge is never forsaken in favor of new knowledge. It is also possible to simply use the previously obtained knowledge as a starting point from where the new model may be trained. Finally, it is possible to set some parts of the new model as untrainable, (resulting in knowledge from the previously trained model to be safe from alteration) and other parts as trainable.

2.5 Convolutional Neural Networks

LeCun et al. (1998) describe that basic neural networks use fully connected layers which implies that every node (neuron) in a previous layer is connected to every node in the next layer. This enables the network to find any correlations between nodes, meaning that the network is not affected by the order of the input data. However, the use of fully connected layers leads to huge computational costs in larger networks. A CNN solves this problem by restricting the amount of connections between the layers. The requirement is to have the input data represented so that important data structures can be found in the local space (i.e. within the restricted connections between the layers). They further explain that the component which searches for important data structures is called a feature extractor and is a part of a convolutional layer. A natural way to think of feature extractors can be in face recognition where the data is represented in a 2-dimensional structure. Rosebrock (2017) explains that feature extractors in layers that are closer to the output of the network does generally find patterns of higher abstraction in images. An example of a more abstract pattern might be entire objects, such as eyes. The feature extractors in the earlier layers tend to be less domain specific such as looking for vertical and horizontal lines. LeCun et al. (1998) continue to explain that before the breakthrough of CNN, hand-crafted feature extractors have been a technique used for pattern recognition. The variety of patterns from different domains (speech, music, faces, etc.) makes this approach very difficult. One of the main limitations with this technique is the designer’s ability to come up with an appropriate set of features. CNNs have the ability to learn reliable feature extractors by applying the backpropagation procedure without having any knowledge of the domain. Worth mention is that a CNN does not necessarily have to be bound to the 2-dimensional data structure. It could embrace data structures of any number of dimensions.

(15)

9

2.6 Convolutional neural network features

In this section, an explanation for how the mechanics of a CNN operates is being provided.

2.6.1 Kernel

Rosebrock (2017) describes the key features of CNNs, one of them being feature extractors.

Feature extractors are referred to as “kernels”. These can be seen as trainable weights in the network. In image recognition where a 2-dimensional CNN is used, they have a spatial dimension of NxM pixels. Every pixel of the kernel has values that have been evolved from backpropagation.

The task is to move the kernel step by step over the whole image. For each step, the kernel multiplies its values by the overlapping pixel values of the image. Every step produces a single new value in what is called an activation map, see Figure 6. When the process is done, the result becomes a new 2D representation. The activation maps are the outputs of a convolutional layer.

The number of activation maps depends on the number of kernels which becomes the depth Z of a convolutional layer. When configuring a convolutional layer, the dimensions NxMxZ must be set.

Figure 6 - Kernels producing activation maps

2.6.2 Spatial dimension reduction

The spatial dimension of the data in the activation maps can be unnecessarily large which then becomes a computational bottleneck. Pooling layers and the number of strides can reduce the spatial dimension. Strides is the number of pixels the kernel should move for each step. A higher stride value results in a smaller activation map. A pooling layer acts as a convolutional layer but only has one kernel. The kernel combines a set of pixels to a new one. An example is average pooling, which outputs the calculated average pixel value. Another example is max pooling, which outputs the pixel with the highest value. The pooling layer will reduce the spatial dimension significantly when applying a stride > 1. When using a pooling layer, padding must be set to “same”

or “valid”. When padding is set to “same”, the kernel is able to be applied outside the spatial context of the input layer, i.e. outside of the image boundaries. If padding is set to “valid”, the kernel will never be applied outside the spatial context of the input layer, see Figure 7.

(16)

10

Figure 7 - Padding

In the Figure 8, the spatial dimension is reduced from 12x12 to 4x4 by a 3x3 average pooling layer. The stride is set to 3 and the padding is set to “same”.

Figure 8 - Average pooling

2.6.3 Depth dimension reduction

Another way to reduce the computational bottleneck is to reduce the depth of the convolutional layers. Reducing the depth is to reduce the number of kernels. The Figure 9 shows how the first convolutional increases the depth by extracting more activation maps with two kernels. Then the second convolutional reduces the depth by only using one kernel. Szegedy et al. (2015) explain that this strategy of reducing the depth of convolutional layers will avoid deeper CNNs to turn into computational explosions. Without this technique, more activation maps will be produced for every convolutional layer that is added to the CNN.

Figure 9 - Depth of convolutional layers

(17)

11

2.6.4 Data representation in the architecture

The Figure 10 illustrates how the data representation is transformed when it is applied to a CNN.

Table 1 refers to Figure 10.

Table 1 - Data representation in CNN

1 The input is an image.

2 An average pooling layer is applied to reduce the spatial dimension of the image.

3 The next layer is a convolutional which produces six activation maps.

4 The data representation of the activation maps is flattened into a 1-dimensional array, so it fits into a fully connected layer.

5 The fully connected layer has all nodes connected to every node in the output layer. The output layer’s values is the CNN’s classification.

Figure 10 - Convolutional neural network

2.7 Inception-V3

How to best design CNN architectures is currently a vibrant research area. Inception-V3 architecture is described by Szegedy et al. (2016) and were able to obtain high performance on the ImageNet dataset, version: ILSVR 2012. Inception architectures are built using inception modules stacked as blocks. The concept of an inception module is to enable the opportunity of finding important data structures at different spatial dimensions at the same level in the network.

The original Inception module was first introduced by Szegedy et al. (2015), see Figure 11.

Figure 11 - Naive inception module

Another key feature is based on the CNN architecture “Network in Network” from Lin et al. (2013), where the 1x1 convolutional layers can be used to change the depth of the data representation.

For example, an input with 32 activation maps applied to a 1x1 convolutional layer with only one feature extractor will produce only one activation map as an output. The reason for reducing the

(18)

12 depth is to avoid a computational explosion from convolutional layers that are stacked one after another. This also reduce the computational consumption of the more expensive convolutional layers with 3x3 and 5x5 feature extractors. The Figure 12 shows how this technique is applied in the Inception-V3 architecture.

Figure 12 - Inception module

2.8 Inception-ResNet-V2

Residual networks were first introduced by He et al. (2016) where they presented a 152 layer residual network which was able to win the ILSVRC & COCO 2015 competitions. The concept of residual connections is to directly pass data structures by using a shortcut connection which enables the architecture to keep data structures to a greater depth in the network. This handles the degradation problem which commonly appears in deeper neural networks. The degradation problem is an observed behaviour which occur when the number of layers in the network gets increased. The network is able to increase accuracy with a greater amount of layers up to a certain point where the accuracy gets saturated, only to then degrade at a fast rate and also experience an increase in loss during training. The residual network consists of residual blocks that are stacked one after another. The residual block can be seen in Figure 13.

Figure 13 - Residual block

(19)

13 The basic idea of Inception-ResNet-V2 is to make use of the combination of residual blocks, and inception blocks. Inception-ResNet-V2 was introduced by Szegedy et al. (2017), and was able to outperform Inception-V3, Inception-ResNet-V1, and Inception-V4 on the ILSVRC 2012 validation set. Inception-ResNet-V2 can be described as a costlier hybrid of Inception, but with its recognition performance significantly improved. The architecture’s combination of residual blocks and inception blocks are composed into different modules, see Figure 14.

Figure 14 - Inception-ResNet-V2 modules (Szegedy et al. 2017)

The modules are setup in different configurations which make use of the Inception methodology internally with a shortcut connection around the entire Inception part of the module, thus combining the Inception, and residual thinking into their model.

2.9 Audio classification with convolutional neural networks

According to Hershey et al. (2017), human listeners are good estimators at separating different sources from an acoustic mixture. A variety of computational approaches has been inspired by this fact. They further show that CNNs are successful in performing audio classification using spectrograms. A spectrogram can be produced from a digital audio signal. By computing the frequency amplitude for an audio sequence (time window), a fourier transform can be made. To create a spectrogram, multiple fourier transforms are stacked after one another. The spectrogram can then show the amplitude of a set of frequencies at multiple time windows. Spectrograms are usually represented as images where the y-axis represents the frequency and the x-axis

(20)

14 represents the time. According to Shen et al. (2018), a standard spectrogram has linear scaling to the frequency axis, see Figure 15.

Figure 15 - Linear spectrogram transformed from voice audio

Mel-frequency spectrogram on the other hand is a non-linear-frequency spectrogram. It is based on the human auditory system such that more details are represented in lower frequencies.

Hershey et al. (2017) have obtained high performance in audio classification with CNNs using mel-frequency spectrograms. See Figure 16.

Figure 16 - Mel-frequency spectrogram transformed from voice audio

(21)

15

2.10 Related work

This section summarizes the related work and some aspects that have not yet been covered.

These works and their connections to this thesis are also further discussed in the background.

Complexity of CNN architectures

Real et al. (2017) show a correlation that higher complex CNN architectures tend to obtain better performance than simpler CNN architectures. This study investigates how different complexities in CNN architectures affect the performance. Szegedy et al. (2015) state that higher complex CNN architectures are more prone to overfit, especially if the amount of data is a restriction. Szegedy et al. (2016) introduce the Inception-V3 architecture and Szegedy et al. (2017) introduce the Inception-ResNet-V2 architecture. The distinguishing techniques that these architectures use are being applied to less complex architectures. To implement these techniques to lower complex CNN architectures has not been found in recent works.

Object classification

For object classification, Szegedy et al. (2017) compare the performance of Inception-V3 against Inception-ResNet-V2. Both papers use exceptionally large datasets which is an important factor for reducing (or avoiding) the risk of overfitting for CNN architectures with greater information processing capacity (Heaton 2017; Szegedy et al. 2015). In this study, Inception-V3 and Inception- ResNet-V2 are tested against a smaller dataset, but also to compare the performance to simpler architectures.

Audio classification

The research area for different CNN architectures has been explored for object classification but also for audio classification. Hershey et al. (2017) show that Inception-V3 is able to perform well on audio classification. This study also uses the newer CNN architecture Inception-ResNet-V2 for audio classification.

Transfer learning

Shin et al. (2016) show a case where concepts of knowledge obtained from natural images are beneficial for use in medical image analysis. In this thesis, the chosen domains (object classification and audio classification) are considered to be disjunctive and is therefore interesting to evaluate if they share concepts of knowledge. No study has been found where transfer learning between object classification and audio classification is applied.

It is also interesting to explore to what degree transfer learning can benefit CNNs of different complexities in order to gain a deeper understanding for the usefulness of transfer learning.

(22)

16

3 Problem

This chapter presents the problem definition of this thesis. The first topic concerns the aim and its usefulness. Next is the problem area which explains the broader aspect of the problem followed by research problem which further narrows down the scope of this thesis. In the final parts, the motivation for why the research is needed is being presented along with the research questions, the hypotheses, and the objectives of this thesis.

3.1 Aim

The aim of this thesis is to evaluate the usefulness of applying transfer learning from models trained in the object classification domain to the audio classification domain.

This could help individuals in training models with higher performance when the dataset is a limitation to achieve high performing models, or in the decision process of how training should be performed.

3.2 Problem Area

Training a CNN to learn about a domain of interest is a difficult task. Recently, researchers such as Szegedy et al. (2017), have succeeded in obtaining high accuracy in object classification by training complex CNN architectures to a large dataset (ImageNet) with a high amount of computational power and time. An advantage with a complex CNN architecture compared to simpler CNNs, is the ability to learn more complex data structures. Higher complexity also makes the architectures more inclined for overfitting which may affect the generalizability (Szegedy et al.

2015). This could be solved by choosing a dataset suitable to the information processing capacity of the CNN. During this thesis, no method for calculating the required complexity of a dataset in relation to the capacity of a CNN to process information has been encountered and does not seem to exist. To find this threshold seems to require a 'trial and error' approach which is something that is not performed in this thesis. However, a suitable dataset with enough structural complexity is not always possible to obtain. This problem might be reduced if feature extractors can be transferred between domains. In the best-case scenario, the training performance may even increase in terms of accuracy, loss, and the required number of epochs for training the CNN.

3.3 Research problem

It is difficult to determine an optimal CNN architecture. A CNN’s generalizability can be limited by its complexity but also by the limited amount of generalizable data structures that can be found in the dataset it is being applied to. However, it is unclear to what degree transfer learning from one domain to another can extend the generalizability of a CNN.

3.3.1 Motivation

According to Szegedy et al. (2015), the biggest impact to the performance of a CNN architecture depends on its complexity. A more complex CNN has the ability to find more complex data structures, but this also makes it more prone to overfit, especially for smaller datasets. CNN

(23)

17 architectures, has, with a good reputation, often been proven to work well on larger datasets as can be seen in the works by Szegedy et al. (2015, 2016, 2017), and He et al. (2016). For these highly well-established CNN architectures there are less focus on how adaptable they are for smaller datasets. In this thesis, RC-boats is considered a small dataset compared to ImageNet and CIFAR-10 which are deemed as being large datasets. It is also unclear how well complex feature extractors from transfer learning can fit to a dataset in another domain of interest. The same obscurity applies for less complex CNN architectures.

3.3.2 Research Questions

1. To what extent does simple CNN architectures perform on a small dataset in comparison to more complex architectures such as Inception-V3, and Inception-ResNet-V2?

2. To what extent does simple CNN architectures perform transfer learning to a small dataset in another domain of interest?

3. To what extent can the Inception-V3 architecture perform transfer learning to a small dataset in another domain of interest?

4. To what extent can the Inception-ResNet-V2 architecture perform transfer learning to a small dataset in another domain of interest?

3.3.3 Hypothesis

1. The simple CNNs will be able obtain similar results as Inception-V3 and Inception-ResNet- V2.

Our expectation is that a small dataset cannot represent enough complex data structures to provide a benefit for Inception-V3 and Inception-ResNet-V2.

2. Transfer learning will enable a CNN to obtain a higher performance with less amount of training.

Our expectation is that concepts of knowledge in the domain of object classification will be useful in the domain of audio classification.

3.3.4 Objectives

To achieve the aim and be able to answer the research questions. The following objectives need to be performed

1. Implement a simple classifier (average mean amplitude classification).

2. Find two state of the art CNN architectures.

3. Implement simplified versions of the state-of-the-art CNN architectures.

4. Pretrain the simplified CNNs on CIFAR-10.

5. Transform audio files to spectrogram images.

6. Develop training scripts.

7. Train architecture with and without transfer learning.

8. Implement validation tools (heatmap, confusion matrices).

9. Present and analyze the results.

(24)

18

4 Method

This chapter describes and motivates the chosen methodology and how it is able to provide answers to the research questions. Later, the chapter focuses on the implementations and motivations to all choices that has to be made.

4.1 Choice of methodology

In this section the motivation for the chosen methodology is given. It also gives an explanation for why other methodologies are discarded.

4.1.1 Experiment

The choice of methodology is to perform an experiment. Wohlin et al. (2012) describe the technology-oriented experiment as an investigation on how different tools can affect objects in a controlled environment. In our case we applied the same dataset to different kinds of CNN architectures. By having the same dataset for every CNN, the risk of giving a CNN an advantage is reduced. An experiment gives an opportunity to collect quantitative data in a controlled environment. Quantitative data is preferable to show statistical power to the conclusions of the research questions. The controlled environment provides the ability to reproduce identical tests to different CNN architectures which becomes important for the validity of the conclusion. Even if the experiment is able to handle many threats to validity, other problems arise that need to be considered which for example a case study might be able to solve.

4.1.2 Alternative methodology

Berndtsson et al. (2007) describe a case study as an in-depth exploration of a phenomenon in its natural setting. That is to execute the data collection in real life environments. This possesses the lack of controlled environments which results in external threats to validity. On the other hand, it provides the opportunity to present results that are not based on artificially produced data. The research to explore the benefits of the chosen CNN architectures could be done in a case study which would reveal how the networks perform in a real environment. At the same time it would create new obstacles for the generalizability.

Another well-established methodology to gain knowledge is the use of surveys. Surveys are used for extracting information from a population with experience from a specific domain. This method would have provided us with the opinions of the experts in the field but could possibly only reveal an inclination towards the real answer for the problem (Wohlin et al. 2012).

4.2 Implementation

This section presents the how the experiment is implemented in the thesis. Starting with the Dataset and the information of how it is built followed by the parameters used for collecting data in the experiment. After this, the experimental environment is explained followed by the architectural implementations. The final parts of the section describe the optimizer which is chosen

(25)

19 for the architectures and the various ways in how the experiment is being set up. Code is available at https://github.com/tobfrjohha/Transfer_learning_between_domains [2020-05-04]

4.2.1 Dataset

The dataset that is used in this thesis is provided by Combitech (2020). The data consist of audio from RC-boats that has been recorded in two underwater environments. One environment being in a pool (4x8 meters) and the other from a small lake (named Bolougnersjön in the Swedish town, Skövde). The recording sessions used two hydrophones that were connected to a recorder. The recorder then saved the audio into .wav files with a sample rate of 44100 hertz. In the pool environment, the hydrophones were placed at different locations but at the same depth. One hydrophone was placed on the longer side of the pool, close to a corner. The other hydrophone was placed in the pool’s center. In the pond environment, the two hydrophones were attached to a pier, with a distance of three meters apart. There are no distinguishable differences between the audio qualities collected from these two hydrophones. The provided dataset consists of 3337 five second samples. There are five classes: “Noise”, “Racer”, “Spy_Cam”, “Sub” and “Tugboat”. The RC-boat nicknamed Spy_Cam has only been recorded in the pool environment. All classes except

“Noise” have audio from RC-boats. The RC-boats are of different sizes with different equipment which results in audio with different characteristics. Noise is simply the natural audio from the environments. To further increase the number of samples provided to the networks, the five second samples were split in half, where they are also converted into mel-spectrogram images.

This results in a dataset with 6674 images in total. These images are split into three different segments. One of these is a training set that is used to adjust the weights of the CNN. Another segment is a validation set that is used to validate the accuracy and loss for the latest adjustment of weights. Finally, there is a test set that is used to test the CNN’s performance. see Table 2.

Table 2 - RC-boats dataset

Class

Data split of 6674 samples

Training Validation Test

Noise 1409 152 223

Racer 965 99 152

Spy_Cam 601 71 95

Sub 1224 157 198

Tugboat 1057 105 166

There are 5840 images reserved for being training and validation images. A validation split of 10%, makes 584 of those images into validation images, leaving 5256 images to be used for training.

The validation split takes the last 584 images in the dataset and allocates them to the validation set. This is the reason why there are not an equal distribution between the classes in the validation set. The test set contained roughly 14 percent of the images, resulting in 834 images. The test set is deemed as large enough for evaluation while still allowing the various architectures to be trained on a relatively large amount of samples in the dataset for being able to find valuable data structures. The dataset is in later parts referred to as the RC-boats dataset.

(26)

20

4.2.1.1

Mel-Frequency spectrogram

According to Hershey et al. (2017), the mel-frequency spectrogram with 960 milliseconds frames in 94x64 images are enough detailed spectrograms for CNNs to accomplish excellent results in audio classification. Our mel-spectrogram images are produced in ~2500 milliseconds frames in 128x128 in a jpg format. The reason why the dimensions are not smaller is that the InceptionV3 and Inception-ResNet-V2 does not allow input resolutions lower than 75x75. It is also convenient to use a higher width-resolution than what Hershey et al. (2017) use, since our frames have a bigger time window. For each image, there are 128 short-time fourier transforms and 128 frequency ranges. When producing a mel-frequency spectrogram, the visualization can have high precision in time which results in less precision in frequencies, or to have a high frequency precision but less precision in time. This is illustrated from the sound of a sine curve switching from 200 hertz to 800 hertz, see Figure 17 and Figure 18.

Figure 17 - Prioritizing frequency precision

Figure 18 - Prioritizing time precision

The spectrograms are produced in consideration of the minimum violation to both time and frequency precision. The configuration is established by evaluating the most detailed images from different configurations, and also by listening to the reproduced audio from the spectrogram to evaluate the minimal experienced loss in sound quality. The goal of the configuration is to retain as high quality in the data structures as possible to provide a high amount of detail in the characteristics of the data. But it is important to point out that this is not crucial for the result of the experiment. The mel-frequency spectrograms are produced with the Python library Librosa version 0.7.2. LibROSA is a package in python used for analyzing music and audio. The building

(27)

21 blocks which Librosa makes available, enables the creation of music information retrieval systems (Librosa 2019). The produced spectrograms are stored on disk as jpg images with the Python library Matplotlib 3.1.1. See Table 3.

Table 3 - Librosa settings

Librosa settings frame_size 55000

n_mels 128

hop_length 256 win_length 256

n_fft 1024

4.2.2 Data collection

The data collection in this thesis is the measurement of accuracy, loss, and training epoch. The data is collected during the training phase and the test evaluation phase. Accuracy and loss are collected by sparse categorical cross-entropy measurements from Keras (Abadi et al. 2016).

Accuracy is the relation to the number of correct predictions for all elements in the training, or test set. The loss variable is the sum of the error in the classifications and misclassification in which the model has made its predictions. Epoch is to reveal when in the training a specific loss or accuracy occurs.

To illustrate the performance of the architectures in the various configurations, confusion matrices are able to more thoroughly show how well the models performs. Every confusion matrix presents how each architecture in each configuration performs on average when thirty models are trained (Koehrsen 2018). The confusion matrices used are setup according to the illustration given by Figure 19.

Figure 19 - Confusion matrix

Two sample student’s t-tests with unequal variances are being applied to show if there are statistical evidence for the differences in the results. This study uses a p-value of 0.001, which

(28)

22 means that for cases where the p-value is lower than 0.001, there are a 99.9% or higher chance that the data is significant, revealing that there is an important difference between the results. It also shows that there is a 0.1% chance or less that the data randomly occurred the way it did.

Both one-tailed and two-tailed t-tests are being applied. The calculations are performed in the Microsoft Excel software from Microsoft Corporation, (2018).

4.2.3 Experimentation environments

The experiment runs in parallel on two different setups, see Table 4 and Table 5.

Table 4 - Setup 1

OS Ubuntu 18.04.4 LTS, 64-bit

Memory 62.9 GiB

CPU AMD Ryzen 7 2700x eight-core processor x 16 GPU GeForce GTX 1080 Ti/PCIe/SSE2

Table 5 - Setup 2

OS Ubuntu 16.04 LTS, 64-bit

Memory 31.4 GiB

CPU AMD Ryzen 7 2700x eight-core processor x 16 GPU GeForce GTX 1080 Ti/PCIe/SSE2

4.2.3.1 Software

The experiment is running in Python 3.7. All code is written and executed in the editor Spyder3.

All libraries are installed through the Anaconda3 prompt. Both systems use the Keras API from Tensorflow 2.1 to build the CNN architectures. GPU acceleration is being utilized during training of the CNNs which becomes a time-consuming advantage.

4.2.4 CNN implementations

It is preferable to use a CNN with minimal unnecessary complexity since it reduces the amount of processing power needed for training. Rosebrock (2017) shows that a CNN called ShallowNet, with only one convolutional layer is able to obtain 60% test accuracy on the dataset CIFAR-10, after 40 epochs. ShallowNet is considered a very low complex CNN and has 328,586 trainable parameters (weights that are adjustable) when it is configured to CIFAR-10. Another CNN with four convolutional layers that is provided by Keras (Chollet et al. 2015) has 1,256,858 trainable parameters and is able to obtain 79% test accuracy on CIFAR-10, after 50 epochs. Real et al.

(2017) summarize the performance of a number of different CNNs to CIFAR-10. A CNN with 1.3 million trainable parameters is able to obtain 92.8% test accuracy. The research shows that there is a high correlation between test accuracy and the complexity of CNNs. CIFAR-10 provides us with a pointer to determine the usefulness to different CNN architectures. By running short training sessions, we are able to quickly determine the tendency of a CNN to become overfit. If a CNN architecture has more complexity than ShallowNet, it should be able to reach a higher test accuracy than ShallowNet. That would indicate that the model is able to find more meaningful

(29)

23 structures within the data before overfitting. For this study, we developed two different CNN architectures, “Mini-Inception” and “Mini-Inception-ResNet”.

4.2.4.1 Mini-Inception

The Mini-Inception architecture is inspired by Inception-V3, and its important features that are introduced by Szegedy et al. (2015). A feature that is inherited from Inception-V3 to Mini-Inception is the technique to spread out the input-layer to different spatial dimensions. Also, the depth reduction by using 1x1x32 feature extractors before the more computational taxing 3x3 and 5x5 feature extractors. Batch normalization and the activation function Leaky ReLU are added after every convolutional layer. See Figure 20.

Figure 20 - Mini-Inception module

When the network is configured to CIFAR-10 (input dimension 32x32x3, 10 classes), it has 238,730 trainable parameters. Without the depth reduction from the 1x1 feature extractors, the number of trainable parameters would be 576,266. When the network is configured to RC-boats (input dimension 128x128x3, 5 classes), it has 371,205 trainable parameters. The network has three Mini-Inception modules placed on top of each other. On top of that, an average pooling layer is used followed by a fully connected layer and a Softmax activation function as the final output layer, see Figure 21.

Figure 21 - Mini-Inception architecture, config: RC-boats

The Mini-Inception architecture is able to obtain 85.7% test accuracy and 0.450 loss value on the dataset CIFAR-10, after 279 epochs. The architecture obtains 80% test accuracy after 43 epochs.

(30)

24 4.2.4.2 Mini-Inception-ResNet

The Mini-Inception-ResNet architecture is inspired by the Inception-ResNet-V2 architecture developed by Szegedy et al. (2017). The module first marks where the residual connection starts.

It then splits the input from the previous layer into three parallel branches. The first step is to reduce the spatial dimensions which allows for cheaper calculations when the 5x5, and 3x3 feature extractors are passed over the image. The next step is to upsize the spatial dimension back to its original size which allows the residual connection to be applied after the concatenation, see Figure 22.

Figure 22 - Mini-Inception-ResNet module

When the network is configured on CIFAR-10 the model has 117,290 trainable parameters, and 1,056 non-trainable parameters. When the model is configured on the RC-boats dataset, the model has 139,045 trainable parameters, and 1,056 non-trainable parameters. The network employs three of these modules, and in order for the spatial dimensions to be reduced, a max pooling layer is implemented between the modules with a filter size of 3x3 and a stride length of 3. In the top layers of the network, the network consists of an average pooling layer, a fully connected layer, and a final Softmax activation layer, see Figure 23.

Figure 23 - Mini-Inception-ResNet architecture, config: RC-boats

The architecture is able to obtain 85.8% test accuracy and 0.433 loss value on the dataset CIFAR- 10, after 195 epochs. The architecture obtains 81.76% test accuracy after 45 epochs.

(31)

25

4.2.5 Optimizer

The adam optimizer is used by every architecture in this thesis. Kingma & Lei Ba (2014) explain that Adam has previously been shown to be a robust optimizer for training neural networks. The Adam optimizer is convenient to use since it is a straightforward implementation that does not require any fine tuning to work efficiently. Kingma & Lei Ba (2014) also show that Adam is very cost effective in terms of memory usage and time consumption compared to other popular optimizers. A more thorough evaluation of different optimizers is out of scope for this thesis, due to the given time constraints.

4.2.6 Configurations

In every configuration, each CNN architecture has 30 independent training sessions to the RC- boat dataset. Every session has a maximum of 1000 epochs. A session is stopped earlier if the validation loss does not decrease within 125 epochs. The evaluation is made on the model with the lowest validation loss.

Configuration 1 - RC-boats with random weights initialized

The models have random initialized weights in every layer, and every layer is trainable. This configuration is referred to as “random weights”

Configuration 2 - RC-boats with transfer learning, unfreezed convolutional layers

The CNN architectures are pre-trained before training on the RC-boats dataset. Every layer is set to trainable. Inception-V3 and Inception-ResNet-V2 are pre-trained on the ImageNet dataset. Mini- Inception and Mini-Inception-ResNet are pre-trained on the CIFAR-10 dataset. This configuration is referred to as “tf, unfreezed conv”.

Configuration 3 - RC-boats with transfer learning, half freezed convolutional layers

The CNN architectures are pre-trained before training on the RC-boats dataset. The first half of the architectures are frozen (untrainable), while the deeper layers in the last half are set to trainable. Inception-V3 and Inception-ResNet-V2 are pre-trained on the ImageNet dataset. Mini- Inception and Mini-Inception-ResNet are pre-trained on the CIFAR-10 dataset. This configuration is referred to as “tf, half freezed conv”

Configuration 4 - RC-boats with transfer learning, freezed convolutional layers

The CNN architectures are pre-trained before training on the RC-boats dataset. Only the final fully connected layer with the Softmax activation function is set to be trainable. Inception-V3 and Inception-ResNet-V2 are pre-trained to the ImageNet dataset. Mini-Inception and Mini-Inception- ResNet are pre-trained to the CIFAR-10 dataset. This configuration is referred to as “tf, freezed conv”.

(32)

26

4.3 Average mean amplitude classification

To make sure that an advanced approach such as CNN is necessary for the problem of audio classification, a simple classifier is built which takes the average mean amplitude from each of the classes in the RC boats dataset. The classifications are then estimated according to how close the average amplitude of a sample is to the mean amplitude of a class. The classifier is implemented as a script in the Python programming language. To see the code for how this was implemented, see the script ‘simple_classifier.py’ at:

https://github.com/tobfrjohha/Transfer_learning_between_domains [2020-05-19].

(33)

27

5 Evaluation

This chapter focuses on the results of the study. The first part presents the results. The second part is an analysis which focuses on the average performance for the different configurations and the performance of the highest performing models of the different configurations. Loss and accuracy are the parameters used for evaluating the performance of the models. The analysis part also investigates different convolutional layers' ability to find patterns within the data (heatmap). The last part contains the conclusions of the result.

5.1 Introduction

The experiment consists of measuring the adaptability for different CNN architectures in different configurations to the RC-boat dataset. This is done in order to reveal the level of impact which transfer learning has on the results. Each configuration involves thirty versions of each architecture. This makes it possible to present results with a reasonably high statistical power.

The results are collected from the measured variables of accuracy, loss and the number epochs required for finding the optimal model. The optimal model is considered being the model with the lowest validation loss since this indicates a higher level of the capacity to generalize.