Transfer Learning on Ultrasound Spectrograms of Weld Joints for Predictive Maintenance

(1)

UPTEC F 20056

Examensarbete 30 hp November 2020

Transfer Learning on Ultrasound Spectrograms of Weld Joints for Predictive Maintenance

Joakim Bergström

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Transfer Learning on Ultrasound Spectrograms of Weld Joints for Predictive Maintenance

Joakim Bergström

A big hurdle for many companies to start using machine learning is that trending techniques need a huge amount of structured data.

One potential way to reduce the need for data is taking advantage of previous knowledge from a related task. This is so called transfer learning. A basic description of it would be when you take a model trained on existing data and reuse that for another problem.

The purpose of this master thesis is to investigate if transfer learning can reduce the need for data when faced with a new machine learning task which is, in particular, to use transfer learning on ultrasound spectrograms of weld joints for predictive maintenance.

The base for transfer learning is VGGish, a convolutional neural network model trained on audio samples collected from YouTube videos. The pre-trained weights are kept, and the prediction layer is replaced with a new prediction layer consisting of two neurons.

The whole model is re-trained on the ultrasound spectrograms. The dataset is restricted to a minimum of ten and a maximum of 100 training samples. The results are evaluated and compared to a regular convolutional neural network trained on the same data. The results show that transfer learning improves the test accuracy compared to the regular convolutional neural network when the dataset is small.

This thesis project concludes that transfer learning can reduce the need for data when faced with a new machine learning task. The results indicate that transfer learning could be useful in the

industry.

Ämnesgranskare: Ping Wu

Handledare: Adam Hedkvist

(3)

Popul¨ arvetenskaplig sammanfattning

Artificiell intelligens (AI) har funnits som koncept ¨ anda sedan mitten av 1900-talet.

Fr˚ an b¨ orjan bestod AI av h˚ ardkodade regler som datorerna skulle f¨ olja. I takt med att datorkraften ¨ okade f¨ orsk¨ ots fokus mot maskininl¨ arning (ML). Skillnaden mot tidigare var att ML anv¨ ande uppm¨ arkt data f¨ or att p˚ a egen hand l¨ ara sig reglerna, utan explicit m¨ anskligt ingripande.

Idag ¨ ar ett stort hinder f¨ or f¨ oretag som ¨ ar intresserade av att b¨ orja anv¨ anda AI eller ML att de trendande teknikerna beh¨ over enorma m¨ angder v¨ alstrukturerad data. Syftet med detta examensarbete ¨ ar att unders¨ oka en potentiell l¨ osning p˚ a detta hinder. Metoden kallas transfer learning, eller ¨ overf¨ oringsl¨ arning, och g˚ ar ut p˚ a att ˚ ateranv¨ anda tidigare vunnen kunskap f¨ or nya problem.

I ¨ overf¨ oringsl¨ arning anv¨ ands ett f¨ ortr¨ anat faltningsn¨ atverk (CNN), som bas, och enbart det sista lagret byts ut till ett som motsvarar den nya klassificeringsuppgiften.

Ett CNN till¨ ampas f¨ or bildigenk¨ anning och best˚ ar av flera lager som hittar m¨ onster i bilden. Det sista lagret tar utv¨ ardena fr˚ an CNN och f¨ orutsp˚ ar vilken klass bilden h¨ or till.

Den potentiella nyttan med ¨ overf¨ oringsl¨ arning ligger i antagandet att den f¨ ortr¨ anade modellen har plockat upp generella m¨ onster och drag som kan hj¨ alpa till att l¨ osa det nya problemet. F¨ or att unders¨ oka detta j¨ amf¨ ors resultaten med ett nytt CNN som enbart tr¨ anats p˚ a detta problem.

Resultatet visar p˚ a att ¨ overf¨ oringsl¨ arning ger en h¨ ogre tr¨ affs¨ akerhet n¨ ar m¨ angden

tr¨ aningsdata ¨ ar mycket liten. Det indikerar att metoden har potential till att vara

anv¨ andbar f¨ or f¨ oretag som vill b¨ orja anv¨ anda ML eller AI.

(4)

Preface

First, I would like to give my deepest thanks to my supervisor Adam Hedkvist at Syntronic AB in G¨ avle. This master thesis could not have been done without him, especially due to the extraordinary circumstances in the spring and summer of 2020. I would also like to thank my subject reader Ping Wu for always being there when I needed anything. Finally, I would like to send my warmest thanks to my closest family for being supportive and never giving up on me.

Uppsala, November 2020

Joakim Bergstr¨ om

(5)

1 Introduction 1

1.1 Background . . . . 1

1.2 Prior work . . . . 2

1.3 Purpose of the project . . . . 3

1.4 Task and scope . . . . 3

1.5 Outline . . . . 4

2 Theory 5 2.1 Ultrasound and spectrograms . . . . 5

2.2 Machine learning and deep learning . . . . 6

2.3 Feedforward neural networks . . . . 7

2.3.1 Weights and bias . . . . 7

2.3.2 Loss/cost function . . . . 7

2.3.3 Gradient descent using backpropagation . . . . 9

2.3.4 Convolutional neural networks . . . . 10

2.3.5 Activation functions . . . . 11

2.3.6 Fully connected layer . . . . 12

2.3.7 Regularization techniques . . . . 12

2.3.8 Label representation . . . . 13

2.3.9 Performance metrics . . . . 14

2.4 Logistic Regression . . . . 14

(6)

2.5 Transfer Learning . . . . 14

2.5.1 What to transfer . . . . 16

2.5.2 Negative transfer . . . . 16

2.5.3 How to use transfer learning? . . . . 16

2.5.4 VGGish . . . . 17

3 Implementation 19 3.1 Software and development tools . . . . 19

3.1.1 Python . . . . 20

3.1.2 Visual Studio Code . . . . 20

3.1.3 TensorFlow . . . . 20

3.1.4 Keras . . . . 21

3.1.5 NumPy . . . . 21

3.1.6 Matplotlib . . . . 22

3.1.7 Scikit-learn . . . . 22

3.2 Data preprocessing . . . . 22

3.3 Model construction and compilation . . . . 24

3.4 Training and validation . . . . 26

3.5 Testing and evaluation . . . . 27

4 Results and discussion 28 4.1 Implementation of transfer learning . . . . 28

4.2 Evaluation . . . . 28

(7)

4.3 Proposed pre-trained models for company libraries . . . . 33

5 Conclusions and future work 34

References 35

(8)

List of Figures

2.1 Illustration of how AI, ML and deep learning are related to each other. . . . . 6 2.2 Feedforward neural network where, for each node, a

_i

is the input

value, w

i

is the weight, b is the bias and y is the output value. . . . 8 2.3 Convolution with a filter size of 3 × 3 and a stride of 1, using zero

padding, yields a new image with the same resolution as the original one. . . . . 10 2.4 Max pooling with a filter size of 2 × 2 and a stride of 2 reduces the

image to half its original resolution. . . . 11 2.5 The left most graph is an example of underfit. The middle graph

is an example of a good fit. The right most graph is an example of overfit. . . . 13 2.6 The architecture of the convolutional neural network VGG. . . . 15 2.7 The architecture of the convolutional neural network VGGish . . . 18 3.1 Overview of the implementation. . . . . 19 3.2 An example view of the code editor Visual Studio Code. . . . . 20 3.3 The Keras implementation of VGGish. It is built using the func-

tional API. . . . . 21 3.4 Two sample spectrograms from the training set. Figure 3.4(a) is

labeled as pass. Figure 3.4(b) is labeled as fail. . . . 24 3.5 The architecture of the Hedkvist [1] model. . . . 25 4.1 Test accuracy for VGGish. The model is trained 100 times for

100 epochs. The accuracy shown is the mean with one standard deviation for each of the ten training sizes. . . . 31 4.2 Test accuracy for the Hedkvist [1] model. The model is trained 100

times for 30 epochs. The accuracy shown is the mean with one

standard deviation for each of the ten training sizes. . . . 31

(9)

4.3 Test accuracy for VGGish without the pre-trained weights. The

model is trained 100 times for 100 epochs. The accuracy shown is

the mean with one standard deviation for each of the ten training

sizes. . . . 32

(10)

List of Tables

2.1 One-hot-encoding. . . . . 13 4.1 Test accuracy for Hedkvist [1] and VGGish, with 18956 training

examples. All layers of VGGish were finetuned. . . . 28

4.2 Test accuracy for logistic regression. . . . 29

(11)

Abbreviations

AI artificial intelligence ANN artificial neural network

API application programming interface CNN convolutional neural network CPU central processing unit FCL fully connected layer GPU graphical processing unit

IDE integrated development environment ML machine learning

NLP natural language processing ReLU rectified linear unit

SGD stochastic gradient descent SOTA state of the art

STFT short time Fourier transform

VSC Visual Studio Code

(12)

1 Introduction

A big hurdle for many companies to start using artificial intelligence (AI) is that trending techniques need a huge amount of structured data. Are there methods to reduce the need for data? One potential way to do so is by taking advantage of previous knowledge from a related task. This field within machine learning (ML) is called transfer learning. A basic description of it would be when you take a model trained on existing data and reuse that for another problem.

1.1 Background

Even though AI and ML are almost considered buzzwords of today, these concepts are not new at all. The idea of computers thinking for themselves has been around since the 1950s. At first, AI consisted of humans providing hard-coded rules for the computers to obey. Up until the 1980s, symbolic AI was the most popular approach for AI [2]. As computational power increased [3], the ML approach rose in popularity. In contrast to symbolic AI, ML uses labeled data to learn the rules explaining the underlying problem, without (explicit) human intervention [2].

Early machine learning used fully connected layers (FCLs) to classify images. How- ever, the number of model parameters blew up as the image’s resolution increased.

Another drawback was the need to flatten the input before presenting it to the network, yielding any spatial information redundant. This led to the invention of the convolutional neural network (CNN) [4]. This new type of network was able to recognize patterns no matter where in the image they appeared, and was inspired by the visual nervous system model proposed by [5], [6]. Modern CNN architech- tures are often very deep, but [7] shows that a shallow model can be taught to mimic a more complex teacher model using the logits of the teacher as labels. To solve the problem of having numerous model parameters, they used a bottleneck linear layer between the input and hidden layer. This allowed them to factorize the weight matrix, reducing convergence time as well as memory usage. However, until the need of a teacher model is solved, deep nets are still the way to go.

Deep neural network training involves a large amount of matrix multiplications.

The architecture of the central processing unit (CPU) is not well suited for doing

these types of calculations. A graphical processing unit (GPU) on the other hand

excels at the task, prompting many researchers to use them for development. This

is mainly because GPUs have larger memory bandwidth and can perform many

(13)

small computations in parallel [8], [9].

In [10], they claim that when neural networks are trained on images, almost all of them learn features akin to Gabor filters on the first layer. These features are called general features, in contrast to the highly (class) specific features learned by the last layers of neural network.

Transfer learning has been called the ”...the next driver of ML commercial success”

[11]. First described in [12], and later in [13], transfer learning were shown to improve learning time compared to randomly initialized neural networks. The theory is based on a concept in psychology known as adaptive generalization, the ability to generalize not only within the same domain, but across different domains [14].

Audio classification network architecture have not been as deeply investigated as its image counterpart. While there exists several state of the art neural networks for images, such as ResNet [15], VGG [16] etc., for sounds there are only VGGish [17] and Yamnet [17] whom can be called state of the art (SOTA).

Ultrasonic waves, or ultrasound, have a higher frequency than the upper audible frequency of human ears. A common use of ultrasound is fetal ultrasound (sono- grams). In medical settings, the frequency of the sound waves lie in the range of 3 to 10 MHz [18]. By sending sound waves and gathering the reflections, an image of the fetus inside the uterus can be produced [19]. Another use is within the field predictive maintenance, which is defined as ”a condition based maintenance carried out following a forecast derived from the analysis and evaluation of the significant parameters of the degradation of the item” [20].

1.2 Prior work

This thesis project started from the work in [1] where they used ML to analyze data from ultrasonic scanning of weld joints for predictive maintenance. A portion of the same dataset and the network structure were used here in the project.

Spectrum analysis technology have been used extensively for the analysis of infant

cries [21], EGG signals [22], motor fault diagnostics [23] and automatic speech

recognition [24]. In [25], they have had good results combining sound with unla-

beled video to learn natural sound representations. Other applications of transfer

learning have had success within natural language processing (NLP) [26].

(14)

1.3 Purpose of the project

The main objective of the project is to investigate if transfer learning can reduce the need for data when faced with a new ML task, and more specifically, compare transfer learning to regular ML within the field of predictive maintenance using ultrasound. The questions to be answered are:

• Can transfer learning reduce the need for data when faced with a new ML task?

• Could this technique be useful in the industry?

• Are there any rules available for reducing one problem to another?

• What library of pre-trained models would be needed for a company to quickly get something up and running for a customer?

1.4 Task and scope

In this thesis project, one pre-trained neural networks will be used as the base for transfer learning. The results of the new CNN, on the same dataset of spectrogram images, will be compared to the CNN design inspired by [1]. A comparison between the CNNs trained with transfer learning and the CNN without transfer learning will be done.

This thesis project contains four major parts. The first part is a literature study, where knowledge about key concepts and prior work is gathered. The second part is to find a pre-trained network suitable for spectrogram images. These parts include the following tasks:

• Find a pre-trained CNNs specialized in audio classification

• Implement the CNNs (both with and without transfer learning)

The third part is about preprocessing of the data used for part four, which is about training, evaluating and comparing pre-trained networks with self-made networks.

These parts include the following tasks:

• Train the networks on the spectrogram images

• Compare the results of the two neural networks

(15)

1.5 Outline

Chapter 1 is dedicated to the background and motivation for this thesis, along

with project specifications, tasks and methods. Chapter 2 describes the relevant

theory and concepts. Chapter 3 contains the implementation. Chapter 4 gives

a presentation of the results and discussion. Chapter 5 presents conclusions and

future work.

(16)

2 Theory

This section presents the relevant theory and concepts about ML, CNNs and trans- fer learning.

2.1 Ultrasound and spectrograms

Ultrasound can be used to non-invasively inspect material and weld joints and detect defects inside them. This is realized specifically by an ultrasonic device that generates ultrasound and sends it into the material, and then receives the reflections of the ultrasound from structures (e.g. weld joints, voids, cracks etc.) inside the material. The received ultrasound signals are continuous signals.

How one chooses to represent ultrasound data to feed into the neural network is not as straightforward as for image data. The best representation may vary since ultrasound have many defining properties. A raw ultrasound signal with amplitude values at certain times can only represent some part of the total information carried by the signal. A transformation of the signal into the frequency domain reveals other properties unique for that signal. The short time Fourier transform (STFT) tries to combine the time and frequency components of a signal into one complete representation called a spectrogram. Each of these spectrograms has grey- or colored-scaled amplitude with time on the x-axis and frequency on the y-axis [27].

The STFT is defined in [28] as STFT

_x

(t, f ) =

Z

∞

−∞

[x(t) · w(t − t

⁰

)] · e

^{−j2πf t}

, (2.1) and the spectrogram is defined as an intensity plot of the STFT magnitude [29] in the following manner:

Spectrogram = |STFT

x

(t, f ) |

²

. (2.2)

The STFT uses a window function w(t), e.g. a periodic Hann window, to slide

over the signal and perform a Fourier transform on the part of the signal x(t)

visible through the window, hopping forward in time after each computation. The

windows are controlled by the window length and hop-size, and they typically

overlap in time.

(17)

2.2 Machine learning and deep learning

ML uses labeled data to learn the rules explaining the underlying problem, without explicit human intervention. [30] gives a formal definition of ML: ”A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” These learning tasks are commonly grouped as follows:

• Supervised learning, or learning to imitate. The goal is to predict the output given some input data with known labels [31]. This is done by fitting a model on the input/output pairs, learning the relationship between them.

• Unsupervised learning, where the difference to supervised learning is that there are no known labels for the inputs. The goal is instead to find relevant relations between the inputs themselves. One method is called clustering, aiming to group inputs in a way that the members of each group are more similar to each other than to members of any other group [32].

• Reinforcement learning, or learning by trial and error. An agent inter- acts with an environment. Rewards are the result of an evaluation of the environment, and the agent’s goal is to maximize that reward [33].

Artificial intelligence

Machine learning

Deep learning

Figure 2.1: Illustration of how AI, ML and deep learning are related to each other.

(18)

This thesis is restricted to the field supervised learning.

Deep learning is about letting computers take a hierarchical approach to learning patterns. This way, the computer can learn from experience and break down difficult problems into simpler ones. The name deep learning comes from the deep, layered, hierarchical structure for learning [34]. An example of this is the deep neural network.

2.3 Feedforward neural networks

Neural networks are inspired by the neuron model in [35]. The first model that was trainable by only feeding it inputs was the perceptron [36], [37]. Feedforward neural networks are characterized by their layered structure and no feedback connections [38]. An example of a feedforward neural network is shown in Figure 2.2. If the network has at least one feedback connection it is called a recurrent network.

2.3.1 Weights and bias

All neurons are connected to the neurons in the previous layers, and the weights can be thought of as the strength of the connections between them. The bias indicates how common it is for this particular neuron to be in action. The output of each neuron is defined as

y = σ b +

n

X

i=1

w

_i

a

_i

!

, (2.3)

where a

_i

, w

_i

and n are the inputs, weights and the number of nodes from the previous layer respectively. The bias is denoted by b, shown in Figure 2.2. The activation function σ will be described further in Section 2.3.5. To summarize, the learning part of ML is to find the right weights and biases such that the output of the network correctly conforms with the label of the input.

2.3.2 Loss/cost function

To score how well the networks performs during training, the loss function is used

as a similarity measure between the predicted label and the actual label. The

(19)

Input layer

Hidden layer

Output layer

Backpropagation Feedforward

b

𝑎

₁

𝑤

₁

y

Figure 2.2: Feedforward neural network where, for each node, a

_i

is the input value, w

_i

is the weight, b is the bias and y is the output value.

input to the loss function is all the weights and biases. The output of the loss function gives a value of how well the network is performing at that moment. The goal however is to find the choice of parameters which minimizes this loss function.

A common loss function for a binary classification problem is binary cross-entropy, defined as

C = − (y log (p) + (1 − y) log (1 − p)) , (2.4)

where y is the binary label and p is the predicted probability of the input being

in class y = 1. To find the global minimum of the loss function is hard, but there

are many ways to find a local minimum.

(20)

2.3.3 Gradient descent using backpropagation

The gradient gives the direction of steepest ascent, and the negative gradient gives the direction of the steepest descent. In the negative gradient vector of the loss: A relatively large negative number indicates that if this particular weight decreases by a lot, the cost function will also decrease by a lot. The preferred changes are those who give you most value for your money. That is, the negative gradient of the loss function tells the neurons how to change their weights in order to improve the predictive capabilities of the network. The gradients are computed using back-propagation [39]. It is an algorithm which works its way backwards from the output loss by computing the partial derivative of the loss function with respect to each weight.

Neither the neural network or the programmer can explicitly change the output of the neurons, only implicitly by changing the weights and biases. The neurons in the second to last layer that have a positive weight with the desired neuron in the last layer should have their activation increased. By symmetry, the neurons in the second to last layer with a negative weight to that output neuron should have their activations decreased. For a binary classification problem, the only other output neuron wants the opposite. Adding the two neurons wishes gives a list of how all the weights in the second to last layer should change in order to minimize the loss.

The same reasoning holds for neurons in the previous layers, walking backwards through the network.

It is computationally expensive to compute the gradient for the whole training set.

A common choice to overcome this is by using stochastic gradient descent (SGD), a gradient-based optimization technique suitable for machine learning problems [40].

It essentially means to shuffle the data, then split it into mini batches for which the negative gradient is computed. This is not the true gradient, but a good enough approximation [2]. The choice of optimizer in a neural network is essentially a choice of how the parameters of the network will be updated based on the loss.

The magnitude of the update step is called the learning rate. Going through the

same procedure for all training samples in a batch, saving each desired change to

the weights and biases, the average of those changes are how the parameters end

up being updated.

(21)

Figure 2.3: Convolution with a filter size of 3 × 3 and a stride of 1, using zero padding, yields a new image with the same resolution as the original one.

2.3.4 Convolutional neural networks

A CNN is an ANN widely used for image analysis. It excels at detecting spatial patterns in the input data. The hidden layers are so called convolutional layers.

Layers in the beginning are able to detect general features like edges and shapes.

Convolutional layers closer to the top of the network are more specialized and can detect complex textures and shapes. Between each convolutional layer is an activation function, described in Section 2.3.5.

Each convolutional layer consists of filters, which convolves across the input pixels

until they have covered the whole image. The dot product of the filters and the

values of the input pixels are stored as new ”images” and will be passed to the

(22)

next layer. This procedure will reduce the dimensions of the image, from (n × n) to (n − f + 1 × n − f + 1) with a filter of size (f × f ), unless some padding is used. Zero padding inserts extra pixels with the value zero on the border of the image. This ensures that the filters can convolve over the pixels near the borders, thus preserving the shape of the original image. An example of the convolution operation can be seen in Figure 2.3.

Max/average pooling on the other hand is used to purposely reduce the size of the image after a convolutional layer. A pool of size (m × m) moves across the image with a stride of g, where each pooling operation results in the maximum/average value out of the pixels in the (m × m) pool. The output is a lower resolution of the convolutional output, thus leading to a reduced number of parameters and computational cost. An example of the max pooling operation can be seen in Figure 2.4. The most active regions of the image are fed forward, while the non- active regions are discarded. This will also make the model sensitive to shifts in the input image [41]. The reason for this is that pooling ignores the sampling theorem by not using anti-aliasing via low-pass filters. But [42] has managed to integrate anti-aliasing into the pooling operation, improving robustness and accuracy.

4 0 4 3

4 6 8 5

3 1 6 7

4 5 6 9

6 8

5 9

Figure 2.4: Max pooling with a filter size of 2 × 2 and a stride of 2 reduces the image to half its original resolution.

2.3.5 Activation functions

The output from a hidden layer in the network passes through an activation func- tion before being used as input to the next layer. This function must be non-linear for the network to be able to learn the complex underlying patterns connecting inputs and the corresponding labels. Historically, the sigmoid function Eq.(??) has been very popular. In recent years, the most common activation function for neural networks is rectified linear unit (ReLU). ReLU is defined as

σ (x) = max (0, x) . (2.5)

(23)

The prediction layer uses the softmax activation function [43], defined as

σ (x

_i

) = exp (x

i

) P

j

exp (x

_j

) . (2.6)

The reason why many chose this particular activation function for the prediction layer is that all output values are non-negative and sum to one. It can therefore be interpreted as the probability of each output being true.

2.3.6 Fully connected layer

After the convolution and the pooling layers have extracted the features for the image, the output needs to be flattened into a column vector in order to feed it to an FCL. As the name suggests, each neuron in the one layer are connected to all neurons in the adjacent layers. An example of FCLs is shown in Figure 2.2. The FCL acts as a final step in the classification process, and the output is the class predicted by the model.

2.3.7 Regularization techniques

Overfitting happens when the amount of available training data is too low com- pared to the complexity of the model, and if the model is trained for too long [44].

An example of overfitting is shown in Figure 2.5.

In [45], they propose dropout to combat overfitting. By dropping certain neurons at random between each image during training, the network is forced to generalize instead of memorizing the training data. If a neuron is dropped with probability p while training, then the weights going out from that neuron are multiplied by 1 − p during testing. The final model still has all the neurons activated, which can be an explanation if the training accuracy is lower than the validation or test accuracy.

Data augmentation is a common regularization technique for CNNs with many advantages. While increasing the dataset by augmenting the original images in multiple ways, it also gives the network a more diverse input stream for each class.

However, a hidden assumption is that the augmentations don’t change the image in

such a way that the label is now inaccurate. Two spectrograms are not necessarily

(24)

Underfit Good fit Overfit

Figure 2.5: The left most graph is an example of underfit. The middle graph is an example of a good fit. The right most graph is an example of overfit.

the same if one of them is a mirror image of the other one, but a mirror image of a dog is still a dog.

To know when to stop the training, one can stop when the validation loss has not decreased for a certain number of epochs. When this happens, it indicates that the model is now overfitting. This is an effective and intuitive regularization technique known as early stopping [34]. The trained model then uses the weight values at the epoch when the validation loss was at a minimum.

2.3.8 Label representation

To be able to compute the loss, the labels must be transformed from categorical to numerical values. The best way to do that is to use one-hot-encoding, where the labels are represented by a sequence of binary digits (see Table 2.1). The length of the sequence is the same as the number of classes, and each class has the number one in a different place with the rest of the digits being zeros.

Table 2.1: One-hot-encoding.

Label One-hot-encoding

Pass 1 0

Fail 0 1

(25)

2.3.9 Performance metrics Accuracy is defined as

Accuracy = # correct classifications

# total classifications . (2.7) Other metrics are specificity, sensitivity, F1 score etc. When neither false positives nor false negatives are more important than the other, and the number of instances of each class is balanced, accuracy is a sufficient metric.

2.4 Logistic Regression

Logistic regression is a statistical model that basically uses a logistic function P r (Y = 1|X) = p (X) = e

^β⁰^+β¹^X

1 + e

^β⁰^+β¹^X

(2.8) to model a binary dependent variable Y . It is used to solve classification problems.

The coefficients β

₀

and β

₁

are learned during training, and the goal is to estimate the probability of Y being in a particular class given the corresponding input X [31].

2.5 Transfer Learning

Transfer learning is one prominent ML technique. The purpose is to leverage previously gained knowledge to improve performance on a related task, similarly to human learning. Transfer learning aims to reduce the need for labeled data by leveraging previously gained knowledge to solve the new learning task [46].

According to [47], the way this technique could possibly improve performance can be summarized in three points:

• Greater initial accuracy compared to a randomly initialized network

• Faster increase in accuracy during training

• Greater final accuracy.

Since large networks tend to overfit when trained on small datasets, transfer learn-

ing can also mitigate the overfitting effects [10].

(26)

224×224×3 224×224×64

112×112×128

28×28×512 7×7×512

1×1×1000 58×58×256

14×14×512 1×1×4096

Convolution + ReLU Max pooling

Fully connected + ReLU Softmax

Figure 2.6: The architecture of the convolutional neural network VGG.

ML has traditionally been seen as one model for one task and dataset. Trying to repurpose a model for a different task than it was originally intended for could yield poor performance due to lack of generalization. The reason for that can be summarized as the bias of the model with respect to the training data. The solution to that problem has been restricted to remaking of the models and collecting new data. With the introduction of transfer learning, a new useful solution is presented.

Defined as a way to leverage previously gained knowledge to solve a new learning task, transfer learning is different from traditional ML. Introducing the terms domain, task, source and target from [46] will help explain the concept of transfer learning.

• A domain is made up of a feature space and a marginal probability distribu- tion.

• A task is made up of a label space and an objective predictive function.

In transfer learning, the goal is to boost the learning of the target task using information gained from the source domain and source task.

There are different transfer learning techniques, for instance, transductive transfer

learning, inductive transfer learning and unsupervised transfer learning [26]. The

inductive transfer learning can be further divided into two subcategories: multitask

(27)

learning and sequential transfer learning.

Multi-task learning [48] is similar to sequential transfer learning. The idea of multi-task learning is to train on multiple related tasks at the same time, hop- ing to improve generalization by leveraging the extra information from the other tasks. The key difference is however that in sequential transfer learning, only the performance on the target task is important. In multi-task learning, all tasks are important.

The most relevant one to this thesis is sequential transfer learning with neural networks, which, therefore, will be the main focus of this thesis.

2.5.1 What to transfer

It is not certain that all available information from the source is relevant for the task. In the context of image classification, the features which are desirable to transfer are the general features present in the first layers of the neural network [10]. The later and more source specific features could actually be harmful to transfer, so called negative transfer.

2.5.2 Negative transfer

An informal definition of negative transfer is when ”transferring knowledge from the source have a negative impact on target learning” [46]. According to [49], negative transfer relates to not only domain divergence, but should also be defined with respect to specific algorithms and the amount of labeled target data.

2.5.3 How to use transfer learning?

What seems to be the most common approach is to use a pre-trained model, often

one released by a research institute. The benefit of doing so is that these models

have already been trained on large and complicated datasets, a process which could

take days or even weeks on an ordinary computer. To adapt a pre-trained model

to a new classification task, the output (prediction) layer must be replaced with a

new output layer. When first beginning to train a neural network, the weights and

biases are initialized somehow. Using a pre-trained network just means that the

(28)

initialization has already been done, and that it has proven to be a good choice of parameters for a similar task.

A simple way to use a pre-trained model is to use feature extraction from one of the last convolutional layers. These features can be thought of as a generic image representation. Coupled with a simple classifier, this approach has shown promising results [50].

Another way is to add a new output layer, an FCL with as many neurons as there are classes in the new classification task. If the amount of target data is scarce, keeping the pre-trained layers frozen is recommended. Otherwise, the network might overfit to the new training set. Freezing layers just means that the learning rate is set to zero during back propagation. When there’s a large amount of target data, one can apply greater fine-tuning of the network, i.e. use a non-zero learning rate. Fine-tuning has the benefit of aligning the target domain with the source domain.

2.5.4 VGGish

From the famous image recognition network VGG [16], TensorFlow Model gar- den has developed a research model associated with AudioSet [51]. Due to the similarities in architecture, this sound classifier was given the name VGGish [17].

More specifically, it is adapted from VGG16 Configuration A, but with some minor changes:

• Input size changed from (224 × 224) to (96 × 64)

• Last group of convolutional and maxpool layers were dropped

• Uses 128-wide fully connected layer instead of 1000-wide fully connected layer.

It is trained on millions of seconds of audio samples collected from YouTube-videos,

resampled to 16 KHz mono. They used STFT with a periodic Hann window,

window size of 25 ms and a window hop of 10 ms to generate a spectrogram for each

sample. Every one of these spectrograms were mapped into 64 mel bins covering

the range 125−7500 Hz, resulting in so called mel spectrograms. They then created

stabilized log mel spectrograms by taking the log of (mel-spectrum + 0.01), which

they finally framed into non-overlapping examples of 0.96 s. Every example covers

64 mel bands and 96 frames of 10 ms each.

(29)

96×64×3 96×64×64

48×32×128

12×8×512 6×4×512 1×1×4096

1×1×128 24×16×256

Convolution + ReLU Max pooling

Fully connected + ReLU Softmax

Figure 2.7: The architecture of the convolutional neural network VGGish

The structure of VGG16 configuration A is shown in Figure 2.6 and the structure

of VGGish can be seen in Figure 2.7.

(30)

3 Implementation

This section presents the software implementation, together with the specifications for the software and the development tools used to perform the training of the neural networks.

A quick overview of the implementation looks like this:

Figure 3.1: Overview of the implementation.

The spectrogram data provided by [1] is fed into the CNN during training. During evaluation, the CNN predicts the correct label of unseen test data. How well the CNN generalizes to unseen data can be measured via the test accuracy, see Eq.

2.7. 3.1 Software and development tools

The implementation is based on Anaconda 2019.10 [52], an open-source data sci- ence toolkit and distribution platform for Python [53]. Besides Python, Anaconda contains many pre-installed libraries together with several integrated development environments (IDEs). It allows the user to create virtual environments, with the possibility of using different versions of Python or libraries in each one.

All code is written in Python 3.7.6, using Visual Studio Code (VSC) IDE. For the construction and training of the neural networks, the high-level application pro- gramming interface (API) Keras with TensorFlow-GPU backend is chosen. Other prominent libraries used are NumPy, Matplotlib and Scikit-learn.

The following subsections will provide more detail about the aforementioned de-

velopment tools.

(31)

3.1.1 Python

Python is a high level interpreted programming language [53]. In 2019, it was the second most popular language on GitHub [54]. It can be used in a variety of different ways, including web development, scientific computing and education.

Python permits being used both as a functional and an object-oriented program- ming language. It can also be run from the command line, allowing fast testing and makes it easy to use for beginners [55].

3.1.2 Visual Studio Code

VSC is a code editor free for private or commercial use [56]. It can be used for many different programming languages. By using the proper file extension, the user lets VSC know how to interpret the content of that file. An example view of VSC is shown in Figure 3.2. The version used in this thesis project is 1.50.0.

Figure 3.2: An example view of the code editor Visual Studio Code.

3.1.3 TensorFlow

TensorFlow is an open-source ML platform [57]. It can be run either on a CPU

or a GPU, and is available for both small mobile devices and large distributed

(32)

computer clusters. It is used in research for a wide range of applications, e.g.

image classification and NLP. In [57], they describe a TensorFlow computation as a directed graph, where the nodes have zero or more inputs and outputs. The nodes represent operations, such as ”add” or ”matrixInverse”. The values propagating along these edges are tensors (typed multidimensional arrays). The version of TensorFlow-GPU used in this thesis project is 1.15.0.

3.1.4 Keras

Keras is a Python deep learning API [58]. It runs on top of TensorFlow. Keras is widely used among machine learning engineers due to its philosophy of keeping simple things simple. It consists of two main data structures, layers and models.

A layer takes an input tensor, applies some transformation, and returns an output tensor. A model is a directed acyclic graph made up of stacks of layers. The Sequential model allows the user to create structurally straightforward models, while the Keras functional API enables more complex model structures.

Part of the Keras implementation of the VGGish model is shown in Figure 3.3. It is built using the functional API, and consists of convolutional layers, max pooling layers and dense layers. The version of Keras used in this thesis project is 2.3.1.

Figure 3.3: The Keras implementation of VGGish. It is built using the functional API.

3.1.5 NumPy

NumPy is an open source ML library for scientific computing in Python [59].

It combines the high-performance language C with the flexibility of Python to

(33)

allow for fast array computing. NumPy is a versatile tool for mathematical and statistical programming. It has made the visualization of big datasets possible, due to its ability to work with large arrays. The version of NumPy used in this thesis project is 1.18.1.

3.1.6 Matplotlib

Matplotlib is a 2D plotting library used for visualizations in Python [60]. The collection of functions matplotlib.pyplot emulates the behavior of MATLAB, and is used to create a figure on which the desired plot is drawn. The version of Matplotlib used in this thesis project is 3.1.3.

3.1.7 Scikit-learn

Scikit-learn is an open source ML library used for supervised and unsupervised learning [61]. It contains all steps of a common ML pipeline, such as pre-processing and prediction. One such prediction algorithm is Logistic Regression, explained in Section 2.4. Scikit-learn also provides automatic hyper-parameter tuning, helping the user select the best hyper-parameters for each particular problem. The version of Scikit-learn used in this thesis project is 0.22.2.

3.2 Data preprocessing

The original data from [1] consists of about 40000 ”.csv” files with the amplitude of the ultrasound signal at 1500 different points in time. The labeling criteria for a fail can be explained as follows:

• The amplitude exceeded 60 at three consecutive data points,

• or the amplitude exceeded 40 at three consecutive data points in either of the intervals 187-562 and 937-1312 of the original 1500 data points.

All other cases were labeled pass. To capture relevant features from the signal,

they transformed the data using STFT. This combined the frequency information

gained by the Fourier transform with the time information and amplitude stored

in the original representation. This produced spectrograms, explained in Section

2.1. To have equal amounts of spectrogram labeled pass and fail, they only used

(34)

(a) Spectrogram labeled as pass. The x-axis represents time [s] and the y-axis represents frequency [Hz].

23696 images in the end.

In this thesis project, the same 23696 images are used. For training purposes, 18956 images are used as training and validation. Two examples are shown in Figure 3.4. Another 4740 images are used as test data for this thesis project.

All images have the shape (189, 126, 1), representing pixel height, pixel width and number of color channels.

The project’s main aim is to investigate if the application of transfer learning

can reduce the need for data. Usually when faced with an image classification

problem, data augmentation improves the result. However, due to the fact that

random rotations and translations of a spectrogram can change the label, data

augmentation is not implemented in this project.

(35)

(b) Spectrogram labeled as fail. The x-axis represents time [s] and the y-axis represents frequency [Hz].

Figure 3.4: Two sample spectrograms from the training set. Figure 3.4(a) is labeled as pass. Figure 3.4(b) is labeled as fail.

3.3 Model construction and compilation

The pre-trained network VGGish [17] has a Keras implementation available to use for audio classification problems [62]. To use it for transfer learning, the option include top is set to False, pooling is set to max and load weights is set to True.

To find a baseline to compare the results with, the images are fed through the convolutional blocks of the VGGish network. The output from the last max pooling layer consists of features extracted by the model. These features are used as inputs to a classifier network using logistic regression.

Another approach is letting the extracted features be inputs to a classifier of FCLs.

This method is very similar to freezing the convolutional blocks of the VGGish

network and only allowing the last FCLs to be trained. The key difference is that

if you can save the extracted features, you only need to run each image through

(36)

the convolutional layers once. This is a cheaper alternative, but limits the use of e.g. data augmentation [2].

When the amount of training data is limited to only 100 images, the best architec- ture of the classifier is only one FCL with two neurons. This choice is motivated by a grid search of optimal hyperparameters using Scikit-learn [61], where the number of hidden layers and different layer sizes were explored. This architecture might be too simple for larger training sets, but for consistency, it is used in all further experiments.

189×126×1 189×126×16

94×63×32

23×15×64 1×1×2 47×31×64

Convolution + ReLU Max pooling

Softmax

Figure 3.5: The architecture of the Hedkvist [1] model.

The model used for comparison is adopted from one used in [1], which is shown in Figure 3.5. It is a convolutional neural network trained with the Adam optimizer [40] using a kernel size of 5, ReLU activation functions (softmax last) and a dropout rate of 0.2.

To further improve the results, fine-tuning is used. Unlike freezing, fine-tuning

(37)

allows some of the convolutional layers to be re-trained with the new training data. The reason for this is to alter the features extracted by the network to better represent the current task. The recipe for fine-tuning involves two training sessions.

During the first session, only the new classification layer is trained to learn the representation provided by the pre-trained network. If the whole network is trained together from the start, the large loss error will ruin the pre-trained weights. The models are compiled using the Adam optimizer, binary cross-entropy loss function and accuracy as the measured performance metric.

The second session is about slowly fine-tuning the network. Instead of the Adam optimizer, SGD [63] with a small learning rate is better suited since the risk of undoing the pre-training is lower. The choice of how many layers to be finetuned is determined by the specific problem at hand. But since the final layers are more task specific, these are ones where fine-tuning will improve the performance the most [2].

3.4 Training and validation

Training the neural network is the most expensive part regarding both time and computer power. This is where the network learns how to distinguish between which image represents the label fail and which represents pass. By showing the labeled data to the network, it will try to classify the image based on the current weights and biases. The value of the loss function will then tell the network how to update the weights to decrease the loss. Each pass where all the images are shown to the network once is called an epoch.

The validation set acts as an intermediate test set, where after each epoch, the network is evaluated on the validation set. The validation accuracy and valida- tion loss give an insight on how well the network is performing. The way Keras implements early stopping is that it doesn’t restore the best weights unless it is the callback itself that interrupts the training process. It does nothing when all epochs set for training are allowed to finish, even though the validation loss might have started to rise. Therefore, the Keras callback early stopping is not used.

The VGGish model is trained for 100 epochs and the Hedkvist [1] model is trained

for 30 epochs. The weights are restored to the state at which the validation loss

is at a minimum, mimicking the behavior of early stopping. The batch size used

is 2, and the data is split using the last 20% of the training set as validation data.

(38)

3.5 Testing and evaluation

The ability to generalize and correctly classify previously unseen images is the ultimate test of a machine learning model. The test set consists of 4740 images not used for training or validation purposes. These images are fed through the CNN which then predicts the label for each example. The metric used to measure performance is accuracy (Eq. 2.7). To ensure that the results are not statistical flukes, the training process is repeated 100 times for each size of the training set, resetting the weights in-between repetitions. The result is visualized using Matplotlib.

The results are presented in the following section.

(39)

4 Results and discussion

This section presents the results and answers the questions posed in the beginning of this report.

4.1 Implementation of transfer learning

A supervised neural network model based on transfer learning has been developed.

It was implemented in Python using Keras with TensorFlow-GPU backend. The convolutional base was VGGish, a model previously trained on millions of seconds of audio samples collected from YouTube-videos. The output layers were replaced with a new fully connected classification layer consisting of two neurons. The model was compiled with the Adam optimizer with learning rate 1 × 10

⁻³

and binary cross-entropy loss function for the first training session. For the second training session, the model was compiled with the SGD optimizer with learning rate 5 × 10

⁻⁴

and momentum 0.9. The results were compared to the neural network model inspired by Hedkvist [1].

4.2 Evaluation

When training on the whole dataset (18956 images), transfer learning does not yield any improvements on the test accuracy, see Table 4.1. However, when only using {10k}

¹⁰_k=1

randomly sampled (forcing class balance at 50/50) images for train- ing, transfer learning gives higher accuracy on the test set.

Table 4.1: Test accuracy for Hedkvist [1] and VGGish, with 18956 training exam- ples. All layers of VGGish were finetuned.

Hedkvist VGGish 95.61 % 94.85 %

Both CNNs exhibit high performance on spectrogram classification. This suggests that the representation of the ultrasound data in form of spectrograms was a good choice. Despite the simplicity of the prediction layer, the accuracy reaches as high as approximately 95% for both VGGish and the Hedkvist [1] model.

The results of the logistic regression are shown in Table 4.2. The accuracy is

(40)

approximately the same for all training sizes. Only the smallest case with ten images stand out with a slightly lower accuracy. These results are used as a baseline to compare with the results obtained by VGGish and the Hedkvist [1]

model.

Table 4.2: Test accuracy for logistic regression.

Training size Accuracy

10 0.604

20 0.625

30 0.620

40 0.619

50 0.624

60 0.618

70 0.623

80 0.622

90 0.623

100 0.623

The VGGish model is trained for 100 epochs 100 times, each time with a random subset of the full training set. The model is evaluated on the test set, and the accuracy shown is the mean with one standard deviation for each of the ten training sizes. Between each time, the model is re-initialized using the pre-trained weights.

The accuracy of VGGish + the new prediction neural network is shown in Figure 4.1(a). Here all of the pre-trained weights have been frozen. For VGGish + prediction with one finetuned VGGish layer, the test accuracy is depicted in Figure 4.1(b). They both improve over the baseline result of the logistic regression. The benefit of finetuning is not very prominent in the scenario when the amount of data is heavily reduced.

For smaller sizes of the training set, the mean test accuracy is low, and the variance is high. This could indicate that some subsets allow for better generalization than other subsets. As the training set size increases, the mean accuracy also increases while the variance decreases. This means that the model performance becomes more reliable with larger training set sizes, and not as dependent on the particular subset used for training. The increase of the mean accuracy implies that the model also learns how to generalize better with more training examples.

In Figure 4.2, the test accuracy is shown for the Hedkvist [1] model. The model

is trained for 30 epochs 100 times, each time with a random subset of the full

training set. The model is evaluated on the test set, and the accuracy shown is

(41)

the mean with one standard deviation for each of the ten training sizes. Between each time, the model is re-initialized using random weights.

As for the VGGish model, the Hedkvist model obtains higher test accuracy with larger training set sizes. The variance is high at first for Hedkvist too but decreases slower than for the VGGish model. The mean accuracy for Hedkvist is constantly lower than for VGGish, further reinforcing the promise that transfer learning could reduce the need for data significantly.

20 40 60 80 100

Training size 0.72

0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.88

Accuracy

Test accuracy

Test accuracy: iters = 100, epochs = 100, VGGish with 0 conv layers finetuned, optimizer adam

(a) Test accuracy for VGGish with all layers frozen.

(42)

20 40 60 80 100 Training size

0.725 0.750 0.775 0.800 0.825 0.850 0.875

Accuracy

Test accuracy

Test accuracy: iters = 100, epochs = 100, VGGish with 1 conv layers finetuned, optimizer adam

(b) Test accuracy for VGGish with one finetuned layer.

Figure 4.1: Test accuracy for VGGish. The model is trained 100 times for 100 epochs. The accuracy shown is the mean with one standard deviation for each of the ten training sizes.

20 40 60 80 100

Training size 0.60

0.65 0.70 0.75 0.80 0.85

Accuracy

Test accuracy

Test accuracy: iters = 100, epochs = 30, simple model

Figure 4.2: Test accuracy for the Hedkvist [1] model. The model is trained 100 times for 30 epochs. The accuracy shown is the mean with one standard deviation

31

(43)

The results for VGGish without the pre-trained weights is shown in Figure 4.3.

These results differ from the other cases, both in mean and the variance of the test accuracy. For many iterations during training, the model was stuck in a local minimum from the first epoch. This explains both why the variance is so high and the relatively low mean accuracy.

20 40 60 80 100

Training size 0.55

0.60 0.65 0.70 0.75 0.80 0.85 0.90

Accuracy

Test accuracy

Iters = 100, epochs = 100, VGGish with no pretrained weights

Figure 4.3: Test accuracy for VGGish without the pre-trained weights. The model is trained 100 times for 100 epochs. The accuracy shown is the mean with one standard deviation for each of the ten training sizes.

With these results at hand, combined with the fact that prior work has shown some success, transfer learning could very well be useful in the industry. The ability to achieve this high and reliable test accuracy shows that the transferred knowledge from the source is of great aid when the amount of labeled data for the target is small. However, the need for a skilled ML engineer is still required to implement the solution.

According to the literature and prior work, the problems must be similar enough

to benefit from transfer learning. Possible pitfalls are negative transfer, see Section

2.5.2. These results suggest that negative transfer does not occur for this particular

(44)

problem, indicating that the source and target domains are similar enough to benefit from transfer learning.

4.3 Proposed pre-trained models for company libraries

As of September 2020, the SOTA neural networks for image recognition (obtained from [64]) are:

• Branching/Merging CNN + Homogeneous Filter Capsules [65]

(#1 on MNIST dataset [66])

• SOPCNN [67]

(#2 on MNIST dataset)

• BiT-L (ResNet) [68]

(#1 on CIFAR-10 dataset and #1 on CIFAR-100 dataset [69])

• FixEfficientNet-L2 [70], [71]

(#1 on ImageNet dataset)

• NoisyStudent (EfficientNet-L2) [72]

(#2 on ImageNet dataset).

SOTA for environmental sound classification:

• TSCNN-DS [73]

(#1 on UrbanSound8k dataset [74])

However, there is no guarantee that these models are easily accessible for anyone

to use. For a list of available models implemented in the Keras API, see [75] for a

complete and up-to-date list.

(45)

5 Conclusions and future work

This section presents the conclusions of this report and proposed future work.

A pre-trained CNN specialized in audio classification has been found. It is called VGGish and has been trained on millions of seconds of audio from YouTube clips present in the dataset AudioSet. The CNN was used as the base for transfer learning, where the new classification task was to correctly classify faulty items based on the spectrogram of an ultrasound scan. This dataset came from a previous master thesis report by Hedkvist [1]. To compare the results of VGGish, a self- made CNN architecture was implemented based on one of the models found in that report, referred to as the Hedkvist model. VGGish was trained on the spectrogram dataset for 100 epochs, while the Hedkvist model was trained for 30 epochs. They were both evaluated on the same test set. This procedure was repeated 100 times, providing more knowledge of the true test accuracy of the two models.

When all the training data was used, the Hedkvist model out-performed VGGish by a small amount. However, when the training set was restricted to {10k}

¹⁰_k=1

, VGGish produced higher test accuracy. This result suggests that transfer learning could indeed reduce the need for data when faced with a new ML task. This is in line with the other findings referenced in this report. This result indicates that transfer learning could be useful in the industry. To prove a concept is useful in the general case, more than one example is necessary. It is, however, interesting to note that ultrasound analysis could be solved by applying transfer learning via a model only trained on sounds from the audible spectra.

This is a list of proposals for future work:

• Use multiple different datasets as targets

• Use more than one pre-trained network as the source

• Train both the pre-trained and the self-made model at the same time

• Implement more complex prediction models on top of the pre-trained base.

(46)

References

[1] A. Hedkvist, “Predictive maintenance with machine learning on weld joint analysed by ultrasound,” M. S. thesis, Uppsala University, Uppsala, Sweden, 2019.

[2] F. Chollet, Deep learning with Python. Shelter Island, New York: Manning Publications Co, 1 ed., 2017.

[3] S. H. Fuller and L. I. Millett, “Computing performance: Game over or next level?,” Computer, vol. 44, pp. 31–38, January 2011. [Online]. Available:

https://ieeexplore.ieee.org/document/5688147. [Accessed Nov. 6, 2020].

[4] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,” In Proc.

Competition and Cooperation in Neural Nets (S. Amari and M.A. Arbib eds.), Springer, Berlin, Heidelberg, 1982, pp. 267–285.

[5] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of physiology, vol. 160, no. 1, pp. 106–154, 1962.

[6] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat,” Journal of

Neurophysiology, vol. 28, no. 2, pp. 229–289, 1965.

[7] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in Proc.

Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014, pp. 2654–2662.

[8] D. Schlegel, “Deep machine learning on gpu,” 2015. [Online]. Seminar paper, University of Heidelber-Ziti, Available:

http://www.ziti.uni-heidelberg.de/ziti/uploads/ce_group/

seminar/2014-Daniel_Schlegel.pdf. [Accessed Nov. 6, 2020].

[9] Z. A. Memon, F. Samad, Z. R. Awan, A. Aziz, and S. S. Siddiqi, “Cpu-gpu processing,” International Journal of Computer Science and Network Security, vol. 17, no. 9, pp. 188–193, 2017.

[10] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are

features in deep neural networks?,” In Proc. Advances in Neural

Information Processing Systems 27 (NIPS 2014), 2014, pp. 3320–3328.

(47)

[11] A. Ng, “NIPS 2016 tutorial: Nuts and bolts of building AI applications using deep learning.” youtube.com. May 6, 2018. [Online]. Available:

https://www.youtube.com/watch?v=wjqaz6m42wU. [Accessed July 27, 2020].

[12] L. Y. Pratt, J. Mostow, C. A. Kamm, and A. A. Kamm, “Direct transfer of learned information among neural networks.,” In Proc. AAAI, vol. 91, 1991, pp. 584–589.

[13] L. Y. Pratt, “Discriminability-based transfer between neural networks,” In Proc. Advances in Neural Information Processing Systems 5 (NIPS 1992), 1993, pp. 204–211.

[14] N. E. Sharkey and A. J. C. Sharkey, “Adaptive generalisation,” Artificial Intelligence Review, vol. 7, no. 5, pp. 313–328, 1993.

[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” In Proc. IEEE conference on computer vision and pattern recognition (IEEE 2016), 2016, pp. 770–778.

[16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” presented at 3rd International Conference on Learning Representations (ICLR 2015), 2015. [Online]. Available:

https://arxiv.org/pdf/1409.1556.pdf. [Accessed Nov. 6, 2020].

[17] S. Hershey et al., “CNN architectures for large-scale audio classification,”

presented at 42nd IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, 2017. [Online]. Available:

https://arxiv.org/abs/1609.09430. [Accessed Nov. 6, 2020].

[18] A. Carovac, F. Smajlovic, and D. Junuzovic, “Application of ultrasound in medicine,” Acta informatica medica : AIM : journal of the Society for Medical Informatics of Bosnia & Herzegovina : casopis Drustva za medicinsku informatiku BiH, vol. 19, no. 3, pp. 168–171, 2011.

[19] Mayo Clinic, “Fetal ultrasound.” mayoclinic.org. Jan. 4, 2019. [Online].

Available: https://www.mayoclinic.org/tests-procedures/

fetal-ultrasound/about/pac-20394149. [Accessed July 22 2020].

[20] R. Gouriveau, K. Medjaher, and N. Zerhouni, From prognostics and health systems management to predictive maintenance 1: Monitoring and

prognostics. Hoboken, NJ: John Wiley & Sons, 2016.

(48)

[21] L. Liu, W. Li, X. Wu, and B. X. Zhou, “Infant cry language analysis and recognition: an experimental approach,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. 778–788, 2019.

[22] Q. Liu et al., “Spectrum analysis of EEG signals using CNN to model patient’s consciousness level based on anesthesiologists’ experience,” IEEE Access, vol. 7, pp. 53731–53742, 2019.

[23] L.-H. Wang, X.-P. Zhao, J.-X. Wu, Y.-Y. Xie, and Y.-H. Zhang, “Motor fault diagnosis based on short-time fourier transform and convolutional neural network,” Chinese Journal of Mechanical Engineering, vol. 30, no. 6, pp. 1357–1368, 2017.

[24] A. Mohamed, “Deep neural network acoustic models for ASR,” PhD thesis, University of Toronto, Toronto, Canada, 2014.

[25] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” In Proc. Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016, pp. 892–900.

[26] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning in natural language processing,” In Proc. 2019 Conference of the North

American Chapter of the Association for Computational Linguistics:

Tutorials, 2019, pp. 15–18.

[27] L. Wyse, “Audio spectrogram representations for processing with

convolutional neural networks,” presented at First International Workshop on Deep Learning and Music joint with IJCNN, 2017. [Online]. Available:

https://arxiv.org/abs/1706.09559. [Accessed Nov. 6, 2020].

[28] S. Duan, H. Zheng, and J. Liu, “A novel classification method for flutter signals based on the CNN and STFT,” International Journal of Aerospace Engineering, vol. 2019, pp. 1–8, 2019.

[29] J. O. Smith, Spectral Audio Signal Processing. 2011. [Online book].

Available: http://ccrma.stanford.edu/~jos/sasp/. [Accessed Mars 9, 2020].

[30] T. M. Mitchell, Machine learning. New York, NY: McGraw-Hill, 1997.

[31] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to

Statistical Learning: with Applications in R. New York, NY: Springer New

York, 2013.

Transfer Learning on Ultrasound Spectrograms of Weld Joints for Predictive Maintenance

Examensarbete 30 hp November 2020

Transfer Learning on Ultrasound Spectrograms of Weld Joints for Predictive Maintenance

Joakim Bergström

Abstract

Transfer Learning on Ultrasound Spectrograms of Weld Joints for Predictive Maintenance

Joakim Bergström

A big hurdle for many companies to start using machine learning is that trending techniques need a huge amount of structured data.

One potential way to reduce the need for data is taking advantage of previous knowledge from a related task. This is so called transfer learning. A basic description of it would be when you take a model trained on existing data and reuse that for another problem.

The purpose of this master thesis is to investigate if transfer learning can reduce the need for data when faced with a new machine learning task which is, in particular, to use transfer learning on ultrasound spectrograms of weld joints for predictive maintenance.

The base for transfer learning is VGGish, a convolutional neural network model trained on audio samples collected from YouTube videos. The pre-trained weights are kept, and the prediction layer is replaced with a new prediction layer consisting of two neurons.

This thesis project concludes that transfer learning can reduce the need for data when faced with a new machine learning task. The results indicate that transfer learning could be useful in the

industry.

Ämnesgranskare: Ping Wu

Handledare: Adam Hedkvist

Popul¨ arvetenskaplig sammanfattning

Artificiell intelligens (AI) har funnits som koncept ¨ anda sedan mitten av 1900-talet.

I ¨ overf¨ oringsl¨ arning anv¨ ands ett f¨ ortr¨ anat faltningsn¨ atverk (CNN), som bas, och enbart det sista lagret byts ut till ett som motsvarar den nya klassificeringsuppgiften.

Ett CNN till¨ ampas f¨ or bildigenk¨ anning och best˚ ar av flera lager som hittar m¨ onster i bilden. Det sista lagret tar utv¨ ardena fr˚ an CNN och f¨ orutsp˚ ar vilken klass bilden h¨ or till.

Resultatet visar p˚ a att ¨ overf¨ oringsl¨ arning ger en h¨ ogre tr¨ affs¨ akerhet n¨ ar m¨ angden

tr¨ aningsdata ¨ ar mycket liten. Det indikerar att metoden har potential till att vara

anv¨ andbar f¨ or f¨ oretag som vill b¨ orja anv¨ anda ML eller AI.

Preface

Uppsala, November 2020

Joakim Bergstr¨ om

Contents

1 Introduction 1

1.1 Background . . . . 1

1.2 Prior work . . . . 2

1.3 Purpose of the project . . . . 3

1.4 Task and scope . . . . 3

1.5 Outline . . . . 4

2 Theory 5 2.1 Ultrasound and spectrograms . . . . 5

2.2 Machine learning and deep learning . . . . 6

2.3 Feedforward neural networks . . . . 7

2.3.1 Weights and bias . . . . 7

2.3.2 Loss/cost function . . . . 7

2.3.3 Gradient descent using backpropagation . . . . 9

2.3.4 Convolutional neural networks . . . . 10

2.3.5 Activation functions . . . . 11

2.3.6 Fully connected layer . . . . 12

2.3.7 Regularization techniques . . . . 12

2.3.8 Label representation . . . . 13

2.3.9 Performance metrics . . . . 14

2.4 Logistic Regression . . . . 14

2.5 Transfer Learning . . . . 14

2.5.1 What to transfer . . . . 16

2.5.2 Negative transfer . . . . 16

2.5.3 How to use transfer learning? . . . . 16

2.5.4 VGGish . . . . 17

3 Implementation 19 3.1 Software and development tools . . . . 19

3.1.1 Python . . . . 20

3.1.2 Visual Studio Code . . . . 20

3.1.3 TensorFlow . . . . 20

3.1.4 Keras . . . . 21

3.1.5 NumPy . . . . 21

3.1.6 Matplotlib . . . . 22

3.1.7 Scikit-learn . . . . 22

3.2 Data preprocessing . . . . 22

3.3 Model construction and compilation . . . . 24

3.4 Training and validation . . . . 26

3.5 Testing and evaluation . . . . 27

4 Results and discussion 28 4.1 Implementation of transfer learning . . . . 28

4.2 Evaluation . . . . 28

4.3 Proposed pre-trained models for company libraries . . . . 33

5 Conclusions and future work 34

References 35

List of Figures

2.1 Illustration of how AI, ML and deep learning are related to each other. . . . . 6 2.2 Feedforward neural network where, for each node, a

is the input

value, w

is the weight, b is the bias and y is the output value. . . . 8 2.3 Convolution with a filter size of 3 × 3 and a stride of 1, using zero

padding, yields a new image with the same resolution as the original one. . . . . 10 2.4 Max pooling with a filter size of 2 × 2 and a stride of 2 reduces the

image to half its original resolution. . . . 11 2.5 The left most graph is an example of underfit. The middle graph

tional API. . . . . 21 3.4 Two sample spectrograms from the training set. Figure 3.4(a) is

labeled as pass. Figure 3.4(b) is labeled as fail. . . . 24 3.5 The architecture of the Hedkvist [1] model. . . . 25 4.1 Test accuracy for VGGish. The model is trained 100 times for

100 epochs. The accuracy shown is the mean with one standard deviation for each of the ten training sizes. . . . 31 4.2 Test accuracy for the Hedkvist [1] model. The model is trained 100

times for 30 epochs. The accuracy shown is the mean with one

standard deviation for each of the ten training sizes. . . . 31

4.3 Test accuracy for VGGish without the pre-trained weights. The