UPTEC F 20056
Examensarbete 30 hp November 2020
Transfer Learning on Ultrasound Spectrograms of Weld Joints for Predictive Maintenance
Joakim Bergström
Teknisk- naturvetenskaplig fakultet UTH-enheten
Besöksadress:
Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0
Postadress:
Box 536 751 21 Uppsala
Telefon:
018 – 471 30 03
Telefax:
018 – 471 30 00
Hemsida:
http://www.teknat.uu.se/student
Abstract
Transfer Learning on Ultrasound Spectrograms of Weld Joints for Predictive Maintenance
Joakim Bergström
A big hurdle for many companies to start using machine learning is that trending techniques need a huge amount of structured data.
One potential way to reduce the need for data is taking advantage of previous knowledge from a related task. This is so called transfer learning. A basic description of it would be when you take a model trained on existing data and reuse that for another problem.
The purpose of this master thesis is to investigate if transfer learning can reduce the need for data when faced with a new machine learning task which is, in particular, to use transfer learning on ultrasound spectrograms of weld joints for predictive maintenance.
The base for transfer learning is VGGish, a convolutional neural network model trained on audio samples collected from YouTube videos. The pre-trained weights are kept, and the prediction layer is replaced with a new prediction layer consisting of two neurons.
The whole model is re-trained on the ultrasound spectrograms. The dataset is restricted to a minimum of ten and a maximum of 100 training samples. The results are evaluated and compared to a regular convolutional neural network trained on the same data. The results show that transfer learning improves the test accuracy compared to the regular convolutional neural network when the dataset is small.
This thesis project concludes that transfer learning can reduce the need for data when faced with a new machine learning task. The results indicate that transfer learning could be useful in the
industry.
Ämnesgranskare: Ping Wu
Handledare: Adam Hedkvist
Popul¨ arvetenskaplig sammanfattning
Artificiell intelligens (AI) har funnits som koncept ¨ anda sedan mitten av 1900-talet.
Fr˚ an b¨ orjan bestod AI av h˚ ardkodade regler som datorerna skulle f¨ olja. I takt med att datorkraften ¨ okade f¨ orsk¨ ots fokus mot maskininl¨ arning (ML). Skillnaden mot tidigare var att ML anv¨ ande uppm¨ arkt data f¨ or att p˚ a egen hand l¨ ara sig reglerna, utan explicit m¨ anskligt ingripande.
Idag ¨ ar ett stort hinder f¨ or f¨ oretag som ¨ ar intresserade av att b¨ orja anv¨ anda AI eller ML att de trendande teknikerna beh¨ over enorma m¨ angder v¨ alstrukturerad data. Syftet med detta examensarbete ¨ ar att unders¨ oka en potentiell l¨ osning p˚ a detta hinder. Metoden kallas transfer learning, eller ¨ overf¨ oringsl¨ arning, och g˚ ar ut p˚ a att ˚ ateranv¨ anda tidigare vunnen kunskap f¨ or nya problem.
I ¨ overf¨ oringsl¨ arning anv¨ ands ett f¨ ortr¨ anat faltningsn¨ atverk (CNN), som bas, och enbart det sista lagret byts ut till ett som motsvarar den nya klassificeringsuppgiften.
Ett CNN till¨ ampas f¨ or bildigenk¨ anning och best˚ ar av flera lager som hittar m¨ onster i bilden. Det sista lagret tar utv¨ ardena fr˚ an CNN och f¨ orutsp˚ ar vilken klass bilden h¨ or till.
Den potentiella nyttan med ¨ overf¨ oringsl¨ arning ligger i antagandet att den f¨ ortr¨ anade modellen har plockat upp generella m¨ onster och drag som kan hj¨ alpa till att l¨ osa det nya problemet. F¨ or att unders¨ oka detta j¨ amf¨ ors resultaten med ett nytt CNN som enbart tr¨ anats p˚ a detta problem.
Resultatet visar p˚ a att ¨ overf¨ oringsl¨ arning ger en h¨ ogre tr¨ affs¨ akerhet n¨ ar m¨ angden
tr¨ aningsdata ¨ ar mycket liten. Det indikerar att metoden har potential till att vara
anv¨ andbar f¨ or f¨ oretag som vill b¨ orja anv¨ anda ML eller AI.
Preface
First, I would like to give my deepest thanks to my supervisor Adam Hedkvist at Syntronic AB in G¨ avle. This master thesis could not have been done without him, especially due to the extraordinary circumstances in the spring and summer of 2020. I would also like to thank my subject reader Ping Wu for always being there when I needed anything. Finally, I would like to send my warmest thanks to my closest family for being supportive and never giving up on me.
Uppsala, November 2020
Joakim Bergstr¨ om
Contents
1 Introduction 1
1.1 Background . . . . 1
1.2 Prior work . . . . 2
1.3 Purpose of the project . . . . 3
1.4 Task and scope . . . . 3
1.5 Outline . . . . 4
2 Theory 5 2.1 Ultrasound and spectrograms . . . . 5
2.2 Machine learning and deep learning . . . . 6
2.3 Feedforward neural networks . . . . 7
2.3.1 Weights and bias . . . . 7
2.3.2 Loss/cost function . . . . 7
2.3.3 Gradient descent using backpropagation . . . . 9
2.3.4 Convolutional neural networks . . . . 10
2.3.5 Activation functions . . . . 11
2.3.6 Fully connected layer . . . . 12
2.3.7 Regularization techniques . . . . 12
2.3.8 Label representation . . . . 13
2.3.9 Performance metrics . . . . 14
2.4 Logistic Regression . . . . 14
2.5 Transfer Learning . . . . 14
2.5.1 What to transfer . . . . 16
2.5.2 Negative transfer . . . . 16
2.5.3 How to use transfer learning? . . . . 16
2.5.4 VGGish . . . . 17
3 Implementation 19 3.1 Software and development tools . . . . 19
3.1.1 Python . . . . 20
3.1.2 Visual Studio Code . . . . 20
3.1.3 TensorFlow . . . . 20
3.1.4 Keras . . . . 21
3.1.5 NumPy . . . . 21
3.1.6 Matplotlib . . . . 22
3.1.7 Scikit-learn . . . . 22
3.2 Data preprocessing . . . . 22
3.3 Model construction and compilation . . . . 24
3.4 Training and validation . . . . 26
3.5 Testing and evaluation . . . . 27
4 Results and discussion 28 4.1 Implementation of transfer learning . . . . 28
4.2 Evaluation . . . . 28
4.3 Proposed pre-trained models for company libraries . . . . 33
5 Conclusions and future work 34
References 35
List of Figures
2.1 Illustration of how AI, ML and deep learning are related to each other. . . . . 6 2.2 Feedforward neural network where, for each node, a
iis the input
value, w
iis the weight, b is the bias and y is the output value. . . . 8 2.3 Convolution with a filter size of 3 × 3 and a stride of 1, using zero
padding, yields a new image with the same resolution as the original one. . . . . 10 2.4 Max pooling with a filter size of 2 × 2 and a stride of 2 reduces the
image to half its original resolution. . . . 11 2.5 The left most graph is an example of underfit. The middle graph
is an example of a good fit. The right most graph is an example of overfit. . . . 13 2.6 The architecture of the convolutional neural network VGG. . . . 15 2.7 The architecture of the convolutional neural network VGGish . . . 18 3.1 Overview of the implementation. . . . . 19 3.2 An example view of the code editor Visual Studio Code. . . . . 20 3.3 The Keras implementation of VGGish. It is built using the func-
tional API. . . . . 21 3.4 Two sample spectrograms from the training set. Figure 3.4(a) is
labeled as pass. Figure 3.4(b) is labeled as fail. . . . 24 3.5 The architecture of the Hedkvist [1] model. . . . 25 4.1 Test accuracy for VGGish. The model is trained 100 times for
100 epochs. The accuracy shown is the mean with one standard deviation for each of the ten training sizes. . . . 31 4.2 Test accuracy for the Hedkvist [1] model. The model is trained 100
times for 30 epochs. The accuracy shown is the mean with one
standard deviation for each of the ten training sizes. . . . 31
4.3 Test accuracy for VGGish without the pre-trained weights. The
model is trained 100 times for 100 epochs. The accuracy shown is
the mean with one standard deviation for each of the ten training
sizes. . . . 32
List of Tables
2.1 One-hot-encoding. . . . . 13 4.1 Test accuracy for Hedkvist [1] and VGGish, with 18956 training
examples. All layers of VGGish were finetuned. . . . 28
4.2 Test accuracy for logistic regression. . . . 29
Abbreviations
AI artificial intelligence ANN artificial neural network
API application programming interface CNN convolutional neural network CPU central processing unit FCL fully connected layer GPU graphical processing unit
IDE integrated development environment ML machine learning
NLP natural language processing ReLU rectified linear unit
SGD stochastic gradient descent SOTA state of the art
STFT short time Fourier transform
VSC Visual Studio Code
1 Introduction
A big hurdle for many companies to start using artificial intelligence (AI) is that trending techniques need a huge amount of structured data. Are there methods to reduce the need for data? One potential way to do so is by taking advantage of previous knowledge from a related task. This field within machine learning (ML) is called transfer learning. A basic description of it would be when you take a model trained on existing data and reuse that for another problem.
1.1 Background
Even though AI and ML are almost considered buzzwords of today, these concepts are not new at all. The idea of computers thinking for themselves has been around since the 1950s. At first, AI consisted of humans providing hard-coded rules for the computers to obey. Up until the 1980s, symbolic AI was the most popular approach for AI [2]. As computational power increased [3], the ML approach rose in popularity. In contrast to symbolic AI, ML uses labeled data to learn the rules explaining the underlying problem, without (explicit) human intervention [2].
Early machine learning used fully connected layers (FCLs) to classify images. How- ever, the number of model parameters blew up as the image’s resolution increased.
Another drawback was the need to flatten the input before presenting it to the network, yielding any spatial information redundant. This led to the invention of the convolutional neural network (CNN) [4]. This new type of network was able to recognize patterns no matter where in the image they appeared, and was inspired by the visual nervous system model proposed by [5], [6]. Modern CNN architech- tures are often very deep, but [7] shows that a shallow model can be taught to mimic a more complex teacher model using the logits of the teacher as labels. To solve the problem of having numerous model parameters, they used a bottleneck linear layer between the input and hidden layer. This allowed them to factorize the weight matrix, reducing convergence time as well as memory usage. However, until the need of a teacher model is solved, deep nets are still the way to go.
Deep neural network training involves a large amount of matrix multiplications.
The architecture of the central processing unit (CPU) is not well suited for doing
these types of calculations. A graphical processing unit (GPU) on the other hand
excels at the task, prompting many researchers to use them for development. This
is mainly because GPUs have larger memory bandwidth and can perform many
small computations in parallel [8], [9].
In [10], they claim that when neural networks are trained on images, almost all of them learn features akin to Gabor filters on the first layer. These features are called general features, in contrast to the highly (class) specific features learned by the last layers of neural network.
Transfer learning has been called the ”...the next driver of ML commercial success”
[11]. First described in [12], and later in [13], transfer learning were shown to improve learning time compared to randomly initialized neural networks. The theory is based on a concept in psychology known as adaptive generalization, the ability to generalize not only within the same domain, but across different domains [14].
Audio classification network architecture have not been as deeply investigated as its image counterpart. While there exists several state of the art neural networks for images, such as ResNet [15], VGG [16] etc., for sounds there are only VGGish [17] and Yamnet [17] whom can be called state of the art (SOTA).
Ultrasonic waves, or ultrasound, have a higher frequency than the upper audible frequency of human ears. A common use of ultrasound is fetal ultrasound (sono- grams). In medical settings, the frequency of the sound waves lie in the range of 3 to 10 MHz [18]. By sending sound waves and gathering the reflections, an image of the fetus inside the uterus can be produced [19]. Another use is within the field predictive maintenance, which is defined as ”a condition based maintenance carried out following a forecast derived from the analysis and evaluation of the significant parameters of the degradation of the item” [20].
1.2 Prior work
This thesis project started from the work in [1] where they used ML to analyze data from ultrasonic scanning of weld joints for predictive maintenance. A portion of the same dataset and the network structure were used here in the project.
Spectrum analysis technology have been used extensively for the analysis of infant
cries [21], EGG signals [22], motor fault diagnostics [23] and automatic speech
recognition [24]. In [25], they have had good results combining sound with unla-
beled video to learn natural sound representations. Other applications of transfer
learning have had success within natural language processing (NLP) [26].
1.3 Purpose of the project
The main objective of the project is to investigate if transfer learning can reduce the need for data when faced with a new ML task, and more specifically, compare transfer learning to regular ML within the field of predictive maintenance using ultrasound. The questions to be answered are:
• Can transfer learning reduce the need for data when faced with a new ML task?
• Could this technique be useful in the industry?
• Are there any rules available for reducing one problem to another?
• What library of pre-trained models would be needed for a company to quickly get something up and running for a customer?
1.4 Task and scope
In this thesis project, one pre-trained neural networks will be used as the base for transfer learning. The results of the new CNN, on the same dataset of spectrogram images, will be compared to the CNN design inspired by [1]. A comparison between the CNNs trained with transfer learning and the CNN without transfer learning will be done.
This thesis project contains four major parts. The first part is a literature study, where knowledge about key concepts and prior work is gathered. The second part is to find a pre-trained network suitable for spectrogram images. These parts include the following tasks:
• Find a pre-trained CNNs specialized in audio classification
• Implement the CNNs (both with and without transfer learning)
The third part is about preprocessing of the data used for part four, which is about training, evaluating and comparing pre-trained networks with self-made networks.
These parts include the following tasks:
• Train the networks on the spectrogram images
• Compare the results of the two neural networks
1.5 Outline
Chapter 1 is dedicated to the background and motivation for this thesis, along
with project specifications, tasks and methods. Chapter 2 describes the relevant
theory and concepts. Chapter 3 contains the implementation. Chapter 4 gives
a presentation of the results and discussion. Chapter 5 presents conclusions and
future work.
2 Theory
This section presents the relevant theory and concepts about ML, CNNs and trans- fer learning.
2.1 Ultrasound and spectrograms
Ultrasound can be used to non-invasively inspect material and weld joints and detect defects inside them. This is realized specifically by an ultrasonic device that generates ultrasound and sends it into the material, and then receives the reflections of the ultrasound from structures (e.g. weld joints, voids, cracks etc.) inside the material. The received ultrasound signals are continuous signals.
How one chooses to represent ultrasound data to feed into the neural network is not as straightforward as for image data. The best representation may vary since ultrasound have many defining properties. A raw ultrasound signal with amplitude values at certain times can only represent some part of the total information carried by the signal. A transformation of the signal into the frequency domain reveals other properties unique for that signal. The short time Fourier transform (STFT) tries to combine the time and frequency components of a signal into one complete representation called a spectrogram. Each of these spectrograms has grey- or colored-scaled amplitude with time on the x-axis and frequency on the y-axis [27].
The STFT is defined in [28] as STFT
x(t, f ) =
Z
∞−∞
[x(t) · w(t − t
0)] · e
−j2πf t, (2.1) and the spectrogram is defined as an intensity plot of the STFT magnitude [29] in the following manner:
Spectrogram = |STFT
x(t, f ) |
2. (2.2)
The STFT uses a window function w(t), e.g. a periodic Hann window, to slide
over the signal and perform a Fourier transform on the part of the signal x(t)
visible through the window, hopping forward in time after each computation. The
windows are controlled by the window length and hop-size, and they typically
overlap in time.
2.2 Machine learning and deep learning
ML uses labeled data to learn the rules explaining the underlying problem, without explicit human intervention. [30] gives a formal definition of ML: ”A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” These learning tasks are commonly grouped as follows:
• Supervised learning, or learning to imitate. The goal is to predict the output given some input data with known labels [31]. This is done by fitting a model on the input/output pairs, learning the relationship between them.
• Unsupervised learning, where the difference to supervised learning is that there are no known labels for the inputs. The goal is instead to find relevant relations between the inputs themselves. One method is called clustering, aiming to group inputs in a way that the members of each group are more similar to each other than to members of any other group [32].
• Reinforcement learning, or learning by trial and error. An agent inter- acts with an environment. Rewards are the result of an evaluation of the environment, and the agent’s goal is to maximize that reward [33].
Artificial intelligence
Machine learning
Deep learning
Figure 2.1: Illustration of how AI, ML and deep learning are related to each other.
This thesis is restricted to the field supervised learning.
Deep learning is about letting computers take a hierarchical approach to learning patterns. This way, the computer can learn from experience and break down difficult problems into simpler ones. The name deep learning comes from the deep, layered, hierarchical structure for learning [34]. An example of this is the deep neural network.
2.3 Feedforward neural networks
Neural networks are inspired by the neuron model in [35]. The first model that was trainable by only feeding it inputs was the perceptron [36], [37]. Feedforward neural networks are characterized by their layered structure and no feedback connections [38]. An example of a feedforward neural network is shown in Figure 2.2. If the network has at least one feedback connection it is called a recurrent network.
2.3.1 Weights and bias
All neurons are connected to the neurons in the previous layers, and the weights can be thought of as the strength of the connections between them. The bias indicates how common it is for this particular neuron to be in action. The output of each neuron is defined as
y = σ b +
n
X
i=1
w
ia
i!
, (2.3)
where a
i, w
iand n are the inputs, weights and the number of nodes from the previous layer respectively. The bias is denoted by b, shown in Figure 2.2. The activation function σ will be described further in Section 2.3.5. To summarize, the learning part of ML is to find the right weights and biases such that the output of the network correctly conforms with the label of the input.
2.3.2 Loss/cost function
To score how well the networks performs during training, the loss function is used
as a similarity measure between the predicted label and the actual label. The
Input layer
Hidden layer
Output layer
Backpropagation Feedforward
b
𝑎
1𝑤
1y
Figure 2.2: Feedforward neural network where, for each node, a
iis the input value, w
iis the weight, b is the bias and y is the output value.
input to the loss function is all the weights and biases. The output of the loss function gives a value of how well the network is performing at that moment. The goal however is to find the choice of parameters which minimizes this loss function.
A common loss function for a binary classification problem is binary cross-entropy, defined as
C = − (y log (p) + (1 − y) log (1 − p)) , (2.4)
where y is the binary label and p is the predicted probability of the input being
in class y = 1. To find the global minimum of the loss function is hard, but there
are many ways to find a local minimum.
2.3.3 Gradient descent using backpropagation
The gradient gives the direction of steepest ascent, and the negative gradient gives the direction of the steepest descent. In the negative gradient vector of the loss: A relatively large negative number indicates that if this particular weight decreases by a lot, the cost function will also decrease by a lot. The preferred changes are those who give you most value for your money. That is, the negative gradient of the loss function tells the neurons how to change their weights in order to improve the predictive capabilities of the network. The gradients are computed using back-propagation [39]. It is an algorithm which works its way backwards from the output loss by computing the partial derivative of the loss function with respect to each weight.
Neither the neural network or the programmer can explicitly change the output of the neurons, only implicitly by changing the weights and biases. The neurons in the second to last layer that have a positive weight with the desired neuron in the last layer should have their activation increased. By symmetry, the neurons in the second to last layer with a negative weight to that output neuron should have their activations decreased. For a binary classification problem, the only other output neuron wants the opposite. Adding the two neurons wishes gives a list of how all the weights in the second to last layer should change in order to minimize the loss.
The same reasoning holds for neurons in the previous layers, walking backwards through the network.
It is computationally expensive to compute the gradient for the whole training set.
A common choice to overcome this is by using stochastic gradient descent (SGD), a gradient-based optimization technique suitable for machine learning problems [40].
It essentially means to shuffle the data, then split it into mini batches for which the negative gradient is computed. This is not the true gradient, but a good enough approximation [2]. The choice of optimizer in a neural network is essentially a choice of how the parameters of the network will be updated based on the loss.
The magnitude of the update step is called the learning rate. Going through the
same procedure for all training samples in a batch, saving each desired change to
the weights and biases, the average of those changes are how the parameters end
up being updated.
Figure 2.3: Convolution with a filter size of 3 × 3 and a stride of 1, using zero padding, yields a new image with the same resolution as the original one.
2.3.4 Convolutional neural networks
A CNN is an ANN widely used for image analysis. It excels at detecting spatial patterns in the input data. The hidden layers are so called convolutional layers.
Layers in the beginning are able to detect general features like edges and shapes.
Convolutional layers closer to the top of the network are more specialized and can detect complex textures and shapes. Between each convolutional layer is an activation function, described in Section 2.3.5.
Each convolutional layer consists of filters, which convolves across the input pixels
until they have covered the whole image. The dot product of the filters and the
values of the input pixels are stored as new ”images” and will be passed to the
next layer. This procedure will reduce the dimensions of the image, from (n × n) to (n − f + 1 × n − f + 1) with a filter of size (f × f ), unless some padding is used. Zero padding inserts extra pixels with the value zero on the border of the image. This ensures that the filters can convolve over the pixels near the borders, thus preserving the shape of the original image. An example of the convolution operation can be seen in Figure 2.3.
Max/average pooling on the other hand is used to purposely reduce the size of the image after a convolutional layer. A pool of size (m × m) moves across the image with a stride of g, where each pooling operation results in the maximum/average value out of the pixels in the (m × m) pool. The output is a lower resolution of the convolutional output, thus leading to a reduced number of parameters and computational cost. An example of the max pooling operation can be seen in Figure 2.4. The most active regions of the image are fed forward, while the non- active regions are discarded. This will also make the model sensitive to shifts in the input image [41]. The reason for this is that pooling ignores the sampling theorem by not using anti-aliasing via low-pass filters. But [42] has managed to integrate anti-aliasing into the pooling operation, improving robustness and accuracy.
4 0 4 3
4 6 8 5
3 1 6 7
4 5 6 9
6 8
5 9
Figure 2.4: Max pooling with a filter size of 2 × 2 and a stride of 2 reduces the image to half its original resolution.
2.3.5 Activation functions
The output from a hidden layer in the network passes through an activation func- tion before being used as input to the next layer. This function must be non-linear for the network to be able to learn the complex underlying patterns connecting inputs and the corresponding labels. Historically, the sigmoid function Eq.(??) has been very popular. In recent years, the most common activation function for neural networks is rectified linear unit (ReLU). ReLU is defined as
σ (x) = max (0, x) . (2.5)
The prediction layer uses the softmax activation function [43], defined as
σ (x
i) = exp (x
i) P
j