A Deep Learning Approach to Detection and Classification of Small Defects

(1)

A Deep Learning Approach to Detection and Classification of Small Defects

on Painted Surfaces

A Study Made on Volvo GTO, Umeå

Authors:

Johannes Sjölund & Johannes Rönnqvist

Supervised by:

Eric Lindahl, Natalya Pya Arnqvist & Blaise Ngendangenzwa

Master Thesis, 30 Credits

(2)

In this thesis we conclude that convolutional neural networks, together with phase-measuring deflectometry techniques, can be used to create models which can detect and classify defects on painted surfaces very well, even compared to experienced humans. Further, we show which preprocessing measures enhances the performance of the models. We see that standardisation does increase the classification accuracy of the models. We demonstrate that cleaning the data through relabelling and removing faulty images improves classification accuracy and especially the models’ ability to distinguish between different types of defects. We show that oversampling might be a feasible method to improve accuracy through increasing and balancing the data set by augmenting existing observations. Lastly, we find that combining many images with different patterns heavily increases the classification accuracy of the models. Our proposed approach is demonstrated to work well in a real-time factory environment. An automated quality control of the painted surfaces of Volvo Truck cabins could give great benefits in cost and quality.

The automated quality control could provide data for a root-cause analysis and a quick and efficient alarm system. This could significantly streamline production and at the same time reduce costs and errors in production. Cor- rections and optimisation of the processes could be made in earlier stages in time and with higher precision than today.

Sammanfattning

I den här rapporten visar vi att modeller av typen convolutional neural networks, tillsammans med phase-measuring deflektometri, kan hitta och klassificera defekter på målade ytor med hög precision, även jämfört med erfarna operatörer. Vidare visar vi vilka databehandlingsåtgärder som ökar modellernas prestanda. Vi ser att standardisering ökar modellernas klas- sificeringsförmåga. Vi visar att städning av data genom ommärkning och borttagning av felaktiga bilder förbättrar klassificeringsförmågan och särskilt modellernas förmåga att särskilja mellan olika typer av defekter. Vi visar att översampling kan vara en metod för att förbättra precisionen genom att öka och balansera datamängden genom att förändra och duplicera befintliga ob- servationer. Slutligen finner vi att kombinera flera bilder med olika mönster ökar modellernas klassificeringsförmåga väsentligt. Vårt föreslagna tillväga- gångssätt har visat sig fungera bra i realtid inom en produktionsmiljö. En automatiserad kvalitetskontroll av de målade ytorna på Volvos lastbilshytter kan ge stora fördelar med avseende på kostnad och kvalitet. Den automatis- ka kvalitetskontrollen kan ge data för en rotorsaksanalys och ett snabbt och effektivt alarmsystem. Detta kan väsentligt effektivisera produktionen och samtidigt minska kostnader och fel i produktionen. Korrigeringar och opti- mering av processerna kan göras i tidigare skeden och med högre precision än idag.

(3)

We would like to start by thanking our supervisors Natalya Pya Arnqvist and Blaise Ngendangenzwa for their helpful guidance throughout the project.

Especially for their, at times, harsh and rightful reprimands regarding our writing. We also thank our supervisor at Volvo, Eric Lindahl, for his help and superb support throughout the project. Gratitude also goes to the workers at Volvo who collected the data and aided us during the project. Finally, we thank our thesis colleagues at Volvo for all the sharing of ideas and for tolerating our overwhelming number of bad jokes. We are proud of all of you for managing to get through these difficult times. All of the above mentioned have made our time at Volvo GTO Umeå more than pleasant!

(4)

1 Introduction

1.1 Background

Volvo Trucks is among the largest producers of heavy duty trucks in the world [1]. Volvo’s cabin factory in Umeå is among the biggest engineering industries in northern Sweden and, with its high amount of automation, it is also one of the most modern of its kind [2][3]. The factory in Umeå includes stamping, welding, pre-treatment, and painting processes. The Umeå factory is also Volvo’s Cab Com- petence Center and shall therefore be at the technological forefront and develop these processes. The paint shop at the Umeå factory is already very modern and highly automated with the exception of the inspection and quality control. After the painting, a manual quality control is carried out, and from that it is deter- mined whether the cabin’s painted surface is approved or if any further treatment such as polishing or repainting has to be done.

Today’s manual control of the painted-surface quality is effective, but there are some drawbacks with manual quality control. For starters, it takes a lot of time to carefully examine a whole cabin which leads to high labour costs. Another factor that comes in is also the precision. It is hard for a worker to be consistent; daily form and tiredness can change a worker’s judgement which leads to less predictable and less consistent quality controls.

In order to evaluate the possibility of an automated surface inspection and quality control, AB Volvo (parent company of Volvo Trucks) started a project in coopera- tion with Umeå University, Volvo Cars, and Vinnova called FIQA, Finish Inspec- tion and Quality analysis. The goal of FIQA is to investigate the possibilities of improvement in the quality inspection, through automatic cabin quality inspection and analysis of production data. FIQA consists of two parts, called Work Package 1, WP1, and Work Package 2, WP2. WP1 consists of investigating the possibilities of automating the detection and classification of defects in the top coat, which is the outer most layer of the painted surface. WP2 is directed towards creating a root-cause analysis of the quality data, and also towards creating an alarm system which shall alert when there are systematic problems with the quality or elevated risk of quality problems.

The hope is that these two work packages can be connected. The automatic defect detection and classification should provide data for the root-cause analysis and for a quick and efficient alarm system. This could significantly streamline production and at the same time reduce costs and errors in production. Corrections of the

(7)

processes can be made in earlier stages in time and with higher precision than today. Further, the collected and stored quality data could be used in optimisation of processes and standards.

Volvo Trucks strives to automate this quality control with an image capturing system and an algorithm which would automatically provide data about the quality of the top coat. However, the defects which are in consideration in the quality control can be very small, often under one millimetre, which have made this kind of automation non-trivial. There has been previous work performed within WP1. A functional image-acquisition rig, which we will call the Pilot Rig in this thesis, and a program for visualising eventual defects on the cabin has been made. There has also been work which has attempted to detect and classify two common defects in the top coat, crater defects and dirt defects. This work has been performed by Blaise Ngendangenzwa and Natalya Pya Arnqvist which have created classification algorithms [4]. These algorithms have reached some success in predicting and classifying defects, but improvements need to be made in prediction time and accuracy for practical uses in the factory.

Our approach to detecting and classifying defects are with deep learning algorithms. Deep learning is when a learning algorithm has multiple layers for which the deeper layers learn more abstract concepts [5]. Over the last years, Deep learning has overtaken most machine learning areas in terms of performance [6]. One of the areas where machine learning has done the most progress in the last decade is in image recognition, and the algorithms which have stood for this improvement are almost all deep learning algorithms [7]. These deep learning algorithms are all of the class deep neural networks, or more specifically deep convolutional neural networks. With the recent advancement of deep convolutional neural networks in the area of image recognition, it would be reasonable to assume that they would provide great benefits in performing analyses and creating quality data about the cabin’s painted surface. Deep learning has yet not been tried as a way of creating this quality data before at Volvo Trucks’ paint shop. We therefore see it as a prime candidate for evaluation. In this thesis we present a functional application of deep convolutional neural networks for defect detection and classification.

1.2 Purpose

The purpose of this thesis work is to find a method of prediction that can automatically detect, locate, and classify eventual defects in the top coat using images of the painted surface of a truck cabin. Further, the result will be used to investigate the possibilities of an automated quality inspection system which in extension can be connected to further analyses such as an extensive root-cause analysis.

(8)

1.3 Goal

The goal of this thesis work is to find an approach to detect and classify dirt and crater defects, with great accuracy, using convolutional neural networks. Our goal is also to find how much information that is necessary to take into consideration in order to get a good performing model regarding accuracy and prediction time.

We will also investigate what, and to which degree, data preprocessing and data engineering will improve the result. This approach should also work in a real-time production environment.

1.4 Scope and Limitations

This thesis will focus on the possibility of detecting and classifying defects on painted surfaces using convolutional neural networks only. Only a small part of the cabin will be investigated, the luggage lid. The luggage lid is flat while many other areas of painted vehicles might be curved, uneven, or contain pressings or gaps. With the luggage lid being one of the easier areas to investigate, the result of this report might not be applicable to the whole cabin. Moreover, the cabins passing by the pilot rig were examined regardless of colour, and we will not examine the colour effects that the cabins might have on the models. The images that are evaluated are in grey scale. We will also only examine one neural network architecture, so the results may not be applicable for neural networks with different architectures. Further, each model was only trained once, more training instances of each model would statistically strengthen the result. Also, the test set which the models were compared against did not have fully correct labels, which makes the performance evaluation of the models not represent their true capacity.

Relabelling data is a very time-consuming task, and also, if we were to do that task on the test labels, it could induce a sort of observation bias; we could be affected by our expectation of what the models would predict.

There is a possibility that data containing even more information can be obtained.

In this thesis, only images with vertical sinusoidal stripes are evaluated. The ad- dition of images with vertical sinusoidal stripes, or possibly even other patterns, might extract even more information about the surface.

1.5 Outline

This thesis will start with an explanation of how the data have been collected, how it is structured and the method used to generate the different images which were used throughout this thesis. Then in the neural network section, we will briefly explain the theory behind neural networks and convolutional neural networks in

(9)

2. DATA June 14, 2019

particular. The performance metrics we use to evaluate our models are also presented in this section.

The Method section covers what programs and hardware we used, our image- preprocessing measures and how the models are constructed. The Result section is divided into four parts. First we examine the overall performance of one of our models which we will use as the baseline to compare all the other models to. In the second part we compare the effects of the different preprocessing measures. In the third part we compare models trained on patches (see section 4.2.1) of different frequencies (see section 2.1.1), and determine the best combination of channels (see section 2.1.1) to use in the models. Finally, in the fourth part we test our approach in a real-time factory environment and evaluate the performance. We then discuss the implication of our results and draw conclusions.

2 Data

In the following section we present the method used when acquiring the data, and we also present the pilot rig, which is used to acquire data and to scan the luggage lid in a real-time factory application scenario, i.e. attempting to detect and classify defects on the painted surface of the luggage lid as cabins passes the pilot rig.

2.1 Pilot Rig

The pilot rig has two grey-scale digital cameras, one 55 inch screen and a powerful computer. The technique used for acquiring images of the luggage lid is by phase-measuring deflectometry, which is presented in section 2.1.1. The process starts with a cabin coming on the conveyor line and then it stops by the rig for some seconds while the cameras capture 16 images of the luggage lid. Those 16 images have different patterns reflected on the surface of the cabin. The different patterns are constructed with the function described in Section 2.1.1 and are of four different frequencies and where all frequencies have four phases. We call these 16 images with different patterns reflected on the surface, channels. Each image is a fusion of both the cameras captions, and each fused image has the resolution 4928 × 2056, or 10 megapixel, and covers a surface area of roughly 0.2 m².

If the rig is used for scanning the luggage lid for defects, as in the real-time factory application scenario, there is also a screen with a visualisation program which can plot the predicted defects in real-time. The plot is shown just after the images have been captured and an algorithm has made a prediction of locations and classes of eventual defects. The visualisation program which plots the defects

(10)

2. DATA June 14, 2019

is shown in Appendix A.3.

The hardware of the computer used in the pilot rig consisted of:

• an Intel Xeon E5-2637 v4 CPU, 3.50 GHz

• Nvidia Quadro p4000 GPU

• 160 GB RAM memory

• 2x Basler acA2440-35um cameras

• 55" Sony TV-screen

The pilot rig uses the same software as presented in section 4.1. A picture of the pilot rig’s image capturing system is shown below in Figure 1.

Figure 1 – Pilot rig. Cabin to the left (shown in purple), screen to the right (shown in red) with the two cameras above the screen (shown in green).

(11)

2. DATA June 14, 2019

2.1.1 Phase-Measuring Deflectometry

Phase-measuring deflectometry is used to enhance contrasts of the structure of a surface so that the surface deviances become clearer and more visible in order to extract as much information of the surface as possible. In phase-measuring deflectometry, different patterns, in our case horizontal sinusoidal patterns, are displayed on a screen and the pattern is reflected on a surface and then captured by one or more cameras. The sinusoidal pattern, for every pixel column on the TV screen, is created as following:

I = 0.5 + 0.5sin(2πf kx + θ), (1)

where I is the screen brightness for that pixel column, and k is a constant, and xi

is the pixel column index. The two parameters which are changed to create different sinusoidal patterns are f, which is the frequency, and θ, which is the phase.

Because all pixel rows in a pixel column has the same brightness, the effect of the sinus pattern is that the screen pattern appears to have vertical stripes which shifts between being bright and dark.

A visual example of the steps in phase-measuring deflectometry is shown in Figure 2.

(12)

2. DATA June 14, 2019

Figure 2 – Depiction of the measurement principle of phase-measuring deflectometry. The computer creates a pattern which is displayed on the monitor/screen.

The camera captures the reflection of the pattern on the surface which are then sent back to the computer [8]. An eventual defect may then be shown as a distortion in the pattern on the image.

Frequencies

Increasing the frequency f, in equation (1), shortens the period of the sinus pattern. The effect of this is that the vertical stripes becomes thinner and closer together. The frequency corresponds to the number of vertical bright stripes that are displayed on the TV-screen. The frequency f = 32 thus have 32 bright stripes displayed on the screen. Four different frequencies have been used at the pilot rig during the collection of the data set, namely f = {16, 32, 64, 128}. Figure 3 shows the same small section of an image but with different sinusoidal patterns from the four different frequencies reflected on the surface. The differences in the width of the stripes are clearly visible.

(13)

2. DATA June 14, 2019

(a) (b) (c) (d)

Figure 3 – Images showing the four different frequencies. (a) f = 16, (b) f = 32, (c) f = 64, (d) f = 128. The dot in the middle is a defect.

Phases

For each frequency, four different phases, θ in equation (1), have also been used at the pilot rig during the collection of the data set. Different phases have different horizontal offsets of the vertical lines in the patterns. The phases used are θ = {0, 90, 180, 270}.

(a) (b) (c) (d)

Figure 4 – The four different phases for frequency f = 16. (a) θ = 0, (b) θ = 90, (c) θ = 180, (d) θ = 270. The dot in the middle is a defect.

The Figure 4 shows how the pattern with vertical stripes is horizontally offset at frequencies θ = {90, 180, 270} compared to θ = 0.

2.2 Types of Defects

In this thesis, two types of defects have been considered, dirt and crater. This is because those defect types being two of the most common defects on the painted surface. A dirt defect is a small foreign particle in the paint, and a crater defect is a small bowl-shaped recess in the paint.

(14)

2. DATA June 14, 2019

(a) (b)

(c) (d)

Figure 5 – Different types of defects. (a) Dirt defect, bright cabin, (b) Dirt defect, dark cabin, (c) Crater defect, bright cabin, (d) Crater defect, dark cabin.

Figure 5 shows representative examples of the two classes on both bright and dark cabins. The great majority of the defects considered in this thesis are 0.3 − 3 millimetres in diameter with 0.3−1 millimetres in diameter being the most common defect size. 0.3 millimetres correspond to 2 × 2 pixels on the images.

2.3 Data Acquisition

Data was collected in the FIQA-project before our thesis work started. Roughly the first half of the collected data is considered training data and the other half test data. There have mainly been two experienced operators inspecting the luggage lids visually, one per work shift, and annotated the defects they have seen. They

(15)

2. DATA June 14, 2019

have annotated the coordinates of where the defects are located on the luggage lid of a specific cabin and the type of the defect. They have also annotated coordinates on the luggage lid where there are not supposed to be any defects as non-defect.

In the training data, 414 defects have been annotated with the label crater, 4826 with the label dirt, and 13559 have been annotated with the label non-defect. In the test data, there are 583 observations with label crater, 3861 with label dirt, and 12404 with label non-defect.

In this thesis, we divide the cabins into three different types with different looking luggage lids. Those are the FH cabin, FM long cabin, and FM short cabin, which are shown in Figure 6. Images of these cabins, taken by the two cameras at the pilot rig, are shown in Figure 7.

(a) (b) (c)

Figure 6– Sketches of the three different cabin types. The red-marked area is the area that is being evaluated for defects. (a) Cabin type FH, (b) Cabin type FM long, (c) Cabin type FM short.

(16)

3. NEURAL NETWORKS June 14, 2019

(a) (b)

(c)

Figure 7 – Raw images of the different cabin types. The images are captured with the cameras at the pilot rig with the frequency f = 16 pattern reflected on the cabin surface. (a) Cabin type FH, (b) Cabin type FM long, (c) Cabin type FM short.

3 Neural Networks

A Neural network is a machine learning algorithm used to create non-linear statistical models for regression and classification. Neural networks are made up of several different functions which are why they are called networks. Neural networks are the most used deep learning method today [5]. In the following section, we present a basic structure of a "vanilla" neural network, which is also called a multilayer perceptron. Then we explain the different functions and general methods which we use to build, train, and evaluate these neural networks. We then present Convolutional Neural Networks (CNN), which is the type of neural network that we used in this thesis and which is commonly used for many kinds of computer vision tasks, including image recognition.

3.1 Neural Networks Basics

A K-class classification neural network has the response variables Yk, representing the classification model’s probability of the observation X being of class k, k = 1, ..., K. A normal representation of a vanilla neural network consisting of a single fully connected hidden layer (also called dense layer), is shown in Figure 8. The Zm

nodes represent the hidden layer and the Xpnodes represent the input observation.

(17)

Figure 8– Schematic of a single hidden layer, feed-forward neural network [9].

This network can mathematically be represented as a two-step classification model as the following equations in (2).

Z_m = σ(α_0m+ α^T_mX), m = 1, . . . , M,

T_k = β_0k+ β_k^TZ, k = 1, . . . , K, (2) fk(X) = Yk= gk(T ), k = 1, . . . , K,

where Z = (Z1, Z₂, . . . , Z_m), T = (T1, T₂, . . . , T_K), and σ(·) is an activation function, and gk(T ) is an output activation function which scales the output to represent probabilities (see 3.1.1). The parameters α0k and β0k are called biases and the parameters αm and βm are called weights. This complete set of weights and biases is usually denoted by θ [9]. The final classification of the object X is then calculated by:

G(X) = argmax_k(f_k(X)) (3)

3.1.1 Activation Functions

Activation functions are used to create non-linearity or to scale the outputs in neural networks. A neural network without activation functions will simply be

(18)

a linear model. The most commonly used activation function in neural networks today is the ReLU, short form for Rectified Linear Unit [5]. ReLU is defined as:

f (x) = max(0, x),

where x is the weighted input to the node combined with the node specific bias.

Another activation function, which is generally only used in the final layer of a neural network is the softmax function. Softmax is used in many classification algorithms, like the multinomial logistic regression, to scale all outputs so they are all positive and together sum up to 1. For a class k, k = 1, . . . , K, in a neural network as in equation 2, the softmax function is defined as:

gk(T ) = e^T^k PK

l=1e^T^l,

where Tk is the output node for the class k. Because softmax always has positive outputs which sums to 1, it is possible to view the outputs as probabilities for the observation belonging to each respective class [9].

3.1.2 Regularisation

A predictive statistical model should be able to generalise and predict the value or class of new objects which it has not been trained on. When a statistical model has too few observations compared to the number of parameters, the model can run the risk of overfitting. When a model has been overfitted, it has adapted too well to the training data and can not classify new observations effectively [10].

Regularisation is when a constraint is put on the model with the purpose of reducing overfitting [10].

A common method of regularisation when training statistical models is early stopping. Early stopping has some rules for when to stop training the model. The rule can be to train the network until it reaches a minimum for the validation loss function or a maximum validation accuracy, or until a certain predetermined number of epochs [10]. One epoch consists of all training observations passing through the model and their corresponding backpropagation (see section 3.1.3).

Batch normalisation is a transformation that includes the scaling of the inputs to a node so that the inputs have a mean of zero and unit variance. The scaling is made from a subsample of the training set called batch. Batch normalisation makes the loss surface smoother and thus the model can converge easier [11]. It

(19)

decreases the risk of a model to get drawn in towards an unwanted local minimum.

Mean and variance of k inputs and the for the batch B can be described as:

µ_Bk = 1 b

b

X

i=1

τ_ik

σ_Bk² = 1 b

b

X

i=1

(τik− µBk)²

B is the batch of size b and τikis the k:th input to the node of the i:th observation.

For K input dimensions, k = 1, . . . , K, and i = 1, . . . , b, we have that:

ˆ

τ_ik = τ_ik− µ_Bk q

σ_Bk² + ,

where ˆτik is the k:th input of the i:th observation after the normalisation step. is a small value to prevent dividing with zero in case of zero-variance batches.

The final step of batch normalisation is to transform ˆτik with the parameters γ_k and νk. γk and νk are learned through backpropagation. The transformation is performed as:

z_ik = γ_kτˆ_ik+ ν_k.

z_ik is the k:th new input for the i:th observation to the node [12].

Dropout is a standard regularisation method used in the final layers of a neural network. It prevents overfitting by removing a certain percentage of the links from a layer for each batch. The network can therefore not get reliant on certain links or a combination of links for its prediction and therefore the risk of overfitting is reduced.

3.1.3 Backpropagation

Backpropagation is the procedure of optimising neural networks. It consists of a

"forward pass", where training data is predicted and a loss function is calculated, and a "backward pass", where the network weights are updated and tuned. These weights are tuned hierarchically, which means they are first tuned in the output layer, which then tunes the second-last hidden layer, and so on until all layers are tuned.

(20)

The loss function, categorical cross entropy, R(θ), calculates the entropy between the different classes and is defined as:

R(θ) ≡

N

X

i=1

R_i = −

N

X

i=1 K

X

k=1

y_iklog(f_k(x_i)),

where θ is the complete set of weights, K is the number of classes, N is the number of observations, xi = (x_i1, . . . , x_iP)is the i:th observation with P being the dimension of the input, yik is the true probability distribution of observation i belonging to class k, and f is a classification function for class k [9]. Ri is the categorical cross entropy for the observation i.

The most common way of tuning the weights of the layers in a neural network is through a gradient descent on the loss function. In gradient descent the goal is to reach a minimum through following a function’s steepest descent. In Neu- ral networks, we strive to minimise the loss function, in this case the categorical cross-entropy R(θ). The steepest descent is calculated separately through partial derivatives for every parameter. The steepest descent is traditionally calculated on the full training set. Calculating the steepest descent on a randomly selected subset of the training set (batch) is called Stochastic Gradient Descent. [5]. When working with a big amount of data, the common measure is to use the stochastic gradient descent.

For the same scenario as explained in the two-step neural network in the begin- ning of section 3.1, let βk^T = (β_k1, β_k2, . . . , β_kM)^T, and α^Tm = (α_m1, α_m2, . . . , α_mP)^T, where m = 1, . . . M and p = 1, . . . , P . The partial derivative of the loss function for the parameters βkm and αmp for a batch of size N, is defined as following:

N

X

i=1

∂R_i

∂β_km^(r) (4)

N

X

i=1

∂R_i

∂αmp^(r)

(5)

The β-parameters are dependent on the α-parameters, so the partial derivative with regards to the α-parameters will be dependent on the partial derivatives of the β-parameters, which is why its called backpropagation. Using the partial derivatives, the formula for updating weights of the parameters βkm and αmp are given below in equation 6-7.

(21)

β_km^r+1 = β_km^r − γ_r

N

X

i=1

∂R_i

∂β_km^(r) (6)

α^r+1_mp = α^r_mp− γ_r

N

X

i=1

∂R_i

∂α^(r)mp

(7) Where r is the iteration step and γr is the learning rate, which is decided by the optimisation algorithm (also called optimiser) which is chosen.

Adam, or adaptive moment estimation, is an optimiser of the loss function based on the stochastic gradient descent. It utilises the stochastic gradient descent but with momentum and adaptive learning rate for all parameters. Adam is today considered a default optimiser when using neural networks.

3.1.4 Performance Metrics

Performance metrics are used to evaluate models after training. We use two different metrics in this thesis: Accuracy and recall. These performance metrics are based on the classification criteria in Equation (3).

Accuracy is the percentage of correctly classified objects. It is calculated on the full set of predictions [5]. Equation:

Accuracy = Number of correctly classified objects Total number of objects

Recall is usually calculated on each class separately and is defined as percentage of observations from a certain class which were correctly predicted as coming from that class [5], Recall for class k is defined as:

Recallk = Number of objects correctly classified as class k Total number of objects of class k

3.2 Convolutional Neural Networks

Convolutional neural networks are neural networks which perform feature map- pings with convolutional layers before using dense layers to classify the object. A feature mapping, when working with images, is a transformation of the images into some other feature space in order to better be able to extract information.

Convolutional neural networks have, as mentioned earlier, become the standard in image recognition tasks due to their accuracy and efficiency.

(22)

3.2.1 Convolutional Layers

Convolutional layers consist of filters, called kernels. Kernels are transformation matrices used for feature mapping. Kernels can be of different sizes and takes values from nearby points in a matrix in consideration and uses them to perform a linear transformation when feature mapping a point [5]. In our case, the matrix is a two-dimensional array of pixel values of a certain channel of a part of an image, and each point is one pixel value. Every kernel performs a feature mapping which corresponds to a feature map. The Figures 9, 10, and 11 visualise the principle behind how a feature mapping procedure is performed with a kernel matrix of size 3 × 3.

Stride is the step length of the kernel in each dimension. The default stride is one through all dimensions, i.e. a feature mapping is performed on every data point/pixel, although it is not uncommon to use larger strides for the kernels in the feature mapping.

Figure 9– Kernel matrix for performing a feature mapping to the left and matrix that is to be feature mapped to the right. The point to transform shown in the black square.

(23)

Figure 10 – Highlighting area where kernel matrix is applied on the matrix and how the neighbouring points connect to the kernel matrix.

Figures 9 and 10 show the kernel matrix to the left and the matrix which has the point to transform to the right. Highlighted in blue in both images is the area which the kernel filter is going to be applied on in order to map the point containing the value 4, shown in the bold black square. The new value in the bold black square will then be the linear combination:

1 × 1 + 3 × 1 − 2 × 4 + 3 × 1 + 2 × 4 + 1 × 2 − 2 × 1 − 3 × 2 − 2 × 1 = −1

Figure 11 – Matrix after transformation of the highlighted point.

Figure 11 shows the image matrix after the kernel filter has been applied over the top left area and the value in the bold black square was transformed from 4 to −1.

There is no clear answer what size of the kernels to choose but the most common when working with images is a square kernel with odd numbered sides, i.e.

(24)

3 × 3or 5×5. Larger kernels are more computationally costly. The kernels are automatically updated in the backpropagation operation and thus one can not really tell what features they learn. The number of kernels needed in a model structure is dependent on the complexity of the problem. The more complex a problem is, the more kernels are needed.

3.2.2 Pooling

Pooling is a dimension reduction technique, and the most common type of pooling is max pooling. A max pooling is a function which selects the largest value of a matrix of specified size and retains only that value for the next layer. Max pooling also has a stride, and in order to reduce dimensions, the stride has to be larger than one in at least one dimension. The default stride is usually larger than one in every dimension. A visualisation of a dimension reduction via max pooling of size 2 × 2 with stride two in both dimensions is shown in Figure 12.

Figure 12 – Visualisation of max pooling operation. Every sub-matrix only retains the highest number before pooling their result in a new matrix.

When using max pooling on the raw greyscale-pixel values of images you only keep the brightest pixels in each small area after the max pooling. However, when max pooling is used after a convolutional layer, it is not clear what features are retained after the max pooling.

Average pooling, is another pooling function used in this thesis. The procedure is the same as max pooling, but it retains the average value of the matrix instead of the maximum value.

3.2.3 Inception Module

An inception module is a special reoccurring substructure within a convolutional neural network. It consists of multiple parallel layers with kernels of different sizes.

(25)

4. METHOD June 14, 2019

A network which consists of these inception modules is commonly called an inception network. It most famously appeared in Google’s network called GoogLeNet, which won the ILSVRC (ImageNet Large Scale Visual Recognition Competition) in 2014 [13]. The idea behind inception modules was to capture details that are similar in appearance but of different sizes in an image with the different sized kernels [14]. See Section 4.3 Figures 22 and 23 to view the architecture of the inception modules used.

4 Method

Following in this section, the methods and programs used in this thesis are presented.

4.1 Programs and Hardware

We chose to write all our programs and models in Python with version 3.5.6. The reason why we chose Python is that it has many useful tools and frameworks for machine learning and neural networks especially, and we also had previous expe- rience with this programming language.

We also used TensorFlow version 1.10.0, which is a powerful framework for calculations of neural networks and other machine learning tasks. TensorFlow allows training neural networks on GPU-cores and thus significantly reducing training and prediction times compared to traditional CPU training and prediction.

TensorFlow is a standard framework for implementation of neural networks today, due to it being very efficient. TensorFlow utilises tensors in its calculations.

Tensors, or arrays, can be seen as multidimensional vectors. All models in this thesis were trained on the GPU-unit which decreased the training time significantly.

To make it easier to construct the neural networks, we used Keras as an API on top of TensorFlow. Keras was created with the goal of simplifying the process of creating neural networks [15]. Keras can also be used on top of the frameworks Theano and CNTK, but we have not tried to run the code on these frameworks.

There is also the possibility to use Keras in the programming language R.

When training our networks, we used a PC which had the following specification:

• an Intel Xeon E5-2637 v4 CPU clocked at 3.50 GHz

• Nvidia Quadro m4000 GPU, driver version 418.81 and CUDA version 10.0

(26)

• 112 GB RAM memory

• Windows 10 Enterprise

4.2 Data Preprocessing and Data Structuring

In this section we present how we structured the data for training and testing of the models in order to detect and classify defects. The preprocessing methods that will be evaluated in the Result section are also explained.

4.2.1 Patches - Sliding Window Approach

In order to scan the 16 channels of the full images for defects, as is the case in the real-time factory application, we chose to use a sliding window approach to divide our images into smaller sections, which we will call patches. Each patch contains all of the 16 channels. We then classify each patch separately as dirt, crater or non- defect. In this thesis, we use non-overlapping windows. Non-overlapping windows have the advantage that less patches have to be classified, and less calculations has to be made compared to overlapping sliding windows. One issue with this approach is that defects might show on two adjacent patches, so the models predict two defects when it is just one. Also, the defects may be too small when they are split between two patches, so the model fails to classify them as defects. Figure 13 shows an image of a cabin being divided into patches where classification can be made on each patch separately.

Figure 13 – Example of dividing an image into patches with non-overlapping sliding windows. Every square represents a patch.

To create our training set and test set, We first cropped patches of size 251 × 251 pixel from the labelled coordinates which were presented in section 2.3. These patches were created with the coordinate of an eventual defect in the centre, so

(27)

that every large patch had a defect in the centre. However, when classifying patches in the real-time factory application, an eventual defect is not necessarily in the centre of a patch. Therefore, we cropped a smaller 128 × 128 patch from a random location of each 251 × 251 patch. The defect being in the centre of the big patch and the small patch being of size 128 × 128 ensures that an eventual defect will also be inside the small patch but can be positioned anywhere in the smaller patch to emulate a real-time application scenario. The cropping is performed at the same location on the images for all channels, so the defect has the same position in all channels of the patch. Figure 14 shows an example of a randomly located 128 × 128patch cropped from a 251 × 251 patch.

Figure 14 – Example how we randomise the position of a defect in the patches by cropping a smaller patch from a bigger patch.

We chose 128 × 128 pixels as size for the patches, it is equivalent to an area of 20 by 20 millimetres on the surface of the cabin. With this patch size, a full image contains 663 patches. This patch size offers a good trade-off between prediction speed and the precision of where the defect is located on the cabin. At most, a defect is located ≈ 14 millimetres away from the centre of the patch. Bigger patches would lose precision and run a bigger risk of having multiple defects in one patch.

Contrarily, smaller patches demands more patches to be classified when scanning a full image and each classification made by a model runs a risk of misclassifying the patch. In this aspect it is more beneficial to have bigger patches. The number 128 is also a power of 2, which works well with our most common dimension reduction, max pooling with the stride (2,2), which halves the numbers of pixels heightwise and lengthwise of the patch. See section 3.2.2 for explanation of max pooling, and

(28)

Figure 21 for the structure of our base model.

When scanning a full luggage lid for defects as in the real-time factory application, we chose to not classify non-square patches which are on the edge and are therefore not of the size 128 × 128. We also did not classify patches which were not on the plain surface of the luggage lid. The number of patches which were considered on each luggage lid were around 600. The area of the FH cabin which we scan for defects are shown in Figure 15. Different cabin types have different areas which are scanned for defects.

Figure 15 – Area of the FH cabin which we scan for defects are within the green rectangles. Patches which has areas outside the rectangle are therefore not considered in the scanning and not classified.

4.2.2 Preprocessing

Since every patch is classified separately, there are many individual tests performed when scanning a cabin’s luggage lid. Because of this large amount of tests, a very high accuracy is necessary to obtain a useful result. For example, a model with a 1% false positive rate would, with our number of patches, generate on average six false positives per luggage lid. This is significantly higher than the average number of defects per luggage lid.

With the knowledge of having to create very accurate classification models, we chose to prepossess the data to make the data easier for the neural network to represent. This section describes the preprocessing methods which have been used in this thesis.

Standardisation

It is a common measure when working with images in computer vision to standard- ise every image, so that they have similar average pixel values and similar range of pixel values [5]. One of the most common forms of standardisation is Global Contrast Normalisation, GCN. In GCN, you subtract each pixel xi with the pixel

(29)

mean of the whole image ¯x. Then you divide each pixel with the image’s standard deviation s. GCN makes it so every picture has the same contrast and the same intensity. This should in theory also remove most of the colour effects of the cabins.

In this thesis, GCN was applied on each channel of each patch separately. The formula for performing GCN on one patch, is thus as following:

z_ic = xic− ¯xc

s_c

where zic is the new pixel value at the i:th pixel of the c:th channel, xicis the i:th pixel of the c:th channel before transformation, ¯xcis the average pixel value in the channel and sc is the standard deviation in that same channel. Below, in Figure 16, are some examples of non-standardised versus standardised patches containing a defect, for different coloured surfaces of cabins.

(30)

(a) (b) (c)

(d) (e) (f )

Figure 16– images which exemplify the reduction of colour effect after standardisation. The images (a-c) are images of cabins without standardisation. (a) is a bright-coloured cabin, (b) is an intermediate-coloured cabin, and (c) is a dark- coloured cabin. The images (d-f ) are the same images as (a-c) but after standardisation.

As can be seen in Figure 16, the standardisation appears to remove any effect of colour. The brightness of the patches 16a-16c look very different to each other and the defects are hard to visually distinguish, especially for the dark cabin. After standardisation, the brightness of the images looks very similar as can be seen in Figure 16d-16f. The defects also become easier to spot with the eye, and looking at 16d-16f, it is impossible to tell what their initial colour was.

Oversampling

Oversampling is when you increase the observations of some classes to reduce the

(31)

imbalance in the data set. We chose to oversample by using the same method as described in Section 4.2.1; we randomly cropped multiple smaller patches at randomised positions from a larger patch which has the defect centred. This allows for the same defect to be represented in multiple smaller patches but at different locations. This method gives more instances of the defect but decreases issues with overfitting that normal oversampling can have [16]. The method of duplicating data by changing some part of the images is also called data augmentation [5]. A visual example of how the oversampling is performed is shown below in Figure 17.

Figure 17– Example of oversampling with four smaller patches randomly cropped from one larger patch with defect centred in larger patch.

Data Cleaning

When cleaning the data, we chose to remove images which we considered being faulty and we also changed labels of patches which we considered being mislabelled by the operators. The cleaning is only done on the training data.

We removed images which had some peculiarities which most likely were asso- ciated with delay in the program or the cameras, for example images which did not contain any luggage lid or when there was no sinus pattern on the image.

There were also some images which had a text on it saying: "idle TV standby", which disrupted the sinusoidal pattern and these images were also removed. There were also images which were supposed to be of frequency f = 16 but in many cases it was of frequency f = 128 or an intermediate between the two (shown in Figure 18). Around 400 cabins were removed from the training set in this way.

(32)

Figure 18 – Example of patch with erroneous frequency, the image is labelled as frequency f = 16 but is in reality f = 128 or rather an intermediate between f = 16 and f = 128. Compare with Figure 3.

Then we looked at all the patches to ensure that they were labelled correctly. We did this because we had seen some incorrectly labelled defects by just skimming through the data set. If we were unsure what class they belonged to, we looked at all 16 channels of the patch in order to as accurately as possible determine if the patch contained a defect and the type of that eventual defect. The patch was then annotated with the label that we considered being the correct label. Examples of relabelled patches are shown in Appendix A.1. The number of labels changed for the observations in the training data can be viewed below in Table 1. After working with the data for this long, we believe that we have a good understanding about how a dirt and crater respectively look like, and we believe that our relabelling is correct. We do not believe we have more knowledge in the area than the operators have, but we had the benefit of being able to thoroughly look at all 16 channels and zoom in on the defects to view them better. From conversations with the operators, they also verified that it was a very hard task to distinguish between the two types of defects in certain cases, especially with limited time.

(33)

Table 1 – The distribution of each class in the original labels, the changes made to each label, and the distribution after relabelling.

Original Labels

Label Amount

Non-defect 13559

Dirt 4826

Crater 414

38 Crater ⇒ Dirt 24 N o def ect ⇒ Dirt

0 N o def ect ⇒ Crater 170 Dirt ⇒ Crater 477 Dirt ⇒ N o def ect 54 Crater ⇒ N o def ect 28 N o def ect ⇒ Removed

11 Dirt ⇒ Removed 2 Crater ⇒ Removed

Relabelled

Label Amount

Non-defect 14038

Dirt 4230

Crater 490

Removed 41

In total 4.3% of the observations were relabelled. This means that 95.7% retained their label. 41 observations were removed and not used since they were faulty in some way, for example if they contained NaN’s (Not a Number). We think that the test set and the training set have approximately the same percentage of mislabelling. With this implication, we believe that the best possible accuracy for a perfect model on the non-relabelled test set should be somewhere in the vicinity of the non-relabelled percentage in the data set, i.e. 95.7%. A model with classification accuracy much higher than 95.7% on a data set with non-corrected labels is thus most likely not predicting correctly. A "true" accuracy could only be achieved with a thoroughly cleaned test set with proper labels.

4.2.3 Combining Channels

Since it is hard, to not say impossible to see a defect in just one channel, we chose to use multiple channels when classifying a patch to include more information about the surface in the patch. The case of a defect showing in some of the channels but not in others is visualised Figure 19.

(34)

(a) (b) (c)

Figure 19 – Examples of the increasing information which multiple channels provide. All patches have f = 16. (a) θ = 0, (b) θ = 90, and (c) θ = 180. The defect is not visible in (a), but is visible in (b) and (c).

Sometimes it is also hard to distinguish the type of defect by just looking at one channel. A crater may look like a dirt in one phase of a frequency and crater in the remaining phases. Figure 20 exemplifies when the defect type is not visible in one channel (a), but can be distinguished in the other two (b-c). If only (a) would be considered, the patch would very likely be classified as a dirt defect when it really is a crater defect.

(a) (b) (c)

Figure 20 – Examples of the increasing information which multiple channels provide. All patches have f = 16. (a) θ = 0, (b) θ = 90, and (c) θ = 180. The type of defect is unclear in (a), but is clear in (b) and (c).

These examples indicate that it is a good idea to utilise several channels in order to make an accurate prediction.

(35)

4.3 Modelling Approach

Here we present the modelling approach and model structures used in this thesis, how the models are trained, the validation method, our method of ensembling multiple models together, and the reference model.

Model Structure

Our base model architecture is based on inception modules consisting of kernels of different sizes. First it has three convolution layers where each convolution layer is immediately followed by a max pooling layer for dimension reduction to make computations more effective. This part is commonly called the stem of the network. After that there are two inception modules containing four parallel convolution layers each. Then there is another max pooling layer followed by two other inception modules. Finally, there is an average pooling layer and then a dense layer before the output layer. The ReLU function is used as an activation function for each convolutional and dense layer except the output layer which uses the softmax activation function. Categorical cross-entropy was used as loss function. Adam was chosen as optimisation algorithm for the loss function. The motivation for these choices is that they are considered the default for this kind of image recognition and classification problems. Previous studies yielded us that this model structure is performing well for this problem and thus we will not experiment with different model architectures in order to obtain a better result.

(36)

Figure 21 – Base model architecture. The structures of the Inception-layers are shown in Figures 22 and 23.

(37)

Figure 22– Architecture of the inception 1 module.

Figure 23– Architecture of the inception 2 module.

The base model shown in Figure 21 has 3,641,027 parameters in total.

Model Training

Previous studies showed that 15 epochs were enough to yield a good result regarding validation accuracy. After 15 epochs, there were no further improvements. We also found that the default learning rate for Adam was not sensitive enough to yield the best result so after training the model for 15 epochs it was fine-tuned one epoch with the learning rate decreased to a tenth of the default. More than one

(38)

epoch of fine-tuning did not yield better results. In all the models were trained for 16 epochs unless otherwise stated. Note that this is valid for the base models only. The ensembles were trained for one epoch only, since any more training than that yielded worse results in general.

Ensemble of Models

When utilising patches from more than one frequency, we used an ensemble of the base models. Preliminary studies showed that the obtained result was better when using ensembles of the base model trained on patches of separate frequencies rather than to combine them and use them in the same base model input line. This ensembling was performed by combining base models, excepting the base models’

output layer, trained on patches of a single frequency but with an extra dense layer just before the output layer. The weights for the convolutional layers of those base models included in the ensemble model were transferred from the individual base models and set as non-trainable so they were not updated when the ensemble was trained. Only the dense layers in the end of the ensemble were trained for one epoch.

Model Validation

The models were validated against the test set data with the original labels. This will most likely yield a slightly misleading result if one just looks at the accuracy, since there are very likely and to a significant degree, observations with the wrong labels present in the test set just like in the training set. In the training set we found 4.3% of the observations to have the wrong labels as shown earlier in section 4.2.2 and we believe the ratio of erroneous labels in the test set is about the same as in the training set. We chose not to alter the test set labels in order to get an unbiased benchmark. The accuracies will thus most likely not represent the

"true" accuracies of the models. We still think it is a good benchmark to look at the accuracy of predictions on the uncleaned test data set since a better model should generally have fewer errors and thus a higher accuracy and recall.

Reference Model

The reference model hereafter is the model using all 16 channels with patches from each frequency as a separate input line. Each input line consists of the base model excepting the output layer, which is then joined as an ensemble with an extra dense layer in the end. The reference model is trained on oversampled and relabelled data with GCN standardisation. This is the baseline to which all models was compared against.

(39)

5. RESULT June 14, 2019

Figure 24 – Reference model architecture.

5 Result

The data referenced in this section is the test data set. The test data set has not been relabelled or cleaned. As mentioned earlier, all test patches were cropped using the same random sampling technique as described in Section 4.2.1 to get a test set that best matches a realistic real-time application scenario.

In this section we evaluate the performance of different models on the test data set. Firstly, the performance of the reference model is evaluated through its accuracy. Then we examined those observations where the reference model predictions deviated from the original labels, in order to examine whether our model or the original label was correct in those cases. Then we present a comparison between the reference model and models excluding some kind of preprocessing measure to find out if each specific preprocessing has any effect and offers an accuracy im-

(40)

provement. After that we compare which frequency produces the best model on its own to establish a ranking order for ensembling. Then ensembles of base models with differing amount of channels are compared against each other to find out how many channels are necessary to produce accurate predictions.

5.1 Reference Model Validation

The accuracy of the reference model was 95.82%. The confusion matrix of the original labels versus our predicted labels were:

Predicted

N D C

Original

N 12365 38 1

D 385 3328 148

C 72 61 450

(8)

We see from the confusion matrix that our reference model accurately classifies observations from the three classes. Since an accuracy of 95.82% is around our theoretical estimated maximum accuracy, presented in section 4.2.2, it is interest- ing to compare the observations where the human labels differ from the reference model’s prediction. Looking at a sample of 150 of the observations where the predictions and original labels differ, we judged that the reference model had correct predictions in 141 (94%) of those cases and the original human manual labelling was correct in 9 (6%) of the cases. We also looked at 500 of the observations where the model’s predictions did not differ from the original labels, and we judged that the model’s predictions were fully correct in all those 500 cases. The recall for the reference model, shown in Figure 25, which is based on the confusion matrix above, also shows that the model can detect the defects as well as distinguish between the two defect types to a high degree.

(41)

5.2 Comparing Preprocessing Measures

We compared different preprocessing measures by training models on patches excluding one preprocessing measure at a time and comparing its performance versus the reference model. We also trained one model on patches without any preprocessing. All these models use the same structure as the reference model. The performances of the differently preprocessed models are shown in Table 2 and in Figure 25.

Table 2 – Comparison of models excluding different preprocessing measures.

Experiment Confusion Matrix Accuracy (%)

Non-standardised

Predicted

N D C

Original

N 12353 47 4

D 402 3304 155

C 66 83 434

95.51

Non-relabelled

Predicted

N D C

Original

N 12269 128 7

D 357 3417 87

C 50 187 346

95.16

Non-oversampled

Predicted

N D C

Original

N 12358 44 2

D 384 3343 134

C 77 74 432

95.76

Non-preprocessed

Predicted

N D C

Original

N 12325 79 0

D 391 3433 37

C 59 380 144

94.39

(42)

Class Recall for Differently Preprocessed Models

Reference Model Non-standardised Model Non-rellabeled Model Non-oversampled Model Non-preprocessed Model 0

10 20 30 40 50 60 70 80 90 100

Recall (%)

Non-defect Dirt Crater

Figure 25– Recall of each class for our differently preprocessed models. Data is based on the confusion matrices in Table 2 and the confusion matrix in (8).

Non-Standardisation

As we reasoned earlier in Section 4.2, GCN standardisation might remove, or at least reduce, colour effects between brighter and darker colours and could thus make for increased model performance. To investigate this, we trained one model with the same architecture as the reference model on non-standardised data and compared the performance with the reference model.

GCN standardisation appears to improve the total accuracy when looking at Table 2 and comparing the non-standardised model with the reference model in section 5.1. Another noteworthy point is that we had to depart from our training standard of 16 epochs for the base models trained on the non-standardised data. When training on non-standardised data, the convergence of the models was distinctly slower. So much that we had to train every base model for over 30 epochs. The effect of standardisation is therefore two folded; it improves accuracy a bit and speeds up convergence of the models by a great margin.

Non-Relabelling

Since manual annotation errors was found in the data set, we wanted to investigate whether correcting those mislabellings would have an effect on the prediction

(43)

results. Explanation of how this relabelling was done was presented in section 4.2.2.

Looking at Table 2, we see that the correction of labels has an impact on the accuracy. Especially when one looks at the ability to distinguish between crater and dirt (shown in Figure 25). There the difference between the model trained on non-relabelled data and the reference model is rather big.

Non-Oversampling

Since the original training data set was imbalanced with 14,038 non-defect patches, 4,230 dirt patches, and 490 crater patches after relabelling, we did an oversampling of dirt and crater, in order to get the training data set more balanced to improve accuracy and the ability to differentiate between the two defect classes. We oversampled dirt to twice the amount of original dirts and craters to eight times the original amount. This resulted in 8,460 dirts and 3,920 craters in the oversampled data set. We did not oversample more than this because we still wanted to retain some of the proportion structure of patches on the luggage lid. In reality there is a great majority of non-defect patches, and dirt is more common than crater.

Oversampling does increase the risk of overfitting, although no sign of overfitting was noticed in this test. A more numerous oversampling than this might have been better but was not tested.

The model trained on oversampled data has a slightly higher total accuracy than the model trained on non-oversampled data, see Table 2 and Section 5.1. Over- sampling also makes the model better at distinguishing craters from dirt given that the reference model has a higher recall for crater than the non-oversampled model (Figure 25).

Non-preprocessing

We also tried to fit a model without any of the data preprocessing. The result of that model is presented in Table 2. This model can detect defects very well, but has a very hard time distinguishing craters from dirt. This model has a recall for crater of 24.70% which means it correctly finds and classifies 24.70% of the labelled craters as craters. This shows that preprocessing is necessary for the model to accurately be able to differentiate between craters and dirt. The model also had severe difficulties at converging so the number of training epochs had to be higher than the 15 that was the standard. In the end, each base model was trained for 30 epochs.

Preprocessing Conclusions

Relabelling appears to be the most important preprocessing measure in order to

(44)

obtain a really high accuracy. Standardising and oversampling appears to increase the performance but is less important than relabelling. We still judge the oversampling as useful and will use it since the increase in computational cost is small.

Without any preprocessing, it is still possible for the model to determine if there is a defect or not with high accuracy, but the model does not have the ability to distinguish between the two defect types. Since all preprocessing measures results in improved accuracy and better ability to distinguish between the two defect types, all models hereafter will therefore be trained on patches which have been relabelled, standardised and oversampling of the patches containing defects.

5.3 Comparing Channel Combinations

We intended to find out how much information that is necessary to create a model that is predicting good. This is done by comparing models trained on different channel combinations of the 16 available channels. All models in this section were trained on standardised, oversampled, and relabelled data.

Frequency Choice

In order to find out how many channels that are necessary to create a good model, a ranking order was established between base models trained on the different frequencies. This is done to know in which order they are to be included in the ensemble models. The one with best accuracy is included together with the second best in an ensemble of two and first, second, and third in an ensemble of three, etc.. The models were trained on patches of all four channels for every frequency.

Table 3 – Accuracy and ranking order of base models trained on patches of different frequencies but on all phases.

Frequency Accuracy (%) Rank

16 95.33 3

32 95.60 1

64 95.48 2

128 94.81 4

Table 3 shows that a model trained on patches of frequency f = 32 with all four phases is the best followed by a model trained on patches of frequency f = 64 and then frequency f = 16. A model trained on patches of frequency f = 128 was clearly the worst with an accuracy far below the models trained on other frequencies.

(45)

Combination Test

In order to determine how much information is needed to get a highly accurate prediction result, multiple base models were combined in different combinations.

Base models were trained on patches of every frequency with all possible number of phases for each frequency, i.e. four models for each frequency. Those base models were then combined as ensembles in the established rank order. The no.

of Phases in Table 4, is the number of phases of each frequency that is included in the ensemble. If phases equals two then the base models were trained on two phases for every frequency. Two frequencies and two phases then constitute four channels in total.

Table 4– Model Accuracies of ensembles with different constellations of channels.

Accuracies in %.

Frequencies\no. of Phases 1 2 3 4

{32} 94.15 95.57 95.48 95.60

{32, 64} 94.83 95.71 95.81 95.85

{16, 32, 64} 94.88 95.79 95.90 95.86

{16, 32, 64, 128} 95.00 95.24 95.84 95.82

(46)

Figure 26 – Surface plot of accuracy for models with differing amount of channels. Data is from Table 4.

Figure 26 and Table 4 show the accuracy for the ensemble models consisting of different amount of channels. We see in Table 4 that the accuracy is the highest with three frequencies f = {16, 32, 64} consisting of three phases each, totalling nine channels. However, as can be seen in Table 4, there are eight models that have a high accuracy, over 95.7%. Confusion matrices of all combinations of channels in this section can be found in Table 6 in Appendix A.2.

5.4 Pilot Test

It was in our interest to see how our models performed in a real-time application at the rig in the factory setting. We only validated the reference model. The complete time for preprocessing and prediction for the reference model was just short of five seconds. The prediction time for the reference model stood for 1.8 seconds

A Deep Learning Approach to Detection and Classification of Small Defects